Seleman Simon Sewangi COMPUTER-ASSISTED EXTRACTION OF TERMS IN SPECIFIC DOMAINS: THE CASE OF SWAHILI ACADEMIC DISSERTATION To be presented, with the permission of the Faculty of Arts of the University of Helsinki, for public criticism in auditorium XIII, Fabianinkatu 33, on the 15 th of December, 2001, at 10 o’clock. Institute for Asian and African Studies Publications, 1 University of Helsinki 2001 ISBN 952-10-0253-0 (printed) ISBN 952-10-0254-9 (pdf) ISSN 1458-5359 Helsinki University Printing House, Helsinki 2001 2 ABSTRACT This dissertation proposes the method of computer-assisted extraction of terms in domain- specific corpora. The method is developed based on constraints of structures of Swahili terms, primarily with the focus on the extraction of Swahili terms. The title of the dissertation reflects the dual nature of the tasks involved in the implementation of the method: both machine and manual tasks are employed in the implementation. The method incorporates techniques for the formulation of term patterns appropriate for the extraction of terms in their domains. The techniques concern the unique analysis of terms in a domain-specific corpus and formulation of term-patterns on the term structural constraints discovered, based on the terms analysed in the corpus. For the unique analysis of terms, the techniques introduce the term-domain feature, among others, for analysing words in a domain-specific corpus. Tags for the domain-feature are introduced into a lexicon of the morphological analyser or into the rule file for the BETA system, where they are used for marking term base-forms according to their domains. The morphological analyser or the BETA system applies the tags in the lexicon or in the rule file to analyse terms uniquely in the corpus by their domains. The formulation of term-patterns involves manual identification of compound terms in the analysed corpus, where words analysed as terms are used as searching words. Then the identified terms are specified as sequences of tags selected from the annotation. Thereafter, the tags and their relationships in the sequences are employed to derive the possible term formation constraints on which the patterns are formulated for the extraction of terms. The effectiveness of the proposed method in the extraction of terms is evaluated with respect to the extraction of Swahili terms in the domains of health care and literature, and the results obtained are encouraging. The terms compiled from the evaluation are indexed at the end of the dissertation. 3 !&()*+,-./-0-)$#' I #$ $ost gr#teful to $y supervisors* Professor /rvi Hursk#inen #nd Professor 0i$$o 0oskennie$i for initi#ting $e in t"e field of %o$put#tion#l linguisti%s #nd for t"eir untiring en%our#ge$ent #nd guid#n%e )"i%" $#de t"is rese#r%" # su%%ess' 1y sin%ere gr#titude is due to t"e Nor)egi#n 2oun%il of Universities3 2o$$ittee for 4evelop$ent 5ese#r%" #nd .du%#tion (NU-U) for #)#rding $e # s%"ol#rs"ip* #nd t"e le#ders"ip of t"e University of 4#r es S#l##$ for gr#nting $e study le#ve' I %ordi#lly t"#nk Professor Se#n 63-#"ey of t"e University of Bergen for t"e f#t"erly support #nd #dvi%e t"#t "e "#s #l)#ys given $e* #nd 1r 6ve Sto%knes of t"e University of Bergen for friendly %ooper#tion t"roug"out $y period of study' Spe%i#l t"#nks go to $y )ife -elister for "er en%our#ge$ent #nd support* )"i%" "#ve (een t"e sour%e of strengt" for t"e %o$pletion of t"is )ork' 1y person#l de(t is very gre#t to friends )"o "#ve %ontri(uted in one )#y or #not"er to t"e su%%ess of t"is study' I deeply #%kno)ledge t"e %ontri(ution of 7ussi Piitul#inen )"o )rote t"e p#ttern-$#t%"ing progr#$ for t"is )ork' I sin%erely #ppre%i#te t"e $#teri#l #nd $or#l support I "#ve re%eived fro$+ H#rry H#l8n* 7uri /"lfors* !r#ute Stude* 1r' 9 1rs' !#pio Pitk:nen* Sinikk# !uovinen* Helen# Pyk:l:$:ki* 1r' 9 1rs' ;eif P#%k#l8n* .ev# Uusit#lo* 1r' 9 1rs' .l<#s Suikk#nen* 5iikk# H#l$e* ;ott# H#r