Building Web Corpora for Minority Languages

Show simple item record Jauhiainen, Heidi Jauhiainen, Tommi Linden, Krister
dc.contributor.editor Barbaresi, Adrien
dc.contributor.editor Bildhauer, Felix
dc.contributor.editor Schäfer, Roland
dc.contributor.editor Stemle, Egon 2020-09-11T14:04:01Z 2020-09-11T14:04:01Z 2020
dc.identifier.citation Jauhiainen , H , Jauhiainen , T & Linden , K 2020 , Building Web Corpora for Minority Languages . in A Barbaresi , F Bildhauer , R Schäfer & E Stemle (eds) , Proceedings of the 12th Web as Corpus Workshop . The Association for Computational Linguistics , Stroudsburg , pp. 23-32 , Language Resources and Evaluation Conference , 11/05/2020 . < >
dc.identifier.citation conference
dc.identifier.other PURE: 143101618
dc.identifier.other PURE UUID: 826a5a11-af38-422b-8736-fb261c195161
dc.identifier.other ORCID: /0000-0003-2337-303X/work/80215472
dc.identifier.other ORCID: /0000-0002-8227-5627/work/80225167
dc.identifier.other ORCID: /0000-0002-6474-3570/work/80225638
dc.description.abstract Web corpora creation for minority languages that do not have their own top-level Internet domain is no trivial matter. Web pages in such minority languages often contain text and links to pages in the dominant language of the country. When building corpora in specific languages, one has to decide how and at which stage to make sure the texts gathered are in the desired language. In the {``}Finno-Ugric Languages and the Internet{''} (Suki) project, we created web corpora for Uralic minority languages using web crawling combined with a language identification system in order to identify the language while crawling. In addition, we used language set identification and crowdsourcing before making sentence corpora out of the downloaded texts. In this article, we describe a strategy for collecting textual material from the Internet for minority languages. The strategy is based on the experiences we gained during the Suki project. en
dc.format.extent 10
dc.language.iso eng
dc.publisher The Association for Computational Linguistics
dc.relation.ispartof Proceedings of the 12th Web as Corpus Workshop
dc.relation.isversionof 979-10-95546-68-9
dc.rights cc_by
dc.rights.uri info:eu-repo/semantics/openAccess
dc.subject 6121 Languages
dc.title Building Web Corpora for Minority Languages en
dc.type Conference contribution
dc.contributor.organization Language Technology
dc.contributor.organization Department of Digital Humanities
dc.contributor.organization Centre of Excellence in Ancient Near Eastern Empires (ANEE)
dc.description.reviewstatus Peer reviewed
dc.rights.accesslevel openAccess
dc.type.version publishedVersion

Files in this item

Total number of downloads: Loading...

Files Size Format View
JauhiainenEtAl_BuildingWebCorpora_2020.wac_1.4.pdf 237.4Kb PDF View/Open

This item appears in the following Collection(s)

Show simple item record