Language Set Identification in Noisy Synthetic Multilingual Documents

Show simple item record

dc.contributor University of Helsinki, Department of Modern Languages 2010-2017 en
dc.contributor University of Helsinki, Department of Modern Languages 2010-2017 en
dc.contributor University of Helsinki, Department of Modern Languages 2010-2017 en
dc.contributor.author Jauhiainen, Tommi Sakari
dc.contributor.author Linden, Krister
dc.contributor.author Jauhiainen, Heidi Annika
dc.contributor.editor Gelbukh, A.
dc.date.accessioned 2016-01-08T06:31:02Z
dc.date.available 2016-01-08T06:31:02Z
dc.date.issued 2015
dc.identifier.citation Jauhiainen , T S , Linden , K & Jauhiainen , H A 2015 , Language Set Identification in Noisy Synthetic Multilingual Documents . in A Gelbukh (ed.) , Computational Linguistics and Intelligent Text Processing . vol. Part I , Lecture Notes in Computer Science , vol. 9041 , Springer International Publishing AG , pp. 633-643 , International Conference on Intelligent Text Processing and Computational Linguistics , Kairo , Egypt , 14/04/2015 . https://doi.org/10.1007/978-3-319-18111-0_48 en
dc.identifier.citation conference en
dc.identifier.isbn 978-3-319-18110-3
dc.identifier.isbn 978-3-319-18111-0
dc.identifier.other PURE: 49707109
dc.identifier.other PURE UUID: 4ee7998d-9fe3-445d-a16e-de573a40b702
dc.identifier.other Scopus: 84942684044
dc.identifier.other ORCID: /0000-0003-2337-303X/work/29934316
dc.identifier.other ORCID: /0000-0002-8227-5627/work/29566440
dc.identifier.other ORCID: /0000-0002-6474-3570/work/34198886
dc.identifier.uri http://hdl.handle.net/10138/159361
dc.description.abstract In this paper, we reconsider the problem of language identification of multilingual documents. Automated language identification algorithms have been improving steadily from the seventies until recent years. The current state-of-the-art language identifiers are quite efficient even with only a few characters and this gives us enough reason to again evaluate the possibility to use existing language identifiers for monolingual text to detect the language set of a multilingual document. We are using a previously developed language identifier for monolingual documents with the multilingual documents from the WikipediaMulti dataset published in a recent study. Our method outperforms previous methods tested with the same data, achieving an F 1-score of 97.6 when classifying between 44 languages. en
dc.format.extent 11
dc.language.iso eng
dc.publisher Springer International Publishing AG
dc.relation.ispartof Computational Linguistics and Intelligent Text Processing
dc.relation.ispartofseries Lecture Notes in Computer Science
dc.rights en
dc.subject 6121 Languages en
dc.subject 113 Computer and information sciences en
dc.title Language Set Identification in Noisy Synthetic Multilingual Documents en
dc.type Conference contribution
dc.identifier.doi https://doi.org/10.1007/978-3-319-18111-0_48
dc.type.uri info:eu-repo/semantics/other
dc.contributor.pbl
dc.contributor.pbl
dc.contributor.pbl

Files in this item

Total number of downloads: Loading...

Files Size Format View
CICLing2015.pdf 200.1Kb PDF View/Open

This item appears in the following Collection(s)

Show simple item record