Language Set Identification in Noisy Synthetic Multilingual Documents

Visa fullständig post



Permalänk

http://hdl.handle.net/10138/159361

Citation

Jauhiainen , T S , Linden , K & Jauhiainen , H A 2015 , Language Set Identification in Noisy Synthetic Multilingual Documents . in A Gelbukh (ed.) , Computational Linguistics and Intelligent Text Processing . vol. Part I , Lecture Notes in Computer Science , vol. 9041 , Springer International Publishing AG , pp. 633-643 , International Conference on Intelligent Text Processing and Computational Linguistics , Kairo , Egypt , 14/04/2015 . https://doi.org/10.1007/978-3-319-18111-0_48

Titel: Language Set Identification in Noisy Synthetic Multilingual Documents
Författare: Jauhiainen, Tommi Sakari; Linden, Krister; Jauhiainen, Heidi Annika
Editor: Gelbukh, A.
Medarbetare: University of Helsinki, Department of Modern Languages 2010-2017
University of Helsinki, Department of Modern Languages 2010-2017
University of Helsinki, Department of Modern Languages 2010-2017
Utgivare: Springer International Publishing AG
Datum: 2015
Språk: eng
Sidantal: 11
Tillhör serie: Computational Linguistics and Intelligent Text Processing
Tillhör serie: Lecture Notes in Computer Science
ISBN: 978-3-319-18110-3
978-3-319-18111-0
Permanenta länken (URI): http://hdl.handle.net/10138/159361
Abstrakt: In this paper, we reconsider the problem of language identification of multilingual documents. Automated language identification algorithms have been improving steadily from the seventies until recent years. The current state-of-the-art language identifiers are quite efficient even with only a few characters and this gives us enough reason to again evaluate the possibility to use existing language identifiers for monolingual text to detect the language set of a multilingual document. We are using a previously developed language identifier for monolingual documents with the multilingual documents from the WikipediaMulti dataset published in a recent study. Our method outperforms previous methods tested with the same data, achieving an F 1-score of 97.6 when classifying between 44 languages.
Subject: 6121 Languages
113 Computer and information sciences
Licens:


Filer under denna titel

Totalt antal nerladdningar: Laddar...

Filer Storlek Format Granska
CICLing2015.pdf 200.1Kb PDF Granska/Öppna

Detta dokument registreras i samling:

Visa fullständig post