Language Set Identification in Noisy Synthetic Multilingual Documents

Show full item record



Permalink

http://hdl.handle.net/10138/159361

Citation

Jauhiainen , T S , Linden , K & Jauhiainen , H A 2015 , Language Set Identification in Noisy Synthetic Multilingual Documents . in A Gelbukh (ed.) , Computational Linguistics and Intelligent Text Processing . vol. Part I , Lecture Notes in Computer Science , vol. 9041 , Springer International Publishing AG , pp. 633-643 , International Conference on Intelligent Text Processing and Computational Linguistics , Kairo , Egypt , 14/04/2015 . https://doi.org/10.1007/978-3-319-18111-0_48

Title: Language Set Identification in Noisy Synthetic Multilingual Documents
Author: Jauhiainen, Tommi Sakari; Linden, Krister; Jauhiainen, Heidi Annika
Other contributor: Gelbukh, A.
Contributor organization: Department of Modern Languages 2010-2017
Krister Linden / Research Group
Language Technology
Publisher: Springer International Publishing AG
Date: 2015
Language: eng
Number of pages: 11
Belongs to series: Computational Linguistics and Intelligent Text Processing
Belongs to series: Lecture Notes in Computer Science
ISBN: 978-3-319-18110-3
978-3-319-18111-0
ISSN: 0302-9743
DOI: https://doi.org/10.1007/978-3-319-18111-0_48
URI: http://hdl.handle.net/10138/159361
Abstract: In this paper, we reconsider the problem of language identification of multilingual documents. Automated language identification algorithms have been improving steadily from the seventies until recent years. The current state-of-the-art language identifiers are quite efficient even with only a few characters and this gives us enough reason to again evaluate the possibility to use existing language identifiers for monolingual text to detect the language set of a multilingual document. We are using a previously developed language identifier for monolingual documents with the multilingual documents from the WikipediaMulti dataset published in a recent study. Our method outperforms previous methods tested with the same data, achieving an F 1-score of 97.6 when classifying between 44 languages.
Description: Proceeding volume: Part I
Subject: 6121 Languages
113 Computer and information sciences
Peer reviewed: Yes
Usage restriction: openAccess
Self-archived version: acceptedVersion


Files in this item

Total number of downloads: Loading...

Files Size Format View
CICLing2015.pdf 200.1Kb PDF View/Open

This item appears in the following Collection(s)

Show full item record