Jauhiainen , T S , Linden , K & Jauhiainen , H A 2015 , Language Set Identification in Noisy Synthetic Multilingual Documents . in A Gelbukh (ed.) , Computational Linguistics and Intelligent Text Processing . vol. Part I , Lecture Notes in Computer Science , vol. 9041 , Springer International Publishing AG , pp. 633-643 , International Conference on Intelligent Text Processing and Computational Linguistics , Kairo , Egypt , 14/04/2015 . https://doi.org/10.1007/978-3-319-18111-0_48
Title: | Language Set Identification in Noisy Synthetic Multilingual Documents |
Author: | Jauhiainen, Tommi Sakari; Linden, Krister; Jauhiainen, Heidi Annika |
Other contributor: | Gelbukh, A. |
Contributor organization: | Department of Modern Languages 2010-2017 Krister Linden / Research Group Language Technology |
Publisher: | Springer International Publishing AG |
Date: | 2015 |
Language: | eng |
Number of pages: | 11 |
Belongs to series: | Computational Linguistics and Intelligent Text Processing |
Belongs to series: | Lecture Notes in Computer Science |
ISBN: | 978-3-319-18110-3 978-3-319-18111-0 |
ISSN: | 0302-9743 |
DOI: | https://doi.org/10.1007/978-3-319-18111-0_48 |
URI: | http://hdl.handle.net/10138/159361 |
Abstract: | In this paper, we reconsider the problem of language identification of multilingual documents. Automated language identification algorithms have been improving steadily from the seventies until recent years. The current state-of-the-art language identifiers are quite efficient even with only a few characters and this gives us enough reason to again evaluate the possibility to use existing language identifiers for monolingual text to detect the language set of a multilingual document. We are using a previously developed language identifier for monolingual documents with the multilingual documents from the WikipediaMulti dataset published in a recent study. Our method outperforms previous methods tested with the same data, achieving an F 1-score of 97.6 when classifying between 44 languages. |
Description: | Proceeding volume: Part I |
Subject: |
6121 Languages
113 Computer and information sciences |
Peer reviewed: | Yes |
Usage restriction: | openAccess |
Self-archived version: | acceptedVersion |
Total number of downloads: Loading...
Files | Size | Format | View |
---|---|---|---|
CICLing2015.pdf | 200.1Kb |
View/ |