Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods

Näytä kaikki kuvailutiedot



Pysyväisosoite

http://hdl.handle.net/10138/136269

Lähdeviite

Kettunen , K , Honkela , T , Linden , K , Kauppinen , P , Pääkkönen , T & Kervinen , J 2014 , Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods . in IFLA World Library and Information Congress Proceedings : 80th IFLA General Conference and Assembly . IFLA , Lyon, France , IFLA World Library and Information Congress , Lyon , France , 16/08/2014 . < http://hdl.handle.net/10138/136269 >

Julkaisun nimi: Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods
Tekijä: Kettunen, Kimmo; Honkela, Timo; Linden, Krister; Kauppinen, Pekka; Pääkkönen, Tuula; Kervinen, Jukka
Tekijän organisaatio: Centre for Preservation and Digisation
Department of Modern Languages 2010-2017
Julkaisija: IFLA
Päiväys: 2014-08-16
Kieli: eng
Sivumäärä: 23
Kuuluu julkaisusarjaan: IFLA World Library and Information Congress Proceedings
URI: http://hdl.handle.net/10138/136269
Tiivistelmä: In this paper, we study how to analyze and improve the quality of a large historical newspaper collection. The National Library of Finland has digitized millions of newspaper pages. The quality of the outcome of the OCR process is limited especially with regard to the oldest parts of the collection. Approaches such as crowd-sourcing has been used in this field to improve the quality of the texts, but in this case the volume of the materials makes it impossible to edit manually any substantial proportion of the texts. Therefore, we experiment with quality evaluation and improvement methods based on corpus statistics, language technology and machine learning in order to find ways to automate analysis and improvement process. The final objective is to reach a clear reduction in the human effort needed in the post-processing of the texts. We present quantitative evaluations of the current quality of the corpus, describe challenges related to texts written in a morphologically complex language, and describe two different approaches to achieve quality improvements.
Avainsanat: 6121 Languages
Vertaisarvioitu: Kyllä
Pääsyrajoitteet: openAccess
Rinnakkaistallennettu versio: publishedVersion


Tiedostot

Latausmäärä yhteensä: Ladataan...

Tiedosto(t) Koko Formaatti Näytä
IFLA2014_kettunen_en.pdf 609.7KB PDF Avaa tiedosto

Viite kuuluu kokoelmiin:

Näytä kaikki kuvailutiedot