Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods

Show simple item record

dc.contributor.author Kettunen, Kimmo
dc.contributor.author Honkela, Timo
dc.contributor.author Linden, Krister
dc.contributor.author Kauppinen, Pekka
dc.contributor.author Pääkkönen, Tuula
dc.contributor.author Kervinen, Jukka
dc.date.accessioned 2014-10-18T21:12:47Z
dc.date.available 2014-10-18T21:12:47Z
dc.date.issued 2014-08-16
dc.identifier.citation Kettunen , K , Honkela , T , Linden , K , Kauppinen , P , Pääkkönen , T & Kervinen , J 2014 , Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods . in IFLA World Library and Information Congress Proceedings : 80th IFLA General Conference and Assembly . IFLA , Lyon, France , IFLA World Library and Information Congress , Lyon , France , 16/08/2014 . < http://hdl.handle.net/10138/136269 >
dc.identifier.citation conference
dc.identifier.other PURE: 42253298
dc.identifier.other PURE UUID: b94cf653-eb0b-45cd-b350-977fea3f443f
dc.identifier.other ORCID: /0000-0003-2337-303X/work/29934318
dc.identifier.other ORCID: /0000-0003-2747-1382/work/29423261
dc.identifier.other ORCID: /0000-0003-2071-5110/work/29585359
dc.identifier.other ORCID: /0000-0003-3958-9732/work/28762995
dc.identifier.other ORCID: /0000-0003-0917-2020/work/37919208
dc.identifier.uri http://hdl.handle.net/10138/136269
dc.description.abstract In this paper, we study how to analyze and improve the quality of a large historical newspaper collection. The National Library of Finland has digitized millions of newspaper pages. The quality of the outcome of the OCR process is limited especially with regard to the oldest parts of the collection. Approaches such as crowd-sourcing has been used in this field to improve the quality of the texts, but in this case the volume of the materials makes it impossible to edit manually any substantial proportion of the texts. Therefore, we experiment with quality evaluation and improvement methods based on corpus statistics, language technology and machine learning in order to find ways to automate analysis and improvement process. The final objective is to reach a clear reduction in the human effort needed in the post-processing of the texts. We present quantitative evaluations of the current quality of the corpus, describe challenges related to texts written in a morphologically complex language, and describe two different approaches to achieve quality improvements. en
dc.format.extent 23
dc.language.iso eng
dc.publisher IFLA
dc.relation.ispartof IFLA World Library and Information Congress Proceedings
dc.rights.uri info:eu-repo/semantics/openAccess
dc.subject 6121 Languages
dc.title Analyzing and Improving the Quality of a Historical News Collection using Language Technology and Statistical Machine Learning Methods en
dc.type Conference contribution
dc.contributor.organization Centre for Preservation and Digisation
dc.contributor.organization Department of Modern Languages 2010-2017
dc.description.reviewstatus Peer reviewed
dc.rights.accesslevel openAccess
dc.type.version publishedVersion

Files in this item

Total number of downloads: Loading...

Files Size Format View
IFLA2014_kettunen_en.pdf 609.7Kb PDF View/Open

This item appears in the following Collection(s)

Show simple item record