Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771–1929: Early Results Using the PIVAJ Software

Show simple item record

dc.contributor.author Kettunen, Kimmo
dc.contributor.author Ruokolainen, Teemu
dc.contributor.author Liukkonen, Erno Samuli
dc.contributor.author Tranouez, Pierrick
dc.contributor.author Antelme, Daniel
dc.contributor.author Paquet, Thierry
dc.date.accessioned 2020-03-03T12:36:02Z
dc.date.available 2020-03-03T12:36:02Z
dc.date.issued 2019-05
dc.identifier.citation Kettunen , K , Ruokolainen , T , Liukkonen , E S , Tranouez , P , Antelme , D & Paquet , T 2019 , Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771–1929: Early Results Using the PIVAJ Software . in Proceedings of DaTECH 2019 . The Association for Computing Machinery , DATeCH 2019 , Brussels , Belgium , 08/05/2019 . https://doi.org/10.1145/3322905.3322911
dc.identifier.citation conference
dc.identifier.other PURE: 127487381
dc.identifier.other PURE UUID: 7d4f772c-42e3-45e1-8b28-62a83410cf62
dc.identifier.other ORCID: /0000-0003-2747-1382/work/70950328
dc.identifier.other ORCID: /0000-0001-7454-5300/work/70951934
dc.identifier.uri http://hdl.handle.net/10138/312739
dc.description.abstract This paper describes first large scale article detection and extraction efforts on the Finnish Digi newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869-1898 . The historical digital newspaper archive environment of the NLF is based on commercial docWorks software. The software is capable of article detection and extraction, but our material does not seem to behave well in the system in t his respect. Therefore, we have been in search of an alternative article segmentation system and have now focused our efforts on the PIVAJ machine learning based platform developed at the LITIS laborator y of University of Rouen Normandy. As training and evaluation data for PIVAJ we chose one newspaper, Uusi Suometar. We established a data set that contains 56 issues of the newspaper from years 1869 1898 with 4 pages each, i.e. 224 pages in total. Given the selected set of 56 issues, our first data annotation and experiment phase consisted of annotating a subset of 28 issues (112 pages) and conducting preliminary experiments. After the preliminary annotation and annotation of the first 28 issues accordingly. Subsequently, we annotated the remaining 28 issues . We then divided the annotated set in to training and evaluation set s of 168 and 56 pages. We trained PIVAJ successfully and evaluate d the results using the layout evaluation software developed by PRImA research laboratory of University of Salford. The results of our experiments show that PIVAJ achieves success rates of 67.9, 76.1, and 92.2 for the whole data set of 56 pages with three different evaluation scenarios introduced in [6]. On the whole, the results seem reasonable considering the varying layouts of the different issues of Uusi Suometar along the time scale of the data. fi
dc.format.extent 6
dc.language.iso eng
dc.publisher The Association for Computing Machinery
dc.relation.ispartof Proceedings of DaTECH 2019
dc.relation.isversionof 978-1-4503-7194-0
dc.rights unspecified
dc.rights.uri info:eu-repo/semantics/openAccess
dc.subject 113 Computer and information sciences
dc.subject 518 Media and communications
dc.title Detecting Articles in a Digitized Finnish Historical Newspaper Collection 1771–1929: Early Results Using the PIVAJ Software en
dc.type Conference contribution
dc.contributor.organization The National Library of Finland, Research Library
dc.description.reviewstatus Peer reviewed
dc.relation.doi https://doi.org/10.1145/3322905.3322911
dc.rights.accesslevel openAccess
dc.type.version publishedVersion

Files in this item

Total number of downloads: Loading...

Files Size Format View
3322905.3322911.pdf 1.092Mb PDF View/Open

This item appears in the following Collection(s)

Show simple item record