OCR and post-correction of historical newspapers and journals

Show full item record



Permalink

http://urn.fi/URN:ISBN:978-951-51-6512-1
Title: OCR and post-correction of historical newspapers and journals
Author: Drobac, Senka
Contributor: University of Helsinki, Faculty of Arts
Doctoral Programme in Language Studies
Publisher: Helsingin yliopisto
Date: 2020-10-09
URI: http://urn.fi/URN:ISBN:978-951-51-6512-1
http://hdl.handle.net/10138/319496
Thesis level: Doctoral dissertation (article-based)
Abstract: The corpus of historical newspapers and journals published in Finland, with more than 11 million pages of historical text, is of great value to the research community. The National Library of Finland (NLF) has OCRed the corpus with ABBYY FineReader, a commercial software that provides OCR models pre-trained on general historical fonts. The estimated accuracy of the OCRed text is between 87% - 92% on the character level, which is rather low even for scientific research. Optical character recognition of printed text commonly reaches over 99% accuracy for modern Latin fonts. Historical documents, on the other hand, contain a large variety of fonts, can be of poor condition and often are written without an orthographic standard (the same words are spelled differently). All these reasons present a challenge to creating robust and highly accurate OCR models for historical data. The corpus of historical newspapers and journals published in Finland is particularly challenging because it is written in both the official languages of Finland (Finnish and Swedish) and is printed in two font-families (Blackletter and Antiqua). With two main languages and a large number of different fonts from two font-families, it is not possible to achieve high OCR accuracy with models pre-trained on different materials. A research group at the NLF has worked on re-OCRing this corpus and they have trained OCR models using the open-source software Tesseract, but only for the Finnish Blackletter part of the corpus. They report high accuracy results (97.64% on character level) for Finnish Blackletter but also slow performance. For the Antiqua part of the corpus, they reportedly use Tesseract's pre-trained Antiqua model, but they do not report any accuracy results. Also, they have still not published any work done on the material written in Swedish. In this work, we have explored methods and practices for training high-accuracy OCR models that can be used for efficiently recognizing the entire corpus of historical Finnish newspapers and journals. We selected 13,000 Finnish and 11,000 Swedish text lines from the corpus, of which half are printed in Blackletter and half in Antiqua fonts. After transcribing these lines, we used them for training and testing OCR models with two open-source OCR tool-kits, Ocropy and Calamari. We performed experiments with different training data setups, along with different neural network configurations and architectures. Furthermore, we tested how the voting mechanism behaves with different OCR models. Post-correction can further improve the final OCR results, especially in cases when the text, due to material damage or ink bleed, is incomprehensible without a broader context. Therefore, we have also explored different post-correction methods and implemented one of them. We compared the method's effect on OCR results of different accuracy. The biggest accomplishment of this work is succeeding in training a high-accuracy model that is capable of recognizing both Finnish and Swedish text, as well as Blackletter and Antiqua fonts. Having a mixed model for all the data and not needing to separately perform language or font identification is extremely practical when dealing with such a large corpus. Furthermore, we found that the results improve when voting with five mixed models, resulting in accuracy between 97.2% and 98.4% on the character level, which is up to 11% better than the current ABBYY results. Finally, the post-correction experiments showed that, even with a simple automatic method, post-correction can further improve OCR results. Depending on the starting OCR accuracy, the post-correction improved accuracy between 0.1-0.4%, which is a relative improvement of 0.9-12.5%.The corpus of historical newspapers and journals published in Finland, with more than 11 million pages of historical text, is of great value to the research community. The National Library of Finland (NLF) has performed Optical character recognition (OCR) of the corpus with ABBYY FineReader, a commercial software that provides OCR models pre-trained on general historical fonts. The estimated accuracy of the OCRed text is between 87% - 92% on the character level, which is rather low for scientific research. Optical character recognition of printed text commonly reaches over 99% accuracy for modern Latin fonts. Historical documents, on the other hand, contain a large variety of fonts, can be of poor condition and often are written without an orthographic standard (the same words are spelled differently). All these reasons present a challenge to creating robust and highly accurate OCR models for historical data. The corpus of historical newspapers and journals published in Finland is particularly challenging because it is written in both the official languages of Finland (Finnish and Swedish) and is printed in two font-families (Blackletter and Antiqua). With two main languages and a large number of different fonts from two font-families, it is not possible to achieve high OCR accuracy with models pre-trained on different materials. In this work, we have explored methods and practices for training high-accuracy OCR models that can be used for efficiently recognizing the entire corpus of historical Finnish newspapers and journals. We performed experiments with different training data setups, along with different neural network configurations and architectures. Furthermore, we tested how the voting mechanism behaves with different OCR models and checked if the post-correction can further improve the final OCR results. The biggest accomplishment of this work is succeeding in training a high-accuracy model that is capable of recognizing both Finnish and Swedish text, as well as Blackletter and Antiqua fonts. Having a mixed model for all the data and not needing to separately perform language or font identification is extremely practical when dealing with such a large corpus. We also managed to drastically improve OCR accuracy to 97.2% - 98.4% on character level, which is up to 11% better than the current ABBYY results. Depending on the starting OCR accuracy, the post-correction further improved accuracy between 0.1-0.4%, which is a relative improvement of 0.9-12.5%.
Subject: digital humanities
Rights: This publication is copyrighted. You may download, display and print it for Your own personal use. Commercial use is prohibited.


Files in this item

Total number of downloads: Loading...

Files Size Format View
OCRandpo.pdf 8.140Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record