Drobac, Senka
(Helsingin yliopisto, 2020)
The corpus of historical newspapers and journals published in Finland, with more than 11 million pages of historical text, is of great value to the research community. The National Library of Finland (NLF) has OCRed the corpus with ABBYY FineReader, a commercial software that provides OCR models pre-trained on general historical fonts. The estimated accuracy of the OCRed text is between 87% - 92% on the character level, which is rather low even for scientific research.
Optical character recognition of printed text commonly reaches over 99% accuracy for modern Latin fonts. Historical documents, on the other hand, contain a large variety of fonts, can be of poor condition and often are written without an orthographic standard (the same words are spelled differently). All these reasons present a challenge to creating robust and highly accurate OCR models for historical data.
The corpus of historical newspapers and journals published in Finland is particularly challenging because it is written in both the official languages of Finland (Finnish and Swedish) and is printed in two font-families (Blackletter and Antiqua). With two main languages and a large number of different fonts from two font-families, it is not possible to achieve high OCR accuracy with models pre-trained on different materials.
A research group at the NLF has worked on re-OCRing this corpus and they have trained OCR models using the open-source software Tesseract, but only for the Finnish Blackletter part of the corpus. They report high accuracy results (97.64% on character level) for Finnish Blackletter but also slow performance. For the Antiqua part of the corpus, they reportedly use Tesseract's pre-trained Antiqua model, but they do not report any accuracy results. Also, they have still not published any work done on the material written in Swedish.
In this work, we have explored methods and practices for training high-accuracy OCR models that can be used for efficiently recognizing the entire corpus of historical Finnish newspapers and journals. We selected 13,000 Finnish and 11,000 Swedish text lines from the corpus, of which half are printed in Blackletter and half in Antiqua fonts. After transcribing these lines, we used them for training and testing OCR models with two open-source OCR tool-kits, Ocropy and Calamari. We performed experiments with different training data setups, along with different neural network configurations and architectures. Furthermore, we tested how the voting mechanism behaves with different OCR models.
Post-correction can further improve the final OCR results, especially in cases when the text, due to material damage or ink bleed, is incomprehensible without a broader context. Therefore, we have also explored different post-correction methods and implemented one of them. We compared the method's effect on OCR results of different accuracy.
The biggest accomplishment of this work is succeeding in training a high-accuracy model that is capable of recognizing both Finnish and Swedish text, as well as Blackletter and Antiqua fonts. Having a mixed model for all the data and not needing to separately perform language or font identification is extremely practical when dealing with such a large corpus.
Furthermore, we found that the results improve when voting with five mixed models, resulting in accuracy between 97.2% and 98.4% on the character level, which is up to 11% better than the current ABBYY results.
Finally, the post-correction experiments showed that, even with a simple automatic method, post-correction can further improve OCR results. Depending on the starting OCR accuracy, the post-correction improved accuracy between 0.1-0.4%, which is a relative improvement of 0.9-12.5%.