Evaluating the Robustness of Embedding-based Topic Models to OCR Noise

Show full item record



Permalink

http://hdl.handle.net/10138/337045

Citation

Zosa , E , Granroth-Wilding , M , Mutuvi , S & Doucet , A 2021 , Evaluating the Robustness of Embedding-based Topic Models to OCR Noise . in H-R Ke , C S Lee & K Sugiyama (eds) , Towards Open and Trustworthy Digital Societies. ICADL 2021 . Lecture Notes in Computer Science , vol. 13133 , Springer , Cham , pp. 392-400 , International Conference on Asia-Pacific Digital Libraries , 01/12/2021 . https://doi.org/10.1007/978-3-030-91669-5_30

Title: Evaluating the Robustness of Embedding-based Topic Models to OCR Noise
Author: Zosa, Elaine; Granroth-Wilding, Mark; Mutuvi, Stephen; Doucet, Antoine
Other contributor: Ke, Hao-Ren
Lee, Chei Sian
Sugiyama, Kazunari
Contributor organization: Department of Computer Science
Discovery Research Group/Prof. Hannu Toivonen
Publisher: Springer
Date: 2021-11-30
Language: eng
Number of pages: 9
Belongs to series: Towards Open and Trustworthy Digital Societies. ICADL 2021
Belongs to series: Lecture Notes in Computer Science
ISBN: 978-3-030-91668-8
978-3-030-91669-5
ISSN: 0302-9743
DOI: https://doi.org/10.1007/978-3-030-91669-5_30
URI: http://hdl.handle.net/10138/337045
Abstract: Unsupervised topic models such as Latent Dirichlet Allocation (LDA) are popular tools to analyse digitised corpora. However, the performance of these tools have been shown to degrade with OCR noise. Topic models that incorporate word embeddings during inference have been proposed to address the limitations of LDA, but these models have not seen much use in historical text analysis. In this paper we explore the impact of OCR noise on two embedding-based models, Gaussian LDA and the Embedded Topic Model (ETM) and compare their performance to LDA. Our results show that these models, especially ETM, are slightly more resilient than LDA in the presence of noise in terms of topic quality and classification accuracy.
Subject: 113 Computer and information sciences
topic modelling
word embeddings
OCR
Peer reviewed: Yes
Rights: unspecified
Usage restriction: openAccess
Self-archived version: acceptedVersion


Files in this item

Total number of downloads: Loading...

Files Size Format View
ICADL_OCR_impact.pdf 361.2Kb PDF View/Open

This item appears in the following Collection(s)

Show full item record