A Method for Wavelet-Based Time Series Analysis of Historical Newspapers

Show full item record



Permalink

http://urn.fi/URN:NBN:fi:hulib-202001211112
Title: A Method for Wavelet-Based Time Series Analysis of Historical Newspapers
Author: Avikainen, Jari
Contributor: University of Helsinki, Faculty of Science
Publisher: Helsingin yliopisto
Date: 2019
Language: eng
URI: http://urn.fi/URN:NBN:fi:hulib-202001211112
http://hdl.handle.net/10138/310021
Thesis level: master's thesis
Degree program: Tietojenkäsittelytieteen maisteriohjelma
Master's Programme in Computer Science
Magisterprogrammet i datavetenskap
Specialisation: Algoritmit
Algorithms
Algoritmer
Discipline: none
Abstract: This thesis presents a wavelet-based method for detecting moments of fast change in the textual contents of historical newspapers. The method works by generating time series of the relative frequencies of different words in the newspaper contents over time, and calculating their wavelet transforms. Wavelet transform is essentially a group of transformations describing the changes happening in the original time series at different time scales, and can therefore be used to pinpoint moments of fast change in the data. The produced wavelet transforms are then used to detect fast changes in word frequencies by examining products of multiple scales of the transform. The properties of the wavelet transform and the related multi-scale product are evaluated in relation to detecting various kinds of steps and spikes in different noise environments. The suitability of the method for analysing historical newspaper archives is examined using an example corpus consisting of 487 issues of Uusi Suometar from 1869–1918 and 250 issues of Wiipuri from 1893–1918. Two problematic features in the newspaper data, noise caused by OCR (optical character recognition) errors and uneven temporal distribution of the data, are identified and their effects on the results of the presented method are evaluated using synthetic data. Finally, the method is tested using the example corpus, and the results are examined briefly. The method is found to be adversely affected especially by the uneven temporal distribution of the newspaper data. Without additional processing, or improving the quality of the examined data, a significant amount of the detected steps are due to the noise in the data. Various ways of alleviating the effect are proposed, among other suggested improvements on the system.


Files in this item

Total number of downloads: Loading...

Files Size Format View
Avikainen_Jari_Pro_gradu_2019.pdf 1.264Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record