Hussain, Zafar
(Helsingin yliopisto, 2020)
The National Library of Finland has digitized newspapers starting from late eighteenth century.
Digitized data of Finnish newspapers is a heterogeneous data set, which contains the content and
metadata of historical newspapers. This research work is focused to study this rich materiality data
to find the data-driven categorization of newspapers. Since the data is not known beforehand, the
objective is to understand the development of newspapers and use statistical methods to analyze
the fluctuations in the attributes of this metadata. An important aspect of this research work is to
study the computational and statistical methods which can better express the complexity of Finnish
historical newspaper metadata. Exploratory analyses are performed to get an understanding of the
attributes and extract the patterns among them. To explicate the attributes’ dependencies on each
other, Ordinary Least Squares and Linear Regression methods are applied. The results of these
regression methods confirm the significant correlation between the attributes. To categorize the
data, spectral and hierarchical clustering methods are studied for grouping the newspapers with
similar attributes. The clustered data further helps in dividing and understanding the data over time
and place. Decision trees are constructed to split the newspapers after attributes’ logical divisions.
The results of Random Forest decision trees show the paths of development of the attributes.
The goal of applying various methods is to get a comprehensive interpretation of the attributes’
development based on language, time, and place and evaluate the usefulness of these methods on the
newspaper data. From the features’ perspective, area appears as the most imperative feature and
from language based comparison Swedish newspapers are ahead of Finnish newspapers in adapting
popular trends of the time. Dividing the newspaper publishing places into regions, small towns
show more fluctuations in publishing trends, while from the perspective of time the second half
of twentieth century has seen a large increase in newspapers and publishing trends. This research
work coordinates information on regions, language, page size, density, and area of newspapers and
offers robust statistical analysis of newspapers published in Finland.