Browsing by Subject "digital humanities"

Sort by: Order: Results:

Now showing items 1-14 of 14
  • Tolonen, Mikko; Lahti, Leo; Ilomäki, Niko (2015)
    This article analyses publication trends in the field of history in early modern Britain and North America in 1470–1800, based on English Short- Title Catalogue (ESTC) data. Its major contribution is to demonstrate the potential of digitized library catalogues as an essential scholastic tool and part of reproducible research. We also introduce a novel way of quantitatively analysing a particular trend in book production, namely the publishing of works in the field of history. The study is also our first experimental analysis of paper consumption in early modern book production, and dem- onstrates in practice the importance of open-science principles for library and information science. Three main research questions are addressed: 1) who wrote history; 2) where history was published; and 3) how publishing changed over time in early modern Britain and North America. In terms of our main findings we demonstrate that the average book size of history publications decreased over time, and that the octavo-sized book was the rising star in the eighteenth century, which is a true indication of expand- ing audiences. The article also compares different aspects of the most popu- lar writers on history, such as Edmund Burke and David Hume. Although focusing on history, these findings may reflect more widespread publishing trends in the early modern era. We show how some of the key questions in this field can be addressed through the quantitative analysis of large-scale bibliographic data collections.
  • Haq, Ehsan ul; Braud, Tristan; Kwon, Young D.; Hui, Pan (2020)
    Computational Politics is the study of computational methods to analyze and moderate users' behaviors related to political activities such as election campaign persuasion, political affiliation, and opinion mining. With the rapid development and ease of access to the Internet, Information Communication Technologies (ICT) have given rise to massive numbers of users joining online communities and the digitization of political practices such as debates. These communities and digitized data contain both explicit and latent information about users and their behaviors related to politics and social movements. For researchers, it is essential to utilize data from these sources to develop and design systems that not only provide solutions to computational politics but also help other businesses, such as marketers, to increase users' participation and interactions. In this survey, we attempt to categorize main areas in computational politics and summarize the prominent studies in one place to better understand computational politics across different and multidimensional platforms. e.g., online social networks, online forums, and political debates. We then conclude this study by highlighting future research directions, opportunities, and challenges.
  • Tolonen, Mikko Sakari; Lahti, Leo Mikael (2015)
  • Tolonen, Mikko; Mäkelä, Eetu; Marjanen, Jani; Tahko, Tuuli (2020)
  • Lahti, Leo; Vaara, Ville; Marjanen, Jani; Tolonen, Mikko (University of Oulu, 2019)
    Studia humaniora Ouluensia
  • Lahti, Leo; Marjanen, Jani Pekka; Roivainen, Hege Henri Markus; Tolonen, Mikko Sakari (2019)
    National bibliographies have been identified as a crucial resource for historical research on the publishing landscape, but using them requires addressing challenges of data quality, completeness, and interpretation. We call this approach bibliographic data science. In this article, we briefly assess the development of book formats and the vernacularization process in early modern Europe. The work undertaken paves the way for more extensive integration of library catalogs to map the history of the book.
  • Tolonen, Mikko Sakari; Lahti, Leo (Gaudeamus, 2018)
  • Fridlund, Mats; Oiva, Mila; Paju, Petri (Helsinki University Press, 2020)
    Historical scholarship is currently undergoing a digital turn. All historians have experienced this change in one way or another, by writing on word processors, applying quantitative methods on digitalized source materials, or using internet resources and digital tools. Digital Histories showcases this emerging wave of digital history research. It presents work by historians who – on their own or through collaborations with e.g. information technology specialists – have uncovered new, empirical historical knowledge through digital and computational methods. The topics of the volume range from the medieval period to the present day, including various parts of Europe. The chapters apply an exemplary array of methods, such as digital metadata analysis, machine learning, network analysis, topic modelling, named entity recognition, collocation analysis, critical search, and text and data mining. The volume argues that digital history is entering a mature phase, digital history ‘in action’, where its focus is shifting from the building of resources towards the making of new historical knowledge. This also involves novel challenges that digital methods pose to historical research, including awareness of the pitfalls and limitations of the digital tools and the necessity of new forms of digital source criticisms. Through its combination of empirical, conceptual and contextual studies, Digital Histories is a timely and pioneering contribution taking stock of how digital research currently advances historical scholarship.
  • Tamper, Minna; Leal, Rafael; Sinikallio, Laura; Leskinen, Petri; Tuominen, Jouni; Hyvönen, Eero (CEUR-WS.org, 2022)
    CEUR Workshop Proceedings
    This paper presents knowledge extraction and natural language processing methods used to enrich the knowledge graph of the plenary debates (textual transcripts of speeches) of the Parliament of Finland. This knowledge graph includes some 960 000 speeches (1907–2021) interlinked with a prosopographical knowledge graph about the politicians. A recent subset of the speeches was used to extract named entities and topical keywords for semantic searching and browsing the data and for data analysis. The process is based on linguistic analysis, named entity linking, and automatic subject indexing. The results were included into the ParliamentSampo knowledge graph in a SPARQL endpoint. This data can be used for studying parliamentary language and culture in Digital Humanities research and for developing applications, such as the ParliamentSampo portal.
  • Säily, Tanja; Mäkelä, Eetu; Hämäläinen, Mika (University of Helsinki, 2021)
  • Sarv, Mari; Kallio, Kati; Janicki, Maciej Michal; Mäkelä, Eetu (ICL CAS, 2021)
    This article represents a first step in the corpus-based study of metric variation in Finnic runosong, a poetic tradition shared by several Finnic peoples and documented extensively in the 19th and 20th centuries. Runosong metre has generally been assumed to be a syllabic tetrametric trochee with specific rules about the placement of stressed syllables according to their quantity: long stressed syllables occupy the strong positions in the trochaic schema while short stressed syllables appear in the weak positions. Recent studies by Mari Sarv (2008, 2015, 2019) of Estonian runosong metre have shown, however, that due to linguistic changes, it has gradually lost its quantitative properties and acquired the features of accentual metre. Using computational methods, this study aims to give a preliminary overview of the extent of metric variation on the quantitative-accentual scale across the entire Finnic runosong area. After an approximate syllabification, we apply two separate indirect methods for estimating variation. These appear to generate coherent results: quantitative runosong metre dominates in the north-east and has gradually been replaced by accentual runosong metre towards the south-west. Subsequent studies should verify these results through more precise and detailed investigations.
  • Drobac, Senka (Helsingin yliopisto, 2020)
    The corpus of historical newspapers and journals published in Finland, with more than 11 million pages of historical text, is of great value to the research community. The National Library of Finland (NLF) has OCRed the corpus with ABBYY FineReader, a commercial software that provides OCR models pre-trained on general historical fonts. The estimated accuracy of the OCRed text is between 87% - 92% on the character level, which is rather low even for scientific research. Optical character recognition of printed text commonly reaches over 99% accuracy for modern Latin fonts. Historical documents, on the other hand, contain a large variety of fonts, can be of poor condition and often are written without an orthographic standard (the same words are spelled differently). All these reasons present a challenge to creating robust and highly accurate OCR models for historical data. The corpus of historical newspapers and journals published in Finland is particularly challenging because it is written in both the official languages of Finland (Finnish and Swedish) and is printed in two font-families (Blackletter and Antiqua). With two main languages and a large number of different fonts from two font-families, it is not possible to achieve high OCR accuracy with models pre-trained on different materials. A research group at the NLF has worked on re-OCRing this corpus and they have trained OCR models using the open-source software Tesseract, but only for the Finnish Blackletter part of the corpus. They report high accuracy results (97.64% on character level) for Finnish Blackletter but also slow performance. For the Antiqua part of the corpus, they reportedly use Tesseract's pre-trained Antiqua model, but they do not report any accuracy results. Also, they have still not published any work done on the material written in Swedish. In this work, we have explored methods and practices for training high-accuracy OCR models that can be used for efficiently recognizing the entire corpus of historical Finnish newspapers and journals. We selected 13,000 Finnish and 11,000 Swedish text lines from the corpus, of which half are printed in Blackletter and half in Antiqua fonts. After transcribing these lines, we used them for training and testing OCR models with two open-source OCR tool-kits, Ocropy and Calamari. We performed experiments with different training data setups, along with different neural network configurations and architectures. Furthermore, we tested how the voting mechanism behaves with different OCR models. Post-correction can further improve the final OCR results, especially in cases when the text, due to material damage or ink bleed, is incomprehensible without a broader context. Therefore, we have also explored different post-correction methods and implemented one of them. We compared the method's effect on OCR results of different accuracy. The biggest accomplishment of this work is succeeding in training a high-accuracy model that is capable of recognizing both Finnish and Swedish text, as well as Blackletter and Antiqua fonts. Having a mixed model for all the data and not needing to separately perform language or font identification is extremely practical when dealing with such a large corpus. Furthermore, we found that the results improve when voting with five mixed models, resulting in accuracy between 97.2% and 98.4% on the character level, which is up to 11% better than the current ABBYY results. Finally, the post-correction experiments showed that, even with a simple automatic method, post-correction can further improve OCR results. Depending on the starting OCR accuracy, the post-correction improved accuracy between 0.1-0.4%, which is a relative improvement of 0.9-12.5%.
  • Tolonen, Mikko; Mäkelä, Eetu; Lahti, Leo (2022)
  • Rantala, Heikki; Ikkala, Esko; Jokipii, Ilkka; Hyvönen, Eero (2022)
    This article presents the semantic portal and Linked Open Data service WARVICTIMSAMPO 1914-1922 about the war victims, battles, and prisoner camps in the Finnish Civil and other wars in 1914-1922. The system is based on a database of the National Archives of Finland and additional related data created, compiled, and linked during the project. The system contains detailed information about some 40,000 deaths extracted from several data sources and data about over 1,000 battles of the Civil War. A key novelty of WARVICTIMSAMPO 1914-1922 is the integration of ready-to-use Digital Humanities visualizations and data analysis tooling with semantic faceted search and data exploration, which allows, e.g., studying data about wider prosopographical groups in addition to individual war victims. The article focuses on demonstrating how the tools of the portal, as well as the underlying SPARQL endpoint openly available on the Web, can be used to explore and analyze war history in flexible and visual ways. WARVICTIMSAMPO 1914-1922 is a new member in the series of "Sampo" model-based semantic portals. The portal is in use and has had 23,000 users, including both war historians and the general public seeking information about their deceased relatives.