  • Alnajjar, Khalid (2021)
    Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word embeddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis models which achieved high accuracies. All our cross-lingual word embeddings and sentiment analysis models will be released openly via an easy-to-use Python library.
  • Saarikivi, Janne (2021)
    The question as to how the linguistic and archaeological data can be combined together to create a comprehensive account on the prehistory of present ethnicities is a debated issue around the globe. In particular, the identification of the new language groups in the material remnants of a particular area, or discerning in the material culture correlates for the language contact periods reflected in the loan word layers are complex and often probably insolvable questions. Regarding the early history of the Finns and the related people, Valter Lang’s new monograph on the archaeology of Estonia and the “arrivals of the Finnic people” (Läänemeresoome tulemised, 2018) has been considered a paradigm changing work in this respect. In my article I argue that despite undisputed progress in this ouevre, many of the old questions regarding time, place and method are still in place.
  • Hämäläinen, Mika (2021)
    The term low-resourced has been tossed around in the field of natural language processing to a degree that almost any language that is not English can be called "low-resourced"; sometimes even just for the sake of making a mundane or mediocre paper appear more interesting and insightful. In a field where English is a synonym for language and low-resourced is a synonym for anything not English, calling endangered languages low-resourced is a bit of an overstatement. In this paper, I inspect the relation of the endangered with the low-resourced from my own experiences.
  • Säily, Tanja; Mäkelä, Eetu; Hämäläinen, Mika (2021)
    We study neologism use in two samples of early English correspondence, from 1640-1660 and 1760-1780. Of especial interest are the early adopters of new vocabulary, the social groups they represent, and the types and functions of their neologisms. We describe our computer-assisted approach and note the difficulties associated with massive variation in the corpus. Our findings include that while male letter-writers tend to use neologisms more frequently than women, the eighteenth century seems to have provided more opportunities for women and the lower ranks to participate in neologism use as well. In both samples, neologisms most frequently occur in letters written between close friends, which could be due to this less stable relationship triggering more creative language use. In the seventeenth-century sample, we observe the influence of the English Civil War, while the eighteenth-century sample appears to reflect the changing functions of letter-writing, as correspondence is increasingly being used as a tool for building and maintaining social relationships in addition to exchanging information.
  • Blokland, Rogier; Partanen, Niko; Rießler, Michael (2021)
    In this paper we analyse an epic song, performed by Ulita Koskova in 1966 in Kolva in the Komi ASSR, and recorded by the Hungarian-Australian researcher Erik Vászolyi, and discuss its background and wider historical context. We look at different ways how such material can contribute to data-driven and sociolinguistically oriented research, specifically in connection to contemporary documentary linguistics, and point to directions for further research.
  • Hämäläinen, Mika; Partanen, Niko; Alnajjar, Khalid (University of Helsinki, 2021)