  • Säily, Tanja; Mäkelä, Eetu; Hämäläinen, Mika (2021)
    We study neologism use in two samples of early English correspondence, from 1640-1660 and 1760-1780. Of especial interest are the early adopters of new vocabulary, the social groups they represent, and the types and functions of their neologisms. We describe our computer-assisted approach and note the difficulties associated with massive variation in the corpus. Our findings include that while male letter-writers tend to use neologisms more frequently than women, the eighteenth century seems to have provided more opportunities for women and the lower ranks to participate in neologism use as well. In both samples, neologisms most frequently occur in letters written between close friends, which could be due to this less stable relationship triggering more creative language use. In the seventeenth-century sample, we observe the influence of the English Civil War, while the eighteenth-century sample appears to reflect the changing functions of letter-writing, as correspondence is increasingly being used as a tool for building and maintaining social relationships in addition to exchanging information.
  • Hämäläinen, Mika; Partanen, Niko; Alnajjar, Khalid (University of Helsinki, 2021)
  • Tiedemann, Jörg (2021)
    This paper presents our on-going efforts to develop a comprehensive data set and benchmark for machine translation beyond high-resource languages. The current release includes 500GB of compressed parallel data for almost 3,000 language pairs covering over 500 languages and language variants. We present the structure of the data set and demonstrate its use for systematic studies based on baseline experiments with multilingual neural machine translation between Finno-Ugric languages and other language groups. Our initial results show the capabilities of training effective multilingual translation models with skewed training data but also stress the shortcomings with low-resource settings and the difficulties to obtain sufficient information through straightforward transfer from related languages.
  • Partanen, Niko; Jalava, Lotta (2021)
    Artikkeli kuvaa Nykysuomen sanakirjan näköisjulkaisun luontia ja siihen liittyviä työvaiheita. Samalla kuvataan tunnistetut rivikohtaiset tekstit ja tyylit sisältävä latauspaketti. Yhdessä ne mahdollistavat erilaisten sähköisten versioiden ja tutkimusaineistojen luomisen tulevaisuudessa, mutta ovat nykyisellään vain yksi askel tässä työssä. Tutkimus muodostaa esimerkin sanakirja-aineiston modernista tekstintunnistamisesta ja arvioi tuloksia kriittisesti, mahdollistaen samojen käytäntöjen soveltamisen muihin vastaaviin materiaaleihin. Kuvatut oikoluetut aineistot ja tekstintunnistusmallit tullaan julkaisemaan sanakirjan näköisjulkaisun rinnalla.
  • Jantunen, Tommi; Rousi, Rebekah; Rainò, Päivi; Turunen, Markku; Moeen Valipoor, Mohammad; García, Narciso (2021)
    This article discusses the prerequisites for the machine translation of sign languages. The topic is complex, including questions relating to technology, interaction design, linguistics and culture. At the moment, despite the affordances provided by the technology, automated translation between signed and spoken languages – or between sign languages – is not possible. The very need of such translation and its associated technology can also be questioned. Yet, we believe that contributing to the improvement of sign language detection, processing and even sign language translation to spoken languages in the future is a matter that should not be abandoned. However, we argue that this work should focus on all necessary aspects of sign languages and sign language user communities. Thus, a more diverse and critical perspective towards these issues is needed in order to avoid generalisations and bias that is often manifested within dominant research paradigms particularly in the fields of spoken language research and speech community.
  • Trosterud, Trond; Moshagen, Sjur (2021)
    The article discusses correcting of typos due to erroneous use of the so-called soft sign in Skolt Sami, one of the most common orthographic symbols, and the most common source of typographic errors. The discussion is based upon the suggestion mechanism of an existing open source Skolt Sami speller. The discussion shows that with an improved suggestion mechanism, the speller is able to restore a single soft sign error in over 97 % of the cases, and remove a hypercorrect soft sign as first correction in 90 % of the cases. Allowing the target form to be within top-5, the correction performance is well above 99 %. Improving the suggestion mechanism also had a positive impact of its overall performance, rising the percentage of target forms within top-5 from 74.1 % to 84.7 %.
  • Blokland, Rogier; Partanen, Niko; Rießler, Michael (2021)
    In this paper we analyse an epic song, performed by Ulita Koskova in 1966 in Kolva in the Komi ASSR, and recorded by the Hungarian-Australian researcher Erik Vászolyi, and discuss its background and wider historical context. We look at different ways how such material can contribute to data-driven and sociolinguistically oriented research, specifically in connection to contemporary documentary linguistics, and point to directions for further research.
  • Da Silva Facundes, Sidney; Fernanda Pereira de Freitas, Marília; Soares de Lima-Padovani, Bruna Fernanda (2021)
    Apurinã (Arawak), spoken along several tributaries of Purus River (Southwest of Amazonas State, Brazil), presents a plural morphological system that marks pronouns and nouns. The language has some free pronominal forms that distinguish singular from plural; additionally, it has bound pronominal forms, with singular/plural distinction made only in the first person for the enclitic forms. In the case of nouns, there are two suffixes that mark plural, -waku (that occurs only with [+human] nouns, as kyky-waku-ry (man-pl-m) ‘men’), and -ny (that can occurs both, with [+human] nouns, as in pupỹka-ry-ny-ry (indigenous person-m-pl-m) ‘indigenous people’; or [-human] nouns, as in aiku-ny-ry (house-pl-m) ‘houses’). The language also presents some quantifiers and numerals that encode number syntactically. The quantifiers are ithu, kaiãu and kuna kamuny to encode the notion of ‘much’, puiãu, referring to ‘some/few/little’, and ykyny to mean ‘all/every’. Additionally, there are the following numerals: (h)ãty(tu) ‘one’ and epi ‘two’, which combine to derive higher numbers, and the word for ‘hand’, waku/ piu, indicating the numeral five. Thus, the plural marking in the language can be marked in different ways, none of which is, however, required by the grammar. With that in mind, we discuss the extent to which plural marking is, to a great extent, constructed by the speakers in daily language use, according to whether it is contextually important to do so, and raise the question of the relevance of this problem to a computationally implementable grammar of the language.
  • Saarikivi, Janne (2021)
    The question as to how the linguistic and archaeological data can be combined together to create a comprehensive account on the prehistory of present ethnicities is a debated issue around the globe. In particular, the identification of the new language groups in the material remnants of a particular area, or discerning in the material culture correlates for the language contact periods reflected in the loan word layers are complex and often probably insolvable questions. Regarding the early history of the Finns and the related people, Valter Lang’s new monograph on the archaeology of Estonia and the “arrivals of the Finnic people” (Läänemeresoome tulemised, 2018) has been considered a paradigm changing work in this respect. In my article I argue that despite undisputed progress in this ouevre, many of the old questions regarding time, place and method are still in place.
  • Bradley, Jeremy; Skribnik, Elena (2021)
    The paper at hand presents the recently published COPIUS Orthographic Toolset’s Mansi module. This open-source software, part of the COPIUS drive to create necessary international infrastructures for teaching/learning and researching Uralic languages, allows for rule-based transcription between four basic writing systems historically used for Mansi: the Cyrillic alphabet, the Latin-based Unified Northern Alphabet (UNA), Finno-Ugric Transcription (FUT), and the International Phonetic Alphabet (IPA). The software aims to take variation in the usage of these respective writing systems into consideration as best possible in a purely rule-based approach currently lacking lexical support. Section 1 will give a short summary of the history of Mansi literacy and aims to elucidate how changing trends, both local and Russia-wide, influenced the manner in which Mansi was captured in writing by scientists and speakers throughout history. Section 2 will give an overview of (Northern) Mansi phonology and discuss how difficult aspects of it are handled in the writing systems under consideration. Finally, Section 3 will illustrate the transcription software, in its current version, in action, with a sample text transcribed from each of the four writing systems under consideration into the three other ones.
  • Hämäläinen, Mika (2021)
    The term low-resourced has been tossed around in the field of natural language processing to a degree that almost any language that is not English can be called "low-resourced"; sometimes even just for the sake of making a mundane or mediocre paper appear more interesting and insightful. In a field where English is a synonym for language and low-resourced is a synonym for anything not English, calling endangered languages low-resourced is a bit of an overstatement. In this paper, I inspect the relation of the endangered with the low-resourced from my own experiences.
  • Hulden, Mans; Silfverberg, Miikka (2021)
    We design an FST-driven computational method to calculate the minimal number of nominal forms—the principal parts—one must know to be able to fully inflect a lexeme in standard Finnish. To do this, we model the nominal inflection pattern as an FST according to the KOTUS inflectional classes. Our results show that knowing five forms always suffices to uniquely determine a nominal’s inflectional class, and to subsequently correctly inflect all the remaining forms. This contrasts with most sources in the literature that tend to assume seven forms are needed.
  • Iwatsuki, Kenichi (2021)
    While scholarly papers in many disciplines are written in English, non-English papers have been published. Formulaic expressions used in research articles have been studied, but past work mainly focused on English formulaic expressions. In this study, we applied an existing formulaic expression extraction method that was originally proposed for English papers to introduction sections of Japanese papers on natural language processing. The results show that the extraction is to some extent successful. However, the paucity of dataset of scholarly papers hinders the construction of a comprehensive list of formulaic expressions and comparison among multiple disciplines.
  • Juutinen, Markus; Mettovaara, Jukka (2021)
    We provide an overview of indefinite pronouns in Saami languages that have been borrowed or calqued from Finnic, Scandinavian or Russian. We define indefinite pronouns in the traditional way, i.e. encompassing all pronouns not belonging to any other pronoun class. The treatment of Saami indefinite pronouns in earlier literature varies, but generally they haven’t received as much attention as other pronouns. From Finnic sources, Saami languages have borrowed e.g. pronouns harva ‘few’, joku ‘some(one)’, kaikki ‘all’, moni ‘many’ and muu ‘other’ as well as pronominal elements ikänänsä ‘-ever’, saati ‘let alone’ and vaikka ‘even (if)’. Loans from Scandinavian include e.g. mange ‘many’, noen ~ någon ‘some’ and same ~ samma ‘same’. Russian loans include pronominal elements ни- ‘not (even)’ хоть ‘even (if)’. Indefinite pronouns in Saami prove to be rather an open class, and elements with similar meanings have been borrowed time after time. The variation is especially abundant in pronouns of indifference and free choice. Most of the pronouns in our data have been noted as loans before, but there are some unnoticed cases. Especially these warrant further study.
  • Nevalainen, Terttu (2021)
    This paper analyses language users’ participation in real-time grammatical change. The question addressed is the extent to which individuals continue using both the incoming form and the recessive, outgoing form as opposed to using one of them categorically. Variable grammars are related to the sociolinguistic discussion of whether language change is a generational or a communal process. Ultimately, they also raise the question of the predictability of real-time language change
  • Alnajjar, Khalid (2021)
    Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word embeddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis models which achieved high accuracies. All our cross-lingual word embeddings and sentiment analysis models will be released openly via an easy-to-use Python library.
  • Пунегова, Галина (2021)
    Гижöдын асьсö гöлöссö, акустикасö некыдз оз позь петкöдлыны. Та понда гижöдын персонажлысь шуанног аслыспöлöслунсö серпасалöны гижысь кывъясöн, тшöкыдакодь сёрнилöн сикас йылысь позьö тöдмавны и геройяслöн асланыс сёрниысь, сiйöс донъялöмысь. Статтяын видлалöма-туялöма да петкöдлöма персонаж сёрниысь горсö да сылысь ёнлунсö, сёрни öдсö, ритмсö, ставсö, мый тöдчö герой сёрнилöн сикас вылö. Кывкöрталöма, мый юргана сёрнисö гижöдын стöча, тыр-бура да мичаа петкöдлöмын ыджыд тöдчанлун кутö авторлöн гижан сямыс.
  • Swanson, Daniel; Howell, Nick (2021)
    This paper presents lexd, a lexicon compiler for languages with non-suffixational morphology, which is intended to be faster and easier to use than existing solutions while also being compatible with other tools. We perform a case-study for Chukchi, comparing against a hand-optimised analyser written in lexc, and find that while lexd is easier to use, performance remains an obstacle to its use at production level. We also compare performance between lexd and hfst-lexc for three analysers still in the prototype phase, finding that lexd is at least as fast, sometimes faster, to compile; we conclude it is a reasonable choice for prototyping new analysers. Future work will explore how to move lexd performance toward production-grade.
  • Pirinen,Tommi A; Tyers, Francis M. (2021)
    Digital infrastructures are a vital part of support for providing a research framework and platform in engineering their digital lexicography and grammars and deploying the to end-users as real NLP software products.
  • Цыпанов, Йöлгинь (2021)
    In modern linguistics, a branch of linguistics - translation studies - was formed, which aims at comprehensive study of the processes of translation from one language to other languages from different aspects. Based on the material of the Russian Finno-Ugric languages, this branch of science takes its first steps. The purpose of this paper is to consider lexical and semantic language errors in the text of the translation of P.A. Sorokin's autobiography into the Komi language, identified by systematic comparisons of text fragments in English, Russian and Komi. The material of the study was the texts of P. A. Sorokin's autobiography published in separate books in different years of publication. The language errors found in the text of the translation of the autobiography of the world-famous sociologist, a native of the Komi region, Pitirim A. Sorokin into the Komi language, published as a separate book in Syktyvkar in 2013, are considered for the first time. The errors considered are analyzed on the basis of subsequent comparisons with the English-language original and the translation of the same book into Russian, published in Syktyvkar in 1991. Analysis of the Komi language of the book (the first 40 pages of his autobiography) allowed to conclude that the translation into the Komi language was made not from the language of the original, as recorded in the bibliographic description, but from the Russian translation of the autobiography, as most translation errors from the Russian-language text moved to the Komi-language one.