Multilingual Facilitation


Recent Submissions

  • Hämäläinen, Mika; Partanen, Niko; Alnajjar, Khalid (University of Helsinki, 2021)
  • Hämäläinen, Mika (2021)
    The term low-resourced has been tossed around in the field of natural language processing to a degree that almost any language that is not English can be called "low-resourced"; sometimes even just for the sake of making a mundane or mediocre paper appear more interesting and insightful. In a field where English is a synonym for language and low-resourced is a synonym for anything not English, calling endangered languages low-resourced is a bit of an overstatement. In this paper, I inspect the relation of the endangered with the low-resourced from my own experiences.
  • Bradley, Jeremy; Skribnik, Elena (2021)
    The paper at hand presents the recently published COPIUS Orthographic Toolset’s Mansi module. This open-source software, part of the COPIUS drive to create necessary international infrastructures for teaching/learning and researching Uralic languages, allows for rule-based transcription between four basic writing systems historically used for Mansi: the Cyrillic alphabet, the Latin-based Unified Northern Alphabet (UNA), Finno-Ugric Transcription (FUT), and the International Phonetic Alphabet (IPA). The software aims to take variation in the usage of these respective writing systems into consideration as best possible in a purely rule-based approach currently lacking lexical support. Section 1 will give a short summary of the history of Mansi literacy and aims to elucidate how changing trends, both local and Russia-wide, influenced the manner in which Mansi was captured in writing by scientists and speakers throughout history. Section 2 will give an overview of (Northern) Mansi phonology and discuss how difficult aspects of it are handled in the writing systems under consideration. Finally, Section 3 will illustrate the transcription software, in its current version, in action, with a sample text transcribed from each of the four writing systems under consideration into the three other ones.
  • Alnajjar, Khalid; Hämäläinen, Mika (2021)
    Every NLP researcher has to work with different XML or JSON encoded files. This often involves writing code that serves a very specific purpose. Corpona is meant to streamline any workflow that involves XML and JSON based corpora, by offering easy and reusable functionalities. The current functionalities relate to easy parsing and access to XML files, easy access to sub-items in a nested JSON structure and visualization of a complex data structure. Corpona is fully open-source and it is available on GitHub and Zenodo.
  • Da Silva Facundes, Sidney; Fernanda Pereira de Freitas, Marília; Soares de Lima-Padovani, Bruna Fernanda (2021)
    Apurinã (Arawak), spoken along several tributaries of Purus River (Southwest of Amazonas State, Brazil), presents a plural morphological system that marks pronouns and nouns. The language has some free pronominal forms that distinguish singular from plural; additionally, it has bound pronominal forms, with singular/plural distinction made only in the first person for the enclitic forms. In the case of nouns, there are two suffixes that mark plural, -waku (that occurs only with [+human] nouns, as kyky-waku-ry (man-pl-m) ‘men’), and -ny (that can occurs both, with [+human] nouns, as in pupỹka-ry-ny-ry (indigenous person-m-pl-m) ‘indigenous people’; or [-human] nouns, as in aiku-ny-ry (house-pl-m) ‘houses’). The language also presents some quantifiers and numerals that encode number syntactically. The quantifiers are ithu, kaiãu and kuna kamuny to encode the notion of ‘much’, puiãu, referring to ‘some/few/little’, and ykyny to mean ‘all/every’. Additionally, there are the following numerals: (h)ãty(tu) ‘one’ and epi ‘two’, which combine to derive higher numbers, and the word for ‘hand’, waku/ piu, indicating the numeral five. Thus, the plural marking in the language can be marked in different ways, none of which is, however, required by the grammar. With that in mind, we discuss the extent to which plural marking is, to a great extent, constructed by the speakers in daily language use, according to whether it is contextually important to do so, and raise the question of the relevance of this problem to a computationally implementable grammar of the language.
  • Пунегова, Галина (2021)
    Гижöдын асьсö гöлöссö, акустикасö некыдз оз позь петкöдлыны. Та понда гижöдын персонажлысь шуанног аслыспöлöслунсö серпасалöны гижысь кывъясöн, тшöкыдакодь сёрнилöн сикас йылысь позьö тöдмавны и геройяслöн асланыс сёрниысь, сiйöс донъялöмысь. Статтяын видлалöма-туялöма да петкöдлöма персонаж сёрниысь горсö да сылысь ёнлунсö, сёрни öдсö, ритмсö, ставсö, мый тöдчö герой сёрнилöн сикас вылö. Кывкöрталöма, мый юргана сёрнисö гижöдын стöча, тыр-бура да мичаа петкöдлöмын ыджыд тöдчанлун кутö авторлöн гижан сямыс.
  • Pirinen,Tommi A; Tyers, Francis M. (2021)
    Digital infrastructures are a vital part of support for providing a research framework and platform in engineering their digital lexicography and grammars and deploying the to end-users as real NLP software products.
  • Koponen, Eino; Kuokkala, Juha (2021)
    A survey of Saami *-(e̮)hče̮ frequentative verbs is made based on dictionary data from all Saami languages. The analysis of their base verbs shows that in most of the languages, the frequentative derivatives are not restricted to *ē-stem bases as in North Saami; specifically in Skolt and Kildin Saami, the derivational type seems to be productive on *e̮- and *ō-stems as well.
  • Trosterud, Trond; Moshagen, Sjur (2021)
    The article discusses correcting of typos due to erroneous use of the so-called soft sign in Skolt Sami, one of the most common orthographic symbols, and the most common source of typographic errors. The discussion is based upon the suggestion mechanism of an existing open source Skolt Sami speller. The discussion shows that with an improved suggestion mechanism, the speller is able to restore a single soft sign error in over 97 % of the cases, and remove a hypercorrect soft sign as first correction in 90 % of the cases. Allowing the target form to be within top-5, the correction performance is well above 99 %. Improving the suggestion mechanism also had a positive impact of its overall performance, rising the percentage of target forms within top-5 from 74.1 % to 84.7 %.
  • Blokland, Rogier; Partanen, Niko; Rießler, Michael (2021)
    In this paper we analyse an epic song, performed by Ulita Koskova in 1966 in Kolva in the Komi ASSR, and recorded by the Hungarian-Australian researcher Erik Vászolyi, and discuss its background and wider historical context. We look at different ways how such material can contribute to data-driven and sociolinguistically oriented research, specifically in connection to contemporary documentary linguistics, and point to directions for further research.
  • Jauhiainen, Tommi; Jauhiainen, Heidi; Lindén, Krister (2021)
    Tässä artikkelissa esittelemme vuonna 2013 aloittaneen ja 2019 päättyneen Koneen säätiön rahoittaman Suomalais-ugrilaiset kielet ja internet projektin suunnittelua sekä toteutusta ja kokoamme yhteen saavutettuja tuloksia. Aikaisemmin julkaistujen valmiiden tulosten lisäksi esittelemme myös joitakin keskeneräisiksi jääneitä tuotoksia. Projektissa kerättiin verkkoharavoinnin ja automaattisen kielentunnistuksen avulla harvinaisilla uralilaisilla kielillä kirjoitettujen sivujen tekstiä avoimilta verkkosivuilta. Projektissa kehitetty Wanca-portaalisivusto toimii kokoelmana linkkejä haravoinnin yhteydessä löydetyille näitä kieliä käyttäen kirjoitetuille sivuille. Projektissa kehitettiin prosessi, jota käyttäen verkkoharavan avulla löydetyistä teksteistä muodostetaan virkekorpuksia halutuille kielille. Muodostetut virkekorpukset ovat avoimesti saatavilla FIN-CLARIN konsortion ylläpitämän Kielipankin Korp-palvelussa. Verkkoharavoinnin ja korpusten kokoamisen ohella projekti keskittyi erityisesti kielentunnistuksen menetelmien kehittämiseen, jossa saavutettiin kansainvälisesti erittäin merkittäviä tuloksia. Projektin tutkijat ovat osallistuneet kansainvälisiin tekstin kielentunnistukseen keskittyneisiin kilpailuihin ja voittaneet niistä useita.
  • Tiedemann, Jörg (2021)
    This paper presents our on-going efforts to develop a comprehensive data set and benchmark for machine translation beyond high-resource languages. The current release includes 500GB of compressed parallel data for almost 3,000 language pairs covering over 500 languages and language variants. We present the structure of the data set and demonstrate its use for systematic studies based on baseline experiments with multilingual neural machine translation between Finno-Ugric languages and other language groups. Our initial results show the capabilities of training effective multilingual translation models with skewed training data but also stress the shortcomings with low-resource settings and the difficulties to obtain sufficient information through straightforward transfer from related languages.
  • Агафонова, Нина А.; Рябов, Иван Н. (2021)
    Статьясонть тевс нолдавить материалтнэ, конат пурназь лингвистической экспедициянь шкасто Ульяновской областень Новомалыклинской райононь эрзянь велень кортавктнэстэ. Эрзянь литературной келенть марто карадо-каршо аравтомась невтизе, неть кортавкстнэсэ ванстовсь седе пешксе кезэрень азорксчинь суффиксэнь системась. Неть кортавкстнэнь касомаст-кепетемаст мольсь башка лия эрзянь кортавкстнэнь эйстэ. Те лездась неень шкас ванстомс азорксчинь суффиксэнь системасонть весе ниле сериятнень.
  • Alnajjar, Khalid (2021)
    Big languages such as English and Finnish have many natural language processing (NLP) resources and models, but this is not the case for low-resourced and endangered languages as such resources are so scarce despite the great advantages they would provide for the language communities. The most common types of resources available for low-resourced and endangered languages are translation dictionaries and universal dependencies. In this paper, we present a method for constructing word embeddings for endangered languages using existing word embeddings of different resource-rich languages and the translation dictionaries of resource-poor languages. Thereafter, the embeddings are fine-tuned using the sentences in the universal dependencies and aligned to match the semantic spaces of the big languages; resulting in cross-lingual embeddings. The endangered languages we work with here are Erzya, Moksha, Komi-Zyrian and Skolt Sami. Furthermore, we build a universal sentiment analysis model for all the languages that are part of this study, whether endangered or not, by utilizing cross-lingual word embeddings. The evaluation conducted shows that our word embeddings for endangered languages are well-aligned with the resource-rich languages, and they are suitable for training task-specific models as demonstrated by our sentiment analysis models which achieved high accuracies. All our cross-lingual word embeddings and sentiment analysis models will be released openly via an easy-to-use Python library.
  • Saarikivi, Janne (2021)
    The question as to how the linguistic and archaeological data can be combined together to create a comprehensive account on the prehistory of present ethnicities is a debated issue around the globe. In particular, the identification of the new language groups in the material remnants of a particular area, or discerning in the material culture correlates for the language contact periods reflected in the loan word layers are complex and often probably insolvable questions. Regarding the early history of the Finns and the related people, Valter Lang’s new monograph on the archaeology of Estonia and the “arrivals of the Finnic people” (Läänemeresoome tulemised, 2018) has been considered a paradigm changing work in this respect. In my article I argue that despite undisputed progress in this ouevre, many of the old questions regarding time, place and method are still in place.
  • Jantunen, Tommi; Rousi, Rebekah; Rainò, Päivi; Turunen, Markku; Moeen Valipoor, Mohammad; García, Narciso (2021)
    This article discusses the prerequisites for the machine translation of sign languages. The topic is complex, including questions relating to technology, interaction design, linguistics and culture. At the moment, despite the affordances provided by the technology, automated translation between signed and spoken languages – or between sign languages – is not possible. The very need of such translation and its associated technology can also be questioned. Yet, we believe that contributing to the improvement of sign language detection, processing and even sign language translation to spoken languages in the future is a matter that should not be abandoned. However, we argue that this work should focus on all necessary aspects of sign languages and sign language user communities. Thus, a more diverse and critical perspective towards these issues is needed in order to avoid generalisations and bias that is often manifested within dominant research paradigms particularly in the fields of spoken language research and speech community.
  • Iwatsuki, Kenichi (2021)
    While scholarly papers in many disciplines are written in English, non-English papers have been published. Formulaic expressions used in research articles have been studied, but past work mainly focused on English formulaic expressions. In this study, we applied an existing formulaic expression extraction method that was originally proposed for English papers to introduction sections of Japanese papers on natural language processing. The results show that the extraction is to some extent successful. However, the paucity of dataset of scholarly papers hinders the construction of a comprehensive list of formulaic expressions and comparison among multiple disciplines.
  • Hulden, Mans; Silfverberg, Miikka (2021)
    We design an FST-driven computational method to calculate the minimal number of nominal forms—the principal parts—one must know to be able to fully inflect a lexeme in standard Finnish. To do this, we model the nominal inflection pattern as an FST according to the KOTUS inflectional classes. Our results show that knowing five forms always suffices to uniquely determine a nominal’s inflectional class, and to subsequently correctly inflect all the remaining forms. This contrasts with most sources in the literature that tend to assume seven forms are needed.
  • Цыпанов, Йöлгинь (2021)
    In modern linguistics, a branch of linguistics - translation studies - was formed, which aims at comprehensive study of the processes of translation from one language to other languages from different aspects. Based on the material of the Russian Finno-Ugric languages, this branch of science takes its first steps. The purpose of this paper is to consider lexical and semantic language errors in the text of the translation of P.A. Sorokin's autobiography into the Komi language, identified by systematic comparisons of text fragments in English, Russian and Komi. The material of the study was the texts of P. A. Sorokin's autobiography published in separate books in different years of publication. The language errors found in the text of the translation of the autobiography of the world-famous sociologist, a native of the Komi region, Pitirim A. Sorokin into the Komi language, published as a separate book in Syktyvkar in 2013, are considered for the first time. The errors considered are analyzed on the basis of subsequent comparisons with the English-language original and the translation of the same book into Russian, published in Syktyvkar in 1991. Analysis of the Komi language of the book (the first 40 pages of his autobiography) allowed to conclude that the translation into the Komi language was made not from the language of the original, as recorded in the bibliographic description, but from the Russian translation of the autobiography, as most translation errors from the Russian-language text moved to the Komi-language one.
  • Juutinen, Markus; Mettovaara, Jukka (2021)
    We provide an overview of indefinite pronouns in Saami languages that have been borrowed or calqued from Finnic, Scandinavian or Russian. We define indefinite pronouns in the traditional way, i.e. encompassing all pronouns not belonging to any other pronoun class. The treatment of Saami indefinite pronouns in earlier literature varies, but generally they haven’t received as much attention as other pronouns. From Finnic sources, Saami languages have borrowed e.g. pronouns harva ‘few’, joku ‘some(one)’, kaikki ‘all’, moni ‘many’ and muu ‘other’ as well as pronominal elements ikänänsä ‘-ever’, saati ‘let alone’ and vaikka ‘even (if)’. Loans from Scandinavian include e.g. mange ‘many’, noen ~ någon ‘some’ and same ~ samma ‘same’. Russian loans include pronominal elements ни- ‘not (even)’ хоть ‘even (if)’. Indefinite pronouns in Saami prove to be rather an open class, and elements with similar meanings have been borrowed time after time. The variation is especially abundant in pronouns of indifference and free choice. Most of the pronouns in our data have been noted as loans before, but there are some unnoticed cases. Especially these warrant further study.

View more