Browsing Multilingual Facilitation by Title

Sort by: Order: Results:

Now showing items 1-20 of 26
  • Pirinen,Tommi A; Tyers, Francis M. (2021)
    Digital infrastructures are a vital part of support for providing a research framework and platform in engineering their digital lexicography and grammars and deploying the to end-users as real NLP software products.
  • Alnajjar, Khalid; Hämäläinen, Mika (2021)
    Every NLP researcher has to work with different XML or JSON encoded files. This often involves writing code that serves a very specific purpose. Corpona is meant to streamline any workflow that involves XML and JSON based corpora, by offering easy and reusable functionalities. The current functionalities relate to easy parsing and access to XML files, easy access to sub-items in a nested JSON structure and visualization of a complex data structure. Corpona is fully open-source and it is available on GitHub and Zenodo.
  • Hämäläinen, Mika (2021)
    The term low-resourced has been tossed around in the field of natural language processing to a degree that almost any language that is not English can be called "low-resourced"; sometimes even just for the sake of making a mundane or mediocre paper appear more interesting and insightful. In a field where English is a synonym for language and low-resourced is a synonym for anything not English, calling endangered languages low-resourced is a bit of an overstatement. In this paper, I inspect the relation of the endangered with the low-resourced from my own experiences.
  • Säily, Tanja; Mäkelä, Eetu; Hämäläinen, Mika (2021)
    We study neologism use in two samples of early English correspondence, from 1640-1660 and 1760-1780. Of especial interest are the early adopters of new vocabulary, the social groups they represent, and the types and functions of their neologisms. We describe our computer-assisted approach and note the difficulties associated with massive variation in the corpus. Our findings include that while male letter-writers tend to use neologisms more frequently than women, the eighteenth century seems to have provided more opportunities for women and the lower ranks to participate in neologism use as well. In both samples, neologisms most frequently occur in letters written between close friends, which could be due to this less stable relationship triggering more creative language use. In the seventeenth-century sample, we observe the influence of the English Civil War, while the eighteenth-century sample appears to reflect the changing functions of letter-writing, as correspondence is increasingly being used as a tool for building and maintaining social relationships in addition to exchanging information.
  • Jantunen, Tommi; Rousi, Rebekah; Rainò, Päivi; Turunen, Markku; Moeen Valipoor, Mohammad; García, Narciso (2021)
    This article discusses the prerequisites for the machine translation of sign languages. The topic is complex, including questions relating to technology, interaction design, linguistics and culture. At the moment, despite the affordances provided by the technology, automated translation between signed and spoken languages – or between sign languages – is not possible. The very need of such translation and its associated technology can also be questioned. Yet, we believe that contributing to the improvement of sign language detection, processing and even sign language translation to spoken languages in the future is a matter that should not be abandoned. However, we argue that this work should focus on all necessary aspects of sign languages and sign language user communities. Thus, a more diverse and critical perspective towards these issues is needed in order to avoid generalisations and bias that is often manifested within dominant research paradigms particularly in the fields of spoken language research and speech community.
  • Koponen, Eino; Kuokkala, Juha (2021)
    A survey of Saami *-(e̮)hče̮ frequentative verbs is made based on dictionary data from all Saami languages. The analysis of their base verbs shows that in most of the languages, the frequentative derivatives are not restricted to *ē-stem bases as in North Saami; specifically in Skolt and Kildin Saami, the derivational type seems to be productive on *e̮- and *ō-stems as well.
  • Swanson, Daniel; Howell, Nick (2021)
    This paper presents lexd, a lexicon compiler for languages with non-suffixational morphology, which is intended to be faster and easier to use than existing solutions while also being compatible with other tools. We perform a case-study for Chukchi, comparing against a hand-optimised analyser written in lexc, and find that while lexd is easier to use, performance remains an obstacle to its use at production level. We also compare performance between lexd and hfst-lexc for three analysers still in the prototype phase, finding that lexd is at least as fast, sometimes faster, to compile; we conclude it is a reasonable choice for prototyping new analysers. Future work will explore how to move lexd performance toward production-grade.
  • Hämäläinen, Mika; Partanen, Niko; Alnajjar, Khalid (University of Helsinki, 2021)
  • Kokkonen, Paula (2021)
    Artikkeli käsittelee Heimolasten laulukirjasta ja Sukukansain lauluja -vihkosesta löytyviä kominkielisiä runoja, niiden suomennoksia ja niihin tehtyjä sävellyksiä. Tutkimuksen kohteena on Mihail Lebedevin ja Ivan Kuratovin runot sekä V. I. Lytkinin käännökset J. H. Erkon runoista.
  • Da Silva Facundes, Sidney; Fernanda Pereira de Freitas, Marília; Soares de Lima-Padovani, Bruna Fernanda (2021)
    Apurinã (Arawak), spoken along several tributaries of Purus River (Southwest of Amazonas State, Brazil), presents a plural morphological system that marks pronouns and nouns. The language has some free pronominal forms that distinguish singular from plural; additionally, it has bound pronominal forms, with singular/plural distinction made only in the first person for the enclitic forms. In the case of nouns, there are two suffixes that mark plural, -waku (that occurs only with [+human] nouns, as kyky-waku-ry (man-pl-m) ‘men’), and -ny (that can occurs both, with [+human] nouns, as in pupỹka-ry-ny-ry (indigenous person-m-pl-m) ‘indigenous people’; or [-human] nouns, as in aiku-ny-ry (house-pl-m) ‘houses’). The language also presents some quantifiers and numerals that encode number syntactically. The quantifiers are ithu, kaiãu and kuna kamuny to encode the notion of ‘much’, puiãu, referring to ‘some/few/little’, and ykyny to mean ‘all/every’. Additionally, there are the following numerals: (h)ãty(tu) ‘one’ and epi ‘two’, which combine to derive higher numbers, and the word for ‘hand’, waku/ piu, indicating the numeral five. Thus, the plural marking in the language can be marked in different ways, none of which is, however, required by the grammar. With that in mind, we discuss the extent to which plural marking is, to a great extent, constructed by the speakers in daily language use, according to whether it is contextually important to do so, and raise the question of the relevance of this problem to a computationally implementable grammar of the language.
  • Partanen, Niko; Jalava, Lotta (2021)
    Artikkeli kuvaa Nykysuomen sanakirjan näköisjulkaisun luontia ja siihen liittyviä työvaiheita. Samalla kuvataan tunnistetut rivikohtaiset tekstit ja tyylit sisältävä latauspaketti. Yhdessä ne mahdollistavat erilaisten sähköisten versioiden ja tutkimusaineistojen luomisen tulevaisuudessa, mutta ovat nykyisellään vain yksi askel tässä työssä. Tutkimus muodostaa esimerkin sanakirja-aineiston modernista tekstintunnistamisesta ja arvioi tuloksia kriittisesti, mahdollistaen samojen käytäntöjen soveltamisen muihin vastaaviin materiaaleihin. Kuvatut oikoluetut aineistot ja tekstintunnistusmallit tullaan julkaisemaan sanakirjan näköisjulkaisun rinnalla.
  • Saarikivi, Janne (2021)
    The question as to how the linguistic and archaeological data can be combined together to create a comprehensive account on the prehistory of present ethnicities is a debated issue around the globe. In particular, the identification of the new language groups in the material remnants of a particular area, or discerning in the material culture correlates for the language contact periods reflected in the loan word layers are complex and often probably insolvable questions. Regarding the early history of the Finns and the related people, Valter Lang’s new monograph on the archaeology of Estonia and the “arrivals of the Finnic people” (Läänemeresoome tulemised, 2018) has been considered a paradigm changing work in this respect. In my article I argue that despite undisputed progress in this ouevre, many of the old questions regarding time, place and method are still in place.
  • Juutinen, Markus; Mettovaara, Jukka (2021)
    We provide an overview of indefinite pronouns in Saami languages that have been borrowed or calqued from Finnic, Scandinavian or Russian. We define indefinite pronouns in the traditional way, i.e. encompassing all pronouns not belonging to any other pronoun class. The treatment of Saami indefinite pronouns in earlier literature varies, but generally they haven’t received as much attention as other pronouns. From Finnic sources, Saami languages have borrowed e.g. pronouns harva ‘few’, joku ‘some(one)’, kaikki ‘all’, moni ‘many’ and muu ‘other’ as well as pronominal elements ikänänsä ‘-ever’, saati ‘let alone’ and vaikka ‘even (if)’. Loans from Scandinavian include e.g. mange ‘many’, noen ~ någon ‘some’ and same ~ samma ‘same’. Russian loans include pronominal elements ни- ‘not (even)’ хоть ‘even (if)’. Indefinite pronouns in Saami prove to be rather an open class, and elements with similar meanings have been borrowed time after time. The variation is especially abundant in pronouns of indifference and free choice. Most of the pronouns in our data have been noted as loans before, but there are some unnoticed cases. Especially these warrant further study.
  • Trosterud, Trond; Moshagen, Sjur (2021)
    The article discusses correcting of typos due to erroneous use of the so-called soft sign in Skolt Sami, one of the most common orthographic symbols, and the most common source of typographic errors. The discussion is based upon the suggestion mechanism of an existing open source Skolt Sami speller. The discussion shows that with an improved suggestion mechanism, the speller is able to restore a single soft sign error in over 97 % of the cases, and remove a hypercorrect soft sign as first correction in 90 % of the cases. Allowing the target form to be within top-5, the correction performance is well above 99 %. Improving the suggestion mechanism also had a positive impact of its overall performance, rising the percentage of target forms within top-5 from 74.1 % to 84.7 %.
  • Jauhiainen, Tommi; Jauhiainen, Heidi; Lindén, Krister (2021)
    Tässä artikkelissa esittelemme vuonna 2013 aloittaneen ja 2019 päättyneen Koneen säätiön rahoittaman Suomalais-ugrilaiset kielet ja internet projektin suunnittelua sekä toteutusta ja kokoamme yhteen saavutettuja tuloksia. Aikaisemmin julkaistujen valmiiden tulosten lisäksi esittelemme myös joitakin keskeneräisiksi jääneitä tuotoksia. Projektissa kerättiin verkkoharavoinnin ja automaattisen kielentunnistuksen avulla harvinaisilla uralilaisilla kielillä kirjoitettujen sivujen tekstiä avoimilta verkkosivuilta. Projektissa kehitetty Wanca-portaalisivusto toimii kokoelmana linkkejä haravoinnin yhteydessä löydetyille näitä kieliä käyttäen kirjoitetuille sivuille. Projektissa kehitettiin prosessi, jota käyttäen verkkoharavan avulla löydetyistä teksteistä muodostetaan virkekorpuksia halutuille kielille. Muodostetut virkekorpukset ovat avoimesti saatavilla FIN-CLARIN konsortion ylläpitämän Kielipankin Korp-palvelussa. Verkkoharavoinnin ja korpusten kokoamisen ohella projekti keskittyi erityisesti kielentunnistuksen menetelmien kehittämiseen, jossa saavutettiin kansainvälisesti erittäin merkittäviä tuloksia. Projektin tutkijat ovat osallistuneet kansainvälisiin tekstin kielentunnistukseen keskittyneisiin kilpailuihin ja voittaneet niistä useita.
  • Tiedemann, Jörg (2021)
    This paper presents our on-going efforts to develop a comprehensive data set and benchmark for machine translation beyond high-resource languages. The current release includes 500GB of compressed parallel data for almost 3,000 language pairs covering over 500 languages and language variants. We present the structure of the data set and demonstrate its use for systematic studies based on baseline experiments with multilingual neural machine translation between Finno-Ugric languages and other language groups. Our initial results show the capabilities of training effective multilingual translation models with skewed training data but also stress the shortcomings with low-resource settings and the difficulties to obtain sufficient information through straightforward transfer from related languages.
  • Bradley, Jeremy; Skribnik, Elena (2021)
    The paper at hand presents the recently published COPIUS Orthographic Toolset’s Mansi module. This open-source software, part of the COPIUS drive to create necessary international infrastructures for teaching/learning and researching Uralic languages, allows for rule-based transcription between four basic writing systems historically used for Mansi: the Cyrillic alphabet, the Latin-based Unified Northern Alphabet (UNA), Finno-Ugric Transcription (FUT), and the International Phonetic Alphabet (IPA). The software aims to take variation in the usage of these respective writing systems into consideration as best possible in a purely rule-based approach currently lacking lexical support. Section 1 will give a short summary of the history of Mansi literacy and aims to elucidate how changing trends, both local and Russia-wide, influenced the manner in which Mansi was captured in writing by scientists and speakers throughout history. Section 2 will give an overview of (Northern) Mansi phonology and discuss how difficult aspects of it are handled in the writing systems under consideration. Finally, Section 3 will illustrate the transcription software, in its current version, in action, with a sample text transcribed from each of the four writing systems under consideration into the three other ones.
  • Hulden, Mans; Silfverberg, Miikka (2021)
    We design an FST-driven computational method to calculate the minimal number of nominal forms—the principal parts—one must know to be able to fully inflect a lexeme in standard Finnish. To do this, we model the nominal inflection pattern as an FST according to the KOTUS inflectional classes. Our results show that knowing five forms always suffices to uniquely determine a nominal’s inflectional class, and to subsequently correctly inflect all the remaining forms. This contrasts with most sources in the literature that tend to assume seven forms are needed.
  • Blokland, Rogier; Partanen, Niko; Rießler, Michael (2021)
    In this paper we analyse an epic song, performed by Ulita Koskova in 1966 in Kolva in the Komi ASSR, and recorded by the Hungarian-Australian researcher Erik Vászolyi, and discuss its background and wider historical context. We look at different ways how such material can contribute to data-driven and sociolinguistically oriented research, specifically in connection to contemporary documentary linguistics, and point to directions for further research.
  • Nevalainen, Terttu (2021)
    This paper analyses language users’ participation in real-time grammatical change. The question addressed is the extent to which individuals continue using both the incoming form and the recessive, outgoing form as opposed to using one of them categorically. Variable grammars are related to the sociolinguistic discussion of whether language change is a generational or a communal process. Ultimately, they also raise the question of the predictability of real-time language change