  • Yli-Jyrä, Anssi Mikael; Koskenniemi, Kimmo; Linden, Krister (2006)
    Finite-state methods have been adopted widely in computational morphology and related linguistic applications. To enable efficient development of finite-state based linguistic descriptions, these methods should be a freely available resource for academic language research and the language technology industry. The following needs can be identified: (i) a registry that maps the existing approaches, implementations and descriptions, (ii) managing the incompatibilities of the existing tools, (iii) increasing synergy and complementary functionality of the tools, (iv) persistent availability of the tools used to manipulate the archived descriptions, (v) an archive for free finite-state based tools and linguistic descriptions. Addressing these challenges contributes to building a common research infrastructure for advanced language technology.
  • Koskenniemi, Kimmo (Northern European Association for Language Technology, 2013)
    NEALT Proceedings Series
    Regular correspondences between historically related languages can be modelled using finite-state transducers (FST). A new method is presented by demonstrating it with a bidirectional experiment between Finnish and Estonian. An artificial representation (resembling a proto-language) is established between two related languages. This representation, AFE (Aligned Finnish-Estonian) is based on the letter by letter alignment of the two languages and uses mechanically constructed morphophonemes which represent the corresponding characters. By describing the constraints of this AFE using two-level rules, one may construct useful mappings between the languages. In this way, the badly ambiguous FSTs from Finnish and Estonian to AFE can be composed into a practically unambiguous transducer from Finnish to Estonian. The inverse mapping from Estonian to Finnish is mildly ambiguous. Steps according to the proposed method could be repeated as such with dialectal or older written texts. Choosing a set of model words, aligning them, recording the mechanical correspondences and designing rules for the constraints could be done with a limited effort. For the purposes of indexing and searching, the mild ambiguity may be tolerable as such. The ambiguity can be further reduced by composing the resulting FST with a speller or morphological analyser of the standard language.
  • Pirinen, Tommi; Silfverberg, Miikka; Linden, Krister (2012)
    In this paper we demonstrate a finite-state implementation of context-aware spell checking utilizing an N-gram based part of speech (POS) tagger to rerank the suggestions from a simple edit-distance based spell-checker. We demonstrate the benefits of context-aware spell-checking for English and Finnish and introduce modifications that are necessary to make traditional N-gram models work for morphologically more complex languages, such as Finnish.
  • Rueter, Jack; Partanen, Niko; Hämäläinen, Mika; Trosterud, Trond (The Association for Computational Linguistics, 2021)
  • Rueter, Jack; Hämäläinen, Mika (Peter Lang, 2020)
    Österreichisches Deutsch – Sprache der Gegenwart
    This paper will provide a brief description of Skolt Sami and how it might be construed as a pluricentric language. Historical factors are identified that might contribute to a pluricentric identity: geographic location and political history; shortages of language documentation, and the establishment of a normative body for the development of a standard language. Skolt Sami is assessed in the context of Sami languages and is forwarded as one of a closely related yet distinct language group. Here the issue then becomes one of facilitating diversity even for under-documented languages. And we aptly describe opportunities in language technology that have been utilized to this end. Finally, brief insight is given for other Uralic languages with regard to pluricentric character and possibilities for language users to facilitate the maintenance of their individual language needs.
  • Rueter, Jack (Издательский центр Историко-социологического института, 2020)
    This paper addresses the issue of a national corpus for language documentation of the Moksha and Erzya literary languages in coordination with dialect archives comprising over 80 years of fieldwork (inclusive Shoksha, Karatai). It shows necessary development in computer-assisted research tools and ongoing research aligned with a consistent and systematic open research project.