Browsing by Subject "finite-state transducers"

Sort by: Order: Results:

Now showing items 1-10 of 10
  • Yli-Jyrä, Anssi Mikael; Koskenniemi, Kimmo; Linden, Krister (2006)
    Finite-state methods have been adopted widely in computational morphology and related linguistic applications. To enable efficient development of finite-state based linguistic descriptions, these methods should be a freely available resource for academic language research and the language technology industry. The following needs can be identified: (i) a registry that maps the existing approaches, implementations and descriptions, (ii) managing the incompatibilities of the existing tools, (iii) increasing synergy and complementary functionality of the tools, (iv) persistent availability of the tools used to manipulate the archived descriptions, (v) an archive for free finite-state based tools and linguistic descriptions. Addressing these challenges contributes to building a common research infrastructure for advanced language technology.
  • Yli-Jyrä, Anssi Mikael (The Association for Computational Linguistics, 2011)
    Paperi kuvaa epäkonventionaalisen menetelmän (fonologisten ja morfo-syntaktisten) kontekstirajoitesääntöjen kääntämiseksi epädeterministisiksi automaateiksi äärellistilaisissa työkaluissa ja pintajäsennysjärjestelmissä. Metodi redusoi minkä tahansa kontekstirajoitteen yksinkertaiseksi rajoitteeksi, joka rajoittaa tyhjän merkkijonon esiintymisiä ja esittää oikean puolen kontekstit takaperindeterminististen tilojen avulla. Tapauksissa, joissa täysin deterministinen esitysmuoto olisi eksponentiaalisesti isompi, tällainen sisäänpäin deterministinen kontekstien esitysmuoto voi olla edullisempi kuin erilaiset De Morgan -lähestymistavat, joissa täysi determinisointi on välttämätöntä. Menetelmän yhteydessä jokainen hyväksytty merkkijono saa yksiselitteisen polun, joka on kontekstien tunnistaja-automaatissa olevan tikapuumaisen rakenteen projektio. Tämä projektio voidaan laskea (koko rajoitteelle) ajassa, joka on polynomisessa suhteessa kontekstitilojen määrään. Menetelmästä voi kuitenkin olla vaikea saada hyötyä, jos sitä käytetään äärellistilaisessa kirjastossa, joka pakottaa välitulokset kanonisiksi automaateiksi ja jonka leikkaus-operaatio edellyttää deterministisiä automaatteja operandeinaan.
  • Yli-Jyrä, Anssi Mikael (Northern European Association for Language Technology, 2011)
    NEALT Proceedings Series
    (översättning:) I dokumentet föreslås morphofonematisk markörer kallas positionwise flaggor. Dessa flaggor är inspirerade av de tekniker som används i sammanställningen av två nivåer regler. Det sammanställer praktiskt taget alla regler parallellt, men på ett effektivt sätt. Tekniken hanterar morphofonematisk processer utan separat morphofonematisk representation. De förekomster av allomorphofonem i latenta fonologiska strängar spåras genom en dynamisk datastruktur där den mest framträdande (dvs. bäst rankade) flaggor samlas in. Tillämpningen av tekniken är misstänkt för att ge fördelar när de beskriver morfologi Bantu språk och dialekter
  • Koskenniemi, Kimmo (Northern European Association for Language Technology, 2013)
    NEALT Proceedings Series
    Regular correspondences between historically related languages can be modelled using finite-state transducers (FST). A new method is presented by demonstrating it with a bidirectional experiment between Finnish and Estonian. An artificial representation (resembling a proto-language) is established between two related languages. This representation, AFE (Aligned Finnish-Estonian) is based on the letter by letter alignment of the two languages and uses mechanically constructed morphophonemes which represent the corresponding characters. By describing the constraints of this AFE using two-level rules, one may construct useful mappings between the languages. In this way, the badly ambiguous FSTs from Finnish and Estonian to AFE can be composed into a practically unambiguous transducer from Finnish to Estonian. The inverse mapping from Estonian to Finnish is mildly ambiguous. Steps according to the proposed method could be repeated as such with dialectal or older written texts. Choosing a set of model words, aligning them, recording the mechanical correspondences and designing rules for the constraints could be done with a limited effort. For the purposes of indexing and searching, the mild ambiguity may be tolerable as such. The ambiguity can be further reduced by composing the resulting FST with a speller or morphological analyser of the standard language.
  • Drobac, Senka; Linden, Krister; Pirinen, Tommi; Silfverberg, Miikka (European Language Resources Association (ELRA), 2014)
    Flag diacritics, which are special multi-character symbols executed at runtime, enable optimising finite-state networks by combining identical sub-graphs of its transition graph. Traditionally, the feature has required linguists to devise the optimisations to the graph by hand alongside the morphological description. In this paper, we present a novel method for discovering flag positions in morphological lexicons automatically, based on the morpheme structure implicit in the language description. With this approach, we have gained significant decrease in the size of finite-state networks while maintaining reasonable application speed. The algorithm can be applied to any language description, where the biggest achievements are expected in large and complex morphologies. The most noticeable reduction in size we got with a morphological transducer for Greenlandic, whose original size is on average about 15 times larger than other morphologies. With the presented hyper-minimization method, the transducer is reduced to 10,1% of the original size, with lookup speed decreased only by 9,5%.
  • Kokkinakis, Dimitrios; Niemi, Jyrki; Hardwick, Sam; Linden, Krister; Borin, Lars (European Language Resources Association (ELRA), 2014)
    Named entity recognition (NER) is a knowledge-intensive information extraction task that is used for recognizing textual mentions of entities that belong to a predefined set of categories, such as locations, organizations and time expressions. NER is a challenging, difficult, yet essential preprocessing technology for many natural language processing applications, and particularly crucial for language understanding. NER has been actively explored in academia and in industry especially during the last years due to the advent of social media data. This paper describes the conversion, modeling and adaptation of a Swedish NER system from a hybrid environment, with integrated functionality from various processing components, to the Helsinki Finite-State Transducer Technology (HFST) platform. This new HFST-based NER (HFST-SweNER) is a full-fledged open source implementation that supports a variety of generic named entity types and consists of multiple, reusable resource layers, e.g., various n-gram-based named entity lists (gazetteers).
  • Drobac, Senka; Silfverberg, Miikka; Yli-Jyrä, Anssi Mikael (The Association for Computational Linguistics, 2012)
    We explain the implementation of replace rules with the .r-glc. operator and preference relations. Our modular approach combines various preference constraints to form different replace rules. In addition to describing the method, we present illustrative examples.
  • Koskenniemi, Kimmo Matti; Kuutti, Pirkko (Research Institute for Linguistics, Hungarian Academy of Sciences, 2017)
  • Yli-Jyrä, Anssi Mikael (Springer-Verlag, 2012)
    Arc contractions in syntactic dependency graphs can be used to decide which graphs are trees. The paper observes that these contractions can be expressed with weighted finite-state transducers (weighted FST) that operate on string-encoded trees. The observation gives rise to a finite-state parsing algorithm that computes the parse forest and extracts the best parses from it. The algorithm is customizable to functional and bilexical dependency parsing, and it can be extended to non-projective parsing via a multi-planar encoding with prior results on high recall. Our experiments support an analysis of projective parsing according to which the worst-case time complexity of the algorithm is quadratic to the sentence length, and linear to the overlapping arcs and the number of functional categories of the arcs. The results suggest several interesting directions towards efficient and highprecision dependency parsing that takes advantage of the flexibility and the demonstrated ambiguity-packing capacity of such a parser.
  • Yli-Jyrä, Anssi (The Association for Computational Linguistics, 2015)
    Kaksijakoisten verkkojen merkkijonokoodaus mahdollistaa sen, että toonifonologian autosegmentaalisessa kuvausessa käytettävä verkkojen toisinkirjoittaminen voidaan toteuttaa käyttäen olemassa olevia ja runsaasti optimoituja äärellistilaisia transduktorityökaluja (Yli-Jyrä 2013). Käsillä oleva työ kuvailee ankarasti tämän kooditeoreettisen lähestymistavan ja yleistää siihen liittyvän metodiikan kaikkiin sellaisiin kaksijakoisiin verkkoihin, joissa ei ole ristikkäisiä kaaria eikä järjestämättömiä solmuja. Työssä esitetään kolme bijektiivisesti toisiinsa suhteutettua koodia, joista kullakin on erityiset luonteenpiirteet ja vapaus rikkoa tai ilmaista nk. Obligatory Contour Principle -rajoite. Koodit ovat äärettömiä, äärellistilaisesti esitettyjä ja optimaalisia (tehokkaasti laskettavia, käännettäviä, paikallisesti ikonisia ja kompositionaalisia) Kornain (1995) määrittelmän mukaisesti. Nämä kolme koodia laajentavat koodaukseen perustuvaa lähestymistapaa visualisoinnilla, yleisyydellä ja joustavuudella ja ne tekevät koodatuista graafeista vahvan ehdokkaan silloin, kun autosegmentaalisen fonologian formaali semantiikka tai risteämättömät kohdistusrelaatiot toteutetaan säännöllisen kieliopin puitteissa.