Browsing by Subject "finite-state transducers"

Sort by: Order: Results:

Now showing items 1-10 of 10
  • Yli-Jyrä, Anssi Mikael; Koskenniemi, Kimmo; Linden, Krister (2006)
    Finite-state methods have been adopted widely in computational morphology and related linguistic applications. To enable efficient development of finite-state based linguistic descriptions, these methods should be a freely available resource for academic language research and the language technology industry. The following needs can be identified: (i) a registry that maps the existing approaches, implementations and descriptions, (ii) managing the incompatibilities of the existing tools, (iii) increasing synergy and complementary functionality of the tools, (iv) persistent availability of the tools used to manipulate the archived descriptions, (v) an archive for free finite-state based tools and linguistic descriptions. Addressing these challenges contributes to building a common research infrastructure for advanced language technology.
  • Yli-Jyrä, Anssi Mikael (The Association for Computational Linguistics, 2011)
    This paper describes a non-conventional method for compiling (phonological or morpho-syntactic) context restriction (CR) constraints into non-deterministic automata in finite-state tools and surface parsing systems. The method reduces any CR into a simple one that constraints the occurrences of the empty string and represents right contexts with co-determististic states. In cases where a fully deterministic representation would be exponentially larger, this kind of inward de- terminism in contexts can bring benefits over various De Morgan approaches where full determinization is necessary. In the method, an accepted word gets a unique path that is a projection of a ladder-shaped structure in the context recognizer. This projection is computed in time that is polynomial to the number of context states. However, it may be difficult to take advantage of the method in a finite-state library that coerces intermediate results into canonical automata and whose intersection operation assumes deterministic automata.
  • Yli-Jyrä, Anssi Mikael (Northern European Association for Language Technology, 2011)
    NEALT Proceedings Series
    A novel technique of adding positionwise flags to one-level finite state lexicons is presented. The proposed flags are kinds of morphophonemic markers and they constitute a flexible method for describing morphophonological processes with a formalism that is tightly coupled with lexical entries and rule-like regular expressions. The formalism is inspired by the techniques used in two-level rule compilation and it practically compiles all the rules in parallel, but in an efficient way. The technique handles morphophonological processes without a separate morphophonemic representation. The occurrences of the allomorphophonemes in latent phonological strings are tracked through a dynamic data structure into which the most prominent (i.e. the best ranked) flags are collected. The application of the technique is suspected to give advantages when describing the morphology of Bantu languages and dialects.
  • Koskenniemi, Kimmo (Northern European Association for Language Technology, 2013)
    NEALT Proceedings Series
    Regular correspondences between historically related languages can be modelled using finite-state transducers (FST). A new method is presented by demonstrating it with a bidirectional experiment between Finnish and Estonian. An artificial representation (resembling a proto-language) is established between two related languages. This representation, AFE (Aligned Finnish-Estonian) is based on the letter by letter alignment of the two languages and uses mechanically constructed morphophonemes which represent the corresponding characters. By describing the constraints of this AFE using two-level rules, one may construct useful mappings between the languages. In this way, the badly ambiguous FSTs from Finnish and Estonian to AFE can be composed into a practically unambiguous transducer from Finnish to Estonian. The inverse mapping from Estonian to Finnish is mildly ambiguous. Steps according to the proposed method could be repeated as such with dialectal or older written texts. Choosing a set of model words, aligning them, recording the mechanical correspondences and designing rules for the constraints could be done with a limited effort. For the purposes of indexing and searching, the mild ambiguity may be tolerable as such. The ambiguity can be further reduced by composing the resulting FST with a speller or morphological analyser of the standard language.
  • Drobac, Senka; Linden, Krister; Pirinen, Tommi; Silfverberg, Miikka (European Language Resources Association (ELRA), 2014)
    Flag diacritics, which are special multi-character symbols executed at runtime, enable optimising finite-state networks by combining identical sub-graphs of its transition graph. Traditionally, the feature has required linguists to devise the optimisations to the graph by hand alongside the morphological description. In this paper, we present a novel method for discovering flag positions in morphological lexicons automatically, based on the morpheme structure implicit in the language description. With this approach, we have gained significant decrease in the size of finite-state networks while maintaining reasonable application speed. The algorithm can be applied to any language description, where the biggest achievements are expected in large and complex morphologies. The most noticeable reduction in size we got with a morphological transducer for Greenlandic, whose original size is on average about 15 times larger than other morphologies. With the presented hyper-minimization method, the transducer is reduced to 10,1% of the original size, with lookup speed decreased only by 9,5%.
  • Kokkinakis, Dimitrios; Niemi, Jyrki; Hardwick, Sam; Linden, Krister; Borin, Lars (European Language Resources Association (ELRA), 2014)
    Named entity recognition (NER) is a knowledge-intensive information extraction task that is used for recognizing textual mentions of entities that belong to a predefined set of categories, such as locations, organizations and time expressions. NER is a challenging, difficult, yet essential preprocessing technology for many natural language processing applications, and particularly crucial for language understanding. NER has been actively explored in academia and in industry especially during the last years due to the advent of social media data. This paper describes the conversion, modeling and adaptation of a Swedish NER system from a hybrid environment, with integrated functionality from various processing components, to the Helsinki Finite-State Transducer Technology (HFST) platform. This new HFST-based NER (HFST-SweNER) is a full-fledged open source implementation that supports a variety of generic named entity types and consists of multiple, reusable resource layers, e.g., various n-gram-based named entity lists (gazetteers).
  • Drobac, Senka; Silfverberg, Miikka; Yli-Jyrä, Anssi Mikael (The Association for Computational Linguistics, 2012)
    We explain the implementation of replace rules with the .r-glc. operator and preference relations. Our modular approach combines various preference constraints to form different replace rules. In addition to describing the method, we present illustrative examples.
  • Koskenniemi, Kimmo Matti; Kuutti, Pirkko (Research Institute for Linguistics, Hungarian Academy of Sciences, 2017)
  • Yli-Jyrä, Anssi Mikael (Springer-Verlag, 2012)
    Arc contractions in syntactic dependency graphs can be used to decide which graphs are trees. The paper observes that these contractions can be expressed with weighted finite-state transducers (weighted FST) that operate on string-encoded trees. The observation gives rise to a finite-state parsing algorithm that computes the parse forest and extracts the best parses from it. The algorithm is customizable to functional and bilexical dependency parsing, and it can be extended to non-projective parsing via a multi-planar encoding with prior results on high recall. Our experiments support an analysis of projective parsing according to which the worst-case time complexity of the algorithm is quadratic to the sentence length, and linear to the overlapping arcs and the number of functional categories of the arcs. The results suggest several interesting directions towards efficient and highprecision dependency parsing that takes advantage of the flexibility and the demonstrated ambiguity-packing capacity of such a parser.
  • Yli-Jyrä, Anssi (The Association for Computational Linguistics, 2015)
    A string encoding for a subclass of bipartite graphs enables graph rewriting used in autosegmental descriptions of tone phonology via existing and highly optimized finite- state transducer toolkits (Yli-Jyrä 2013). The current work offers a rigorous treatment of this code-theoretic approach, generalizing the methodology to all bipartite graphs having no crossing edges and un- ordered nodes. We present three bijectively related codes each of which exhibit unique characteristics while preserving the freedom to violate or express the OCP constraint. The codes are infinite, finite-state representable and optimal (efficiently computable, invertible, locally iconic, compositional) in the sense of Kornai (1995). They extend the encoding approach with visualisation, generality and flexibility and they make encoded graphs a strong candidate when the formal semantics of autosegmental phonology or non-crossing alignment relations are implemented within the con- fines of regular grammar.