Browsing by Subject "finite-state methods"

Sort by: Order: Results:

Now showing items 1-12 of 12
  • Yli-Jyrä, Anssi Mikael (Northern European Association for Language Technology, 2011)
    NEALT Proceedings Series
    The paper reconceptualizes Constraint Grammar as a framework where the rules refine the compact representations of local ambiguity while the rule conditions are matched against a string of feature vectors that summarize the compact representations. Both views to the ambiguity are processed with pure finite-state operations. The compact representations are mapped to feature vectors with the aid of a rational power series. This magical interconnection is not less pure than a prevalent interpretation that requires that the reading set provided by a lexical transducer is magically linearized to a marked concatenation of readings given to pure transducers. The current approach has several practical benefits, including the inward deterministic way to compute, represent and maintain all the applications of the rules in the sentence.
  • Yli-Jyrä, Anssi Mikael (The Association for Computational Linguistics, 2017)
    A recently proposed encoding for non- crossing digraphs can be used to imple- ment generic inference over families of these digraphs and to carry out first-order factored dependency parsing. It is now shown that the recent proposal can be substantially streamlined without information loss. The improved encoding is less dependent on hierarchical processing and it gives rise to a high-coverage bounded-depth approximation of the space of non- crossing digraphs. This subset is presented elegantly by a finite-state machine that recognises an infinite set of encoded graphs. The set includes more than 99.99% of the 0.6 million noncrossing graphs obtained from the UDv2 treebanks through planarisation. Rather than taking the low probability of the residual as a flat rate, it can be modelled with a joint probability distribution that is factorised into two underlying stochastic processes – the sentence length distribution and the related conditional distribution for deep nesting. This model points out that deep nesting in the streamlined code requires extreme sentence lengths. High depth is categorically out in common sentence lengths but emerges slowly at infrequent lengths that prompt further inquiry.
  • Yli-Jyrä, Anssi Mikael (The Association for Computational Linguistics, 2011)
    This paper describes a non-conventional method for compiling (phonological or morpho-syntactic) context restriction (CR) constraints into non-deterministic automata in finite-state tools and surface parsing systems. The method reduces any CR into a simple one that constraints the occurrences of the empty string and represents right contexts with co-determististic states. In cases where a fully deterministic representation would be exponentially larger, this kind of inward de- terminism in contexts can bring benefits over various De Morgan approaches where full determinization is necessary. In the method, an accepted word gets a unique path that is a projection of a ladder-shaped structure in the context recognizer. This projection is computed in time that is polynomial to the number of context states. However, it may be difficult to take advantage of the method in a finite-state library that coerces intermediate results into canonical automata and whose intersection operation assumes deterministic automata.
  • Yli-Jyrä, Anssi Mikael (Northern European Association for Language Technology, 2011)
    NEALT Proceedings Series
    A novel technique of adding positionwise flags to one-level finite state lexicons is presented. The proposed flags are kinds of morphophonemic markers and they constitute a flexible method for describing morphophonological processes with a formalism that is tightly coupled with lexical entries and rule-like regular expressions. The formalism is inspired by the techniques used in two-level rule compilation and it practically compiles all the rules in parallel, but in an efficient way. The technique handles morphophonological processes without a separate morphophonemic representation. The occurrences of the allomorphophonemes in latent phonological strings are tracked through a dynamic data structure into which the most prominent (i.e. the best ranked) flags are collected. The application of the technique is suspected to give advantages when describing the morphology of Bantu languages and dialects.
  • Koskenniemi, Kimmo Matti (The Association for Computational Linguistics, 2018)
    A practical method for interactive guessing of LEXC lexicon entries is presented. The method is based on describing groups of similarly inflected words using regular expressions. The patterns are compiled into a finite-state transducer (FST) which maps any word form into the possible LEXC lexicon entries which could generate it. The same FST can be used (1) for converting conventional headword lists into LEXC entries, (2) for interactive guessing of entries, (3) for corpus-assisted interactive guessing and (4) guessing entries from corpora. A method of representing affixes as a table is presented as well how the tables can be converted into LEXC format for several different purposes including morphological analysis and entry guessing. The method has been implemented using the HFST finite-state transducer tools and its Python embedding plus a number of small Python scripts for conversions. The method is tested with a near complete implementation of Finnish verbs. An experiment of generating Finnish verb entries out of corpus data is also described as well as a creation of a full-scale analyzer for Finnish verbs using the conversion patterns.
  • Linden, Krister; Silfverberg, Miikka; Axelson, Erik; Hardwick, Sam; Pirinen, Tommi (Springer, 2011)
    Communications in Computer and Information Science
    HFST–Helsinki Finite-State Technology ( hfst.sf.net ) is a framework for compiling and applying linguistic descriptions with finite-state methods. HFST currently connects some of the most important finite-state tools for creating morphologies and spellers into one open-source platform and supports extending and improving the descriptions with weights to accommodate the modeling of statistical information. HFST offers a path from language descriptions to efficient language applications in key environments and operating systems. HFST also provides an opportunity to exchange transducers between different software providers in order to get the best out of each finite-state library.
  • Drobac, Senka; Silfverberg, Miikka; Yli-Jyrä, Anssi Mikael (The Association for Computational Linguistics, 2012)
    We explain the implementation of replace rules with the .r-glc. operator and preference relations. Our modular approach combines various preference constraints to form different replace rules. In addition to describing the method, we present illustrative examples.
  • Yli-Jyrä, Anssi Mikael (Springer-Verlag, 2012)
    Arc contractions in syntactic dependency graphs can be used to decide which graphs are trees. The paper observes that these contractions can be expressed with weighted finite-state transducers (weighted FST) that operate on string-encoded trees. The observation gives rise to a finite-state parsing algorithm that computes the parse forest and extracts the best parses from it. The algorithm is customizable to functional and bilexical dependency parsing, and it can be extended to non-projective parsing via a multi-planar encoding with prior results on high recall. Our experiments support an analysis of projective parsing according to which the worst-case time complexity of the algorithm is quadratic to the sentence length, and linear to the overlapping arcs and the number of functional categories of the arcs. The results suggest several interesting directions towards efficient and highprecision dependency parsing that takes advantage of the flexibility and the demonstrated ambiguity-packing capacity of such a parser.
  • Koskenniemi, Kimmo (CSLI publications, 2019)
    CSLI Lecture Notes
  • Yli-Jyrä, Anssi (2005)
    Most of the world’s languages lack electronic word form dictionaries. The linguists who gather such dictionaries could be helped with an efficient morphology workbench that adapts to different environments and uses. A widely usable workbench could be characterized, ideally, as generally applicable, extensible, and freely available (GEA). It seems that such a solution could be implemented in the framework of finite-state methods. The current work defines the GEA desiderata and starts a series of articles concerning these desiderata in finite- state morphology. Subsequent parts will review the state of the art and present an action plan toward creating a widely usable finite-state morphology workbench.
  • Yli-Jyrä, Anssi Mikael (The Linguistic Association of Finland, 2006)
    SKY journal of linguistics, special supplement
    The trees in the Penn Treebank have a standard representation that involves complete balanced bracketing. In this article, an alternative for this standard representation of the tree bank is proposed. The proposed representation for the trees is loss-less, but it reduces the total number of brackets by 28%. This is possible by omitting the redundant pairs of special brackets that encode initial and final embedding, using a technique proposed by Krauwer and des Tombe (1981). In terms of the paired brackets, the maximum nesting depth in sentences decreases by 78%. The 99.9% coverage is achieved with only five non-top levels of paired brackets. The observed shallowness of the reduced bracketing suggests that finite-state based methods for parsing and searching could be a feasible option for tree bank processing.
  • Yli-Jyrä, Anssi (The Linguistic Association of Finland, 2006)
    The trees in the Penn Treebank have a standard representation that involves complete balanced bracketing. In this article, an alternative for this standard representation of the tree bank is proposed. The proposed representation for the trees is loss-less, but it reduces the total number of brackets by 28%. This is possible by omitting the redundant pairs of special brackets that encode initial and final embedding, using a technique proposed by Krauwer and des Tombe (1981). In terms of the paired brackets, the maximum nesting depth in sentences decreases by 78%. The 99.9% coverage is achieved with only five non-top levels of paired brackets. The observed shallowness of the reduced bracketing suggests that finite-state based methods for parsing and searching could be a feasible option for tree bank processing.