Browsing by Subject "computational linguistics"

Sort by: Order: Results:

Now showing items 1-15 of 15
  • Koskenniemi, Kimmo (The Association for Computational Linguistics, 1984)
    A language independent model for recognition and production of word forms is presented. This "two-level model" is based on a new way of describing morphological alternations. All rules describing the morphophonological variations are parallel and relatively independent of each other. Individual rules are implemented as finite state automata, as in an earlier model due to Martin Kay and Ron Kaplan. The two-level model has been implemented as an operational computer programs in several places. A number of operational two-level descriptions have been written or are in progress (Finnish, English, Japanese, Rumanian, French, Swedish, Old Church Slavonic, Greek, Lappish, Arabic, Icelandic). The model is bidirectional and it is capable of both analyzing and synthesizing word-forms.
  • Shao, Yan; Hardmeier, Christian; Tiedemann, Jörg; Nivre, Joakim (Asian Federation of Natural Language Processing, 2017)
    We present a character-based model for joint segmentation and POS tagging for Chinese. The bidirectional RNN-CRF architecture for general sequence tagging is adapted and applied with novel vector representations of Chinese characters that capture rich contextual information and lower-than-character level features. The proposed model is extensively evaluated and compared with a state-of-the-art tagger respectively on CTB5, CTB9 and UD Chinese. The experimental results indicate that our model is accurate and robust across datasets in different sizes, genres and annotation schemes. We obtain state-of-the-art performance on CTB5, achieving 94.38 F1-score for joint segmentation and POS tagging.
  • Tiedemann, Jörg (CEUR Workshop Proceedings, 2018)
    CEUR Workshop Proceedings
  • Koskenniemi, Kimmo; Linden, Krister; Nordgård, Torbjørn; Department of Modern Languages 2010-2017 (University of Helsinki. Department of General Linguistics, 2007)
    Publications
  • Koskenniemi, Kimmo Matti (The Association for Computational Linguistics, 2018)
    A practical method for interactive guessing of LEXC lexicon entries is presented. The method is based on describing groups of similarly inflected words using regular expressions. The patterns are compiled into a finite-state transducer (FST) which maps any word form into the possible LEXC lexicon entries which could generate it. The same FST can be used (1) for converting conventional headword lists into LEXC entries, (2) for interactive guessing of entries, (3) for corpus-assisted interactive guessing and (4) guessing entries from corpora. A method of representing affixes as a table is presented as well how the tables can be converted into LEXC format for several different purposes including morphological analysis and entry guessing. The method has been implemented using the HFST finite-state transducer tools and its Python embedding plus a number of small Python scripts for conversions. The method is tested with a near complete implementation of Finnish verbs. An experiment of generating Finnish verb entries out of corpus data is also described as well as a creation of a full-scale analyzer for Finnish verbs using the conversion patterns.
  • Koskenniemi, Kimmo Matti; Kuutti, Pirkko (Research Institute for Linguistics, Hungarian Academy of Sciences, 2017)
  • Hakala, Tero; Hulten, Annika; Lehtonen, Minna; Lagus, Krista; Salmelin, Riitta (2018)
    Neuroimaging studies of the reading process point to functionally distinct stages in word recognition. Yet, current understanding of the operations linked to those various stages is mainly descriptive in nature. Approaches developed in the field of computational linguistics may offer a more quantitative approach for understanding brain dynamics. Our aim was to evaluate whether a statistical model of morphology, with well-defined computational principles, can capture the neural dynamics of reading, using the concept of surprisal from information theory as the common measure. The Morfessor model, created for unsupervised discovery of morphemes, is based on the minimum description length principle and attempts to find optimal units of representation for complex words. In a word recognition task, we correlated brain responses to word surprisal values derived from Morfessor and from other psycholinguistic variables that have been linked with various levels of linguistic abstraction. The magnetoencephalography data analysis focused on spatially, temporally and functionally distinct components of cortical activation observed in reading tasks. The early occipital and occipito-temporal responses were correlated with parameters relating to visual complexity and orthographic properties, whereas the later bilateral superior temporal activation was correlated with whole-word based and morphological models. The results show that the word processing costs estimated by the statistical Morfessor model are relevant for brain dynamics of reading during late processing stages.
  • Lison, Pierre; Tiedemann, Jörg; Kouylekov, Milen (European Language Resources Association (ELRA), 2018)
  • Bozovic, Petar; Erjavec, Tomaz; Tiedemann, Jörg; Ljubesic, Nikola; Gorjanc, Vojko (Ljubljana University Press, 2018)
  • Zampieri, Marcos; Nakov, Preslav; Ljubesic, Nikola; Tiedemann, Jörg; Malmasi, Shervin; Ali, Ahmed; Department of Digital Humanities; Language Technology; Doctoral Programme in Language Studies (The Association for Computational Linguistics, 2018)
  • Venekoski, Viljami (Helsingfors universitet, 2016)
    Advances in computational linguistics have made analyzing large quantities of text data a more feasible task than ever before. In particular, the recent distributional language models hold promise of effective semantic analysis at a low computational cost. Semantics, however, is a multifaceted phenomenon, and although various language model architectures have been presented, there is relatively little research evaluating the semantic validity of such models. The aim of this research is to evaluate the semantic validity of different distributional language models, particularly as tools for representing Finnish language online text data. The models and methods are evaluated based on their performance on three empirical studies, each estimating a different aspect of semantic representation. The language models in the studies were built using word2vec architecture. The models were taught on approximately 2.6 billion tokens from the Suomi24 corpus of Finnish language social media discussions. 18 models were built in total, each with a different combination of feature processing methods. The models were evaluated in three studies. For Study I, a resource consisting of 300 similarity ratings for word pairs from 55 human annotators was collected. This resource was used as an evaluation task by comparing model estimated similarity scores to the human rated similarity judgments. Study II investigated relational semantics as an evaluation method and were operationalized in form of an analogy task, for which a Finnish language resource is presented. In Study III, the language models were evaluated based on their performance in document classification of Suomi24 messages to their respective topics. The results of the Studies indicate that each presented evaluation task is sufficiently reliable method for estimating language model semantic validity. In turn, distributed language models are reported being able to represent semantics given morphologically rich yet fragmentary Finnish language social media data. Feature processing methods are shown to increase the semantic accuracy of language models in most cases, but to a limited extent. If evaluated valid, semantic language technologies are proposed to hold widespread applicability across scientific as well as commercial fields.
  • Koskenniemi, Kimmo (CSLI publications, 2019)
    CSLI Lecture Notes
  • Koskenniemi, Kimmo (University of Helsinki. Department of General Linguistics, 1983)
    Publications
    This dissertation presents a new computationally implemented linguistic model for morphological analysis and synthesis. The model incorporates a general formalism for making morphological descriptions of particular languages, and a language-independent program implementing the model. The two-level formalism and the structure of the program are formally defined. The program can utilize descriptions of various languages, including highly inflected ones such as Finnish, Russian, or Sanskrit. The new model is unrestricted in scope and it is capable of handling the entire language system as well as ordinary running text. A full description of Finnish inflectional morphology is presented in order to validate the model. The two-level model is based on a lexicon system and a set of two-level rules. It differs from generative phonology in the following respects. The rules are parallel, as opposed to being sequentially ordered, as is the case with the rewriting rules of generative phonology. The two-level model is fully bidirectional both conceptually and processually. It can also be interpreted as a morphological model of the performance processes of word-form recognition and production. The model and the descriptions are based on computationally simple machinery, mostly on small finite state automata. The computational complexity of the model is discussed, and the description of Finnish is evaluated with respect to external evidence from child language acquisition.
  • Bjerva, Johannes; Östling, Robert; Han Veiga, Maria; Tiedemann, Jörg; Augenstein, Isabelle (2019)
    A neural language model trained on a text corpus can be used to induce distributed representations of words, such that similar words end up with similar representations. If the corpus is multilingual, the same model can be used to learn distributed representations of languages, such that similar languages end up with similar representations. We show that this holds even when the multilingual corpus has been translated into English, by picking up the faint signal left by the source languages. However, just like it is a thorny problem to separate semantic from syntactic similarity in word representations, it is not obvious what type of similarity is captured by language representations. We investigate correlations and causal relationships between language representations learned from translations on one hand, and genetic, geographical, and several levels of structural similarity between languages on the other. Of these, structural similarity is found to correlate most strongly with language representation similarity, while genetic relationships---a convenient benchmark used for evaluation in previous work---appears to be a confounding factor. Apart from implications about translation effects, we see this more generally as a case where NLP and linguistic typology can interact and benefit one another.
  • Linden, Krister ([s.n.], 2005)