Browsing by Subject "Natural language processing"

Sort by: Order: Results:

Now showing items 1-15 of 15
  • Koskenniemi, Kimmo Matti (Linköping University Electronic Press, 2017)
    NEALT Proceedings Series
    The paper presents two finite-state methods which can be used for aligning pairs of cognate words or sets of different allomorphs of stems. Both methods use weighted finite-state machines for choosing the best alternative. Individual letter or phoneme correspondences can be weighted according to various principles, e.g. using distinctive features. The comparison of just two forms at a time is simple, so that method is easier to refine to include context conditions. Both methods are language independent and could be tuned for and applied to several types of languages for producing gold standard data. The algorithms were implemented using the HFST finite-state library from short Python programs. The paper demonstrates that the solving of some non-trivial problems has become easier and accessible for a wider range of scholars.
  • Pollak, Senja; Boggia, Michele; Linden, Carl-Gustav; Leppänen, Leo; Zosa, Elaine; Toivonen, Hannu (The Association for Computational Linguistics, 2021)
  • Oberbichler, Sarah; Boroş, Emanuela; Doucet, Antoine; Marjanen, Jani; Pfanzelter, Eva; Rautiainen, Juha; Toivonen, Hannu; Tolonen, Mikko (2022)
    This article considers the interdisciplinary opportunities and challenges of working with digital cultural heritage, such as digitized historical newspapers, and proposes an integrated digital hermeneutics workflow to combine purely disciplinary research approaches from computer science, humanities, and library work. Common interests and motivations of the above-mentioned disciplines have resulted in interdisciplinary projects and collaborations such as the NewsEye project, which is working on novel solutions on how digital heritage data is (re)searched, accessed, used, and analyzed. We argue that collaborations of different disciplines can benefit from a good understanding of the workflows and traditions of each of the disciplines involved but must find integrated approaches to successfully exploit the full potential of digitized sources. The paper is furthermore providing an insight into digital tools, methods, and hermeneutics in action, showing that integrated interdisciplinary research needs to build something in between the disciplines while respecting and understanding each other's expertise and expectations.
  • Moisio, Mikko (Helsingin yliopisto, 2021)
    Semantic textual similarity (STS), the procedure of determining how similar pieces of text are in terms of their meaning, is an important problem in the rapidly evolving field of natural language processing (NLP). STS accelerates major information retrieval applications dealing with natural language text, such as web search engines. For computational efficiency reasons, text pieces are often encoded into semantically meaningful real-valued vectors, sentence embeddings, that can be compared with similarity metrics. Majority of recent NLP research has focused on a small set of largest Indo-European languages and Chinese. Although much of the research is machine learning oriented and is thus often applicable across languages, languages with lesser speaker population, such as Finnish, often lack annotated data required to train, or even evaluate, complex models. BERT, a language representation framework building on transfer learning, is one of the recent quantum leaps in NLP research. BERT-type models take advantage of unsupervised pre-training reducing annotated data demands for supervised tasks. Furthermore, a BERT modification called Sentence-BERT enables us to extend and train BERT-type models to derive semantically meaningful sentence embeddings. However, yet the annotated data demands for conventional training of a Sentence-BERT is relatively low, often such data is unavailable for low-resourced languages. Multilingual knowledge distillation has been shown to be a working strategy for extending mono- lingual Sentence-BERT models to new languages. This technique allows transferring and merging desired properties of two language models, and, instead of annotated data, consumes bilingual parallel samples. In this thesis we study using knowledge distillation to transfer STS properties learnt from English into a model pre-trained on Finnish while bypassing the lack of annotated Finnish data. Further, we experiment distillation with different types of data, English-Finnish bilingual, English monolingual and random pseudo samples, to observe which properties of training data are really necessary. We acquire a bilingual English-Finnish test dataset by translating an existing annotated English dataset and use this set to evaluate the fit of our resulting models. We evaluate the performance of the models in different tasks, English, Finnish and English-Finnish cross-lingual STS, to observe how well the properties being transferred are captured, and how well the models retain the desired properties they already have. We find that knowledge distillation is indeed a feasible approach for obtaining a relatively high quality Sentence-BERT for Finnish. Surprisingly, in all setups large portion of desired properties are transferred to the Finnish model, and, training with English-Finnish bilingual data yields best Finnish sentence embedding model we are aware of.
  • Koponen, Maarit; Sulubacak, Umut; Vitikainen, Kaisa; Tiedemann, Jörg (European Association for Machine Translation, 2020)
    This paper presents a user evaluation of machine translation and post-editing for TV subtitles. Based on a process study where 12 professional subtitlers translated and post-edited subtitles, we compare effort in terms of task time and number of keystrokes. We also discuss examples of specific subtitling features like condensation, and how these features may have affected the post-editing results. In addition to overall MT quality, segmentation and timing of the subtitles are found to be important issues to be addressed in future work.
  • Vazquez Carrillo, Juan Raul; Raganato, Alessandro; Tiedemann, Jörg; Creutz, Mathias (The Association for Computational Linguistics, 2019)
    In this paper, we propose a multilingual encoder-decoder architecture capable of obtaining multilingual sentence representations by means of incorporating an intermediate {\em attention bridge} that is shared across all languages. That is, we train the model with language-specific encoders and decoders that are connected via self-attention with a shared layer that we call attention bridge. This layer exploits the semantics from each language for performing translation and develops into a language-independent meaning representation that can efficiently be used for transfer learning. We present a new framework for the efficient development of multilingual NMT using this model and scheduled training. We have tested the approach in a systematic way with a multi-parallel data set. We show that the model achieves substantial improvements over strong bilingual models and that it also works well for zero-shot translation, which demonstrates its ability of abstraction and transfer learning.
  • Sulubacak, Umut; Caglayan, Ozan; Grönroos, Stig-Arne; Rouhe, Aku; Elliott, Desmond; Specia, Lucia; Tiedemann, Jörg (2020)
    Multimodal machine translation involves drawing information from more than one modality, based on the assumption that the additional modalities will contain useful alternative views of the input data. The most prominent tasks in this area are spoken language translation, image-guided translation, and video-guided translation, which exploit audio and visual modalities, respectively. These tasks are distinguished from their monolingual counterparts of speech recognition, image captioning, and video captioning by the requirement of models to generate outputs in a different language. This survey reviews the major data resources for these tasks, the evaluation campaigns concentrated around them, the state of the art in end-to-end and pipeline approaches, and also the challenges in performance evaluation. The paper concludes with a discussion of directions for future research in these areas: the need for more expansive and challenging datasets, for targeted evaluations of model performance, and for multimodality in both the input and output space.
  • Çolakoğlu, Talha; Sulubacak, Umut; Tantuğ, Ahmet Cüneyd (The Association for Computational Linguistics, 2019)
    With the growth of the social web, user-generated text data has reached unprecedented sizes. Non-canonical text normalization provides a way to exploit this as a practical source of training data for language processing systems. The state of the art in Turkish text normalization is composed of a token level pipeline of modules, heavily dependent on external linguistic resources and manually defined rules. Instead, we propose a fully automated, context-aware machine translation approach with fewer stages of processing. Experiments with various implementations of our approach show that we are able to surpass the current best-performing system by a large margin.
  • Lison, Pierre; Tiedemann, Jörg; Kouylekov, Milen (European Language Resources Association (ELRA), 2018)
  • Talman, Aarne; Suni, Antti; Celikkanat, Hande; Kakouros, Sofoklis; Tiedemann, Jörg; Vainio, Martti (Linköping University Electronic Press, 2019)
    Linköping Electronic Conference Proceedings
    In this paper we introduce a new natural language processing dataset and benchmark for predicting prosodic prominence from written text. To our knowledge this will be the largest publicly available dataset with prosodic labels. We describe the dataset construction and the resulting benchmark dataset in detail and train a number of different models ranging from feature-based classifiers to neural network systems for the prediction of discretized prosodic prominence. We show that pre-trained contextualized word representations from BERT outperform the other models even with less than 10% of the training data. Finally we discuss the dataset in light of the results and point to future research and plans for further improving both the dataset and methods of predicting prosodic prominence from text. The dataset and the code for the models are publicly available.
  • Toivonen, Hannu; Boggia, Michele; Mind and Matter; Department of Computer Science; Helsinki Institute for Information Technology; Discovery Research Group/Prof. Hannu Toivonen; Language Technology (The Association for Computational Linguistics, 2021)
  • Rosa, Aaron B.; Gudowsky, Niklas; Repo, Petteri (2021)
    As foresight activities continue to increase across multiple arenas and types of organizations, the need to develop effective modes of reviewing future-oriented information against long-term goals and policies becomes more pressing. The activities of institutional sensemaking are vital in constructing potential and desired futures, but remain sensitive to organizational culture and ethos, thus raising concerns about whose futures are being constructed. In viewing foresight studies as a critical component in such sensemaking, this research investigates a method of textual analysis that deploys natural language processing algorithms (NLP). In this research, we introduce and apply the methodology of topic modelling for conducting a comparative analysis to explore how citizen-derived foresight differs from other institutional foresight. Finally we present pros-pects for further employing NLP for strategic foresight and futures studies.
  • Kangasharju, Arja Irmeli; Ilomäki, Liisa; Toom, Auli; Lakkala, Minna; Kantosalo, Anna; Toivonen, Hannu (2021)
    In this study we investigate whether a digital tool supports lower secondary school students in poetry writing and influences on students’ perceptions of poetry. It is essential to find new means to develop students’ weakening writing competencies with digital tools and methods. This study analyzed students' perceptions of poems before and after writing poems with a co-creative tool called the Poetry Machine and the log data of poems written with it. We found that draft poems offered by the tool supported the students. Interestingly, this support received a higher evaluation by male students compared with the assessment by female students. The participants' perceptions of poetry writing changed positively during the period when using the tool and most of them considered writing with it to be easy and fun. Our findings suggest that digital tools have the potential to change positively perceptions about challenging literary forms, such as poetry, and especially to support male students in writing. Digital tools, such as the Poetry Machine, offer opportunities to motivate students in online learning. However, the young students need support both in face-to-face and online learning environments.
  • Vázquez , Raúl; Aulamo, Mikko; Sulubacak, Umut; Tiedemann, Jörg (The Association for Computational Linguistics, 2020)
    This paper describes the University of Helsinki Language Technology group’s participation in the IWSLT 2020 offline speech translation task, addressing the translation of English audio into German text. In line with this year’s task objective, we train both cascade and end-to-end systems for spoken language translation. We opt for an end-to-end multitasking architecture with shared internal representations and a cascade approach that follows a standard procedure consisting of ASR, correction, and MT stages. We also describe the experiments that served as a basis for the submitted systems. Our experiments reveal that multitasking training with shared internal representations is not only possible but allows for knowledge-transfer across modalities.
  • Dubossarsky, Haim; Hengchen, Simon; Tahmasebi, Nina; Schlechtweg, Dominik (ACL, 2019)
    State-of-the-art models of lexical semantic change detection suffer from noise stemming from vector space alignment. We have empirically tested the Temporal Referencing method for lexical semantic change and show that, by avoiding alignment, it is less affected by this noise. We show that, trained on a diachronic corpus, the skip-gram with negative sampling architecture with temporal referencing outperforms alignment models on a synthetic task as well as a manual testset. We introduce a principled way to simulate lexical semantic change and systematically control for possible biases.