Browsing by Subject "NLP"

Sort by: Order: Results:

Now showing items 1-12 of 12
  • Leppämäki, Tatu (Helsingin yliopisto, 2022)
    Ever more data is available and shared through the internet. The big data masses often have a spatial dimension and can take many forms, one of which are digital texts, such as articles or social media posts. The geospatial links in these texts are made through place names, also called toponyms, but traditional GIS methods are unable to deal with the fuzzy linguistic information. This creates the need to transform the linguistic location information to an explicit coordinate form. Several geoparsers have been developed to recognize and locate toponyms in free-form texts: the task of these systems is to be a reliable source of location information. Geoparsers have been applied to topics ranging from disaster management to literary studies. Major language of study in geoparser research has been English and geoparsers tend to be language-specific, which threatens to leave the experiences provided by studying and expressed in smaller languages unexplored. This thesis seeks to answer three research questions related to geoparsing: What are the most advanced geoparsing methods? What linguistic and geographical features complicate this multi-faceted problem? And how to evaluate the reliability and usability of geoparsers? The major contributions of this work are an open-source geoparser for Finnish texts, Finger, and two test datasets, or corpora, for testing Finnish geoparsers. One of the datasets consists of tweets and the other of news articles. All of these resources, including the relevant code for acquiring the test data and evaluating the geoparser, are shared openly. Geoparsing can be divided into two sub-tasks: recognizing toponyms amid text flows and resolving them to the correct coordinate location. Both tasks have seen a recent turn to deep learning methods and models, where the input texts are encoded as, for example, word embeddings. Geoparsers are evaluated against gold standard datasets where toponyms and their coordinates are marked. Performance is measured on equivalence and distance-based metrics for toponym recognition and resolution respectively. Finger uses a toponym recognition classifier built on a Finnish BERT model and a simple gazetteer query to resolve the toponyms to coordinate points. The program outputs structured geodata, with input texts and the recognized toponyms and coordinate locations. While the datasets represent different text types in terms of formality and topics, there is little difference in performance when evaluating Finger against them. The overall performance is comparable to the performance of geoparsers of English texts. Error analysis reveals multiple error sources, caused either by the inherent ambiguousness of the studied language and the geographical world or are caused by the processing itself, for example by the lemmatizer. Finger can be improved in multiple ways, such as refining how it analyzes texts and creating more comprehensive evaluation datasets. Similarly, the geoparsing task should move towards more complex linguistic and geographical descriptions than just toponyms and coordinate points. Finger is not, in its current state, a ready source of geodata. However, the system has potential to be the first step for geoparsers for Finnish and it can be a steppingstone for future applied research.
  • Joosten, Rick (Helsingin yliopisto, 2020)
    In the past two decades, an increasing amount of discussions are held via online platforms such as Facebook or Reddit. The most common form of disruption of these discussions are trolls. Traditional trolls try to digress the discussion into a nonconstructive argument. One strategy to achieve this is to give asymmetric responses, responses that don’t follow the conventional patterns. In this thesis we propose a modern machine learning NLP method called ULMFiT to automatically detect the discourse acts of online forum posts in order to detect these conversational patterns. ULMFiT finetunes the language model before training its classifier in order to create a more accurate language representation of the domain language. This task of discourse act recognition is unique since it attempts to classify the pragmatic role of each post within a conversation compared to the functional role which is related to tasks such as question-answer retrieval, sentiment analysis, or sarcasm detection. Furthermore, most discourse act recognition research has been focused on synchronous conversations where all parties can directly interact with each other while this thesis looks at asynchronous online conversations. Trained on a dataset of Reddit discussions, the proposed model achieves a matthew’s correlation coefficient of 0.605 and an F1-score of 0.69 to predict the discourse acts. Other experiments also show that this model is effective at question-answer classification as well as showing that language model fine-tuning has a positive effect on both classification performance along with the required size of the training data. These results could be beneficial for current trolling detection systems.
  • Hämäläinen, Mika (2021)
    The term low-resourced has been tossed around in the field of natural language processing to a degree that almost any language that is not English can be called "low-resourced"; sometimes even just for the sake of making a mundane or mediocre paper appear more interesting and insightful. In a field where English is a synonym for language and low-resourced is a synonym for anything not English, calling endangered languages low-resourced is a bit of an overstatement. In this paper, I inspect the relation of the endangered with the low-resourced from my own experiences.
  • Mickus, Timothee; Paperno, Denis; Constant, Mathieu (2022)
    Pretrained embeddings based on the Transformer architecture have taken the NLP community by storm. We show that they can mathematically be reframed as a sum of vector factors and showcase how to use this reframing to study the impact of each component. We provide evidence that multi-head attentions and feed-forwards are not equally useful in all downstream applications, as well as a quantitative overview of the effects of finetuning on the overall embedding space. This approach allows us to draw connections to a wide range of previous studies, from vector space anisotropy to attention weights.
  • Suolahti, Riitta (Helsingin yliopisto, 2000)
    Verkkari ; 2000 (13)
  • Salmenkivi, Essi (Helsingin yliopisto, 2020)
    This work introduces a system for generating radio play scripts. Generating dramatic dialogue presents unique challenges in language generation. In addition to fluency of language, dramatic text should exhibit plot and characters' affective stances to each other and events. Character relationships and affect may be expressed beneath the surface level of everyday conversation topics. In the affect-driven dialogue generation system introduced by this thesis, characters have goals, relationships and a three-dimensional model of mood which influences their behaviour. Given conflicting goals, characters will navigate the web of conversation, making choices that influence others to accept their goal while simultaneously trying to maintain the relationship to others. Characters react emotionally to each others' speech acts and express their own affective state in how they speak. The system separates the form of a sentence from its content, allowing the system to generate a wide range of coherent, dramatic conversations by combining affect-expressing sentence templates with goal-expressing content. Because content and form are independent from each other, only a finite number of sentence templates need to be prepared to generate conversations about any content.
  • Palma-Suominen, Saara (Helsingin yliopisto, 2021)
    Maisterintutkielma käsittelee monikielistä nimien tunnistusta. Tutkielmassa testataan kahta lähestymistapaa monikieliseen nimien tunnistukseen: annotoidun datan siirtoa toisille kielille, sekä monikielisen mallin luomista. Lisäksi nämä kaksi lähestymistapaa yhdistetään. Tarkoitus on löytää menetelmiä, joilla nimien tunnistusta voidaan tehdä luotettavasti myös pienemmillä kielillä, joilla annotoituja nimientunnistusaineistoja ei ole suuressa määrin saatavilla. Tutkielmassa koulutetaan ja testataan malleja neljällä kielellä: suomeksi, viroksi, hollanniksi ja espanjaksi. Ensimmäisessä metodissa annotoitu data siirretään kieleltä toiselle monikielisen paralleelikorpuksen avulla, ja näin syntynyttä dataa käytetään neuroverkkoja hyödyntävän koneoppimismallin opettamiseen. Toisessa metodissa käytetään monikielistä BERT-mallia. Mallin koulutukseen käytetään annotoituja korpuksia, jotka yhdistetään monikieliseksi opetusaineistoksi. Kolmannessa metodissa kaksi edellistä metodia yhdistetään, ja kieleltä toiselle siirrettyä dataa käytetään monikielisen BERT-mallin koulutuksessa. Kaikkia kolmea lähestymistapaa testataan kunkin kielen annotoidulla testisetillä, ja tuloksia verrataan toisiinsa. Metodi, jossa rakennettiin monikielinen BERT-malli, saavutti selkeästi parhaimmat tulokset nimien tunnistamisessa. Neuroverkkomallit, jotka koulutettiin kielestä toiseen siirretyillä annotaatioilla, saivat selkeästi heikompia tuloksia. BERT-mallin kouluttaminen siirretyillä annotaatioilla tuotti myös heikkoja tuloksia. Annotaatioiden siirtäminen kieleltä toiselle osoittautui haastavaksi, ja tuloksena syntynyt data sisälsi virheitä. Tulosten heikkouteen vaikutti myös opetusaineiston ja testiaineiston kuuluminen eri genreen. Monikielinen BERT-malli on tutkielman mukaan testatuista parhaiten toimiva metodi, ja sopii myös kielille, joilla annotoituja aineistoja ei ole paljon saatavilla.
  • Malmelin, Karoliina (2001)
    Tutkimuksessa selvitetään NLP:n eli neuro-lingvistisen ohjelmoinnin perusteita sekä viestintäkäsitystä suostuttelun näkökulmasta. Painopiste on keskinäis- ja pienryhmäviestinnässä. Tutkimuksen tarkoituksena on perustiedon tuottaminen NLP:stä viestinnän näkökulmasta. Tutkimuksessa kysytään, mitä NLP on ja mikä on sen viestintäkäsitys sekä millaisia suostuttelun keinoja NLP tuntee. Lähdekirjallisuutena on käytetty NLP:n uranuurtajien kirjallisuutta sekä kotimaisia perusteoksia ja soveltavia NLP-oppaita. Kirjoittajista useimmat ovat työskennelleet NLP:n perustajan Richard Bandlerin henkilökohtaisessa ohjauksessa, ja Bandler on kommentoinut ja ohjannut kirjojen syntyä. Käytetty kirjallisuus edustaa siksi hyvin koko NLP:tä. Kirjallisuutta on käytetty sekä lähteenä että analyysin kohteena. Työssä kartoitetaan NLP:n historiaa ja esitellään sen keskeiset tavoitteet. NLP:n suhdetta tieteeseen, teoriaan ja tutkimukseen pohditaan, ja NLP:n keskeisimpiä malleja ja näkemyksiä peilataan viestinnän teorioihin ja malleihin. Huomio on etenkin siinä, kuinka NLP täyttää suostuttelevan viestinnän kriteerit ja millaisia keinoja sen mallit antavat suostutteluun. Työn toinen puoli on omistettu NLP:n viestintäkäsitystä ohjaaville taustaoletuksille, arvoille ja asenteille ja niiden metaforille. Aineisto ja kysymykset ovat ohjanneet metodin valintaa ja aineiston käyttöä. Kuvailevaa ja vertailevaa otetta täydennetään mm. metaforien, retoriikan ja diskurssien analyysista tutuilla ajatuksilla. Aihetta pyritään käsittelemään systemaattisesti. Metafora-analyysin aineistona on käytetty NLP:n keskeisiä taustaoletuksia, jotka ovat olleet kaikkien kirjojen pohjana. Mallien esittelyssä pyritään läpileikkaukseen. Tutkimus osoittaa, että NLP:n viestinnällinen aines on runsasta, mutta se on teoreettisesti heikolla pohjalla. NLP:n mallien avulla voi luokitella viestintäprosessin eri osia, analysoida erilaisia suostuttelustrategioita ja myös tuottaa niitä. NLP:n järjestelmäteoreettinen tausta on vahva, mutta systeemiltä vaadittava alajärjestelmän muutosten siirtyminen myös muihin alajärjestelmiin ei NLP:ssä aina täyty. Viestintämallien ja väittämien selitysvoima on heikko, sillä ne eivät huomioi tarpeeksi eri muuttujia ja konteksteja. NLP:n käyttämät metaforat noudattelivat varsin konventionaalisia tapoja kuvata vuorovaikutusta ja havainnointia. Ne eivät erottaneet NLP:tä erityisesti muista suuntauksista eivätkä tuoneet radikaalisti uudenlaisia näkökulmia NLP:n kiinnostuksen kohteisiin. Metaforisointi tuntui jopa osin toimivan tarkoitustaan, asioiden havainnollistamista ja yksinkertaistamista, vastaan. Ihmisen ohjelmoitavuus on NLP:ssä lähinnä metafora, vaikka NLP ei sitä aina sellaisena käsittelekään. NLP:n keskeisistä taustaoletuksista nousi esiin merkittäviä paradokseja. Ongelman muodostaa myös NLP:n tapa esittää arvoväittämiä tieteellisinä väittäminä.
  • Mickus, Timothee; Van Deemter, Kees; Constant, Mathieu; Paperno, Denis (The Association for Computational Linguistics, 2022)
    Word embeddings have advanced the state of the art in NLP across numerous tasks. Understanding the contents of dense neural representations is of utmost interest to the computational semantics community. We propose to focus on relating these opaque word vectors with human-readable definitions, as found in dictionaries This problem naturally divides into two subtasks: converting definitions into embeddings, and converting embeddings into definitions. This task was conducted in a multilingual setting, using comparable sets of embeddings trained homogeneously.
  • Nevalainen, Janne (Helsingin yliopisto, 2020)
    Neural network based modern language models can reach state of the art performance on wide range of natural language tasks. Their success is based on capability to learn from large unlabeled data by pretraining, using transfer learning to learn strong representations for the language and transferring the learned into new domains and tasks. I look at how language models produce transfer learning for NLP. Especially from the viewpoint of classification. How transfer learning can be formally defined? I compare different LM implementations in theory and also use two example data sets for empirically testing their performance on very small labeled training data.
  • Leal, Rafael (Helsingin yliopisto, 2020)
    In modern Natural Language Processing, document categorisation tasks can achieve success rates of over 95% using fine-tuned neural network models. However, so-called "zero-shot" situations, where specific training data is not available, are researched much less frequently. The objective of this thesis is to investigate how pre-trained Finnish language models fare when classifying documents in a completely unsupervised way: by relying only on their general "knowledge of the world" obtained during training, without using any additional data. Two datasets are created expressly for this study, since labelled and openly available datasets in Finnish are very uncommon: one is built using around 5k news articles from Yle, the Finnish Broacasting Company, and the other, 100 pieces of Finnish legislation obtained from the Semantic Finlex data service. Several language representation models are built, based on the vector space model, by combining modular elements: different kinds of textual representations for documents and category labels, different algorithms that transform these representations into vectors (TF-IDF, Annif, fastText, LASER, FinBERT, S-BERT), different similarity measures and post-processing techniques (such as SVD and ensemble models). This approach allows for a variety of models to be tested. The combination of Annif for extracting keywords and fastText for producing word embeddings out of them achieves F1 scores of 0.64 on the Finlex dataset and 0.73-0.74 on the Yle datasets. Model ensembles are able to raise these figures by up to three percentage points. SVD can bring these numbers to 0.7 and 0.74-0.75 respectively, but these gains are not necessarily reproducible on unseen data. These results are distant from the ones obtained from state-of-the-art supervised models, but this is a method that is flexible, can be quickly deployed and, most importantly, do not depend on labelled data, which can be slow and expensive to make. A reliable way to set the input parameter for SVD would be an important next step for the work done in this thesis.
  • Kjellman, Martin (Helsingin yliopisto, 2021)
    This thesis examines how representatives of service providers for news automation perceive a) journalists and news organisations and b) the service providers’ relationship to these. By introducing new technology (natural language generation, i.e. the transformation of data into everyday language) that influences both the production and business models of news media, news automation represents a type of media innovation. The service providers represent actors peripheral to journalism. The theoretical framework takes hybrid media logics as its starting point, meaning that the power dynamics of news production are thought to be influenced by the field-specific logics of the actors involved. The hybridity metaphor is deepened by using a typology for journalistic strangers that takes into account the different roles peripheral actors adopt in relation to journalists and news organisations. Journalism is understood throughout as a professional ideology encountered by service providers who work with news organisations. Semi-structured interviews were conducted with representatives from companies that create natural language generation software used to produce journalistic text based on data. Participants were asked about their experiences working with news media and the interviews (N=6) were analysed phenomenologically. The findings form three distinct but interrelated dimensions of how the service providers perceive news media and journalism: an area that sorely needs innovators (potential) but lacks resources in terms of knowledge, money and will to innovate (obstacles), but one that they can ultimately learn from and collaborate with (solutions). Their own relationship to journalism and news media is not fixed to one single role. Instead, they alternate between challenging news media (explicit interloping) and inhabiting a supportive role (implicit interloping). This thesis serves as an exploration into how service providers for news automation affect the power dynamics of news production. It does so by unveiling how journalists and news organisations are perceived, and by adding further understanding to previous research on actors peripheral to journalism. In order to further untangle how service providers for news automation shift the balance of power shaping news production, future research should attempt to unify the way traditional news media actors and service providers perceive each other and their collaborations.