Unsupervised zero-shot classification of Finnish documents using pre-trained language models

Show full item record



Permalink

http://urn.fi/URN:NBN:fi:hulib-202012155147
Title: Unsupervised zero-shot classification of Finnish documents using pre-trained language models
Author: Leal, Rafael
Other contributor: Helsingin yliopisto, Humanistinen tiedekunta
University of Helsinki, Faculty of Arts
Helsingfors universitet, Humanistiska fakulteten
Publisher: Helsingin yliopisto
Date: 2020
Language: eng
URI: http://urn.fi/URN:NBN:fi:hulib-202012155147
http://hdl.handle.net/10138/323019
Thesis level: master's thesis
Degree program: Kielellisen diversiteetin ja digitaalisten menetelmien maisteriohjelma
Master's Programme Linguistic Diversity in the Digital Age
Magisterprogrammet i språklig diversitet och digitala metoder
Specialisation: Kieliteknologia
Language Technology
Språkteknologi
Abstract: In modern Natural Language Processing, document categorisation tasks can achieve success rates of over 95% using fine-tuned neural network models. However, so-called "zero-shot" situations, where specific training data is not available, are researched much less frequently. The objective of this thesis is to investigate how pre-trained Finnish language models fare when classifying documents in a completely unsupervised way: by relying only on their general "knowledge of the world" obtained during training, without using any additional data. Two datasets are created expressly for this study, since labelled and openly available datasets in Finnish are very uncommon: one is built using around 5k news articles from Yle, the Finnish Broacasting Company, and the other, 100 pieces of Finnish legislation obtained from the Semantic Finlex data service. Several language representation models are built, based on the vector space model, by combining modular elements: different kinds of textual representations for documents and category labels, different algorithms that transform these representations into vectors (TF-IDF, Annif, fastText, LASER, FinBERT, S-BERT), different similarity measures and post-processing techniques (such as SVD and ensemble models). This approach allows for a variety of models to be tested. The combination of Annif for extracting keywords and fastText for producing word embeddings out of them achieves F1 scores of 0.64 on the Finlex dataset and 0.73-0.74 on the Yle datasets. Model ensembles are able to raise these figures by up to three percentage points. SVD can bring these numbers to 0.7 and 0.74-0.75 respectively, but these gains are not necessarily reproducible on unseen data. These results are distant from the ones obtained from state-of-the-art supervised models, but this is a method that is flexible, can be quickly deployed and, most importantly, do not depend on labelled data, which can be slow and expensive to make. A reliable way to set the input parameter for SVD would be an important next step for the work done in this thesis.
Subject: NLP
space vector model
zero-shot classification
Finnish language
pre-trained language models


Files in this item

Total number of downloads: Loading...

Files Size Format View
Leal_Rafael_Masters_Thesis_2020.pdf 1.755Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record