Distribution Matching – Semi-Supervised Feature Selection for Biased Labelled Data

Show full item record


Title: Distribution Matching – Semi-Supervised Feature Selection for Biased Labelled Data
Author: Lange, Moritz Johannes
Contributor: University of Helsinki, Faculty of Science
Publisher: Helsingin yliopisto
Date: 2020
URI: http://urn.fi/URN:NBN:fi:hulib-202006243440
Thesis level: master's thesis
Abstract: In the context of data science and machine learning, feature selection is a widely used technique that focuses on reducing the dimensionality of a dataset. It is commonly used to improve model accuracy by preventing data redundancy and over-fitting, but can also be beneficial in applications such as data compression. The majority of feature selection techniques rely on labelled data. In many real-world scenarios, however, data is only partially labelled and thus requires so-called semi-supervised techniques, which can utilise both labelled and unlabelled data. While unlabelled data is often obtainable in abundance, labelled datasets are smaller and potentially biased. This thesis presents a method called distribution matching, which offers a way to do feature selection in a semi-supervised setup. Distribution matching is a wrapper method, which trains models to select features that best affect model accuracy. It addresses the problem of biased labelled data directly by incorporating unlabelled data into a cost function which approximates expected loss on unseen data. In experiments, the method is shown to successfully minimise the expected loss transparently on a synthetic dataset. Additionally, a comparison with related methods is performed on a more complex EMNIST dataset.
Subject: Semi-supervised
Feature selection
Wrapper method
Discipline: none

Files in this item

Total number of downloads: Loading...

Files Size Format View
masters_thesis_moritz_lange.pdf 1.000Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record