Distribution Matching : Semi-Supervised Feature Selection for Biased Labelled Data

Show simple item record

dc.contributor Helsingin yliopisto, Matemaattis-luonnontieteellinen tiedekunta fi
dc.contributor University of Helsinki, Faculty of Science en
dc.contributor Helsingfors universitet, Matematisk-naturvetenskapliga fakulteten sv
dc.contributor.author Lange, Moritz Johannes
dc.date.issued 2020
dc.identifier.uri URN:NBN:fi:hulib-202006243440
dc.identifier.uri http://hdl.handle.net/10138/316959
dc.description.abstract In the context of data science and machine learning, feature selection is a widely used technique that focuses on reducing the dimensionality of a dataset. It is commonly used to improve model accuracy by preventing data redundancy and over-fitting, but can also be beneficial in applications such as data compression. The majority of feature selection techniques rely on labelled data. In many real-world scenarios, however, data is only partially labelled and thus requires so-called semi-supervised techniques, which can utilise both labelled and unlabelled data. While unlabelled data is often obtainable in abundance, labelled datasets are smaller and potentially biased. This thesis presents a method called distribution matching, which offers a way to do feature selection in a semi-supervised setup. Distribution matching is a wrapper method, which trains models to select features that best affect model accuracy. It addresses the problem of biased labelled data directly by incorporating unlabelled data into a cost function which approximates expected loss on unseen data. In experiments, the method is shown to successfully minimise the expected loss transparently on a synthetic dataset. Additionally, a comparison with related methods is performed on a more complex EMNIST dataset. en
dc.language.iso eng
dc.publisher Helsingin yliopisto fi
dc.publisher University of Helsinki en
dc.publisher Helsingfors universitet sv
dc.subject Semi-supervised
dc.subject Feature selection
dc.subject Wrapper method
dc.subject Bias
dc.title Distribution Matching : Semi-Supervised Feature Selection for Biased Labelled Data en
dc.type.ontasot pro gradu -tutkielmat fi
dc.type.ontasot master's thesis en
dc.type.ontasot pro gradu-avhandlingar sv
dc.subject.discipline none und
dct.identifier.urn URN:NBN:fi:hulib-202006243440
dc.subject.specialization ei opintosuuntaa fi
dc.subject.specialization no specialization en
dc.subject.specialization ingen studieinriktning sv
dc.subject.degreeprogram Datatieteen maisteriohjelma fi
dc.subject.degreeprogram Master's Programme in Data Science en
dc.subject.degreeprogram Magisterprogrammet i data science sv

Files in this item

Total number of downloads: Loading...

Files Size Format View
masters_thesis_moritz_lange.pdf 1.000Mb PDF View/Open

This item appears in the following Collection(s)

Show simple item record