Label Noise Influence and Recognition on Survey Datasets Using Hierarchical Ensemble Noise Filters

Näytä kaikki kuvailutiedot



Pysyväisosoite

http://urn.fi/URN:NBN:fi-fe201804208653
Julkaisun nimi: Label Noise Influence and Recognition on Survey Datasets Using Hierarchical Ensemble Noise Filters
Tekijä: Gallegos Gutierrez, Angel Manuel
Muu tekijä: Helsingin yliopisto, Matemaattis-luonnontieteellinen tiedekunta, Tietojenkäsittelytieteen laitos
Julkaisija: Helsingin yliopisto
Päiväys: 2018
Kieli: eng
URI: http://urn.fi/URN:NBN:fi-fe201804208653
http://hdl.handle.net/10138/273493
Opinnäytteen taso: pro gradu -tutkielmat
Tiivistelmä: Statistical Bureaus are responsible for producing meaningful statistical publications. Evidently, the reliability of their publications is subject to the quality of the source dataset, and consequently a significant amount of resources is allocated on detecting and correcting inconsistencies before any statistical output is produced. Particularly, Statistics Finland (Tilastokeskus) is developing a pilot project based on the selective data editing methodology, aiming to preserve high standards in the quality of their datasets while reducing manual interventions. Label noise is a presumably common situation encountered in several real-world datasets, and the current development does not include a module capable of handling such inconsistencies in their datasets. Moreover, the labels characterizing the instances are defined over a class hierarchy following a tree structure. Therefore, this thesis is an initial assessment for including a preprocessing module for explicit label noise recognition in two of their survey datasets. Although automatic label noise corrections cannot be performed for preserving high data quality standards, plausible replacements could be used as a tool assisting the manual interventions. Based on the previous motivations, this thesis was focused on explicitly recognizing hierarchical label inconsistencies and the impact of label noise in the hierarchical classification performance. The performance of several hierarchical classification techniques was assessed under different levels of artificial label noise. In this work, only mandatory leaf node predictions were considered during the evaluations. Two promising noise filtering techniques were evaluated in their capability to uncover the artificially created label noise. Given that the labels are structured over a class hierarchy, the best performing hierarchical methods were selected to work as the base noise filters. Although the results could not be conclusive, certain hierarchical classification methods showed a certain level of robustness against label noise, and their performance is competitive with the conventional methods. On the other hand, noise filtering techniques were effective against hierarchical noise completely at random. Hierarchical adaptations of the noise filters remain competitive and might show signs of handling better rare cases.


Tiedostot

Tiedosto(t) Koko Formaatti Näytä

Tähän julkaisuun ei ole liitetty tiedostoja

Viite kuuluu kokoelmiin:

Näytä kaikki kuvailutiedot