Label Noise Influence and Recognition on Survey Datasets Using Hierarchical Ensemble Noise Filters

Visa fullständig post



Permalänk

http://urn.fi/URN:NBN:fi-fe201804208653
Titel: Label Noise Influence and Recognition on Survey Datasets Using Hierarchical Ensemble Noise Filters
Författare: Gallegos Gutierrez, Angel Manuel
Medarbetare: Helsingfors universitet, Matematisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap
Utgivare: Helsingin yliopisto
Datum: 2018
Språk: eng
Permanenta länken (URI): http://urn.fi/URN:NBN:fi-fe201804208653
http://hdl.handle.net/10138/273493
Nivå: pro gradu-avhandlingar
Abstrakt: Statistical Bureaus are responsible for producing meaningful statistical publications. Evidently, the reliability of their publications is subject to the quality of the source dataset, and consequently a significant amount of resources is allocated on detecting and correcting inconsistencies before any statistical output is produced. Particularly, Statistics Finland (Tilastokeskus) is developing a pilot project based on the selective data editing methodology, aiming to preserve high standards in the quality of their datasets while reducing manual interventions. Label noise is a presumably common situation encountered in several real-world datasets, and the current development does not include a module capable of handling such inconsistencies in their datasets. Moreover, the labels characterizing the instances are defined over a class hierarchy following a tree structure. Therefore, this thesis is an initial assessment for including a preprocessing module for explicit label noise recognition in two of their survey datasets. Although automatic label noise corrections cannot be performed for preserving high data quality standards, plausible replacements could be used as a tool assisting the manual interventions. Based on the previous motivations, this thesis was focused on explicitly recognizing hierarchical label inconsistencies and the impact of label noise in the hierarchical classification performance. The performance of several hierarchical classification techniques was assessed under different levels of artificial label noise. In this work, only mandatory leaf node predictions were considered during the evaluations. Two promising noise filtering techniques were evaluated in their capability to uncover the artificially created label noise. Given that the labels are structured over a class hierarchy, the best performing hierarchical methods were selected to work as the base noise filters. Although the results could not be conclusive, certain hierarchical classification methods showed a certain level of robustness against label noise, and their performance is competitive with the conventional methods. On the other hand, noise filtering techniques were effective against hierarchical noise completely at random. Hierarchical adaptations of the noise filters remain competitive and might show signs of handling better rare cases.


Filer under denna titel

Filer Storlek Format Granska

There are no files associated with this item.

Detta dokument registreras i samling:

Visa fullständig post