Label Noise Influence and Recognition on Survey Datasets Using Hierarchical Ensemble Noise Filters

Show simple item record

dc.contributor Helsingin yliopisto, Matemaattis-luonnontieteellinen tiedekunta, Tietojenkäsittelytieteen laitos fi
dc.contributor University of Helsinki, Faculty of Science, Department of Computer Science en
dc.contributor Helsingfors universitet, Matematisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap sv
dc.contributor.author Gallegos Gutierrez, Angel Manuel
dc.date.issued 2018
dc.identifier.uri URN:NBN:fi-fe201804208653
dc.identifier.uri http://hdl.handle.net/10138/273493
dc.description.abstract Statistical Bureaus are responsible for producing meaningful statistical publications. Evidently, the reliability of their publications is subject to the quality of the source dataset, and consequently a significant amount of resources is allocated on detecting and correcting inconsistencies before any statistical output is produced. Particularly, Statistics Finland (Tilastokeskus) is developing a pilot project based on the selective data editing methodology, aiming to preserve high standards in the quality of their datasets while reducing manual interventions. Label noise is a presumably common situation encountered in several real-world datasets, and the current development does not include a module capable of handling such inconsistencies in their datasets. Moreover, the labels characterizing the instances are defined over a class hierarchy following a tree structure. Therefore, this thesis is an initial assessment for including a preprocessing module for explicit label noise recognition in two of their survey datasets. Although automatic label noise corrections cannot be performed for preserving high data quality standards, plausible replacements could be used as a tool assisting the manual interventions. Based on the previous motivations, this thesis was focused on explicitly recognizing hierarchical label inconsistencies and the impact of label noise in the hierarchical classification performance. The performance of several hierarchical classification techniques was assessed under different levels of artificial label noise. In this work, only mandatory leaf node predictions were considered during the evaluations. Two promising noise filtering techniques were evaluated in their capability to uncover the artificially created label noise. Given that the labels are structured over a class hierarchy, the best performing hierarchical methods were selected to work as the base noise filters. Although the results could not be conclusive, certain hierarchical classification methods showed a certain level of robustness against label noise, and their performance is competitive with the conventional methods. On the other hand, noise filtering techniques were effective against hierarchical noise completely at random. Hierarchical adaptations of the noise filters remain competitive and might show signs of handling better rare cases. en
dc.language.iso eng
dc.publisher Helsingin yliopisto fi
dc.publisher University of Helsinki en
dc.publisher Helsingfors universitet sv
dc.title Label Noise Influence and Recognition on Survey Datasets Using Hierarchical Ensemble Noise Filters en
dc.type.ontasot pro gradu -tutkielmat fi
dc.type.ontasot master's thesis en
dc.type.ontasot pro gradu-avhandlingar sv
dct.identifier.urn URN:NBN:fi-fe201804208653

Files in this item

Files Size Format View
computer_science_gallegos_gutierrez.pdf 860.1Kb application/pdf View/Open

This item appears in the following Collection(s)

Show simple item record