Large-scale Multi-Label Text Classification for an Online News Monitoring System

Show simple item record

dc.contributor Helsingin yliopisto, Matemaattis-luonnontieteellinen tiedekunta, Tietojenkäsittelytieteen laitos fi
dc.contributor University of Helsinki, Faculty of Science, Department of Computer Science en
dc.contributor Helsingfors universitet, Matematisk-naturvetenskapliga fakulteten, Institutionen för datavetenskap sv
dc.contributor.author Pierce, Matthew
dc.date.issued 2015
dc.identifier.uri URN:NBN:fi-fe2017112251298
dc.identifier.uri http://hdl.handle.net/10138/159184
dc.description.abstract This thesis provides a detailed exploration of numerous methods — some established and some novel — considered in the construction of a text-categorization system, for use in a large-scale, online news-monitoring system known as PULS. PULS is an information extraction (IE) system, consisting of a number of tools for automatically collecting named-entities from text. The system also has access to large training corpora in the business domain, where documents are annotated with associated industry-sectors. These assets are leveraged in the construction of a multi-label industry-sector classifier, the output of which is displayed on the web-based front-end of PULS, for new articles. Through review of background literature and direct experimentation with each stage of development, we illuminate many major challenges of multi-label classification. These challenges include: working effectively in a real-world scenario that poses time and memory restrictions; organizing and processing semi-structured, pre-annotated text corpora; handling large-scale data sets and label sets with significant class imbalances; weighing the trade-offs of different learning algorithms and feature-selection methods with respect to end-user performance; and finding meaningful evaluations for each system component. In addition to presenting the challenges associated with large-scale multi-label learning, this thesis presents a number of experiments and evaluations to determine methods which enhance overall performance. The major outcome of these experiments is a multi-stage, multi-label classifier that combines IE-based rote classification — with features extracted by the PULS system — with an array of balanced, statistical classifiers. Evaluation of this multi-stage system shows improvement over a baseline classifier and, for certain evaluations, over state-of-the-art performance from literature, when tested on a commonly-used corpus. Aspects of the classification method and their associated experimental results have also been published for international conference proceedings. en
dc.language.iso en
dc.publisher Helsingfors universitet sv
dc.publisher University of Helsinki en
dc.publisher Helsingin yliopisto fi
dc.title Large-scale Multi-Label Text Classification for an Online News Monitoring System en
dc.type.ontasot pro gradu-avhandlingar sv
dc.type.ontasot pro gradu -tutkielmat fi
dc.type.ontasot master's thesis en
dct.identifier.urn URN:NBN:fi-fe2017112251298

Files in this item

Total number of downloads: Loading...

Files Size Format View
matthewpiercethesis.pdf 1.215Mb PDF View/Open
matthew-pierce-thesis.pdf 1.215Mb PDF View/Open

This item appears in the following Collection(s)

Show simple item record