Large-scale Multi-Label Text Classification for an Online News Monitoring System

Show full item record

Permalink

http://hdl.handle.net/10138/159184
Title: Large-scale Multi-Label Text Classification for an Online News Monitoring System
Author: Pierce, Matthew
Contributor: Helsingin yliopisto, Matemaattis-luonnontieteellinen tiedekunta, Tietojenkäsittelytieteen laitos
Thesis level:
Abstract: This thesis provides a detailed exploration of numerous methods---some established and some novel---considered in the construction of a text-categorization system, for use in a large-scale, online news-monitoring system known as PULS. PULS is an information extraction (IE) system, consisting of a number of tools for automatically collecting named-entities from text. The system also has access to large training corpora in the business domain, where documents are annotated with associated industry-sectors. These assets are leveraged in the construction of a multi-label industry-sector classifier, the output of which is displayed on the web-based front-end of PULS, for new articles. Through review of background literature and direct experimentation with each stage of development, we illuminate many major challenges of multi-label classification. These challenges include: working effectively in a real-world scenario that poses time and memory restrictions; organizing and processing semi-structured, pre-annotated text corpora; handling large-scale data sets and label sets with significant class imbalances; weighing the trade-offs of different learning algorithms and feature-selection methods with respect to end-user performance; and finding meaningful evaluations for each system component. In addition to presenting the challenges associated with large-scale multi-label learning, this thesis presents a number of experiments and evaluations to determine methods which enhance overall performance. The major outcome of these experiments is a multi-stage, multi-label classifier that combines IE-based rote classification---with features extracted by the PULS system---with an array of balanced, statistical classifiers. Evaluation of this multi-stage system shows improvement over a baseline classifier and, for certain evaluations, over state-of-the-art performance from literature, when tested on a commonly-used corpus. Aspects of the classification method and their associated experimental results have also been published for international conference proceedings.
URI: http://hdl.handle.net/10138/159184
Date: 2015-12-19
Discipline: Algorithms and Machine Learning


Files in this item

Total number of downloads: Loading...

Files Size Format View
matthewpiercethesis.pdf 1.215Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record