A Comparative Study on Large-scale Multi-label Text Classification of Social Media

Show full item record


Title: A Comparative Study on Large-scale Multi-label Text Classification of Social Media
Author: Huang, Biyun
Other contributor: Helsingin yliopisto, Matemaattis-luonnontieteellinen tiedekunta
University of Helsinki, Faculty of Science
Helsingfors universitet, Matematisk-naturvetenskapliga fakulteten
Publisher: Helsingin yliopisto
Date: 2018
Language: eng
URI: http://urn.fi/URN:NBN:fi:hulib-202004221915
Thesis level: master's thesis
Discipline: Tietojenkäsittelytiede
Abstract: Text classification, also known as text categorization, is a task to classify documents into predefined sets. As the prosperity of the social networks, a large volume of unstructured text is generated exponentially. Social media text, due to its limited length, extreme imbalance, high dimensionality, and multi-label characteristic, needs special processing before being fed to machine learning classifiers. There are all kinds of statistics, machine learning, and natural language processing approaches to solve the problem, of which two trends of machine learning algorithms are the state of the art. One is the large-scale linear classification which deals with large sparse data, especially for short social media text; the other is the active deep learning techniques, which takes advantage of the word order. This thesis provided an end-to-end solution to deal with large-scale, multi-label and extremely imbalanced text data, compared both the active trends and discussed the effect of balance learning. The results show that deep learning does not necessarily work well in this context. Well-designed large linear classifiers can achieve the best scores. Also, when the data is large enough, the simpler classifiers may perform better.

Files in this item

Total number of downloads: Loading...

Files Size Format View
bhuang_msc.pdf 2.451Mb PDF View/Open

This item appears in the following Collection(s)

Show full item record