Software Newsroom – an approach to automation of news search and editing

Show full item record



Huovelin , J , Gross , O , Solin , O , Linden , K , Maisala , S P T , Oittinen , T , Toivonen , H , Niemi , J & Silfverberg , M 2013 , ' Software Newsroom – an approach to automation of news search and editing ' , Journal of Print Media Technology research , vol. 2 , no. 3 , pp. 141-156 .

Title: Software Newsroom – an approach to automation of news search and editing
Author: Huovelin, Juhani; Gross, Oskar; Solin, Otto; Linden, Krister; Maisala, Sami Petri Tapio; Oittinen, Tero; Toivonen, Hannu; Niemi, Jyrki; Silfverberg, Miikka
Contributor organization: Department of Physics
Helsinki Institute for Information Technology
Discovery Research Group/Prof. Hannu Toivonen
Finnish Centre of Excellence in Algorithmic Data Analysis Research (Algodan)
Helsinki Graduate School in Computer Science and Engineering (Hecse)
Department of Modern Languages 2010-2017
Krister Linden / Research Group
Department of Computer Science
Planetary-system research
Date: 2013-11-07
Language: eng
Number of pages: 15
Belongs to series: Journal of Print Media Technology research
ISSN: 2223-8905
Abstract: We have developed tools and applied methods for automated identification of potential news from textual data for an automated news search system called Software Newsroom. The purpose of the tools is to analyze data collected from the internet and to identify information that has a high probability of containing new information. The identified information is summarized in order to help understanding the semantic contents of the data, and to assist the news editing process. It has been demonstrated that words with a certain set of syntactic and semantic properties are effective when building topic models for English. We demonstrate that words with the same properties in Finnish are useful as well. Extracting such words requires knowledge about the special characteristics of the Finnish language, which are taken into account in our analysis. Two different methodological approaches have been applied for the news search. One of the methods is based on topic analysis and it applies Multinomial Principal Component Analysis (MPCA) for topic model creation and data profiling. The second method is based on word association analysis and applies the log-likelihood ratio (LLR). For the topic mining, we have created English and Finnish language corpora from Wikipedia and Finnish corpora from several Finnish news archives and we have used bag-of-words presentations of these corpora as training data for the topic model. We have performed topic analysis experiments with both the training data itself and with arbitrary text parsed from internet sources. The results suggest that the effectiveness of news search strongly depends on the quality of the training data and its linguistic analysis. In the association analysis, we use a combined methodology for detecting novel word associations in the text. For detecting novel associations we use the background corpus from which we extract common word associations. In parallel, we collect the statistics of word co-occurrences from the documents of interest and search for associations with larger likelihood in these documents than in the background. We have demonstrated the applicability of these methods for Software Newsroom. The results indicate that the background-foreground model has significant potential in news search. The experiments also indicate great promise in employing background-foreground word associations for other applications. A combined application of the two methods is planned as well as the application of the methods on social media using a pre-translator of social media language.
Subject: 113 Computer and information sciences
social media
data mining
topic analysis
word associations
6121 Languages
linguistic analsysis
named entities
Peer reviewed: Yes
Usage restriction: restrictedAccess
Self-archived version: submittedVersion

Files in this item

Total number of downloads: Loading...

Files Size Format View
JPMTR_1311.pdf 446.3Kb PDF View/Open

This item appears in the following Collection(s)

Show full item record