The Development of a Comprehensive Data Set for Systematic Studies of Machine Translation

Show full item record



Permalink

http://hdl.handle.net/10138/327794
Title: The Development of a Comprehensive Data Set for Systematic Studies of Machine Translation
Author: Tiedemann, Jörg
Date: 2021
Language: fi
DOI: https://doi.org/10.31885/9789515150257.22
URI: http://hdl.handle.net/10138/327794
Abstract: This paper presents our on-going efforts to develop a comprehensive data set and benchmark for machine translation beyond high-resource languages. The current release includes 500GB of compressed parallel data for almost 3,000 language pairs covering over 500 languages and language variants. We present the structure of the data set and demonstrate its use for systematic studies based on baseline experiments with multilingual neural machine translation between Finno-Ugric languages and other language groups. Our initial results show the capabilities of training effective multilingual translation models with skewed training data but also stress the shortcomings with low-resource settings and the difficulties to obtain sufficient information through straightforward transfer from related languages.
Subject: machine translation
low-resource languages
multilingual NLP
Rights: CC BY 4.0


Files in this item

Total number of downloads: Loading...

Files Size Format View
22_Tiedemann_Multilingual_Facilitation.pdf 158.9Kb PDF View/Open

This item appears in the following Collection(s)

Show full item record

CC BY 4.0 Except where otherwise noted, this item's license is described as CC BY 4.0