TY - T1 - The Development of a Comprehensive Data Set for Systematic Studies of Machine Translation SN - / UR - http://hdl.handle.net/10138/327794 T3 - A1 - Tiedemann, Jörg A2 - PB - Y1 - 2021 LA - fi AB - This paper presents our on-going efforts to develop a comprehensive data set and benchmark for machine translation beyond high-resource languages. The current release includes 500GB of compressed parallel data for almost 3,000 language pairs covering over 500 languages and language variants. We present the structure of the data set and demonstrate its use for systematic studies based on baseline experiments with multilingual neural machine translation between Finno-Ugric languages and other lang... VO - IS - SP - OP - KW - machine translation; low-resource languages; multilingual NLP N1 - PP - ER -