Domain adaptation : Retraining NMT with translation memories

Show full item record



Permalink

http://urn.fi/URN:NBN:fi:hulib-201906112531
Title: Domain adaptation : Retraining NMT with translation memories
Author: Mäkinen, Maria
Contributor: University of Helsinki, Faculty of Arts
Publisher: Helsingin yliopisto
Date: 2019
Language: eng
URI: http://urn.fi/URN:NBN:fi:hulib-201906112531
http://hdl.handle.net/10138/302732
Thesis level: master's thesis
Degree program: Kääntämisen ja tulkkauksen maisteriohjelma
Master's Programme in Translation and Interpreting
Magisterprogrammet i översättning och tolkning
Specialisation: Käännösteknologia
Translation Technology
Översättningsteknologi
Abstract: The topic of this thesis is domain adaptation of an NMT system by retraining it with translation memories. The translation memory used in the experiments is the EMEA corpus that consists of medical texts – mostly package leaflets. The NMT system used in the experiments is OpenNMT because it is completely free and easy to use. The goal of this thesis is to find out how an NMT system can be adapted to a special domain, and if the translation quality improves after domain adaptation. The original plan was to continue training the pretrained model of OpenNMT with EMEA data, but this is not possible. Therefore, it is necessary to train a new baseline model with the same data as the pretrained model was trained with. After this two domain adaptation methods are tested: continuation training with EMEA data and continuation training with unknown terms. In the manual evaluation, it turned out that domain adaptation with unknown terms worsens the translation quality drastically because all sentences are translated as single words. This method is only suitable for translating wordlists because it improved the translation of unknown terms. Domain adaptation with EMEA data, for the other hand, improves the translation quality significantly. The EMEA-retrained system translates long sentences and medical terms much better than the pretrained and the baseline models. Long and complicated terms are still difficult to translate but the EMEA-retrained model makes fewer errors than the other models. The evaluation metrics used for automatic evaluation are BLEU and LeBLEU. BLEU is stricter than LeBLEU. The results are similar as in the manual evaluation: The EMEA-retrained model translates medical texts much better than the other models, and the translation quality of the UNK-retrained model is the worst of all. It can be presumed that an NMT system needs contextual information so that it learns to translate terms and long sentences without transforming the text into a wordlist without sentences. In addition, it seems that long terms are translated in smaller pieces so that the NMT system possibly translates some pieces wrong, which results in that the whole term is wrong.
Subject: NMT
neural machine translation
domain adaptation
translation memory
machine translation


Files in this item

Total number of downloads: Loading...

Files Size Format View
Mäkinen_Maria_Pro_Gradu_2019.pdf 309.3Kb PDF View/Open

This item appears in the following Collection(s)

Show full item record