Generalised Unsupervised Domain Adaptation of Neural Machine Translation with Cross-Lingual Data Selection

09/09/2021
by   Thuy-Trang Vu, et al.
0

This paper considers the unsupervised domain adaptation problem for neural machine translation (NMT), where we assume the access to only monolingual text in either the source or target language in the new domain. We propose a cross-lingual data selection method to extract in-domain sentences in the missing language side from a large generic monolingual corpus. Our proposed method trains an adaptive layer on top of multilingual BERT by contrastive learning to align the representation between the source and target language. This then enables the transferability of the domain classifier between the languages in a zero-shot manner. Once the in-domain data is detected by the classifier, the NMT model is then adapted to the new domain by jointly learning translation and domain discrimination tasks. We evaluate our cross-lingual data selection method on NMT across five diverse domains in three language pairs, as well as a real-world scenario of translation for COVID-19. The results show that our proposed method outperforms other selection baselines up to +1.5 BLEU score.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2021

Neural Machine Translation with Monolingual Translation Memory

Prior work has proved that Translation memory (TM) can boost the perform...
research
10/18/2021

Multilingual Domain Adaptation for NMT: Decoupling Language and Domain Information with Adapters

Adapter layers are lightweight, learnable units inserted between transfo...
research
04/18/2023

Tailoring Domain Adaptation for Machine Translation Quality Estimation

While quality estimation (QE) can play an important role in the translat...
research
12/03/2019

Cross-lingual Pre-training Based Transfer for Zero-shot Neural Machine Translation

Transfer learning between different language pairs has shown its effecti...
research
04/05/2020

Unsupervised Domain Clusters in Pretrained Language Models

The notion of "in-domain data" in NLP is often over-simplistic and vague...
research
11/18/2020

Master Thesis: Neural Sign Language Translation by Learning Tokenization

In this thesis, we propose a multitask learning based method to improve ...
research
08/26/2018

Contextual Parameter Generation for Universal Neural Machine Translation

We propose a simple modification to existing neural machine translation ...

Please sign up or login with your details

Forgot password? Click here to reset