Synthesizing Monolingual Data for Neural Machine Translation

01/29/2021
by   Benjamin Marie, et al.
0

In neural machine translation (NMT), monolingual data in the target language are usually exploited through a method so-called "back-translation" to synthesize additional training parallel data. The synthetic data have been shown helpful to train better NMT, especially for low-resource language pairs and domains. Nonetheless, large monolingual data in the target domains or languages are not always available to generate large synthetic parallel data. In this work, we propose a new method to generate large synthetic parallel data leveraging very small monolingual data in a specific domain. We fine-tune a pre-trained GPT-2 model on such small in-domain monolingual data and use the resulting model to generate a large amount of synthetic in-domain monolingual data. Then, we perform back-translation, or forward translation, to generate synthetic in-domain parallel data. Our preliminary experiments on three language pairs and five domains show the effectiveness of our method to generate fully synthetic but useful in-domain parallel data for improving NMT in all configurations. We also show promising results in extreme adaptation for personalized NMT.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/09/2021

AUGVIC: Exploiting BiText Vicinity for Low-Resource NMT

The success of Neural Machine Translation (NMT) largely depends on the a...
research
06/02/2022

Finding the Right Recipe for Low Resource Domain Adaptation in Neural Machine Translation

General translation models often still struggle to generate accurate tra...
research
11/03/2017

Towards Neural Machine Translation with Partially Aligned Corpora

While neural machine translation (NMT) has become the new paradigm, the ...
research
04/07/2020

Dynamic Data Selection and Weighting for Iterative Back-Translation

Back-translation has proven to be an effective method to utilize monolin...
research
04/05/2020

AR: Auto-Repair the Synthetic Data for Neural Machine Translation

Compared with only using limited authentic parallel data as training cor...
research
10/06/2020

Iterative Domain-Repaired Back-Translation

In this paper, we focus on the domain-specific translation with low reso...
research
05/07/2023

Leveraging Synthetic Targets for Machine Translation

In this work, we provide a recipe for training machine translation model...

Please sign up or login with your details

Forgot password? Click here to reset