AUGVIC: Exploiting BiText Vicinity for Low-Resource NMT

06/09/2021
by   Tasnim Mohiuddin, et al.
11

The success of Neural Machine Translation (NMT) largely depends on the availability of large bitext training corpora. Due to the lack of such large corpora in low-resource language pairs, NMT systems often exhibit poor performance. Extra relevant monolingual data often helps, but acquiring it could be quite expensive, especially for low-resource languages. Moreover, domain mismatch between bitext (train/test) and monolingual data might degrade the performance. To alleviate such issues, we propose AUGVIC, a novel data augmentation framework for low-resource NMT which exploits the vicinal samples of the given bitext without using any extra monolingual data explicitly. It can diversify the in-domain bitext data with finer level control. Through extensive experiments on four low-resource language pairs comprising data from different domains, we have shown that our method is comparable to the traditional back-translation that uses extra in-domain monolingual data. When we combine the synthetic parallel data generated from AUGVIC with the ones from the extra monolingual data, we achieve further improvements. We show that AUGVIC helps to attenuate the discrepancies between relevant and distant-domain monolingual data in traditional back-translation. To understand the contributions of different components of AUGVIC, we perform an in-depth framework analysis.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
01/29/2021

Synthesizing Monolingual Data for Neural Machine Translation

In neural machine translation (NMT), monolingual data in the target lang...
research
05/23/2023

When Does Monolingual Data Help Multilingual Translation: The Role of Domain and Model Scale

Multilingual machine translation (MMT), trained on a mixture of parallel...
research
01/23/2020

Pre-training via Leveraging Assisting Languages and Data Selection for Neural Machine Translation

Sequence-to-sequence (S2S) pre-training using large monolingual data is ...
research
06/10/2019

Generalized Data Augmentation for Low-Resource Translation

Translation to or from low-resource languages LRLs poses challenges for ...
research
11/14/2020

Iterative Self-Learning for Enhanced Back-Translation in Low Resource Neural Machine Translation

Many language pairs are low resource - the amount and/or quality of para...
research
05/24/2021

Neural Machine Translation with Monolingual Translation Memory

Prior work has proved that Translation memory (TM) can boost the perform...
research
09/28/2019

The Source-Target Domain Mismatch Problem in Machine Translation

While we live in an increasingly interconnected world, different places ...

Please sign up or login with your details

Forgot password? Click here to reset