Self-Training Sampling with Monolingual Data Uncertainty for Neural Machine Translation

06/02/2021
by   Wenxiang Jiao, et al.
0

Self-training has proven effective for improving NMT performance by augmenting model training with synthetic parallel data. The common practice is to construct synthetic data based on a randomly sampled subset of large-scale monolingual data, which we empirically show is sub-optimal. In this work, we propose to improve the sampling procedure by selecting the most informative monolingual sentences to complement the parallel data. To this end, we compute the uncertainty of monolingual sentences using the bilingual dictionary extracted from the parallel data. Intuitively, monolingual sentences with lower uncertainty generally correspond to easy-to-translate patterns which may not provide additional gains. Accordingly, we design an uncertainty-based sampling strategy to efficiently exploit the monolingual data for self-training, in which monolingual sentences with higher uncertainty would be sampled with higher probability. Experimental results on large-scale WMT English⇒German and English⇒Chinese datasets demonstrate the effectiveness of the proposed approach. Extensive analyses suggest that emphasizing the learning on uncertain monolingual sentences by our approach does improve the translation quality of high-uncertainty sentences and also benefits the prediction of low-frequency words at the target side.

READ FULL TEXT
research
08/27/2018

Back-Translation Sampling by Targeting Difficult Words in Neural Machine Translation

Neural Machine Translation has achieved state-of-the-art performance for...
research
08/28/2018

Understanding Back-Translation at Scale

An effective method to improve neural machine translation with monolingu...
research
08/31/2019

Improving Back-Translation with Uncertainty-based Confidence Estimation

While back-translation is simple and effective in exploiting abundant mo...
research
12/19/2014

Leveraging Monolingual Data for Crosslingual Compositional Word Representations

In this work, we present a novel neural network based architecture for i...
research
12/02/2022

Improving Simultaneous Machine Translation with Monolingual Data

Simultaneous machine translation (SiMT) is usually done via sequence-lev...
research
04/05/2020

AR: Auto-Repair the Synthetic Data for Neural Machine Translation

Compared with only using limited authentic parallel data as training cor...
research
06/12/2021

Don't Rule Out Monolingual Speakers: A Method For Crowdsourcing Machine Translation Data

High-performing machine translation (MT) systems can help overcome langu...

Please sign up or login with your details

Forgot password? Click here to reset