Neural machine translation (NMT) based on the encoder-decoder framework with attention mechanism Sutskever et al. (2014); Bahdanau et al. (2014); Cho et al. (2014); Vaswani et al. (2017) has achieved state-of-the-art (SOTA) results in many language pairs Xia et al. (2019). Generally, millions of or even more parallel sentence pairs are needed to train a decent NMT system Wu et al. (2016). However, authentic parallel data-set is limited in many scenarios, e.g. low resource language, which restricts the use of NMT in the real world Sennrich and Zhang (2019).
As collecting large scale authentic parallel data is expensive and impractical in many scenarios, approaches that use freely available monolingual data to create an additional synthetic parallel data have drawn much attention, e,g back translation. Since the pioneer work of Sennrich et al. (2017) uses training data consists of synthetic and authentic parallel data to train a high quality NMT model, many approaches Gulcehre et al. (2015); He et al. (2016); Poncelas et al. (2018); Lample et al. (2018); Edunov et al. (2018) have been proposed to further proved that synthetic parallel data obtained by back-translation (BT) or forward-translation (FT) Bogoychev and Sennrich (2019) is a simple yet effective approach to improve translation quality.
Most of the existing work focuses on how to better leverage given synthetic data. caswell2019tagged proposed tagged back-translation to guide the training process by informing the NMT model that the given source is synthetic. The work in wang2019improving uses pre-calculated uncertainty scores for BT data to weight the attention of NMT models. wu2019exploiting proposed to use noised training to better leverage both BT and FT data. Despite the success of above works, they haven’t propose any method to improve the quality of synthetic data while the quality of synthetic data matters in NMT training Gulcehre et al. (2015). Iterative BT Hoang et al. (2018); Gwinnup et al. (2017) generates increasingly better synthetic data along with the training iterations. However, the iteration procedure is time-consuming and expensive in real-world applications.
In this paper, we propose a simple and effective auto-repair (AR) model to improve the quality of synthetic parallel data with less time cost and can be applied to various types of synthetic corpus. Our auto-repair model could learn the mapping from low quality synthetic data to high quality authentic data. After the training, the auto-repair (AR) model can be used to repair the mistakes in synthetic data directly. NMT models trained by the repaired synthetic data can achieve more gain compared with original back-translated synthetic data. We evaluate the proposed approach on WMT14 ENDE and IWSLT14 DEEN translation tasks. Experimental results show that our AR model could repair the mistakes in synthetic data effectively. And the translation quality of both tasks get considerable improvements with the high quality synthetic data.
In this section, we elaborate on the proposed AR framework and how we integrate AR models into NMT.
2.1 Background and Notation
We firstly define and as source authentic sentence and target authentic sentences, respectively. Meanwhile, and as source and target synthetic sentence, respectively. Since our data come from either bilingual or monolingual data, we use different superscripts to indicate the data source. For example, and represent the source authentic sentence from bilingual data and monolingual data, respectively.
The left part (Traditional Synthesis) of Figure 1 shows a diagram of a traditional data synthesis process. We firstly pre-train a source-to-target S2T and a target-to-source T2S NMT model based on the authentic parallel corpus . Then we use the pre-trained models to translate the source and target side monolingual data, and , to get the target synthetic sentence and source synthetic sentence , respectively. Finally, we can use the combination of authentic data , FT pseudo-parallel corpus and BT pseudo-parallel to train our S2T model.
2.2 The Proposed AR Framework
The right part (Auto-Repair) of Figure 1 schematically illustrates the AR Framework. We follow the definitions for traditional synthesis in Section 2.1, and we define and as the repaired synthetic source sentence and target sentences, respectively. Rather than directly use the synthetic data, () and (), we use a and a auto-repair model to repair the synthetic data and get the () and () data. Then we continue to train the S2T model with the combination of authentic parallel corpus , FT synthetic (pseudo-parallel) corpus , BT synthetic , FT repaired corpus and BT repaired corpus . For simplicity reasons, we drop the superscripts of symbolic representation in the following descriptions for monolingual data. e.g for .
The aim of the AR model is to transform the low quality (noisy) sentences to high quality sentences. We adopt the seq2seq architecture to build our S2S and T2T AR models. In this work, we use the SAN-based (self-attention network) Vaswani et al. (2017) structure. Given a low and high quality sentence-pair , where and a high quality sentence , the conditional distribution of each target token predicted by AR model is computed as
The input of our AR model is , and the AR model transforms it to , which is of higher quality than .
|Model||WMT14 ENDE||IWSLT14 DEEN||AVG|
|Transformer-base Vaswani et al. (2017)||27.3||N/A||N/A||N/A||N/A||N/A|
|Transformer-base Gao et al. (2019)||N/A||N/A||34.79||N/A||N/A||N/A|
|BASE + BT Sennrich et al. (2017)||28.64||+1.12||37.11||+2.07||32.88||+1.60|
|BASE + FT||27.88||+0.36||35.88||+0.84||31.88||+0.60|
|BASE + BT + FT||28.84||+1.32||37.70||+2.66||33.27||+1.99|
|BASE + BTR-REP||29.05||+1.53||37.79||+2.75||33.42||+2.14|
|BASE + FTR-REP||28.11||+0.59||36.17||+ 1.13||32.14||+0.86|
|BASE + BTR-ADD||29.29||+1.77||38.15||+3.11||33.72||+2.44|
|BASE + FTR-ADD||28.20||+0.68||36.34||+1.30||32.27||+0.99|
|BASE + BTR-ADD + FTR-ADD||29.59||+2.07||38.89||+3.85||34.24||+2.96|
AR Model Training
We take BT scenario as an example to describe how we generate the training data and how we train our AR models. The FT scenario is identical to BT, except the language to be repaired is different. In order to build the AR model for BT data, we need sentence-pair as training corpus, where is generated by NMT systems and of low quality. is high quality sentence.
For the , we could simply use the large scale authentic monolingual data for two reasons: 1) the monolingual data is almost always universally available Gulcehre et al. (2015), which makes it abundant for NMT training; 2) the monolingual data is original in the specific language, thus the fluency and accuracy are guaranteed, which makes it of high quality.
For the , we investigate a data-driven method to generate it. We firstly use the pre-trained S2T model to translate the monolingual data, , and get . Then we use the pre-trained T2S model to translate from to .
Because the here is generated by S2T and T2S NMT models, it could reveal the mistakes made by NMT models inherently, which exactly meets our requirements for . In addition, the AR dev data contains 1000 sentence pairs randomly selected from AR training data. After the training and dev data are obtained, we can train the AR models following the typical seq2seq model training methods Vaswani et al. (2017).
3.1 Implementation Detail
We evaluate the proposed approach on the publicly available WMT2014 English to German (ENDE) and IWSLT2014 German to English (DEEN) tasks. On the WMT ENDE task, our training set consists of about 4.5 million sentence pairs. We use newstest2013 as our validation set and newstest2014 as our test set111http://www.statmt.org/wmt17/translation-task.html. We learned a joint byte pair encoding (BPE) Sennrich et al. (2015) for English and German with 32,000 operations. We limit the size of our English and German vocabularies to 30,000. On the IWSLT14 DEEN task, following the setting of gehring2017convolutional, the training set includes about 160k sentence pairs and the dev set includes 7000 sentence pairs randomly selected from the training set. Then, we concatenate tst201012, dev2010 and dev2012 as our test data222https://github.com/facebookresearch/fairseq/tree/master/data. We learned a joint BPE with 10,000 operations and limit the size of both English and German vocabularies to 10,000. We randomly sampled 16 million monolingual sentences for both English and German from WMT News Crwal data for both tasks.
We use an in-house implementation of Transformer 333https://github.com/tensorflow/tensor2tensor/tree/master/ tensor2tensor. For the Transformer model structure, we use the transformer-base settings from Vaswani et al. (2017). Sentence pairs are batched together by approximate sentence length. Each batch has approximately 25000 source tokens and 25000 target tokens. We set our label smoothing and dropout rate to 0.1. We use the Adam Kingma and Ba (2014)
to update the parameters, and the learning rate was varied under a warm-up strategy with 4000 steps. We use beam search for heuristic decoding, and the beam size is set to 4 and case-insensitive 4-gram BLEUPapineni et al. (2002) to evaluate our translation results.
3.2 Experimental Results
We show our translation results on WMT14 ENDE task and IWSLT14 DEEN task in Table 1
. The first 3 lines show that our in-house implementation is comparable with the open source implementationVaswani et al. (2018).
Results of FT & BT
The next 3 lines of Table 1 show the consistent improvements when using different synthetic data. The BASE + BT + FT model outperforms the baseline by an average 1.99 BLEU points and outperforms the other two models (+FT or +BT) by 0.41.39 BLEU points on the two tasks. It’s worth mentioning that the translation quality improvement bring by FT 444We found that directly adding FT data into authentic bilingual data and continue to train the NMT model, the translation quality deteriorates rapidly. An assumption is that the target-side of bilingual data plays a more important role in NMT training, thus the relative low-quality target side of FT data hinders the NMT performance. Instead, for using FT data, we firstly train a NMT model with FT data, and then fine-tune the model with the authentic data in all our experiments. is much lower than BT, which is consistent with the observations in Bogoychev and Sennrich (2019).
Results of AR Models
With respect to BLEU scores, we observe that the *-ADD models work better than the *-REP models. This indicates that both synthetic and repaired data are useful for NMT training. All repaired models (BTR and FTR) achieve better results over corresponding Non-repaired models. The BASE + BTR-ADD + FTR-ADD model achieves the best BLEU score over all test sets.
On the WMT14 EN-DE translation task, the BASE + BTR-ADD + FTR-ADD model outperforms the baseline by 2.07 BLEU points and outperforms the BASE + BT + FT model by 0.75 BLEU points. While on the IWSLT DE-EN task, The BASE + BTR-ADD + FTR-ADD model outperforms the baseline and the BASE + BT + FT model by 3.85 and 1.19 BLEU points, respectively. An average 2.96 BLEU points improvement over baseline and 0.97 BLEU point improvement over strong synthetic model are achieved by our best AR models.
3.3 AR Quality Analysis
We further analyze the quality of the proposed AR models themselves from three aspects: BLEU score, Change Rate, Better Rate in Table 2. We follow the definition of dev data described in Section 2.2. We use our AR model to transform from to of the dev data, (), then use as the reference to evaluate the quality of .
We apply the corpus-level BLEU score as the evaluation criterion to measure whether has higher quality than . From Table 3.3, we can find that achieve over +11 BLEU points improvements over of both EN2EN and DE2DE AR models. The BLEU score improvements indicate that our AR models can improve the quality of synthetic data.
We count the differences between the and and call it the change rate. The CR could be calculated as:
, where is the number of sentence that the is different from the input sentence . We found that over 76% of the input sentences have been changed by the AR models, which indicates that our AR models indeed learn transformation information for synthetic data.
”Better” here means achieves a higher sentence-level BLEU score compared with . We can find that over 69% of are better than , which proves again the AR models indeed improve the quality of synthetic data.
In this paper, we have presented an AR framework by using seq2seq-based AR model to directly repair the synthetic parallel data. The proposed method can be applied to various types of synthetic corpus with less time cost. On both WMT14 ENDE and IWSLT14 DEEN translation tasks, experimental results and further in-depth analysis demonstrate that the proposed AR method is able to 1) improve the quality of synthetic parallel data; 2) significantly improve the quality of NMT models by repaired data.
- Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. Cited by: §1.
- Domain, translationese and noise in synthetic data for neural machine translation. arXiv preprint arXiv:1911.03362. Cited by: §1, §3.2.
- Learning phrase representations using rnn encoder–decoder for statistical machine translation. In EMNLP, Cited by: §1.
- Understanding back-translation at scale. arXiv preprint arXiv:1808.09381. Cited by: §1.
- Soft contextual data augmentation for neural machine translation. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 5539–5544. Cited by: Table 1.
- On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535. Cited by: §1, §1, §2.2.
- The afrl-mitll wmt17 systems: old, new, borrowed, bleu. In Proceedings of the Second Conference on Machine Translation, pp. 303–309. Cited by: §1.
- Dual learning for machine translation. In Advances in Neural Information Processing Systems, pp. 820–828. Cited by: §1.
- Iterative back-translation for neural machine translation. In Proceedings of the 2nd Workshop on Neural Machine Translation and Generation, pp. 18–24. Cited by: §1.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.1.
- Phrase-based & neural unsupervised machine translation. arXiv preprint arXiv:1804.07755. Cited by: §1.
- BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics, pp. 311–318. Cited by: §3.1.
- Investigating backtranslation in neural machine translation. arXiv preprint arXiv:1804.06189. Cited by: §1.
- The university of edinburgh’s neural mt systems for wmt17. arXiv preprint arXiv:1708.00726. Cited by: §1, Table 1.
- Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909. Cited by: §3.1.
- Revisiting low-resource neural machine translation: a case study. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Cited by: §1.
Sequence to sequence learning with neural networks. Advances in NIPS. Cited by: §1.
- Tensor2tensor for neural machine translation. arXiv preprint arXiv:1803.07416. Cited by: §3.2.
- Attention is all you need. In Advances in neural information processing systems, pp. 5998–6008. Cited by: §1, §2.2, §2.2, Table 1, §3.1.
- Google’s neural machine translation system: bridging the gap between human and machine translation. arXiv preprint arXiv:1609.08144. Cited by: §1.
- Microsoft research asia’s systems for WMT19. In the Fourth Conference on Machine Translation), Cited by: §1.