Data Diversification: An Elegant Strategy For Neural Machine Translation

by   Xuan-Phi Nguyen, et al.

A common approach to improve neural machine translation is to invent new architectures. However, the research process of designing and refining such new models is often exhausting. Another approach is to resort to huge extra monolingual data to conduct semi-supervised training, like back-translation. But extra monolingual data is not always available, especially for low resource languages. In this paper, we propose to diversify the available training data by using multiple forward and backward peer models to augment the original training dataset. Our method does not require extra data like back-translation, nor additional computations and parameters like using pretrained models. Our data diversification method achieves state-of-the-art BLEU score of 30.7 in the WMT'14 English-German task. It also consistently and substantially improves translation quality in 8 other translation tasks: 4 IWSLT tasks (English-German and English-French) and 4 low-resource translation tasks (English-Nepali and English-Sinhala).


page 1

page 2

page 3

page 4


Using Self-Training to Improve Back-Translation in Low Resource Neural Machine Translation

Improving neural machine translation (NMT) models using the back-transla...

On Using Monolingual Corpora in Neural Machine Translation

Recent work on end-to-end neural network-based architectures for machine...

Single Model Ensemble for Subword Regularized Models in Low-Resource Machine Translation

Subword regularizations use multiple subword segmentations during traini...

Iterative Self-Learning for Enhanced Back-Translation in Low Resource Neural Machine Translation

Many language pairs are low resource - the amount and/or quality of para...

On the use of BERT for Neural Machine Translation

Exploiting large pretrained models for various NMT tasks have gained a l...

Sparsely Factored Neural Machine Translation

The standard approach to incorporate linguistic information to neural ma...

Scaling Neural Machine Translation

Sequence to sequence learning models still require several days to reach...

1 Introduction

Neural machine translation (NMT) is one of the core and most attractive fields of research in neural approaches to natural language processing (NLP). A standard approach to build a better NMT system is to invent new architectures such as recurrent

(Luong et al., 2015), convolutional (Gehring et al., 2017), and self-attention (Vaswani et al., 2017). While the deployment-ready packaged product of such novel architectures is often appealing to practitioners, the invention process is often exhausting and usually requires many subsequent research to refine and implement it optimally. A simpler approach is to use extra monolingual data to perform semi-supervised training, as is done with the standard back-translation method (Edunov et al., 2018). However, collecting and cleaning huge amount of monolingual data also require substantial efforts. Furthermore, extra data is not always available, especially for low-resource languages like Nepali and Sinhala (spoken in Sri Lanka).

Method Training Inference
New Architectures
Transformer 1x 1x 1x 1x 1x
Dynamic Conv 1x 1x 1x 1x 1x
NMT+BERT 60x 3x 25x 3x 3x
Back-Translation 2x 1x 50x 1x 1x
(So et al., 2019) 15000x 1x 1x 1x 1x
Our Data Diversification
Default Setup 7x 1x 1x 1x 1x
Table 1: Estimated method comparison. denotes the number of parameters, while denotes the size of training data required.

In contrast, in this paper we propose Data Diversification111Code:, a simple but effective way to avoid many disadvantages of the aforementioned approaches, but still improves the translation quality consistently and significantly. In particular, we first train different models on both backward (targetsource) and forward (sourcetarget) translation tasks. Then, we use the backward models to translate the target sentences of the training set and obtain a diverse set of source sentences that augment the original training dataset. We conduct the similar practice with the forward models to augment the training dataset with diverse set of target sentences. After that, we train another descendant model with the augmented data to acquire the final translation model. Our approach is inspired from and a combination of multiple well-known strategies: back-translation, ensemble of models, and data augmentation for NMT. Table 1 compares the trade-offs made by our method and other well-known approaches (elaborated more in Section 2).

Our contributions are threefold. First, we propose a new data diversification strategy (Section 3), which is simple but effective in many translation tasks. Second, we conduct extensive experiments (Section 4) and show that the technique improves the baseline Transformer by 1.0 to 2.0 BLEU in four IWSLT translation tasks: English-German, German-English, English-French and French-English. Our approach also boosts performances by around 1.0 BLEU in the supervised settings of the four newly introduced low-resource tasks: English-Nepali, Nepali-English, English-Sinhala and Sinhala-English (Guzmán et al., 2019). Our approach also achieves state-of-the-art in the WMT’14 English-German translation task. Finally, we explain with various ablation studies (Section 5) why such straight-forward strategy works well. Our analysis suggests that our data diversification method has a strong correlation with the ensemble of models. Besides, the technique also acts as a regularizer, which naturally trades lower perplexity off for better BLEU score.

2 Background

In this section, we describe different approaches to NMT advancements and compare them with ours.

Developing Novel Architectures

Since the pioneering work by Sutskever et al. (2014) on sequence to sequence (seq2seq) learning, a number of new architectures have been proposed. The inclusion of attention mechanism inside the recurrence-based seq2seq models Bahdanau et al. (2014); Luong et al. (2015) was an important breakthrough, which made seq2seq models effective for longer sequence transduction tasks. Gehring et al. (2017) proposed a convolutional approach that parallelizes the sequence-encoding process. Vaswani et al. (2017) proposed the Transformer architecture with self-attention layers as the main component. This architecture can model long-range dependencies and maximum path length at constant-time complexity (). As such, the model achieves state-of-the-art in both efficiency and performance in machine translation and other NLP tasks.

Despite their scientific significance, the process of developing novel architectures for NMT often requires huge research efforts, while the gains are not always substantial in practice. The new models often come with a different optimal set of training hyper-parameters (e.g., learning rate schedule, dropout rate), which are usually found through trials and errors. Moreover, novel and effective neural models like the Transformer may not stop at one research work, but are often refined and improved subsequently. For instance, Shaw et al. (2018) and Ahmed et al. (2017) propose minor modifications to improve the Transformer model with slight gains in performance. Popel and Bojar (2018) provide a set of training tips to train the Transformer more efficiently. On the other hand, Ott et al. (2018) propose scaling the training process to 128 GPUs to achieve more significant improvements. Wu et al. (2019) recently introduce Dynamic Convolution, yet another new architecture. Our data diversification strategy is orthogonal to the advancements in model architectures, and can further improve translation quality without having to go through the tedious process of model development.

Semi-supervised Training for NMT

Semi-supervised learning offers considerable capabilities to NMT models. Back-translation (Sennrich et al., 2016a) is a simple and very effective way to exploit monolingual data. In this method, the target language (monolingual) data is translated to the source by a backward model trained on the target-to-source parallel corpus. The translated data is then augmented with the original training data to train the source-to-target NMT model. Another effective way to use monolingual data is to use pretrained models. Anonymous (2020) recently propose an effective way to incorporate pretrained BERT (Devlin et al., 2018) to improve NMT. Nonetheless, the drawback of both methods is that they require huge extra monolingual data to train/pretrain. Acquiring such datasets is sometimes expensive, especially for very low-resource languages, like Nepali and Sinhala. Moreover, in the case of using pretrained BERT, the packaged translation model incurs the additional computational cost of the pretrained model. Our approach shares many similarities with back-translation, but differs from it in that our method does not use any extra monolingual data.

Making it More Concrete

Table 1 summarizes different types of costs required for training and inference of different approaches to improve NMT. As it can be noticed, developing new and efficient architectures, like dynamic convolution (Wu et al., 2019), offers virtually no measurable compromise for both training and inference. However, such approach incurs other hidden costs as explained earlier. On the other hand, semi-supervised methods are often simpler, but require significantly more training data. In particular, back-translation (Edunov et al., 2018) employs 50x more training data. Meanwhile, NMT+BERT requires 60x more computations and 25x more data to train (including the pre-training stage). It also needs 3x more computations and parameters during inference. Another strategy is evolution, which trains a population of thousands of models to find the best hyper-parameters. Evolved-Transformer (So et al., 2019), for instance, requires more than 15,000 times more FLOPs to train. This may not be practical for common practitioners.

On the other hand, our approach is as simple as back-translation, but requires no extra data to improve translation performance. It also has the same efficiency in inference as the “New Architectures” approach. However, it has to make compromise with extra computations in training. In our view, moderately extra computations can be afforded, while extra monolingual data may not be always accessible to train the model.

3 Data Diversification

In this section, we formally describe our data diversification strategy. Let be the parallel training data, where denotes the source-side corpus and denotes the target-side corpus, respectively. Also, let and be the forward and backward NMT models, which translate from source to target and from target to source, respectively. In our case, we use the Transformer (Vaswani et al., 2017) as the main architecture for the NMT models. In addition, given a corpus in language and an NMT model which translates from language to language , we denote the corpus as the translation of corpus produced by the model . The translation may be conducted following the standard procedures such as maximum likelihood training with cross entropy loss and beam search inference.

Our data diversification strategy trains the models in rounds. In the first round, we train forward NMT models and backward NMT models , where denotes a diversification factor.

Then, we use the forward models to translate the source-side corpus of the original data to generate synthetic training data. In other words, we obtain multiple synthetic target-side corpora as:


Similarly, the backward models are used to translate the target-side original corpus to synthetic source-side corpora as:


After that, we augment the original data with the newly generated synthetic data, which is summed up to the new round-1 data as follows:


where the union notation means joining multiple datasets into one dataset.

1:procedure Train()
2:     Initialize with random parameters
3:     Train on until convergence
4:     return
1:procedure Reverse()
2:     return Reverse source and target sides
1:procedure DataDiverse()
2:      Assign original dataset to round-0 dataset.
3:     for  do
5:         for  do
6:               Train forward model
7:               Train backward model
8:               Add forward data
9:               Add backward data               
10:      Train the final model
11:     return
Algorithm 1 Data Diversification: Given a dataset , a diversification factor , the number of rounds ; return a trained source-target translation model .

After that, if the number of rounds , we continue training round-2 models and on the augmented data . The similar process continues until the final augmented dataset is generated. Eventually, we train the final model on the dataset , which will be used for testing. For a clearer presentation, algorithm 1 summarizes the process concretely. In the experiments, unless specified otherwise, we use the default setup of and .

4 Experiments

In this section, we present a series of experiments to demonstrate that our data diversification approach improves translation quality in many translation tasks, encompassing WMT and IWSLT tasks, and high- and low-resource translation tasks.

4.1 WMT’14 Translation Task


We conduct experiments on the WMT’14 English-German (En-De) translation task. The training dataset contains about 4.5 million sentence pairs. The sentences are encoded with Byte-Pair Encoding (BPE) (Sennrich et al., 2016b) with 32,000 operations, which results in a shared-vocabulary of 32,768 tokens. We use newstest2013 as the development set, and newstest2014 for testing. The WMT’14 En-De translation is considered a high-resource task as the amount of parallel training data is relatively large.

We use the Transformer (Vaswani et al., 2017) as our NMT model and follow the same training configurations as suggested by Ott et al. (2018). The model has 6 layers, each of which has model dimensions , feed-forwad dimensions , and attention heads. The model has approximately 209M parameters. Adam optimizer (Kingma and Ba, 2014) was used with the similar learning rate schedule as Ott et al. (2018) learning rate, warm-up steps, and a batch size of tokens. We use a dropout rate of . We train the models for about updates. For data diversification, we use a diversification factor . This yields six (6) intermediate models, which are selected based on the validation loss. When augmenting the dataset, we filter out the duplicate pairs. After the filtering process, the resulting dataset available for training the final model is about 27 million sentence pairs. Note that we do not use any extra monolingual data whatsoever. For inference, we average the last checkpoints of the final model and use a beam size of and a length penalty of . We measure the performance in standard tokenized BLEU.


We report our results on newstest2014 testset in Table 2. We can observe that the Transformer model that achieves BLEU on this testset, now achieves BLEU with our data diversification strategy, setting a new state-of-the-art. In other words, our approach yields an improvement of BLEU over the without-diversification baseline model (Ott et al., 2018) and BLEU over the previous state-of-the-art (Wu et al., 2019).

Method WMT’14
Transformer (Vaswani et al., 2017) 28.4
Transformer + Rel. Pos (Shaw et al., 2018) 29.2
Scale Transformer (Ott et al., 2018) 29.3
Dynamic Convolution (Wu et al., 2019) 29.7
Our Data Diversification with
Scale Transformer (Ott et al., 2018) 30.7
Table 2: WMT’14 English-German (En-De) performance in BLEU scores on newstest2014 testset.

4.2 IWSLT Translation Tasks


In this section, we show the effectiveness of our approach in IWSLT’14 English-German (En-De) and German-English (De-En), IWSLT’13 English-French (En-Fr) and French-English (Fr-En) translation tasks. The IWSLT’14 En-De training set contains about 160K sentence pairs. We randomly sample of the training data for validation and combine multiple test sets IWSLT14.TED.{dev2010, dev2012, tst2010, tst1011, tst2012} for testing. The IWSLT’13 En-Fr dataset has about 200K training sentence pairs. We use the IWSLT15.TED.tst2012 set for validation and the IWSLT15.TED.tst2013 set for testing. We use BPE for all four tasks. This results in a shared vocabulary of tokens for English-German pair and tokens for English-French pair.

We compare our approach against two baselines that do not use our data diversification: Transformer (Vaswani et al., 2017) and Dynamic Convolution (Wu et al., 2019). In order to make a fair comparison, for the baselines and our approach, we use the base setup of the Transformer model. Specifically, the models have 6 layers, each with model dimensions , feed-forward dimensions , and attention heads. We use a dropout of , weight decay of and label smoothing of for all our IWSLT experiments. The models are trained until convergence based on the validation loss. Note that we do not perform checkpoint averaging for these tasks, rather we run the experiments for 5 times with different random seeds and report the mean BLEU scores to provide more consistent and stable results.

For inference, we use a beam size of , a length penalty of for English-German pair and for English-French pair. The performance is measured in case-insensitive BLEU. For our data diversification settings, we set the factor .


Table 3 shows the results of our data diversification method in comparison to the baselines. As it can be seen, our method substantially and consistently boosts the performance by 1.0 to 2.0 BLEU in all the four translation tasks. In particular, in the En-De task, our method achieves up to 30.4 BLEU, which is 1.8 point above the Transformer baseline. Similar trend can be seen in De-En, where data diversification achieves 36.8 BLEU with a gain of +2.1. Meanwhile, in the En-Fr and Fr-En tasks, our proposed approach reaches 45.3 and 44.5 BLEU, which are improvements of 1.3 and 1.2 points over the Transformer baseline.

Method IWSLT
En-De En-Fr De-En Fr-En
No Data Diversification
Transformer 28.6 44.0 34.7 43.3
Dynamic Conv 28.7 43.8 35.0 43.5
Our Data Diversification
Transformer 30.4 45.3 36.8 44.5
Table 3: Performances in BLEU scores on IWSLT’14 English-German, German-English, and IWSLT’13 English-French and French-English translation tasks.

4.3 Low-Resource Translation Tasks

Having demonstrated the effectiveness of our approach in high-resource languages such as English, German and French, we now test how well our data diversification trick performs on very low-resource languages. Guzmán et al. (2019) recently proposed two new machine translation datasets which target low-resource setup, namely, English-Nepali and English-Sinhala. Both Nepali (Ne) and Sinhala (Si) are very challenging languages to translate because the vocabularies and grammars are vastly different from high-resource language like English. Furthermore, the data sources are particularly scarce as there are not many native speakers and parallel corpus. Therefore, NMT models can barely correctly translate from/to these languages without being supported by a huge monolingual data.


We evaluate our data diversification strategy on the supervised setups of the four low-resource translation tasks: En-Ne, Ne-En, En-Si, and Si-En. We compare our approach against the baseline proposed in (Guzmán et al., 2019). The English-Nepali parallel dataset contains about 500K sentence pairs, while the English-Sinhala dataset has about 400K pairs. We use the provided dev set for development and devtest set for testing.

In terms of training parameters, we replicate the same setup as done by Guzmán et al. (2019), with some exceptions. Specifically, we use the base Transformer model with 5 layers, each of which has 2 attention heads, model dimension , feed-forward dimension . We use a dropout rate of , label smoothing of , weight decay of . We train the models for epochs with batch size of tokens. We select the inference models and length penalty based on the validation loss. The Nepali and Sinhala corpora are tokenized using the Indic NLP library.222 indic_nlp_library We reuse the shared vocabulary of tokens built by BPE learned with the sentencepiece library.333

For inference, we use beam search with a beam size of , and a length penalty of for Ne-En and Si-En tasks, for En-Ne and for En-Si. We report tokenized BLEU for from-English tasks and detokenized SacredBLEU (Post, 2018) for to-English tasks. We use in our data diversification experiments. Note that we do not use any additional monolingual data.


We report the low-resource results in Table 4. We can notice that our data diversification strategy consistently improves the performance by more than 1 BLEU in all four tested tasks. Specifically, the method achieves 5.7, 8.9, 2.2, and 8.2 BLEU for En-Ne, Ne-En, En-Si and Si-En tasks, respectively. In absolute terms, these are 1.4, 1.3, 2.2 and 1.5 BLEU improvements over the baseline model (Guzmán et al., 2019). Without any monolingual data involved, our method establishes a new state-of-the-art in all four low-resource tasks.

Method En-Ne Ne-En En-Si Si-En
Baseline 4.3 7.6 1.0 6.7
Ours 5.7 8.9 2.2 8.2
Table 4: Performances on low-resource language pairs. English-Nepali and English-Sinhala pairs are measured in tokenized BLEU, while Nepali-English and Sinhala-English are measured in detokenized SacreBLEU.

5 Understanding Data Diversification

We have shown that our data diversification method performs quite well in a variety of translation scenarios. In this section, we propose several intuitive hypotheses to explain why and how it works as well as provide a deeper insight to its mechanism. We conduct a series of experimental analysis to confirm or reject such hypotheses. As a result, certain hypotheses are confirmed by the experiments, while some others, though being intuitive, are rejected by the experiments. We first present the explanations that are empirically verified and then briefly elaborate the failed hypotheses.

5.1 Ensemble Effects


Data diversification exhibits a strong correlation with ensemble of models.


To show this, we perform inference with an ensemble of seven (7) models (3 forward, 3 backward and the final model) and compare its performance with ours. We evaluate this setup on the WMT’14 English-German, and IWSLT’14 English-German and German-English translation tasks. The results are reported in Table 5. We notice that the ensemble of models outdoes the single-model baseline by 1.3 BLEU in WMT’14 and 1.0-2.0 BLEU in IWSLT tasks. These results are particularly comparable to those achieved by our data diversification technique. This suggests that our method has an ensembling effect. However, note that an ensemble of models has a major drawback that it requires (7 in this case) times more computations and parameters to perform inference. Our approach do not have this disadvantage.

En-De De-En En-De
Baseline 1x 28.6 34.7 29.3
Ensemble 7x 30.2 36.5 30.3
Ours 1x 30.4 36.8 30.7
Table 5: Data diversification preserves the effects of ensembling, while it keeps the number of parameters constant.


The correlation between ensembling and our data diversification can be explained as follows. Different models trained from the original dataset converge to different local optima. As such, individual models tend to have high variance. Ensembles of models are known to help reduce variance, thus improves the performance. More formally, suppose a single-model

estimates the model-specific distribution , which is close to the data generating distribution . An ensemble of models averages multiple (for ), which leads to a model distribution that is closer to and improves generalization. Our data diversification strategy achieves the same effect by forcing a single-model to learn from the data generated from those different data distributions. In other words, not only is our model trained to estimate the original data sampled from the real distribution, it also learns to estimate multiple for , simultaneously. As a result, our model generalizes an ensemble of models and achieves similar to thereof performance.

5.2 Perplexity vs. BLEU Score


Data diversification sacrifices perplexity for better BLEU score.


We tested this hypothesis as follows. We recorded the validation perplexity when the models fully converge for the baseline setup and for our data diversification method. We report the results in Figure 1 for WMT’14 English-German and IWSLT’14 English-German, German-English tasks. The left axis of the figure shows Perplexity (PPL) values for the models, which compares the dark blue (baseline) and red (our) bars. Meanwhile, the right axis shows the respective BLEU scores for the models as reflected by the faded bars.


Common wisdom tells that the lower perplexity often leads to better BLEU scores. In fact, our NMT models are trained to minimize perplexity (equivalently, cross entropy loss). However, existing research (Vaswani et al., 2013) also suggest that sometimes sacrificing perplexity may result in better generalization and performance. As shown in Figure 1, our models consistently show higher perplexity compared to the baseline in all the translation tasks, though we did not have intention to do so. As a result, the BLEU score is also consistently higher than the baseline.

Figure 1: Relationship between validation perplexity and the BLEU scores for the Transformer baseline and data diversification, in the IWSLT’14 English-German, German-English and WMT’14 English-German tasks.

5.3 Initial Parameters vs. Diversity


Models with different initial parameters increase diversity in the augmented data, while the ones with fixed initial parameters decrease it.

Experiments and Explanation

With the intuition that diversity in training data improves translation quality, we speculated that the initialization of model parameters plays a crucial role in data diversification. Since neural networks are susceptible to initialization, it is possible that different initial parameters may lead the models to different convergence paths

(Goodfellow et al., 2016) and thus different model distributions, while models with the same initialization are more likely to converge in similar paths. To verify this, we did an experiment with initializing all the constituent models (, ) with the same initial parameters to suppress data diversity. We conducted this experiment on the IWSLT’14 English-German and German-English tasks. We used a diversification factor of only in this case. The results are shown in Figure 2. Apparently, the BLEU scores of the fixed (same) initialization drop compared to the randomized counterpart in both language pairs. However, its performance is still significantly higher than the single-model baseline. This suggests that initialization is not the only contributing factor to diversity. Indeed, even though we are using the same initial checkpoint, each constituent model is trained on a different dataset and and learns to estimate a different distribution.

Figure 2: Effects of random and fixed parameters initialization on the performance on the IWSLT’14 English-German and German-English translation tasks.

5.4 Forward-Translation is Important


Forward-translation is as vital as back-translation.

Experiments and Explanation

We separate our method into forward and backward diversification, in which we only train the final model () with the original data augmented by either the translations of the forward models () or those of the backward models () separately. We compare those variants with the bidirectionally diversified model and the single-model baseline. Experiments were conducted on the IWSLT’14 English-German and German-English tasks.

As shown in Table 6, both the forward and backward diversification methods perform worse than the bidirectional counterpart but still better than the baseline. However, it is interesting that diversification with forward models outperforms the ones with backward models as recent research has focused mainly on back-translation, where they use a backward model to translate target monolingual data to source language (Sennrich et al., 2016a; Edunov et al., 2018). Our finding in data diversification is similar to Zhang and Zong (2016), where the authors used source-side monolingual data to improve translation quality.

Task Baseline Backward Forward Bidirectional
En-De 28.6 29.2 29.86 30.4
De-En 34.7 35.8 35.94 36.8
Table 6: The performance of forward and backward diversification in comparison to bidirectional diversification and the baseline in the IWSLT’14 English-German and German-English tasks.

5.5 Failed Hypotheses

We also speculated other possible hypotheses that were eventually not supported by the experiments. We briefly present them in this section for better understanding of the approach. We put the results of these experiments in the Appendix.

First, given that parameter initialization affects diversity, it is logical to assume that dropout will magnify the diversification effects. However, our empirical results did not support this. We performed an experiment with our method and the baseline, with and without dropout, expecting that the gains would be higher for non-zero dropout experiments. However, we observed that the gains were similar and inconclusive in both scenarios.

Second, we speculated that beam search (-) makes the synthetic data more diverse and that greedy selection (-) generates samples closer to the original data, which reduces diversity. We tested this hypothesis by training the final model with the synthetic data generated by either beam search or greedy search. However, we found that both versions achieve similar results in our experiments (see Appendix).

6 Related Works

Our data diversification method shares many similarities with a variety of techniques in literature, namely, data augmentation, back-translation, and boosting ensembles of models.

Firstly, our approach is genuinely a data augmentation method. Related to that, Fadaee et al. (2017) proposed an augmentation strategy which targets rare words to improve low-resource translation. Wang et al. (2018) suggested to simply replace random words with other words in the vocabularies. Gao et al. (2019) used context and proposed to swap random words with their contextually related words. Li and Specia (2019) introduced an edit-distance-based augmentation strategy to help improve the robustness of NMT models to noise. Our data diversification approach is distinct from all previous augmentation methods. That is, our method does not randomly corrupt the data and train the model on the augmented data on the fly. Instead, it transforms the data into synthetic translations, which follows different model distributions.

Secondly, our method is similar to back-translation, which has been employed to generate synthetic data from target-side extra monolingual data. Sennrich et al. (2016a) was the first to propose such strategy, while Edunov et al. (2018) implemented the approach at scale and achieved higher BLEU scores. Our method’s main advantage over previous works is that it does not require any extra monolingual data, which may not always be available. Our technique also differs from those works that it employs forward translation, which we have shown to be important. Moreover, back-translation has also been used for data augmentation in question-anwsering (Yu et al., 2018).

Using multiple models to average the predictions and reduce variance is a typical feature of an ensemble of models. In traditional ensembling, multiple forward models vote on the final predictions. This reduces generalization errors and increases confidence compared to individual models (Perrone and Cooper, 1992). Huang et al. (2017) used snapshots of the same models, while Xie et al. (2013) used an epoch-based strategy. The common drawback of ensembles of models is that the testing parameters are multiple times more than an individual model. Our approach does not suffer this disadvantage. Our method is also similar to knowledge distillation (Hinton et al., 2015), where a student model trains on the predictions of another trained teacher model.

7 Conclusion

We have proposed a simple yet very effective method to improve translation performance in many standard translation tasks. Our method involves training multiple forward and backward models and use them to augment the training data, which is then used to train the final model.

The approach achieves state-of-the-art in the WMT’14 English-German translation task with 30.7 BLEU. It also improves in IWSLT’14 English-German, German-English, IWSLT’13 English-French and French-English tasks by 1.0-2.0 BLEU. Furthermore, it outperforms the baselines in the low-resource tasks: English-Nepali, Nepali-English, English-Sinhala, Sinhala-English.

Our experimental analysis of different hypotheses reveals that our approach has an ensembling effect. It also trade perplexity off for better BLEU score, which indicates a regularizing effect. We have also shown that forward translation and diversified parameters initialization of constituent models are important.


Appendix A Appendix

a.1 Results for Failed Hypotheses

Effects of Dropout.

We ran experiments to test whether non-zero dropout magnify the improvements of our method over the baseline. We trained the single-model baseline and our data diversification’s final model in cases of and , in the IWSLT’14 English-German and German-English. We used factor in these experiments. As reported in table 7, the no-dropout versions perform much worse than the non-zero dropout versions in all experiments. However, the gains made by our data diversification with dropout are not particularly higher than the non-dropout counterpart. This suggests that dropout may not contribute to the diversity of the synthetic data.

Task Baseline Ours Gain
En-De 28.6 30.1 +1.5 (5%)
De-En 34.7 36.5 +1.8 (5%)
En-De 25.7 27.5 +1.8 (6%)
De-En 30.7 32.5 +1.8 (5%)

Table 7: Improvements of data diversification under conditions with- and without- dropout in the IWSLT’14 English-German and German-English.

Effects of Beam Search.

We hypothesized that beam search will generate more diverse synthetic translations of the original dataset, thus increases the diversity and improves generalization. We tested this hypothesis by using maximum likelihood decoding (beam=1) to generate the synthetic data and compare its performance against beam search (beam=5) counterparts. We also used the IWSLT’14 English-German and German-English as a testbed. Note that for testing with the final model, we used the same beam search () procedure for both cases. As shown in table 8, the performance of maximum likelihood is not particularly reduced compared to the beam search versions.

Task Baseline Ours
Beam=1 Beam=5
En-De 28.6 30.3 30.4
De-En 34.7 36.6 36.8

Table 8: Improvements of data diversification under conditions maximum likelihood and beam search in the IWSLT’14 English-German and German-English.