Back-translation (Sennrich et al., 2016b) is an effective strategy for improving the performance of neural machine translation (NMT) using monolingual data, delivering impressive gains over already competitive NMT models (Edunov et al., 2018). The strategy is simple: given monolingual data in the target language, one can use a translation model in the opposite of the desired translation direction to back-translate the monolingual data, effectively synthesizing a parallel dataset, which is in turn used to train the final translation model. Further improvements can be obtained by iteratively repeating this process (Hoang et al., 2018) in both directions, resulting in strong forward and backward translation models, an approach known as iterative back-translation.
However, not all monolingual data are equally important. An envisioned downstream application is very often characterized by a unique data distribution. In such cases of domain shift, back-translating target domain data can be an effective strategy (Hu et al., 2019) for obtaining a better in-domain translation model. One common strategy is to select samples that are both (1) close to the target distribution and (2) dissimilar to the average general-domain text (Moore and Lewis, 2010). However, as depicted in Figure 1, this method is not ideal because the second objective could bias towards the selection of sentences far from the center of the target distribution, potentially leading to selecting a non-representative set of sentences.
Even if we could select all in-domain monolingual data, the back-translation model would still not be suited for translating them because it has not been trained on in-domain parallel data and thus the back-translated data will be of poor quality. As we demonstrate in the experiments, the quality of the back-translated data can have a large influence on the final model performance.
To achieve the two goals of both selecting target-domain data and back-translating them with high quality, in this paper, we propose a method to combine dynamic data selection with weighting strategies for iterative back-translation. Specifically, the dynamic data selection selects subsets of sentences from a monolingual corpus at each training epoch, gradually transitioning from selecting general-domain data to choosing target-domain sentences. The gradual transition ensures that the back-translation model of each iteration can adequately translate the selected sentences, as they are close to the distribution of its current training data. We also assign weights to the back-translated data that reflect their quality, which further reduces the effect of potential noise due to low quality translations. The proposed data selection and weighting strategies are complementary to each other, as the former focuses on domain information while the latter emphasizes the quality of sentences.
We investigate the performance of our methods in domain adaptation, low-resource and high-resource MT settings and on German-English and Lithuanian-English datasets. Our strategies demonstrate improvements of up to 1.8 BLEU points over a competitive iterative back-translation baseline and up to 1.2 BLEU points over the best static data selection strategies. In addition, our analysis reveals that the selected samples can represent the target distribution well and that the weighting strategies are effective in noisy settings.
2 Background: Back-Translation
Back-translation Sennrich et al. (2016a) has proven to be an effective way of utilizing monolingual data for machine translation. Given a parallel training corpus , we first train a target-to-source machine translation model . Then, we use the pre-trained model to translate a target language monolingual corpus to the source language and obtain a synthetic parallel corpus . Last, we concatenate back-translated data with the original parallel corpus to train a source-to-target model .
The success of back-translation has motivated researchers to investigate and extend the method Edunov et al. (2018); Xia et al. (2019). The dual learning approach He et al. (2016) integrates training on parallel data and training on monolingual data via round-trip translation and use of language models to improve output fluency. cheng2016semi attempt to augment back-translation with weighting strategies and the back-translation steps are conducted iteratively.
Because a better back-translation system will likely lead to a better synthetic corpus, hoang2018iterative propose to use iterative back-translation and achieve improvements over previous state-of-the-art models in various settings. As shown in Algorithm 1, at each training step, a batch of monolingual sentences is sampled from one language and back-translated to the other language. The back-translated data is utilized to train the model in the other direction. The process is repeated in both directions. Parallel data can be mixed with the back-translated data to train the translation models.
In this section, we first introduce our data selection and weighting strategies separately, and then illustrate how we combine the two ideas. In our problem setting, we are given two MT models and pretrained on parallel data , and both source and target monolingual corpora and . The goal is to select and weight samples from the two monolingual corpora for back-translation, in order to best improve the performance of the two translation models.
3.1 Data Selection Strategies
We first describe a commonly used static selection strategy, and then illustrate our dynamic approach.
3.1.1 The moore2010intelligentselection Method
A common approach for data selection is the Moore and Lewis (2010) method (and extensions, e.g. Axelrod et al. (2011); Duh et al. (2013)), which computes the language model cross-entropy difference for each sentence in a monolingual corpus:
where and represent the cross-entropy scores of measured with an in-domain and a general-domain language model (LM) respectively. Sentences with the highest scores will be selected for training. Typically, the in-domain language model is trained with a small set of sentences in the target domain and is trained with all data available.
3.1.2 Our Two Scoring Criteria
Instead of static data selection, we propose a new curriculum strategy for iterative back-translation. Specifically, we measure both representativeness, i.e. how well the sentence represents the target distribution, and simplicity, i.e. how well the MT models can translate the sentence, of each sentence in the monolingual corpus. First, we select the most simple samples for back-translation to ensure the quality of the back-translated data. As the training progresses, the model will become better at translating in-domain sentences, and we will then shift to choosing more representative examples.
Formally, at each epoch , we rank the corpus according to
where and denote the representativeness and simplicity of sentence respectively. We discuss the representativeness and the simplicity metrics in the following sections. The term balances between the two criteria and is a function of the current epoch .
We adopt the square-root growing function for Platanios et al. (2019) and set
where is the initial value and denotes the time after which we solely select representative samples. increases relatively quickly at first and then its acceleration will be gradually decreased as the training progresses, which is suitable for our task as at first the sentences are relatively simple and thus we will not need much time on those sentences.222We found that increasing linearly over time achieved worse performance in our preliminary studies.
Connections to moore2010intelligentselection.
Our proposed criteria generalize moore2010intelligentselection. The first term of Equation 1, namely , measures the representativeness of data because the in-domain LM assigns low entropy to sentences that appear frequently in the target domain. The second term , on the other hand, measures the simplicity of the sentences. If is high, it is likely that some -grams of the sentence appear frequently in the parallel training data , indicating that the MT models will likely translate the sentence well. In other words, the sentence can provide limited additional information if is high. Therefore, one can view moore2010intelligentselection as selecting the most representative and difficult sentences.
3.1.3 Representativeness Metrics
We propose three approaches to measure the sentence representativeness.
In-Domain Language Model Cross-Entropy (LM-in).
As in axelrod2011adaptation,duh2013adaptation, we can use to measure the representativeness of the instances. Concretely, we train a language model with in-domain monolingual data and compute the score for each sentence .
TF-IDF Scores (TF-IDF).
TF-IDF score is another way to perform data selection for machine translation Kirchhoff and Bilmes (2014). For a sentence , one can compute the term frequency and the inverse document frequency for each word
. We can thus obtain the TF-IDF vector
. Finally, we calculate the cosine similarity between the TF-IDF vectors ofand each sentence in a small in-domain dataset, and treat the maximum value as its representativeness score.
BERT Representation Similarities (BERT).
BERT Devlin et al. (2019) has proven to be a powerful model for learning sentence representations. Following the conclusion of pires2019multilingual, we feed each sentence to the multilingual BERT model and average the hidden states for all the input tokens except [CLS] and [SEP] at the eighth layer to obtain the sentence representation. We then compute the cosine similarity between representations of sentence in the monolingual corpus and each sentence in a small in-domain set, and the maximum value is treated as the representativeness score.
3.1.4 Simplicity Metrics
In our experiments, we study two metrics for measuring the simplicity of sentences.
General-Domain Language Model Cross-Entropy (LM-gen).
We train a language model with the one side of the parallel training data . Then, for each sentence we compute the score .
Round-Trip BLEU (R-BLEU).
Given two pre-trained MT models and , round-trip translation first translates a sentence into another language using and then back-translates the result using , obtaining the reconstructed sentence . The BLEU score between and is treated as our simplicity metric. Similar ideas have been applied to filter sentences of low-quality Imankulova et al. (2017).
For both the representativeness and simplicity scores, it should be noted that they are separately normalized to [0, 1], using the equation , where and are the maximum and minimum scores.333 Other standardization strategies such as subtracting the mean and dividing by the standard deviation can also be applied and we found they achieved comparable performance.
Other standardization strategies such as subtracting the mean and dividing by the standard deviation can also be applied and we found they achieved comparable performance.Also, both the representativeness and simplicity scores can be computed in a pre-processing step, and we only need to adjust in Equation 3 during training.
3.2 Weighting Strategies
Next, we illustrate how we perform data weighting on the back-translated data by measuring both the current quality of sentences and its improvements over the previous iteration.
3.2.1 Measuring the Current Quality
As general translation models could perform poorly on the in-domain data, it is important to down-weight the examples of bad quality. To this end, we present two ways to measure the current quality of the back-translated sentences.
Encoder Representation Similarities (Enc).
We feed the source sentence and the target sentence to the encoders of and respectively, and average the hidden states at the final layer to obtain the representations and . The cosine similarity between them is treated as the quality metric.444We share parameters of the top layers of encoders between and to ensure the encoder representations to be comparable (details can be seen in §4.1).
Agreement Between Forward and Backward Models (Agree).
Inspired by junczys2018dual, the second approach utilizes the agreement of the two translation models. For each sentence pair
, we compute the length-normalized conditional probabilityand , then exponentiate the absolute value between them to obtain . Intuitively, it is likely that the back-translated sentences are of poor quality if there are huge disagreements on them between the two models.
3.2.2 Measuring Quality Improvements
In domain adaptation, it is natural that at first the in-domain sentences are poorly translated. As training progresses, however, the quality should be improved. We therefore propose a metric to measure the improvement in translation quality and combine it with the current quality metric, in order to encourage the inclusion of in-domain sentences where the translation qualities have improved.
Specifically, every time we obtain the quality score of sentence , we store it, then the next time we come across the same sentence, we can compare the new quality score with the previous one:
where the clipping function limits the weights to a reasonable range. We set to .
|TF-IDF + R-BLEU||39.11*||28.93*||44.91*||36.19*|
|Best Curriculum + Best Weighting|
3.3 Overall Algorithm: Combining Curriculum and Weighting Strategies
Our final algorithm is shown in Figure 2. At each epoch, we compute the score for each sentence in monolingual corpora using Equation 2 and select the top % sentences, where is a hyper-parameter. Afterwards, we perform back-translation and data weighting on the selected data, then use the back-translated data to train the translation model. The process will be repeated iteratively for both directions, with increased at each training epoch.
|TF-IDF + LM-gen||38.67||28.67||44.90||35.49|
|TF-IDF + R-BLEU||39.11||28.93||44.91||36.19|
4 Experiments on Domain Adaptation
We first conduct experiments in the domain adaptation setting, where we adapt models from a general domain to a specific domain.
We first train the translation models with (general-domain) WMT-14 German-English dataset, consisting of about 4.5M training sentences, then perform iterative back-translation with (in-domain) law or medical OPUS monolingual data Tiedemann (2012). We de-duplicate the law and medical training data and sub-sample 250K and 200K sentences respectively in both languages to obtain the monolingual corpora. The development and test sets contain 2K sentences in each domain. Byte-pair encoding Sennrich et al. (2016b) is applied with 32K merge operations. The general-domain and in-domain language models are trained on the WMT training data and the OPUS monolingual data respectively. The OPUS development sets are used to compute the TF-IDF and BERT representativeness scores.
We implement our approaches upon the Transformer-base model Vaswani et al. (2017). Both the encoder and decoder consist of 6 layers and the hidden size is set to 512. For the translation models, weights of the top 4 layers of the encoders and bottom 4 layers of the decoders are shared between forward and backward models, as these parameters tend to be language-independent Yang et al. (2018). We also tie the source and target word embeddings. We build 5-gram language models with modified Kneser-Ney smoothing using KenLM Heafield (2011).
and in Equation 3 are set to 0.1 and 5. We select 30% of the sentences with the highest score at each epoch for our curriculum methods and 50% of the sentences for the static data selection baselines.
|Ite-Sampling + Enc||35.67||27.76||+0.89|
|Source||- wenn der Viehhalter seinen Betrieb einem Nachfolger bis zum dritten Verwandtschaftsgrad übergibt ;||-||-|
|Reference||- when the farmer gives over his farm to his family successor up to the third degree of relationship ,||-||-|
|Ite-5K||- if the livestock farmer hands over his holding to a successor up to the third degree of kinship ;||0.550||0.353|
|Ite-10K||- when the livestock farmer passes his holding to a successor up to the third degree of kinship ;||0.572||0.383|
|Ite-15K||- when the livestock farmer gives his holding to a successor up to the third degree of kinship ;||0.585||0.402|
|Source||folgerichtig sollte dies auch auf Antisubventionsuntersuchungen zutreffen .||-||-|
|Reference||the same principles should logically apply to anti - subsidy investigations .||-||-|
|Ite-5K||this should also be followed up by anti - subsidy investigations .||0.389||0.331|
|Ite-10K||it should also be folly to apply to anti - subsidy investigations .||0.403||0.486|
|Ite-15K||it should also be folly true to apply to anti - subsidy investigations .||0.397||0.447|
We compare our dynamic curriculum and weighting methods with three baselines: the iterative back-translation baseline, a baseline trained with only data selection strategies, a baseline trained with only data weighting strategies. The results with the best-performing representativeness and simplicity metrics (TF-IDF and R-BLEU, respectively) in the domain adaptation setting are listed in Table 1.
The iterative back-translation method is rather competitive, as it improves over the unadapted baseline by 9.6 BLEU and simple back-translation by 1.8 BLEU points.
We can see from the table that the best-performing selection strategies, namely selecting sentences with high TF-IDF scores, is generally effective and can improve the baseline by about 0.5 BLEU points.
Curriculum and Weighting Strategies.
Both our curriculum and weighting strategies outperform the unadapted and the iterative back-translation models, with the curriculum learning method achieving better performance and improving the strong iterative back-translation baseline by 1.1 BLEU points. Combining curriculum and weighting methods can further improve the performance by up to 0.5 BLEU points, demonstrating the two strategies are complementary to each other.
|High ()||Low ()|
|High ()||Article 20||( 2005 / 686 / EC )|
|Low ()||any Contracting Party||MS Danuta|
|may request that|
|a meeting be held .||HÜBNER|
|Best Curriculum + Weighting|
4.3 Choices of Metrics
We examine different choices of representativeness and simplicity metrics. The performance of different models is listed in Table 2.
All data selection strategies outperform the baseline, with TF-IDF, LM-diff, and BERT metrics exhibiting fairly robust performance in all settings. Due to its simplicity, we choose TF-IDF for experiments where a good in-domain development set is available.
Data Weighting Strategies.
The agreement-based weighting method (“Agree”) performs slightly worse than the encoder-similarity weighting strategy (“Enc”), probably because the two languages are similar and thus encoders with shared parameters can accurately measure the data quality.
Our curriculum strategies need to consider both representativeness and simplicity metrics. Table 2 demonstrates that TF-IDF is a better metric than other representativeness metrics in both static and dynamic data selection settings. Also, the round-trip BLEU score can be better at measuring the simplicity of sentences than LM-gen. Last, by comparing the Moore-Lewis method (“LM-diff”) with our curriculum strategy (“LM-in+LM-gen”), the advantages of dynamic data selection are clear.
In this part, we investigate how noise in the back-translated data impacts the model performance and how many sentences we should select. We also qualitatively examine if our weighting methods assign weights appropriately.
Effect of Back-Translation Quality.
We try to generate the back-translated data using sampling, greedy search and beam search for iterative back-translation and the results are listed in Table 3. We find that the sampling method significantly degrades the model performance, as it introduces more noise than other approaches, demonstrating that noise can have a negative impact in domain adaptation settings. The conclusion is similar to the findings in low-resource settings edunov2018understanding. In addition, we find that our weighting strategies are more beneficial in noisy settings.
Effect of the Percentage .
We test how many sentences should be selected at each epoch for our curriculum strategies. As shown in Figure 3, selecting 30% of the monolingual sentences achieves the best performance in general. Selecting fewer samples can discard valuable information whereas choosing more instances can introduce more noise.
We use our model (Curri+Enc) to back-translate some sentences from the monolingual corpus and Table 4 shows the weights our models assign at different training stages. It is clear that the assigned weights correlate well with the BLEU scores, demonstrating our methods can perform weighting appropriately.
4.5 Characteristics of the Selected Data
In this part, we investigate certain characteristics of the selected samples.
Figure 4 shows the average lengths of the selected sentences in each bucket. We can see that 1) both LM-in and BERT favor long sentences, with one possible explanation being that those sentences are more likely to contain in-domain words; 2) TF-IDF does not share this feature, likely due to the IDF term; 3) sentences with high R-BLEU scores are generally short, which is unsurprising since NMT models are typically bad at translating long sentences.
Unigram Distribution Distance.
We also compute the unigram ditribution distance using the Hellinger distance. Concretely, we compute the unigram distribution and for both the selected data and the test set, and calculate
where is the size of the vocabulary. The larger the Hellinger distance is, the more dissimilar the two distributions are. Figure 4 shows that both TF-IDF and BERT match the test distribution well. Also, LM-in performs better than LM-diff, which confirms our hypothesis that the data selected by the Moore-Lewis method cannot adequately represent the target distribution.
Diversity Among Selected Data at Each Epoch.
As our curriculum strategies dynamically select different subsets of data, here we examine how many new sentences are actually introduced at each epoch. We find that starting from the second epoch, 12.5%, 10.4%, 12.5%, 18.3%, 21.5% of the selected sentences will be replaced at each epoch, and 52.5% of the monolingual sentences will be selected at least once in total.
Table 5 shows examples of the selected sentences. Sentences with both high TF-IDF and R-BLEU scores are typically short and match the target distribution well. Sentences with high TF-IDF but low R-BLEU scores can be long and contain some out-of-vocabulary words, while sentences with low TF-IDF but high R-BLEU scores are generally short and frequently include digits and single characters. Most of the sentences with both low TF-IDF and R-BLEU scores are extremely noisy and can be safely discarded.
5 Experiments on Low-Resource and High-Resource Scenarios
Next, we conduct experiments in both low- and high-resource scenarios over two language pairs: German-English and Lithuanian-English.
|low||WMT en-de (100K)||test2013 (3K)||test2014 (3K)||CC (1M)|
|LAW (2K)||LAW (2K)||LAW (25K)|
|MED (2K)||MED (2K)||MED (20K)|
|high||WMT en-de (4.5M)||test2013 (3K)||test2014 (3K)||CC (10M)|
|2-5||WMT en-lt (2M)||dev2019 (2K)||test2019 (1K)||News lt (5M) +|
|CC en (5M)|
The results are reported in Table 6. We find that LM-in and LM-gen is the best metric combination for curriculum strategies when the target distribution is the news domain. TF-IDF and R-BLEU as the representativeness and simplicity metrics are the best in all other settings.
In low-resource settings, iterative back-translation can improve the baseline model by a large margin, and our curriculum strategies can still outperform the strong baseline by 1.3 BLEU points. Weighting methods also generally help and our best method outperforms iterative back-translation by 1.8 BLEU points.
In high-resource settings, our curriculum strategies improve the iterative back-translation baseline by up to 0.3 BLEU points. Data weighting strategies do not always help, probably because in high-resource settings the back-translated data is already of high quality. Our best method outperforms iterative back-translation by 0.6 BLEU points.
6 Related Work
Back-translation Sennrich et al. (2016a) has proven to be successful and several extensions of it have been proposed He et al. (2016); Cheng et al. (2016); Xia et al. (2019), among which iterative back-translation methods Cotterell and Kreutzer (2018); Hoang et al. (2018); Niu et al. (2018) have demonstrated strong empirical performance.
For domain adaptation, moore2010intelligentselection,axelrod2011adaptation use language model cross entropy differences to select data that are similar to in-domain text, which is adapted by duh2013adaptation to neural models. Similarly, kirchhoff2014submodularity propose to use TF-IDF scores to select relevant samples for machine translation. van2017dynamic propose dynamic data selection strategies for machine translation models, and zhang2019curriculum extend the idea to curriculum strategies. As for filtering noisy sentences, junczys2018dual propose to utilize the agreement between forward and backward translation models and wang2019improving propose uncertainty-based confidence estimation to improve back-translation. wang-etal-2019-dynamically compose dynamic domain-data selection with dynamic clean-data selection. Our methods generalize the previous strategies and our primary focus is to improve iterative back-translation.
In this paper, we provide a novel insight into a widely-used data selection method Moore and Lewis (2010) and generalize it to a curriculum strategy for iterative back-translation. We also propose data weighting methods. Extensive experiments are performed to evaluate the model performance. Analyses reveal the selected samples can represent the target domain well, and our weighting strategies benefit noisy settings the most. Future directions include experimenting on other datasets and language pairs, as well as developing better scoring criteria and score combination techniques.
Domain adaptation via pseudo in-domain data selection.
Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §3.1.1.
- Semi-supervised learning for neural machine translation. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §6.
- Explaining and generalizing back-translation through wake-sleep. arXiv. Cited by: §6.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), Cited by: §3.1.3.
- Adaptation data selection using neural language models: experiments in machine translation. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §3.1.1.
- Understanding back-translation at scale. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §1, §2.
- Dual learning for machine translation. In Conference on Neural Information Processing Systems (NeurIPS), Cited by: §2, §6.
- KenLM: faster and smaller language model queries. In Conference on Machine Translation (WMT), Cited by: §4.1.
- Iterative back-translation for neural machine translation. In Workshop on Neural Generation and Translation (WNGT), Cited by: §1, §6.
Domain adaptation of neural machine translation by lexicon induction. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §1.
- Improving low-resource neural machine translation with filtered pseudo-parallel corpus. In Workshop on Asian Translation (WAT), Cited by: §3.1.4.
- Submodularity for data selection in machine translation. In Conference on Empirical Methods in Natural Language Processing (EMNLP), Cited by: §3.1.3.
- Intelligent selection of language model training data. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §1, §3.1.1, §7.
- Compare-mt: a tool for holistic comparison of language generation systems. In Meetings of the North American Chapter of the Association for Computational Linguistics (NAACL) Demo Track, Cited by: Table 1.
- Bi-directional neural machine translation with synthetic parallel data. Workshop on Neural Generation and Translation (WNGT). Cited by: §6.
- Competence-based curriculum learning for neural machine translation. In Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL), Cited by: §3.1.2.
- Improving neural machine translation models with monolingual data. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §2, §6.
- Neural machine translation of rare words with subword units. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §1, §4.1.
- Parallel data, tools and interfaces in opus. In Language Resources and Evaluation Conference (LREC), Cited by: §4.1.
- Attention is all you need. In Conference on Neural Information Processing Systems (NeurIPS), Cited by: §4.1.
- Generalized data augmentation for low-resource translation. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §2, §6.
- Unsupervised neural machine translation with weight sharing. In Annual Meeting of the Association for Computational Linguistics (ACL), Cited by: §4.1.