Simulated Multiple Reference Training Improves Low-Resource Machine Translation

04/30/2020 ∙ by Huda Khayrallah, et al. ∙ Johns Hopkins University 0

Many valid translations exist for a given sentence, and yet machine translation (MT) is trained with a single reference translation, exacerbating data sparsity in low-resource settings. We introduce a novel MT training method that approximates the full space of possible translations by: sampling a paraphrase of the reference sentence from a paraphraser and training the MT model to predict the paraphraser's distribution over possible tokens. With an English paraphraser, we demonstrate the effectiveness of our method in low-resource settings, with gains of 1.2 to 7 BLEU.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Variability and expressiveness are core features of language, and they extend to translation as well. Dreyer and Marcu (2012) showed that naturally occurring sentences have billions of valid translations. Despite this variety, machine translation (MT) models are optimized toward a single translation of each sentence in the training corpus.

Training high resource MT on millions of sentence pairs exposes it to similar sentences translated different ways, but training low-resource MT with a single translation for each sentence (out of potentially billions) exacerbates data sparsity. Despite active research in the area, low-resource settings remain a challenge for MT Koehn and Knowles (2017); Sennrich and Zhang (2019).

A natural question is: To what extent does the discrepancy between linguistic diversity and standard single-reference training hinder MT performance? This was previously impractical to explore, since obtaining multiple human translations of training data is typically not feasible. However, recent advances in neural sentential paraphrasers produce fluent, meaning-preserving English paraphrases Hu et al. (2019c). We introduce a novel method that incorporates such a paraphraser directly in the training objective, and uses it to simulate the full space of translations.

We demonstrate the effectiveness of our method on two MATERIAL program low-resource datasets, and on publicly available data from GlobalVoices. We release data & code: data.statmt.org/smrt

Figure 1: Some possible paraphrases of ‘the turtle beat a hare’ including a sampled path and some of the other tokens also considered in the training objective

2 Method

We propose a novel training method that uses a paraphraser to approximate the full space of possible translations, since explicitly training on billions of possible translations per sentence is intractable.

In standard neural MT training, the reference is: (1) used in the training objective; and (2) conditioned on as the previous target token.111In autoregressive NMT inference, predictions conditions on the previous target tokens. In training, predictions typically condition on the previous tokens in the reference, not the model’s output (teacher forcing; Williams and Zipser, 1989).

We approximate the full space of possible translations by: (1) training the MT model to predict the distribution of possible tokens from the paraphraser at each time step; and (2) sampling the previous target token from the paraphraser distribution. Figure 1 shows an example of possible paraphrases and highlights a sampled path and some of the other tokens used in the training objective distribution.

We review the standard training objective, and then introduce our proposed objective.

NLL Objective

The standard negative log likelihood (NLL) training objective in NMT, for the target word in the reference is:

(1)

where is the vocabulary, is the indicator function, and is the MT output distribution (conditioned on the source , and on the previous tokens in the reference ). Equation 1 computes the cross-entropy between the MT model’s distribution and the one-hot human reference.

Proposed Objective

We compute the cross entropy between the distribution of the MT model and the distribution from a paraphraser conditioned on the reference:222Note the paraphraser parameters are not modified when training the MT model.

(2)

where is the single human reference, and is the paraphrase of that reference. is the output distribution from the paraphraser (conditioned on the single human reference and the previous tokens in the sentence produced by the paraphraser ). is the MT output distribution (conditioned on the source sentence, and the previous tokens in the sentence produced by the paraphraser, ). At each timestep we sample a target token from the paraphraser’s output distribution333Graves (2013) introduced sampling in sequence to sequence models for variety in handwriting generation. to ensure coverage of the full space of translations.444We resample every time a sentence is observed in training. We condition on this sampled as the previous target token for both the MT model and paraphraser.

dataset GlobalVoices MATERIAL
* en hu id cs sr ca sw nl pl mk ar sw tl
train lines 8k 8k 11k 14k 15k 24k 32k 40k 44k 47k 19k 46k
baseline 2.3 5.3 3.4 11.8 16.0 17.9 22.2 16.0 27.0 12.7 37.8 32.5
this work 5.4 12.3 6.6 16.1 20.0 20.5 24.8 18.0 28.2 14.9 39.0 33.7
+3.1 +7.0 +3.2 +4.3 +4.0 +2.6 +2.6 +2.0 +1.2 +2.2 +1.2 +1.2
Table 1: Test set results translating to English. ‘train lines’ indicates amount of training bitext. We bold the best value; all improvements are statistically significant at the 95% confidence level.

3 Experimental Setup

3.1 Paraphraser

For our paraphraser we train a Transformer model Vaswani et al. (2017) in fairseq Ott et al. (2019) with an 8-layer encoder and decoder, dimensional embeddings, encoder and decoder attention heads, and dropout. We optimize using Adam Kingma and Ba (2014). We train on ParaBank2 Hu et al. (2019c), an English paraphrase dataset.555Parabank2 also released a trained Sockeye paraphrase model but we are using fairseq, so we retrain it. ParaBank2 was generated by training an MT system on CzEng 1.7 (a CzechEnglish bitext with over million lines Bojar et al. (2016)), re-translating the Czech training sentences, and pairing the English output with the human English translation. Many potential candidates were generated from the translation model for each sentence, and high quality diverse paraphrases were selected.

3.2 NMT models

For both the baseline and our method, we train Transformer models in fairseq using parameters from the flores low-resource benchmark Guzmán et al. (2019): -layer encoder and decoder, dimensional embeddings, and encoder and decoder attention heads. We regularize with label smoothing, and dropout. We optimize using Adam with a learning rate of

. We train for a maximum of 200 epochs, and model selection from checkpoints is based on validation set perplexity. We translate with a beam size of

.

For our method we use the proposed objective

with probability

and standard on the original reference with probability . We sample from only the 100 highest probability vocabulary items at a given time step when sampling from the paraphraser distribution to avoid very unlikely tokens Fan et al. (2018).

Using our English paraphraser, we aim to demonstrate improvements in low-resource settings. We use Tagalog (tl) to English and Swahili (sw) to English bitext from the MATERIAL low-resource program Rubino (2018). We also report results on public data, using MT bitext from GlobalVoices, a non-profit news site that publishes in languages.666We use v2017q3 released on Opus (opus.nlpl.eu/GlobalVoices.php). Not all 53 languages have MT bitext. We evaluate on the 10 lowest-resource settings that have at least 10,000 lines of parallel text with English: Hungarian (hu), Indonesian (id), Czech (cs), Serbian (sr), Catalan (ca), Swahili (sw),777Swahili is in both. MATERIAL data is not widely available, so we separate them to keep GlobalVoices reproducible. Dutch (nl), Polish (pl), Macedonian (mk), Arabic (ar).

We use 2,000 lines each for: a validation set for model selection from checkpoints and a test set for reporting results. The approximate number of lines of training data is in Table 1.

We train an English SentencePiece model Kudo and Richardson (2018) on the paraphraser data, and apply it to the target (English) side of the MT bitext, so that the paraphraser and MT models have the same output vocabulary. We also train SentencePiece models on the source-side of the bitexts. We use a subword vocabulary size of 4,000 for each.

4 Results

Results are shown in Table 1. We improve over the baseline in all settings, by to BLEU (all statistically significant at the 95% confidence level Koehn (2004)).888All BLEU scores are SacreBLEU Post (2018). We see more pronounced improvements in the the lower-resource settings.999We acknowledge our three lowest-resource baselines (hu-en, id-en, cs-en) have very low BLEU scores and indicate very poor translations, and our even our large improvements may not be enough to make those systems practically usable. However, based on manual inspection, the improvement from 5.3 to 12.3 for id-en makes that system useful for gisting.

5 Analysis

In this section, we analyze our method to explore: (1) How it performs at a variety of resource levels; and (2) How it compares to the popular data augmentation method of back-translation.

dataset GlobalVoices MATERIAL
* en hu id cs sr ca sw nl pl mk ar sw tl
train lines 8k 8k 11k 14k 15k 24k 32k 40k 44k 47k 19k 46k
baseline 2.3 5.3 3.4 11.8 16.0 17.9 22.2 16.0 27.0 12.7 37.8 32.5
baseline w/ back-translation 2.8 7.1 4.6 17.6 20.1 20.7 26.9 19.3 29.1 16.0 38.8 33.0
this work 5.4 12.3 6.6 16.1 20.0 20.5 24.8 18.0 28.2 14.9 39.0 33.7
this work w/ back-translation 4.9 12.8 6.6 19.6 23.4 23.0 27.5 20.2 29.7 16.8 39.3 33.7
Table 2: Comparison between back-translation and this work on the test set. We bold the best value as well as any result where the difference from it is not statistically significant at the 95% confidence level.

5.1 MT Data Ablation

In order to better understand how this method performs across data sizes on the same data set, we ablate Bengali-English bitext from GlobalVoices.101010We choose bn-en for its relatively large size while still containing dissimilar languages, as ablating French-English (another similarly-sized option from GlobalVoices) does not reflect typical low-resource MT performance. After reserving validation and test sets (as in § 3.2), approximately 132k lines are left for training; we ablate this to 100k, 50k, 25k, and 15k lines.

Figure 2 plots the performance of our method and the baseline against the log of the data amount. Our improvements of 2.7, 3.7, 1.6, and 0.8 BLEU at the 15k, 25k, 50k, and 100k subsets are statistically significant at the 95% confidence level; the 0.1 improvement for the full 132k data amount is not. Similar to Table 1, we see more pronounced improvements in lower-resource ablations.

Neural paraphrasers are very rapidly improving in both adequacy and diversity Wieting et al. (2017, 2019b); Li et al. (2018); Wieting and Gimpel (2018); Hu et al. (2019a, b, c); as they continues to improve our method will likely provide larger improvements across the board, including for higher-resource MT.

Figure 2: Bengali-English data ablation.

5.2 Back-translation

Back-translation Sennrich et al. (2016) is the most common method for incorporating non-parallel data in NMT. We investigate how our method interacts with it. Table 2 shows the results for back-translation, our work, and the combination of back-translation and our work.111111We use a 1:1 ratio of bitext to back-translated bitext. We use newscrawl2016 (data.statmt.org/news-crawl) as our monolingual text. When combining with our work, we run our method on both the original and back-translation data. Adding our method to the strong data augmentation baseline of back-translation improves performance by 0.5 to 5.7 BLEU121212All statistically significant at the 95% confidence level. over back-translation alone.

For all our settings, the best performance either comes from our method combined with back-translation, or our method alone. In the lowest-resource setting (hu-en) our method alone outperforms the baseline by 3.1 BLEU, but adding back-translation reduces the improvement by 0.5 BLEU. For cs-en and tl-en adding back-translation to our method does not change performance. In the remaining 9 (of 12) settings, back-translation and our proposed method are complementary and we see improvements of 1.2 to 7.8 BLEU12 over the baseline when combining the two.

6 Related Work

Knowledge Distillation

Our proposed objective is similarly structured to word-level knowledge distillation (KD; Hinton et al., 2015; Kim and Rush, 2016), where a student model is trained to match the output distribution of a teacher model. In KD both models are translation models trained on the same data, have the same input and output languages, and use the human reference as the previous token. In contrast, we train toward the distribution of the paraphraser, which takes as input the human reference sentence (in the target language), with the sampled paraphrase as the previous token. KD is usually used to train smaller models and does not incorporate additional data sources, like we do.

Integrating Paraphrases in MT

Hu et al. (2019a)

present case studies on paraphrasing as data augmentation for NLP tasks, including an appendix on NMT, where they show small gains. They generate paraphrases as an offline preprocessing step using heuristic constraints on the model’s output, and train on the synthetic and original data. They then also find it necessary to fine tune on only the original data. Our work differs in that we train toward the paraphraser

distribution, and we sample from the distribution rather than using heuristics.

Wieting et al. (2019a) used a paraphrase-similarity metric for minimum risk training (MRT; Shen et al., 2016) in NMT. They note MRT is expensive, and, following prior work, use it for fine-tuning after maximum likelihood training. While our method is times slower than standard , this is not prohibitive in low-resource settings.

Paraphrasing was explored in the context of statistical machine translation (SMT) too. Callison-Burch et al. (2006) and Marton et al. (2009) used paraphrases to augment the phrase table directly, focusing on source-side paraphrasing to improve test set coverage. Madnani et al. (2007, 2008) used a coverage-focused paraphrasing technique to augment the set of references used during SMT tuning.

Data Augmentation in NMT

Back-translation (BT) translates target-language monolingual text to create synthetic source sentences Sennrich et al. (2016). BT needs a reverse model for each language pair. In contrast, our work needs a paraphraser only for each target language. Zhou et al. (2019) found BT is harmful in some low-resource language pairs, but a strong paraphraser can be trained as long as the target language is sufficiently high resource.

Fadaee et al. (2017) insert low frequency words in novel contexts in the existing bitext, using automatic word alignment and a language model. RAML Norouzi et al. (2016) and SwitchOut Wang et al. (2018) randomly replace words with another word from the vocabulary. In contrast to random or targeted word replacement, we generate semantically similar sentential paraphrases. Label smoothing (which we use with ) spreads probability mass over all non-reference tokens equally Szegedy et al. (2016); in the paraphraser places more mass on semantically plausible tokens.

7 Conclusion

In this work we find that our novel method for simulating multiple references in the MT training leads to significantly improved performance in low-resource settings, with gains of 1.2 to 7 BLEU.

References