Facebook FAIR's WMT19 News Translation Task Submission

07/15/2019 ∙ by Nathan Ng, et al. ∙ 0

This paper describes Facebook FAIR's submission to the WMT19 shared news translation task. We participate in two language pairs and four language directions, English <-> German and English <-> Russian. Following our submission from last year, our baseline systems are large BPE-based transformer models trained with the Fairseq sequence modeling toolkit which rely on sampled back-translations. This year we experiment with different bitext data filtering schemes, as well as with adding filtered back-translated data. We also ensemble and fine-tune our models on domain-specific data, then decode using noisy channel model reranking. Our submissions are ranked first in all four directions of the human evaluation campaign. On En->De, our system significantly outperforms other systems as well as human translations. This system improves upon our WMT'18 submission by 4.5 BLEU points.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We participate in the WMT19 shared news translation task in two language pairs and four language directions, EnglishGerman (EnDe), GermanEnglish (DeEn), EnglishRussian (EnRu), and RussianEnglish (RuEn). Our methods are based on techniques and approaches used in our submission from last year Edunov et al. (2018), including the use of subword models, Sennrich et al. (2016), large-scale back-translation, and model ensembling. We train all models using the fairseq sequence modeling toolkit Ott et al. (2019). Although document level context for EnDe is now available, all our systems are pure sentence level systems. In the future, we expect better results from leveraging this additional context information.

Compared to our WMT18 submission, we also decide to compete in the EnRu and DeEn translation directions. Although all four directions are considered high resource settings where large amounts of bitext data is available, we demonstrate that leveraging high quality monolingual data through back-translation is still very important. For all language directions, we back-translate the Newscrawl dataset using a reverse direction bitext system. In addition to back-translating the relatively clean Newscrawl dataset, we also experiment with back-translating portions of the much larger and noisier Commoncrawl dataset. For our final models, we apply a domain-specific fine-tuning process and decode using noisy channel model reranking (Anonymous, 2019).

Compared to our WMT18 submission in the EnDe direction, we observe substantial improvements of 4.5 BLEU. Some of these gains can be attributed to differences in dataset quality, but we believe most of the improvement comes from larger models, larger scale back-translation, and noisy channel model reranking with strong channel and language models.

2 Data

For the EnDe language pair we use all available bitext data including the bicleaner version of Paracrawl. For our monolingual data we use English and German Newscrawl. Although our language models were trained on document level data, we did not use document level boundaries in our final decoding step, so all our systems are purely sentence level systems.

For the EnRu language pair we also use all available bitext data. For our monolingual data we use English and Russian Newscrawl as well as a filtered portion of Russian Commoncrawl. We choose to use Russian Commoncrawl to augment our monolingual data due to the relatively small size of Russian Newscrawl compared to English and German.

2.1 Data Preprocessing

Similar to last year’s submission for EnDe, we normalize punctuation and tokenize all data with the Moses tokenizer Koehn et al. (2007). For EnDe we use joint byte pair encodings (BPE) with 32K split operations for subword segmentation Sennrich et al. (2016). For EnRu, we learn separate BPE encodings with 24K split operations for each language. Systems trained with this separate BPE encoding performed significantly better than those trained with joint BPE.

2.2 Data Filtering

2.2.1 Bitext

Large datasets crawled from the internet are naturally very noisy and can potentially decrease the performance of a system if they are used in their raw form. Cleaning these datasets is an important step to achieving good performance on any downstream tasks.

We apply language identification filtering (langid; Lui et al., 2012), keeping only sentence pairs with correct languages on both sides. Although not the most accurate method of language identification Joulin et al. (2016), one side effect of using langid

is the removal of very noisy sentences consisting of mostly garbage tokens, which are classified incorrectly and filtered out.

We also remove sentences longer than 250 tokens as well as sentence pairs with a source/target length ratio exceeding 1.5. In total, we filter out about 30% of the original bitext data. See Table 1 for details on the bitext dataset sizes.

2.2.2 Monolingual

For monolingual Newscrawl data we also apply langid filtering. Since the monolingual Newscrawl corpus for Russian is significantly smaller than that of German or English, we augment our monolingual Russian data with data from the commoncrawl corpus. Commoncrawl is the largest monolingual corpus available for training but is also very noisy. In order to select a limited amount of high quality, in-domain sentences from the larger corpus, we adopt the method of Moore and Lewis (2010) for selecting in-domain data (§3.2.1).

En-De En-Ru
No filter 38.8M 38.5M
+ length filter 35.7M 33.4M
+ langid filter 27.7M 26.0M
Tabelle 1: Number of sentences in bitext datasets for different filtering schemes

3 System Overview

3.1 Base System

Our base system is based on the big Transformer architecture Vaswani et al. (2017) as implemented in fairseq. We experiment with increasing network capacity by increasing embed dimension, FFN size, number of heads, and number of layers. We find that using a larger FFN size (8192) gives a reasonable improvement in performance while maintaining a manageable network size. All subsequent models, including ensembles, use this larger FFN Transformer architecture.

We trained all our models using fairseq Ott et al. (2019) on 128 Volta GPUs, following the setup described in Ott et al. (2018)

3.2 Large-scale Back-translation

Back-translation is an effective and commonly used data augmentation technique to incorporate monolingual data into a translation system. Back-translation first trains an intermediate target-to-source system that is used to translate monolingual target data into additional synthetic parallel data. This data is used in conjunction with human translated bitext data to train the desired source-to-target system.

In this work we used back-translations obtained by sampling Edunov et al. (2018) from an ensemble of three target-to-source models. We found that models trained on data back-translated using an ensemble instead of a single model performed better (Table 2). Previous work also found that upsampling the bitext data can improve back-translation (Edunov et al., 2018). We adopt this method to tune the amount of bitext and synthetic data the model is trained on. We find a ratio of 1:1 synthetic to bitext data to perform the best.

EnRu
Single Model Ensemble
newstest15 35.98 36.32
newstest16 32.78 33.28
newstest17 36.57 36.77
newstest18 34.72 34.72
Tabelle 2: SacreBLEU for English-Russian models trained with data back-translated using a single model vs. an ensemble of two models
En De Ru
Newscrawl 434M 559M 80M
+ langid filter 424M 521M 76M
Commoncrawl - - 1.2B
+ KenLM filter - - 60M
Total 424M 521M 136M
Tabelle 3: Number of sentences in monolingual datasets available for back-translation

3.2.1 Back-translating Commoncrawl

The amount of monolingual Russian data available in the Newscrawl dataset is significantly smaller than that of English and German (Table 3). In order to increase the amount of monolingual Russian data for back-translation, we experiment with incorporating Commoncrawl data. Commoncrawl is a much larger and noisier dataset compared to Newscrawl, and is also non-domain specific. We experiment with methods to identify a subset of Commoncrawl that is most similar to Newscrawl. Specifically, we use the in-domain filtering method described in Moore and Lewis (2010).

Given an in domain corpus , in this case Newscrawl, and a non-domain specific corpus , in this case Commoncrawl, we would like the find the subcorpus that is drawn from the same distribution as . For any given sentence

, we can calculate, using Bayes’ rule, the probability a sentence

in is drawn from

(1)

We ignore the term, since it will be constant for any given and , and use instead of , since and are drawn from the same distribution. Moving into the log domain, we can calculate the probability score for a sentence by , or after normalizing for length, , where and are the word-normalized cross entropy scores for a sentence according to language models and trained on and respectively.

Our corpora are very large and we therefore use an -gram model (Heafield, 2011) rather than a neural language model which would be much slower to train and evaluate. We train two language models and on Newscrawl and Commoncrawl respectively, then score every sentence in Commoncrawl by . We select a cutoff of , and use all sentences that score higher than this value for back-translation, or about 5% of the entire dataset.

3.3 Fine-tuning

Fine-tuning with domain-specific data is a common and effective method to improve translation quality for a downstream task. After completing training on the bitext and back-translated data, we train for an additional epoch on a smaller in-domain corpus. For De

En, we fine-tune on test sets from previous years, including newstest2012, newstest2013, newstest2015, and newstest2017. For EnDe, we fine-tune on previous test sets as well as the News-Commentary dataset. For En

Ru we fine-tune on a combination of News-Commentary, newstest2013, newstest2015, and newstest2017. The other test sets are held out for other tuning procedures and evaluation metrics.

3.4 Noisy Channel Model Reranking

En-De De-En En-Ru Ru-En
newstest12 26.7 28.0 - -
newstest13 27.8 27.6 42.7 27.6
newstest14 21.4 24.0 32.3 22.4
newstest15 25.1 24.6 34.7 21.8
newstest16 24.5 22.0 35.5 19.4
newstest17 25.0 21.9 37.9 19.5
newstest18 25.1 26.0 39.3 20.0
Tabelle 4: Perplexity scores for language models on bolded target languages in all translation directions

-best reranking is a method of improving translation quality by scoring and selecting a candidate hypothesis from a list of -best hypotheses generated by a source-to-target, or forward model. For our submissions, we rerank using a noisy channel model approach.

Given a target sequence and a source sequence , the noisy channel approach applies Bayes’ rule to model

(2)

Since is constant for a given source sequence , we can ignore it. We refer to the remaining terms , , and , as the forward model, channel model, and language model respectively. In order to combine these scores for reranking, we calculate for every one of our -best hypotheses:

(3)

The weights and are determined by tuning them with a random search on a validation set and selecting the weights that give the best performance. In addition, we also tune a length penalty.

For all translation directions, our forward models are ensembles of fine-tuned and back-translated models. Since we compete in both directions for both language pairs, for any given translation direction we can use the forward model for the reverse direction as the channel model. Our language models for each of the target languages English, German, and Russian, are big Transformer decoder models with FFN 8192. We train the language models on the monolingual Newscrawl dataset, and use document level context for the English and German models. Perplexity scores for the language models on the bolded target language of each translation direction are shown in table 4. With a smaller amount of monolingual Russian data available, we observe that our Russian language model performs worse than the German and English language models.

To select the length penalty and weights, and , for decoding, we use random search, choosing values in the range for the weights and values in the range for the length penalty. For all language directions, we choose the weights that give the highest BLEU score on a combined dataset of newstest2014 and newstest2016.

To run our final decoding step, we first use the forward model with beam size to generate an -best list. We then use the channel and language models to score each of these hypotheses, using the weights and length penalty tuned previously. Finally, we select the hypothesis with the highest score as our output.

3.5 Postprocessing

For EnDe and EnRu, we also change the standard English quotation marks (“ … ”) to German-style quotation marks ( … “).

4 Results

Results and ablations for EnDe are shown in Table 5, DeEn in Table 6, EnRu in Table 7 and RuEn in Table 8. We report case-sensitive SacreBLEU scores using SacreBLEU Post (2018)111SacreBLEU signatures:
BLEU+case.mixed+lang.en-de+numrefs.1+smooth.exp+
test.wmt{17/18}+tok.13a+version.1.2.11, BLEU+case.mixed+lang.de-en+numrefs.1+smooth.exp+
test.wmt{17/18}+tok.13a+version.1.2.11, BLEU+case.mixed+lang.ru-en+numrefs.1+smooth.exp+
test.wmt{17/18}+tok.13a+version.1.2.11, BLEU+case.mixed+lang.en-ru+numrefs.1+smooth.exp+
test.wmt{17/18}+tok.intl+version.1.2.11
, using international tokenization for EnRu. In the final row of each table we also report the case-sensitive BLEU score of our submitted system on this year’s test set. All single models and individual models within ensembles are averages of the last checkpoints of training. Our baseline systems are big Transformers as described in Vaswani et al. (2017). The baselines were trained with minimally filtered data, removing only those sentences longer than 250 words and exceeding a source/target length ratio of This setup gave us a reasonable baseline to evaluate data filtering.

4.1 EnglishGerman

EnDe
System news2017 news2018
baseline 30.90 45.40
+ langid filtering 30.78 46.43
+ ffn 8192 31.15 46.28
+ BT 33.62 46.66
+ fine tuning - 47.61
+ ensemble - 49.27
+ reranking - 50.63
WMT’18 submission - 46.10
WMT’19 submission 42.7
Tabelle 5: SacreBLEU scores on EnglishGerman.

For EnDe, langid filtering, larger FFN, and ensembling improve our baseline performance on news2018 by about 1.5 BLEU. Note that our best bitext only systems already outperforms our system from last year by 1 BLEU point. This is perhaps due to the addition of higher quality bitext data and improved data filtering techniques. The addition of back-translated (BT) data improves single model performance by only 0.3 BLEU, but combining this with fine-tuning and ensembling gives us a total of 3 BLEU. Finally, applying reranking on top of these strong ensembled systems gives another 1.4 BLEU.

4.2 GermanEnglish

DeEn
System news2017 news2018
baseline 37.28 45.32
+ langid and ffn 8192 38.45 46.16
+ BT 41.08 48.78
+ fine tuning - 49.07
+ ensemble - 49.60
+ reranking - 51.13
WMT’19 submission 40.8
Tabelle 6: SacreBLEU scores on GermanEnglish.

For DeEn, as with EnDe, we see similar improvements with langid filtering, larger FFN, and ensembling on the order of 1.4 BLEU. Compared to EnDe however, we also observe that the addition of back-translated data is much more significant, improving single model performance by over 2.5 BLEU. Fine-tuning, ensembling, and reranking add an additional 2.4 BLEU, with reranking contributing 1.5 BLEU, a majority of the improvement.

4.3 EnglishRussian

EnRu
System news2017 news2018
baseline 35.42 31.53
+ langid filtering 35.69 31.77
+ ffn 8192 36.66 33.49
+ BT NewsCrawl 40.09 37.07
+ BT CommonCrawl 40.42 37.3
+ fine tuning - 37.74
+ ensemble - 38.59
+ reranking - 39.53
WMT’19 submission 36.3
Tabelle 7: SacreBLEU scores on EnglishRussian

For EnRu, we observe large improvements of 2.4 BLEU over a bitext-only model after applying langid filtering, larger FFN, and ensembling. Since we start with a lower quality initial EnRu bitext dataset, we observe a large improvement of 3.5 BLEU by adding back-translated data. Augmenting this back-translated data with Commoncrawl adds an additional 0.2 BLEU. Finally, applying fine-tuning, ensembling, and reranking adds 2.2 BLEU, with reranking contributing 1 BLEU.

4.4 RussianEnglish

RuEn
System news2017 news2018
baseline 37.07 32.69
+ langid and ffn 8192 37.72 33.44
+ BT 41.68 36.49
+ fine tuning - 38.54
+ ensemble - 38.96
+ reranking - 40.16
WMT’19 submission 40.0
Tabelle 8: SacreBLEU scores on RussianEnglish

For RuEn, we observe similar trends to EnDe, with langid filtering, larger FFN, and ensembling improving performance of a bitext-only system by 1.6 BLEU. Backtranslation adds 3 BLEU, again most likely due to the lower quality bitext data available. Fine-tuning, ensembling, and reranking add almost 4 BLEU, with reranking contributing 1.2 BLEU.

4.5 Reranking

For every language direction, reranking gives a significant improvement, even when applied on top of an ensemble of very strong back-translated models. We also observe that the biggest improvement of 1.5 BLEU comes in the DeEn language direction, and the smallest improvement of 1 BLEU in the EnRu direction. This is perhaps due to the relatively weak Russian language model, which is trained on significantly less data compared to English and German. Improving our language models may lead to even greater improvements with reranking.

4.6 Human Evaluations

All our systems participated in the human evaluation campaign of WMT’19. For different systems, different styles of evaluations were used. All our systems except RuEn were evaluated with document level context and had a document level rating collected. Source based direct assessment was used for systems translating from English, and target based direct assessment was used for systems translating to English. See Table 9 for more details.

Doc Rating Seg Rating Seg Rating
Doc Context Doc Context Doc Context
(DRDC) (SRDC) (SRDC)
de-en M M
en-de B B
en-ru B B
ru-en M
Tabelle 9: Human evaluation configurations; M denotes monolingual human evaluation, or target-based direct assessment, where translations are compared to human references; B denotes bilingual/source based evaluation where the human annotators evaluate MT output based only on the source sentence (and no reference translation is present); +DC denotes systems evaluated with document level context, -DC without document context.

Facebook-FAIR was ranked first in all four language directions we compete in. Table 10 shows that our EnDe submission significantly outperforms other systems as well as human translations. Our submissions for DeEn, EnRu and RuEn also achieve the highest score.

Although our systems are pure sentence-level models, they performed well irrespective of whether the evaluation method used document context or not. For document level rankings, our EnDe system also ranked first and significantly outperformed human translations. Our EnRu submission achieved the highest score among all submissions and is tied for the first place with human translations. The DeEn system achieved the second highest score among constrained systems. See (Bojar et al., 2019) for details.

Ave. Ave. z System
90.3 0.347 Facebook-FAIR
93.0 0.311 Microsoft-WMT19-sent-doc
92.6 0.296 Microsoft-WMT19-doc-level
90.3 0.240 HUMAN
87.6 0.214 MSRA-MADL
88.7 0.213 UCAM
89.6 0.208 NEU
87.5 0.189 MLLP-UPV
87.5 0.130 eTranslation
86.8 0.119 dfki-nmt
84.2 0.094 online-B
86.6 0.094 Microsoft-WMT19-sent-level
87.3 0.081 JHU
84.4 0.077 Helsinki-NLP
84.2 0.038 online-Y
83.7 0.010 lmu-ctx-tf-single
84.1 0.001 PROMT-NMT
82.8 0.072 online-A
82.7 0.119 online-G
80.3 0.129 UdS-DFKI
82.4 0.132 TartuNLP-c
76.3 0.400 online-X
43.3 1.769 en-de-task
Tabelle 10: Official results of the WMT’19 En

De News Translation Task. Systems are ordered by DA z-score; systems within a cluster are considered tied; grayed entries indicate systems using resources beyond the provided data.

5 Conclusions

This paper describes Facebook FAIR’s submission to the WMT19 news translation task. For all four translation directions, EnDe and EnRu, we use the same strategy of filtering bitext data, performing sampling-based back-translation on monolingual data, then training strong individual models on a combination of this data. Each of these models is fine-tuned and ensembled into a final system that is used for decoding with noisy channel model reranking. We demonstrate the effectiveness of our noisy channel-based reranking approach even when applied on top of very strong systems, and rank first in all four directions of the human evaluation campaign.

Literatur

  • Bojar et al. (2019) Ondřej Bojar, Christian Federmann, Mark Fishel, Yvette Graham, Barry Haddow, Matthias Huck, Philipp Koehn, and Christof Monz. 2019. Findings of the 2019 conference on machine translation (wmt19). In Proceedings of the Fourth Conference on Machine Translation, Volume 2: Shared Task Papers, Florence, Italy. Association for Computational Linguistics.
  • Edunov et al. (2018) Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. In

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing

    , pages 489–500.
  • Heafield (2011) Kenneth Heafield. 2011. Kenlm: faster and smaller language model queries. In Proceedings of the Sixth Workshop on Statistical Machine Translation, pages 187–197.
  • Joulin et al. (2016) Armand Joulin, Edouard Grave, Piotr Bojanowski, and Tomas Mikolov. 2016. Bag of tricks for efficient text classification. arXiv preprint arXiv:1607.01759.
  • Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In ACL Demo Session.
  • Lui and Baldwin (2012) Marco Lui and Timothy Baldwin. 2012. langid. py: An off-the-shelf language identification tool. In Proceedings of the ACL 2012 system demonstrations, pages 25–30. Association for Computational Linguistics.
  • Moore and Lewis (2010) Robert Moore and William Lewis. 2010. Intelligent selection of language model training data. In Proceedings of the ACL 2010 Conference Short Papers, pages 220–224.
  • Ott et al. (2019) Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
  • Ott et al. (2018) Myle Ott, Sergey Edunov, David Grangier, and Michael Auli. 2018.

    Scaling neural machine translation.

    In Proc. of WMT.
  • Post (2018) Matt Post. 2018. A call for clarity in reporting BLEU scores. In Proceedings of the Third Conference on Machine Translation: Research Papers. Association for Computational Linguistics.
  • Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, pages 1715–1725.
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st Conference on Neural Information Processing Systems.