Log In Sign Up

Human-Paraphrased References Improve Neural Machine Translation

by   Markus Freitag, et al.

Automatic evaluation comparing candidate translations to human-generated paraphrases of reference translations has recently been proposed by Freitag et al. When used in place of original references, the paraphrased versions produce metric scores that correlate better with human judgment. This effect holds for a variety of different automatic metrics, and tends to favor natural formulations over more literal (translationese) ones. In this paper we compare the results of performing end-to-end system development using standard and paraphrased references. With state-of-the-art English-German NMT components, we show that tuning to paraphrased references produces a system that is significantly better according to human judgment, but 5 BLEU points worse when tested on standard references. Our work confirms the finding that paraphrased references yield metric scores that correlate better with human judgment, and demonstrates for the first time that using these scores for system development can lead to significant improvements.


page 1

page 2

page 3

page 4


BLEU might be Guilty but References are not Innocent

The quality of automatic metrics for machine translation has been increa...

Explicit Representation of the Translation Space: Automatic Paraphrasing for Machine Translation Evaluation

Following previous work on automatic paraphrasing, we assess the feasibi...

BERTTune: Fine-Tuning Neural Machine Translation with BERTScore

Neural machine translation models are often biased toward the limited tr...

Reward Optimization for Neural Machine Translation with Learned Metrics

Neural machine translation (NMT) models are conventionally trained with ...

Decoding and Diversity in Machine Translation

Neural Machine Translation (NMT) systems are typically evaluated using a...

Sampling and Filtering of Neural Machine Translation Distillation Data

In most of neural machine translation distillation or stealing scenarios...

Who Evaluates the Evaluators? On Automatic Metrics for Assessing AI-based Offensive Code Generators

AI-based code generators are an emerging solution for automatically writ...

1 Introduction

Machine Translation (MT) has shown impressive progress in recent years. Neural architectures Bahdanau et al. (2015); Gehring et al. (2017); Vaswani et al. (2017) have greatly contributed to this improvement, especially for languages with abundant training data Bojar et al. (2016, 2018); Barrault et al. (2019). This progress creates novel challenges for the evaluation of machine translation, both for human Toral (2020); Läubli et al. (2020) and automated evaluation protocols Lo (2019); Zhang et al. (2019).

Both types of evaluation play an important role in machine translation Koehn (2010). While human evaluations provide a gold standard evaluation, they involve a fair amount of careful and hence expensive work by human assessors. Cost therefore limits the scale of their application. On the other hand, automated evaluations are much less expensive. They typically only involve human labor when collecting human reference translations and can hence be run at scale to compare a wide range of systems or validate design decisions. The value of automatic evaluations therefore resides in their capacity to be used as a proxy for human evaluations for large scale comparisons and system development.

The recent progress in MT has raised concerns about whether automated evaluation methodologies reliably reflect human ratings in high accuracy ranges. In particular, it has been observed that the best systems according to humans might fare less well with automated metrics Barrault et al. (2019). Most metrics such as Bleu Papineni et al. (2002) and TER Snover et al. (2006) measure overlap between a system output and a human reference translation. More refined ways to compute such overlap have consequently been proposed Banerjee and Lavie (2005); Lo (2019); Zhang et al. (2019).

Orthogonal to the work of building improved metrics, freitag2020bleu hypothesized that human references are also an important factor in the reliability of automated evaluations. In particular, they observed that standard references exhibit simple, monotonic language due to human ‘translationese‘ effects. These standard references might favor systems which excel at reproducing these effects, independent of the underlying translation quality. They showed that better correlation between human and automated evaluations could be obtained when replacing standard references with paraphrased references, even when still using surface overlap metrics such as BLEU (Papineni et al., 2002). The novel references, collected by asking linguists to paraphrase standard references, were shown to steer evaluation away from rewarding translation artifacts. This improves the assessment of alternative, but equally good translations.

Our work builds on the success of paraphrased translations for evaluating existing systems, and asks if different design choices could have been made when designing a system with such an evaluation protocol in mind. This examination has several potential benefits: it can help identify choices which improve BLEU on standard references but have limited impact on final human evaluations; or those that result in better translations for the human reader, but worse in terms of standard reference BLEU. Conversely, it might turn out that paraphrased references are not robust enough to support system development due to the presence of ‘metric honeypots’: settings that produce poor translations, but which are nevertheless assigned high BLEU scores.

To address these points, we revisit the major design choices of the best EnglishGerman system from WMT2019 Ng et al. (2019) step-by-step, and measure their impact on standard reference BLEU as well as on paraphrased BLEU. This allows us to measure the extent to which steps such as data cleaning, back-translation, fine-tuning, ensemble decoding and reranking benefit standard reference BLEU more than paraphrase BLEU. Revisiting these development choices with the two metrics results in two systems with quite different behaviors. We conduct a human evaluation for adequacy and fluency to assess the overall impact of designing a system using paraphrased BLEU.

Our main findings show that optimizing for paraphrased BLEU is advantageous for human evaluation when compared to an identical system optimized for standard BLEU. The system optimized for paraphrased BLEU significantly improves WMT newstest19 adequacy ratings (4.72 vs 4.27 on a six-point scale) and fluency ratings (63.8% vs 27.2% on side-by-side preference) despite scoring 5 BLEU points lower on standard references.

2 Related Work

Collecting human paraphrases of existing references has recently been shown to be useful for system evaluation Freitag et al. (2020). Our work considers applying the same methodology for system tuning. There is some earlier work relying on automated paraphrases for system tuning, especially for Statistical Machine Translation (SMT). madnani2007paraphrase introduced an automatic paraphrasing technique based on English-to-English translation of full sentences using a statistical MT system, and showed that this permitted reliable system tuning using half as much data. Similar automatic paraphrasing has also been used to augment training data, e.g. Marton et al. (2009), but relying on standard references for evaluation. In contrast to human paraphrases, the quality of current machine generated paraphrases degrades significantly as overlap with the input decreases Mallinson et al. (2017); Roy and Grangier (2019). This makes their use difficult for evaluation since Freitag et al. (2020) suggests that substantial paraphrasing – ‘paraphrase as much as possible‘ – is necessary for evaluation.

Our work can be seen as replacing the regular BLEU metric with a new paraphrase BLEU metric for system tuning. Different alternative automatic evaluation metric have also been considered for system tuning

He and Way (2010); Servan and Schwenk (2011) with Minimum Error Rate Training, MERT Och (2003). This work showed some specific cases where Translation Error Rate (TER) was superior to Bleu.

Our work is also related to the bias that the human translation process introduces in the references, including source language artifacts—Translationese Koppel and Ordan (2011)—as well as source-independent artifacts—Translation Universals Mauranen and Kujamäki (2004). The professional translation community studies both systematic biases inherent to translated texts Baker (1993); Selinker (1972), as well as biases resulting specifically from interference from the source text Toury (1995). For MT, Freitag et al. (2019) point at Translationese as a source of mismatch between BLEU and human evaluation, raising concerns that overlap-based metrics might reward hypotheses with translationese language more than hypotheses using more natural language. The impact of Translationese on human evaluation of MT has recently received attention as well (Toral et al., 2018; Zhang and Toral, 2019; Graham et al., 2019). More generally, the question of bias to a specific reference has also been raised, in the case of monolingual manual evaluation (Fomicheva and Specia, 2016; Ma et al., 2017). Different from the impact of Translationese on evaluation, the impact of Translationese in the training data has also been studied Kurokawa et al. (2009); Lembersky et al. (2012a); Bogoychev and Sennrich (2019); Riley et al. (2020).

Finally, our work is also related to studies measuring the importance of the test data quality, looking specifically at the test set translation direction. For SMT evaluation, lembersky2012language and stymne2017effect explored how the translation direction affects translation results. holmqvist2009improving noted that the original language of the test sentences influences the BLEU score of translations. They showed that the BLEU scores for target-original sentences are on average higher than sentences that have their original source in a different language. Recently, a similar study was conducted for neural MT Bogoychev and Sennrich (2019).

3 Experimental Setup

We first describe data and models, then present our human evaluation protocol.

3.1 Data

We ran all experiments on the WMT 2019 EnglishGerman news translation task (Barrault et al., 2019). The task provides 38M parallel sentences. As German monolingual data, we concatenate all News Crawl data from 2007 to 2018, comprising 264M sentences after removing duplicates.

In addition to the training data, we use newstest2018 for development and newstest2019 for evaluation only. There is an important difference between these two test sets. Newstest2018 was created from monolingual news data from both English and German online sources. Half of the data consists of English text translated into German, while the other half consists of German text translated into English. This results in a joint test set of 2,998 sentences. Newstest2019, on the other hand, consists only of 1,997 sentences translated from English into German (see Figure 1). To provide a joint test set similar to newstest2018, we took newstest2019 from the reverse translation direction GermanEnglish, swapped source and target, and concatenated it with the original test sets. This results in a new joint newstest2019 test set of 3,997 sentences.

(a) Forward-translated, i.e. source original
(b) Backward-translated, i.e. target original
Figure 1: Sentences in a test set are either natural in the source and forward-translated into the target language, or vice-versa. If a test set consists of both kinds of sentences, we call it a joint test set. WMT EnglishGerman newstest2018 is a joint test set with half of the sentences being forward-translated. WMT EnglishGerman newstest2019 is a forward-translated test set.

In addition to reporting overall Bleu scores on the different test sets, we also report results on the two subsets (based on the original language) of each newstest20XX, which we call the orig-en and the orig-de halves of the test set.

freitag2020bleu provided an alternative reference translation for the orig-en half of newstest2019. For both standard and alternative references, they provided an additional paraphrased ‘as much as possible‘ version (four different references in all). In order to enable our parameter tuning experiments, we created a paraphrased version of the reference for the orig-en half of newstest2018 (1,500 sentences) following the instructions from freitag2020bleu. We will release this new paraphrased reference, newstest2018.orig-en.p, as part of our work.

3.2 Models

For our translation models, we adopt the transformer implementation from Lingvo Shen et al. (2019), using the transformer-big model size Vaswani et al. (2017). We use a vocabulary of 32k subword units and exponentially moving averaging of checkpoints (EMA decay) with the weight decrease parameter set to  Buduma and Locascio (2017). We used a batch size of around 32k sentences in all our experiments.

We report Bleu Papineni et al. (2002) in addition to human evaluation. All Bleu scores are calculated with sacreBLEU Post (2018)111BLEU+case.mixed+lang.ende+numrefs.1+smooth.exp+ SET+tok.13a+version.1.4.12 SET {wmt18, wmt19, wmt19/google/ar, wmt19/google/arp, wmt19/google/wmtp}.

3.3 Human Evaluation

To collect human rankings, we ran side-by-side evaluation for overall quality and fluency. We hired 20 linguists and divided them equally between the two evaluations. Each evaluation included 1,000 items with each item being rated exactly once. We acquired only a single rating per sentence from the professional linguists as we found that they were more reliable than crowd workers Toral (2020). We evaluated the orig-en sentences corresponding to the official WMT-19 EnglishGerman test set Barrault et al. (2019). Results in this natural translation direction are more meaningful as pointed out by Zhang and Toral (2019), who show that translating a ‘translationese‘ source is simpler and should not be used for human evaluation.

Our human evaluation followed the protocol:

  • Fluency: We present two translations of the same source sentence to professional linguists without showing the actual source sentence. We then ask the rater wether they prefer one of the outputs or rate them equally based on fluency.

  • Overall Quality: We present two translations along with the source and ask the raters to evaluate each translation on a 6-point scale. A score of 6 will be assigned to translations with ‘perfect meaning and grammar‘, while a score of 0 will be assigned to ‘nonsense/ no meaning preserved‘ translations. The average over all ratings yields the system’s final quality score.

4 Experimental Results

This section first presents our main result comparing the same system tuned with BLEU on standard versus paraphrased references. We then break down how system design choices impact each metric differently. Throughout, we refer to scores computed with standard references as Bleu, and those computed with paraphrased references as BleuP.

4.1 Overall Performance

We compare the performance of a system optimized on newstest2018 with standard references (opt-on-Bleu) with one optimized on newstest2018.orig-en with paraphrased references (opt-on-BleuP). Both systems were developed using only newstest2018 data, keeping newstest2019 as a blind test set. Table 1 summarizes the results on newstest2019. Details of how these two systems were developed and how they differ are given in Section 4.2.

The opt-on-Bleu system outperforms opt-on-BleuP by 5.2 Bleu points. Normally this would lead us to discard opt-on-BleuP. However, the BleuP scores tell a different story: opt-on-BleuP outperforms by 0.3 points, a potentially large improvement given the smaller natural range of this metric. Under a significance test with random approximation Riezler and Maxwell III (2005), both the Bleu and BleuP differences are significant at p5e-18.

opt-on-Bleu opt-on-BleuP
human quality
human fluency
Table 1: Bleu scores and human ratings for WMT newstest2019 EnglishGerman (original English sources). We optimized the system to perform best on either newstest2018 with standard reference translations (opt-on-Bleu) or newstest2018.orig-en with paraphrased reference translations (opt-on-BleuP). BLEU differences are significant according to random approximation Riezler and Maxwell III (2005) with p5e-18. Human score differences are significant according to a Wilcoxon rank-sum test with p5e-18.

freitag2020bleu showed that BLEU scores calculated on paraphrased references have higher correlation with human judgment than those calculated on standard references. To verify their findings, we ran a human evaluation for the two different outputs on 1,000 sentences randomly drawn from newstest2019 (orig-en), as described above. As shown in Table 1, opt-on-BleuP is consistently evaluated as better for both quality and fluency. To measure the significance between the two ratings, we ran a Wilcoxon rank sum test on the human ratings and found that both improvements are significant with pe-18.

This experiment demonstrates that we can actually tune our MT system on paraphrased references to yield higher translation quality when compared to a typical system tuned on standard Bleu. Interestingly, the Bleu score for the better system is much lower, supporting our contention that Bleu rewards spurious translation features (e.g. monotonicity and common translations) that are filtered out by BleuP.

4.2 Analysing Performance

We now describe the individual model decisions that went into the two final systems of Section 4.1. To build a classical system optimized on Bleu with standard references, we replicate the WMT 2019 winning submission Ng et al. (2019) and examine the effect of each of its major design decisions.222Our replication achieves 45.0 BLEU on newstest19, competitive with the reference system at 42.7 BLEU. In particular, we are looking into the effect of data cleaning, back-translation, fine tuning, ensembling and noisy channel reranking. We examine the impact of each method on Bleu and BleuP. For our experiments, we used newstest2018 as our development set and newstest2019 as our held-out test set. All model decisions (checkpoint, variants) are solely made on newstest2018.

Experimental results are presented in Table 2. As described in Section 3.1, we report 4 different Bleu scores for newstest2018 (dev) and newstest2019 (test). In addition to reporting Bleu score on the joint or the orig-de/orig-en halves of the test sets, we also report Bleu scores that are calculated on paraphrased references (BleuP).

newstest2018 (dev) newstest2019 (test)
joint orig-de orig-en orig-en.p joint orig-de orig-en orig-en.p
(1) bitext 46.0 38.8 50.6 12.8 38.5 34.9 40.9 12.1
(2) + CDS 46.1 39.4 50.5 13.4 39.6 35.6 42.3 12.6
(3) + BT 47.2 45.3 47.7 13.6 40.9 43.1 39.4 13.1
(4) + Fine tuning 47.7 43.6 49.2 13.8 41.2 41.3 41.1 13.6
(5) + Ensemble of 4 49.8 45.4 52.1 13.7 43.1 42.1 43.6 13.3
+ reranking of (5) (opt on Bleu) 50.7 44.8 53.9 13.8 43.4 41.2 45.0 13.4
+ reranking of (4) (opt on BleuP) 47.1 45.9 47.1 14.7 41.6 44.0 39.8 13.7
Table 2: Bleu scores for WMT 2019 EnglishGerman. The joint sets combine orig-en and orig-de subsets. The orig-en.p sets use paraphrased references instead of standard references. Our experiments compared newstest2018.joint and newstest2018.orig-en.p for system tuning. The standard newstest2018 and newstest2019 sets are newstest2018.joint and newstest2019.orig-en, respectively.

4.2.1 Data Cleaning

For data cleaning, we used CDS Wang et al. (2018). We trained a CDS model for EnglishGerman taking news-commentary as the in-domain/clean data set. We scored all parallel sentences with our trained CDS model and kept the 70% highest scoring sentences. Our experimental results suggest that data cleaning is useful for all four types of test sets and consistently improves over a baseline system that is trained on raw parallel data. We conclude that data cleaning is useful for all systems independently of which test set it will be optimized for.

4.2.2 Back-Translation

We trained a strong GermanEnglish model on the same parallel data (with flipped source/target) and used that model to (back-)translate (BT) all deduped German monolingual sentences from NewsCrawl 2007-2018 into English. We filtered sentences with a source-target ratio lower than 0.5 or higher than 1.5. We further run language identification and filtered out all backtranslations going into the wrong language. We then oversample our bitext data to match the size of the backtranslation data and train a NMT model on the concatenation of both datasets.

As previously reported by Freitag et al. (2019); Bogoychev and Sennrich (2019), the original language of the sentences within a test is crucial and can lead to very different conclusions, in particular for back-translation systems. This difference is visible when looking at the Bleu scores on the standard references. While the Bleu score on orig-de does improve by 7.5 points, the Bleu score drops by 2.9 points on the orig-en half. Due to the big gain on the orig-de half, BT also improves the Bleu score on the joint set. The paraphrased references were designed to overcome these kinds of mismatches and they show a gain of 0.5 BLEU points. We can conclude that back-translation helps improve Bleu and BleuP and we include BT for systems that are optimized for both standard or paraphrased Bleu scores.

4.2.3 Fine-Tuning

Similar to Ng et al. (2019)

, we fine-tuned our back-translated model on a concatenation of previous WMT testsets (newstest{2013,2015,2016,2017}) and the clean in-domain news-commentary corpus. In total, we fine-tuned the model on 330k sentences. We kept all model parameters the same (batch size, learning rate) and continued training on the fine-tuned data for one epoch. The

Bleu scores on the standard references suggest a small improvement of 0.3 Bleu on the joint test set. Interestingly, the improvement is visible on the orig-en half by 0.7 points while the Bleu scores on orig-de actually drop by 1.7 points. Nevertheless, BleuP does improve by 0.5 points, suggesting that fine-tuning is especially helpful when measuring scores with paraphrased references. Despite the small gain on standard references, we include fine-tuning in both our optimized systems.

4.2.4 Ensemble

Combining different predictions is a standard approach in MT to boost Bleu scores. We run ensemble decoding with 4 previously built models. In addition to using the 3 models described in Section 4.2.1, 4.2.2, and 4.2.3, we build a second fine-tuned model with the same approach, but different initialization.

Although ensemble decoding improves the performance on our standard references by up to 1.9 Bleu points, the quality is rated as lower by 0.3 Bleu points on the paraphrased references. We suspect that using an ensemble for decoding favors common, average language by promoting target spans where all systems agree. Paraphrase translations actually downweight the importance of this language, which seems important for agreeing with human judgments Freitag et al. (2020). This promotion of average language and monotonic translation may explain the effectiveness of ensembling only for standard reference Bleu. Similar to the WMT 2019 winning submission, we include the ensemble approach in our system that is optimized on the joint Bleu scores. However, we do not include it in our system optimized on BleuP.

4.3 Reranking

Finally, we extend the noisy-channel approach Yee et al. (2019) which consists of re-ranking the top-50 beam search output of either the ensemble model (when tuned for Bleu) or the fine-tuned model (when tuned for BleuP

). Instead of using 4 features—forward probability, backward probability, language model and word penalty—we use 11 forward probabilities, 10 backward probabilities and 2 language model scores. Different to

Ng et al. (2019), we did not pick the re-ranking weights through random search, but used MERT Och (2003) for efficient tuning.

The 11 different forward translation scores come from different EnglishGerman NMT models that are replicas of the previous described models (Section 4.2.1, 4.2.2, and 4.2.3). The 10 backward translation scores come from the same approaches, but trained in the reverse direction. These 21 NMT model scores are combined with 2 language model (LM) scores. The first LM is trained on the German monolingual NewsCrawl data, while the second LM is trained on forward-translated English NewsCrawl data. The first LM should assign high scores to genuine German text, while the second LM should assign high scores to translationese German originating from English.

We first reranked the 50-best list generated by the ensemble model with MERT on newstest2018. Similar to the original WMT 2019 submission, the Bleu scores on the joint and orig-en set increase. This reranked output corresponds to our opt-on-Bleu model. Next, we reranked the 50-best list generated by the fine-tuned model with MERT on newstest2018.orig-en with paraphrased references. This led to further small increases in BleuP, and corresponds to our opt-on-BleuP model.

In summary, optimizing on BleuP leads us to keep back-translation, even though evaluation with standard English-original references would have us drop it, and also leads us to drop the ensembling step. Rescoring using MERT weights learned with Bleu or BleuP further separates the systems according to these metrics.

5 Analysis

This section confirms the results from the previous section with additional references for newstest2019 and illustrates the behaviour of our systems on individual sentences.

5.1 Alternative Reference Translations

freitag2020bleu released an additional standard reference translation (AR) and two ‘paraphrase as-much-as-possible‘ reference translations for newstest2019 (WMT.p and AR.p). We used WMT.p in all our above experiments; here we report Bleu scores for all four available reference translations in table 3. The Bleu improvements between the two standard reference translations agree perfectly. Similarly, the BleuP improvements between the two paraphrased references also coincide. This indicates that by optimizing on Bleu or BleuP we have not somehow overfit to a specific set of reference translations or their paraphrases, but instead have molded our model to better match a style of reference translation.

(orig-en) (orig-en) (orig-en.p) (orig-en.p)
(1) bitext 40.9 32.2 12.1 12.0
(2) + CDS 42.3 34.2 12.6 12.3
(3) + BT 39.4 33.6 13.1 13.0
(4) + Fine tuning 41.1 35.5 13.6 13.4
(5) + Ensemble of 4 43.6 36.0 13.3 13.0
+ reranking of (5) (opt-on-Bleu) 45.0 36.7 13.4 13.1
+ reranking of (4) (opt-on-BleuP) 39.8 34.4 13.7 13.5
Table 3: Bleu scores for EnglishGerman newstest2019 for the additional references from Freitag et al. (2020).

5.2 Translation Examples

This section presents translation examples from our two differently optimized systems in Table 4. The first 3 examples show sentences where opt-on-BleuP has higher translation quality than opt-on-Bleu. One observation of Freitag et al. (2020) was that Bleu scores calculated on standard references prefer monotonic translations. This is visible in our first translation example, where opt-on-Bleu incorrectly translates the saying Tomorrow’s a different beast into Morgen ist ein anderes Biest, using an inappropriately monotonic strategy. On the other hand, the opt-on-BleuP system captures the meaning of the source sentence and generates a valid translation.

Another drawback of standard reference Bleu is the preference for literal translation. This is visible in our second example where the word cap is translated into Kappe and tip into kippen. Both are valid word-by-word translations, but do not make much sense in this context. The third example is another example of the monotonic translation style of a regular tuned system. The opt-on-Bleu translation is an incorrect word-by-word translation. The opt-on-BleuP system is able to introduce a German natural sentence structure and generate a flawless translation.

The last translation example is a loss for the paraphrased-tuned system and demonstrates that sometimes a more literal translation can be better. Even though the word run can be translated into Ansturm, it is not appropriate in this context and the simpler translation Lauf is correct.

source Tomorrow’s a different beast.
opt on Bleu Morgen ist ein anderes Biest.
opt on BleuP Morgen ist alles anders.
source You have to tip your cap.
opt on Bleu Sie müssen Ihre Kappe kippen.
opt on BleuP Man muss den Hut ziehen.
source He averaged 5.6 points and 2.6 rebounds a game last season.
opt on Bleu Er durchschnittlich 5,6 Punkte und 2,6 Rebounds ein Spiel in der vergangenen Saison.
opt on BleuP In der vergangenen Saison erzielte er im Schnitt 5,6 Punkte und 2,6 Rebounds pro Spiel.
source Thirty-two percent supported such a run.
opt on Bleu 32 Prozent unterstützten einen solchen Lauf.
opt on BleuP 32 Prozent sprachen sich für einen solchen Ansturm aus.
Table 4: Example output for EnglishGerman for systems optimized on standard Bleu or BleuP. Translations for opt-on-Bleu tend to be more literal, and adhere closely to the source sentence structure.

5.3 Matched n-grams

The Bleu scores calculated on the two different references yield different conclusions. Bleu on standard references evaluated opt-on-Bleu higher by more than 5 Bleu points. BleuP came to a different conclusion and gave a higher score to opt-on-BleuP

. In this section, we look at the n-grams that contributed most to these different outcomes. Those that contribute most to the difference in

Bleu across the two systems are:

  • Er sagte, dass (He said that)

  • , sagte er der (, he said the)

  • stellte fest, dass (noted that)

These are all generic, high-frequency n-grams. They are crucial for attaining high BLEU scores, and tend to appear in translations that employ the same structure as the source sentence. In contrast, the n-grams that contribute most to the difference in BleuP are:

  • Menschen ums Leben kamen (humans died)

  • Grossbritanien keine Steuern zahlen (Great Britain pay no tax)

  • von BBC Scottland (from BBC Scottland)

These are much less frequent sequences with more semantic content.

6 Conclusions

Prior work has shown that BLEU measured on paraphrased references (BleuP) has better correlation with human evaluation than BLEU measured on regular references (Bleu) for the comparison of existing systems Freitag et al. (2019). Motivated by this finding, we collected a development set of paraphrased references and assessed BleuP for system development. This allowed us to evaluate if the design choices of a modern neural MT system impact Bleu and BleuP differently, including tuning a re-ranking noisy channel model to these metrics. Our experiments followed the setup from the winning newstest19 EnglishGermam entry at WMT19 Ng et al. (2019).

For design choices, we observe that BleuP seems to emphasize the importance of back-translation even when test sets are source original. On the other end, BleuP seems to de-emphasize the importance of ensembles, as the reliable prediction of common language by ensembles is less rewarded by this metric.

Our tuning experiments led to positive results. In human evaluation, the system tuned on BleuP showed significant improvements in terms of adequacy and even greater gains in terms of fluency compared to the system tuned on Bleu. Example translations indicate that the model tuned on BleuP produces noticeably less literal translations. Our experiments also highlight a disconnect between regular Bleu and human evaluation: the system tuned on BleuP degrades standard Bleu scores by over 5 points, while faring significantly better in human evaluation. Paraphrased automatic evaluation therefore seems to be a promising proxy for human evaluation when making design choices for MT systems.

This research opens the question of whether these results can be confirmed over a wide range of language pairs. We also hope to achieve further improvements by refining the paraphrased evaluation protocol.