A Call for Clarity in Reporting BLEU Scores

by   Matt Post, et al.

The field of machine translation is blessed with new challenges resulting from the regular production of fresh test sets in diverse settings. But it is also cursed---with a lack of consensus in how to report scores from its dominant metric. Although people refer to "the" BLEU score, BLEU scores can vary wildly with changes to its parameterization and, especially, reference processing schemes, yet these details are absent from papers or hard to determine. We quantify this variation, finding differences as high as 1.8 between commonly used configurations. Pointing to the success of the parsing community, we suggest machine translation researchers set- tle upon the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not permit user-supplied preprocessing of the reference. We provide a new tool to facilitate this.


page 1

page 2

page 3

page 4


Neural Machine Translation for Cebuano to Tagalog with Subword Unit Translation

The Philippines is an archipelago composed of 7, 641 different islands w...

Grammar Accuracy Evaluation (GAE): Quantifiable Intrinsic Evaluation of Machine Translation Models

Intrinsic evaluation by humans for the performance of natural language g...

Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers

This paper presents the first large-scale meta-evaluation of machine tra...

Telemedicine as a special case of Machine Translation

Machine translation is evolving quite rapidly in terms of quality. Nowad...

Low Resourced Machine Translation via Morpho-syntactic Modeling: The Case of Dialectal Arabic

We present the second ever evaluated Arabic dialect-to-dialect machine t...

Assessing Reference-Free Peer Evaluation for Machine Translation

Reference-free evaluation has the potential to make machine translation ...

Speed-Constrained Tuning for Statistical Machine Translation Using Bayesian Optimization

We address the problem of automatically finding the parameters of a stat...

1 Introduction

Science is the process of formulating hypotheses, making predictions, and measuring their outcomes. In machine translation research, the predictions are made by models whose development is the focus of the research, and the measurement, more often than not, is done via BLEU (Papineni et al., 2002). BLEU’s relative language independence, its ease of computation, and its reasonable correlation with human judgments have led to its adoption as the dominant metric for machine translation research. On the whole, it has been a boon to the community, providing a fast and cheap way for researchers to gauge the performance of their models. Together with larger-scale controlled manual evaluations, BLEU has shepherded the field through a decade and a half of quality improvements Graham et al. (2014).

This is of course not to claim there are no problems with BLEU. Its weaknesses abound, and much has been written about them (cf. Callison-Burch et al. (2006)). This paper is not, however, concerned with the shortcomings with using BLEU as a proxy for human evaluation of quality; instead, our goal is to bring attention to problems with the reporting of BLEU scores. These can be summarized as follows:

  • BLEU is not a single metric, but requires a number of parameters (§2.1).

  • Preprocessing schemes have a large effect on scores (§2.2). Importantly, BLEU scores computed against differently-processed references are not comparable.

  • Papers vary in the hidden parameters and schemes they use, yet rarely report them (§2.3). Even when they do, it is often hard to discover the details.

Together, these issues make it difficult to evaluate and compare BLEU scores across papers, impeding comparison and replication. We quantify these issues and show that they are serious, with variances bigger than many reported gains. In particular, we identify user-supplied reference tokenizations as a source of incompatibility. As a solution, we suggest the community use only “detokenized” reference processing, as done by the annual Conference on Machine Translation

(Bojar et al., 2017, WMT). In support of this, we release a Python script, SacreBLEU,111pip3 install sacrebleu which computes this metric and reports a version string recording the parameters. It also provides a number of other features, such as automatic download and management of common test sets.

English English
config en-cs en-de en-fi en-lv en-ru en-tr cs-en de-en fi-en lv-en ru-en tr-en
basic 20.7 25.8 22.2 16.9 33.3 18.5 26.8 31.2 26.6 21.1 36.4 24.4
split 20.7 26.1 22.6 17.0 33.3 18.7 26.9 31.7 26.9 21.3 36.7 24.7
unk 20.9 26.5 25.4 18.7 33.8 20.6 26.9 31.4 27.6 22.7 37.5 25.2
metric 20.1 26.6 22.0 17.9 32.0 19.9 27.4 33.0 27.6 22.0 36.9 25.6
range 0.6 0.8 0.6 1.0 1.3 1.4 0.6 1.8 1.0 0.9 0.5 1.2
basic 21.2 26.3 22.5 17.4 33.3 18.9 27.7 32.5 27.5 22.0 37.3 25.2
split 21.3 26.6 22.9 17.5 33.4 19.1 27.8 32.9 27.8 22.2 37.5 25.4
unk 21.4 27.0 25.6 19.1 33.8 21.0 27.8 32.6 28.3 23.6 38.3 25.9
metric 20.6 27.2 22.4 18.5 32.8 20.4 28.4 34.2 28.5 23.0 37.8 26.4
range 0.6 0.9 0.5 1.1 0.6 1.5 0.7 1.7 1.0 1.0 0.5 1.2
Table 1: BLEU score variation across WMT’17 language arcs for cased (top) and uncased (bottom) BLEU. Each column varies the processing of the “online-B” system output and its references. basic denotes basic user-supplied tokenization, split adds compound splitting, unk replaces words not appearing at least twice in the training data with UNK, and metric denotes the metric-supplied tokenization used by WMT. The range row lists the difference between the smallest and largest scores, excluding unk.

2 Problem Description

2.1 Problem: BLEU is underspecified

“BLEU” does not signify any one thing, but to a constellation of parameterized methods. Among these parameters are:

  • The number of references used;

  • for multi-reference settings, the computation of the length penalty;

  • the maximum n-gram length; and

  • smoothing applied to 0-count n-grams.

It is true that many of these are often not problems in practice. Most often, there is only one reference, and the length penalty calculation is therefore moot. The maximum n-gram length is virtually always set to four, and since BLEU is corpus level, it is rare that there are any zero counts.

But it is also true that people use BLEU scores as very rough guides to MT performance across test sets and languages (comparing, for example, Chinese and Arabic). And traps exist. For example, WMT 2017 includes two references for English–Finnish. Scoring the online-B system with one reference produces a BLEU score of 22.04, and with two, 25.25. How sure are you that the results in that paper you just reviewed with those good EN-FI results were using just one reference?

2.2 Problem: Different reference preprocessings cannot be compared

The first problem dealt with parameters used in BLEU scores, and was more theoretical. We now discuss a second problem, preprocessing, and demonstrate its existence in practice.

Preprocessing includes input text modifications such as normalization (e.g., collapsing punctuation, removing special characters), tokenization (e.g., splitting off punctuation), compound-splitting, the removal of case, and so on. Its general goal is to deliver meaningful white-space delimited tokens to the MT system. Of these, tokenization is one of the most important and central. This is because BLEU is a precision metric, and changing the reference processing changes the set of n-grams against which system n-gram precision is computed. Rehbein and Genabith (2007) showed that the analogous use in the parsing community of F

scores as rough estimates of cross-lingual parsing difficulty were unreliable, for this exact reason. We note that BLEU scores are often reported as being

tokenized or detokenized. But for computing BLEU, both the system output and reference are always tokenized; what this distinction refers to is whether the reference preprocessing is user-supplied or metric-internal (i.e., handled by the code implementing the metric), respectively. And since BLEU scores can only be compared when the reference processing is the same, user-supplied preprocessing is error-prone and inadequate for comparing across papers.

Table 1 demonstrates the effect of computing BLEU scores with different reference tokenizations. We took a single system (online-B) from the WMT 2017 system outputs, and processed both it and the reference in the following ways:

  • basic. User-supplied preprocessing with the Moses tokenizer (Koehn et al., 2007).222Arguments -q -no-escape -protected basic-protected-patterns -l LANG.

  • split. We also split compounds, as in Luong et al. (2015a):333This is not mentioned in the paper, but here: http://nlp.stanford.edu/projects/nmt. e.g., rich-text rich - text.

  • unk. All word types not appearing at least twice in the target side of the WMT training data (with “basic” tokenization) are mapped to UNK. This hypothetical scenario could easily happen if this common user-supplied preprocessing were inadvertently applied to the reference.

  • metric. Only the metric-internal tokenization of the official WMT scoring script, mteval-v13a.pl, is applied.444https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl

The changes in each column show the effect these different schemes have, as high as 1.8 for one arc, and averaging around 1.0. The biggest is the treatment of case, which is well known, yet many papers are not clear about whether they report cased or case-insensitive BLEU.

Allowing the user to handle pre-processing of the reference has other traps. For example, many systems (particularly before sub-word splitting (Sennrich et al., 2016) was proposed) limited the vocabulary in their attempt to deal with unknown words. How sure are you that they didn’t apply the same unknown-word masking to the reference, making word matches much more likely? Such mistakes are easy to introduce.555The observations in this paper stem in part from an early version of the authors’ research workflow, which applied preprocessing to the reference, affecting scores by half a point.

2.3 Problem: Details are hard to come by

User-supplied reference processing precludes direct comparison of published numbers, but if enough detail is specified in the paper, it is at least possible to reconstruct comparable numbers. Unfortunately, this is not the trend, and even for meticulous researchers, it is often unwieldy to include this level of technical detail. In any case, it creates uncertainty and work for the reader. One has to read the experiments section, scour the footnotes, and look for other clues which are sometimes scattered throughout the paper. Figuring out what another team did is not easy.

paper configuration
Chiang (2005) metric
Bahdanau et al. (2014) (?)
Luong et al. (2015b) (?) user or metric
Jean et al. (2015) user
Wu et al. (2016) (?) user or user
Vaswani et al. (2017) (?) user or user
Gehring et al. (2017) user, metric
Table 2: Benchmarks set by well-cited papers use different BLEU configurations (Table 1). Which one was used is often difficult to determine.

The variations in Table 1 are only some of the possible configurations, since there is no limit to the preprocessing that a group could apply. But assuming these represent common, concrete configurations, one might wonder how easy it is to determine which of them was used by a particular paper. In Table 2, we attempt do this from a handful of influential papers in the literature. Not only are systems not comparable due to different schemes, in many cases, no easy determination can be made.

2.4 Summary

Figure 1: The proper pipeline for computing reported BLEU scores. White boxes denote user-supplied processing, and the black box, metric-supplied. The user should not touch the reference, while the metric applies its own processing to the system output and reference.

Reference tokenization must identical in order for scores to be comparable (see Figure 1 below). The widespread use of user-supplied reference preprocessing prevents this, needlessly complicating comparisons. The lack of details about preprocessing pipelines exacerbates this problem. This situation should be fixed.

3 A way forward

3.1 The example of PARSEVAL

An instructive comparison is the PARSEVAL metric for computing parser accuracy (Black et al., 1991). PARSEVAL works by taking labeled spans of the form representing a nonterminal spanning a constituent from word to word

. These are extracted from the parser output and used to compute precision and recall against the gold-standard set taken from the correct parse tree. Precision and recall are then combined to compute the F

metric that is commonly reported and compared across parsing papers.

Computing parser F is not without its own set of edge cases. Do we count the TOP (ROOT) node? What about -NONE-? Punctuation? Should any labels be considered equivalent? These boundary cases are resolved by that community’s adoption of a standard codebase, evalb,666http://nlp.cs.nyu.edu/evalb/ which included a parameters file that answers each of these questions.777The configuration file, COLLINS.PRM, answers these questions as no, no, no, and ADVP=PRT. This has facilitated thirty years of comparable cross-paper comparisons on treebanks in the parsing community.888In some cases (WSJ23), on the same test set (Marcus et al., 1993).

3.2 Existing scripts

Moses 999http://statmt.org/moses has a number of scoring scripts. Unfortunately, each of them has problems. Moses’ multi-bleu.perl cannot be used because it requires user-supplied preprocessing. The same is true of another evaluation framework, MultEval (Clark et al., 2011), which explicitly advocates for user-supplied tokenization.101010https://github.com/jhclark/multeval A good candidate is Moses’ mteval-v13a.pl, which makes use of metric-internal preprocessing and is used in the annual WMT evaluations. However, this script requires the data to be wrapped into XML. Nematus Sennrich et al. (2017) contains a version (multi-bleu-detok.perl) that removes the XML requirement. This is a good idea, but it still requires the user to manually handle the reference translations. A better approach is to keep the reference away from the user entirely.

3.3 SacreBLEU

SacreBLEU is a Python script that aims to treat BLEU with a bit more reverence:

  • It expects detokenized outputs, applying its own metric-internal preprocessing, and produces the same values as WMT;

  • it produces a short version string that documents the settings used; and

  • it automatically downloads and manages WMT (2008–2018) and IWSLT 2017 (Cettolo et al., 2017) test sets and processes them to plain text.

SacreBLEU can be installed via the Python package management system:

    pip3 install sacrebleu

It is open source software under the Apache 2.0 license.111111https://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu

4 Summary

Machine translation benefits from the regular introducion of test sets for many different language arcs, from academic, government, and industry sources. This should make it easy to share and compare scores on a constant set of fresh data. It is a shame, therefore, that we are in a situation where we cannot in fact easily do so. One might be tempted to shrug this off as an unimportant detail, but as we have shown, these differences are in fact quite important, resulting in large variances in the score that are often much higher than the gains reported by a new method.

Fixing the problem is relatively simple. Groups should only report BLEU computed using a metric-internal tokenization and preprocessing scheme for the reference. With the reference processed the same way every time, scores can be directly compared across papers. We recommend the version used by WMT, and provide a new tool that makes it even easier.