Science is the process of formulating hypotheses, making predictions, and measuring their outcomes. In machine translation research, the predictions are made by models whose development is the focus of the research, and the measurement, more often than not, is done via BLEU (Papineni et al., 2002). BLEU’s relative language independence, its ease of computation, and its reasonable correlation with human judgments have led to its adoption as the dominant metric for machine translation research. On the whole, it has been a boon to the community, providing a fast and cheap way for researchers to gauge the performance of their models. Together with larger-scale controlled manual evaluations, BLEU has shepherded the field through a decade and a half of quality improvements Graham et al. (2014).
This is of course not to claim there are no problems with BLEU. Its weaknesses abound, and much has been written about them (cf. Callison-Burch et al. (2006)). This paper is not, however, concerned with the shortcomings with using BLEU as a proxy for human evaluation of quality; instead, our goal is to bring attention to problems with the reporting of BLEU scores. These can be summarized as follows:
BLEU is not a single metric, but requires a number of parameters (§2.1).
Preprocessing schemes have a large effect on scores (§2.2). Importantly, BLEU scores computed against differently-processed references are not comparable.
Papers vary in the hidden parameters and schemes they use, yet rarely report them (§2.3). Even when they do, it is often hard to discover the details.
Together, these issues make it difficult to evaluate and compare BLEU scores across papers, impeding comparison and replication. We quantify these issues and show that they are serious, with variances bigger than many reported gains. In particular, we identify user-supplied reference tokenizations as a source of incompatibility. As a solution, we suggest the community use only “detokenized” reference processing, as done by the annual Conference on Machine Translation(Bojar et al., 2017, WMT). In support of this, we release a Python script, SacreBLEU,111pip3 install sacrebleu which computes this metric and reports a version string recording the parameters. It also provides a number of other features, such as automatic download and management of common test sets.
2 Problem Description
2.1 Problem: BLEU is underspecified
“BLEU” does not signify any one thing, but to a constellation of parameterized methods. Among these parameters are:
The number of references used;
for multi-reference settings, the computation of the length penalty;
the maximum n-gram length; and
smoothing applied to 0-count n-grams.
It is true that many of these are often not problems in practice. Most often, there is only one reference, and the length penalty calculation is therefore moot. The maximum n-gram length is virtually always set to four, and since BLEU is corpus level, it is rare that there are any zero counts.
But it is also true that people use BLEU scores as very rough guides to MT performance across test sets and languages (comparing, for example, Chinese and Arabic). And traps exist. For example, WMT 2017 includes two references for English–Finnish. Scoring the online-B system with one reference produces a BLEU score of 22.04, and with two, 25.25. How sure are you that the results in that paper you just reviewed with those good EN-FI results were using just one reference?
2.2 Problem: Different reference preprocessings cannot be compared
The first problem dealt with parameters used in BLEU scores, and was more theoretical. We now discuss a second problem, preprocessing, and demonstrate its existence in practice.
Preprocessing includes input text modifications such as normalization (e.g., collapsing punctuation, removing special characters), tokenization (e.g., splitting off punctuation), compound-splitting, the removal of case, and so on. Its general goal is to deliver meaningful white-space delimited tokens to the MT system. Of these, tokenization is one of the most important and central. This is because BLEU is a precision metric, and changing the reference processing changes the set of n-grams against which system n-gram precision is computed. Rehbein and Genabith (2007) showed that the analogous use in the parsing community of F
scores as rough estimates of cross-lingual parsing difficulty were unreliable, for this exact reason. We note that BLEU scores are often reported as beingtokenized or detokenized. But for computing BLEU, both the system output and reference are always tokenized; what this distinction refers to is whether the reference preprocessing is user-supplied or metric-internal (i.e., handled by the code implementing the metric), respectively. And since BLEU scores can only be compared when the reference processing is the same, user-supplied preprocessing is error-prone and inadequate for comparing across papers.
Table 1 demonstrates the effect of computing BLEU scores with different reference tokenizations. We took a single system (online-B) from the WMT 2017 system outputs, and processed both it and the reference in the following ways:
basic. User-supplied preprocessing with the Moses tokenizer (Koehn et al., 2007).222Arguments -q -no-escape -protected basic-protected-patterns -l LANG.
unk. All word types not appearing at least twice in the target side of the WMT training data (with “basic” tokenization) are mapped to UNK. This hypothetical scenario could easily happen if this common user-supplied preprocessing were inadvertently applied to the reference.
metric. Only the metric-internal tokenization of the official WMT scoring script,
mteval-v13a.pl, is applied.444https://github.com/moses-smt/mosesdecoder/blob/master/scripts/generic/mteval-v13a.pl
The changes in each column show the effect these different schemes have, as high as 1.8 for one arc, and averaging around 1.0. The biggest is the treatment of case, which is well known, yet many papers are not clear about whether they report cased or case-insensitive BLEU.
Allowing the user to handle pre-processing of the reference has other traps. For example, many systems (particularly before sub-word splitting (Sennrich et al., 2016) was proposed) limited the vocabulary in their attempt to deal with unknown words. How sure are you that they didn’t apply the same unknown-word masking to the reference, making word matches much more likely? Such mistakes are easy to introduce.555The observations in this paper stem in part from an early version of the authors’ research workflow, which applied preprocessing to the reference, affecting scores by half a point.
2.3 Problem: Details are hard to come by
User-supplied reference processing precludes direct comparison of published numbers, but if enough detail is specified in the paper, it is at least possible to reconstruct comparable numbers. Unfortunately, this is not the trend, and even for meticulous researchers, it is often unwieldy to include this level of technical detail. In any case, it creates uncertainty and work for the reader. One has to read the experiments section, scour the footnotes, and look for other clues which are sometimes scattered throughout the paper. Figuring out what another team did is not easy.
|Bahdanau et al. (2014)||(?)|
|Luong et al. (2015b)||(?) user or metric|
|Jean et al. (2015)||user|
|Wu et al. (2016)||(?) user or user|
|Vaswani et al. (2017)||(?) user or user|
|Gehring et al. (2017)||user, metric|
The variations in Table 1 are only some of the possible configurations, since there is no limit to the preprocessing that a group could apply. But assuming these represent common, concrete configurations, one might wonder how easy it is to determine which of them was used by a particular paper. In Table 2, we attempt do this from a handful of influential papers in the literature. Not only are systems not comparable due to different schemes, in many cases, no easy determination can be made.
Reference tokenization must identical in order for scores to be comparable (see Figure 1 below). The widespread use of user-supplied reference preprocessing prevents this, needlessly complicating comparisons. The lack of details about preprocessing pipelines exacerbates this problem. This situation should be fixed.
3 A way forward
3.1 The example of PARSEVAL
An instructive comparison is the PARSEVAL metric for computing parser accuracy (Black et al., 1991). PARSEVAL works by taking labeled spans of the form representing a nonterminal spanning a constituent from word to word
. These are extracted from the parser output and used to compute precision and recall against the gold-standard set taken from the correct parse tree. Precision and recall are then combined to compute the Fmetric that is commonly reported and compared across parsing papers.
Computing parser F is not without its own set of edge cases.
Do we count the
Should any labels be considered equivalent?
These boundary cases are resolved by that community’s adoption of a standard codebase,
evalb,666http://nlp.cs.nyu.edu/evalb/ which included a parameters file that answers each of these questions.777The configuration file, COLLINS.PRM, answers these questions as no, no, no, and ADVP=PRT.
This has facilitated thirty years of comparable cross-paper comparisons on treebanks in the parsing community.888In some cases (WSJ23), on the same test set (Marcus et al., 1993).
3.2 Existing scripts
Moses 999http://statmt.org/moses has a number of scoring scripts.
Unfortunately, each of them has problems.
multi-bleu.perl cannot be used because it requires user-supplied preprocessing.
The same is true of another evaluation framework, MultEval (Clark et al., 2011), which explicitly advocates for user-supplied tokenization.101010https://github.com/jhclark/multeval
A good candidate is Moses’
mteval-v13a.pl, which makes use of metric-internal preprocessing and is used in the annual WMT evaluations.
However, this script requires the data to be wrapped into XML.
Nematus Sennrich et al. (2017) contains a version (
multi-bleu-detok.perl) that removes the XML requirement.
This is a good idea, but it still requires the user to manually handle the reference translations.
A better approach is to keep the reference away from the user entirely.
SacreBLEU is a Python script that aims to treat BLEU with a bit more reverence:
It expects detokenized outputs, applying its own metric-internal preprocessing, and produces the same values as WMT;
it produces a short version string that documents the settings used; and
it automatically downloads and manages WMT (2008–2018) and IWSLT 2017 (Cettolo et al., 2017) test sets and processes them to plain text.
SacreBLEU can be installed via the Python package management system:
pip3 install sacrebleu
It is open source software under the Apache 2.0 license.111111https://github.com/awslabs/sockeye/tree/master/contrib/sacrebleu
Machine translation benefits from the regular introducion of test sets for many different language arcs, from academic, government, and industry sources. This should make it easy to share and compare scores on a constant set of fresh data. It is a shame, therefore, that we are in a situation where we cannot in fact easily do so. One might be tempted to shrug this off as an unimportant detail, but as we have shown, these differences are in fact quite important, resulting in large variances in the score that are often much higher than the gains reported by a new method.
Fixing the problem is relatively simple. Groups should only report BLEU computed using a metric-internal tokenization and preprocessing scheme for the reference. With the reference processed the same way every time, scores can be directly compared across papers. We recommend the version used by WMT, and provide a new tool that makes it even easier.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR, abs/1409.0473.
- Black et al. (1991) E. Black, S. Abney, D. Flickenger, C. Gdaniec, R. Grishman, P. Harrison, D. Hindle, R. Ingria, F. Jelinek, J. Klavans, M. Liberman, M. Marcus, S. Roukos, B. Santorini, and T. Strzalkowski. 1991. A procedure for quantitatively comparing the syntactic coverage of english grammars. In Speech and Natural Language: Proceedings of a Workshop Held at Pacific Grove, California, February 19-22, 1991.
- Bojar et al. (2017) Ondřej Bojar, Rajen Chatterjee, Christian Federmann, Yvette Graham, Barry Haddow, Shujian Huang, Matthias Huck, Philipp Koehn, Qun Liu, Varvara Logacheva, Christof Monz, Matteo Negri, Matt Post, Raphael Rubino, Lucia Specia, and Marco Turchi. 2017. Findings of the 2017 conference on machine translation (wmt17). In Proceedings of the Second Conference on Machine Translation, pages 169–214. Association for Computational Linguistics.
- Callison-Burch et al. (2006) Chris Callison-Burch, Miles Osborne, and Philipp Koehn. 2006. Re-evaluation the role of bleu in machine translation research. In 11th Conference of the European Chapter of the Association for Computational Linguistics.
- Cettolo et al. (2017) Mauro Cettolo, Marcello Federico, Luisa Bentivogli, Jan Niehues, Sebastian Stüker, Katsuitho Sudoh, Koichiro Yoshino, and Christian Federmann. 2017. Overview of the iwslt 2017 evaluation campaign. In 14th International Workshop on Spoken Language Translation, pages 2–14, Tokyo, Japan.
- Chiang (2005) David Chiang. 2005. A hierarchical phrase-based model for statistical machine translation. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL’05), pages 263–270. Association for Computational Linguistics.
- Clark et al. (2011) Jonathan H. Clark, Chris Dyer, Alon Lavie, and Noah A. Smith. 2011. Better hypothesis testing for statistical machine translation: Controlling for optimizer instability. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, pages 176–181. Association for Computational Linguistics.
- Gehring et al. (2017) Jonas Gehring, Michael Auli, David Grangier, and Yann Dauphin. 2017. A convolutional encoder model for neural machine translation. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 123–135. Association for Computational Linguistics.
- Graham et al. (2014) Yvette Graham, Timothy Baldwin, Alistair Moffat, and Justin Zobel. 2014. Is machine translation getting better over time? In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, pages 443–451. Association for Computational Linguistics.
Jean et al. (2015)
Sébastien Jean, Kyunghyun Cho, Roland Memisevic, and Yoshua Bengio. 2015.
On using very large
target vocabulary for neural machine translation.
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1–10. Association for Computational Linguistics.
- Koehn et al. (2007) Philipp Koehn, Hieu Hoang, Alexandra Birch, Chris Callison-Burch, Marcello Federico, Nicola Bertoldi, Brooke Cowan, Wade Shen, Christine Moran, Richard Zens, Chris Dyer, Ondrej Bojar, Alexandra Constantin, and Evan Herbst. 2007. Moses: Open source toolkit for statistical machine translation. In Proceedings of the 45th Annual Meeting of the Association for Computational Linguistics Companion Volume Proceedings of the Demo and Poster Sessions, pages 177–180. Association for Computational Linguistics.
- Luong et al. (2015a) Thang Luong, Hieu Pham, and Christopher D. Manning. 2015a. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pages 1412–1421. Association for Computational Linguistics.
- Luong et al. (2015b) Thang Luong, Ilya Sutskever, Quoc Le, Oriol Vinyals, and Wojciech Zaremba. 2015b. Addressing the rare word problem in neural machine translation. In Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 11–19. Association for Computational Linguistics.
- Marcus et al. (1993) Mitchell P. Marcus, Beatrice Santorini, and Mary Ann Marcinkiewicz. 1993. Building a large annotated corpus of english: The penn treebank. Computational Linguistics, Volume 19, Number 2, June 1993, Special Issue on Using Large Corpora: II.
- Papineni et al. (2002) Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics.
- Rehbein and Genabith (2007) Ines Rehbein and Josef van Genabith. 2007. Treebank annotation schemes and parser evaluation for german. In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL).
- Sennrich et al. (2017) Rico Sennrich, Orhan Firat, Kyunghyun Cho, Alexandra Birch, Barry Haddow, Julian Hitschler, Marcin Junczys-Dowmunt, Samuel Läubli, Antonio Valerio Miceli Barone, Jozef Mokry, and Maria Nadejde. 2017. Nematus: a toolkit for neural machine translation. In Proceedings of the Software Demonstrations of the 15th Conference of the European Chapter of the Association for Computational Linguistics, pages 65–68. Association for Computational Linguistics.
- Sennrich et al. (2016) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016. Neural machine translation of rare words with subword units. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1715–1725. Association for Computational Linguistics.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. CoRR, abs/1706.03762.
- Wu et al. (2016) Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V. Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, Jeff Klingner, Apurva Shah, Melvin Johnson, Xiaobing Liu, Łukasz Kaiser, Stephan Gouws, Yoshikiyo Kato, Taku Kudo, Hideto Kazawa, Keith Stevens, George Kurian, Nishant Patil, Wei Wang, Cliff Young, Jason Smith, Jason Riesa, Alex Rudnick, Oriol Vinyals, Greg Corrado, Macduff Hughes, and Jeffrey Dean. 2016. Google’s neural machine translation system: Bridging the gap between human and machine translation. CoRR, abs/1609.08144.