A Call for Clarity in Reporting BLEU Scores
The field of machine translation is blessed with new challenges resulting from the regular production of fresh test sets in diverse settings. But it is also cursed---with a lack of consensus in how to report scores from its dominant metric. Although people refer to "the" BLEU score, BLEU scores can vary wildly with changes to its parameterization and, especially, reference processing schemes, yet these details are absent from papers or hard to determine. We quantify this variation, finding differences as high as 1.8 between commonly used configurations. Pointing to the success of the parsing community, we suggest machine translation researchers set- tle upon the BLEU scheme used by the annual Conference on Machine Translation (WMT), which does not permit user-supplied preprocessing of the reference. We provide a new tool to facilitate this.
READ FULL TEXT