Reproducibility Issues for BERT-based Evaluation Metrics

by   Yanran Chen, et al.

Reproducibility is of utmost concern in machine learning and natural language processing (NLP). In the field of natural language generation (especially machine translation), the seminal paper of Post (2018) has pointed out problems of reproducibility of the dominant metric, BLEU, at the time of publication. Nowadays, BERT-based evaluation metrics considerably outperform BLEU. In this paper, we ask whether results and claims from four recent BERT-based metrics can be reproduced. We find that reproduction of claims and results often fails because of (i) heavy undocumented preprocessing involved in the metrics, (ii) missing code and (iii) reporting weaker results for the baseline metrics. (iv) In one case, the problem stems from correlating not to human scores but to a wrong column in the csv file, inflating scores by 5 points. Motivated by the impact of preprocessing, we then conduct a second study where we examine its effects more closely (for one of the metrics). We find that preprocessing can have large effects, especially for highly inflectional languages. In this case, the effect of preprocessing may be larger than the effect of the aggregation mechanism (e.g., greedy alignment vs. Word Mover Distance).


page 17

page 18

page 20

page 22


A global analysis of metrics used for measuring performance in natural language processing

Measuring the performance of natural language processing models is chall...

MENLI: Robust Evaluation Metrics from Natural Language Inference

Recently proposed BERT-based evaluation metrics perform well on standard...

Shades of BLEU, Flavours of Success: The Case of MultiWOZ

The MultiWOZ dataset (Budzianowski et al.,2018) is frequently used for b...

Out of the BLEU: how should we assess quality of the Code Generation models?

In recent years, researchers have created and introduced a significant n...

Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale

Automatic evaluation of language generation systems is a well-studied pr...

Can we do that simpler? Simple, Efficient, High-Quality Evaluation Metrics for NLG

We explore efficient evaluation metrics for Natural Language Generation ...

Evaluating natural language processing models with generalization metrics that do not need access to any training or testing data

The search for effective and robust generalization metrics has been the ...

Please sign up or login with your details

Forgot password? Click here to reset