Scientific Credibility of Machine Translation Research: A Meta-Evaluation of 769 Papers

06/29/2021
by   Benjamin Marie, et al.
0

This paper presents the first large-scale meta-evaluation of machine translation (MT). We annotated MT evaluations conducted in 769 research papers published from 2010 to 2020. Our study shows that practices for automatic MT evaluation have dramatically changed during the past decade and follow concerning trends. An increasing number of MT evaluations exclusively rely on differences between BLEU scores to draw conclusions, without performing any kind of statistical significance testing nor human evaluation, while at least 108 metrics claiming to be better than BLEU have been proposed. MT evaluations in recent papers tend to copy and compare automatic metric scores from previous work to claim the superiority of a method or an algorithm without confirming neither exactly the same training, validating, and testing data have been used nor the metric scores are comparable. Furthermore, tools for reporting standardized metric scores are still far from being widely adopted by the MT community. After showing how the accumulation of these pitfalls leads to dubious evaluation, we propose a guideline to encourage better automatic MT evaluation along with a simple meta-evaluation scoring method to assess its credibility.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/30/2020

Machine Translation of Novels in the Age of Transformer

In this chapter we build a machine translation (MT) system tailored to t...
research
05/23/2023

Ties Matter: Modifying Kendall's Tau for Modern Metric Meta-Evaluation

Kendall's tau is frequently used to meta-evaluate how well machine trans...
research
12/20/2022

IndicMT Eval: A Dataset to Meta-Evaluate Machine Translation metrics for Indian Languages

The rapid growth of machine translation (MT) systems has necessitated co...
research
10/20/2022

Searching for a higher power in the human evaluation of MT

In MT evaluation, pairwise comparisons are conducted to identify the bet...
research
04/23/2018

A Call for Clarity in Reporting BLEU Scores

The field of machine translation is blessed with new challenges resultin...
research
01/22/2021

Evaluation Discrepancy Discovery: A Sentence Compression Case-study

Reliable evaluation protocols are of utmost importance for reproducible ...
research
05/30/2023

Breeding Machine Translations: Evolutionary approach to survive and thrive in the world of automated evaluation

We propose a genetic algorithm (GA) based method for modifying n-best li...

Please sign up or login with your details

Forgot password? Click here to reset