Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

06/11/2020
by   Nitika Mathur, et al.
0

Automatic metrics are fundamental for the development and evaluation of machine translation systems. Judging whether, and to what extent, automatic metrics concur with the gold standard of human evaluation is not a straightforward problem. We show that current methods for judging metrics are highly sensitive to the translations used for assessment, particularly the presence of outliers, which often leads to falsely confident conclusions about a metric's efficacy. Finally, we turn to pairwise system ranking, developing a method for thresholding performance improvement under an automatic metric against human judgements, which allows quantification of type I versus type II errors incurred, i.e., insignificant human differences in system quality that are accepted, and significant human differences that are rejected. Together, these findings suggest improvements to the protocols for metric evaluation and system performance evaluation in machine translation.

READ FULL TEXT
research
07/22/2021

To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

Automatic metrics are commonly used as the exclusive tool for declaring ...
research
03/29/2022

Investigating Data Variance in Evaluations of Automatic Machine Translation Metrics

Current practices in metric evaluation focus on one single dataset, e.g....
research
04/29/2021

Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

Human evaluation of modern high-quality machine translation systems is a...
research
07/06/2023

BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training

Automatic metrics play a crucial role in machine translation. Despite th...
research
07/21/2017

Why We Need New Evaluation Metrics for NLG

The majority of NLG evaluation relies on automatic metrics, such as BLEU...
research
08/10/2015

Improve the Evaluation of Fluency Using Entropy for Machine Translation Evaluation Metrics

The widely-used automatic evaluation metrics cannot adequately reflect t...
research
01/12/2016

Comparison and Adaptation of Automatic Evaluation Metrics for Quality Assessment of Re-Speaking

Re-speaking is a mechanism for obtaining high quality subtitles for use ...

Please sign up or login with your details

Forgot password? Click here to reset