BLEU might be Guilty but References are not Innocent

04/13/2020
by   Markus Freitag, et al.
0

The quality of automatic metrics for machine translation has been increasingly called into question, especially for high-quality systems. This paper demonstrates that, while choice of metric is important, the nature of the references is also critical. We study different methods to collect references and compare their value in automated evaluation by reporting correlation with human evaluation for a variety of systems and metrics. Motivated by the finding that typical references exhibit poor diversity, concentrating around translationese language, we develop a paraphrasing task for linguists to perform on existing reference translations, which counteracts this bias. Our method yields higher correlation with human judgment not only for the submissions of WMT 2019 English to German, but also for Back-translation and APE augmented MT output, which have been shown to have low correlation with automatic metrics using standard references. We demonstrate that our methodology improves correlation with all modern evaluation metrics we look at, including embedding-based methods. To complete this picture, we reveal that multi-reference BLEU does not improve the correlation for high quality output, and present an alternative multi-reference formulation that is more effective.

READ FULL TEXT

page 5

page 10

research
04/30/2020

Explicit Representation of the Translation Space: Automatic Paraphrasing for Machine Translation Evaluation

Following previous work on automatic paraphrasing, we assess the feasibi...
research
10/20/2020

Human-Paraphrased References Improve Neural Machine Translation

Automatic evaluation comparing candidate translations to human-generated...
research
04/27/2020

BLEU Neighbors: A Reference-less Approach to Automatic Evaluation

Evaluation is a bottleneck in the development of natural language genera...
research
09/24/2018

Sentence-Level Fluency Evaluation: References Help, But Can Be Spared!

Motivated by recent findings on the probabilistic modeling of acceptabil...
research
05/22/2023

Improving Metrics for Speech Translation

We introduce Parallel Paraphrasing (Para_both), an augmentation method f...
research
09/23/2020

KoBE: Knowledge-Based Machine Translation Evaluation

We propose a simple and effective method for machine translation evaluat...
research
08/13/2018

Automatic Reference-Based Evaluation of Pronoun Translation Misses the Point

We compare the performance of the APT and AutoPRF metrics for pronoun tr...

Please sign up or login with your details

Forgot password? Click here to reset