Experts, Errors, and Context: A Large-Scale Study of Human Evaluation for Machine Translation

04/29/2021
by   Markus Freitag, et al.
0

Human evaluation of modern high-quality machine translation systems is a difficult problem, and there is increasing evidence that inadequate evaluation procedures can lead to erroneous conclusions. While there has been considerable research on human evaluation, the field still lacks a commonly-accepted standard procedure. As a step toward this goal, we propose an evaluation methodology grounded in explicit error analysis, based on the Multidimensional Quality Metrics (MQM) framework. We carry out the largest MQM research study to date, scoring the outputs of top systems from the WMT 2020 shared task in two language pairs using annotations provided by professional translators with access to full document context. We analyze the resulting data extensively, finding among other results a substantially different ranking of evaluated systems from the one established by the WMT crowd workers, exhibiting a clear preference for human over machine output. Surprisingly, we also find that automatic metrics based on pre-trained embeddings can outperform human crowd workers. We make our corpus publicly available for further research.

READ FULL TEXT

page 5

page 11

page 19

page 20

research
06/11/2020

Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics

Automatic metrics are fundamental for the development and evaluation of ...
research
08/21/2018

Has Machine Translation Achieved Human Parity? A Case for Document-level Evaluation

Recent research suggests that neural machine translation achieves parity...
research
10/23/2021

PhoMT: A High-Quality and Large-Scale Benchmark Dataset for Vietnamese-English Machine Translation

We introduce a high-quality and large-scale Vietnamese-English parallel ...
research
07/22/2021

To Ship or Not to Ship: An Extensive Evaluation of Automatic Metrics for Machine Translation

Automatic metrics are commonly used as the exclusive tool for declaring ...
research
10/27/2016

Ex Machina: Personal Attacks Seen at Scale

The damage personal attacks cause to online discourse motivates many pla...
research
12/20/2022

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

The state-of-the-art language model-based automatic metrics, e.g. BARTSc...
research
09/23/2020

KoBE: Knowledge-Based Machine Translation Evaluation

We propose a simple and effective method for machine translation evaluat...

Please sign up or login with your details

Forgot password? Click here to reset