Quantitative Fine-Grained Human Evaluation of Machine Translation Systems: a Case Study on English to Croatian

02/02/2018
by   Filip Klubička, et al.
0

This paper presents a quantitative fine-grained manual evaluation approach to comparing the performance of different machine translation (MT) systems. We build upon the well-established Multidimensional Quality Metrics (MQM) error taxonomy and implement a novel method that assesses whether the differences in performance for MQM error types between different MT systems are statistically significant. We conduct a case study for English-to-Croatian, a language direction that involves translating into a morphologically rich language, for which we compare three MT systems belonging to different paradigms: pure phrase-based, factored phrase-based and neural. First, we design an MQM-compliant error taxonomy tailored to the relevant linguistic phenomena of Slavic languages, which made the annotation process feasible and accurate. Errors in MT outputs were then annotated by two annotators following this taxonomy. Subsequently, we carried out a statistical analysis which showed that the best-performing system (neural) reduces the errors produced by the worst system (pure phrase-based) by more than half (54%). Moreover, we conducted an additional analysis of agreement errors in which we distinguished between short (phrase-level) and long distance (sentence-level) errors. We discovered that phrase-based MT approaches are of limited use for long distance agreement phenomena, for which neural MT was found to be especially effective.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/14/2017

Fine-grained human evaluation of neural versus phrase-based machine translation

We compare three approaches to statistical machine translation (pure phr...
research
06/15/2020

Fine-grained Human Evaluation of Transformer and Recurrent Approaches to Neural Machine Translation for English-to-Chinese

This research presents a fine-grained human evaluation to compare the Tr...
research
06/30/2016

HUME: Human UCCA-Based Evaluation of Machine Translation

Human evaluation of machine translation normally uses sentence-level mea...
research
05/24/2023

MuLER: Detailed and Scalable Reference-based Evaluation

We propose a novel methodology (namely, MuLER) that transforms any refer...
research
05/20/2022

SALTED: A Framework for SAlient Long-Tail Translation Error Detection

Traditional machine translation (MT) metrics provide an average measure ...
research
04/01/2022

A Test Suite for the Evaluation of Portuguese-English Machine Translation

This paper describes the development of the first test suite for the lan...

Please sign up or login with your details

Forgot password? Click here to reset