The Inside Story: Towards Better Understanding of Machine Translation Neural Evaluation Metrics

05/19/2023
by   Ricardo Rei, et al.
0

Neural metrics for machine translation evaluation, such as COMET, exhibit significant improvements in their correlation with human judgments, as compared to traditional metrics based on lexical overlap, such as BLEU. Yet, neural metrics are, to a great extent, "black boxes" returning a single sentence-level score without transparency about the decision-making process. In this work, we develop and compare several neural explainability methods and demonstrate their effectiveness for interpreting state-of-the-art fine-tuned neural metrics. Our study reveals that these metrics leverage token-level information that can be directly attributed to translation errors, as assessed through comparison of token-level neural saliency maps with Multidimensional Quality Metrics (MQM) annotations and with synthetically-generated critical translation errors. To ease future research, we release our code at: https://github.com/Unbabel/COMET/tree/explainable-metrics.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/30/2023

BLEU Meets COMET: Combining Lexical and Neural Metrics Towards Robust Machine Translation Evaluation

Although neural-based machine translation evaluation metrics, such as CO...
research
12/20/2022

BMX: Boosting Machine Translation Metrics with Explainability

State-of-the-art machine translation evaluation metrics are based on bla...
research
11/05/2020

Detecting Hallucinated Content in Conditional Neural Sequence Generation

Neural sequence models can generate highly fluent sentences but recent s...
research
07/06/2023

BLEURT Has Universal Translations: An Analysis of Automatic Metrics by Minimum Risk Training

Automatic metrics play a crucial role in machine translation. Despite th...
research
11/02/2019

Machine Translation Evaluation using Bi-directional Entailment

In this paper, we propose a new metric for Machine Translation (MT) eval...
research
10/25/2022

DEMETR: Diagnosing Evaluation Metrics for Translation

While machine translation evaluation metrics based on string overlap (e....
research
11/17/2021

Minimum Bayes Risk Decoding with Neural Metrics of Translation Quality

This work applies Minimum Bayes Risk (MBR) decoding to optimize diverse ...

Please sign up or login with your details

Forgot password? Click here to reset