Curious Case of Language Generation Evaluation Metrics: A Cautionary Tale

10/26/2020
by   Ozan Caglayan, et al.
0

Automatic evaluation of language generation systems is a well-studied problem in Natural Language Processing. While novel metrics are proposed every year, a few popular metrics remain as the de facto metrics to evaluate tasks such as image captioning and machine translation, despite their known limitations. This is partly due to ease of use, and partly because researchers expect to see them and know how to interpret them. In this paper, we urge the community for more careful consideration of how they automatically evaluate their models by demonstrating important failure cases on multiple datasets, language pairs and tasks. Our experiments show that metrics (i) usually prefer system outputs to human-authored texts, (ii) can be insensitive to correct translations of rare words, (iii) can yield surprisingly high scores when given a single sentence as system output for the entire test set.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/24/2023

Gender Biases in Automatic Evaluation Metrics: A Case Study on Image Captioning

Pretrained model-based evaluation metrics have demonstrated strong perfo...
research
08/25/2023

Training and Meta-Evaluating Machine Translation Evaluation Metrics at the Paragraph Level

As research on machine translation moves to translating text beyond the ...
research
01/04/2022

StyleM: Stylized Metrics for Image Captioning Built with Contrastive N-grams

In this paper, we build two automatic evaluation metrics for evaluating ...
research
05/30/2016

Does Multimodality Help Human and Machine for Translation and Image Captioning?

This paper presents the systems developed by LIUM and CVC for the WMT16 ...
research
02/02/2021

The GEM Benchmark: Natural Language Generation, its Evaluation and Metrics

We introduce GEM, a living benchmark for natural language Generation (NL...
research
03/30/2022

Reproducibility Issues for BERT-based Evaluation Metrics

Reproducibility is of utmost concern in machine learning and natural lan...
research
09/20/2022

Can we do that simpler? Simple, Efficient, High-Quality Evaluation Metrics for NLG

We explore efficient evaluation metrics for Natural Language Generation ...

Please sign up or login with your details

Forgot password? Click here to reset