BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation

10/18/2021
by   Thomas Scialom, et al.
7

Natural language processing (NLP) systems are increasingly trained to generate open-ended text rather than classifying between responses. This makes research on evaluation metrics for generated language – functions that score system output given the context and/or human reference responses – of critical importance. However, different metrics have different strengths and biases, and reflect human intuitions better on some tasks than others. There is currently no simple, unified way to compare, analyse or evaluate metrics across a representative set of tasks. Here, we describe the Benchmark to Evaluate Automatic Metrics (BEAMetrics), a resource to make research into new metrics itself easier to evaluate. BEAMetrics users can quickly compare existing and new metrics with human judgements across a diverse set of tasks, quality dimensions (fluency vs. coherence vs. informativeness etc), and languages. As generation experts might predict, BEAMetrics reveals stark task-dependent differences between existing metrics, and consistently poor performance on tasks with complex answer spaces or high reliance on general knowledge. While this analysis highlights a critical issue facing current research practice, BEAMetrics also contribute to its resolution by facilitating research into better metrics – particularly those that can account for the complex interaction between context and general knowledge inherent to many modern NLP applications. BEAMetrics is available under the MIT License: https://github.com/ThomasScialom/BEAMetrics

READ FULL TEXT

page 2

page 7

page 8

page 16

page 17

page 18

page 19

research
05/24/2023

Not All Metrics Are Guilty: Improving NLG Evaluation with LLM Paraphrasing

Most research about natural language generation (NLG) relies on evaluati...
research
08/31/2022

The Glass Ceiling of Automatic Evaluation in Natural Language Generation

Automatic evaluation metrics capable of replacing human judgments are cr...
research
07/31/2019

On conducting better validation studies of automatic metrics in natural language generation evaluation

Natural language generation (NLG) has received increasing attention, whi...
research
07/18/2022

Towards the Human Global Context: Does the Vision-Language Model Really Judge Like a Human Being?

As computer vision and NLP make progress, Vision-Language(VL) is becomin...
research
09/14/2023

Are Large Language Model-based Evaluators the Solution to Scaling Up Multilingual Evaluation?

Large Language Models (LLMs) have demonstrated impressive performance on...
research
11/13/2022

Language Model Classifier Aligns Better with Physician Word Sensitivity than XGBoost on Readmission Prediction

Traditional evaluation metrics for classification in natural language pr...
research
09/14/2021

Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

Natural language generation (NLG) spans a broad range of tasks, each of ...

Please sign up or login with your details

Forgot password? Click here to reset