Perturbation CheckLists for Evaluating NLG Evaluation Metrics

09/13/2021
by   Ananya B. Sai, et al.
5

Natural Language Generation (NLG) evaluation is a multifaceted task requiring assessment of multiple desirable criteria, e.g., fluency, coherency, coverage, relevance, adequacy, overall quality, etc. Across existing datasets for 6 NLG tasks, we observe that the human evaluation scores on these multiple criteria are often not correlated. For example, there is a very low correlation between human scores on fluency and data coverage for the task of structured data to text generation. This suggests that the current recipe of proposing new automatic evaluation metrics for NLG by showing that they correlate well with scores assigned by humans for a single criteria (overall quality) alone is inadequate. Indeed, our extensive study involving 25 automatic evaluation metrics across 6 different tasks and 18 different evaluation criteria shows that there is no single metric which correlates well with human scores on all desirable criteria, for most NLG tasks. Given this situation, we propose CheckLists for better design and evaluation of automatic metrics. We design templates which target a specific criteria (e.g., coverage) and perturb the output such that the quality gets affected only along this specific criteria (e.g., the coverage drops). We show that existing evaluation metrics are not robust against even such simple perturbations and disagree with scores assigned by humans to the perturbed output. The proposed templates thus allow for a fine-grained assessment of automatic evaluation metrics exposing their limitations and will facilitate better design, analysis and evaluation of such metrics.

READ FULL TEXT

page 3

page 4

page 6

page 7

page 14

page 15

page 16

research
05/29/2018

Human vs Automatic Metrics: on the Importance of Correlation Design

This paper discusses two existing approaches to the correlation analysis...
research
08/07/2020

Perception Score, A Learned Metric for Open-ended Text Generation Evaluation

Automatic evaluation for open-ended natural language generation tasks re...
research
09/17/2020

Small but Mighty: New Benchmarks for Split and Rephrase

Split and Rephrase is a text simplification task of rewriting a complex ...
research
06/13/2023

HAUSER: Towards Holistic and Automatic Evaluation of Simile Generation

Similes play an imperative role in creative writing such as story and di...
research
05/31/2021

The effectiveness of feature attribution methods and its correlation with automatic evaluation scores

Explaining the decisions of an Artificial Intelligence (AI) model is inc...
research
08/06/2023

Empirical Optimal Risk to Quantify Model Trustworthiness for Failure Detection

Failure detection (FD) in AI systems is a crucial safeguard for the depl...
research
10/08/2021

Global Explainability of BERT-Based Evaluation Metrics by Disentangling along Linguistic Factors

Evaluation metrics are a key ingredient for progress of text generation ...

Please sign up or login with your details

Forgot password? Click here to reset