NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

05/15/2023
by   Iftitahu Ni'mah, et al.
6

In this study, we analyze NLG automatic metrics based on whether human evaluation aspect is used as context or objective to compute the metrics: (i) Task-agnostic and (ii) Human-aligned. Task-agnostic metrics, such as Perplexity, BLEU, BERTScore, are cost-effective and highly adaptable to diverse NLG tasks, yet they have a weak correlation with human. Human-aligned metrics (CTC, CtrlEval, UniEval) improves correlation level by incorporating desirable human-like qualities as training objective. However, their effectiveness at discerning system-level performance and quality of system outputs remain unclear. We present metric preference checklist as a framework to assess the discriminative power of automatic metrics in three NLG tasks: Text Summarization, Dialogue Response Generation, and Controlled Generation. We show that multi-aspect human-aligned metric (UniEval) is not necessarily dominant over single-aspect human-aligned metrics (CTC, CtrlEval) and task-agnostic metrics (BLEU, BERTScore), particularly when a disagreement between human evaluation aspects is present. We also show particular use cases in which automatic metrics provide a better guidance than human on discriminating system-level performance. Our proposed framework provides access: (i) for verifying whether automatic metrics are faithful to human preference, regardless their correlation level to human; and (ii) for scrutinizing the strengths and limitations of NLG systems, which are often obscured by a standard averaging method of evaluation scores.

READ FULL TEXT

page 7

page 8

page 9

research
05/29/2018

Human vs Automatic Metrics: on the Importance of Correlation Design

This paper discusses two existing approaches to the correlation analysis...
research
08/05/2022

Out of the BLEU: how should we assess quality of the Code Generation models?

In recent years, researchers have created and introduced a significant n...
research
07/06/2018

The price of debiasing automatic metrics in natural language evaluation

For evaluating generation systems, automatic metrics such as BLEU cost n...
research
06/23/2015

deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets

We introduce Discriminative BLEU (deltaBLEU), a novel metric for intrins...
research
02/20/2021

Towards Automatic Evaluation of Dialog Systems: A Model-Free Off-Policy Evaluation Approach

Reliable automatic evaluation of dialogue systems under an interactive e...
research
11/01/2020

Deconstruct to Reconstruct a Configurable Evaluation Metric for Open-Domain Dialogue Systems

Many automatic evaluation metrics have been proposed to score the overal...
research
09/19/2023

What is the Best Automated Metric for Text to Motion Generation?

There is growing interest in generating skeleton-based human motions fro...

Please sign up or login with your details

Forgot password? Click here to reset