Compression, Transduction, and Creation: A Unified Framework for Evaluating Natural Language Generation

09/14/2021
by   Mingkai Deng, et al.
31

Natural language generation (NLG) spans a broad range of tasks, each of which serves for specific objectives and desires different properties of generated text. The complexity makes automatic evaluation of NLG particularly challenging. Previous work has typically focused on a single task and developed individual evaluation metrics based on specific intuitions. In this paper, we propose a unifying perspective based on the nature of information change in NLG tasks, including compression (e.g., summarization), transduction (e.g., text rewriting), and creation (e.g., dialog). Information alignment between input, context, and output text plays a common central role in characterizing the generation. With automatic alignment prediction models, we develop a family of interpretable metrics that are suitable for evaluating key aspects of different NLG tasks, often without need of gold reference data. Experiments show the uniformly designed metrics achieve stronger or comparable correlations with human judgement compared to state-of-the-art metrics in each of diverse tasks, including text summarization, style transfer, and knowledge-grounded dialog.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/21/2022

Spurious Correlations in Reference-Free Evaluation of Text Generation

Model-based, reference-free evaluation metrics have been proposed as a f...
research
05/26/2023

AlignScore: Evaluating Factual Consistency with a Unified Alignment Function

Many text generation applications require the generated text to be factu...
research
03/27/2023

ChatGPT as a Factual Inconsistency Evaluator for Abstractive Text Summarization

The performance of abstractive text summarization has been greatly boost...
research
12/20/2022

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

The state-of-the-art language model-based automatic metrics, e.g. BARTSc...
research
08/25/2023

Text Style Transfer Evaluation Using Large Language Models

Text Style Transfer (TST) is challenging to evaluate because the quality...
research
10/18/2021

BEAMetrics: A Benchmark for Language Generation Evaluation Evaluation

Natural language processing (NLP) systems are increasingly trained to ge...
research
05/24/2023

Human-Centered Metrics for Dialog System Evaluation

We present metrics for evaluating dialog systems through a psychological...

Please sign up or login with your details

Forgot password? Click here to reset