Not All Errors are Equal: Learning Text Generation Metrics using Stratified Error Synthesis

10/10/2022
by   Wenda Xu, et al.
15

Is it possible to build a general and automatic natural language generation (NLG) evaluation metric? Existing learned metrics either perform unsatisfactorily or are restricted to tasks where large human rating data is already available. We introduce SESCORE, a model-based metric that is highly correlated with human judgements without requiring human annotation, by utilizing a novel, iterative error synthesis and severity scoring pipeline. This pipeline applies a series of plausible errors to raw text and assigns severity labels by simulating human judgements with entailment. We evaluate SESCORE against existing metrics by comparing how their scores correlate with human ratings. SESCORE outperforms all prior unsupervised metrics on multiple diverse NLG tasks including machine translation, image captioning, and WebNLG text generation. For WMT 20/21 En-De and Zh-En, SESCORE improve the average Kendall correlation with human judgement from 0.154 to 0.195. SESCORE even achieves comparable performance to the best supervised metric COMET, despite receiving no human-annotated training data.

READ FULL TEXT
research
12/19/2022

SEScore2: Retrieval Augmented Pretraining for Text Generation Evaluation

Is it possible to leverage large scale raw and raw parallel corpora to b...
research
03/06/2023

Models See Hallucinations: Evaluating the Factuality in Video Captioning

Video captioning aims to describe events in a video with natural languag...
research
12/20/2022

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

The state-of-the-art language model-based automatic metrics, e.g. BARTSc...
research
06/06/2023

Correction of Errors in Preference Ratings from Automated Metrics for Text Generation

A major challenge in the field of Text Generation is evaluation: Human e...
research
07/02/2021

Scarecrow: A Framework for Scrutinizing Machine Text

Modern neural text generation systems can produce remarkably fluent and ...
research
04/12/2023

ImageReward: Learning and Evaluating Human Preferences for Text-to-Image Generation

We present ImageReward – the first general-purpose text-to-image human p...
research
08/27/2021

Automatic Text Evaluation through the Lens of Wasserstein Barycenters

A new metric to evaluate text generation based on deep contextualized e...

Please sign up or login with your details

Forgot password? Click here to reset