Perception Score, A Learned Metric for Open-ended Text Generation Evaluation

by   Jing Gu, et al.

Automatic evaluation for open-ended natural language generation tasks remains a challenge. Existing metrics such as BLEU show a low correlation with human judgment. We propose a novel and powerful learning-based evaluation metric: Perception Score. The method measures the overall quality of the generation and scores holistically instead of only focusing on one evaluation criteria, such as word overlapping. Moreover, it also shows the amount of uncertainty about its evaluation result. By connecting the uncertainty, Perception Score gives a more accurate evaluation for the generation system. Perception Score provides state-of-the-art results on two conditional generation tasks and two unconditional generation tasks.


page 1

page 2

page 3

page 4


Perturbation CheckLists for Evaluating NLG Evaluation Metrics

Natural Language Generation (NLG) evaluation is a multifaceted task requ...

MAUVE: Human-Machine Divergence Curves for Evaluating Open-Ended Text Generation

Despite major advances in open-ended text generation, there has been lim...

Coarse-to-fine Seam Estimation for Image Stitching

Seam-cutting and seam-driven techniques have been proven effective for h...

A Human Evaluation of AMR-to-English Generation Systems

Most current state-of-the art systems for generating English text from A...

deltaBLEU: A Discriminative Metric for Generation Tasks with Intrinsically Diverse Targets

We introduce Discriminative BLEU (deltaBLEU), a novel metric for intrins...

On Hallucination and Predictive Uncertainty in Conditional Language Generation

Despite improvements in performances on different natural language gener...

Rethinking and Refining the Distinct Metric

Distinct is a widely used automatic metric for evaluating the diversity ...