MAUVE: Human-Machine Divergence Curves for Evaluating Open-Ended Text Generation

02/02/2021
by   Krishna Pillutla, et al.
0

Despite major advances in open-ended text generation, there has been limited progress in designing evaluation metrics for this task. We propose MAUVE – a metric for open-ended text generation, which directly compares the distribution of machine-generated text to that of human language. MAUVE measures the mean area under the divergence curve for the two distributions, exploring the trade-off between two types of errors: those arising from parts of the human distribution that the model distribution approximates well, and those it does not. We present experiments across two open-ended generation tasks in the web text domain and the story domain, and a variety of decoding algorithms and model sizes. Our results show that evaluation under MAUVE indeed reflects the more natural behavior with respect to model size, compared to prior metrics. MAUVE's ordering of the decoding algorithms also agrees with that of generation perplexity, the most widely used metric in open-ended text generation; however, MAUVE presents a more principled evaluation metric for the task as it considers both model and human text.

READ FULL TEXT

Authors

page 8

09/05/2019

MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

A robust evaluation metric has a profound impact on the development of t...
02/03/2020

CoTK: An Open-Source Toolkit for Fast Development and Fair Evaluation of Text Generation

In text generation evaluation, many practical issues, such as inconsiste...
09/14/2021

The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

Recent text generation research has increasingly focused on open-ended d...
05/22/2020

Investigating Label Bias in Beam Search for Open-ended Text Generation

Beam search is an effective and widely used decoding algorithm in many s...
08/07/2020

Perception Score, A Learned Metric for Open-ended Text Generation Evaluation

Automatic evaluation for open-ended natural language generation tasks re...
10/14/2020

Decoding Methods for Neural Narrative Generation

Narrative generation is an open-ended NLP task in which a model generate...
07/03/2020

On the Relation between Quality-Diversity Evaluation and Distribution-Fitting Goal in Text Generation

The goal of text generation models is to fit the underlying real probabi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.