MAUVE: Human-Machine Divergence Curves for Evaluating Open-Ended Text Generation

by   Krishna Pillutla, et al.

Despite major advances in open-ended text generation, there has been limited progress in designing evaluation metrics for this task. We propose MAUVE – a metric for open-ended text generation, which directly compares the distribution of machine-generated text to that of human language. MAUVE measures the mean area under the divergence curve for the two distributions, exploring the trade-off between two types of errors: those arising from parts of the human distribution that the model distribution approximates well, and those it does not. We present experiments across two open-ended generation tasks in the web text domain and the story domain, and a variety of decoding algorithms and model sizes. Our results show that evaluation under MAUVE indeed reflects the more natural behavior with respect to model size, compared to prior metrics. MAUVE's ordering of the decoding algorithms also agrees with that of generation perplexity, the most widely used metric in open-ended text generation; however, MAUVE presents a more principled evaluation metric for the task as it considers both model and human text.



page 8


MoverScore: Text Generation Evaluating with Contextualized Embeddings and Earth Mover Distance

A robust evaluation metric has a profound impact on the development of t...

CoTK: An Open-Source Toolkit for Fast Development and Fair Evaluation of Text Generation

In text generation evaluation, many practical issues, such as inconsiste...

The Perils of Using Mechanical Turk to Evaluate Open-Ended Text Generation

Recent text generation research has increasingly focused on open-ended d...

Investigating Label Bias in Beam Search for Open-ended Text Generation

Beam search is an effective and widely used decoding algorithm in many s...

Perception Score, A Learned Metric for Open-ended Text Generation Evaluation

Automatic evaluation for open-ended natural language generation tasks re...

Decoding Methods for Neural Narrative Generation

Narrative generation is an open-ended NLP task in which a model generate...

On the Relation between Quality-Diversity Evaluation and Distribution-Fitting Goal in Text Generation

The goal of text generation models is to fit the underlying real probabi...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.