Unifying Human and Statistical Evaluation for Natural Language Generation

04/04/2019
by   Tatsunori B. Hashimoto, et al.
0

How can we measure whether a natural language generation system produces both high quality and diverse outputs? Human evaluation captures quality but not diversity, as it does not catch models that simply plagiarize from the training set. On the other hand, statistical evaluation (i.e., perplexity) captures diversity but not quality, as models that occasionally emit low quality samples would be insufficiently penalized. In this paper, we propose a unified framework which evaluates both diversity and quality, based on the optimal error rate of predicting whether a sentence is human- or machine-generated. We demonstrate that this error rate can be efficiently estimated by combining human and statistical evaluation, using an evaluation metric which we call HUSE. On summarization and chit-chat dialogue, we show that (i) HUSE detects diversity defects which fool pure human evaluation and that (ii) techniques such as annealing for improving quality actually decrease HUSE due to decreased diversity.

READ FULL TEXT
research
10/11/2022

Measuring and Improving Semantic Diversity of Dialogue Generation

Response diversity has become an important criterion for evaluating the ...
research
05/03/2022

Semantic Diversity in Dialogue with Natural Language Inference

Generating diverse, interesting responses to chitchat conversations is a...
research
04/06/2020

Evaluating the Evaluation of Diversity in Natural Language Generation

Despite growing interest in natural language generation (NLG) models tha...
research
04/27/2020

BLEU Neighbors: A Reference-less Approach to Automatic Evaluation

Evaluation is a bottleneck in the development of natural language genera...
research
07/10/2018

Deep-speare: A Joint Neural Model of Poetic Language, Meter and Rhyme

In this paper, we propose a joint architecture that captures language, r...
research
10/09/2020

Mark-Evaluate: Assessing Language Generation using Population Estimation Methods

We propose a family of metrics to assess language generation derived fro...
research
12/21/2022

Tracing and Removing Data Errors in Natural Language Generation Datasets

Recent work has identified noisy and misannotated data as a core cause o...

Please sign up or login with your details

Forgot password? Click here to reset