How Human is Human Evaluation? Improving the Gold Standard for NLG with Utility Theory

05/24/2022
by   Kawin Ethayarajh, et al.
0

Human ratings are treated as the gold standard in NLG evaluation. The standard protocol is to collect ratings of generated text, average across annotators, and then rank NLG systems by their average scores. However, little consideration has been given as to whether this approach faithfully captures human preferences. In this work, we analyze this standard protocol through the lens of utility theory in economics. We first identify the implicit assumptions it makes about annotators and find that these assumptions are often violated in practice, in which case annotator ratings become an unfaithful reflection of their preferences. The most egregious violations come from using Likert scales, which provably reverse the direction of the true preference in certain cases. We suggest improvements to the standard protocol to make it more theoretically sound, but even in its improved form, it cannot be used to evaluate open-ended tasks like story generation. For the latter, we propose a new evaluation protocol called system-level probabilistic assessment (SPA). In our experiments, we find that according to SPA, annotators prefer larger GPT-3 variants to smaller ones – as expected – with all comparisons being statistically significant. In contrast, the standard protocol only yields significant results half the time.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/15/2018

RankME: Reliable Human Ratings for Natural Language Generation

Human evaluation for natural language generation (NLG) often suffers fro...
research
08/12/2021

To Rate or Not To Rate: Investigating Evaluation Methods for Generated Co-Speech Gestures

While automatic performance metrics are crucial for machine learning of ...
research
11/08/2020

A Gold Standard Methodology for Evaluating Accuracy in Data-To-Text Systems

Most Natural Language Generation systems need to produce accurate texts....
research
08/30/2023

Peering Through Preferences: Unraveling Feedback Acquisition for Aligning Large Language Models

Aligning large language models (LLMs) with human values and intents crit...
research
10/16/2022

StoryER: Automatic Story Evaluation via Ranking, Rating and Reasoning

Existing automatic story evaluation methods place a premium on story lex...
research
01/12/2017

A Data-Oriented Model of Literary Language

We consider the task of predicting how literary a text is, with a gold s...
research
06/03/2020

A no-gold-standard technique to objectively evaluate quantitative imaging methods using patient data: Theory

Objective evaluation of quantitative imaging (QI) methods using measurem...

Please sign up or login with your details

Forgot password? Click here to reset