RankME: Reliable Human Ratings for Natural Language Generation

03/15/2018
by   Jekaterina Novikova, et al.
0

Human evaluation for natural language generation (NLG) often suffers from inconsistent user ratings. While previous research tends to attribute this problem to individual user preferences, we show that the quality of human judgements can also be improved by experimental design. We present a novel rank-based magnitude estimation method (RankME), which combines the use of continuous scales and relative assessments. We show that RankME significantly improves the reliability and consistency of human ratings compared to traditional evaluation methods. In addition, we show that it is possible to evaluate NLG systems according to multiple, distinct criteria, which is important for error analysis. Finally, we demonstrate that RankME, in combination with Bayesian estimation of system quality, is a cost-effective alternative for ranking multiple NLG systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/10/2019

Automatic Quality Estimation for Natural Language Generation: Ranting (Jointly Rating and Ranking)

We present a recurrent neural network based system for automatic quality...
research
03/24/2022

k-Rater Reliability: The Correct Unit of Reliability for Aggregated Human Annotations

Since the inception of crowdsourcing, aggregation has been a common stra...
research
08/03/2023

Is GPT-4 a reliable rater? Evaluating Consistency in GPT-4 Text Ratings

This study investigates the consistency of feedback ratings generated by...
research
05/24/2022

How Human is Human Evaluation? Improving the Gold Standard for NLG with Utility Theory

Human ratings are treated as the gold standard in NLG evaluation. The st...
research
03/14/2023

Exploring ChatGPT's Ability to Rank Content: A Preliminary Study on Consistency with Human Preferences

As a natural language assistant, ChatGPT is capable of performing variou...
research
09/23/2019

Towards Best Experiment Design for Evaluating Dialogue System Output

To overcome the limitations of automated metrics (e.g. BLEU, METEOR) for...
research
03/26/2018

Analysis of Problem Tokens to Rank Factors Impacting Quality in VoIP Applications

User-perceived quality-of-experience (QoE) in internet telephony systems...

Please sign up or login with your details

Forgot password? Click here to reset