DeepAI
Log In Sign Up

Dynamic Human Evaluation for Relative Model Comparisons

Collecting human judgements is currently the most reliable evaluation method for natural language generation systems. Automatic metrics have reported flaws when applied to measure quality aspects of generated text and have been shown to correlate poorly with human judgements. However, human evaluation is time and cost-intensive, and we lack consensus on designing and conducting human evaluation experiments. Thus there is a need for streamlined approaches for efficient collection of human judgements when evaluating natural language generation systems. Therefore, we present a dynamic approach to measure the required number of human annotations when evaluating generated outputs in relative comparison settings. We propose an agent-based framework of human evaluation to assess multiple labelling strategies and methods to decide the better model in a simulation and a crowdsourcing case study. The main results indicate that a decision about the superior model can be made with high probability across different labelling strategies, where assigning a single random worker per task requires the least overall labelling effort and thus the least cost.

READ FULL TEXT

page 1

page 2

page 3

page 4

04/14/2020

A Human Evaluation of AMR-to-English Generation Systems

Most current state-of-the art systems for generating English text from A...
07/06/2018

The price of debiasing automatic metrics in natural language evaluation

For evaluating generation systems, automatic metrics such as BLEU cost n...
03/17/2022

RoMe: A Robust Metric for Evaluating Natural Language Generation

Evaluating Natural Language Generation (NLG) systems is a challenging ta...
01/02/2019

Judge the Judges: A Large-Scale Evaluation Study of Neural Language Models for Online Review Generation

Recent advances in deep learning have resulted in a resurgence in the po...
04/06/2020

Evaluating the Evaluation of Diversity in Natural Language Generation

Despite growing interest in natural language generation (NLG) models tha...
08/24/2020

How To Evaluate Your Dialogue System: Probe Tasks as an Alternative for Token-level Evaluation Metrics

Though generative dialogue modeling is widely seen as a language modelin...
12/09/2016

Evaluating Creative Language Generation: The Case of Rap Lyric Ghostwriting

Language generation tasks that seek to mimic human ability to use langua...