Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets

05/13/2022
by   Philippe Laban, et al.
0

Precisely assessing the progress in natural language generation (NLG) tasks is challenging, and human evaluation to establish preference in a model's output over another is often necessary. However, human evaluation is usually costly, difficult to reproduce, and non-reusable. In this paper, we propose a new and simple automatic evaluation method for NLG called Near-Negative Distinction (NND) that repurposes prior human annotations into NND tests. In an NND test, an NLG model must place higher likelihood on a high-quality output candidate than on a near-negative candidate with a known error. Model performance is established by the number of NND tests a model passes, as well as the distribution over task-specific errors the model fails on. Through experiments on three NLG tasks (question generation, question answering, and summarization), we show that NND achieves higher correlation with human judgments than standard NLG evaluation metrics. We then illustrate NND evaluation in four practical scenarios, for example performing fine-grain model analysis, or studying model training dynamics. Our findings suggest NND can give a second life to human annotations and provide low-cost NLG evaluation.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/16/2021

IndoNLG: Benchmark and Resources for Evaluating Indonesian Natural Language Generation

A benchmark provides an ecosystem to measure the advancement of models w...
research
04/14/2022

Constructing Open Cloze Tests Using Generation and Discrimination Capabilities of Transformers

This paper presents the first multi-objective transformer model for cons...
research
03/07/2023

Is ChatGPT a Good NLG Evaluator? A Preliminary Study

Recently, the emergence of ChatGPT has attracted wide attention from the...
research
11/27/2020

FFCI: A Framework for Interpretable Automatic Evaluation of Summarization

In this paper, we propose FFCI, a framework for automatic summarization ...
research
12/20/2022

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

The state-of-the-art language model-based automatic metrics, e.g. BARTSc...
research
07/06/2018

The price of debiasing automatic metrics in natural language evaluation

For evaluating generation systems, automatic metrics such as BLEU cost n...

Please sign up or login with your details

Forgot password? Click here to reset