Active Evaluation: Efficient NLG Evaluation with Few Pairwise Comparisons

03/11/2022
by   Akash Kumar Mohankumar, et al.
0

Recent studies have shown the advantages of evaluating NLG systems using pairwise comparisons as opposed to direct assessment. Given k systems, a naive approach for identifying the top-ranked system would be to uniformly obtain pairwise comparisons from all k 2 pairs of systems. However, this can be very expensive as the number of human annotations required would grow quadratically with k. In this work, we introduce Active Evaluation, a framework to efficiently identify the top-ranked system by actively choosing system pairs for comparison using dueling bandit algorithms. We perform extensive experiments with 13 dueling bandits algorithms on 13 NLG evaluation datasets spanning 5 tasks and show that the number of human annotations can be reduced by 80 model-based dueling bandit algorithms which combine automatic evaluation metrics with human evaluations. Specifically, we eliminate sub-optimal systems even before the human annotation process and perform human evaluations only on test examples where the automatic metric is highly uncertain. This reduces the number of human annotations required further by 89 identifying the top-ranked system requires only a few hundred human annotations, which grow linearly with k. Lastly, we provide practical recommendations and best practices to identify the top-ranked system efficiently. Our code has been made publicly available at https://github.com/akashkm99/duelnlg

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/10/2022

Decreasing Annotation Burden of Pairwise Comparisons with Human-in-the-Loop Sorting: Application in Medical Image Artifact Rating

Ranking by pairwise comparisons has shown improved reliability over ordi...
research
04/05/2023

Human-like Summarization Evaluation with ChatGPT

Evaluating text summarization is a challenging problem, and existing eva...
research
09/30/2022

CEREAL: Few-Sample Clustering Evaluation

Evaluating clustering quality with reliable evaluation metrics like norm...
research
08/21/2021

CushLEPOR: Customised hLEPOR Metric Using LABSE Distilled Knowledge Model to Improve Agreement with Human Judgements

Human evaluation has always been expensive while researchers struggle to...
research
12/19/2022

LENS: A Learnable Evaluation Metric for Text Simplification

Training learnable metrics using modern language models has recently eme...
research
08/22/2016

Multi-Dueling Bandits and Their Application to Online Ranker Evaluation

New ranking algorithms are continually being developed and refined, nece...
research
08/03/2020

OFAI-UKP at HAHA@IberLEF2019: Predicting the Humorousness of Tweets Using Gaussian Process Preference Learning

Most humour processing systems to date make at best discrete, coarse-gra...

Please sign up or login with your details

Forgot password? Click here to reset