The price of debiasing automatic metrics in natural language evaluation

07/06/2018
by   Arun Tejasvi Chaganty, et al.
0

For evaluating generation systems, automatic metrics such as BLEU cost nothing to run but have been shown to correlate poorly with human judgment, leading to systematic bias against certain model improvements. On the other hand, averaging human judgments, the unbiased gold standard, is often too expensive. In this paper, we use control variates to combine automatic metrics with human evaluation to obtain an unbiased estimator with lower cost than human evaluation alone. In practice, however, we obtain only a 7-13 reduction on evaluating summarization and open-response question answering systems. We then prove that our estimator is optimal: there is no unbiased estimator with lower cost. Our theory further highlights the two fundamental bottlenecks---the automatic metric and the prompt shown to human evaluators---both of which need to be improved to obtain greater cost savings.

READ FULL TEXT

page 5

page 8

page 13

page 14

research
05/15/2023

NLG Evaluation Metrics Beyond Correlation Analysis: An Empirical Metric Preference Checklist

In this study, we analyze NLG automatic metrics based on whether human e...
research
04/29/2020

Evaluating Dialogue Generation Systems via Response Selection

Existing automatic evaluation metrics for open-domain dialogue response ...
research
10/09/2020

Evaluating and Characterizing Human Rationales

Two main approaches for evaluating the quality of machine-generated rati...
research
12/15/2021

Dynamic Human Evaluation for Relative Model Comparisons

Collecting human judgements is currently the most reliable evaluation me...
research
12/02/2021

InfoLM: A New Metric to Evaluate Summarization Data2Text Generation

Assessing the quality of natural language generation systems through hum...
research
08/27/2018

WiSeBE: Window-based Sentence Boundary Evaluation

Sentence Boundary Detection (SBD) has been a major research topic since ...
research
05/13/2022

Near-Negative Distinction: Giving a Second Life to Human Evaluation Datasets

Precisely assessing the progress in natural language generation (NLG) ta...

Please sign up or login with your details

Forgot password? Click here to reset