The statistical advantage of automatic NLG metrics at the system level

05/26/2021
by   Johnny Tian-Zheng Wei, et al.
0

Estimating the expected output quality of generation systems is central to NLG. This paper qualifies the notion that automatic metrics are not as good as humans in estimating system-level quality. Statistically, humans are unbiased, high variance estimators, while metrics are biased, low variance estimators. We compare these estimators by their error in pairwise prediction (which generation system is better?) using the bootstrap. Measuring this error is complicated: predictions are evaluated against noisy, human predicted labels instead of the ground truth, and metric predictions fluctuate based on the test sets they were calculated on. By applying a bias-variance-noise decomposition, we adjust this error to a noise-free, infinite test set setting. Our analysis compares the adjusted error of metrics to humans and a derived, perfect segment-level annotator, both of which are unbiased estimators dependent on the number of judgments collected. In MT, we identify two settings where metrics outperform humans due to a statistical advantage in variance: when the number of human judgments used is small, and when the quality difference between compared systems is small. The data and code to reproduce our analyses are available at https://github.com/johntzwei/metric-statistical-advantage .

READ FULL TEXT

page 7

page 14

11/07/2021

Variance-Aware Machine Translation Test Sets

We release 70 small and discriminative test sets for machine translation...
04/21/2022

Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics

How reliably an automatic summarization evaluation metric replicates hum...
09/24/2018

Statistical Estimation of Malware Detection Metrics in the Absence of Ground Truth

The accurate measurement of security metrics is a critical research prob...
07/06/2020

Comparing representational geometries using the unbiased distance correlation

Representational similarity analysis (RSA) tests models of brain computa...
04/08/2019

Unbiased variance reduction in randomized experiments

This paper develops a flexible method for decreasing the variance of est...
08/01/2019

Estimating the Standard Error of Cross-Validation-Based Estimators of Classification Rules Performance

First, we analyze the variance of the Cross Validation (CV)-based estima...
03/06/2022

Social-Implicit: Rethinking Trajectory Prediction Evaluation and The Effectiveness of Implicit Maximum Likelihood Estimation

Best-of-N (BoN) Average Displacement Error (ADE)/ Final Displacement Err...

Code Repositories

metric-statistical-advantage

Data and analyses for our acl2021 work


view repo