A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods

03/31/2021
by   Daniel Deutsch, et al.
0

The quality of a summarization evaluation metric is quantified by calculating the correlation between its scores and human annotations across a large number of summaries. Currently, it is not clear how precise these correlation estimates are, nor whether differences between two metrics' correlations reflects a true difference or if it is due to random chance. In this work, we address these two problems by proposing methods for calculating confidence intervals and running hypothesis tests for correlations using two resampling methods, bootstrapping and permutation. After evaluating which of the proposed methods is most appropriate for summarization through two simulation experiments, we analyze the results of applying these methods to several different automatic evaluation metrics across three sets of human annotations. We find that the confidence intervals are rather wide, demonstrating high uncertainty in how reliable automatic metrics truly are. Further, although many metrics fail to show statistical improvements over ROUGE, two recent works, QAEval and BERTScore, do in some evaluation settings.

READ FULL TEXT
research
04/21/2022

Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics

How reliably an automatic summarization evaluation metric replicates hum...
research
10/23/2020

Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries

Reference-based metrics such as ROUGE or BERTScore evaluate the content ...
research
04/05/2023

Human-like Summarization Evaluation with ChatGPT

Evaluating text summarization is a challenging problem, and existing eva...
research
09/16/2021

Does Summary Evaluation Survive Translation to Other Languages?

The creation of a large summarization quality dataset is a considerable,...
research
07/01/2013

An Empirical Study into Annotator Agreement, Ground Truth Estimation, and Algorithm Evaluation

Although agreement between annotators has been studied in the past from ...
research
07/21/2020

Quantifying Performance Changes with Effect Size Confidence Intervals

Measuring performance quantifying a performance change are core eval...
research
06/03/2018

Jackknife Empirical Likelihood Methods for Gini Correlations and their Equality Testing

The Gini correlation plays an important role in measuring dependence of ...

Please sign up or login with your details

Forgot password? Click here to reset