Re-Examining System-Level Correlations of Automatic Summarization Evaluation Metrics

04/21/2022
by   Daniel Deutsch, et al.
0

How reliably an automatic summarization evaluation metric replicates human judgments of summary quality is quantified by system-level correlations. We identify two ways in which the definition of the system-level correlation is inconsistent with how metrics are used to evaluate systems in practice and propose changes to rectify this disconnect. First, we calculate the system score for an automatic metric using the full test set instead of the subset of summaries judged by humans, which is currently standard practice. We demonstrate how this small change leads to more precise estimates of system-level correlations. Second, we propose to calculate correlations only on pairs of systems that are separated by small differences in automatic scores which are commonly observed in practice. This allows us to demonstrate that our best estimate of the correlation of ROUGE to human judgments is near 0 in realistic scenarios. The results from the analyses point to the need to collect more high-quality human judgments and to improve automatic metrics when differences in system scores are small.

READ FULL TEXT

page 14

page 15

research
03/31/2021

A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods

The quality of a summarization evaluation metric is quantified by calcul...
research
05/29/2018

Human vs Automatic Metrics: on the Importance of Correlation Design

This paper discusses two existing approaches to the correlation analysis...
research
05/26/2021

The statistical advantage of automatic NLG metrics at the system level

Estimating the expected output quality of generation systems is central ...
research
05/13/2021

Towards Human-Free Automatic Quality Evaluation of German Summarization

Evaluating large summarization corpora using humans has proven to be exp...
research
09/16/2021

Does Summary Evaluation Survive Translation to Other Languages?

The creation of a large summarization quality dataset is a considerable,...
research
05/25/2023

Do You Hear The People Sing? Key Point Analysis via Iterative Clustering and Abstractive Summarisation

Argument summarisation is a promising but currently under-explored field...
research
06/13/2022

Automated Evaluation of Standardized Dementia Screening Tests

For dementia screening and monitoring, standardized tests play a key rol...

Please sign up or login with your details

Forgot password? Click here to reset