Understanding the Extent to which Summarization Evaluation Metrics Measure the Information Quality of Summaries

10/23/2020
by   Daniel Deutsch, et al.
0

Reference-based metrics such as ROUGE or BERTScore evaluate the content quality of a summary by comparing the summary to a reference. Ideally, this comparison should measure the summary's information quality by calculating how much information the summaries have in common. In this work, we analyze the token alignments used by ROUGE and BERTScore to compare summaries and argue that their scores largely cannot be interpreted as measuring information overlap, but rather the extent to which they discuss the same topics. Further, we provide evidence that this result holds true for many other summarization evaluation metrics. The consequence of this result is that it means the summarization community has not yet found a reliable automatic metric that aligns with its research goal, to generate summaries with high-quality information. Then, we propose a simple and interpretable method of evaluating summaries which does directly measure information overlap and demonstrate how it can be used to gain insights into model behavior that could not be provided by other methods alone.

READ FULL TEXT
research
04/29/2020

Reference and Document Aware Semantic Evaluation Methods for Korean Language Summarization

Text summarization refers to the process that generates a shorter form o...
research
03/05/2018

ROUGE 2.0: Updated and Improved Measures for Evaluation of Summarization Tasks

Evaluation of summarization tasks is extremely crucial to determining th...
research
07/11/2022

SummScore: A Comprehensive Evaluation Metric for Summary Quality Based on Cross-Encoder

Text summarization models are often trained to produce summaries that me...
research
06/04/2023

A Comparative Evaluation of Visual Summarization Techniques for Event Sequences

Real-world event sequences are often complex and heterogeneous, making i...
research
10/27/2022

Improving abstractive summarization with energy-based re-ranking

Current abstractive summarization systems present important weaknesses w...
research
03/31/2021

A Statistical Analysis of Summarization Evaluation Metrics using Resampling Methods

The quality of a summarization evaluation metric is quantified by calcul...
research
11/22/2022

HaRiM^+: Evaluating Summary Quality with Hallucination Risk

One of the challenges of developing a summarization model arises from th...

Please sign up or login with your details

Forgot password? Click here to reset