DeepAI AI Chat
Log In Sign Up

Re-evaluating Evaluation in Text Summarization

by   Manik Bhandari, et al.

Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not – for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both system-level and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.


Go Figure! A Meta Evaluation of Factuality in Summarization

Text generation models can generate factually inconsistent text containi...

Spurious Correlations in Reference-Free Evaluation of Text Generation

Model-based, reference-free evaluation metrics have been proposed as a f...

Metrics also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics

In text summarization, evaluating the efficacy of automatic metrics with...

Is ChatGPT a Good NLG Evaluator? A Preliminary Study

Recently, the emergence of ChatGPT has attracted wide attention from the...

T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics

Modern embedding-based metrics for evaluation of generated text generall...

Questioning the Validity of Summarization Datasets and Improving Their Factual Consistency

The topic of summarization evaluation has recently attracted a surge of ...

Neural Text Summarization: A Critical Evaluation

Text summarization aims at compressing long documents into a shorter for...