DeepAI AI Chat
Log In Sign Up

Re-evaluating Evaluation in Text Summarization

10/14/2020
by   Manik Bhandari, et al.
0

Automated evaluation metrics as a stand-in for manual evaluation are an essential part of the development of text-generation tasks such as text summarization. However, while the field has progressed, our standard metrics have not – for nearly 20 years ROUGE has been the standard evaluation in most summarization papers. In this paper, we make an attempt to re-evaluate the evaluation method for text summarization: assessing the reliability of automatic metrics using top-scoring system outputs, both abstractive and extractive, on recently popular datasets for both system-level and summary-level evaluation settings. We find that conclusions about evaluation metrics on older datasets do not necessarily hold on modern datasets and systems.

READ FULL TEXT
10/24/2020

Go Figure! A Meta Evaluation of Factuality in Summarization

Text generation models can generate factually inconsistent text containi...
04/21/2022

Spurious Correlations in Reference-Free Evaluation of Text Generation

Model-based, reference-free evaluation metrics have been proposed as a f...
11/08/2020

Metrics also Disagree in the Low Scoring Range: Revisiting Summarization Evaluation Metrics

In text summarization, evaluating the efficacy of automatic metrics with...
03/07/2023

Is ChatGPT a Good NLG Evaluator? A Preliminary Study

Recently, the emergence of ChatGPT has attracted wide attention from the...
12/12/2022

T5Score: Discriminative Fine-tuning of Generative Evaluation Metrics

Modern embedding-based metrics for evaluation of generated text generall...
10/31/2022

Questioning the Validity of Summarization Datasets and Improving Their Factual Consistency

The topic of summarization evaluation has recently attracted a surge of ...
08/23/2019

Neural Text Summarization: A Critical Evaluation

Text summarization aims at compressing long documents into a shorter for...