BUMP: A Benchmark of Unfaithful Minimal Pairs for Meta-Evaluation of Faithfulness Metrics

12/20/2022
by   Liang Ma, et al.
0

The proliferation of automatic faithfulness metrics for summarization has produced a need for benchmarks to evaluate them. While existing benchmarks measure the correlation with human judgements of faithfulness on model-generated summaries, they are insufficient for diagnosing whether metrics are: 1) consistent, i.e., decrease as errors are introduced into a summary, 2) effective on human-written texts, and 3) sensitive to different error types (as summaries can contain multiple errors). To address these needs, we present a benchmark of unfaithful minimal pairs (BUMP), a dataset of 889 human-written, minimally different summary pairs, where a single error (from an ontology of 7 types) is introduced to a summary from the CNN/DailyMail dataset to produce an unfaithful summary. We find BUMP complements existing benchmarks in a number of ways: 1) the summaries in BUMP are harder to discriminate and less probable under SOTA summarization models, 2) BUMP enables measuring the consistency of metrics, and reveals that the most discriminative metrics tend not to be the most consistent, 3) BUMP enables the measurement of metrics' performance on individual error types and highlights areas of weakness for future work.

READ FULL TEXT

page 6

page 7

page 17

research
04/27/2021

Understanding Factuality in Abstractive Summarization with FRANK: A Benchmark for Factuality Metrics

Modern summarization models generate highly fluent but often factually u...
research
04/08/2020

Asking and Answering Questions to Evaluate the Factual Consistency of Summaries

Practical applications of abstractive summarization models are limited b...
research
10/30/2022

How Far are We from Robust Long Abstractive Summarization?

Abstractive summarization has made tremendous progress in recent years. ...
research
11/15/2022

Evaluating the Factual Consistency of Large Language Models Through Summarization

While large language models (LLMs) have proven to be effective on a larg...
research
05/25/2022

Understanding Factual Errors in Summarization: Errors, Summarizers, Datasets, Error Detectors

The propensity of abstractive summarization systems to make factual erro...
research
03/22/2022

Towards Abstractive Grounded Summarization of Podcast Transcripts

Podcasts have recently shown a rapid rise in popularity. Summarization o...
research
07/25/2023

An End-to-End Workflow using Topic Segmentation and Text Summarisation Methods for Improved Podcast Comprehension

The consumption of podcast media has been increasing rapidly. Due to the...

Please sign up or login with your details

Forgot password? Click here to reset