USB: A Unified Summarization Benchmark Across Tasks and Domains

05/23/2023
by   Kundan Krishna, et al.
0

An abundance of datasets exist for training and evaluating models on the task of summary generation.However, these datasets are often derived heuristically, and lack sufficient annotations to support research into all aspects of summarization, such as evidence extraction and controllable summarization. We introduce a benchmark comprising 8 tasks that require multi-dimensional understanding of summarization, e.g., surfacing evidence for a summary, assessing its correctness, and gauging its relevance to different topics. We compare various methods on this benchmark and discover that on multiple tasks, moderately-sized fine-tuned models consistently outperform much larger few-shot prompted language models. For factuality related tasks, we also evaluate existing heuristics to create training data and find that training on them performs worse than training on 20× less human-labeled data. Our benchmark consists of data from 6 different domains, allowing us to study cross-domain performance of trained models. We find that for some tasks, the amount of training data matters more than the domain where it comes from, while for other tasks training specifically on data from the target domain, even if limited, is more beneficial. Our work fulfills the need for a well-annotated summarization benchmark with diverse tasks, and provides useful insights about the impact of the quality, size and domain of training data.

READ FULL TEXT

page 2

page 7

research
05/18/2022

Evaluation of Transfer Learning for Polish with a Text-to-Text Model

We introduce a new benchmark for assessing the quality of text-to-text m...
research
06/18/2023

Summarization from Leaderboards to Practice: Choosing A Representation Backbone and Ensuring Robustness

Academic literature does not give much guidance on how to build the best...
research
04/27/2022

An End-to-End Dialogue Summarization System for Sales Calls

Summarizing sales calls is a routine task performed manually by salespeo...
research
05/26/2023

With a Little Push, NLI Models can Robustly and Efficiently Predict Faithfulness

Conditional language models still generate unfaithful output that is not...
research
12/03/2021

Multilingual training for Software Engineering

Well-trained machine-learning models, which leverage large amounts of op...
research
03/16/2023

Exploring Distributional Shifts in Large Language Models for Code Analysis

We systematically study the capacity of two large language models for co...
research
05/26/2023

UMSE: Unified Multi-scenario Summarization Evaluation

Summarization quality evaluation is a non-trivial task in text summariza...

Please sign up or login with your details

Forgot password? Click here to reset