News Summarization and Evaluation in the Era of GPT-3

09/26/2022
by   Tanya Goyal, et al.
0

The recent success of zero- and few-shot prompting with models like GPT-3 has led to a paradigm shift in NLP research. In this paper, we study its impact on text summarization, focusing on the classic benchmark domain of news summarization. First, we investigate how zero-shot GPT-3 compares against fine-tuned models trained on large summarization datasets. We show that not only do humans overwhelmingly prefer GPT-3 summaries, but these also do not suffer from common dataset-specific issues such as poor factuality. Next, we study what this means for evaluation, particularly the role of gold standard test sets. Our experiments show that both reference-based and reference-free automatic metrics, e.g. recently proposed QA- or entailment-based factuality approaches, cannot reliably evaluate zero-shot summaries. Finally, we discuss future research challenges beyond generic summarization, specifically, keyword- and aspect-based summarization, showing how dominant fine-tuning approaches compare to zero-shot prompting. To support further research, we release: (a) a corpus of 10K generated summaries from fine-tuned and zero-shot models across 4 standard summarization benchmarks, (b) 1K human preference judgments and rationales comparing different systems for generic- and keyword-based summarization.

READ FULL TEXT

page 5

page 10

page 18

page 19

research
09/18/2023

Summarization is (Almost) Dead

How well can large language models (LLMs) generate summaries? We develop...
research
10/24/2020

Improving Zero and Few-Shot Abstractive Summarization with Intermediate Fine-tuning and Data Augmentation

Models pretrained with self-supervised objectives on large text corpora ...
research
11/29/2022

Zero-Shot Opinion Summarization with GPT-3

Very large language models such as GPT-3 have shown impressive performan...
research
12/20/2022

DocAsRef: A Pilot Empirical Study on Repurposing Reference-Based Summary Quality Metrics Reference-Freely

Summary quality assessment metrics have two categories: reference-based ...
research
04/28/2022

Neural Label Search for Zero-Shot Multi-Lingual Extractive Summarization

In zero-shot multilingual extractive text summarization, a model is typi...
research
05/24/2023

Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks

Research on automated text summarization relies heavily on human and aut...
research
05/12/2022

CiteSum: Citation Text-guided Scientific Extreme Summarization and Low-resource Domain Adaptation

Scientific extreme summarization (TLDR) aims to form ultra-short summari...

Please sign up or login with your details

Forgot password? Click here to reset