LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization

01/30/2023
by   Kalpesh Krishna, et al.
4

While human evaluation remains best practice for accurately judging the faithfulness of automatically-generated summaries, few solutions exist to address the increased difficulty and workload when evaluating long-form summaries. Through a survey of 162 papers on long-form summarization, we first shed light on current human evaluation practices surrounding long-form summaries. We find that 73 on model-generated summaries, while other works face new difficulties that manifest when dealing with long documents (e.g., low inter-annotator agreement). Motivated by our survey, we present LongEval, a set of guidelines for human evaluation of faithfulness in long-form summaries that addresses the following challenges: (1) How can we achieve high inter-annotator agreement on faithfulness scores? (2) How can we minimize annotator workload while maintaining accurate faithfulness scores? and (3) Do humans benefit from automated alignment between summary and source snippets? We deploy LongEval in annotation studies on two long-form summarization datasets in different domains (SQuALITY and PubMed), and we find that switching to a finer granularity of judgment (e.g., clause-level) reduces inter-annotator variance in faithfulness scores (e.g., std-dev from 18.5 to 6.8). We also show that scores from a partial annotation of fine-grained units highly correlates with scores from a full annotation workload (0.89 Kendall's tau using 50 our human judgments, annotation templates, and our software as a Python library for future research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/20/2023

GUMSum: Multi-Genre Data and Evaluation for English Abstractive Summarization

Automatic summarization with pre-trained language models has led to impr...
research
06/02/2022

TSTR: Too Short to Represent, Summarize with Details! Intro-Guided Extended Summary Generation

Many scientific papers such as those in arXiv and PubMed data collection...
research
10/30/2022

How Far are We from Robust Long Abstractive Summarization?

Abstractive summarization has made tremendous progress in recent years. ...
research
01/27/2021

How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation

Manual evaluation is essential to judge progress on automatic text summa...
research
05/18/2021

BookSum: A Collection of Datasets for Long-form Narrative Summarization

The majority of available text summarization datasets include short-form...
research
12/15/2022

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

Human evaluation is the foundation upon which the evaluation of both sum...
research
05/11/2022

ALIGNMEET: A Comprehensive Tool for Meeting Annotation, Alignment, and Evaluation

Summarization is a challenging problem, and even more challenging is to ...

Please sign up or login with your details

Forgot password? Click here to reset