Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

12/15/2022
by   Yixin Liu, et al.
12

Human evaluation is the foundation upon which the evaluation of both summarization systems and automatic metrics rests. However, existing human evaluation protocols and benchmarks for summarization either exhibit low inter-annotator agreement or lack the scale needed to draw statistically significant conclusions, and an in-depth analysis of human evaluation is lacking. In this work, we address the shortcomings of existing summarization evaluation along the following axes: 1) We propose a modified summarization salience protocol, Atomic Content Units (ACUs), which relies on fine-grained semantic units and allows for high inter-annotator agreement. 2) We curate the Robust Summarization Evaluation (RoSE) benchmark, a large human evaluation dataset consisting of over 22k summary-level annotations over state-of-the-art systems on three datasets. 3) We compare our ACU protocol with three other human evaluation protocols, underscoring potential confounding factors in evaluation setups. 4) We evaluate existing automatic metrics using the collected human annotations across evaluation protocols and demonstrate how our benchmark leads to more statistically stable and significant results. Furthermore, our findings have important implications for evaluating large language models (LLMs), as we show that LLMs adjusted by human feedback (e.g., GPT-3.5) may overfit unconstrained human evaluation, which is affected by the annotators' prior, input-agnostic preferences, calling for more robust, targeted evaluation methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
07/24/2020

SummEval: Re-evaluating Summarization Evaluation

The scarcity of comprehensive up-to-date studies on evaluation metrics f...
research
05/24/2023

Analyzing Influential Factors in Human Preference Judgments via GPT-4

Pairwise human judgments are pivotal in guiding large language models (L...
research
03/07/2023

Towards Interpretable and Efficient Automatic Reference-Based Summarization Evaluation

Interpretability and efficiency are two important considerations for the...
research
05/23/2023

LLMs as Factual Reasoners: Insights from Existing Benchmarks and Beyond

With the recent appearance of LLMs in practical settings, having methods...
research
06/18/2023

Summarization from Leaderboards to Practice: Choosing A Representation Backbone and Ensuring Robustness

Academic literature does not give much guidance on how to build the best...
research
03/08/2020

ESBM: An Entity Summarization BenchMark

Entity summarization is the problem of computing an optimal compact summ...
research
01/30/2023

LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization

While human evaluation remains best practice for accurately judging the ...

Please sign up or login with your details

Forgot password? Click here to reset