How to Evaluate a Summarizer: Study Design and Statistical Analysis for Manual Linguistic Quality Evaluation

01/27/2021
by   Julius Steen, et al.
0

Manual evaluation is essential to judge progress on automatic text summarization. However, we conduct a survey on recent summarization system papers that reveals little agreement on how to perform such evaluation studies. We conduct two evaluation experiments on two aspects of summaries' linguistic quality (coherence and repetitiveness) to compare Likert-type and ranking annotations and show that best choice of evaluation method can vary from one aspect to another. In our survey, we also find that study parameters such as the overall number of annotators and distribution of annotators to annotation items are often not fully reported and that subsequent statistical analysis ignores grouping factors arising from one annotator judging multiple summaries. Using our evaluation experiments, we show that the total number of annotators can have a strong impact on study power and that current statistical analysis methods can inflate type I error rates up to eight-fold. In addition, we highlight that for the purpose of system comparison the current practice of eliciting multiple judgements per summary leads to less powerful and reliable annotations given a fixed study budget.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/19/2022

SNaC: Coherence Error Detection for Narrative Summarization

Progress in summarizing long texts is inhibited by the lack of appropria...
research
06/04/2019

HighRES: Highlight-based Reference-less Evaluation of Summarization

There has been substantial progress in summarization research enabled by...
research
01/30/2023

LongEval: Guidelines for Human Evaluation of Faithfulness in Long-form Summarization

While human evaluation remains best practice for accurately judging the ...
research
12/14/2020

What Makes a Good Summary? Reconsidering the Focus of Automatic Summarization

Automatic text summarization has enjoyed great progress over the last ye...
research
09/02/2019

SumQE: a BERT-based Summary Quality Estimation Model

We propose SumQE, a novel Quality Estimation model for summarization bas...
research
01/04/2023

A comprehensive review of automatic text summarization techniques: method, data, evaluation and coding

We provide a literature review about Automatic Text Summarization (ATS) ...
research
04/11/2019

Crowdsourcing Lightweight Pyramids for Manual Summary Evaluation

Conducting a manual evaluation is considered an essential part of summar...

Please sign up or login with your details

Forgot password? Click here to reset