Is Summary Useful or Not? An Extrinsic Human Evaluation of Text Summaries on Downstream Tasks

05/24/2023
by   Xiao Pu, et al.
0

Research on automated text summarization relies heavily on human and automatic evaluation. While recent work on human evaluation mainly adopted intrinsic evaluation methods, judging the generic quality of text summaries, e.g. informativeness and coherence, our work focuses on evaluating the usefulness of text summaries with extrinsic methods. We carefully design three different downstream tasks for extrinsic human evaluation of summaries, i.e., question answering, text classification and text similarity assessment. We carry out experiments using system rankings and user behavior data to evaluate the performance of different summarization models. We find summaries are particularly useful in tasks that rely on an overall judgment of the text, while being less effective for question answering tasks. The results show that summaries generated by fine-tuned models lead to higher consistency in usefulness across all three tasks, as rankings of fine-tuned summarization systems are close across downstream tasks according to the proposed extrinsic metrics. Summaries generated by models in the zero-shot setting, however, are found to be biased towards the text classification and similarity assessment tasks, due to its general and less detailed summary style. We further evaluate the correlation of 14 intrinsic automatic metrics with human criteria and show that intrinsic automatic metrics perform well in evaluating the usefulness of summaries in the question-answering task, but are less effective in the other two tasks. This highlights the limitations of relying solely on intrinsic automatic metrics in evaluating the performance and usefulness of summaries.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/18/2023

Summarization is (Almost) Dead

How well can large language models (LLMs) generate summaries? We develop...
research
10/06/2020

Multi-Fact Correction in Abstractive Text Summarization

Pre-trained neural abstractive summarization systems have dominated extr...
research
09/26/2022

News Summarization and Evaluation in the Era of GPT-3

The recent success of zero- and few-shot prompting with models like GPT-...
research
03/30/2023

Comparing Abstractive Summaries Generated by ChatGPT to Real Summaries Through Blinded Reviewers and Text Classification Algorithms

Large Language Models (LLMs) have gathered significant attention due to ...
research
09/14/2022

How to Find Strong Summary Coherence Measures? A Toolbox and a Comparative Study for Summary Coherence Measure Evaluation

Automatically evaluating the coherence of summaries is of great signific...
research
08/23/2023

Evaluation of Faithfulness Using the Longest Supported Subsequence

As increasingly sophisticated language models emerge, their trustworthin...
research
10/01/2020

Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary

Recently, there has been growing interest in using question-answering (Q...

Please sign up or login with your details

Forgot password? Click here to reset