FEQA: A Question Answering Evaluation Framework for Faithfulness Assessment in Abstractive Summarization

05/07/2020
by   Esin Durmus, et al.
0

Neural abstractive summarization models are prone to generate content inconsistent with the source document, i.e. unfaithful. Existing automatic metrics do not capture such mistakes effectively. We tackle the problem of evaluating faithfulness of a generated summary given its source document. We first collected human annotations of faithfulness for outputs from numerous models on two datasets. We find that current models exhibit a trade-off between abstractiveness and faithfulness: outputs with less word overlap with the source document are more likely to be unfaithful. Next, we propose an automatic question answering (QA) based metric for faithfulness, FEQA, which leverages recent advances in reading comprehension. Given question-answer pairs generated from the summary, a QA model extracts answers from the document; non-matched answers indicate unfaithful information in the summary. Among metrics based on word overlap, embedding similarity, and learned language understanding models, our QA-based metric has significantly higher correlation with human faithfulness scores, especially on highly abstractive summaries.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/01/2020

Towards Question-Answering as an Automatic Metric for Evaluating the Content Quality of a Summary

Recently, there has been growing interest in using question-answering (Q...
research
06/02/2019

Question Answering as an Automatic Evaluation Metric for News Article Summarization

Recent work in the field of automatic summarization and headline generat...
research
11/27/2020

FFCI: A Framework for Interpretable Automatic Evaluation of Summarization

In this paper, we propose FFCI, a framework for automatic summarization ...
research
09/26/2021

QA-Align: Representing Cross-Text Content Overlap by Aligning Question-Answer Propositions

Multi-text applications, such as multi-document summarization, are typic...
research
04/04/2019

Guiding Extractive Summarization with Question-Answering Rewards

Highlighting while reading is a natural behavior for people to track sal...
research
06/04/2019

HighRES: Highlight-based Reference-less Evaluation of Summarization

There has been substantial progress in summarization research enabled by...
research
06/25/2022

Evaluation of Semantic Answer Similarity Metrics

There are several issues with the existing general machine translation o...

Please sign up or login with your details

Forgot password? Click here to reset