Evaluating the Factual Consistency of Abstractive Text Summarization

by   Wojciech Kryściński, et al.

Currently used metrics for assessing summarization algorithms do not account for whether summaries are factually consistent with source documents. We propose a weakly-supervised, model-based approach for verifying factual consistency and identifying conflicts between source documents and a generated summary. Training data is generated by applying a series of rule-based transformations to the sentences of source documents. The factual consistency model is then trained jointly for three tasks: 1) identify whether sentences remain factually consistent after transformation, 2) extract a span in the source documents to support the consistency prediction, 3) extract a span in the summary sentence that is inconsistent if one exists. Transferring this model to summaries generated by several state-of-the art models reveals that this highly scalable approach substantially outperforms previous models, including those trained with strong supervision using standard datasets for natural language inference and fact checking. Additionally, human evaluation shows that the auxiliary span extraction tasks provide useful assistance in the process of verifying factual consistency.


page 1

page 2

page 3

page 4


Masked Summarization to Generate Factually Inconsistent Summaries for Improved Factual Consistency Checking

Despite the recent advances in abstractive summarization systems, it is ...

Multi-Fact Correction in Abstractive Text Summarization

Pre-trained neural abstractive summarization systems have dominated extr...

Constrained Abstractive Summarization: Preserving Factual Consistency with Constrained Generation

Summaries generated by abstractive summarization are supposed to only co...

Asking and Answering Questions to Evaluate the Factual Consistency of Summaries

Practical applications of abstractive summarization models are limited b...

Factual Error Correction for Abstractive Summarization Models

Neural abstractive summarization systems have achieved promising progres...

Investigating Crowdsourcing Protocols for Evaluating the Factual Consistency of Summaries

Current pre-trained models applied to summarization are prone to factual...

Factual Consistency Evaluation for Text Summarization via Counterfactual Estimation

Despite significant progress has been achieved in text summarization, fa...

1 Introduction

Source article fragments
(CNN) The mother of a quadriplegic man who police say was left in the woods for days cannot be extradited to face charges in Philadelphia until she completes an unspecified ”treatment,” Maryland police said Monday. The Montgomery County (Maryland) Department of Police took Nyia Parler, 41, into custody Sunday (…) (CNN) The classic video game ”Space Invaders” was developed in Japan back in the late 1970’s – and now their real-life counterparts are the topic of an earnest political discussion in Japan’s corridors of power. Luckily, Japanese can sleep soundly in their beds tonight as the government’s top military official earnestly revealed that (…)
Model generated claims
Quadriplegic man Nyia Parler, 41, left in woods for days can not be extradited. Video game ”Space Invaders” was developed in Japan back in 1970.
Table 1: Examples of factually incorrect claims output by summarization models. Green text highlights the support in the source documents for the generated claims, red text highlights the errors made by summarization models.

The goal of text summarization models is to transduce long documents into a shorter form that retains the most important aspects of the source document. Common approaches to summarization are extractive (Dorr et al., 2003; Nallapati et al., 2017) where the model directly copies the salient parts of the source document into the summary, abstractive (Rush et al., 2015; Paulus et al., 2017) where the important parts are paraphrased to form novel sentences, and hybrid (Gehrmann et al., 2018; Hsu et al., 2018; Chen and Bansal, 2018), combining the two methods by employing specialized extractive and abstractive components.

Advancements in neural architectures (Cho et al., 2014; Sutskever et al., 2014; Bahdanau et al., 2015; Vinyals et al., 2015; Vaswani et al., 2017)

, pre-training and transfer learning

McCann et al. (2017); Peters et al. (2018); Devlin et al. (2018), and availability of large-scale supervised datasets (Sandhaus, 2008; Nallapati et al., 2016; Grusky et al., 2018; Narayan et al., 2018; Sharma et al., 2019)

allowed deep learning-based approaches to dominate the field. State-of-the-art solutions utilize self-attentive Transformer blocks 

(Liu, 2019; Liu and Lapata, 2019; Zhang et al., 2019), attention and copying mechanisms (See et al., 2017; Cohan et al., 2018), and multi-objective training strategies (Guo et al., 2018; Pasunuru and Bansal, 2018)

, including reinforcement learning techniques 

(Kryściński et al., 2018; Dong et al., 2018; Wu and Hu, 2018).

Despite significant efforts made by the research community, there are still many challenges limiting progress in summarization: insufficient evaluation protocols that leave important dimensions, such as factual consistency, unchecked, noisy, automatically collected datasets that leave the task underconstrained, and strong, domain-specific layout biases in the data that dominate training signal (Kryściński et al., 2019).

We address the problem of verifying factual consistency between source documents and generated summaries: a factually consistent summary contains only statements that are entailed by the source document. Recent studies show that up to 30% of summaries generated by abstractive models contain factual inconsistencies (Cao et al., 2018; Goodrich et al., 2019; Falke et al., 2019; Kryściński et al., 2019). Such high levels of factual inconsistency render automatically generated summaries virtually useless in practice.

The problem of factual consistency is closely related to natural language inference (NLI) and fact checking. Current NLI datasets (Bowman et al., 2015; Conneau et al., 2018; Williams et al., 2018)

focus on classifying logical entailment between short, single sentence pairs, but verifying factual consistency can require incorporating the entire context of the source document. Fact checking focuses on verifying facts against the whole of available knowledge, whereas factual consistency checking focuses on adherence of facts to information provided by a source document without guarantee that the information is true.

We propose a novel, weakly-supervised BERT-based (Devlin et al., 2018) model for verifying factual consistency, and we add specialized modules that explain which portions of both the source document and generated summary are pertinent to the model’s decision. Training data is generated from source documents by applying a series of rule-based transformations that were inspired by error-analysis of state-of-the-art summarization model outputs. Training with this weak supervision substantially improves over using the strong supervision provided by existing datasets for NLI (Williams et al., 2018) and fact-checking (Thorne et al., 2018). Through human evaluation we show that the explanatory modules that augment our factual consistency model provide useful assistance to humans as they verify the factual consistency between a source document and generated summaries.

2 Related Work

This work builds on prior work for factual consistency in text summarization and natural language generation.

Goodrich et al. (2019) propose an automatic, model-dependent metric for evaluating the factual accuracy of generated text. Facts are represented as subject-relation-object triplets and factual accuracy is defined as the precision between facts extracted from the generated summary and source document. The authors proposed a new dataset for training fact extraction models based on Wikipedia articles and used it to train a Transformer-based architecture to extract fact triplets. Factual accuracy was then measured by applying this model to the outputs of a separate text summarization model, which had been trained to generate the introduction sections of Wikipedia articles from a set of reference documents Liu et al. (2018)

. Human evaluation demonstrated that the proposed technique outperformed other, non-model based, evaluation metrics, such as ROUGE 

Lin (2004), in assessing factual accuracy. Despite positive results, the authors highlighted remaining challenges, such as its inability to adapt to negated relations or relation names expressed by synonyms.

A parallel line of research focused on improving factual consistency of summarization models by exploring different architectural choices and strategies for both training and inference. In Falke et al. (2019), the authors proposed re-ranking potential potential summaries based on factual correctness during beam search. The solution used textual entailment (NLI) models, trained on the SNLI Bowman et al. (2015) and MNLI (Williams et al., 2018)

datasets, to score summaries by means of the entailment probability between all source document-summary sentence pairs. The summary with the highest aggregate entailment score was used as the final output of the summarization model. The authors validated their approach using summaries generated by models trained on the CNN/DailyMail dataset 

(Nallapati et al., 2016). The authors concluded that out-of-the-box NLI models do not transfer well to the task of factual correctness. The work also showed that the ROUGE metric (Lin, 2004), commonly used to evaluate summarization models, does not correlate with factual correctness. In Cao et al. (2018), the authors proposed a novel, dual-encoder architecture that in parallel encodes the source documents and all the facts contained in them. During generation, the decoder attends to both the encoded source and facts which, according to the authors, forces the output to be conditioned on the both inputs. The facts encoded by the model are explicitly extracted by out-of-the-box open information extraction and a dependency parsing models. Experiments were conducted on the Gigaword (Graff and Cieri, 2003) dataset, through human evaluation the authors showed that the proposed technique substantially lowered the number of errors in generated single-sentence summaries. Li et al. (2018a) incorporated entailment knowledge into summarization by introducing an entailment-aware encoder-decoder model for sentence summarization. Entailment knowledge is injected in two ways: the encoder is shared for the task of summarization and textual entailment and the decoder is trained using reward augmented maximum likelihood (RAML) with rewards coming from a pre-trained entailment classifier. Experiments conducted on the Gigaword (Graff and Cieri, 2003) dataset showed improvements against baselines on both correctness and informativeness.

More loosely related work explored training summarization models in multi-task (Guo et al., 2018) and multi-reward (Pasunuru and Bansal, 2018) settings where the additional task and reward was textual entailment (NLI). The intuition was that incorporating NLI in the training procedure should improve entailment between the summary and source document, however, neither of the mentioned works conducted studies or analysis that would verify this.

Transformation Original sentence Transformed sentence
Paraphrasing Sheriff Lee Baca has now decided to recall some 200 badges his department has handed out to local politicians just two weeks after the picture was released by the U.S. attorney’s office in support of bribery charges against three city officials. Two weeks after the US Attorney’s Office issued photos to support bribery allegations against three municipal officials, Lee Baca has now decided to recall about 200 badges issued by his department to local politicians.

Sentence negation
Snow was predicted later in the weekend for Atlanta and areas even further south. Snow wasn’t predicted later in the weekend for Atlanta and areas even further south.

Pronoun swap
It comes after his estranged wife Mona Dotcom filed a $20 million legal claim for cash and assets. It comes after your estranged wife Mona Dotcom filed a $20 million legal claim for cash and assets.

Entity swap
Charlton coach Guy Luzon had said on Monday: ’Alou Diarra is training with us.’ Charlton coach Bordeaux had said on Monday: ’Alou Diarra is training with us.’

Number swap
He says he wants to pay off the $12.6million lien so he can sell the house and be done with it, according to the Orlando Sentinel. He says he wants to pay off the $3.45million lien so he can sell the house and be done done with it, according to the Orlando Sentinel.

Noise injection
Snow was predicted later in the weekend for Atlanta and areas even further south. Snow was was predicted later in the weekend for Atlanta and areas even further south.
Table 2: Examples of text transformations used to generate training data. Green and red text highlight the changes made by the transformation. Paraphrasing is a semantically invariant transformation, Sentence negation, entity, pronoun, and number swaps are semantically variant transformation.

3 Methods

A careful study of the outputs of state-of-the-art summarization models provided us with valuable insights about the specifics of factual errors made during generation and possible means of detecting them. Primarily, checking factual consistency on a sentence-sentence level, where each sentence of the summary is verified against each sentence from the source document, is insufficient. Some cases might require a longer, multi-sentence context from the source document due to ambiguities present in either of the compared sentences. Summary sentences might paraphrase multiple fragments of the source document, while source document sentences might use certain linguistic constructs, such as coreference, which bind different parts of the document together. In addition, errors made by summarization models are most often related to the use of incorrect entity names, numbers, and pronouns. Other errors such as negations and common sense error occur less often. 111A more fine-grained taxonomy of errors could be created, where, for example, incorrectly attributing quotes to entities would be distinguished from choosing an incorrect subject in a sentence. However, it would carry the implicit assumption that NLP models have the ability to reason about the processed text in a similar way as humans do. We refrain from anthropomorphizing summarization models. Taking these insights into account, we propose and test a document-sentence approach for factual consistency checking, where each sentence of the summary is verified against the entire body of the source document.

3.1 Training data

Currently, there are no supervised training datasets for factual consistency checking. Creating a large-scale, high-quality dataset with strong supervision collected from human annotators is prohibitively expensive and time consuming. Thus alternative approaches of acquiring training data are necessary.

Considering the current state of summarization, in which the level of abstraction of generated summaries is low and models mostly paraphrase single sentences and short spans from the source (Kryściński et al., 2018; Zhang et al., 2018), we propose using an artificial, weakly-supervised dataset for the task at hand. Our data creation method requires an unannotated collection of source documents in the same domain as the summarization models that are to be checked. Examples are created by first sampling single sentences, later referred to as claims, from the source documents. Claims then pass through a set of textual transformations that output novel sentences with both positive and negative labels. A detailed description of the data generation function is presented in Figure 1. The obvious benefit of using an artificially generated dataset is that it allows for creation of large volumes of data at a marginal cost. The data generation process also allows to collect additional metadata that can be used in the training process. In our case, the metadata contains information about the original location of the extracted claim in the source document and the locations in the claim where text transformations were applied.

Our data generation process incorporates both semantically invariant (), and variant () text transformations to generate novel claims with CORRECT and INCORRECT labels accordingly. This work uses the following transformations:


A paraphrasing transformation covers cases where source document sentences are rephrased by the summarization model. Paraphrases were produced by backtranslation using Neural Machine Translation systems 

(Edunov et al., 2018). The original sentence was translated to an intermediate language and translated back to English yielding a semantically-equivalent sentence with minor syntactic and lexical changes. French, german, chinese, spanish, and russian were used as intermediate languages. These languages were chosen based on the performance of recent NMT systems with the expectation that well-performing languages could ensure better translation quality. We used the Google Cloud Translation API 222https://cloud.google.com/translate/ for translations.

Entity and Number swapping

To learn how to identify examples where the summarization model uses incorrect numbers and entities in generated text we used the Entity and Number swapping transformation. An NER system was applied to both the claim sentence and source document to extract all mentioned entities. To generate a novel, semantically changed claim, an entity in the claim sentence was replaced with an entity from the source document. Both of the swapped entities were chosen at random while ensuring that they were unique. Extracted entities were divided into two groups, named entities, covering person, location and institution names, and number entities, such as dates and all other numeric values. Entities were swapped within their groups, i.e. names entities would only be replaced with other named entities. In this work we used the we used the SpaCy NER tagger (Honnibal and Montani, 2017).

Pronoun swapping

To teach the factual consistency checking model how to find incorrect pronoun use in claim sentences we used a pronoun swapping data augmentation. All gender-specific pronouns were first extracted from the claim sentence. Next, a randomly chosen pronoun was swapped with a different one from the same pronoun group to ensure syntactic correctness, i.e. a possessive pronoun could only be replaced with another possessive pronoun. New sentences were considered semantically variant.

Sentence negation

To give the factual consistency checking model the ability to handle negated sentences we used a sentence negation transformation. In the first step, claim sentence was scanned in search of auxiliary verbs. To switch the meaning of the new sentence, a randomly chosen auxiliary verb was replaced with its negation. Positive sentences would be negated by adding not or n’t after the chosen verb, negative sentences would be switched by removing the negation.

Noise injection

Because a verified summary is fully generated by a deep neural network, they should be expected to contain certain types of noise. In order to make the trained factual consistency model robust to such generation errors, all training examples were injected with noise using a simple algorithm. For each token in a claim the decision was made whether noise should be added at the given position with a preset probability. If noise should be injected, the token was randomly duplicated or removed from the sequence. Examples of all transformations are presented in Table 


3.2 Development and test data

Apart from the artificially generated training set, separate, manually annotated, development and test sets were created. Both of the manually annotated dataset utilized summaries output by state-of-the-art summarization models. Each summary was split into separate sentences and all (document, sentence) pairs and annotated by the authors of this work. Since the focus was to collect data that would allow to verify the factual consistency of summarization models, any unreadable sentences caused by poor generation were not labeled. The development set consists of 931 examples, the test set contains 503 examples. The model outputs used for annotation were provided by the authors of papers: Hsu et al. (2018); Gehrmann et al. (2018); Jiang and Bansal (2018); Chen and Bansal (2018); See et al. (2017); Kryściński et al. (2018); Li et al. (2018b); Pasunuru and Bansal (2018); Zhang et al. (2018); Guo et al. (2018).

Effort was made to collect a larger set of annotations through crowdsourcing platforms, however the inter-annotator agreement and general quality of annotations was too low to be considered reliable for the task at hand. This aligns with the conclusions of Falke et al. (2019), where the authors showed that for the task of factual consistency the inter-annotator agreement coefficient reached 0.75 only when 12 annotations were collected for each example. This in turn yields prohibitively high annotations costs.

S - set of source documents
- set of semantically invariant transformations
- set of semantically variant transformations
function generate_data(, , )
      set of generated data points
     for  do
          for  do
          end for
     end for
     for  do
          for  do
          end for
     end for
end function
Figure 1: Procedure to generate weakly-supervised training data. is a set of source documents, is a set of semantically invariant text transformations, is a set of semantically variant text transformations, is a positive label, is a negative label.

3.3 Models

Considering the significant recent improvements in natural language understanding (NLU) tasks (including natural language inference) coming from using pre-trained Transformer-based models 333http://gluebenchmark.com/leaderboard, we decided to use BERT Devlin et al. (2018) as the base model for our work. An uncased, base BERT architecture was used as the starting checkpoint and fine-tuned on the generated training data. The source document and claim sentence were fed as input to the model and the two-way classification (CONSISTENT/INCONSISTENT) was done using a single-layer classifier based on the [CLS] token. We refer to this model as the factual consistency checking model (FactCC).

We also trained a version of FactCC with additional span selection heads using supervision of start and end indices for selection and transformation spans in the source and claim. The span selection heads allow the model not only to classify the consistency of the claim, but also highlight spans in the source document that contain the support for the claim and spans in the claim where a possible mistake was made. We refer to this model as the factual consistency checking model with explanations (FactCCX).

BERT+MNLI 51.51 0.0882
BERT+FEVER 52.07 0.0857
FactCC (ours) 74.15 0.5106
FactCCX (ours) 72.88 0.5005
Table 3: Performance of models evaluated by means of weighted (class-balanced) accuracy and F1 score on the manually annotated test set.
(CNN) Blues legend B.B. King was hospitalized for dehydration, though the ailment didn’t keep him out for long. King’s dehydration was caused by his Type II diabetes, but he ”is much better,” his daughter, Claudette King, told the Los Angeles Times. The legendary guitarist and vocalist released a statement thanking those who have expressed their concerns. ”I’m feeling much better and am leaving the hospital today,” King said in a message Tuesday. Angela Moore, a publicist for Claudette King, said later in the day that he was back home resting and enjoying time with his grandchildren. ”He was struggling before, and he is a trouper,” Moore said. ”He wasn’t going to let his fans down.” No more information on King’s condition or where he was hospitalized was immediately available. (…)
Angela Moore was back home resting and enjoying time with his grandchildren.
Table 4: Example of a test pair correctly classified as incorrect and highlighted by our explainable model. Orange text indicates the span of the source documents that should contain support for the claim. Red text indicates the span of the claim that was selected as incorrect.

4 Experiments

Model Incorrect
Random 50.0%
DA Falke et al. (2019) 42.6% -7.4
InferSent Falke et al. (2019) 41.3% -8.7
SSE Falke et al. (2019) 37.3% -12.7
BERT Falke et al. (2019) 35.9% -14.1
ESIM Falke et al. (2019) 32.4% -17.6
FactCC (ours) 30.0% -20.0
Table 5: Percentage of incorrectly ordered sentence pairs using different consistency prediction mnodels and crowdsourced human performance on the dataset.

Experimental Setup

Training data was generated as described in Section 3.1 using news articles from the CNN/DailyMail dataset (Nallapati et al., 2016) as source documents. 1,003,355 training examples were created, out of which 50.2% were labeled as negative (INCONSISTENT) and the remaining 49.8% were labeled as positive (CONSISTENT).

Models described in this work were implemented using the Huggingface Transformers library (Wolf et al., 2019)

written in PyTorch 

(Paszke et al., 2017). An uncased, base

BERT model pre-trained on English data was used as the starting point for all experiments. Models were trained on the artificially created data for 10 epochs using batch size of 12 examples and learning rate of

2e-5. Best model checkpoints were chosen based on the performance on the validation set, final model performance was evaluated on the test set, both described in Section 3.2. Experiments were conducted using 8 Nvidia V100 GPUs with 16GB of memory.


To verify how other datasets, from related tasks, transfer to the task of verifying factual correctness of summarization models we trained fact consistency checking models on the MNLI entailment data (Williams et al., 2018) and FEVER fact-checking data (Thorne et al., 2018). For fair comparison, before training, we removed examples assigned to the neutral class from both of the datasets. Table 3 shows the performance of trained models evaluated by means of class-balanced accuracy and F1 score. Results show that our FactCC model substantially outperforms classifiers trained on the MNLI and FEVER datasets, despite being trained using weakly-supervised data. The performance differences between models can be explained by a domain gap between the examples in MNLI and FEVER and news articles in CNN/DailyMail. It is also likely the errors made by neural summarization models are specific enough not to be present in any of the other datasets, especially those where examples were obtained from human annotators.

To compare our model with other NLI models for factual consistency checking, we conducted the sentence ranking experiment described by Falke et al. (2019) using the test data provided by the authors. In this experiment an article sentence is paired with two claim sentences, positive and negative. The goal is to see how often a model assigns a higher probability of being correct to the positive rather than the negative claim. Results are presented in Table 5. Despite being trained in a (document, sentence) setting, our model transfers well to the (sentence-sentence setting and outperforms all other NLI models, including BERT fine-tuned on the MNLI dataset. We were unable to recreate the summary re-ranking experiment because the test data was not made publicly available.

Table 3 also shows the performance of our explainable model, FactCCX. Metrics show a small drop of performance in comparison to the classifier-only FactCC, however the explainable model still substantially outperforms the other two models while also returning informative span selections. Examples of span selections generated by FactCCX are show in Table 4. The test set consists of model-generated summaries that do not have annotations for quantifying the quality of spans returned by FactCCX. Instead, span quality is measured through human evaluation and discussed in Section 5.

5 Analysis

Model Highlight Helpfulness
Highlight Overlap
Annotation subset Helpful Somewhat Helpful Not Helpful Accuracy F1 score
Article Highlights
Raw Data
Golden Aligned
Majority Aligned
Claim Highlights
Raw Data
Golden Aligned
Majority Aligned
Table 6: Quality of spans highlighted in the article and claim by the FactCCX model evaluated by human annotators. The left side shows whether the highlights were considered helpful for the task of factual consistency annotations. The right side shows the overlap between model generated and human annotated highlights. Different rows show how the scores change depending on how the collected annotations are filtered. Raw Data shows results without filtering, Golden Aligned only considers annotations where the human-assigned label agreed with the author-assigned label, Majority Aligned only considers annotations where the human-assigned label agreed with the majority-vote label from all annotators.

To further inspect the performance of proposed models, we conducted a series of human-based experiments and manually inspected the outputs of the models.

5.1 Crowdsourced Experiments

Experiments using human annotators on the MTurk platform demonstrated that the span highlights returned by FactCCX are useful tools for researchers and crowdsource workers manually assessing the factual consistency of summaries. For each experiment, examples were annotated by 3 human judges selected from English-speaking countries. Annotator compensation was set to ensure a 10 USD hourly rate. These experiments used 100 examples sampled from the manually annotated test set. Data points were sampled to ensure an equal split between CONSISTENT and INCONSISTENT examples.

To establish whether model generated spans in the article and claim are helpful for the task of fact checking, we hired human annotators to complete the mentioned task. Each of the presented document-sentence was augmented with the highlighted spans output by FactCCX. Judges were asked to evaluate the correctness of the claim and instructed to use the provided segment highlights only as suggestions. After the annotation task, judges where asked whether they found the highlighted spans helpful for solving the task. Helpfulness of article and claim highlights was evaluated separately. The left part of Table 6 presents the results of the survey. A combined number of annotators found the article highlights at least somewhat helpful for the task, of annotators declared the claim highlights as at least somewhat helpful. To verify whether low quality judges do not bias the presented scores, we applied different data filters to the annotations: Raw Data considered all submitted annotations, Golden Aligned only considered annotations where the annotator-assigned label aligned with the author-assigned label for the example, Majority Aligned only considered examples where the annotator-assigned aligned with the majority-vote label assigned for the example by all judges. As shown in Table 6, filtering the annotations does not yield substantially changes in the helpfulness assessment.

Despite instructing the annotators to consider the provided highlights only as a suggestion when solving the underlying task, the annotators perception of the task could have been biased by the model-highlighted spans. To check how well the generated spans align with an unbiased human judgement, we repeated the previous experiment with the difference that model generated highlights were not displayed to the annotators. Contrarily, the annotators were asked to solve the underlying task, and highlight the spans of the source and claim that they found adequate. Using the annotations provided by the a judges we computed the overlap between the model generated spans and unbiased human spans. Results are shown in the right part of Table 6. The overlap between spans was evaluated using two metrics - accuracy based on a binary score whether the entire model-generated span was contained within the human selected span and F1 score between the tokens of the two spans, with human selected spans were considered ground-truth. Results show and accuracy, and and F1 for the article and claim highlights accordingly. Similarly to the previous experiment, we applied different data filters to check how the quality of annotations affects the score and found that removing noisy annotations increases both accuracy and the F1 score.

To verify that providing highlights to annotators has positive effect on the efficiency of annotations we ran two factual consistency annotation tasks in parallel, where in one highlights were provided to the annotators and the other did not show highlights. We measured the effects of providing highlights on the average time spent by an annotator on the task and the inter-annotator agreement of annotations. Results are shown in Table 7. The experiment showed that when completing the task with highlights, annotators were able to complete it faster and the inter-annotator agreement, measured with Fleiss’ , increased by .

Results obtained through crowdsourcing tasks support the hypothesis that the span selections generated by our explainable model can be a valuable asset for supporting human-based factual consistency checking.

5.2 Limitations

In order to better understand the limitations of the proposed approach, we manually inspected examples that were misclassified by our models. The majority of error made by our fact checking model were related to commonsense mistakes made by summarization models. Such errors are easy to spot for humans, but hard to define as a set of transformations that would allow such errors to be added to the training data.

In addition, certain types of errors stemming from dependencies between different sentences within the summary, such as temporal inconsistencies or incorrect coreference, are not handled by the document-sentence setting used in this work.

Task without model highlights Task with model highlights
Average work time (sec)
Inter-annotator agreement ()
Table 7: Annotation speed and inter-annotator agreement measured for factual consistency checking with and without assisting, model generated highlights.

6 Conclusions

We introduced a novel approach to verifying the factual consistency of summaries generated by abstractive neural models. In our approach models are trained to perform factual consistency checking on the document-sentence level that allows them to handle a broader range of errors in comparison to the previously proposed sentence-sentence approaches. Models are trained using artificially generated, weakly-supervised data created based on insights coming from the analysis of errors made by state-of-the-art summarization models. Through quantitative studies we showed that the proposed approach outperforms other models trained on textual entailment and fact-checking data. A series of human-based experiments showed that the proposed approach, including the explainable factual consistency checking model can be a valuable tool for assisting humans checking for factual consistency.

Shortcomings of our approach explained in Section 5.2 can be treated as guidelines for potential future work. The methods proposed in this work could be expanded with more advanced data augmentation techniques, such as generating claims that cover multi-sentence spans from the source document or include commonsense mistakes.

We hope that this work will encourage more research efforts in the important task of verifying and improving the factual consistency of abstractive summarization models.


We thank Nitish Shirish Keskar, Dragomir Radev, Ben Krause, and Wenpeng Yin for reviewing this manuscript and providing valuable feedback.


  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §1.
  • S. R. Bowman, G. Angeli, and C. D. Manning (2015) A large annotated corpus for learning natural language inference. In

    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP)

    Cited by: §1, §2.
  • Z. Cao, F. Wei, W. Li, and S. Li (2018) Faithful to the original: fact aware neural abstractive summarization. In

    Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018

    pp. 4784–4791. Cited by: §1, §2.
  • Y. Chen and M. Bansal (2018) Fast abstractive summarization with reinforce-selected sentence rewriting. In ACL (1), pp. 675–686. Cited by: §1, §3.2.
  • K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, EMNLP 2014, October 25-29, 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest Group of the ACL, pp. 1724–1734. Cited by: §1.
  • A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang, and N. Goharian (2018)

    A discourse-aware attention model for abstractive summarization of long documents

    In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 2 (Short Papers), Cited by: §1.
  • A. Conneau, R. Rinott, G. Lample, A. Williams, S. R. Bowman, H. Schwenk, and V. Stoyanov (2018) XNLI: evaluating cross-lingual sentence representations. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Cited by: §1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) BERT: pre-training of deep bidirectional transformers for language understanding. CoRR abs/1810.04805. Cited by: §1, §1, §3.3.
  • Y. Dong, Y. Shen, E. Crawford, H. van Hoof, and J. C. K. Cheung (2018) BanditSum: extractive summarization as a contextual bandit. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 - November 4, 2018, Cited by: §1.
  • B. Dorr, D. Zajic, and R. Schwartz (2003) Hedge trimmer: a parse-and-trim approach to headline generation. In HLT-NAACL, Cited by: §1.
  • S. Edunov, M. Ott, M. Auli, and D. Grangier (2018) Understanding back-translation at scale. CoRR abs/1808.09381. External Links: Link, 1808.09381 Cited by: §3.1.
  • T. Falke, L. F. R. Ribeiro, P. A. Utama, I. Dagan, and I. Gurevych (2019) Ranking generated summaries by correctness: an interesting but challenging application for natural language inference. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers, pp. 2214–2220. Cited by: §1, §2, §3.2, §4, Table 5.
  • S. Gehrmann, Y. Deng, and A. M. Rush (2018) Bottom-up abstractive summarization. In EMNLP, pp. 4098–4109. Cited by: §1, §3.2.
  • B. Goodrich, V. Rao, P. J. Liu, and M. Saleh (2019) Assessing the factual accuracy of generated text. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2019, Anchorage, AK, USA, August 4-8, 2019., pp. 166–175. Cited by: §1, §2.
  • D. Graff and C. Cieri (2003) English gigaword, linguistic data consortium. Cited by: §2.
  • M. Grusky, M. Naaman, and Y. Artzi (2018) Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2018, New Orleans, Louisiana, USA, June 1-6, 2018, Volume 1 (Long Papers), Cited by: §1.
  • H. Guo, R. Pasunuru, and M. Bansal (2018) Soft layer-specific multi-task summarization with entailment and question generation. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, Cited by: §1, §2, §3.2.
  • M. Honnibal and I. Montani (2017)

    spaCy 2: natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing

    Note: http://spacy.io Cited by: §3.1.
  • W. T. Hsu, C. Lin, M. Lee, K. Min, J. Tang, and M. Sun (2018) A unified model for extractive and abstractive summarization using inconsistency loss. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers, Cited by: §1, §3.2.
  • Y. Jiang and M. Bansal (2018) Closed-book training to improve summarization encoder memory. In EMNLP, pp. 4067–4077. Cited by: §3.2.
  • W. Kryściński, N. S. Keskar, B. McCann, C. Xiong, and R. Socher (2019) Neural text summarization: A critical evaluation. CoRR abs/1908.08960. Cited by: §1, §1.
  • W. Kryściński, R. Paulus, C. Xiong, and R. Socher (2018) Improving abstraction in text summarization. In EMNLP, pp. 1808–1817. Cited by: §1, §3.1, §3.2.
  • H. Li, J. Zhu, J. Zhang, and C. Zong (2018a) Ensure the correctness of the summary: incorporate entailment knowledge into abstractive sentence summarization. In Proceedings of the 27th International Conference on Computational Linguistics, COLING 2018, Santa Fe, New Mexico, USA, August 20-26, 2018, pp. 1430–1441. Cited by: §2.
  • W. Li, X. Xiao, Y. Lyu, and Y. Wang (2018b) Improving neural abstractive document summarization with structural regularization. In EMNLP, pp. 4078–4087. Cited by: §3.2.
  • C. Lin (2004) ROUGE: a package for automatic evaluation of summaries. In Proc. ACL workshop on Text Summarization Branches Out, pp. 10. External Links: Link Cited by: §2, §2.
  • P. J. Liu, M. Saleh, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N. Shazeer (2018) Generating wikipedia by summarizing long sequences. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §2.
  • Y. Liu and M. Lapata (2019) Text summarization with pretrained encoders. CoRR abs/1908.08345. External Links: Link Cited by: §1.
  • Y. Liu (2019) Fine-tune BERT for extractive summarization. CoRR abs/1903.10318. Cited by: §1.
  • B. McCann, J. Bradbury, C. Xiong, and R. Socher (2017)

    Learned in translation: contextualized word vectors

    In NIPS, pp. 6297–6308. Cited by: §1.
  • R. Nallapati, F. Zhai, and B. Zhou (2017)

    SummaRuNNer: A recurrent neural network based sequence model for extractive summarization of documents

    In AAAI, Cited by: §1.
  • R. Nallapati, B. Zhou, Ç. Gülçehre, B. Xiang, et al. (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. Proceedings of SIGNLL Conference on Computational Natural Language Learning. Cited by: §1, §2, §4.
  • S. Narayan, S. B. Cohen, and M. Lapata (2018) Don’t give me the details, just the summary! Topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium. Cited by: §1.
  • R. Pasunuru and M. Bansal (2018) Multi-reward reinforced summarization with saliency and entailment. CoRR abs/1804.06451. External Links: Link, 1804.06451 Cited by: §1, §2, §3.2.
  • A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga, and A. Lerer (2017) Automatic differentiation in PyTorch. In NIPS Autodiff Workshop, Cited by: §4.
  • R. Paulus, C. Xiong, and R. Socher (2017) A deep reinforced model for abstractive summarization. In ICLR, Cited by: §1.
  • M. E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer (2018) Deep contextualized word representations. In Proc. of NAACL, Cited by: §1.
  • A. M. Rush, S. Chopra, and J. Weston (2015) A neural attention model for abstractive sentence summarization. Proceedings of EMNLP. Cited by: §1.
  • E. Sandhaus (2008) The new york times annotated corpus. Linguistic Data Consortium, Philadelphia 6 (12), pp. e26752. Cited by: §1.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In ACL, Cited by: §1, §3.2.
  • E. Sharma, C. Li, and L. Wang (2019) BIGPATENT: A large-scale dataset for abstractive and coherent summarization. CoRR abs/1906.03741. External Links: Link, 1906.03741 Cited by: §1.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In NIPS, Cited by: §1.
  • J. Thorne, A. Vlachos, C. Christodoulopoulos, and A. Mittal (2018) FEVER: a large-scale dataset for fact extraction and verification. CoRR abs/1803.05355. External Links: Link Cited by: §1, §4.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, 4-9 December 2017, Long Beach, CA, USA, pp. 6000–6010. Cited by: §1.
  • O. Vinyals, M. Fortunato, and N. Jaitly (2015) Pointer networks. In NIPS, Cited by: §1.
  • A. Williams, N. Nangia, and S. Bowman (2018) A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Cited by: §1, §1, §2, §4.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) Transformers: state-of-the-art natural language processing. External Links: 1910.03771 Cited by: §4.
  • Y. Wu and B. Hu (2018) Learning to extract coherent summary via deep reinforcement learning. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, Cited by: §1.
  • F. Zhang, J. Yao, and R. Yan (2018) On the abstractiveness of neural document summarization. In EMNLP, pp. 785–790. Cited by: §3.1, §3.2.
  • H. Zhang, Y. Gong, Y. Yan, N. Duan, J. Xu, J. Wang, M. Gong, and M. Zhou (2019) Pretraining-based natural language generation for text summarization. CoRR abs/1902.09243. External Links: Link Cited by: §1.