Text summarization models aim to produce an abridged version of long text while preserving salient information. Abstractive summarization is a type of such models that can freely generate summaries, with no constraint on the words or phrases used. This format is closer to human-edited summaries and is both flexible and informative. Thus, there are numerous literature on producing abstractive summaries (See et al., 2017; Paulus et al., 2017; Dong et al., 2019; Gehrmann et al., 2018).
|Article||Geoffrey Lewis, a prolific character actor who appeared opposite frequent collaborator Clint Eastwood as his pal Orville Boggs in “Every Which Way But Loose” and its sequel, has died… Lewis, the father of Oscar-nominated actress Juliette Lewis, died Tuesday. Lewis began his long association with Eastwood in “High Plains Drifter”…|
|BottomUp||Geoffrey Lewis began his long association with Eastwood in “Every which Way but Loose”…|
|FASum (Ours)||It’s the father of Oscar-nominated actress Juliette Lewis. Lewis began his long association with Eastwood in “High Plains Drifter”…|
|Seq2Seq||Oscar-nominated actress Juliette Lewis dies at 79. Lewis starred in “High Plains Drifter”…|
indicates our model without the knowledge graph component. Factual errors in summaries are marked in red and the corresponding facts in the article are marked in green.
However, one prominent issue with abstractive summarization is factual inconsistency. It refers to the phenomenon that the summary sometimes distorts or fabricates the facts in the article. Recent studies show that up to 30% of the summaries generated by abstractive models contain such factual inconsistencies (Kryściński et al., 2019; Falke et al., 2019). This brings serious problems to the credibility and usability of abstractive summarization systems. Table 1 demonstrates an example article and excerpts of generated summaries. As shown, the article mentions that the actor Geoffery Lewis began his association with Eastwood in “High Plains Drifter”, while the summary from BottomUp (Gehrmann et al., 2018) picks the wrong movie title from another place in the article. Also, the article says that the deceased Geoffery Lewis is the father of Juliette Lewis. However, an ablated version of our model without leveraging the factual knowledge graph in the article, denoted by Seq2Seq, wrongly expresses that Juliette passed away. Comparatively, our model FASum generates a summary that correctly exhibits the relationship in the article.
|Article||The flame of remembrance burns in Jerusalem, and a song of memory haunts Valerie Braham as it never has before. This year, Israel’s Memorial Day commemoration is for bereaved family members such as Braham. “Now I truly understand everyone who has lost a loved one,” Braham said. Her husband, Philippe Braham, was one of 17 people killed in January’s terror attacks in Paris… As Israel mourns on the nation’s remembrance day, French Prime Minister Manuel Valls announced after his weekly Cabinet meeting that French authorities had foiled a terror plot…|
|BottomUp||Valerie Braham was one of 17 people killed in January ’s terror attacks in Paris. France’s memorial day commemoration is for bereaved family members as Braham. Israel’s Prime Minister says the terror plot has not been done.|
|Corrected by FC||Philippe Braham was one of 17 people killed in January’s terror attacks in Paris. Israel’s memorial day commemoration is for bereaved family members as Braham. France’s Prime Minister says the terror plot has not been done.|
On the other hand, most existing abstractive summarization models apply a conditional language model to focus on the token-level accuracy of summaries, while neglecting semantic-level consistency between the summary and article. Therefore, the generated summaries are often high in token-level metrics like ROUGE (Lin, 2004) but lack factual correctness. In view of this, we argue that a robust abstractive summarization system must be equipped with factual knowledge to accurately summarize the article.
In this paper, we represent facts in the form of a knowledge graph. Although there are numerous efforts in building commonly applicable knowledge networks to facilitate knowledge extraction and integration, such as ConceptNet (Speer et al., 2017) and WikiData, we find that these tools are more useful in conferring commonsense knowledge. In abstractive summarization for contents like news articles, many entities and relations are previously unseen. Plus, our goal is to produce summaries that do not conflict with the facts in the article. Thus, we propose to extract factual knowledge from within the article.
We employ the information extraction (IE) tool OpenIE (Angeli et al., 2015) to extract facts from the article in the form of relational tuples: (subject, relation, object). We then use the Levi transformation (Levi, 1942) to convert each tuple component into a node in the knowledge graph. This graph contains the facts in the article and is integrated in the summary generation process.
Then, we use a graph attention network (Veličković et al., 2017) to obtain the representation of each node, and fuse that into a transformer-based encoder-decoder architecture. The decoder attends to each graph node in every transformer block. Finally, it leverages the copy-generate mechanism to selectively produce contents either from the dictionary or from the graph entities. We denote this model as a Fact-Aware Summarization model, FASum.
In addition, to utilize existing summarization systems, we propose a Factual Corrector model, FC, to help improve the factual correctness of any given summary. We frame the correction process as a seq2seq problem: the input is the original summary and the article, and the output is the corrected summary. Thus, we leverage the large-scale pre-trained language generation model UniLM (Dong et al., 2019). We finetune the model on summarization datasets with correct/wrong summaries automatically generated by backtranslation/randomly swapped entities. Therefore, the training data does not require additional human labelling. As shown in Table 2, FC makes three corrections, replacing the original wrong entities which appear elsewhere in the article with the right ones.
In the experiments on benchmark summarization datasets, the FASum model shows great improvements in the factual correctness of generated summaries. Using an independently trained BERT-based factual correctness evaluator (Kryściński et al., 2019), we find that on CNN/DailyMail, FASum obtains 1.2% higher fact correctness scores than UniLM (Dong et al., 2019) and 4.5% higher than BottomUp (Gehrmann et al., 2018). On the other hand, FC can effectively improve the factual correctness of given summaries: after correction, the factual score of summaries from BottomUp increases 1.7% on CNN/DailyMail and 1.2% on XSum, and the score of summaries from TConvS2S increases 3.9% on XSum. Human evaluation further corroborates the effectiveness of our models.
We further propose an easy-to-compute metric, matched relation tuples, to evaluate the factual precision of summaries without using ground-truth labels. This is an extension to the metric proposed in Goodrich et al. (2019). Under this metric, we show that FASum can obtain a higher score than other baselines and FC can help boost the score of summaries from other abstractive models. Furthermore, we quantitatively show that FASum is a highly abstractive summarization model.
In summary, our contribution in this paper is three-fold:
We propose FASum, which conducts neural graph computation on the factual knowledge graph and integrate it into an end-to-end process of summary generation.
We propose FC, which modifies summaries from abstractive systems to improve their factual correcteness.
We propose a simple-to-use metric, matched relation tuples, to evaluate factual correctness in abstractive summarization.
To the best of the authors’ knowledge, FASum is the first approach to leverage knowledge graph in boosting factual correctness, while FC is the first summary-correction model for factual correctness.
2 Related Work
2.1 Abstractive Summarization
Abstractive text summarization has been intensively studied in recent literature. Most models employ the encoder-decoder architecture (seq2seq) (Sutskever et al., 2014). Rush et al. (2015) introduces an attention-based seq2seq model for abstractive sentence summarization. See et al. (2017) uses copy-generate mechanism that can both produce words from the vocabulary via a generator and copy words from the article via a pointer. Paulus et al. (2017)
leverages reinforcement learning to improve summarization quality.Gehrmann et al. (2018) uses a content selector to over-determine phrases in source documents that helps constrain the model to likely phrases. Zhu et al. (2019) defines a pretraining scheme for summarization and produces a zero-shot abstractive summarization model. Dong et al. (2019) employs different masking techniques for both classification and generation tasks in NLP. The resulting pretrained model UniLM achieves state-of-the-art results on various tasks including abstractive summarization.
2.2 Fact-Aware Summarization
Since the task of judging whether a summary is consistent with an article is similar to the text entailment problem, several recent work use text entailment models to evaluate and boost the factual correctness of summarization. Li et al. (2018) co-trains summarization and entailment for the encoder and employs an entailment-aware decoder via Reward Augmented Maximum Likelihood (RAML). Falke et al. (2019) proposes using off-the-shelf entailment models to rerank candidate summary sentences to boost factual correctness.
Apart from using entailment models for factual correctness, Zhang et al. (2019)
approaches the factual correctness problem in the medical domain, where the space of facts is limited and can be depicted with a descriptor vector. The proposed model utilizes information extraction and reinforcement learning and achieves good results in medical summarization.Cao et al. (2018) extracts relational information from the article and maps it to a sequence as input to the encoder. The decoder attends to both the article tokens and the relations. Gunel et al. (2019)
employs an entity-aware transformer structure to boost the factual correctness in abstractive summarization, where the entities come from Wikidata knowledge graph. In comparison, our model utilizes the extracted knowledge graph from the article and fuses it into the generated text via graph neural computation.
As previous work mostly use human evaluation on factual correctness (Kryściński et al., 2019) and text entailment models suffer from the domain shift problem (Falke et al., 2019), several papers put forward automatic factual checking metrics better aligned with human judgement. Goodrich et al. (2019) proposes using information extraction to obtain the relation tuples in ground-truth summary and the candidate summary, then computing the factual accuracy as the ratio of overlapping tuples. Kryściński et al. (2019) frames factual correctness as a binary classification problem: a summary is either consistent or inconsistent with the article. Therefore, it applies positive and negative transformations on the summary to produce training data for a BERT-based classification model. The evaluator, FactCC, significantly outperforms text entailment models and shows a high correlation with human metrics. In this paper, we use FactCC as the factual evaluator for our model.
3.1 Problem Formulation
We formalize abstractive summarization as a supervised seq2seq problem. The input consists of pairs of articles and summaries: . Each article is tokenized into and each summary is tokenized into . In abstrative summarization, the model-generated summary can contain tokens, phrases and sentences not present in the article. For simplicity, in the following we will drop the data index subscript. Therefore, each training pair becomes , and the model needs to generate an abstrative summary .
3.2 Factual Correctness Evaluator
We leverage the FactCC evaluator (Kryściński et al., 2019), which maps the correctness evaluation as a binary classification problem, namely finding a function , where is an article and is a summary sentence defined as a claim.
represents the probability thatis factually correct with respective to the article . If a summary is composed of multiple sentences , we define the factual score of as: .
|InferSent (Falke et al., 2019)||41.3%|
|SSE (Falke et al., 2019)||37.3%|
|BERT (Falke et al., 2019)||35.9%|
|ESIM (Falke et al., 2019)||32.4%|
|FactCC (Kryściński et al., 2019)||30.0%|
|FactCC (our version)||26.8%|
To generate training data, we adopt backtranslation as a paraphrasing tool. The ground-truth summary is translated into an intermediate language, including French, German, Chinese, Spanish and Russian, and then translated back to English. The generated claims are used as positive training examples. We then apply entity swap, negation and pronoun swap to generate negative examples (Kryściński et al., 2019).
Following Kryściński et al. (2019), we finetune the BERTBASE model Devlin et al. (2018b) to evaluate factual correctness. We concatenate the article and the generated claim together with special tokens [CLS] and [SEP]. The final embedding of [CLS] is used to compute the probability that the claim is entailed by the article content.
The hyperparameters were tuned on the development set, with the best setup using a learning rate of 1e-5 and a batch size of 24. As shown in Table3, on CNN/Daily Mail, our reproduced model achieves better accuracy than that in Kryściński et al. (2019) on the human-labelled sentence-pair-ordering data from Falke et al. (2019). Thus, we use this evaluator for all the factual correctness assessment tasks in the following.111We use the same setting and train another evaluator for XSum dataset.
3.3 Fact-Aware Summarizer
We propose the Fact-Aware abstractive Summarizer, FASum. It utilizes the seq2seq architecture built upon transformers (Vaswani et al., 2017). In detail, the encoder produces contextualized embeddings of the article and the decoder attends to the encoder’s output to generate the summary.
To make the summarization model fact-aware, we extract, represent and integrate knowledge from the source article into the summary generation process, which is described in the follows. The overall architecture of FASum is shown in Figure 1.
3.3.1 Knowledge Extraction
To extract important entity-relation information from the article, we employ the Stanford OpenIE tool (Angeli et al., 2015). The extracted knowledge is a list of tuples. Each tuple contains a subject (S), a relation (R) and an object (O), each as a segment of text from the article. For example, for the sentence Born in a town, she took the midnight train, the extracted subject is she, relation is took, and the object is midnight train. In the experiments, there are on average 165.4 tuples extracted per article in CNN/DailyMail Hermann et al. (2015) and 84.5 tuples in XSum Narayan et al. (2018).
3.3.2 Knowledge Representation
We construct a knowledge graph to represent the information extracted from OpenIE. We apply the Levi transformation (Levi, 1942) to treat each entity and relation equally. In detail, suppose a tuple is , we create nodes , and , and add edges — and —. In this way, we obtain an undirected knowledge graph , where each node is associated with text .
We then employ a graph attention network (Veličković et al., 2017) to obtain embeddings for each node. We use a bidirectional LSTM to generate the initial embedding for each node :
where the index indicates the final state of the RNN.
Then, in each round, every node updates its embeddings using attention over itself and all of its neighbors, denoted by the set :
where , and are parametrized matrix and vector, respectively. In the experiment, we set the number of rounds .
3.3.3 Knowledge Integration
The knowledge graph embedding is obtained in parallel with the encoder. During decoding, we conduct attention over the knowledge graph nodes in each transformer block. Suppose the input embedding to a decoder transformer block is , and the embedded article from encoder is . We first apply self-attention on and inter-attention between and , each followed by layer normalization and residual mechanism (denoted by LR). This leads to the intermediate embeddings .
Next, to capture information from the knowledge graph, we conduct inter-attention between and the nodes’ embeddings :
Notice that we multiply the attended vector with a parameter . This is due to the different numerical magnitude between encoder embeddings and knowledge graph embeddings. For the -th transformer block, we train one multiplier scalar .
The results are passed through a LR layer, a feed-forward network (FFN) layer and another LR layer to obtain the final embeddings :
3.3.4 Summary Generation
To produce the next token , we leverage the copy-generate mechanism (Vinyals et al., 2015). First, a copy probability is computed from :
The probability of the next token being is:
Here, are the attention weights from the last transformer block, representing the probability of copying each node’s text. We consider a token to be produced if a node whose text starts with is selected. represents the probability of generating each token in the dictionary, where is the embedding matrix for both the encoder and decoder
During training, we use cross entropy as the loss function:
where is the one-hot vector for the -th token, and represent the parameters in the network.
3.4 Fact Corrector
To better utilize existing summarization systems, we propose a Factual Corrector model, FC, to improve the factual correctness of any summary generated by abstractive systems. FC frames the correction process as a seq2seq problem: given an article and a candidate summary, the model generates a corrected summary with minimal changes to be more factually consistent with the article. We show its architecture in Figure 2.
To create training data, we follow the approach in factual correctness evaluator to apply positive and negative transformation to golden summaries to generate claims, including back-translation and entity swap.
We then utilize the UniLM (Dong et al., 2019) architecture initiailized with weights from RoBERTa-Large model (Liu et al., 2019) to conduct seq2seq finetuning. The training goal is to recover the original ground-truth summary.
During inference, the article and candidate summary are fed into FC to generate the corrected version. We use beam search with a beam size of 2, and block the tri-gram duplicates. The minimum and maximum generation length is tuned on the validation data.
We evaluate our model on benchmark summarization datasets CNN/DailyMail Hermann et al. (2015) and XSum Narayan et al. (2018). They contain 312K and 227K news articles and human-edited summaries respectively, covering different topics and various summarization styles. We show the details of the datasets in Table 4, adapted from Liu and Lapata (2019).
4.2 Implementation Details
In FASum, we use the subword tokenizer SentencePiece Kudo and Richardson (2018). The dictionary is shared across all the datasets. The vocabulary has a size of 32K and a dimension of 720. Both the encoder and decoder has 10 layers of 8 heads for attention. The knowledge graph LSTM has a hidden state of size 128 and the graph attention network has a hidden state of size 50. The dropout rate is 0.1. The knowledge graph attention vector multiplier is initialized with . Teacher forcing is used in training. We use Adam (Kingma and Ba, 2014) as the optimizer with a learning rate of 2e-4.
, we utilize the RoBERTa-Large model which has 24 layers of transformers with 340M parameters. We fine-tuned the model for 5 epochs with a learning rate of 1e-5 and linear warmup over the one-fifths of total steps and linear decay. The batch size is 24. More details are presented in Appendix A.
For factual correctness, we leverage the FactCC model (Kryściński et al., 2019) trained on each dataset. All evaluator models are finetuned starting from the BERT-base model Devlin et al. (2018a). The training is independent of the summarizer so no parameters are shared.
We also employ the standard ROUGE-1, ROUGE-2 and ROUGE-L metrics Lin (2004) to measure summary qualities. These three metrics evaluate the accuracy on unigrams, bigrams and the longest common subsequence. We report the F1 ROUGE scores in all experiments.
Note that during training, FASum chooses the model with the highest ROUGE-L score on the validation set. Therefore, it does not have access to the evaluator during training and validation.
The following abstractive summarization models are selected as baseline systems. TConvS2S Narayan et al. (2018)
is based on convolutional neural networks.BottomUp Gehrmann et al. (2018) uses a bottom-up approach to generate summarization. UniLM Dong et al. (2019)
utilizes large-scale pretraining to produce state-of-the-art abstractive summaries. We obtain the prediction results of these baseline systems via their open-source repositories, or train the model when the predictions are not available.
As shown in Table 5, our model FASum outperforms all baseline systems in factual correctness scores in CNN/DailyMail and is only behind UniLM in XSum. In CNN/DailyMail, FASum is 1.2% higher than UniLM and 4.5% higher than BottomUp in factual correctness score. We conduct statistical tests and show that the lead is statistically significant with p-value smaller than 0.05.
It’s worth noticing that the ROUGE metric does not always reflect the factual correctness, sometimes even showing an inverse relationship, which has been observed in Kryściński et al. (2019); Nogueira and Cho (2019). For instance, although BottomUp has 2.42 higher ROUGE-1 points than FASum in CNN/DailyMail, there are many factual errors in its summaries, as shown in the human evaluation and Appendix B. Therefore, we argue that factual correctness should be specifically handled both in summary generation and evaluation.
Furthermore, the correction model FC can effectively enhance the factual correctness of summaries generated by various baseline models. For instance, on CNN/DailyMail, after correction, the factual score of BottomUp increases by 1.7%. On XSum, the factual scores increase by 0.2% to 3.9% for summaries by all baseline models. And most improvements (marked by *) are statistically significant. It’s worth noting that FC only makes modest modifications necessary to the original summaries. For instance, FC modifies 48.3% of summaries by BottomUp in CNN/DailiMail. These modified summaries contain very few changed tokens: 94.4% of the corrected summaries contain 3 or fewer new tokens, while the summaries have on average 48.3 tokens. In Appendix B, we show more example summary corrections made by FC.
Ablation Study. To evaluate the effectiveness of our proposed modules, we removed the attention vector multiplier, then the Levi transformation and finally the knowledge graph module. The results are reported in Table 6. As shown, each module plays a significant role in boosting the factual correctness score, but not necessarily the ROUGE metrics. The multiplier contributes 2.1 points in factual score, and the knowledge graph module as a whole boosts the factual score by as much as 3.7 points. In the following, we define Seq2Seq as the ablated version of our model without the knowledge graph component.
|Corrected by FC||85.3 (1.7%)||40.95||18.37||37.86|
|Corrected by FC||87.0 (0.2%)||42.75||20.07||39.83|
|Corrected by FC||78.9 (1.2%)||28.21||8.00||20.69|
|Corrected by FC||82.9 (3.9%)||32.44||11.83||26.02|
|Corrected by FC||83.4 (0.2%)||42.18||19.53||34.15|
4.6.1 Novel n-grams
To measure the abstractiveness of our model, we compute the ratio of novel n-grams in summaries that do not appear in the article in XSum test set, following See et al. (2017). As shown by Figure 3, FASum achieves the closest ratio of novel n-gram compared with reference summaries, compared with BottomUp, UniLM and the ablated version of our model Seq2Seq. The novel 1-gram ratio is almost 40%. Furthermore, our model obtains slightly higher novelty percentage of -gram than reference summary when . This demonstrates that FASum can produce highly abstractive summaries while ensuring factual correctness.
|Corrected by FC||68.6||49.6|
|Corrected by FC||61.4||40.7|
4.6.2 Matched relations
As the relational tuples in the knowledge graph capture the factual information in the text, we compute the precision of extracted tuples in the summary. In detail, suppose the set of the relational tuples in the summary is , and the set of the relational tuples in the article is . Then, each tuple in falls into one of the following three cases:
, which we define as a correct hit;
, but , or . We define this case as a wrong hit;
Otherwise, we define it as a miss.
It follows that a more factual correct summary would have more correct hits and fewer wrong hits. Suppose the number of triples in case 1-3 are , and , respectively. We define two kinds of precision metrics to measure the ratio of correct hits:
Note that this metric is different from the ratio of overlapping tuples proposed in Goodrich et al. (2019). In Goodrich et al. (2019), the ratio is computed between the ground-truth summary and the candidate summary. However, since even the ground-truth summary may not cover all the salient information in the article, we choose to compare the knowledge tuples in the candidate summary directly against those in the article. An advantage of our metric is that it can work in cases where ground-truth summaries are not available.
Table 7 displays the average precision metrics of our model and the baseline systems in CNN/DailyMail test set. As shown, FASum achieves the highest precision of correct hits under both measures. And there is a considerable boost from the knowledge graph component: 13.6% in and 17.3% in .
Furthermore, the summaries from both BottomUp and UniLM have a notable increase from 1.1% to 1.9% in precision metrics after being corrected by FC. This demonstrates that FC can help rectify incorrect knowledge relations in summaries.
4.6.3 Human Evaluation
|BottomUp is better||FC is better||Same|
We conduct human evaluation on the factual correctness and informativeness of summaries. We randomly sample 100 articles from the test set of CNN/DailyMail. Then, each article and summary pair is labelled by 3 people from Amazon Mechanical Turk (AMT) to evaluate the factual correctness and informativeness. Each labeller should give a score in each category between 1 and 3 (3 being perfect).
Here, factual correctness indicates whether the summary’s content is faithful with respect to the article; informativeness indicates how well the summary covers the salient information in the article.
As shown in Table 8, our model FASum achieves the highest factual correctness score, higher than UniLM and considerably outperforming BottomUp. We conduct a statistical test and find that compared with UniLM
, our model’s score is statistically significant with p-value smaller than 0.05 under paired t-test. In terms of informativeness, our model is comparable withUniLM. Finally, without the knowledge graph component, the Seq2Seq model generates summaries with both less factual correctness and informativeness.
To assess the effectiveness of the correction model FC, we conduct a human evaluation of side-by-side summaries. In CNN/DailyMail, we firstly collect the articles where the summaries generated by BottomUp are modified by FC. Then we sample 100 articles and juxtapose the original summary and the modified version. 3 labelers from AMT are asked to choose the one with better factual correctness. To reduce bias, we randomly shift the order of two versions for each article. We also conduct similar evaluation for TConvS2S.
As shown in Table 9, after correction by FC, among the summaries that are modified, 40.4% are judged to be factually more correct, 43.8% are assessed to be factually unchanged, while only 15.8% are considered to become worse than the original version from BottomUp. Therefore, FC can help boost the factual correctness of summaries from given systems.
Factual correctness is an important but often neglected criterion for abstractive text summarization.In this paper, we extract factual information from the article to be represented in a knowledge graph. Via neural graph computation, we integrate the factual knowledge into the process of producing summaries. The resulting model FASum enhances the ability to preserve facts during summarization. Both automatic and human evaluation show the effectiveness of our model. We also propose a correction model, FC, to rectify factual errors in candidate summaries.
For future work, we plan to integrate knowledge graphs into pre-trained models for better summarization. Moreover, we will combine the internally extracted knowledge graph with an external knowledge graph (e.g. ConceptNet) to enhance the commonsense capability of summarization.
Leveraging linguistic structure for open domain information extraction.
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 344–354. Cited by: §1, §3.3.1.
Faithful to the original: fact aware neural abstractive summarization.
Thirty-Second AAAI Conference on Artificial Intelligence, Cited by: §2.2.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §4.3.
- Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §3.2.
- Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197. Cited by: §1, §1, §1, §2.1, Figure 2, §3.4, §4.4.
- Ranking generated summaries by correctness: an interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 2214–2220. Cited by: §1, §2.2, §2.2, §3.2, Table 3.
- Bottom-up abstractive summarization. arXiv preprint arXiv:1808.10792. Cited by: §1, §1, §1, §2.1, §4.4.
- Assessing the factual accuracy of generated text. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pp. 166–175. Cited by: §1, §2.2, §4.6.2.
Mind the facts: knowledge-boosted coherent abstractive text summarization.
Proceedings of The Workshop on Knowledge Representation and Reasoning Meets Machine Learning in NIPS 2019, Cited by: §2.2.
- Teaching machines to read and comprehend. Advances in neural information processing systems, pp. 1693–1701. Cited by: §3.3.1, §4.1.
- Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §4.2.
- Neural text summarization: a critical evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 540–551. External Links: Cited by: §2.2, §4.5.
- Evaluating the factual consistency of abstractive text summarization. arXiv preprint arXiv:1910.12840. Cited by: §1, §1, §2.2, §3.2, §3.2, §3.2, §3.2, Table 3, §4.3.
- Sentencepiece: a simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226. Cited by: §4.2.
- Finite geometrical systems. Cited by: §1, §3.3.2.
- Ensure the correctness of the summary: incorporate entailment knowledge into abstractive sentence summarization. In Proceedings of the 27th International Conference on Computational Linguistics, pp. 1430–1441. Cited by: §2.2.
- Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out,. Cited by: §1, §4.3.
- Text summarization with pretrained encoders. EMNLP. Cited by: §4.1, Table 4.
- Roberta: a robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692. Cited by: Figure 2, §3.4.
- Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745. Cited by: §3.3.1, §4.1, §4.4.
- Passage re-ranking with bert. arXiv preprint arXiv:1901.04085. Cited by: §4.5.
- A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304. Cited by: §1, §2.1.
A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685. Cited by: §2.1.
- Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368. Cited by: §1, §2.1, §4.6.1.
- Conceptnet 5.5: an open multilingual graph of general knowledge. In Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1.
- Sequence to sequence learning with neural networks. Advances in neural information processing systems,, pp. 3104–3112. Cited by: §2.1.
- Attention is all you need. Advances in neural information processing systems, pp. 5998–6008. Cited by: §3.3.
- Graph attention networks. arXiv preprint arXiv:1710.10903. Cited by: §1, §3.3.2.
- Pointer networks. In Advances in neural information processing systems, pp. 2692–2700. Cited by: §3.3.4.
- Optimizing the factual correctness of a summary: a study of summarizing radiology reports. arXiv preprint arXiv:1911.02541. Cited by: §2.2.
- Make lead bias in your favor: a simple and effective method for news summarization. arXiv preprint arXiv:1912.11602. Cited by: §2.1.
Appendix A Implementation details
During training, we use beam search with a width of 18 and 2 for CNN/DailyMail and XSum. The minimum summary length is 30 and 40 for CNN/Daily Mail and XSum, respectively.
For hyperparameter search, we tried 4 layers with 4 heads, 6 layers with 6 heads and 10 layers with 8 heads. Beam search width ranges are [5, 20] and [1, 5] for CNN/DailyMail and XSum, respectively. Minimum summary length are tried from [30, 35, 40, 45, 50, 55, 60] for both datasets.
There’re 163.2M parameters in the model and it takes 2 hours (CNN/DailyMail) / 0.5 hours (XSum) for 4 v100 GPUs to train 1 epoch. The batch size is set to 48 for both datasets.
On validation datasets, FASum achieves ROUGE-1 39.33%, ROUGE-2, 17.70% and ROUGE-L 36.23% on CNN/DailyMail, and it achieves ROUGE-1 26.86%, ROUGE-2 8.82% and ROUGE-L 21.54% on XSum.
Appendix B Examples
Table 10, Table 11 and Table 12 show examples of CNN/DailyMail articles and summaries generated by our model and several baseline systems. The factual errors in summaries are marked in red and the corresponding facts are marked in green in the article.
As shown, while baseline systems like BottomUp and UniLM achieve high ROUGE scores, they are susceptible to factual errors. For instance, in Article 2, BottomUp wrongly assigns the speaker to the concept “north star” and UniLM misinterpretes the voting relationship: the voters are the readers, not the journalist Sutter. The ablation model Seq2Seq also makes a number of errors, e.g. it mistakenly summarizes that the daughter of the deceased Geoffery Lewis passed away at 79 in Article 3.
In comparison, our proposed fact-aware summarizer FASum could faithfully summarize the salient information in the article. And it can re-organize the phrasing instead of merely copying content from the article.
Table 13 and Table 14 shows examples of CNN/DailyMail articles, summaries generated by BottomUp and the corrected version by FC. As shown, our correction model can select the wrong entities and replace it with the correct one. For instance, in Article 1, BottomUp’s summary states that Rual Castro, who appears elsewhere in the article, is the President of Venezuela, while FC correctly replaces it with Nocolas Maduro. In Article 3, BottomUp wrongly says that Sofia Vergara wants to keep the frozen embryos, but it is her ex-fiance Nick Loeb who expresses this opinion, as corrected by FC.
|Article 1||A fiery sunset greeted people in Washington Sunday. … The stunning sunsets were the result of raging wildfires in parts of Siberia. … The fires were started in southeastern Siberia, by farmers burning grass in their fields. But on April 14, it is believed that the flames quickly grew out of control because of strong winds and spread throughout the region, according to CNN affiliate KOMO-TV. As a result, the fires have destroyed dozens of villages in the region. …|
|Reference||Smoke from massive fires in Siberia created fiery sunsets in the Pacific Northwest. Atmospheric winds carried smoke from the wildfires across the Pacific Ocean. Smoke particles altered wavelengths from the sun, creating a more intense color.|
|BottomUp||The fires were started in southeastern Siberia on April 14. The fires have destroyed dozens of villages in the region. The smoke from Siberia grew out of control.|
|UniLM||The fiery sunsets were the result of raging wildfires in parts of Siberia. Strong winds picked up the smoke from the fires. As the smoke starts to dissipate, air quality will get bette|
|FASum||“The dramatic sunsets began showing up over the weekend,” Seattle native Tim Durkan says. The fires were started in southeastern Siberia, by farmers burning grass in their fields.|
|Seq2Seq||As a result, the flames grew out of control in Siberia. The sunsets were the result of raging wildfires in parts of siber|
|Article 2||We’re 2 degrees from a different world. Humans never have lived on a planet that’s 2 degrees Celsius (3.6 Fahrenheit) warmer than it was before we started burning fossil fuels, in the late 1800s, and climate experts say we risk fundamentally changing life on this planet if we do cross that 2-degree mark. “This is gambling with the planet,” said Gernot Wagner, the lead senior economist at the Environmental Defense Fund and co-author of the book “Climate Shock.” … I’ve heard it referred to as the “north star” for climate negotiations. … It’s so significant that it’s going to be the subject of my reporting for the rest of the year. I’m calling that effort “2 degrees,” and I need your help to make it work. Until 5 p.m. ET Monday, you can vote on the first story I’ll report for the series. Vote using the Facebook poll below (or go here if you don’t see it.) Tell CNN’s John Sutter which of these climate change stories you like best - and he’ll report on the winner…|
|Reference||Experts have raised red flags about the warming of planet by 2 degrees Celsius. John Sutter: This one little number is significant as a way to focus world’s attention on problem.|
|BottomUp||Climate experts say we risk life on this planet if we do cross that 2-degree mark. “This is gambling with the planet,” says “north star” for climate negotiations. Vote using Facebook poll below #2degrees.|
|UniLM||John Sutter’s “2 degrees” will be the subject of CNN’s next series on climate change. The number is a way to focus the world ’s attention, he says. “This is gambling with the planet,” says Gernot Wagner, the lead economist at the Environmental Defense Fund. Sutter will vote on the first story he ’ll report for the series|
|FASum||I’ve heard it referred to as the “north star” for climate negotiations. John Sutter: “climate change stories you like best” and he’ll report on the winner.|
|Seq2Seq||Gernot Wagner says we risk fundamentally changing life on our planet. “This is gambling with the planet,” says Wagner not Wagner, lead economist at the Environmental Defense Fund. “It’s so significant that it’s going to be the subject of my reporting for the rest of the year,” she says.|
|Article 3||Geoffrey Lewis, a prolific character actor who appeared opposite frequent collaborator Clint Eastwood as his pal Orville Boggs in “Every Which Way But Loose” and its sequel, has died. He was 79. Lewis, the father of Oscar-nominated actress Juliette Lewis, died Tuesday, family friend Michael Henderson said. No other details were immediately available. Lewis began his long association with Eastwood in “High Plains Drifter” (1973). He also appeared with the actor in “Thunderbolt and Lightfoot” (1974) …|
|Reference||Geoffrey Lewis appeared in many movies, TV shows. Actor was frequently collaborator with Clint Eastwood. Actress Juliette Lewis, his daughter, called him “my hero”|
|BottomUp||Geoffrey Lewis began his long association with Eastwood in “Every which Way but Loose” Lewis scored a Golden Globe nomination for playing bartender Earl Tucker. He also appeared with the actor in “Thunderbolt and Lightfoot” “Flo,” “Pink Cadillac” and “Midnight in the Evil ” The actor also stood out as a gravedigger turned vampire in the 1979 Tobe “Salem’s Lot,” a 1979 actor. Lewis was the father of Oscar-nominated actress Juliette Lewis.|
|UniLM||Geoffrey Lewis was a prolific character actor. He played Orville Boggs in “Every Which Way But Loose” Lewis was the father of Oscar-nominated actress Juliette Lewis.|
|FASum||It’s the father of Oscar-nominated actress Juliette Lewis. Lewis began his long association with Eastwood in “High Plains Drifter” he also appeared with the actor in “Thunderbolt and Lightfoot”|
|Seq2Seq||Oscar-nominated actress Juliette Lewis dies at 79. Lewis starred in “High Plains Drifter” he also appeared in “Bronco Billy” and “Midnight in the Garden of Good and Evil”|
|Article 4||… This Sunday in Rome, Pope Francis faced just such a dilemma. First, the back story: One hundred years ago, more than 1 million Armenians
(some estimates run as high as 1.5 million)died at the hand of the Turks. Many of the victims were part of a branch of Christianity closely aligned with Catholicism. A slew of historians and at least 20 countries call the killings a “genocide.” (A U.S. resolution to do the same has languished in Congress.) Turkish officials disagree… John Paul II used the “g” word in 2001, but didn’t dare speak it out loud…
|Reference||Previous popes had finessed the question of whether the killing of 1.5 million Armenians was genocide. Because he often shines such a smiley face on the world, it can be easy to forget the bluntness Francis sometimes brings to the bully pulpit.|
|BottomUp||More than 1 million Armenians died at St. John Paul II ’s Turks. Turkish officials say the killings were a “genocide” the Vatican says it’s not the pope, but he says he has no plans to use it. He says the church’s moral and political woes have improved.|
|UniLM||More than 1 million Armenians died at the hands of the Turks one hundred years ago. Frida Ghitis: Pope Francis faced just such a dilemma this Sunday in Rome. She says previous popes had to make a hard choice: Adopt the sharp tongue of a prophet or the discretion of a diplomat?|
|FASum||As scholars like to say, the Vatican has walked the line between spiritual and worldly concerns for centuries. Many of the victims were part of a branch of Christianity closely aligned with Catholicism. A slew of historians and at least 20 countries call the killings a “genocide”|
|Seq2Seq||One hundred years ago, more than 1 million Armenians died at the hand of the Turks. A slew of historians and at least 20 countries call the killings a “genocide”. John Paul II used the “g” word in 2001.|
|Article 5||Can you imagine paying $1,000 a month in rent to live in a one-car garage? Nicole, a 30-year-old woman, doesn’t have to imagine this scenario because it’s her everyday reality… CNN recently published a powerful piece called ”Poor kids of Silicon Valley” that documents the affordable housing challenges facing families in the Bay Area… The agency I lead, the U.S. Department of Housing and Urban Development, recently released a report estimating that 7.7 million low-income households live in substandard housing, spend more than half their incomes on rent or both… HUD created the Rental Assistance Demonstration initiative to bring private investment into the fold for the public good… RAD has allowed local communities to raise more than $733 million in new capital to date…|
|Reference||CNN’s John Sutter told the story of the “Poor kids of Silicon Valley” HUD Secretary Julian Castro: Our shortage of affordable housing is a national crisis that stunts the economy.|
|BottomUp||“Poor kids of Silicon Valley ” spend $1,000 a month in rent to live in San Mateo, California. Nicole, a woman, has a house that is home to 16 people and 11 children. Silicon Valley is in an affordable housing crisis.|
|UniLM||Writers: Our entire nation is in the midst of an affordable housing crisis. Writers: 7.7 million low-income households live in substandard housing , spend more than half their incomes on rent or both. Writers: HUD has allowed local communities to raise more than $733 million in new capital. Writers: No American should ever have to wait six decades to have a decent|
|FASum||CNN recently published a powerful piece called “Poor kids of Silicon Valley” the affordable housing housing crisis agency says 7.7 million low-income households live in substandard housing.|
|Seq2Seq||“Poor kids of Silicon Valley” documents affordable housing challenges families. U.S. agency estimating 7.7 million low-income households live in substandard housing.|
|Article 6||An American teenager who helped her boyfriend stuff her mother’s lifeless body into a suitcase at an upmarket hotel in Bali has been sentenced to 10 years in prison. Heather Mack, 19, who gave birth to her own daughter just weeks ago, was found guilty with her 21-year-old boyfriend, Tommy Schaefer, of killing Sheila von Wiese-Mack on the Indonesian island last August. Schaefer was sentenced to 18 years in prison for battering von Wiese-Mack to death in room 317 of the St. Regis Bali Resort. Schaefer had claimed he killed his girlfriend’s mother in self-defense after a violent argument erupted over the young couple’s relationship…|
|BottomUp||Heather Mack, 19, gave birth to her own daughter just weeks ago. Schaefer was found guilty with her boyfriend, Tommy Schaefer, on the Indonesian island. Schaefer was sentenced to 18 years in prison.|
|UniLM||Tommy Schaefer, 21, and Heather Mack, 19, were found guilty of killing Sheila von Wiese-Mack in August. The American helped her boyfriend stuff her mother’s body into a suitcase at an upmarket hotel in Bali. Schaafer had claimed he killed his girlfriend’s mother in self-defense after an argument over their relationship. Mack told the court her mother had threatened to kill|
|FASum||Heather Mack, 19, helped boyfriend Tommy Schaefer stuff her mother’s lifeless body into a suitcase at an upmarket hotel in Bali. Schaefer had claimed he killed his girlfriend’s mother in self-defense after a violent argument erupted over the young couple’s relationship.|
|Seq2Seq||Heather Mack, 19, was found guilty of murdering her boyfriend Tommy Schaefer. Schaefer claimed von Wiese-Mack killed his girlfriend’s mother in self-defense. Von wi|
|Article 1||The VII Summit of the Americas was supposed to be all about the symbolic handshake between the United States and Cuba. But insert Venezuela into the mix and Panama City, Panama, quickly turns into a “triangle of tension.”… Cuba has historically been the wrench in the diplomatic machinery, with some Latin American leaders threatening not to attend the Summit of the Americas if the United States and Canada didn’t agree to invite President Raul Castro… The much anticipated handshake between Obama and Castro would steal all the headlines if it wasn’t for Cuba’s strongest ally, Venezuela. Venezuelan President Nicolas Maduro recently accused the United States of trying to topple his government and banned former President George Bush…|
|BottomUp||Heads of state from 35 countries have met every three years to discuss economic, social or political issues since the summit in 1994. Venezuela’s President Raul Castro has been criticized for human rights violations. The u.s. government says the summit of the Americas is a “triangle of tension.”|
|Corrected by FC||Heads of state from 35 countries have met every three years to discuss economic, social or political issues since the summit in 1994. Venezuela’s President Nicolas Maduro has been criticized for human rights violations. The u.s. government says the summit of the Americas is a “triangle of tension.”|
|Article 2||She’s one of the hottest and most successful Latinas in Hollywood, but now Sofia Vergara is playing defense in a legal battle initiated by her ex-fiance: He wants to keep the two frozen embryos from their relationship, both female. The 42-year-old actress and star of the hit TV sitcom ”Modern Family” split from businessman Nick Loeb in May 2014. Loeb is suing the Colombian-born actress in Los Angeles to prevent Vergara from destroying their two embryos conceived through in vitro fertilization in November 2013, according to published reports by New York Daily News and In Touch magazine…|
|BottomUp||Sofia Vergara wants to keep the frozen embryos from their relationship, both female. He is suing the actress in Los Angeles to prevent Vergara from their embryos. The actress and star of the “Modern Family” split from Nick Loeb in May 2014.|
|Corrected by FC||Nick Loeb wants to keep the frozen embryos from their relationship, both female. He is suing the actress in Los Angeles to prevent Vergara from their embryos. The actress and star of the“Modern Family”split from Businessman Nick Loeb in May 2014.|
|Article 3||Volvo says it will begin exporting vehicles made in a factory in southwest China to the United States next month, the first time Chinese-built passenger cars will roll into American showrooms. Parent company Geely Automobile, which bought Volvo in 2010, is seeking to prove that a Chinese company can manage a global auto brand…|
|BottomUp||Volvo is seeking to prove that a Chinese company can manage a global auto brand. The car will be one of four models produced in a manufacturing plant in Chengdu. China is largest market for car sales globally in 2009.|
|Corrected by FC||Geely Automobile is seeking to prove that a Chinese company can manage a global auto brand. The car will be one of four models produced in a manufacturing plant in Chengdu. China is largest market for car sales globally in 2009.|
|Article 4||American suburbanites who can do all their shopping without getting wet, driving from point-to-point or looking for a new place to park, can give much of the credit to Alfred Taubman. Taubman, a real estate developer who helped change the face of suburban life by popularizing upscale indoor shopping malls, died Friday at the age of 91…|
|BottomUp||American suburbanites died Friday at the age of 91. His first name was Adolph. His son, Alfred Taubman, was born in Michigan to German Jewish immigrants.|
|Corrected by FC||Alfred Taubman died Friday at the age of 91. His first name was Adolph. His son, Alfred Taubman, was born in Michigan to German Jewish immigrants.|
|Article 5||In July of 2013, the oldest of Jesus relics stories rose again when Turkish archaeologists discovered a stone chest in a 1,350-year-old church that appeared to contain a piece of Jesus’ cross. ”We have found a holy thing in a chest. It is a piece of a cross,” said excavation team leader Gulgun Koroglu, an art historian and archaeologist…|
|BottomUp||“we have found a holy thing in a chest. It is a piece of a cross,” says archaeologist and archaeologist. Turkish archaeologists found a stone chest in a 1,350-year-old church that Jesus was crucified. “true cross” is the first Roman emperor to convert to Christianity.|
|Corrected by FC||“we have found a holy thing in a chest. It is a piece of a cross,” says art historian and archaeologist. Turkish archaeologists found a stone chest in a 1,350-year-old church that Jesus was crucified. “true cross” is the first Roman emperor to convert to Christianity.|
|Article 6||The last time Frank Jordan spoke with his son, Louis Jordan was fishing on a sailboat a few miles off the South Carolina coast. The next time he spoke with him, more than two months had passed and the younger Jordan was on a German-flagged container ship 200 miles from North Carolina, just rescued from his disabled boat. ”I thought I lost you,” the relieved father said. Louis Jordan, 37, took his 35-foot sailboat out in late January and hadn’t been heard from in 66 days when he was spotted Thursday afternoon by the Houston Express on his ship drifting in the Atlantic Ocean…|
|BottomUp||Frank Jordan, 37, was fishing on a sailboat in the Atlantic Ocean. He hadn’t been heard from in 66 days when he was spotted on his ship. His son, Louis Jordan, took his sailboat out in January.|
|Corrected by FC||Louis Jordan, 37, was fishing on a sailboat in the Atlantic Ocean. He hadn’t been heard from in 66 days when he was spotted on his ship. His son, Louis Jordan, took his sailboat out in late January.|
, the man who recorded a South Carolina police officer fatally shooting a fleeing, unarmed man, told CNN on Thursday night he was told by another cop to stop using his phone to capture the incident. ”One of the officers told me to stop, but it was because I (said) to them that what they did it was an abuse and I witnessed everything,” he told CNN’s ”Anderson Cooper 360@.”… Santana recalled the moments when he recorded a roughly three-minute video of North Charleston Police officer Michael Slager shooting Walter Scott as Scott was running away Saturday…
|BottomUp||“One of the officers told me to stop, but it was an abuse,” Santana says. Santana has said he feared for his life. Scott was told by another cop to stop using his phone to capture the incident.|
|Corrected by FC||“One of the officers told me to stop, but it was an abuse,” Santana says. Santana has said he feared for his life. Feidin Santana was told by another cop to stop using his phone to capture the incident.|