Text summarization models aim to produce an abridged version of long text while preserving salient information. Abstractive summarization is a type of such models that can freely generate summaries, with no constraint on the words or phrases used. This format is closer to human-edited summaries and is both flexible and informative. Thus, there are numerous literature on producing abstractive summaries (See et al., 2017; Paulus et al., 2017; Dong et al., 2019; Gehrmann et al., 2018).
However, one prominent issue with abstractive summarization is factual inconsistency. It refers to the phenomenon that the summary sometimes distorts or fabricates the facts in the article. Recent studies show that up to 30% of the summaries generated by abstractive models contain such factual inconsistencies (Kryściński et al., 2019b; Falke et al., 2019). This brings serious problems to the credibility and usability of abstractive summarization systems. Table 1 demonstrates an example article and excerpts of generated summaries. As shown, the article mentions that the actor Geoffery Lewis began his association with Eastwood in “High Plains Drifter”, while the summary from BottomUp (Gehrmann et al., 2018) picks the wrong movie title from another place in the article. Also, the article says that the deceased Geoffery Lewis is the father of Juliette Lewis. However, an ablated version of our model without leveraging the factual knowledge graph in the article, denoted by -KG, wrongly expresses that Juliette passed away. Comparatively, our model FASum generates summary that correctly exhibits the relationship in the article.
|Article||Geoffrey Lewis, a prolific character actor who appeared opposite frequent collaborator Clint Eastwood as his pal Orville Boggs in “Every Which Way But Loose” and its sequel, has died… Lewis, the father of Oscar-nominated actress Juliette Lewis, died Tuesday. Lewis began his long association with Eastwood in “High Plains Drifter”…|
|BottomUp||Geoffrey Lewis began his long association with Eastwood in “Every which Way but Loose”.|
|FASum||It’s the father of Oscar-nominated actress Juliette Lewis.|
|-KG||Oscar-nominated actress Juliette Lewis dies at 79.|
On the other hand, most existing abstractive summarization models apply conditional language model to focus on the token-level accuracy of summaries, while neglecting semantic-level consistency between the summary and article. Therefore, the generated summaries are often high in token-level metrics like ROUGE (Lin, 2004) but lack factual correctness. In view of this, we argue that a robust abstractive summarization system must be equipped with factual knowledge during the comprehension of article and generation of summary.
In this paper, we represent facts in the form of a knowledge graph. Although there are numerous efforts in building commonly applicable knowledge networks to facilitate knowledge extraction and integration, such as ConceptNet (Speer et al., 2017) and WikiData, we find that these tools are more useful in conferring commonsense knowledge. In abstractive summarization for contents like news articles, many entities and relations are previously unseen. Plus, our goal is to produce summary that have no or very few conflicts against the facts in the article. Thus, we propose to extract factual knowledge from within the article.
Therefore, we employ the information extraction (IE) tool OpenIE (Angeli et al., 2015) to extract facts from the article in the form of relational tuples: (subject, relation, object). We then use the Levi transformation (Levi, 1942) to convert each tuple component into a node in the knowledge graph. This graph contains the facts in the article and is integrated in the summary generation process.
Then, we use graph attention network (Veličković et al., 2017) to obtain the representation of each node, and fuse that into a transformer-based encoder-decoder architecture. The decoder conducts attention to each graph node in every transformer block. Finally, it leverages the copy-generate mechanism to selectively produce contents either from the dictionary or from the graph entities. We denote this model as a Fact-Aware Summarization model, FASum.
In the experiments on benchmark summarization datasets, the FASum model shows great improvements on factual correctness of generated summaries. Using an independently trained BERT-based factual correctness evaluator (Kryściński et al., 2019b), we find that on CNN/DailyMail dataset, FASum obtains 1.2% higher fact correctness scores than UniLM (Dong et al., 2019) and 4.5% higher than BottomUp (Gehrmann et al., 2018). Human evaluation further corroborates the effectiveness of our models.
We further propose an easy-to-compute metric, i.e. matched relation tuples, and show that FASum
can obtain a high score compared with other baselines. Via the ratio of novel n-grams, we show thatFASum is a highly abstractive summarization model.
In summary, our contribution in this paper is three-fold:
We extract factual information from the article in the form of relational knowledge graph.
We conduct neural graph computation on the knowledge graph and integrate its information into an end-to-end process of summary generation.
We propose a simple-to-use metric, matched relation tuples, to evaluate factual correctness in abstractive summarization.
2 Related Work
2.1 Abstractive Summarization
Abstractive text summarization has been intensively studied in recent literature. Most models employ the encoder-decoder architecture (seq2seq) (Sutskever et al., 2014). Rush et al. (2015) introduces an attention-based seq2seq model for abstractive sentence summarization. See et al. (2017) uses copy-generate mechanism that can both produce words from the vocabulary via a generator and copy words from the article via a pointer. Paulus et al. (2017)
leverages reinforcement learning to improve summarization quality.Gehrmann et al. (2018) uses a content selector to over-determine phrases in source documents that helps constrain the model to likely phrases. Zhu et al. (2019) defines a pretraining scheme for summarization and produces a zero-shot abstractive summarization model. Dong et al. (2019) employs different masking techniques for both classification and generation tasks in NLP. The resulting pretrained model UniLM achieves state-of-the-art results on various tasks including abstractive summarization.
2.2 Fact-Aware Summarization
Since the task of judging whether a summary is consistent with an article is similar to the text entailment problem, several recent work use text entailment models to evaluate and boost the factual correctness of summarization. Li et al. (2018) co-trains summarization and entailment for the encoder and employs an entailment-aware decoder via Reward Augmented Maximum Likelihood (RAML). Falke et al. (2019) proposes using off-the-shelf entailment models to rerank candidate summary sentences to boost factual correctness.
Apart from using entailment models for factual correctness, Zhang et al. (2019)
approaches the factual correctness problem in the medical domain, where the space of facts is limited and can be depicted with a descriptor vector. The proposed model utilizes information extraction and reinforcement learning and achieves good results in medical summarization.Cao et al. (2018) extracts relational information from the article and maps it to a sequence as input to the encoder. The decoder attends to both the article tokens and the relations. Gunel et al. (2019) employs an entity-aware transformer structure to boost the factual correctness in abstractive summarization, where the entities come from Wikidata knowledge graph. In comparison, our model utilizes the extracted knowledge graph from the article and fuses it into the generation via graph neural computation.
As previous work mostly use human evaluation on factual correctness (Kryściński et al., 2019a) and text entailment model suffers from domain shift problem (Falke et al., 2019), several papers put forward automatic factual checking metrics better aligned with human judgement. Goodrich et al. (2019) proposes using information extraction to obtain the relation tuples in ground-truth summary and the candidate summary, then computing the factual accuracy as the ratio of overlapping tuples. Kryściński et al. (2019b) frames factual correctness as a binary classification problem: summary is either consistent or inconsistent with the article. Therefore, it applies positive and negative transformations on the summary to produce training data for a BERT-based classification model. The evaluator, FactCC, significantly outperforms text entailment models and shows high correlation with human metrics. In this paper, we use FactCC as the factual evaluator for our model.
3.1 Problem Formulation
We formalize abstractive summarization as a supervised seq2seq problem. The input consists of pairs of articles and summaries: . Each article is tokenized into and each summary is tokenized into . In abstrative summarization, the model-generated summary can contain tokens, phrases and sentences not present in the article. For simplicity, in the follows we will drop the data index subscript. Therefore, each training pair becomes , and the model needs to generate an abstrative summary .
3.2 Factual Correctness Evaluator
We leverage the FactCC evaluator (Kryściński et al., 2019b) which maps the correctness evaluation as a binary classification problem, namely finding a function , where is an article and is a summary sentence defined as a claim.
represents the probability thatis factually correct with respective to the article . If a summary is composed of multiple sentences , we define the factual score of as:
|InferSent (Falke et al., 2019)||41.3%|
|SSE (Falke et al., 2019)||37.3%|
|BERT (Falke et al., 2019)||35.9%|
|ESIM (Falke et al., 2019)||32.4%|
|FactCC (Kryściński et al., 2019b)||30.0%|
|FactCC (our version)||26.8%|
To generate training data, multiple positive and negative forms of transformation are applied to each summary sentence as well as some randomly selected sentences from the article.
To augment the number of entailed examples, we adopt backtranslation as a paraphrasing tool. The original summary is translated into an intermediate language, including French, German, Chinese, Spanish and Russian, and then translated back to English. In this way, the generated claim will be semantically equivalent to the original one but modified with minor syntactic and lexical changes. Thus, it is used as a positive training example.
A common failure mode for abstractive summarization models is to include incorrect entities from the article. To predict the occurrence of these types of errors, we generate summary example that are not entailed by the article by swapping entities in the ground truth summary sentences. We use spaCy to find all entities in the article and summary then randomly swap one entity in the summary with a different entity from the article of the same types, i.e. person, location, organization. The generated claim will be factually incorrect and thus used as a negative training example.
Using these methods, we generate a training set from CNN/DailyMail’s training data. The training set contains 20M claims, with an equal number of positive and negative instances. The development set is similarly obtained from the validation set and has a size of 2,900 claims after sampling.
Following Kryściński et al. (2019b), we finetune the BERTBASE model Devlin et al. (2018a) to evaluate factual correctness. We concatenate the article and the generated claim, together with special tokens [CLS] and [SEP]: [CLS] Article [SEP] Claim. A linear layer and softmax operation are employed to convert the BERT embedding of [CLS] into the probability that the claim is entailed by the article content.
The hyperparameters were tuned on the development set, with the best setup using a learning rate of 1e-5 and a batch size of 24. As shown in Table2, our reproduced model achieves better accuracy than that in Kryściński et al. (2019b) on the human-labelled sentence-pair-ordering data from Falke et al. (2019). Thus, we use this evaluator for all the factual correctness assessment tasks in the follows.111We use the same setting and train another evaluator for XSum dataset.
3.3 Fact-Aware Summarizer
We propose the Fact-Aware abstractive Summarizer, FASum. It utilizes the seq2seq architecture built upon transformers (Vaswani et al., 2017). In detail, the encoder produces contextualized embeddings of the article and the decoder attends to the encoder’s output to generate the summary.
To make the summarization model fact-aware, we extract, represent and integrate knowledge from the source article into the summary generation process, which is described in the follows. The overall architecture of FASum is shown in Figure 1.
3.3.1 Knowledge Extraction
To extract important entity-relation information from the article, we employ the Stanford OpenIE tool (Angeli et al., 2015). The extracted knowledge is a list of tuples. Each tuple contains a subject (S), a relation (R) and an object (O), each as a segment of text from the article. For example, for the sentence Born in a town, she took the midnight train, the extracted subject is she, relation is took, and the object is midnight train. In the experiments, there are on average 165.4 tuples extracted per article in CNN/DailyMail dataset Hermann et al. (2015) and 84.5 tuples in XSum dataset Narayan et al. (2018).
3.3.2 Knowledge Representation
We construct a knowledge graph to represent the information extracted from OpenIE. We apply the Levi transformation (Levi, 1942) to treat each entity and relation equally. In detail, suppose a tuple is , we create nodes , and , and add edges — and —. In this way, we obtain an undirected knowledge graph , where each node is associated with text .
We then employ graph attention network (Veličković et al., 2017) to obtain embeddings for each node. We use bidirectional LSTM to generate the initial embedding for each node :
where the index indicates the final state of the RNN.
Then, in each round, every node updates its embeddings using attention over itself and all of its neighbors, denoted by the set :
where , and are parametrized matrix and vector, respectively. In the experiment, we set the number of rounds .
3.3.3 Knowledge Integration
The knowledge graph embedding is obtained in parallel with the encoder. During decoding, we conduct attention over the knowledge graph nodes in each transformer block. Suppose the input embedding to a decoder transformer block is , and the embedded article from encoder is . We first apply self-attention on and inter-attention between and , each followed by layer normalization and residual mechanism (denoted by LR). This leads to the intermediate embeddings .
Next, to capture information from the knowledge graph, we conduct inter-attention between and the nodes’ embeddings :
Notice that we multiply the attended vector with a parameter . This is due to the different numerical magnitude between encoder embeddings and knowledge graph embeddings. For the -th transformer block, we train one multiplier scalar .
The results are passed through a LR layer, FFN layer and another LR layer to obtain the final embeddings :
3.3.4 Summary Generation
To produce the next token , we leverage the copy-generate mechanism (Vinyals et al., 2015). First, a copy probability is computed from :
The final next-token probability distribution is:
where from the last transformer block, , and is the embedding matrix for both the encoder and decoder.
During training, we use cross entropy as the loss function:
where is the one-hot vector for the -th token, and represent the parameters in the network.
We evaluate our model on benchmark summarization datasets CNN/DailyMail Hermann et al. (2015) and XSum Narayan et al. (2018). They contain 312K and 227K news articles and human-edited summaries respectively, covering different topics and various summarization styles. We show the details of the datasets in Table 3, adapted from Liu and Lapata (2019).
4.2 Implementation Details
We use the subword tokenizer SentencePiece Kudo and Richardson (2018). The dictionary is shared across all the datasets. The vocabulary has a size of 32K and a dimension of 720. Both the encoder and decoder has 10 layers of 8 heads for attention. The LSTM in knowledge graph has a hidden state of size 128 and the graph attention network has a hidden state of size 50. The dropout rate is 0.1. The knowledge graph attention vector multiplier is initialized with . Teacher forcing is used in training. We use Adam (Kingma and Ba, 2014) as the optimizer with a learning rate of 2e-4. More details are presented in the Appendix.
For factual correctness, we leverage the FactCC (Kryściński et al., 2019b) trained on each dataset. All evaluator models are finetuned starting from the BERT-Large model Devlin et al. (2018b). The training is independent of the summarizer so no parameters are shared.
We also employ the standard ROUGE-1, ROUGE-2 and ROUGE-L metrics Lin (2004) to measure summary qualities. These three metrics evaluate the accuracy on unigrams, bigrams and longest common subsequence. We report the F1 ROUGE scores in all experiments.
Note that during training, FASum chooses the model with the highest ROUGE-L score on the validation set. Therefore, it does not have access to the evaluator during training and validation.
The following abstractive summarization models are selected as baseline systems. TConvS2S Narayan et al. (2018)
is based on convolutional neural networks.BottomUp Gehrmann et al. (2018) uses a bottom-up approach to generate summarization. UniLM Dong et al. (2019)
utilizes large-scale pretraining to produce state-of-the-art abstractive summaries. We obtain the prediction results of these baseline systems via their open-source repositories, or train the model when the predictions are not available.
As shown in Table 4, our model FASum outperforms all baseline systems in factual correctness scores in CNN/DailyMail and is only behind UniLM in XSum. In CNN/DailyMail, FASum is 1.2% higher than UniLM and 4.5% higher than BottomUp in factual correctness score. We conduct statistical tests and show that the lead is statistically significant with p-value smaller than 0.05. Meanwhile, we removed the knowledge graph component from FASum and there is a clear drop in factual score on both datasets, indicated by -KG.
It’s worth noticing that the ROUGE metric does not always reflect the factual correctness, sometimes even showing reverse relationship, which has been observed in Kryściński et al. (2019a); Nogueira and Cho (2019). For instance, although BottomUp has 2.42 higher ROUGE-1 points than FASum in CNN/DailyMail, there are many factual errors in their summaries, as shown in the human evaluation and Appendix. Therefore, we argue that factual correctness should be specifically handled both in summary generation and evaluation.
Ablation Study. To evaluate the effectiveness of our proposed modules, we removed the attention vector multiplier, then the Levi transformation and finally the knowledge graph module. The results are reported in Table 5. As shown, each module plays a significant role in boosting the factual correctness score, but not necessarily the ROUGE metrics. The multiplier contributes 2.1 points in factual score, and the knowledge graph module as a whole boosts the factual score by as much as 3.7 points.
4.6.1 Novel n-grams
To measure the abstractiveness of our model, we compute the ratio of novel n-grams in summaries that do not appear in the article in XSum test set, following See et al. (2017). As shown by Figure 2, FASum achieves the closest ratio of novel n-gram compared with reference summaries, compared with BottomUp, UniLM and the ablated version of our model -KG. The novel 1-gram ratio is almost 40%. Furthermore, our model obtains slightly higher novelty percentage of -gram than reference summary when . Therefore, FASum can produce highly abstractive summaries while ensuring factual correctness.
4.6.2 Matched relations
As the relational tuples in the knowledge graph capture the factual information in the text, we compute the precision of extracted tuples in the summary. In detail, suppose the set of the relational tuples in the summary is , and the set of the relational tuples in the article is . Then, each tuple in falls into one of the following three cases:
, which we define as a correct hit;
, but , or . We define this case as a wrong hit;
Otherwise, we define it as a miss.
It follows that a more factual correct summary would have more correct hits and fewer wrong hits. Suppose the number of triples in case 1-3 are , and , respectively. We define two kinds of precision metrics to measure the ratio of correct hits:
Note that this metric is different from the ratio of overlapping tuples proposed in Goodrich et al. (2019). In Goodrich et al. (2019), the ratio is computed between the ground-truth summary and the candidate summary. However, since even the ground-truth summary may not cover all the salient information in the article, we choose to compare the knowledge tuples in the candidate summary directly against those in the article.
Table 6 displays the average precision metrics of our model and the baseline systems in CNN/DailyMail test set. As shown, FASum achieves the highest precision of correct hits under both measures. And there is a considerable boost from the knowledge graph component: 13.6% in and 17.3% in .
4.6.3 Human Evaluation
We conduct human evaluation on the factual correctness and informativeness of summaries. We randomly sample 100 articles from the test set of CNN/DailyMail. Then, each article and summary pair is labelled by 3 people from Amazon Mechanical Turk to evaluate the factual correctness and informativeness. Each labeller should give a score in each category between 1 and 3 (3 being perfect).
Here, factual correctness indicates whether the summary’s content is faithful with respect to the article; informativeness indicates how well the summary covers the salient information in the article.
As shown in Table 7, our model FASum achieves the highest factual correctness score, higher than UniLM and considerably outperforming BottomUp. We conduct a statistical test and find that compared with UniLM
, our model’s score is statistically significant better with p-value smaller than 0.05 under paired t-test. In terms of informativeness, our model is comparable withUniLM. Finally, without knowledge graph, the ablated version of our model generated summaries with both less factual correctness and informativeness.
Factual correctness is an important but often neglected criterion for abstractive text summarization. As most existing models and evaluation metrics focus on token-level accuracy, it often leads to distortion or misrepresentation of facts in the original text. In this paper, we extract factual information from the article to be represented in a knowledge graph. Via neural graph computation, we integrate the factual knowledge into the process of producing summaries. The resulting modelFASum exhibits remarkable improvement in the ability to preserve facts during summarization. Both automatic and human evaluation show the effectiveness of our model.
For future work, we plan to integrate knowledge graph into pre-trained models for better summarization. Moreover, we will combine internally extracted knowledge graph with external knowledge graph (e.g. ConceptNet) to enhance the commonsense capability of summarization.
Angeli et al. (2015)
Gabor Angeli, Melvin Jose Johnson Premkumar, and Christopher D Manning. 2015.
Leveraging linguistic structure for open domain information
Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 344–354.
Cao et al. (2018)
Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018.
Faithful to the original: Fact aware neural abstractive
Thirty-Second AAAI Conference on Artificial Intelligence.
- Devlin et al. (2018a) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018a. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Devlin et al. (2018b) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018b. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Dong et al. (2019) Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, Ming Zhou, and Hsiao-Wuen Hon. 2019. Unified language model pre-training for natural language understanding and generation. arXiv preprint arXiv:1905.03197.
- Falke et al. (2019) Tobias Falke, Leonardo FR Ribeiro, Prasetya Ajie Utama, Ido Dagan, and Iryna Gurevych. 2019. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2214–2220.
- Gehrmann et al. (2018) Sebastian Gehrmann, Yuntian Deng, and Alexander M Rush. 2018. Bottom-up abstractive summarization. arXiv preprint arXiv:1808.10792.
Goodrich et al. (2019)
Ben Goodrich, Vinay Rao, Peter J Liu, and Mohammad Saleh. 2019.
Assessing the factual accuracy of generated text.In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 166–175.
Gunel et al. (2019)
Beliz Gunel, Chenguang Zhu, Michael Zeng, and Xuedong Huang. 2019.
Mind the facts: Knowledge-boosted coherent abstractive text
Proceedings of The Workshop on Knowledge Representation and Reasoning Meets Machine Learning in NIPS 2019.
- Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. Advances in neural information processing systems, pages 1693–1701.
- Kingma and Ba (2014) Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
- Kryściński et al. (2019a) Wojciech Kryściński, Nitish Shirish Keskar, Bryan McCann, Caiming Xiong, and Richard Socher. 2019a. Neural text summarization: A critical evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 540–551, Hong Kong, China. Association for Computational Linguistics.
- Kryściński et al. (2019b) Wojciech Kryściński, Bryan McCann, Caiming Xiong, and Richard Socher. 2019b. Evaluating the factual consistency of abstractive text summarization. arXiv preprint arXiv:1910.12840.
- Kudo and Richardson (2018) Taku Kudo and John Richardson. 2018. Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. arXiv preprint arXiv:1808.06226.
- Levi (1942) Friedrich Wilhelm Levi. 1942. Finite geometrical systems.
- Li et al. (2018) Haoran Li, Junnan Zhu, Jiajun Zhang, and Chengqing Zong. 2018. Ensure the correctness of the summary: Incorporate entailment knowledge into abstractive sentence summarization. In Proceedings of the 27th International Conference on Computational Linguistics, pages 1430–1441.
- Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out,.
- Liu and Lapata (2019) Yang Liu and Mirella Lapata. 2019. Text summarization with pretrained encoders. EMNLP.
- Narayan et al. (2018) Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. arXiv preprint arXiv:1808.08745.
- Nogueira and Cho (2019) Rodrigo Nogueira and Kyunghyun Cho. 2019. Passage re-ranking with bert. arXiv preprint arXiv:1901.04085.
- Paulus et al. (2017) Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304.
- Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.
- See et al. (2017) Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368.
- Speer et al. (2017) Robyn Speer, Joshua Chin, and Catherine Havasi. 2017. Conceptnet 5.5: An open multilingual graph of general knowledge. In Thirty-First AAAI Conference on Artificial Intelligence.
- Sutskever et al. (2014) Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. Advances in neural information processing systems,, pages 3104–3112.
- Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in neural information processing systems, pages 5998–6008.
- Veličković et al. (2017) Petar Veličković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2017. Graph attention networks. arXiv preprint arXiv:1710.10903.
- Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In Advances in neural information processing systems, pages 2692–2700.
- Zhang et al. (2019) Yuhao Zhang, Derek Merck, Emily Bao Tsai, Christopher D Manning, and Curtis P Langlotz. 2019. Optimizing the factual correctness of a summary: A study of summarizing radiology reports. arXiv preprint arXiv:1911.02541.
- Zhu et al. (2019) Chenguang Zhu, Ziyi Yang, Robert Gmyr, Michael Zeng, and Xuedong Huang. 2019. Make lead bias in your favor: A simple and effective method for news summarization. arXiv preprint arXiv:1912.11602.
Appendix A Appendix
a.1 Implementation details
During training, we use beam search with a width of 18 and 2 for CNN/DailyMail and XSum. The minimum summary length is 30 and 40 for CNN/Daily Mail and XSum, respectively.
For hyperparameter search, we tried 4 layers with 4 heads, 6 layers with 6 heads and 10 layers with 8 heads. Beam search width ranges are [5, 20] and [1, 5] for CNN/DailyMail and XSum, respectively. Minimum summary length are tried from [30, 35, 40, 45, 50, 55, 60] for both datasets.
There’re 163.2M parameters in the model and it takes 2 hours (CNN/DailyMail) / 0.5 hours (XSum) for 4 v100 GPUs to train 1 epoch. The batch size is set to 48 for both datasets.
On validation datasets, FASum achieves ROUGE-1 39.33%, ROUGE-2, 17.70% and ROUGE-L 36.23% on CNN/DailyMail, and it achieves ROUGE-1 26.86%, ROUGE-2 8.82% and ROUGE-L 21.54% on XSum.
Table 8, Table 9 and Table 10 show examples of CNN/DailyMail articles and summaries generated by our model and several baseline systems. The factual errors in summaries are marked in red and the corresponding facts are marked in green in the article.
As shown, while baseline systems like BottomUp and UniLM achieve high ROUGE scores, they are susceptible to factual errors. For instance, in Article 2, BottomUp wrongly assigns the speaker to the concept “north star” and UniLM misinterpretes the voting relationship: the voters are the readers, not the journalist Sutter. The ablation model FASum-KG, also makes a number of errors, e.g. it mistakenly summarizes that the daughter of the deceased Geoffery Lewis passed away at 79 in Article 3.
In comparison, our proposed fact-aware summarizer FASum could faithfully summarize the salient information in the article. And it can re-organize the phrasing instead of merely copying content from the article.
|Article 1||A fiery sunset greeted people in Washington Sunday. … The stunning sunsets were the result of raging wildfires in parts of Siberia. … The fires were started in southeastern Siberia, by farmers burning grass in their fields. But on April 14, it is believed that the flames quickly grew out of control because of strong winds and spread throughout the region, according to CNN affiliate KOMO-TV. As a result, the fires have destroyed dozens of villages in the region. …|
|Reference||Smoke from massive fires in Siberia created fiery sunsets in the Pacific Northwest. Atmospheric winds carried smoke from the wildfires across the Pacific Ocean. Smoke particles altered wavelengths from the sun, creating a more intense color.|
|BottomUp||The fires were started in southeastern Siberia on April 14. The fires have destroyed dozens of villages in the region. The smoke from Siberia grew out of control.|
|UniLM||The fiery sunsets were the result of raging wildfires in parts of Siberia. Strong winds picked up the smoke from the fires. As the smoke starts to dissipate, air quality will get bette|
|FASum||“The dramatic sunsets began showing up over the weekend,” Seattle native Tim Durkan says. The fires were started in southeastern Siberia, by farmers burning grass in their fields.|
|-KG||As a result, the flames grew out of control in Siberia. The sunsets were the result of raging wildfires in parts of siber|
|Article 2||We’re 2 degrees from a different world. Humans never have lived on a planet that’s 2 degrees Celsius (3.6 Fahrenheit) warmer than it was before we started burning fossil fuels, in the late 1800s, and climate experts say we risk fundamentally changing life on this planet if we do cross that 2-degree mark. “This is gambling with the planet,” said Gernot Wagner, the lead senior economist at the Environmental Defense Fund and co-author of the book “Climate Shock.” … I’ve heard it referred to as the “north star” for climate negotiations. … It’s so significant that it’s going to be the subject of my reporting for the rest of the year. I’m calling that effort “2 degrees,” and I need your help to make it work. Until 5 p.m. ET Monday, you can vote on the first story I’ll report for the series. Vote using the Facebook poll below (or go here if you don’t see it.) Tell CNN’s John Sutter which of these climate change stories you like best - and he’ll report on the winner…|
|Reference||Experts have raised red flags about the warming of planet by 2 degrees Celsius. John Sutter: This one little number is significant as a way to focus world’s attention on problem.|
|BottomUp||Climate experts say we risk life on this planet if we do cross that 2-degree mark. “This is gambling with the planet,” says “north star” for climate negotiations. Vote using Facebook poll below #2degrees.|
|UniLM||John Sutter’s “2 degrees” will be the subject of CNN’s next series on climate change. The number is a way to focus the world ’s attention, he says. “This is gambling with the planet,” says Gernot Wagner, the lead economist at the Environmental Defense Fund. Sutter will vote on the first story he ’ll report for the series|
|FASum||I’ve heard it referred to as the “north star” for climate negotiations. John Sutter: “climate change stories you like best” and he’ll report on the winner.|
|-KG||Gernot Wagner says we risk fundamentally changing life on our planet. “This is gambling with the planet,” says Wagner not Wagner, lead economist at the Environmental Defense Fund. “It’s so significant that it’s going to be the subject of my reporting for the rest of the year,” she says.|
|Article 3||Geoffrey Lewis, a prolific character actor who appeared opposite frequent collaborator Clint Eastwood as his pal Orville Boggs in “Every Which Way But Loose” and its sequel, has died. He was 79. Lewis, the father of Oscar-nominated actress Juliette Lewis, died Tuesday, family friend Michael Henderson said. No other details were immediately available. Lewis began his long association with Eastwood in “High Plains Drifter” (1973). He also appeared with the actor in “Thunderbolt and Lightfoot” (1974) …|
|Reference||Geoffrey Lewis appeared in many movies, TV shows. Actor was frequently collaborator with Clint Eastwood. Actress Juliette Lewis, his daughter, called him “my hero”|
|BottomUp||Geoffrey Lewis began his long association with Eastwood in “Every which Way but Loose” Lewis scored a Golden Globe nomination for playing bartender Earl Tucker. He also appeared with the actor in “Thunderbolt and Lightfoot” “Flo,” “Pink Cadillac” and “Midnight in the Evil ” The actor also stood out as a gravedigger turned vampire in the 1979 Tobe “Salem’s Lot,” a 1979 actor. Lewis was the father of Oscar-nominated actress Juliette Lewis.|
|UniLM||Geoffrey Lewis was a prolific character actor. He played Orville Boggs in “Every Which Way But Loose” Lewis was the father of Oscar-nominated actress Juliette Lewis.|
|FASum||It’s the father of Oscar-nominated actress Juliette Lewis. Lewis began his long association with Eastwood in “High Plains Drifter” he also appeared with the actor in “Thunderbolt and Lightfoot”|
|-KG||Oscar-nominated actress Juliette Lewis dies at 79. Lewis starred in “High Plains Drifter” he also appeared in “Bronco Billy” and “Midnight in the Garden of Good and Evil”|
|Article 4||… This Sunday in Rome, Pope Francis faced just such a dilemma. First, the back story: One hundred years ago, more than 1 million Armenians
(some estimates run as high as 1.5 million)died at the hand of the Turks. Many of the victims were part of a branch of Christianity closely aligned with Catholicism. A slew of historians and at least 20 countries call the killings a “genocide.” (A U.S. resolution to do the same has languished in Congress.) Turkish officials disagree… John Paul II used the “g” word in 2001, but didn’t dare speak it out loud…
|Reference||Previous popes had finessed the question of whether the killing of 1.5 million Armenians was genocide. Because he often shines such a smiley face on the world, it can be easy to forget the bluntness Francis sometimes brings to the bully pulpit.|
|BottomUp||More than 1 million Armenians died at St. John Paul II ’s Turks. Turkish officials say the killings were a “genocide” the Vatican says it’s not the pope, but he says he has no plans to use it. He says the church’s moral and political woes have improved.|
|UniLM||More than 1 million Armenians died at the hands of the Turks one hundred years ago. Frida Ghitis: Pope Francis faced just such a dilemma this Sunday in Rome. She says previous popes had to make a hard choice: Adopt the sharp tongue of a prophet or the discretion of a diplomat?|
|FASum||As scholars like to say, the Vatican has walked the line between spiritual and worldly concerns for centuries. Many of the victims were part of a branch of Christianity closely aligned with Catholicism. A slew of historians and at least 20 countries call the killings a “genocide”|
|-KG||One hundred years ago, more than 1 million Armenians died at the hand of the Turks. A slew of historians and at least 20 countries call the killings a “genocide”. John Paul II used the “g” word in 2001.|
|Article 5||Can you imagine paying $1,000 a month in rent to live in a one-car garage? Nicole, a 30-year-old woman, doesn’t have to imagine this scenario because it’s her everyday reality… CNN recently published a powerful piece called ”Poor kids of Silicon Valley” that documents the affordable housing challenges facing families in the Bay Area… The agency I lead, the U.S. Department of Housing and Urban Development, recently released a report estimating that 7.7 million low-income households live in substandard housing, spend more than half their incomes on rent or both… HUD created the Rental Assistance Demonstration initiative to bring private investment into the fold for the public good… RAD has allowed local communities to raise more than $733 million in new capital to date…|
|Reference||CNN’s John Sutter told the story of the “Poor kids of Silicon Valley” HUD Secretary Julian Castro: Our shortage of affordable housing is a national crisis that stunts the economy.|
|BottomUp||“Poor kids of Silicon Valley ” spend $1,000 a month in rent to live in San Mateo, California. Nicole, a woman, has a house that is home to 16 people and 11 children. Silicon Valley is in an affordable housing crisis.|
|UniLM||Writers: Our entire nation is in the midst of an affordable housing crisis. Writers: 7.7 million low-income households live in substandard housing , spend more than half their incomes on rent or both. Writers: HUD has allowed local communities to raise more than $733 million in new capital. Writers: No American should ever have to wait six decades to have a decent|
|FASum||CNN recently published a powerful piece called “Poor kids of Silicon Valley” the affordable housing housing crisis agency says 7.7 million low-income households live in substandard housing.|
|-KG||“Poor kids of Silicon Valley” documents affordable housing challenges families. U.S. agency estimating 7.7 million low-income households live in substandard housing.|
|Article 6||An American teenager who helped her boyfriend stuff her mother’s lifeless body into a suitcase at an upmarket hotel in Bali has been sentenced to 10 years in prison. Heather Mack, 19, who gave birth to her own daughter just weeks ago, was found guilty with her 21-year-old boyfriend, Tommy Schaefer, of killing Sheila von Wiese-Mack on the Indonesian island last August. Schaefer was sentenced to 18 years in prison for battering von Wiese-Mack to death in room 317 of the St. Regis Bali Resort. Schaefer had claimed he killed his girlfriend’s mother in self-defense after a violent argument erupted over the young couple’s relationship…|
|BottomUp||Heather Mack, 19, gave birth to her own daughter just weeks ago. Schaefer was found guilty with her boyfriend, Tommy Schaefer, on the Indonesian island. Schaefer was sentenced to 18 years in prison.|
|UniLM||Tommy Schaefer, 21, and Heather Mack, 19, were found guilty of killing Sheila von Wiese-Mack in August. The American helped her boyfriend stuff her mother’s body into a suitcase at an upmarket hotel in Bali. Schaafer had claimed he killed his girlfriend’s mother in self-defense after an argument over their relationship. Mack told the court her mother had threatened to kill|
|FASum||Heather Mack, 19, helped boyfriend Tommy Schaefer stuff her mother’s lifeless body into a suitcase at an upmarket hotel in Bali. Schaefer had claimed he killed his girlfriend’s mother in self-defense after a violent argument erupted over the young couple’s relationship.|
|-KG||Heather Mack, 19, was found guilty of murdering her boyfriend Tommy Schaefer. Schaefer claimed von Wiese-Mack killed his girlfriend’s mother in self-defense. Von wi|