Boosting Factual Correctness of Abstractive Summarization with Knowledge Graph

03/19/2020 ∙ by Chenguang Zhu, et al. ∙ University of Notre Dame 0

A commonly observed problem with abstractive summarization is the distortion or fabrication of factual information in the article. This inconsistency between summary and original text has led to various concerns over its applicability. In this paper, we propose to boost factual correctness of summaries via the fusion of knowledge, i.e. extracted factual relations from the article. We present a Fact-Aware Summarization model, FASum. In this model, the knowledge information can be organically integrated into the summary generation process via neural graph computation and effectively improves the factual correctness. Empirical results show that FASum generates summaries with significantly higher factual correctness compared with state-of-the-art abstractive summarization systems, both under an independently trained factual correctness evaluator and human evaluation. For example, in CNN/DailyMail dataset, FASum obtains 1.2 higher than BottomUp.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text summarization models aim to produce an abridged version of long text while preserving salient information. Abstractive summarization is a type of such models that can freely generate summaries, with no constraint on the words or phrases used. This format is closer to human-edited summaries and is both flexible and informative. Thus, there are numerous literature on producing abstractive summaries (See et al., 2017; Paulus et al., 2017; Dong et al., 2019; Gehrmann et al., 2018).

However, one prominent issue with abstractive summarization is factual inconsistency. It refers to the phenomenon that the summary sometimes distorts or fabricates the facts in the article. Recent studies show that up to 30% of the summaries generated by abstractive models contain such factual inconsistencies (Kryściński et al., 2019b; Falke et al., 2019). This brings serious problems to the credibility and usability of abstractive summarization systems. Table 1 demonstrates an example article and excerpts of generated summaries. As shown, the article mentions that the actor Geoffery Lewis began his association with Eastwood in “High Plains Drifter”, while the summary from BottomUp (Gehrmann et al., 2018) picks the wrong movie title from another place in the article. Also, the article says that the deceased Geoffery Lewis is the father of Juliette Lewis. However, an ablated version of our model without leveraging the factual knowledge graph in the article, denoted by -KG, wrongly expresses that Juliette passed away. Comparatively, our model FASum generates summary that correctly exhibits the relationship in the article.

 

Article Geoffrey Lewis, a prolific character actor who appeared opposite frequent collaborator Clint Eastwood as his pal Orville Boggs in “Every Which Way But Loose” and its sequel, has died… Lewis, the father of Oscar-nominated actress Juliette Lewis, died Tuesday. Lewis began his long association with Eastwood in “High Plains Drifter”
BottomUp Geoffrey Lewis began his long association with Eastwood in “Every which Way but Loose”.
FASum It’s the father of Oscar-nominated actress Juliette Lewis.
  -KG Oscar-nominated actress Juliette Lewis dies at 79.

 

Table 1: Example article and summary excerpts from CNN/DailyMail dataset with summaries generated by BottomUp and our model FASum. -KG indicates our model without the knowledge graph component. Factual errors in summaries are marked in red and the corresponding facts in the article are marked in green.

On the other hand, most existing abstractive summarization models apply conditional language model to focus on the token-level accuracy of summaries, while neglecting semantic-level consistency between the summary and article. Therefore, the generated summaries are often high in token-level metrics like ROUGE (Lin, 2004) but lack factual correctness. In view of this, we argue that a robust abstractive summarization system must be equipped with factual knowledge during the comprehension of article and generation of summary.

In this paper, we represent facts in the form of a knowledge graph. Although there are numerous efforts in building commonly applicable knowledge networks to facilitate knowledge extraction and integration, such as ConceptNet (Speer et al., 2017) and WikiData, we find that these tools are more useful in conferring commonsense knowledge. In abstractive summarization for contents like news articles, many entities and relations are previously unseen. Plus, our goal is to produce summary that have no or very few conflicts against the facts in the article. Thus, we propose to extract factual knowledge from within the article.

Therefore, we employ the information extraction (IE) tool OpenIE (Angeli et al., 2015) to extract facts from the article in the form of relational tuples: (subject, relation, object). We then use the Levi transformation (Levi, 1942) to convert each tuple component into a node in the knowledge graph. This graph contains the facts in the article and is integrated in the summary generation process.

Then, we use graph attention network (Veličković et al., 2017) to obtain the representation of each node, and fuse that into a transformer-based encoder-decoder architecture. The decoder conducts attention to each graph node in every transformer block. Finally, it leverages the copy-generate mechanism to selectively produce contents either from the dictionary or from the graph entities. We denote this model as a Fact-Aware Summarization model, FASum.

In the experiments on benchmark summarization datasets, the FASum model shows great improvements on factual correctness of generated summaries. Using an independently trained BERT-based factual correctness evaluator (Kryściński et al., 2019b), we find that on CNN/DailyMail dataset, FASum obtains 1.2% higher fact correctness scores than UniLM (Dong et al., 2019) and 4.5% higher than BottomUp (Gehrmann et al., 2018). Human evaluation further corroborates the effectiveness of our models.

We further propose an easy-to-compute metric, i.e. matched relation tuples, and show that FASum

can obtain a high score compared with other baselines. Via the ratio of novel n-grams, we show that

FASum is a highly abstractive summarization model.

In summary, our contribution in this paper is three-fold:

  1. We extract factual information from the article in the form of relational knowledge graph.

  2. We conduct neural graph computation on the knowledge graph and integrate its information into an end-to-end process of summary generation.

  3. We propose a simple-to-use metric, matched relation tuples, to evaluate factual correctness in abstractive summarization.

2 Related Work

2.1 Abstractive Summarization

Abstractive text summarization has been intensively studied in recent literature. Most models employ the encoder-decoder architecture (seq2seq) (Sutskever et al., 2014). Rush et al. (2015) introduces an attention-based seq2seq model for abstractive sentence summarization. See et al. (2017) uses copy-generate mechanism that can both produce words from the vocabulary via a generator and copy words from the article via a pointer. Paulus et al. (2017)

leverages reinforcement learning to improve summarization quality.

Gehrmann et al. (2018) uses a content selector to over-determine phrases in source documents that helps constrain the model to likely phrases. Zhu et al. (2019) defines a pretraining scheme for summarization and produces a zero-shot abstractive summarization model. Dong et al. (2019) employs different masking techniques for both classification and generation tasks in NLP. The resulting pretrained model UniLM achieves state-of-the-art results on various tasks including abstractive summarization.

2.2 Fact-Aware Summarization

Since the task of judging whether a summary is consistent with an article is similar to the text entailment problem, several recent work use text entailment models to evaluate and boost the factual correctness of summarization. Li et al. (2018) co-trains summarization and entailment for the encoder and employs an entailment-aware decoder via Reward Augmented Maximum Likelihood (RAML). Falke et al. (2019) proposes using off-the-shelf entailment models to rerank candidate summary sentences to boost factual correctness.

Apart from using entailment models for factual correctness, Zhang et al. (2019)

approaches the factual correctness problem in the medical domain, where the space of facts is limited and can be depicted with a descriptor vector. The proposed model utilizes information extraction and reinforcement learning and achieves good results in medical summarization.

Cao et al. (2018) extracts relational information from the article and maps it to a sequence as input to the encoder. The decoder attends to both the article tokens and the relations. Gunel et al. (2019) employs an entity-aware transformer structure to boost the factual correctness in abstractive summarization, where the entities come from Wikidata knowledge graph. In comparison, our model utilizes the extracted knowledge graph from the article and fuses it into the generation via graph neural computation.

As previous work mostly use human evaluation on factual correctness (Kryściński et al., 2019a) and text entailment model suffers from domain shift problem (Falke et al., 2019), several papers put forward automatic factual checking metrics better aligned with human judgement. Goodrich et al. (2019) proposes using information extraction to obtain the relation tuples in ground-truth summary and the candidate summary, then computing the factual accuracy as the ratio of overlapping tuples. Kryściński et al. (2019b) frames factual correctness as a binary classification problem: summary is either consistent or inconsistent with the article. Therefore, it applies positive and negative transformations on the summary to produce training data for a BERT-based classification model. The evaluator, FactCC, significantly outperforms text entailment models and shows high correlation with human metrics. In this paper, we use FactCC as the factual evaluator for our model.

3 Model

3.1 Problem Formulation

We formalize abstractive summarization as a supervised seq2seq problem. The input consists of pairs of articles and summaries: . Each article is tokenized into and each summary is tokenized into . In abstrative summarization, the model-generated summary can contain tokens, phrases and sentences not present in the article. For simplicity, in the follows we will drop the data index subscript. Therefore, each training pair becomes , and the model needs to generate an abstrative summary .

Figure 1: The model architecture of FASum. It has layers of transformer blocks in both the encoder and decoder. The knowledge graph is obtained from information extraction results and it participates in the decoder’s attention.

3.2 Factual Correctness Evaluator

We leverage the FactCC evaluator (Kryściński et al., 2019b) which maps the correctness evaluation as a binary classification problem, namely finding a function , where is an article and is a summary sentence defined as a claim.

represents the probability that

is factually correct with respective to the article . If a summary is composed of multiple sentences , we define the factual score of as:

(1)

 

Model Incorrect
Random 50.0%
InferSent (Falke et al., 2019) 41.3%
SSE (Falke et al., 2019) 37.3%
BERT (Falke et al., 2019) 35.9%
ESIM (Falke et al., 2019) 32.4%
FactCC (Kryściński et al., 2019b) 30.0%
FactCC (our version) 26.8%

 

Table 2: Percentage of incorrectly ordered sentence pairs using different consistency prediction models in CNN/DailyMail dataset. The test data is the same as in Falke et al. (2019).

To generate training data, multiple positive and negative forms of transformation are applied to each summary sentence as well as some randomly selected sentences from the article.

To augment the number of entailed examples, we adopt backtranslation as a paraphrasing tool. The original summary is translated into an intermediate language, including French, German, Chinese, Spanish and Russian, and then translated back to English. In this way, the generated claim will be semantically equivalent to the original one but modified with minor syntactic and lexical changes. Thus, it is used as a positive training example.

A common failure mode for abstractive summarization models is to include incorrect entities from the article. To predict the occurrence of these types of errors, we generate summary example that are not entailed by the article by swapping entities in the ground truth summary sentences. We use spaCy to find all entities in the article and summary then randomly swap one entity in the summary with a different entity from the article of the same types, i.e. person, location, organization. The generated claim will be factually incorrect and thus used as a negative training example.

Using these methods, we generate a training set from CNN/DailyMail’s training data. The training set contains 20M claims, with an equal number of positive and negative instances. The development set is similarly obtained from the validation set and has a size of 2,900 claims after sampling.

Following Kryściński et al. (2019b), we finetune the BERTBASE model Devlin et al. (2018a) to evaluate factual correctness. We concatenate the article and the generated claim, together with special tokens [CLS] and [SEP]: [CLS] Article [SEP] Claim. A linear layer and softmax operation are employed to convert the BERT embedding of [CLS] into the probability that the claim is entailed by the article content.

The hyperparameters were tuned on the development set, with the best setup using a learning rate of 1e-5 and a batch size of 24. As shown in Table 

2, our reproduced model achieves better accuracy than that in Kryściński et al. (2019b) on the human-labelled sentence-pair-ordering data from Falke et al. (2019). Thus, we use this evaluator for all the factual correctness assessment tasks in the follows.111We use the same setting and train another evaluator for XSum dataset.

3.3 Fact-Aware Summarizer

We propose the Fact-Aware abstractive Summarizer, FASum. It utilizes the seq2seq architecture built upon transformers (Vaswani et al., 2017). In detail, the encoder produces contextualized embeddings of the article and the decoder attends to the encoder’s output to generate the summary.

To make the summarization model fact-aware, we extract, represent and integrate knowledge from the source article into the summary generation process, which is described in the follows. The overall architecture of FASum is shown in Figure 1.

3.3.1 Knowledge Extraction

To extract important entity-relation information from the article, we employ the Stanford OpenIE tool (Angeli et al., 2015). The extracted knowledge is a list of tuples. Each tuple contains a subject (S), a relation (R) and an object (O), each as a segment of text from the article. For example, for the sentence Born in a town, she took the midnight train, the extracted subject is she, relation is took, and the object is midnight train. In the experiments, there are on average 165.4 tuples extracted per article in CNN/DailyMail dataset Hermann et al. (2015) and 84.5 tuples in XSum dataset Narayan et al. (2018).

3.3.2 Knowledge Representation

We construct a knowledge graph to represent the information extracted from OpenIE. We apply the Levi transformation (Levi, 1942) to treat each entity and relation equally. In detail, suppose a tuple is , we create nodes , and , and add edges and . In this way, we obtain an undirected knowledge graph , where each node is associated with text .

We then employ graph attention network (Veličković et al., 2017) to obtain embeddings for each node. We use bidirectional LSTM to generate the initial embedding for each node :

(2)

where the index indicates the final state of the RNN.

Then, in each round, every node updates its embeddings using attention over itself and all of its neighbors, denoted by the set :

(3)
(4)
(5)

where , and are parametrized matrix and vector, respectively. In the experiment, we set the number of rounds .

3.3.3 Knowledge Integration

The knowledge graph embedding is obtained in parallel with the encoder. During decoding, we conduct attention over the knowledge graph nodes in each transformer block. Suppose the input embedding to a decoder transformer block is , and the embedded article from encoder is . We first apply self-attention on and inter-attention between and , each followed by layer normalization and residual mechanism (denoted by LR). This leads to the intermediate embeddings .

Next, to capture information from the knowledge graph, we conduct inter-attention between and the nodes’ embeddings :

(6)
(7)
(8)

Notice that we multiply the attended vector with a parameter . This is due to the different numerical magnitude between encoder embeddings and knowledge graph embeddings. For the -th transformer block, we train one multiplier scalar .

The results are passed through a LR layer, FFN layer and another LR layer to obtain the final embeddings :

(9)
(10)

3.3.4 Summary Generation

To produce the next token , we leverage the copy-generate mechanism (Vinyals et al., 2015). First, a copy probability is computed from :

(11)

The final next-token probability distribution is:

(12)

where from the last transformer block, , and is the embedding matrix for both the encoder and decoder.

During training, we use cross entropy as the loss function:

(13)

where is the one-hot vector for the -th token, and represent the parameters in the network.

4 Experiments

4.1 Datasets

We evaluate our model on benchmark summarization datasets CNN/DailyMail Hermann et al. (2015) and XSum Narayan et al. (2018). They contain 312K and 227K news articles and human-edited summaries respectively, covering different topics and various summarization styles. We show the details of the datasets in Table 3, adapted from Liu and Lapata (2019).

  Datasets # docs (train/val/test) avg. doc len. avg. summary len. % novel bi-grams words sentences words sentences in ref. summary CNN 90,266/1,220/1,093 760.50 33.98 45.70 3.59 52.90 DailyMail 196,961/12,148/10,397 653.33 29.33 54.65 3.86 52.16 XSum 204,045/11,332/11,334 431.1 19.8 23.3 1.00 83.3  

Table 3: Comparison of summarization datasets in the experiments: size of training, validation, and test sets and average document and summary length, adapted from Liu and Lapata (2019). The percentage of novel bi-grams that are not in the article but are in the gold summaries measures the abstractiveness of the dataset.

4.2 Implementation Details

We use the subword tokenizer SentencePiece Kudo and Richardson (2018). The dictionary is shared across all the datasets. The vocabulary has a size of 32K and a dimension of 720. Both the encoder and decoder has 10 layers of 8 heads for attention. The LSTM in knowledge graph has a hidden state of size 128 and the graph attention network has a hidden state of size 50. The dropout rate is 0.1. The knowledge graph attention vector multiplier is initialized with . Teacher forcing is used in training. We use Adam (Kingma and Ba, 2014) as the optimizer with a learning rate of 2e-4. More details are presented in the Appendix.

4.3 Metrics

For factual correctness, we leverage the FactCC (Kryściński et al., 2019b) trained on each dataset. All evaluator models are finetuned starting from the BERT-Large model Devlin et al. (2018b). The training is independent of the summarizer so no parameters are shared.

We also employ the standard ROUGE-1, ROUGE-2 and ROUGE-L metrics Lin (2004) to measure summary qualities. These three metrics evaluate the accuracy on unigrams, bigrams and longest common subsequence. We report the F1 ROUGE scores in all experiments.

Note that during training, FASum chooses the model with the highest ROUGE-L score on the validation set. Therefore, it does not have access to the evaluator during training and validation.

4.4 Baseline

The following abstractive summarization models are selected as baseline systems. TConvS2S Narayan et al. (2018)

is based on convolutional neural networks.

BottomUp Gehrmann et al. (2018) uses a bottom-up approach to generate summarization. UniLM Dong et al. (2019)

utilizes large-scale pretraining to produce state-of-the-art abstractive summaries. We obtain the prediction results of these baseline systems via their open-source repositories, or train the model when the predictions are not available.

4.5 Results

As shown in Table 4, our model FASum outperforms all baseline systems in factual correctness scores in CNN/DailyMail and is only behind UniLM in XSum. In CNN/DailyMail, FASum is 1.2% higher than UniLM and 4.5% higher than BottomUp in factual correctness score. We conduct statistical tests and show that the lead is statistically significant with p-value smaller than 0.05. Meanwhile, we removed the knowledge graph component from FASum and there is a clear drop in factual score on both datasets, indicated by -KG.

It’s worth noticing that the ROUGE metric does not always reflect the factual correctness, sometimes even showing reverse relationship, which has been observed in Kryściński et al. (2019a); Nogueira and Cho (2019). For instance, although BottomUp has 2.42 higher ROUGE-1 points than FASum in CNN/DailyMail, there are many factual errors in their summaries, as shown in the human evaluation and Appendix. Therefore, we argue that factual correctness should be specifically handled both in summary generation and evaluation.

Ablation Study. To evaluate the effectiveness of our proposed modules, we removed the attention vector multiplier, then the Levi transformation and finally the knowledge graph module. The results are reported in Table 5. As shown, each module plays a significant role in boosting the factual correctness score, but not necessarily the ROUGE metrics. The multiplier contributes 2.1 points in factual score, and the knowledge graph module as a whole boosts the factual score by as much as 3.7 points.

 

Model 100Fact Score ROUGE-1 ROUGE-2 ROUGE-L

 

CNN/DailyMail
BottomUp 83.9 41.22 18.68 38.34
UniLM 87.2 43.33 20.21 40.51
FASum 88.4 38.80 17.23 35.70
    -KG 84.7 38.35 16.75 35.47

 

XSum
BottomUp 71.1 25.33 6.32 19.42
TConvS2S 79.8 31.89 11.54 25.75
UniLM 83.6 39.04 16.71 31.42
FASum 81.4 28.60 8.97 22.80
    -KG 80.4 29.23 9.17 23.21

 

Table 4: Factual correctness score and ROUGE scores on CNN/DailyMail and XSum test set. * means that the result is statistically significant with p-value smaller than , compared with the second largest score. -KG means the ablated version of our model without the knowledge graph component.

 

Model Fact R1 R2 RL
FASum 88.4 38.80 17.23 35.70
    -multiplier 86.3 38.85 17.13 35.80
    -Levi 85.5 38.38 16.81 35.39
    -KG 84.7 38.35 16.75 35.47

 

Table 5: Ablation study on CNN/DailyMail test set. The knowledge graph attention multiplier, Levi transformation and the knowledge graph module are sequentially removed.
Figure 2: Percentage of novel n-grams for summaries generated by different models in XSum test set. An n-gram is novel if it does not appear in the article.

4.6 Insights

4.6.1 Novel n-grams

To measure the abstractiveness of our model, we compute the ratio of novel n-grams in summaries that do not appear in the article in XSum test set, following See et al. (2017). As shown by Figure 2, FASum achieves the closest ratio of novel n-gram compared with reference summaries, compared with BottomUp, UniLM and the ablated version of our model -KG. The novel 1-gram ratio is almost 40%. Furthermore, our model obtains slightly higher novelty percentage of -gram than reference summary when . Therefore, FASum can produce highly abstractive summaries while ensuring factual correctness.

 

Model
BottomUp 66.7 48.4
UniLM 60.0 39.6
FASum 66.8 48.6
    -KG 53.2 31.3

 

Table 6: Average precision of matched relational triples between the summary and the article in CNN/DailyMail test set. The definition of precision is in Equation 15.

4.6.2 Matched relations

As the relational tuples in the knowledge graph capture the factual information in the text, we compute the precision of extracted tuples in the summary. In detail, suppose the set of the relational tuples in the summary is , and the set of the relational tuples in the article is . Then, each tuple in falls into one of the following three cases:

  1. , which we define as a correct hit;

  2. , but , or . We define this case as a wrong hit;

  3. Otherwise, we define it as a miss.

It follows that a more factual correct summary would have more correct hits and fewer wrong hits. Suppose the number of triples in case 1-3 are , and , respectively. We define two kinds of precision metrics to measure the ratio of correct hits:

(14)
(15)

Note that this metric is different from the ratio of overlapping tuples proposed in Goodrich et al. (2019). In Goodrich et al. (2019), the ratio is computed between the ground-truth summary and the candidate summary. However, since even the ground-truth summary may not cover all the salient information in the article, we choose to compare the knowledge tuples in the candidate summary directly against those in the article.

Table 6 displays the average precision metrics of our model and the baseline systems in CNN/DailyMail test set. As shown, FASum achieves the highest precision of correct hits under both measures. And there is a considerable boost from the knowledge graph component: 13.6% in and 17.3% in .

4.6.3 Human Evaluation

 

Model Factual Score Informativeness
BottomUp 2.32 2.23
UniLM 2.65 2.45
FASum 2.79 2.36
    -KG 2.57 2.15

 

Table 7: Human evaluation results for factual correctness and informativeness of generated summaries for 100 randomly sampled articles in CNN/DailyMail test set. Each summary is assessed by 3 labellers. * means that the advantage over the second highest score is statistically significant with p-value smaller than 0.05.

We conduct human evaluation on the factual correctness and informativeness of summaries. We randomly sample 100 articles from the test set of CNN/DailyMail. Then, each article and summary pair is labelled by 3 people from Amazon Mechanical Turk to evaluate the factual correctness and informativeness. Each labeller should give a score in each category between 1 and 3 (3 being perfect).

Here, factual correctness indicates whether the summary’s content is faithful with respect to the article; informativeness indicates how well the summary covers the salient information in the article.

As shown in Table 7, our model FASum achieves the highest factual correctness score, higher than UniLM and considerably outperforming BottomUp. We conduct a statistical test and find that compared with UniLM

, our model’s score is statistically significant better with p-value smaller than 0.05 under paired t-test. In terms of informativeness, our model is comparable with

UniLM. Finally, without knowledge graph, the ablated version of our model generated summaries with both less factual correctness and informativeness.

5 Conclusion

Factual correctness is an important but often neglected criterion for abstractive text summarization. As most existing models and evaluation metrics focus on token-level accuracy, it often leads to distortion or misrepresentation of facts in the original text. In this paper, we extract factual information from the article to be represented in a knowledge graph. Via neural graph computation, we integrate the factual knowledge into the process of producing summaries. The resulting model

FASum exhibits remarkable improvement in the ability to preserve facts during summarization. Both automatic and human evaluation show the effectiveness of our model.

For future work, we plan to integrate knowledge graph into pre-trained models for better summarization. Moreover, we will combine internally extracted knowledge graph with external knowledge graph (e.g. ConceptNet) to enhance the commonsense capability of summarization.

References

Appendix A Appendix

a.1 Implementation details

During training, we use beam search with a width of 18 and 2 for CNN/DailyMail and XSum. The minimum summary length is 30 and 40 for CNN/Daily Mail and XSum, respectively.

For hyperparameter search, we tried 4 layers with 4 heads, 6 layers with 6 heads and 10 layers with 8 heads. Beam search width ranges are [5, 20] and [1, 5] for CNN/DailyMail and XSum, respectively. Minimum summary length are tried from [30, 35, 40, 45, 50, 55, 60] for both datasets.

There’re 163.2M parameters in the model and it takes 2 hours (CNN/DailyMail) / 0.5 hours (XSum) for 4 v100 GPUs to train 1 epoch. The batch size is set to 48 for both datasets.

On validation datasets, FASum achieves ROUGE-1 39.33%, ROUGE-2, 17.70% and ROUGE-L 36.23% on CNN/DailyMail, and it achieves ROUGE-1 26.86%, ROUGE-2 8.82% and ROUGE-L 21.54% on XSum.

a.2 Examples

Table 8, Table 9 and Table  10 show examples of CNN/DailyMail articles and summaries generated by our model and several baseline systems. The factual errors in summaries are marked in red and the corresponding facts are marked in green in the article.

As shown, while baseline systems like BottomUp and UniLM achieve high ROUGE scores, they are susceptible to factual errors. For instance, in Article 2, BottomUp wrongly assigns the speaker to the concept “north star” and UniLM misinterpretes the voting relationship: the voters are the readers, not the journalist Sutter. The ablation model FASum-KG, also makes a number of errors, e.g. it mistakenly summarizes that the daughter of the deceased Geoffery Lewis passed away at 79 in Article 3.

In comparison, our proposed fact-aware summarizer FASum could faithfully summarize the salient information in the article. And it can re-organize the phrasing instead of merely copying content from the article.

 

Article 1 A fiery sunset greeted people in Washington Sunday. … The stunning sunsets were the result of raging wildfires in parts of Siberia. … The fires were started in southeastern Siberia, by farmers burning grass in their fields. But on April 14, it is believed that the flames quickly grew out of control because of strong winds and spread throughout the region, according to CNN affiliate KOMO-TV. As a result, the fires have destroyed dozens of villages in the region. …
Reference Smoke from massive fires in Siberia created fiery sunsets in the Pacific Northwest. Atmospheric winds carried smoke from the wildfires across the Pacific Ocean. Smoke particles altered wavelengths from the sun, creating a more intense color.
BottomUp The fires were started in southeastern Siberia on April 14. The fires have destroyed dozens of villages in the region. The smoke from Siberia grew out of control.
UniLM The fiery sunsets were the result of raging wildfires in parts of Siberia. Strong winds picked up the smoke from the fires. As the smoke starts to dissipate, air quality will get bette
FASum “The dramatic sunsets began showing up over the weekend,” Seattle native Tim Durkan says. The fires were started in southeastern Siberia, by farmers burning grass in their fields.
  -KG As a result, the flames grew out of control in Siberia. The sunsets were the result of raging wildfires in parts of siber

 

Article 2 We’re 2 degrees from a different world. Humans never have lived on a planet that’s 2 degrees Celsius (3.6 Fahrenheit) warmer than it was before we started burning fossil fuels, in the late 1800s, and climate experts say we risk fundamentally changing life on this planet if we do cross that 2-degree mark. “This is gambling with the planet,” said Gernot Wagner, the lead senior economist at the Environmental Defense Fund and co-author of the book “Climate Shock.” … I’ve heard it referred to as the “north star” for climate negotiations. … It’s so significant that it’s going to be the subject of my reporting for the rest of the year. I’m calling that effort “2 degrees,” and I need your help to make it work. Until 5 p.m. ET Monday, you can vote on the first story I’ll report for the series. Vote using the Facebook poll below (or go here if you don’t see it.) Tell CNN’s John Sutter which of these climate change stories you like best - and he’ll report on the winner…
Reference Experts have raised red flags about the warming of planet by 2 degrees Celsius. John Sutter: This one little number is significant as a way to focus world’s attention on problem.
BottomUp Climate experts say we risk life on this planet if we do cross that 2-degree mark. “This is gambling with the planet,” says “north star” for climate negotiations. Vote using Facebook poll below #2degrees.
UniLM John Sutter’s “2 degrees” will be the subject of CNN’s next series on climate change. The number is a way to focus the world ’s attention, he says. “This is gambling with the planet,” says Gernot Wagner, the lead economist at the Environmental Defense Fund. Sutter will vote on the first story he ’ll report for the series
FASum I’ve heard it referred to as the “north star” for climate negotiations. John Sutter: “climate change stories you like best” and he’ll report on the winner.
  -KG Gernot Wagner says we risk fundamentally changing life on our planet. “This is gambling with the planet,” says Wagner not Wagner, lead economist at the Environmental Defense Fund. “It’s so significant that it’s going to be the subject of my reporting for the rest of the year,” she says.

 

Table 8: Sample articles and summaries from our model FASum and baseline systems. -KG indicates our model without the knowledge graph component. The factual errors in summaries are marked in red and the corresponding facts are marked in green in the article.

 

Article 3 Geoffrey Lewis, a prolific character actor who appeared opposite frequent collaborator Clint Eastwood as his pal Orville Boggs in “Every Which Way But Loose” and its sequel, has died. He was 79. Lewis, the father of Oscar-nominated actress Juliette Lewis, died Tuesday, family friend Michael Henderson said. No other details were immediately available. Lewis began his long association with Eastwood in “High Plains Drifter” (1973). He also appeared with the actor in “Thunderbolt and Lightfoot” (1974) …
Reference Geoffrey Lewis appeared in many movies, TV shows. Actor was frequently collaborator with Clint Eastwood. Actress Juliette Lewis, his daughter, called him “my hero”
BottomUp Geoffrey Lewis began his long association with Eastwood in “Every which Way but Loose” Lewis scored a Golden Globe nomination for playing bartender Earl Tucker. He also appeared with the actor in “Thunderbolt and Lightfoot” “Flo,” “Pink Cadillac” and “Midnight in the Evil ” The actor also stood out as a gravedigger turned vampire in the 1979 Tobe “Salem’s Lot,” a 1979 actor. Lewis was the father of Oscar-nominated actress Juliette Lewis.
UniLM Geoffrey Lewis was a prolific character actor. He played Orville Boggs in “Every Which Way But Loose” Lewis was the father of Oscar-nominated actress Juliette Lewis.
FASum It’s the father of Oscar-nominated actress Juliette Lewis. Lewis began his long association with Eastwood in “High Plains Drifter” he also appeared with the actor in “Thunderbolt and Lightfoot”
  -KG Oscar-nominated actress Juliette Lewis dies at 79. Lewis starred in “High Plains Drifter” he also appeared in “Bronco Billy” and “Midnight in the Garden of Good and Evil”

 

Article 4 … This Sunday in Rome, Pope Francis faced just such a dilemma. First, the back story: One hundred years ago, more than 1 million Armenians

(some estimates run as high as 1.5 million)

died at the hand of the Turks. Many of the victims were part of a branch of Christianity closely aligned with Catholicism. A slew of historians and at least 20 countries call the killings a “genocide.” (A U.S. resolution to do the same has languished in Congress.) Turkish officials disagree… John Paul II used the “g” word in 2001, but didn’t dare speak it out loud…
Reference Previous popes had finessed the question of whether the killing of 1.5 million Armenians was genocide. Because he often shines such a smiley face on the world, it can be easy to forget the bluntness Francis sometimes brings to the bully pulpit.
BottomUp More than 1 million Armenians died at St. John Paul II ’s Turks. Turkish officials say the killings were a “genocide” the Vatican says it’s not the pope, but he says he has no plans to use it. He says the church’s moral and political woes have improved.
UniLM More than 1 million Armenians died at the hands of the Turks one hundred years ago. Frida Ghitis: Pope Francis faced just such a dilemma this Sunday in Rome. She says previous popes had to make a hard choice: Adopt the sharp tongue of a prophet or the discretion of a diplomat?
FASum As scholars like to say, the Vatican has walked the line between spiritual and worldly concerns for centuries. Many of the victims were part of a branch of Christianity closely aligned with Catholicism. A slew of historians and at least 20 countries call the killings a “genocide”
  -KG One hundred years ago, more than 1 million Armenians died at the hand of the Turks. A slew of historians and at least 20 countries call the killings a “genocide”. John Paul II used the “g” word in 2001.

 

Table 9: Continuation of Table 8.

 

Article 5 Can you imagine paying $1,000 a month in rent to live in a one-car garage? Nicole, a 30-year-old woman, doesn’t have to imagine this scenario because it’s her everyday reality… CNN recently published a powerful piece called ”Poor kids of Silicon Valley” that documents the affordable housing challenges facing families in the Bay Area… The agency I lead, the U.S. Department of Housing and Urban Development, recently released a report estimating that 7.7 million low-income households live in substandard housing, spend more than half their incomes on rent or both… HUD created the Rental Assistance Demonstration initiative to bring private investment into the fold for the public good… RAD has allowed local communities to raise more than $733 million in new capital to date…
Reference CNN’s John Sutter told the story of the “Poor kids of Silicon Valley” HUD Secretary Julian Castro: Our shortage of affordable housing is a national crisis that stunts the economy.
BottomUp “Poor kids of Silicon Valley ” spend $1,000 a month in rent to live in San Mateo, California. Nicole, a woman, has a house that is home to 16 people and 11 children. Silicon Valley is in an affordable housing crisis.
UniLM Writers: Our entire nation is in the midst of an affordable housing crisis. Writers: 7.7 million low-income households live in substandard housing , spend more than half their incomes on rent or both. Writers: HUD has allowed local communities to raise more than $733 million in new capital. Writers: No American should ever have to wait six decades to have a decent
FASum CNN recently published a powerful piece called “Poor kids of Silicon Valley” the affordable housing housing crisis agency says 7.7 million low-income households live in substandard housing.
  -KG “Poor kids of Silicon Valley” documents affordable housing challenges families. U.S. agency estimating 7.7 million low-income households live in substandard housing.

 

Article 6 An American teenager who helped her boyfriend stuff her mother’s lifeless body into a suitcase at an upmarket hotel in Bali has been sentenced to 10 years in prison. Heather Mack, 19, who gave birth to her own daughter just weeks ago, was found guilty with her 21-year-old boyfriend, Tommy Schaefer, of killing Sheila von Wiese-Mack on the Indonesian island last August. Schaefer was sentenced to 18 years in prison for battering von Wiese-Mack to death in room 317 of the St. Regis Bali Resort. Schaefer had claimed he killed his girlfriend’s mother in self-defense after a violent argument erupted over the young couple’s relationship…
BottomUp Heather Mack, 19, gave birth to her own daughter just weeks ago. Schaefer was found guilty with her boyfriend, Tommy Schaefer, on the Indonesian island. Schaefer was sentenced to 18 years in prison.
UniLM Tommy Schaefer, 21, and Heather Mack, 19, were found guilty of killing Sheila von Wiese-Mack in August. The American helped her boyfriend stuff her mother’s body into a suitcase at an upmarket hotel in Bali. Schaafer had claimed he killed his girlfriend’s mother in self-defense after an argument over their relationship. Mack told the court her mother had threatened to kill
FASum Heather Mack, 19, helped boyfriend Tommy Schaefer stuff her mother’s lifeless body into a suitcase at an upmarket hotel in Bali. Schaefer had claimed he killed his girlfriend’s mother in self-defense after a violent argument erupted over the young couple’s relationship.
  -KG Heather Mack, 19, was found guilty of murdering her boyfriend Tommy Schaefer. Schaefer claimed von Wiese-Mack killed his girlfriend’s mother in self-defense. Von wi

 

Table 10: Continuation of Table 8.