Abstractive Summarization Improved by WordNet-based Extractive Sentences

08/04/2018 ∙ by Niantao Xie, et al. ∙ Peking University 0

Recently, the seq2seq abstractive summarization models have achieved good results on the CNN/Daily Mail dataset. Still, how to improve abstractive methods with extractive methods is a good research direction, since extractive methods have their potentials of exploiting various efficient features for extracting important sentences in one text. In this paper, in order to improve the semantic relevance of abstractive summaries, we adopt the WordNet based sentence ranking algorithm to extract the sentences which are most semantically to one text. Then, we design a dual attentional seq2seq framework to generate summaries with consideration of the extracted information. At the same time, we combine pointer-generator and coverage mechanisms to solve the problems of out-of-vocabulary (OOV) words and duplicate words which exist in the abstractive models. Experiments on the CNN/Daily Mail dataset show that our models achieve competitive performance with the state-of-the-art ROUGE scores. Human evaluations also show that the summaries generated by our models have high semantic relevance to the original text.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

For automatic summarization, there are two main methods: extractive and abstractive. Extractive methods use certain scoring rules or ranking methods to select a certain number of important sentences from the source texts. For example,


proposed to make use of Convolutional Neural Networks (CNN) to represent queries and sentences, as well as adopted a greedy algorithm combined with

pair-wise ranking algorithm

for extraction. Based on Recurrent Neural Networks (RNN),


constructed a sequence classifier and obtained the highest extractive scores on the CNN/Daily Mail corpus set. At the same time, The abstractive summarization models attempt to simulate the process of how human beings write summaries and need to analyze, paraphrase, and reorganize the source texts. It is known that there exist two main problems called OOV words and duplicate words by means of abstraction.

[16] proposed an improved pointer mechanism named pointer-generator

to solve the OOV words as well as came up with a variant of coverage vector called

coverage to deal with the duplicate words. [12] created the diverse cell structures to handle duplicate words problem based on query-based summarization. For the first time, a reinforcement learning method based neural network model was raised and obtained the state-of-the-art scores on the CNN/Daily Mail corpus[14].

Both extractive and abstractive methods have their merits. In this paper, we employ the combination of extractive and abstractive methods at the sentence level. In the extractive process, we find that there are some ambiguous words in the source texts. The different meanings of each word can be acquired through the synonym dictionary called WordNet. First WordNet based Lesk algorithm is utilized to analyze the word semantics. Then we apply the modified sentence ranking algorithm to extract a specified number of sentences according to the sentence syntactic information. During the abstractive part based on seq2seq model, we add a new encoder which is derived from the extractive sentences and put the dual attention mechanism for decoding operations. As far as we know, it is the first time that joint training of sentence-level extractive and abstractive models has been conducted. Additionally, we combine the pointer-generator and coverage mechanisms to handle the OOV words and duplicate words.

Our contributions in this paper are mainly summarized as follows:

  • Considering the semantics of words and sentences, we improve the sentence ranking algorithm based on the WordNet-based simplified lesk algorithm to obtain important sentences from the source texts.

  • We construct two parallel encoders from the extracted sentences and source texts separately, and make use of

    seq2seq dual attentional model

    for joint training.

  • We adopt the pointer-generator and coverage mechanisms to deal with OOV words and duplicate words problems. Our results are competitive compared with the state-of-the-art scores.

2 Our Method

Our method is based on the seq2seq attentional model, which is implemented with reference to [11] and the attention distribution is calculated as in [1]. Here, we show the architecture of our model which is composed of eight parts as in Figure 1. We construct two encoders (2⃝4⃝) based on the source texts and extracted sentences, as well as take advantage of a dual attentional decoder (1⃝ 3⃝5⃝6⃝) to generate summaries. Finally, we combine the pointer-generator (7⃝) and coverage mechanisms (8⃝) to manage OOV and duplicate words problems.

Figure 1: A dual attentional encoders-decoder model with pointer-generator network.

2.1 Seq2seq dual attentional model

2.1.1 Encoders-decoder model.

Referring to [1]

, we use two single-layer bidirectional Long Short-Term Memory (BiLSTM) encoders including source and extractive encoders, and a single-layer unidirectional LSTM (UniLSTM) decoder in our model, as shown in Figure

1. For encoding time , the source texts and the extracted information respectively input the word embeddings and into two encoders. Meanwhile, the corresponding hidden layer states and are generated. At decoding step , the decoder will receive the word embedding from the step , which is obtained according to the previous word in the reference summary during training, or provided by the decoder itself when testing. Next we acquire the state and produce the vocabulary distribution .

Here, we are supposed to calculate by the following formulas:


Also, could be obtained as follows:


2.1.2 Dual attention mechanism.

At the step, we need not only the previous hidden state , but also the context vector , , , obtained by the corresponding attention distribution [1] to gain state and vocabulary distribution .

Firstly, for source encoder, we calculate the context vector in the following way (, , , are learnable parameters):


Secondly, for extractive encoder, we utilize the identical method to compute the context vector (, , , are learnable parameters):


Thirdly, we get the gated context vector by calculating the weighted sum of context vectors and , where the weight is the gate network obtained by the concatenation of and via

multi-layer perceptron (MLP)

. Details are shown as below (

is Sigmoid function,

, are learnable parameters):


In the same way, we can obtain the hidden state

and predicte the probability distribution

at time (, , , , , are learnable parameters).


2.2 WordNet-based Sentence Ranking Algorithm

To extract the important sentences, we adopt a WordNet-based sentence ranking algorithm. WordNet111http://www.nltk.org/howto/wordnet.html is a lexical database for the English language, which groups English words into sets of synonyms called synsets and provides short definitions and usage examples. [13] used the simplified lesk approach based on WordNet to extract abstracts. We refer to its algorithm and set up our sentence ranking algorithm so as to construct the extractive encoder.

For sentence , after filtering out the stop words and unambiguous tokens through WordNet, we obtain a reserved subsequence . Since some words contain too many different senses which may result in too much calculation, we set a window size (default value is 5) and sort in descending order according to the number of senses of words, as well as keep the first () words left to get . Next, we count the common number of senses of each word as word weight. Finally, we get the sum weights of each sentence and acquire an average sentence weight.

Taking a sentence for instance, we make an assumption that has two senses and , has two senses and , while has two senses , . Currently considering as the keyword, we measure the number of common words between a pair of sentences, which describe the word senses of and another word.

Table 1 shows all possible matches of the senses of , , . For the two senses of , we can separately obtain the sum of co-occurrence word pairs for each meaning. For , we obtain = + + + , for , we gain = + + + . The significance corresponding to the higher score ( or ) is assigned to the the keyword .

 Pair of sentences  common words in sense description
Table 1: The number of common words between a pair of sentences.

In this way, we’re capable of acquiring the average weight of sentence .


Let’s assume that document , which contains a total of sentences. We sort them in descending order according to the average weights of sentences, and then extract the top sentences (default value is 3).

2.3 Pointer-generator and coverage mechanisms

2.3.1 Pointer-generator network.

Pointer-generator is an effective method to solve the problem of OOV words and its structure has been expanded in Figure 1. We borrow the method improved by [16]. is defined as a switch to decide to generate a word from the vocabulary or copy a word from the source encoder attention distribution. We maintain an extended vocabulary including the vocabulary and all words in the source texts. For the decoding step and decoder input , we define as:


Where is the value of , and , , , are learnable parameters.

2.3.2 Coverage mechanism.

Duplicate words are a critical problem in the seq2seq model, and even more serious when generating long texts like multi-sentence texts. [16] made some minor modifications to the coverage model [18] which is also displayed in Figure 1.

First, we calculate the sum of attention distributions from previous decoder steps () to get a coverage vector :


Then, we make use of coverage vector to update the attention distribution:


Finally, we define the coverage loss function

for the sake of penalizing the duplicate words appearing at decoding time , and renew the total loss:


Where is the target word at step, is the primary loss for timestep

during training, hyperparameter

(default value is 1.0) is the weight for , , , , are learnable parameters.

3 Experiments

3.1 Dataset

CNN/Daily Mail dataset222https://cs.nyu.edu/~kcho/DMQA/ is widely used in the public automatic summarization evaluation, which contains online news articles (781 tokens on average) paired with multi-sentence summaries (56 tokens on average). [16] provided the data processing script, and we take advantage of it to obtain the non-anonymized version of the the data including 287,226 training pairs, 13,368 validation pairs and 11,490 test pairs, though [10, 11] used the anonymized version. During training steps, we find that 114 of 287,226 articles are empty, so we utilize the remaining 287,112 pairs for training. Then, we perform the splitting preprocessing for the data pairs with the help of Stanford CoreNLP toolkit333https://stanfordnlp.github.io/CoreNLP/, and convert them into binary files, as well as get the vocab file for the convenience of reading data.

3.2 Implementation

3.2.1 Model parameters configuration.

The corresponding parameters of controlled experimental models are described as follows. For all models, we have set the word embeddings and RNN hidden states to be 128-dimensional and 256-dimensional respectively for source encoders, extractive encoders and decoders. Contrary to [11]

, we learn the word embeddings from scratch during training, because our training dataset is large enough. We apply the optimization technique Adagrad with learning rate 0.15 and an initial accumulator value of 0.1, as well as employ the gradient clipping with a maximum gradient norm of 2.

For the one-encoder models, we set up the vocabulary size to be 50k for source encoder and target decoder simultaneously. We try to adjust the vocabulary size to be 150k, then discover that when the model is trained to converge, the time cost is doubled but the test dataset scores have slightly dropped. In our analysis, the models’ parameters have increased excessively when the vocabulary enlarges, leading to overfitting during the training process. Meanwhile, for the models with two encoders, we adjust the vocabulary size to be 40k.

Each pair of the dataset consists of an article and a multi-sentence summary. We truncate the article to 400 tokens and limit the summary to 100 tokens for both training and testing time. During decoding mode, we generate at least 35 words with beam search algorithm. Data truncation operations not only reduce memory consumption, speed up training and testing, but also improve the experimental results. The reason is that the vital information of news texts is mainly concentrated in the first half part.

We train on a single GeForce GTX 1080 GPU with a memory of 8114 MiB, and the batch size is set to be 16, as well as the beam size is 4 for beam search in decoding mode. For the seq2seq dual attentional models without pointer-generator, we trained them for about two days. Models with pointer-generator expedite the training, the time cost is reduced to about one day. When we add coverage, the coverage loss weight is set to 1.0, and the model needs about one hour for training.

3.2.2 Controlled experiments.

In order to figure out how each part of our models contributes to the test results, based on the released codes444https://github.com/tensorflow/models/tree/master/research/textsum

of Tensorflow, we have implemented all the models and done a series of experiments.

The baseline model is a general seq2seq attentional model, the encoder consists of a biLSTM and the decoder is made up of an uniLSTM. The second baseline model is our encoders-decoder dual attention model, which contains two biLSTM encoders and one uniLSTM decoder. This model combines the extractive and generative methods to perform joint training effectively through a dual attention mechanism.

For the above two basic models, in order to explain how the OOV and duplicate words are treated, we lead into the pointer-generator and coverage mechanism step by step. For the second baseline, the two tricks are only related to the source encoder, because we think that the source encoder already covers all the tokens in the extractive encoder. For the extractive encoder, we adopt two methods for extraction. One is the leading three (lead-3) sentences technique, which is simple but indeed a strong baseline. The other is the Modified sentence ranking algorithm based on WordNet that we explain in details in section 3. It considers semantic relations in words and sentences from source texts.

3.3 Results


is a set of metrics with a software package used for evaluating automatic summarization and machine translation results. It counts the number of overlapping basic units including n-grams, longest common subsequences (LCS). We use pyrouge

555https://pypi.org/project/pyrouge/0.1.3/, a python wrapper to gain ROUGE-1, ROUGE-2 and ROUGE-L scores and list the scores in table 2.

  Models ROUGE scores
1 2 L
  Seq2seq + Attn 31.50 11.95 28.85
  Seq2seq + Attn (150k) 30.67 11.32 28.11
  Seq2seq + Attn + PGN 36.58 15.76 33.33
  Seq2seq + Attn + PGN + Cov 39.16 16.98 35.81
  Lead-3 + Dual-attn + PGN 37.26 16.12 33.87
  WordNet + Dual-attn + PGN 36.91 15.97 33.58
  Lead-3 + Dual-attn + PGN + Cov 39.41 17.30 35.92
  WordNet + Dual-attn + PGN + Cov 39.32 17.15 36.02
  Lead-3 ([16]) 40.34 17.70 36.57
  Lead-3 ([10]) 39.20 15.70 35.50
  SummaRuNNer ([10]) 39.60 16.20 35.30
1-4[2pt/2pt]   RL + Intra-attn ([14]) 41.16 15.75 39.08
  ML + RL + Intra-attn ([14]) 39.87 15.82 36.90
Table 2: ROUGE

scores on CNN/Daily Mail non-anonymized testing dataset for all the controlled experiment models mentioned above. According to the official ROUGE usage description, all our ROUGE scores have a 95% confidence interval of at most

0.25. PGN, Cov, ML, RL are abbreviations for pointer-generator, coverage, mixed-objective learning and reinforcement learning. Models with subscript were trained and tested on the anonymized CNN/Daily Mail dataset, as well as with are the state-of-the-art extractive and abstractive summarization models on the anonymized dataset by now.

We carry out the experiments based on original dataset, i.e., non-anonymized version of data. For the top three models in table 2, their ROUGE scores are slightly higher than those executed by [16], except for the ROUGE-L score of Seq2seq + Attn + PGN, which is 0.09 points lower than the former result. For the fourth model, we did not reproduce the results of [16], ROUGE-1, ROUGE-2, and ROUGE-L decreased by an average of 0.41 points.

For the four models in the middle, we apply the dual attention mechanism to integrate extraction with abstraction for joint training and decoding. These model variants own a single PGN or PGN together with Cov, achieve better results than the corresponding vulgaris attentional models simultaneously. We conclude that the extractive encoders play a role, among which we obtained higher ROUGE-1 and ROUGE-2 scores based on the Lead-3 + Dual-attn + PGN + Cov model, and achieve a better ROUGE-L score on WordNet + Dual-attn + PGN + Cov model.

Let’s take a look at the five models at the bottom, two of which give the state-of-the-art scores for the extractive and generative methods. our scores are already comparable to them. It is worthy to mention that based on the dual attention, our models related to both Lead-3 and WordNet with PGN and Cov have exceeded the previous best ROUGE-2 scores. When in fact, previous SummaRuNNer, RL related models are based on anonymized dataset, these differences may cause some deviations in the comparison of experimental results.

We give some generated summaries of different models for one selected test article. From Figure 2, we can see that the red words represent key information about who, what, where and when. We can match the corresponding keywords in the remaining seven summaries to find out whether they cover all the significant points, and check if they are expressed in a concise and coherent way. It can be discovered from Figure 2 that most of the models have lost several vital points, and the model Lead-3 + Dual-attn + PGN has undergone fairly serious repetition. Our model WordNet + Dual-attn + PGN + Cov holds the main key information as well as has better readability and semantic correctness reliably.

Figure 2: Summaries for all the models of one test article example.

4 Related Work

Up to now, automatic summarization with extractive and abstractive methods are under fervent research. On the one hand, the extractive techniques extract the topic-related keywords and significant sentences from the source texts to constitute summaries. [3] proposed a seq2seq model with a hierarchical encoder and attentional decoder to solve extractive summarization tasks at the word and sentence levels. Currently [10] put forward SummaRuNNer

, a RNN based sequence model for extractive summarization and it achieves the previous state-of-the-art performance. On the other hand, abstractive methods establish an intrinsic semantic representation and use natural language generation techniques to produce summaries which are closer to what human beings express.

[1] applied the combination of seq2seq model and attention mechanism to machine translation tasks for the first time. [15] exploited seq2seq model to sentence compression to lay the groundwork for subsequent summarization with different granularities. [8] used encoder-decoder with attention method to generate news headlines. [20] added a selective gate network to the basic model in order to control which part of the information flowed from encoder to decoder. [17] raised a model based on graph and attention mechanism to strengthen the positioning of vital information of source texts.

So as to solve rare and unseen words, [5, 6] proposed the COPYNET model and pointing mechanism, [19] created read-again and copy mechanisms. [11] made a combination of the basic model with large vocabulary trick (LVT), feature-rich encoder, pointer-generator, and hierarchical attention. In addition to pointer-generator, other tricks of this paper also contributed to the experiment results. [16] presented an updated version of pointer-generator which proved to be better. As for duplicate words, for sake of solving problems of over or missing translation, [18] came up with a coverage mechanism to avail oneself of historical information for attention calculation, while [16] provided a progressive version. [12] introduced a series of diverse cell structures to solve the duplicate words.

So far, few papers have considered about the structural or sementic issues at the language level in the field of summarization. [4] presented a novel unsupervised method that made use of a pruned dependency tree to acquire the sentence compression. Based on a Chinese short text summary dataset (LCSTS) and the attentional seq2seq model, [9] proposed to enhance the semantic relevance by calculating the cos similarities of summaries and source texts.

5 Conclusion

In our paper, we construct a dual attentional seq2seq model comprising source and extractive encoders to generate summaries. In addition, we put forward the modified sentence ranking algorithm to extract a specific number of high weighted sentences, for the purpose of strengthening the semantic representation of the extractive encoder. Furthermore, we introduce the pointer-generator and coverage mechanisms in our models so as to solve the problems of OOV and duplicate words. In the non-anonymized CNN/Daily Mail dataset, our results are close to the state-of-the-art ROUGE scores. Moreover, we get the highest abstractive ROUGE-2 scores, as well as obtain such summaries that have better readability and higher semantic accuracies. In our future work, we plan to unify the reinforcement learning method with our abstractive models.


We thank the anonymous reviewers for their insightful comments on this paper. This work was partially supported by National Natural Science Foundation of China (61572049 and 61333018). The correspondence author is Sujian Li.