Selective Encoding for Abstractive Sentence Summarization in DyNet
We propose a selective encoding model to extend the sequence-to-sequence framework for abstractive sentence summarization. It consists of a sentence encoder, a selective gate network, and an attention equipped decoder. The sentence encoder and decoder are built with recurrent neural networks. The selective gate network constructs a second level sentence representation by controlling the information flow from encoder to decoder. The second level representation is tailored for sentence summarization task, which leads to better performance. We evaluate our model on the English Gigaword, DUC 2004 and MSR abstractive sentence summarization datasets. The experimental results show that the proposed selective encoding model outperforms the state-of-the-art baseline models.READ FULL TEXT VIEW PDF
Text summarization aims to generate a headline or a short summary consis...
In neural abstractive summarization, the conventional sequence-to-sequen...
In this work, we model abstractive text summarization using Attentional
The success of neural summarization models stems from the meticulous
A good neural sequence-to-sequence summarization model should have a str...
This paper introduces a sentence to vector encoding framework suitable f...
In order to learn universal sentence representations, previous methods f...
Selective Encoding for Abstractive Sentence Summarization in DyNet
Sentence summarization aims to shorten a given sentence and produce a brief summary of it. This is different from document level summarization task since it is hard to apply existing techniques in extractive methods, such as extracting sentence level features and ranking sentences. Early works propose using rule-based methods (Zajic et al., 2007), syntactic tree pruning methods (Knight and Marcu, 2002), statistical machine translation techniques (Banko et al., 2000) and so on for this task. We focus on abstractive sentence summarization task in this paper.
Recently, neural network models have been applied in this task. Rush et al. (2015)
use auto-constructed sentence-headline pairs to train a neural network summarization model. They use a Convolutional Neural Network (CNN) encoder and feed-forward neural network language model decoder for this task.Chopra et al. (2016) extend their work by replacing the decoder with Recurrent Neural Network (RNN). Nallapati et al. (2016) follow this line and change the encoder to RNN to make it a full RNN based sequence-to-sequence model (Sutskever et al., 2014).
All the above works fall into the encoding-decoding paradigm, which first encodes the input sentence to an abstract representation and then decodes the intended output sentence based on the encoded information. As an extension of the encoding-decoding framework, attention-based approach (Bahdanau et al., 2015)
has been broadly used: the encoder produces a list of vectors for all tokens in the input, and the decoder uses an attention mechanism to dynamically extract encoded information and align with the output tokens. This approach achieves huge success in tasks like machine translation, where alignment between all parts of the input and output are required. However, in abstractive sentence summarization, there is no explicit alignment relationship between the input sentence and the summary except for the extracted common words. The challenge here is not to infer the alignment, but to select the highlights while filtering out secondary information in the input. A desired work-flow for abstractive sentence summarization is encoding, selection, and decoding. After selecting the important information from an encoded sentence, the decoder produces the output summary using the selected information. For example, in Figure1, given the input sentence, the summarization system first selects the important information, and then rephrases or paraphrases to produce a well-organized summary. Although this is implicitly modeled in the encoding-decoding framework, we argue that abstractive sentence summarization shall benefit from explicitly modeling this selection process.
In this paper we propose Selective Encoding for Abstractive Sentence Summarization (SEASS). We treat the sentence summarization as a three-phase task: encoding, selection, and decoding. It consists of a sentence encoder, a selective gate network, and a summary decoder. First, the sentence encoder reads the input words through an RNN unit to construct the first level sentence representation. Then the selective gate network selects the encoded information to construct the second level sentence representation. The selective mechanism controls the information flow from encoder to decoder by applying a gate network according to the sentence information, which helps improve encoding effectiveness and release the burden of the decoder. Finally, the attention-equipped decoder generates the summary using the second level sentence representation. We conduct experiments on English Gigaword, DUC 2004 and Microsoft Research Abstractive Text Compression test sets. Our SEASS model achieves 17.54 Rouge-2 F1, 9.56 Rouge-2 recall and 10.63 Rouge-2 F1 on these test sets respectively, which improves performance compared to the state-of-the-art methods.
Abstractive sentence summarization, also known as sentence compression and similar to headline generation, is used to help compress or fuse the selected sentences in extractive document summarization systems since they may inadvertently include unnecessary information. The sentence summarization task has been long connected to the headline generation task. There are some previous methods to solve this task, such as the linguistic rule-based method(Dorr et al., 2003)
. As for the statistical machine learning based methods,Banko et al. (2000) apply statistical machine translation techniques by modeling headline generation as a translation task and use 8000 article-headline pairs to train the system.
Rush et al. (2015) propose leveraging news data in Annotated English Gigaword (Napoles et al., 2012) corpus to construct large scale parallel data for sentence summarization task. They propose an ABS model, which consists of an attentive Convolutional Neural Network encoder and an neural network language model (Bengio et al., 2003) decoder. On this Gigaword test set and DUC 2004 test set, the ABS model produces the state-of-the-art results. Chopra et al. (2016) extend this work, which keeps the CNN encoder but replaces the decoder with recurrent neural networks. Their experiments showes that the CNN encoder with RNN decoder model performs better than Rush et al. (2015). Nallapati et al. (2016) further change the encoder to an RNN encoder, which leads to a full RNN sequence-to-sequence model. Besides, they enrich the encoder with lexical and statistic features which play important roles in traditional feature based summarization systems, such as NER and POS tags, to improve performance. Experiments on the Gigaword and DUC 2004 test sets show that the above models achieve state-of-the-art results.
Gu et al. (2016) and Gulcehre et al. (2016) come up similar ideas that summarization task can benefit from copying words from input sentences. Gu et al. (2016) propose CopyNet to model the copying action in response generation, which also applies for summarization task. Gulcehre et al. (2016) propose a switch gate to control whether to copy from source or generate from decoder vocabulary. Zeng et al. (2016) also propose using copy mechanism and add a scalar weight on the gate of GRU/LSTM for this task. Cheng and Lapata (2016) use an RNN based encoder-decoder for extractive summarization of documents.
Yu et al. (2016) propose a segment to segment neural transduction model for sequence-to-sequence framework. The model introduces a latent segmentation which determines correspondences between tokens of the input sequence and the output sequence. Experiments on this task show that the proposed transduction model performs comparable to the ABS model. Shen et al. (2016)2016) apply MRT on abstractive sentence summarization task and the results show that optimizing for Rouge improves the test performance.
For sentence summarization, given an input sentence , where is the sentence length, and is the source vocabulary, the system summarizes by producing , where is the summary length , and is the target vocabulary.
If , which means all words in summary must appear in given input, we denote this as extractive sentence summarization. If , which means not all words in summary come from input sentence, we denote this as abstractive sentence summarization. Table 1 provides an example. We focus on abstracive sentence summarization task in this paper.
|Input:||South Korean President Kim Young-Sam left here Wednesday on a week - long state visit to Russia and Uzbekistan for talks on North Korea ’s nuclear confrontation and ways to strengthen bilateral ties .|
|Output:||Kim leaves for Russia for talks on NKorea nuclear standoff|
As shown in Figure 2
, our model consists of a sentence encoder using the Gated Recurrent Unit (GRU)Cho et al. (2014), a selective gate network and an attention-equipped GRU decoder. First, the bidirectional GRU encoder reads the input words and builds its representation . Then the selective gate selects and filters the word representations according to the sentence meaning representation to produce a tailored sentence word representation for abstractive sentence summarization task. Lastly, the GRU decoder produces the output summary with attention to the tailored representation. In the following sections, we introduce the sentence encoder, the selective mechanism, and the summary decoder respectively.
The role of the sentence encoder is to read the input sentence and construct the basic sentence representation. Here we employ a bidirectional GRU (BiGRU) as the recurrent unit, where GRU is defined as:
where , and are weight matrices.
The BiGRU consists of a forward GRU and a backward GRU. The forward GRU reads the input sentence word embeddings from left to right and gets a sequence of hidden states, . The backward GRU reads the input sentence embeddings reversely, from right to left, and results in another sequence of hidden states, :
The initial states of the BiGRU are set to zero vectors, i.e., and . After reading the sentence, the forward and backward hidden states are concatenated, i.e., , to get the basic sentence representation.
In the sequence-to-sequence machine translation (MT) model, the encoder and decoder are responsible for mapping input sentence information to a list of vectors and decoding the sentence representation vectors to generate an output sentence Bahdanau et al. (2015). Some previous works apply this framework to summarization generation tasks (Nallapati et al., 2016; Gu et al., 2016; Gulcehre et al., 2016). However, abstractive sentence summarization is different from MT in two ways. First, there is no explicit alignment relationship between the input sentence and the output summary except for the common words. Second, summarization task needs to keep the highlights and remove the unnecessary information, while MT needs to keep all information literally.
Herein, we propose a selective mechanism to model the selection process for abstractive sentence summarization. The selective mechanism extends the sequence-to-sequence model by constructing a tailored representation for abstractive sentence summarization task. Concretely, the selective gate network in our model takes two vector inputs, the sentence word vector and the sentence representation vector . The sentence word vector is the output of the BiGRU encoder and represents the meaning and context information of word . The sentence vector is used to represent the meaning of the sentence. For each word , the selective gate network generates a gate vector using and , then the tailored representation is constructed, i.e., .
In detail, we concatenate the last forward hidden state and backward hidden state as the sentence representation :
For each time step , the selective gate takes the sentence representation and BiGRU hidden as inputs to compute the gate vector :
where and are weight matrices,
is the bias vector,
denotes sigmoid activation function, andis element-wise multiplication. After the selective gate network, we obtain another sequence of vectors . This new sequence is then used as the input sentence representation for the decoder to generate the summary.
On top of the sentence encoder and the selective gate network, we use GRU with attention as the decoder to produce the output summary.
At each decoding time step , the GRU reads the previous word embedding and previous context vector as inputs to compute the new hidden state . To initialize the GRU hidden state, we use a linear layer with the last backward encoder hidden state as input:
where is the weight matrix and is the bias vector.
The context vector for current time step is computed through the concatenate attention mechanism (Luong et al., 2015), which matches the current decoder state with each encoder hidden state to get an importance score. The importance scores are then normalized to get the current context vector by weighted sum:
We then combine the previous word embedding , the current context vector , and the decoder state to construct the readout state . The readout state is then passed through a maxout hidden layer (Goodfellow et al., 2013)
to predict the next word with a softmax layer over the decoder vocabulary.
where , , , , and are weight matrices. Readout state is a -dimensional vector, and the maxout layer (Equation 16) picks the max value for every two numbers in and produces a d-dimensional vector .
where denotes a set of parallel sentence-summary pairs and
is the model parameter. We use Stochastic Gradient Descent (SGD) with mini-batch to learn the model parameter.
In this section we introduce the dataset we use, the evaluation metric, the implementation details, the baselines we compare to, and the performance of our system.
. The parallel corpus is produced by pairing the first sentence and the headline in the news article with some heuristic rules. We use the script111https://github.com/facebook/NAMAS released by Rush et al. (2015) to pre-process and extract the training and development datasets. The script performs various basic text normalization, including PTB tokenization, lower-casing, replacing all digit characters with #, and replacing word types seen less than 5 times with . The extracted corpus contains about 3.8M sentence-summary pairs for the training set and 189K examples for the development set.
For our test set, we use the English Gigaword, DUC 2004, and Microsoft Research Abstractive Text Compression test sets.
We randomly sample 8000 pairs from the extracted development set as our development set since it is relatively large. For the test set, we use the same randomly held-out test set of 2000 sentence-summary pairs as Rush et al. (2015).222Thanks to Rush et al. (2015), we acquired the test set they used. Following Chopra et al. (2016), we remove pairs with empty titles resulting in slightly different accuracy compared to Rush et al. (2015) for their systems. The cleaned test set contains 1951 sentence-summary pairs.
We also find that except for the empty titles, this test set has some invalid lines like the input sentence containing only one word. Therefore, we further sample 2000 pairs as our internal test set and release it for future works333Our development and test sets can be found at https://res.qyzhou.me.
We employ DUC 2004 data for tasks 1 & 2 (Over et al., 2007) in our experiments as one of the test sets since it is too small to train a neural network model on. The dataset pairs each document with 4 different human-written reference summaries which are capped at 75 bytes. It has 500 input sentences with each sentence paired with 4 summaries.
Toutanova et al. (2016) release a new dataset for sentence summarization task by crowdsourcing. This dataset contains approximately 6,000 source text sentences with multiple manually-created summaries (about 26,000 sentence-summary pairs in total). Toutanova et al. (2016) provide a standard split of the data into training, development, and test sets, with 4,936, 448 and 785 input sentences respectively. Since the training set is too small, we only use the test set as one of our test sets. We denote this dataset as MSR-ATC (Microsoft Research Abstractive Text Compression) test set in the following.
Table 2 summarizes the statistic information of the three datasets we used.
We employ Rouge (Lin, 2004) as our evaluation metric. Rouge measures the quality of summary by computing overlapping lexical units, such as unigram, bigram, trigram, and longest common subsequence (LCS). It becomes the standard evaluation metric for DUC shared tasks and popular for summarization evaluation. Following previous work, we use Rouge-1 (unigram), Rouge-2 (bigram) and Rouge-L (LCS) as the evaluation metrics in the reported experimental results.
The input and output vocabularies are collected from the training data, which have 119,504 and 68,883 word types respectively. We set the word embedding size to 300 and all GRU hidden state sizes to 512. We use dropout (Srivastava et al., 2014) with probability .
We initialize model parameters randomly using a Gaussian distribution with Xavier scheme(Glorot and Bengio, 2010). We use Adam (Kingma and Ba, 2015)
as our optimizing algorithm. For the hyperparameters of Adam optimizer, we set the learning rate, two momentum parameters and respectively, and . During training, we test the model performance (Rouge-2 F1) on development set for every 2,000 batches. We halve the Adam learning rate if the Rouge
-2 F1 score drops for twelve consecutive tests on development set. We also apply gradient clipping(Pascanu et al., 2013) with range during training. To both speed up the training and converge quickly, we use mini-batch size 64 by grid search.
We use beam search to generate multiple summary candidates to get better results. To avoid favoring shorter outputs, we average the ranking score along the beam path by dividing it by the number of generated words. To both decode fast and get better results, we set the beam size to 12 in our experiments.
We compare SEASS model with the following state-of-the-art baselines:
Based on ABS model, Rush et al. (2015) further tune their model using DUC 2003 dataset, which leads to improvements on DUC 2004 test set.
As an extension of the ABS model, Chopra et al. (2016) use a convolutional attention-based encoder and RNN decoder, which outperforms the ABS model.
Nallapati et al. (2016) use a full RNN sequence-to-sequence encoder-decoder model and add some features to enhance the encoder, such as POS tag, NER, and so on.
We also implement a sequence-to-sequence model with attention as our baseline and denote it as “s2s+att”.
We report Rouge F1, Rouge recall and Rouge F1 for English Gigaword, DUC 2004 and MSR-ATC test sets respectively. We use the official Rouge script (version 1.5.5) 444http://www.berouge.com/ to evaluate the summarization quality in our experiments. For English Gigaword555The Rouge evaluation option is the same as Rush et al. (2015), -m -n 2 -w 1.2 and MSR-ATC666The Rouge evaluation option is, -m -n 2 -w 1.2 test sets, the outputs have different lengths so we evaluate the system with F1 metric. As for the DUC 2004 test set777The Rouge evaluation option is, -m -b 75 -n 2 -w 1.2, the task requires the system to produce a fixed length summary (75 bytes), therefore we employ Rouge recall as the evaluation metric. To satisfy the length requirement, we decode the output summary to a roughly expected length following Rush et al. (2015).
We acquire the test set from Rush et al. (2015) so we can make fair comparisons to the baselines.
indicates that our SEASS model with beam search performs significantly better than it as given by the 95% confidence interval in the officialRouge script.
In Table 3, we report the Rouge F1 score of our model and the baseline methods. Our SEASS model with beam search outperforms all baseline models by a large margin. Even for greedy search, our model still performs better than other methods which used beam search. For the popular Rouge-2 metric, our SEASS model achieves 17.54 F1 score and performs better than the previous works. Compared to the ABS model, our model has a 6.22 Rouge-2 F1 relative gain. Compared to the highest CAs2s baseline, our model achieves 1.57 Rouge-2 F1 improvement and passes the significant test according to the official Rouge script.
Table 4 summarizes our results on our internal test set using Rouge F1 evaluation metrics. The performance on our internal test set is comparable to our development set, which achieves 24.58 Rouge-2 F1 and outperforms the baselines.
We evaluate our model using the Rouge recall score since the reference summaries of the DUC 2004 test set are capped at 75 bytes. Therefore, we decode the summary to a fixed length 18 to ensure that the generated summary satisfies the minimum length requirement. As summarized in Table 5, our SEASS outperforms all the baseline methods and achieves 29.21, 9.56 and 25.51 for Rouge 1, 2 and L recall. Compared to the ABS+ model which is tuned using DUC 2003 data, our model performs significantly better by 1.07 Rouge-2 recall score and is trained only with English Gigaword sentence-summary data without being tuned using DUC data.
We report the full length Rouge F1 score on the MSR-ATC test set in Table 6. To the best of our knowledge, this is the first work that reports Rouge metric scores on the MSR-ATC dataset. Note that we only compare our model with ABS since the others are not publicly available. Our SEASS achieves 10.63 Rouge-2 F1 and outperforms the s2s+att baseline by 1.02 points.
In this section, we first compare the performance of SEASS with the s2s+att baseline model to illustrate that the proposed method succeeds in selecting information and building tailored representation for abstractive sentence summarization. We then analyze selective encoding by visualizing the heat map.
We further test the SEASS model with different sentence lengths on English Gigaword test sets, which are merged from the Rush et al. (2015) test set and our internal test set. The length of sentences in the test sets ranges from 10 to 80. We group the sentences with an interval of 4 and get 18 different groups and we draw the first 14 groups. We find that the performance curve of our SEASS model always appears to be on the top of that of s2s+att with a certain margin. For the groups of 16, 20, 24, 32, 56 and 60, the SEASS model obtains big improvements compared to the s2s+att model. Overall, these improvements on all groups indicate that the selective encoding method benefits the abstractive sentence summarization task.
Since the output of the selective gate network is a high dimensional vector, it is hard to visualize all the gate values. We use the method in Li et al. (2016) to visualize the contribution of the selective gate to the final output, which can be approximated by the first derivative. Given sentence words with associated output summary , the trained model associates the pair with a score . The goal is to decide which gate associated with a specific word makes the most significant contribution to . We approximate the by computing the first-order Taylor expansion since the score is a highly non-linear function in the deep neural network models:
where is first the derivative of with respect to the gate :
We then draw the Euclidean norm of the first derivative of the output with respect to the selective gate associated with each input words.
Figure 3 shows an example of the first derivative heat map, in which most of the important words are selected by the selective gate such as “europe”, “slammed”, “unacceptable”, “conditions”, and “france”. We can observe that the selective gate determines the importance of each word before decoder, which releases the burden of it by providing tailored sentence encoding.
This paper proposes a selective encoding model which extends the sequence-to-sequence model for abstractive sentence summarization task. The selective mechanism mimics one of the human summarizers’ behaviors, selecting important information before writing down the summary. With the proposed selective mechanism, we build an end-to-end neural network summarization model which consists of three phases: encoding, selection, and decoding. Experimental results show that the selective encoding model greatly improves the performance with respect to the state-of-the-art methods on English Gigaword, DUC 2004 and MSR-ATC test sets.
We thank Chuanqi Tan, Junwei Bao, Shuangzhi Wu and the anonymous reviewers for their helpful comments. We also thank Alexander M. Rush for providing the dataset for comparison and helpful discussions.
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, Doha, Qatar, pages 1724–1734.
A neural attention model for abstractive sentence summarization.In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Lisbon, Portugal, pages 379–389.