Learning Syntactic and Dynamic Selective Encoding for Document Summarization

03/25/2020 ∙ by Haiyang Xu, et al. ∙ Beijing Didi Infinity Technology and Development Co., Ltd. 0

Text summarization aims to generate a headline or a short summary consisting of the major information of the source text. Recent studies employ the sequence-to-sequence framework to encode the input with a neural network and generate abstractive summary. However, most studies feed the encoder with the semantic word embedding but ignore the syntactic information of the text. Further, although previous studies proposed the selective gate to control the information flow from the encoder to the decoder, it is static during the decoding and cannot differentiate the information based on the decoder states. In this paper, we propose a novel neural architecture for document summarization. Our approach has the following contributions: first, we incorporate syntactic information such as constituency parsing trees into the encoding sequence to learn both the semantic and syntactic information from the document, resulting in more accurate summary; second, we propose a dynamic gate network to select the salient information based on the context of the decoder state, which is essential to document summarization. The proposed model has been evaluated on CNN/Daily Mail summarization datasets and the experimental results show that the proposed approach outperforms baseline approaches.



There are no comments yet.


page 1

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Text summarization is a very challenging task of natural language processing (NLP) and information retrieval. Existing approaches for text summarization are categorized into two major types: extractive and abstractive. Extractive methods produce summaries by extracting sentences or tokens from the source text, which can produce the grammatically correct summaries and preserve the meaning of the original text. However, these approaches heavily rely on the text in the original documents and the extracted sentences may contain redundant information or be lack of readability. On the contrary, abstractive methods produce the summaries by generating new sentences or tokens which do not necessarily appear in the source text. However, abstractive approaches are more difficult in practice because they need to address many NLP problems including document understanding, semantic representation and natural language generation, which are harder than sentence extraction.

Fig. 1: The example of text summarization in CNN dataset. The colored text show the source text, corresponding summaries generated by human-written, original Seq2Seq with attention, and the proposed approach, respectively.

The recent neural sequence-to-sequence (Seq2Seq) approach [sutskever2014sequence] has achieved tremendous success in many NLP tasks such as machine translation [bahdanau2014neural, luong2015effective], dialogue systems [serban2016building, bordes2016learning]

. The essence of Seq2Seq based text summarization methods is an encoder-decoder framework, which first encodes a input sentence to a low dimensional representation and then decodes the abstract representation into a output sequence. As the extension of Seq2Seq methods, attention based Seq2Seq models encode a input sequence to a context vector using attention mechanism and dynamically calculate the attentive probability distribution at each generation step


Similar to machine translation, some researchers have applied neural Seq2Seq model to abstractive text summarization [nallapati2016abstractive, see2017get, tan2017abstractive]. However, there is a significant difference between two tasks: in machine translation, one aims to capture all the semantic details from the source text, while in text summarization one only focuses on the salient text information and it is critical to utilize only the key information in the source text rather than the whole text.

Furthermore, the original Seq2Seq with attention method does not learn the syntactic structure of the source text, which is important to text summarization. In Figure 1, a piece of original text is shown at the top. The human-written summary is shown in the next box as the target. The third box shows a model-generated summary using the original Seq2Seq with attention method (baseline). The green text in the target and the red text in the baseline show the summary corresponding to the blue text in the original text. As shown in the figure, the baseline model incorrectly summarizes that “he says he will become the citizens”, mainly because it is not able to capture the internal syntactic structure of the original text (e.g. “300, 000 applicants” is a noun phase, and “applied to .. ceremony” is an attributive clause for it).

To address these problems, we propose a novel syntactic and dynamic selective encoding method for document summarization. We incorporate structured linguistic information such as parsing trees to learn more effective sentence representation by serializing parsing trees into a encoder sequence as in [li2017modeling]. In this way, the encoding sequence contains both the semantic (words) and the syntactic (parsing symbols) information, both of which are fed into the decoder for summary generation.

In addition, a document may contains hundreds of words and it is hard to directly encode the key information from the whole source text. A selective gate was proposed in previous study [Zhou2017Selective] to filter out the secondary information. However, the salient information varies in different decoding stage, so it is better to select the salient information based on the context of decoder states. Therefore, for each decoding step, we take a dynamic selective gate network to control the salient information flow according to the document information, current decoder state and the previous gated encoder state.

In this way, our approach can learn better representation of the sentences and select the salient information from the long input sequence for summary generation. As an example, we show the summary generated by our approach in Figure 1, where the text correctly summarize the original sentences from the document. For reference, Figure 3 shows the constituency parsing tree for the source sentence, and Figure 4 shows the change of the states of the dynamic selective gates in a input sentence. We also conduct experiments on two large-scale CNN/Daily Mail datasets and the experimental results show that our model achieves superiority over baseline summarization models.

We organize the paper as follows: Sec.II introduces the related work. Sec.III describes our proposed method. In Sec.IV we present the experiment settings and illustrate experimental results. We conclude the paper in Sec.V.

Ii Related Work

In general, there are two broad approaches to automatic text summarization: extractive and abstractive. Extractive methods work by selecting important sentences or passages in the original text and reproducing them as summary [Wong2008Extractive, Wang2009Multi, celikyilmaz2010hybrid, Alguliev2011MCMR, Erkan2011LexRank, Alguliev2013Multiple].

In contrast, abstractive summarization techniques generate new shorter texts that seldom consist of the identical sentences from the document. Banko et al. [banko2000headline] apply statistical models of the term selection and term ordering processes to produce short summaries. Bonnie and Dorr [bonnie2004bbn] implement a system using a combination of linguistically motivated sentence compression technique. Other notable methods for abstractive summarization include using discriminative tree-to-tree transduction model [Cohn2008Sentence] and quasi-synchronous grammar approach utilizing both context-free parses and dependency parses [woodsend2010generation].

Recently, researchers have started utilizing deep learning framework in extractive and abstractive summarization. For extractive methods, Nallapati et al.


use recurrent neural networks (RNNs) to read the article and get the representations of the sentences and select important sentences. Yasunaga et al.

[yasunaga2017graph] combine RNNs with graph convolutional networks (CNNs) to compute the salience of each sentence. Narayan et al. [narayan2018document] propose a framework composed of a hierarchical encoder based on CNNs and an attention-based extractor with attention over external information. More works are published recently on abstractive methods. Rush et al. [rush2015neural] firstly apply modern neural networks to text summarization by using a local attention-based model to generate word conditioned on the input sentence. A bunch of work have been proposed to extend this approach, which achieving further improvements in performance. Chopra et al. [chopra2016abstractive] use a similar convolutional attention-based encoder and replace the decoder with a conditional RNNs. Nallapati et al. [nallapati2016abstractive] apply encoder-decoder RNNs framework with hierarchical attention and feature-rich embedding vector. Tan et al. [tan2017abstractive] propose graph-based attention mechanism to summarize the salient information of document. However the above neural models cannot emit unseen words since the vocabulary is fixed at training time. In order to solve this problem, the point network [vinyals2015pointer, see2017get] and the CopyNet [gu2016incorporating] have been proposed to allow both copying words from the original text and generating words from a fixed vocabulary. Hsu et al. [hsu2018unified] combine the strength of extractive and abstractive summarization and propose an inconsistency loss. Zhou et al. [Zhou2017Selective] extend general encoder-decoder framework with a selective gate network, which helps improve encoding effectiveness and release the burden of the decoder.

Our work has several significant improvements comparing with previous studies. First, to incorporate syntactic information, previous works only use unstructured linguistic information such as part-of-speech (POS) tags and named entity [nallapati2016abstractive]. In this work, we utilize a structured syntactic parsing tree to learn a more effective context vector, which improves the performance of word prediction and alleviate the repetition problem. Second, to choose the salient information, previous works employ a selective gate network which is static during the decoding stage [Zhou2017Selective]. We improve the gate network and let the states of the gate dynamically adjust according to the context of the decoder states, which is essential to document summarization.

Fig. 2: Overall architecture of the proposed syntactic and dynamic selective encoding model. The parsing tree of each sentence is serialized and fed into the encoder to help attain syntactic meanings. In th decoding stage, the decoder benefits from the dynamic selective gate to drop out trivial words as well as attention mechanism influenced by the syntactic vector.

Iii Methodology

In this section, we describe the proposed model. The architecture of the syntactic and dynamic selective syntactic encoding model is shown in Figure 2, which consists of the syntactic sequence encoder, the dynamic selective gates, and the pointer-generator network with syntactic attention decoder.

Iii-a Syntactic Sequence Encoder

Previous studies usually treat a document as a sequence of words but ignore the syntactic structure of document. To leverage the syntactic knowledge, we design a syntactic sequence encoder to learn document representations.

A document is denoted as a sequence of sentences : , where is the number of sentences in the document. For each sentence , we apply a syntactic parser to generate a parsing tree, and then adopt a depth-first traversal [li2017modeling] to serialize the parsing tree to a sequence of tokens: , where is the number of tokens in the serialized parsing tree. Note that the token is not necessarily a word. In a parsing tree, a leaf node represents a word, while a non-leaf represents a parsing symbol including either a phrase label or a POS tag.

Then, for a document, we concatenate all the serialized parsing trees into a long sequence . Here is the total number of the tokens in all parsing trees . To model the sequential information, we first use an embedding vector to represent the token

, which can be either a word or a symbol in the parsing tree. Then we employ a bidirectional long short-term memory (BiLSTM)

[graves2005framewise] to encoder the sequence information:


where and denote the hidden state of the forward LSTM and the backward LSTM, respectively. The whole representation of th token is the concatenation of the hidden states from both directions .

To model the syntactic information, we apply the max-pooling over all the hidden states corresponding to the parsing symbols to produce the syntactic vector:

, where is the set of all parsing symbols in the document.

As shown in Figure 2, the syntactic sequence encoder takes the serialized parsing tree as the input. The BiLSTM compute the hidden states for both the words (e.g. “Mary”-, “hates”-, “Lucy”-) and the parsing symbols (e.g. “NP”-, “VP”-, “NNP”-) as input. The word hidden states are used for further computation, while the hidden states of the parsing symbols are max-pooled to generate a syntactic vector .

Iii-B Dynamic Selective Gates

As we discussed in Sec.I, for document summarization, not all the information in the source should be fed into the decoder, and it is more important to only select the salient information and remove the unnecessary information from the input. Herein, we propose a novel dynamic selective gate to model the generation process of the salient information. We use a parameterized gate network to select the useful information for the summary generation. The gate state takes as input from both the state of the source and the previous state of the generated target, as well as a low-dimensional representation of the whole document , which is a concatenation of the last state of the forward LSTM and the first state of the backward LSTM .

Specifically, for the th encoder step and th decoder step, the state of the dynamic selective gate is calculated as:


where is the state of the th step from the decoder LSTM which will be discussed in the next subsection, is the gated hidden state of the encoder,

denotes the sigmoid function, and

denotes element-wise multiplication. When equals ,   is set to 1s vector. are trainable parameters.

Note that, previous study [Zhou2017Selective] has utilized the selective gate to control the information flow but the gate state only depends on the hidden states of the source text and the selective gate is static during the whole decoding stage. But the proposed dynamic selective gate depends on both the encoder and the decoder states, suggesting that the gate only open to the information which is useful for the currenttarget output rather than the whole target outputs. This is critical to document summarization, because the length of the document is long and a static gate may select much irrelevant information from the source at every decoding step. We will show the effectiveness of the dynamic selective gate in Sec.IV-E.

Iii-C Pointer-generator Network with Syntactic Attention Decoder

We use the recent proposed pointer-generator network [see2017get] for decoding, which allows either copying words from the original text via pointers or generating new words based on the source vocabulary to handle the OOV problem.

Specifically, the attention strength between the th source step and the th target step is calculated by the current decoder state , the current gated encoder hidden state , and the document syntactic vector . The context vector is calculated by the attention-weighted summation of the gated encoding hidden states:


where , , , are trainable parameters.

An LSTM takes as input from the word embedding vector of the previous generated word , the previous context vector , and the previous decoder hidden state to compute the new decoder state:


and then the current context vector and the current decoder hidden state are fed into two linear layers and predicts the probability for each word in the vocabulary using the softmax function:


where , , , are trainable parameters.

Further, a pointer-generator network produces the switch probability to decide whether generates a word by or copies a word from the original source text. is calculated from the context vector , the decoder state and the decoder word . The final probability with the word is calculated based on and the attention distribution:


where , , , are trainable parameters.

Iii-D Model Training

To train the model, we use the negative log-likelihood function as the loss for each document. We further adopt the coverage loss from See et al. [see2017get], aiming to handle the repetition problem in text summarization. The coverage loss at the decoding step corresponding to the encoding step is the summation of attention distributions over all previous decoding step:

. The final loss function at the decoding step



Iv Experiments

In this section, we describe the experiment details including datasets, implementation details, baselines and the results.

Iv-a Datasets

We conduct experiments on CNN/Daily Mail111https://github.com/abisee/pointer-generator dataset [see2017get], which comprises multi-sentence summaries and has been widely used in automatic text summarization. We use released scripts222https://github.com/abisee/cnn-dailymail to obtain the same version of the the data, which has 287,227 training pairs, 13,368 validation pairs and 11,490 test pairs. The source documents have 681 words spanning 40 sentences on an average while the summaries consist of 48 words and 3.9 sentences. The dataset is released in two versions: one is anonymized version which has been pre-processed to replace each named entity, and the other is the original version consisting of actual entity names. In this work, we use the original text since it requires no pre-processing and is more challenging because anonymized dataset replaces named entities with unique identifier, which always are out of vocabulary. In the following experiments all the models are trained and tested with three different datasets separately, including CNN corpus, Daily Mail corpus and the combination of CNN and Daily Mail corpus. Table I shows the detail statistics information of experiment datasets.

Data Set CNN Daily CNN/Daily
AvgDocSents 34.4 42 40
AvgDocWords 655 692 681
AvgSumWords 3.7 4 3.92
AvgSumSents 42 42 48.3
TABLE I: Data statistics for CNN and CNN/Daily Mail datasets. AvgDocSents is the average sentences number of original documents and AvgDocWords is the average sentences length of original documents. AvgSumSents is the average sentences number of summaries and AvgDocWords is the average sentences length of summaries.

Iv-B Implementation

For all experiments, we use 50k words of the source vocabulary and Stanford Constituency Parser333https://nlp.stanford.edu/software/srparser.html

to get the syntactic information of the sentences in the corpora, which includes 16 phrase labels and 32 POS tags. Our model takes 256-dimensional hidden states, 128-dimensional word embedding vectors and use adagrad with learning rate 0.15 and initialize the accumulator value with 0.1. This was found to work best among stochastic gradient descent, adadelta, momentum, adam and RMSprop. In addition, we set the maximum length of sentence on source-side to 1200, on target-side for training and testing to 100 and 120 respectively. To both decode fast and get better results, we set the beam size to 4 in our experiments. Furthermore, we added the coverage mechanism in loss function (

12) with coverage loss weighted to .

Iv-C Baselines

We compare our proposed model with several state-of-the-art automatic text summarization systems and techniques consisting of extractive and abstractive methods.444The results of baselines are incomplete on sub-dataset because some other researchers chose to report results on only one sub-dataset;:

  • [itemsep=2pt,parsep=2pt]

  • Lead-3 is a standard extractive baseline, which generates summary simply by selecting the ”leading” three sentences from source document.

  • NN-SE [Cheng2016Neural]

    utilizes encoder-decoder framework, which learns the representation of source though encoder and classifies sentences of document by decoder.

  • SummaRuNN [nallapati2016abstractive] applies encoder-decoder RNN abstractive framework with hierarchical attention and feature-rich embedding vector.

  • SummaRuNNer[nallapati2017summarunner] treats extractive summarization as a sequence classification problem, where a binary decision has been made on each sentence about whether or not it should be included in the summary.

  • SummaRuNNer-abs[nallapati2017summarunner] is also an extractive model similar to SummaRuNNer but is trained directly on the abstractive summaries.

  • Seq2Seq+attn [see2017get] We use a Seq2Seq framework based on Uni-GRU with non-hierarchical attention as our baseline model.

  • Distraction-M3 [chen2016distraction] is an extension of Seq2Seq+attn model with distract mechanism to traverse between different content of a document to better grasp the overall meaning for summarization.

  • Graph-Based Model [tan2017abstractive] proposes a novel abstractive graph-based attention mechanism in the Seq2Seq framework, which aims to find salient content from the original document.

  • DeepRL [paulus2017deep] proposes a unified framework combining Seq2Seq and RL into to improve the quality of summary.

  • Pointer-generator+Coverage(Po-Gen+Cov)[see2017get] improves the standard Seq2Seq model with a hybrid pointer-generator, which can not only produce novel words but also copy words from the source text.

  • SelectiveGate [Zhou2017Selective] proposes the encoder-decoder framework based on a static selective gate network, which helps improve encoding effectiveness and release the burden of the decoder.

Iv-D Experimental Results

We adopt the widely used ROUGE[lin2004rouge] by pyrouge 555pypi.python.org/pypi/pyrouge/0.1.3

for evaluation metric. It measures the similarity of the output summary and the standard reference by computing overlapping n-gram, such as unigram, bigram and longest common subsequence. In the following experiments, we adopt ROUGE-1 (unigram), ROUGE-2 (bigram) and ROUGE-L (longest common subsequence) for evaluation.

It can be observed from Tables II and III that the proposed approach achieves the best performance on the two datasets. Our best model outperforms all baseline extractive and abstractive models on ROUGE-1, ROUGE-2 and ROUGE-L. Compared with abstractive Graph-based, RL-based and SummaRunner model, our model leverages the structural information of document and improves the pointer-network with syntactic attention to copy relevant words in semantic and structural aspect from the original text to handle OOV problems, while Graph-based, RL-based and SummaRunner model all take the anonymized data, which has replaced named entity with “@entity” to alleviate OOV problems. Furthermore, unlike Graph-based, RL-based and SummaRunner model, we do not pretrain the word embedding vectors.

Method R1 R2 RL
Seq2Seq+attn 18.4 4.8 14.3
Distraction-M3 27.1 8.2 18.7
Graph-based model 30.3 9.8 20.0
Po-Gen+Cov 29.8 10.4 26.6
SelectiveGate(w/o Po-Gen+Cov) 19.8 5.8 15.6
SelectiveGate(Po-Gen+Cov) 30.1 10.6 26.5
Our Model(Source Syntax) 30.6 11.1 26.8
Our Model(Dynamic Selective) 30.7 10.9 26.9
Our Model(All) 31.2 11.5 27.3
TABLE II: Comparison results on the CNN test set respectively using the full-length F1 variants of Rouge. Baseline model results with

mark are taken from the corresponding papers. All our ROUGE scores have a 95% confidence interval of at most

0.25 as reported by the official ROUGE script.
Method R1 R2 RL
Seq2Seq+attn(150k) 30.49 11.17 28.08
Seq2Seq+attn(50k) 31.33 11.81 28.83
SummaRuNN 35.46 13.30 32.65
SummaRuNNer 39.6 16.2 35.3
Graph-based model 38.1 13.9 34.0
DeepRL 39.87 15.82 36.90
Pointer-generator+coverage 39.53 17.28 36.38
Our model 40.37 17.82 37.3
lead-3 40.34 17.70 36.57
TABLE III: Comparison results on the CNN/Daily Mail test set using the full-length F1 variants of Rouge. Baseline model results with mark are taken from the corresponding papers. 150k represents vocabulary size of 150k and 50k represents vocabulary size of 50k. All our ROUGE scores have a 95% confidence interval of at most 0.25 as reported by the official ROUGE.

We also compare in detail with two similar methods in Table II. For Pointer-generator with coverage (Po-Gen+Cov) model, we show that, with the help of structural information and dynamic selective gate, the scores of our best model performs the best over the Po-Gen+Cov model on evaluation metrics (1.4 ROUGE-1, 1.1 ROUGE-2 and 0.7 ROUGE-L). For static SelectiveGate model, we conduct two experiments with Po-Gen+Cov and without Po-Gen+Cov due to its original paper focusing on short-text summarization, which does not use Po-Gen+Cov mechanism to alleviate OOV and word repetition problems. The result demonstrates that the static SelectiveGate improves the performance of Po-Gen+Cov model and the dynamic SelectiveGate can further improve the ROUGE scores of static SelectiveGate model by selecting current important information for decoding in every time step.

Further, to study the different impacts of source syntax and dynamic selective gate on the performance of the proposed model, we conduct ablation experiments on the CNN dataset, where we train the model with the source syntax encoding only and the dynamic selective gate only, respectively. As shown in the last three rows in Table II, that 1) either the source syntax encoding or the dynamic selective gate can improve the performance compared with SelectiveGate approach; 2) combining both approach leads to further improvement which achieves the best results as shown in the last row in the table.

L R1 R2 RL
1000 31.1 11.4 27.1
1200 31.2 11.5 27.3
1400 31.0 11.3 27.2
TABLE IV: Comparison results with different document lengths on the CNN dataset respectively using the full-length F1 variants of Rouge. All our ROUGE scores have a 95% confidence interval of at most 0.25 as reported by the official ROUGE.

In addition, to study the impact of the lengths of document on the performance of the proposed model, we conduct experiments on the CNN test sets between 1000 and 1400. Table IV clearly shows that the performance of the proposed approach is stable across different lengths of document.

Iv-E Example Analysis

In this subsection, we use an example to show the effectiveness of the syntactic encoding and the dynamic selective mechanism. We choose the same example introduced in Figure 1.

Figure 3 shows the constituency parsing tree of the sentence. The baseline model in Figure 1 generates wrong summary because it is not able to model the syntactic structure, e.g., “300, 000 applicants” is a noun phase, and “applied to .. ceremony” is an attributive clause. In our approach, the BiLSTM encoder takes as input from the serialized parsing tree, and each word token is surrounded by the parsing symbols. Intuitively, if two consecutive words do not belong to the same syntactic subtree, more parsing symbols will be inserted between them in the BiLSTM encoder and it will be less likely that these two symbols have strong connection. As shown at the top in Figure 3, the generated summary of our model correctly conveys the summary of the source text.

Fig. 3: A parsing tree corresponding to the blue text in Figure 1. The dashed box shows that “ applied to .. ceremony” is an attributive clause for “300, 000 applicants”. The upper box shows the our generated summary which correctly summarize the original document.

For the dynamic selective gate, we use a method in [li2015visualizing] to visualize it. The method defines a highly non-linear function to measure the contribution of the source word gated by in the th generation step. As shown in Figure 4, the dynamic selective mechanism can select the most important information from the original document in every decoding step. For example, at decoding step , the selective gate filters out some nonsensical words (e.g. “the”, “is”, “he”) and selects current important words (e.g. “jedlicka”, “politician”) to help the following attention to generate the most important word (e.g. “jedlicka”). Furthermore, Figure 4 also shows that the word out of source vocabulary can also be generated (e.g. “vit”, “jedlicka”) but the weight of the selected words will decrease in the next decoding steps, indicating that our model can address the OOV problem and the word repetition problem.

Fig. 4: Visualization of the dynamic selective gates. The gates dynamically adjust the states in different decoding steps . Darkness of the blocks indicates the openness of the gates.

V Conclusions and Future Work

In this work, we propose a novel document summarization model, which takes its input from a serialized parsing tree which enables the encoder to learn the inherent syntactic structure from each sentence in the document. Further, we propose a dynamic selective gate to control the information flow from the encoder to the decoder. This mechanism dynamically control the salient information based on the context of the decoder state, which is essential to document summarization. The experimental results on two long-text datasets CNN/Daily Mail show the advantage of our model over several baseline approaches.

In this work, we use the depth-first traversal to generate the serialized parsing tree, which is not able to contain all structure information from the tree. We will consider better representation to model the hierarchical structure of the parsing tree. On the other hand, the proposed dynamic selective gate only apply to the word tokens in the source. We may consider integrating both the word tokens and the parsing symbol to produce better the information flow from the source to the target.