Hybrid MemNet for Extractive Summarization

12/25/2019 ∙ by Abhishek Kumar Singh, et al. ∙ IIIT Hyderabad 0

Extractive text summarization has been an extensive research problem in the field of natural language understanding. While the conventional approaches rely mostly on manually compiled features to generate the summary, few attempts have been made in developing data-driven systems for extractive summarization. To this end, we present a fully data-driven end-to-end deep network which we call as Hybrid MemNet for single document summarization task. The network learns the continuous unified representation of a document before generating its summary. It jointly captures local and global sentential information along with the notion of summary worthy sentences. Experimental results on two different corpora confirm that our model shows significant performance gains compared with the state-of-the-art baselines.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The tremendous growth of the data over the web has increased the need to retrieve, analyze and understand a large amount of information, which often can be time-consuming. Motivation to make a concise representation of large text while retaining the core meaning of the original text has led to the development of various summarization systems. Summarization methods can be broadly classified into two categories:

extractive and abstractive. Extractive methods aim to select salient phrases, sentences or elements from the text while abstractive techniques focus on generating summaries from scratch without the constraint of reusing phrases from the original text.

Most successful summarization systems use extractive methods. Sentence extraction is a crucial step in such systems. The idea is to find a representative subset of sentences, which contains the information of the entire set. Traditional approaches to extractive summarization identify sentences based on human-crafted features such as sentence position and length (Erkan and Radev, 2004), the words in the title, the presence of proper nouns, content features like term frequency (Nenkova et al., 2006), and event features like action nouns (Filatova and Hatzivassiloglou, 2004). Generally, sentences are assigned a saliency score indicating the strength of presence of these features. Kupiec et al. (1995) use binary classifiers to select summary worthy sentences. Conroy and O’Leary (2001)

investigated the use of Hidden Markov Models while  

(Erkan and Radev, 2004; Mihalcea, 2005) introduced graph-based algorithms for selecting salient sentences.

Recently, interest has shifted towards neural network based approaches for modeling the extractive summarization task. Kageback et al. 

(2014)

employed the recursive autoencoder 

(Socher et al., 2011) to summarize documents. Yin and Pei (2015)

exploit convolutional neural networks to project sentences to a continuous vector space and select sentences based on their ‘prestige’ and ‘diversity’ cost for the multi-document extractive summarization task. Very recently, Cheng and Lapata 

(2016) introduced attention based neural encoder-decoder model for extractive single document summarization task, trained on a large corpus of news articles collected from Daily Mail. Similar to Cheng and Lapta (2016)

, our work is focused on sentential extractive summaries of single document using deep neural networks. However, we propose the use of memory networks and convolutional bidirectional long short term memory networks for capturing better document representation.

In this work, we propose a data-driven, end-to-end enhanced encoder-decoder based deep network that summarizes a news article by extracting salient sentences. Figure 1 shows the architecture of the proposed Hybrid MemNet model. The model consists of document reader (encoder) and a sentence extractor (decoder). Contrary to Cheng and Lapata (2016)’s model where they used an attention based decoder, our model uses attention for both encoder and decoder. Our focus is to learn a better document representation that incorporates local as well as global document features along with attention to sentences to capture the notion of saliency of a sentence. Contrary to the orthodox method of computing sentential features, our model uses neural networks and is a purely data-driven approach. Zhang et al. (2014) and Kim (2014) have shown the successful use of Convolution Neural Networks (CNN) in obtaining latent feature representation. Hence, our network applies CNN with multiple filters to automatically capture latent semantic features. Then a Long Short Term memory (LSTM) network is applied to obtain a comprehensive set of features known as thought vector. This vector captures the overall abstract representation of a document. We obtain the final document representation by concatenating the document embeddings obtained from Convolutional LSTM (Conv-LSTM) and the document embeddings from memory network. The final unified document embedding along with the embeddings of the sentences are used by the decoder to select salient sentences in a document. We experiment with Conv-LSTM encoder as well as Convolutional Bidirectional LSTM (Conv-BLSTM) encoder.

Figure 1. The Architecture of the Hybrid MemNet Model

We summarize our primary contributions below:

  1. We propose a novel architecture to learn better unified document representation combining the features from the memory network as well as the features from convolutional LSTM/BLSTM network.

  2. We investigate the application of memory network (incorporates attention to sentences) and Conv-BLSTM (incorporates n-gram features and sentence level information) for learning better thought vector with rich semantics.

  3. We experimentally show that the proposed method outperforms the basic systems and several competitive baselines. Our model achieves significant performance gain on the DUC 2002 generic single-document summarization dataset.

We begin by describing our network architecture in Section 2 followed by experimental details including corpus details in Section 3. We analyze our system against various benchmarks in Section 4 and finally conclude our work in Section 5.

2. Hybrid MemNet Model

The primary building blocks of our model are:

  • Document Encoder - captures local (n-grams level) information, global (sentence level) information and the notion of summary worthy sentences

  • Decoder - attention based sequence to sequence decoder.

The final unified document encoding and sentences vectors from convolutional sentence encoder are fed to the decoder model. In this section, we discuss details of the encoder and decoder modules.

2.1. Document Encoder

The idea is to learn a unified document representation that not only incorporates n-gram features and sentence level information but also includes the notion of salience and redundancy of sentences. For this purpose, we sum the document representations vectors learned from Convolutional LSTM (Conv-LSTM; for hierarchical encoding) and MemNet (Sukhbaatar et al., 2015) (for capturing salience and redundancy). Since the unified document embedding is learned from the joint interaction of the above mentioned two models, we refer to this network as Hybrid MemNet.

Sentence Encoder

Convolution neural networks are used to encode sentences as they have been shown to successfully work for multiple sentence-level classification tasks (Kim, 2014)

. Conventional convolution neural network uses convolution operation over various word embeddings which is then followed by a max pooling operation. Suppose, d-dimensional word embedding of the

word in the sentence is and is the concatenation of word embeddings . Then, convolution operation over a window of words using a filter of yields new features with dimensions. Here, is the filter index. Convolution operation is written as:

(1)

Here b is the bias term. We obtain a feature map by applying filter over all possible window of words in the sentence of length N.

(2)

Our intention is to capture the most prominent features in the feature map hence, we use max-over-time pooling operation (Collobert et al., 2011) to acquire set of features for a filter of fixed window size. Single feature vector () can be represented as:

(3)

We use multiple convolution nets with different filter sizes {1, 2, 3, 4, 5, 6, 7} to compute a list of embeddings which are summed to obtain the final sentence vector.

Conv-BLSTM Document Encoder

Since Recurrent Neural Network (RNN) suffers from vanishing gradient problem over long sequences 

(Siegelmann and Sontag, 1992), we use Long Short-Term Memory (LSTM) network. To obtain hierarchical document encoding, sentence vectors obtained from convolutional sentence encoder are fed to the LSTM. This new representation intuitively captures both local as well as global sentential information. We explore LSTM network as well as Bidirectional LSTM network for our experiments. Experiments show that combination of convolution network and Bidirectional LSTM (BLSTM) performs better in our case. BLSTM exploits future context in the sequence as well which is done by processing the data in both directions.

MemNet based Document Encoder

We leverage a memory network encoder, inspired from the recurrent attention model to solve question answering and language modeling task 

(Sukhbaatar et al., 2015). The model uses an attention mechanism and has been shown to capture temporal context. In our case, it learns the document representation which captures the notion of salience and redundancy of sentences.

We first describe the model that implements a single memory hop operation (single layer) then, we extend it to multiple hops in memory. Consider an input set of sentence vectors , obtained from the sentence encoder for a document D. Let be the document representation of D obtained from Conv-LSTM model and is the document embedding from the MemNet model. The entire set of are transformed into memory vectors of dimension in continuous space, using a learned weight matrix (of size ; where is the embedding size of a sentence). Similarly, an input document embedding is transformed via a learned weight matrix with the same dimension as to obtain internal state . We then compute the match between and each memory by taking inner product followed by softmax as follows.

(4)

Where and

is the probability vector over the inputs. Each

also has a corresponding output vector (using another embedding matrix ). The output vector from memory is computed as the sum over the transformed inputs , weighted by the probability vector from the input as follows.

(5)

In the case of multiple layer model to handle (2 in our case) hop operation, the memory layers are stacked and the input to layer is computed as follows.

(6)

Let be the output obtained from the last memory unit . Final unified document representation is obtained by summing up the output from the Conv-BLSTM () and the output from the MemNet ().

(7)

Intuitively, captures the hierarchical information of a document as well as the notion of worthiness of a sentence.

2.2. Decoder

The decoder uses an LSTM to label sentences sequentially keeping in mind the individual relevance and mutual redundancy. Taking into account both the encoded document and the previously labeled sentences, labeling of the next sentence is done. If encoder hidden states are denoted by () and decoder hidden states are denoted by () at time step , then for sentence the decoder equations are as follows.

(8)
(9)

where is the degree to which the decoder assumes the previous sentence should be a part of summary and is memorized. is 1 if system is certain. is sentence’s label. Concatenation of and is given as input to an

(Multi-layer Perceptron).

3. Experimental Results

In this section of the paper, we present experimental setup for assessing the performance of the proposed system. We present the details of the corpora used for training, evaluation and give implementation details of our approach.

3.1. Datasets

For the purpose of training the model, we use the Daily Mail corpus, which was also used for the task of single document summarization by Cheng and Lapata (2016). Overall, this corpus contains 193,986 training documents, 12,417 validation documents and 10,350 test documents. To evaluate our model, we use standard DUC-2002 single document summarization dataset which consists of 567 documents. We also evaluate our system on 500 articles from the DailyMail test set (with human written highlights as the gold standard). The average byte count for each document is 278 and article-highlight pairs are sampled such that the highlights include a minimum of 3 sentences.

3.2. Implementation Details

We use top three high-scored sentences subject to the standard word limit of 75 words to generate summaries. The size of the embeddings for word, sentence, and document are set to 150, 300, and 750 respectively. A list of kernel sizes

is used for convolutional sentence encoder. Two hop operation is performed in the case of MemNet encoder. All LSTM parameters were randomly initialized over a uniform distribution within [-0.05, 0.05]. We use batch size of 20 documents with learning rate 0.001 and the two momentum parameters as 0.99 and 0.999. We use Adam 

(Kingma and Ba, 2014) as optimizer.

3.3. Evaluation Metrics

We evaluate the quality of system summaries using ROUGE (Lin and Hovy, 2003): ROUGE-1 (unigram overlap), ROUGE-2 (bigram overlap) as means of assessing informativeness and ROUGE-L as means of assessing fluency.

3.4. Baseline Methods

We evaluate our system against several state-of-the-art baselines. We select best systems having state-of-the-art summarization results on DUC 2002 corpus for single document summarization task, which are: (1) ILP (Woodsend and Lapata, 2010), (2) TGRAPH (Parveen et al., 2015), (3) URANK (Wan, 2010), (4) NN-SE (Cheng and Lapata, 2016), (5) SummaRuNNer (Nallapati et al., 2017), and (6) Deep-Classifier (Nallapati et al., 2016)

. ILP is a phrase-based extraction model that selects salient phrases and recombines them subject to length and grammar constraints via Integer Linear Programming (ILP). TGRAPH is a graph-based sentence extraction model. URANK uses a unified ranking for single- as well as multi-document summarization. We also use LEAD as a standard baseline of simply selecting the leading three sentences from the document as the summary. NN-SE is a neural network based sentence extractor. Deep-Classifier uses GRU-RNN to sequentially accept or reject each sentence in the document for being in summary. SummaRuNNer is an RNN based extractive summarizer.

4. Results and Analysis

In this section, we compare the performance of our system against summarization baselines mentioned in Section 3.4. Table 1 shows our results on the DUC 2002 test dataset and on the 500 samples from the Daily Mail corpus. Hybrid MemNet represents our system with Conv-LSTM encoder and MemNet encoder, while Hybrid MemNet uses Conv-BLSTM encoder and MemNet encoder. It is evident from the results that our system (Hybrid MemNet/ Hybrid MemNet) outperforms the LEAD and ILP baselines with a large margin which is an encouraging result as our system does not have access to manually-crafted features, syntactic information and sophisticated linguistic constraints as in the case of ILP. Results also show that our system performs better without the sentence ranking mechanism (URANK). It also achieves significant performance gain against NN-SE, Deep-Classifier, and SummaRuNNer.

To explore the contribution of the MemNet encoder towards the performance of our system we compare results of NN-SE with Hybrid MemNet. Note that there is significant performance gain of about 2% in the results. Post-hoc Tukey tests showed that the proposed Hybrid MemNet model is significantly (

) better than NN-SE. This is due to the fact that MemNet learns document representation which captures salience estimation of a sentence (using the attention mechanism) prior to the summary generation. We also notice that replacing LSTM with BLSTM in the encoder improves the performance of the system. This may be because BLSTM in our setting is able to learn a richer set of semantics as they exploit some notion of future context as well by processing the sequential data in both directions, while LSTM is only able to make use of the previous context.

DUC 2002 ROUGE-1 ROUGE-2 ROUGE-L
LEAD 43.6 21.0 40.2
ILP 45.4 21.3 42.8
TGRAPH 48.1 24.3
URANK 48.5 21.5
NN-SE 47.4 23.0 43.5
Deep-Classifier
SummaRuNNer
Hybrid MemNet 49.1 24.7 44.6
Hybrid MemNet 50.1 25.2 44.9
DailyMail ROUGE-1 ROUGE-2 ROUGE-L
LEAD 20.4 7.7 11.4
NN-SE 21.2 8.3 12.0
Deep-Classifier
SummaRuNNer
Hybrid MemNet 27.1 11.6 15.2
Hybrid MemNet 27.9 12.2 15.5
Table 1. Rouge Evaluation (%) on the DUC-2002 Corpus and 500 Samples from the Daily Mail Corpus

5. Conclusions

In this work, we proposed a data-driven end-to-end deep neural network approach for extractive summarization of a document. Our system makes use of a combination of memory network and convolutional bidirectional long short term memory network to learn better unified document representation which jointly captures n-gram features, sentence level information and the notion of the summary worthiness of sentences eventually leading to better summary generation. Experimental results on DUC 2002 and Daily Mail datasets confirm that our system outperforms several state-of-the-art baselines.

References

  • J. Cheng and M. Lapata (2016) Neural Summarization by Extracting Sentences and Words. In Proc. of the 54th Annual Meeting of the Association for Computational Linguistics, pp. 484–494. Cited by: §1, §1, §3.1, §3.4.
  • R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, and P. Kuksa (2011) Natural Language Processing (almost) from Scratch.

    Journal of Machine Learning Research

    12 (Aug), pp. 2493–2537.
    Cited by: §2.
  • J. M. Conroy and D. P. O’leary (2001) Text Summarization via Hidden Markov Models. In Proc. of the 24th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 406–407. Cited by: §1.
  • G. Erkan and D. R. Radev (2004) Lexrank: Graph-based Lexical Centrality as Salience in Text Summarization.

    Journal of Artificial Intelligence Research

    22, pp. 457–479.
    Cited by: §1.
  • E. Filatova and V. Hatzivassiloglou (2004) Event-based Extractive Summarization. In Proc. of ACL Workshop on Summarization, Cited by: §1.
  • M. Kågebäck, O. Mogren, N. Tahmasebi, and D. Dubhashi (2014) Extractive Summarization using Continuous Vector Space Models. In Proc. of the 2nd Workshop on Continuous Vector Space Models and their Compositionality (CVSC)@ EACL, pp. 31–39. Cited by: §1.
  • Y. Kim (2014) Convolutional Neural Networks for Sentence Classification. In EMNLP, pp. 1746–1751. Cited by: §1, §2.
  • D. Kingma and J. Ba (2014) Adam: A method for Stochastic Optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.2.
  • J. Kupiec, J. Pedersen, and F. Chen (1995) A Trainable Document Summarizer. In Proc. of the 18th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 68–73. Cited by: §1.
  • C. Lin and E. Hovy (2003) Automatic Evaluation of Summaries using n-gram Co-occurrence Statistics. In Proc. of the 2003 Conf. of the North American Chapter of the Association for Computational Linguistics on Human Language Technology-Volume 1, pp. 71–78. Cited by: §3.3.
  • R. Mihalcea (2005) Language Independent Extractive Summarization. In Proc. of the ACL 2005 on Interactive Poster and Demonstration Sessions, pp. 49–52. Cited by: §1.
  • R. Nallapati, F. Zhai, and B. Zhou (2017) SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents. In Proc. of the Thirty-First AAAI Conf. on Artificial Intelligence, pp. 3075–3081. Cited by: §3.4.
  • R. Nallapati, B. Zhou, and M. Ma (2016) Classify or Select: Neural Architectures for Extractive Document Summarization. CoRR abs/1611.04244. Cited by: §3.4.
  • A. Nenkova, L. Vanderwende, and K. McKeown (2006) A Compositional Context Sensitive Multi-Document Summarizer: Exploring the Factors that Influence Summarization. In Proc. of the 29th Annual Intl. ACM SIGIR Conf. on Research and Development in Information Retrieval, pp. 573–580. Cited by: §1.
  • D. Parveen, H. Ramsl, and M. Strube (2015) Topical Coherence for Graph-based Extractive Summarization. pp. 1949––1954. Cited by: §3.4.
  • H. T. Siegelmann and E. D. Sontag (1992) On the Computational Power of Neural Nets. In

    Proc. of the fifth Annual workshop on Computational learning theory

    ,
    pp. 440–449. Cited by: §2.
  • R. Socher, E. H. Huang, J. Pennington, A. Y. Ng, and C. D. Manning (2011) Dynamic Pooling and Unfolding Recursive Autoencoders for Paraphrase Detection. In NIPS, Vol. 24, pp. 801–809. Cited by: §1.
  • S. Sukhbaatar, J. Weston, R. Fergus, et al. (2015) End-to-end Memory Networks. In Advances in neural information processing systems, pp. 2440–2448. Cited by: §2.1, §2.
  • X. Wan (2010) Towards a Unified Approach to Simultaneous Single-Document and Multi-Document Summarizations. In Proc. of the 23rd Intl. Conf. on Computational Linguistics, pp. 1137–1145. Cited by: §3.4.
  • K. Woodsend and M. Lapata (2010) Automatic Generation of Story Highlights. In Proc. of the 48th Annual Meeting of the Association for Computational Linguistics, pp. 565–574. Cited by: §3.4.
  • W. Yin and Y. Pei (2015) Optimizing Sentence Modeling and Selection for Document Summarization. In IJCAI, pp. 1383–1389. Cited by: §1.
  • X. Zhang and M. Lapata (2014) Chinese Poetry Generation with Recurrent Neural Networks. In EMNLP, pp. 670–680. Cited by: §1.