Salience Estimation with Multi-Attention Learning for Abstractive Text Summarization

04/07/2020 ∙ by Piji Li, et al. ∙ The Chinese University of Hong Kong FUDAN University 0

Attention mechanism plays a dominant role in the sequence generation models and has been used to improve the performance of machine translation and abstractive text summarization. Different from neural machine translation, in the task of text summarization, salience estimation for words, phrases or sentences is a critical component, since the output summary is a distillation of the input text. Although the typical attention mechanism can conduct text fragment selection from the input text conditioned on the decoder states, there is still a gap to conduct direct and effective salience detection. To bring back direct salience estimation for summarization with neural networks, we propose a Multi-Attention Learning framework which contains two new attention learning components for salience estimation: supervised attention learning and unsupervised attention learning. We regard the attention weights as the salience information, which means that the semantic units with large attention value will be more important. The context information obtained based on the estimated salience is incorporated with the typical attention mechanism in the decoder to conduct summary generation. Extensive experiments on some benchmark datasets in different languages demonstrate the effectiveness of the proposed framework for the task of abstractive summarization.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sequence-to-sequence (seq2seq) framework with attention mechanism has achieved significant improvement in the field of neural machine translation Bahdanau et al. (2015). Encouraged by this outcome, some researchers transplanted the seq2seq framework to tackle the problem of abstractive text summarization Rush et al. (2015); Chopra et al. (2016); Nallapati et al. (2016) and also obtained some encouraging results. Since then, abstractive text summarization has bloomed into a popular research task and quite a few seq2seq-based frameworks have been proposed. For example, See et al. (2017) integrated the copy operation Gu et al. (2016); Vinyals et al. (2015) and the coverage model Tu et al. (2016) into the typical attention based seq2seq to generate better summaries. Li et al. (2017b) designed a recurrent generative decoder to capture the latent structures in the target summaries. Paulus et al. (2018)

employed deep reinforcement learning techniques to enhance the performance of this task.

The above frameworks can improve the quality of the generated abstractive summaries to some extent. However, when we immerse ourselves in designing such dazzling and complex tricks on top of the seq2seq model, we may unintentionally ignore some important characteristics specific to the task of text summarization. Along the whole way of summarization research, salience detection—finding the most important information (words, phrases, or sentences) from the source input text—has always been the most crucial and essential component. Some supervised Ng et al. (2012); Wang et al. (2013) or unsupervised Erkan and Radev (2004); Mihalcea and Tarau (2004) learning methods were proposed to estimate the salience score for producing better summaries. However, for the attention-based seq2seq framework, it is not straightforward to figure out how to conduct salience detection. The current attention mechanism for the summarization task is not as natural and effective as in some other tasks. For instance, in neural machine translation, it is reasonable to use the current decoding state to attend the source sequence to get the relevant information for translating the next target word. In reading comprehension, it makes sense to use the question to attend the reading passage to retrieve relevant information for extracting the answer. But for text summarization, it is difficult to connect the attention mechanism with the salience estimation operation. Although several works have tried some strategies to conduct the salience detection, there still exist some limitations. For example, the selective mechanism Zhou et al. (2017) only implicitly performs salience detection. The graph-based attention mechanism Tan et al. (2017) only adopts an unsupervised method, thus it is not capable to exploit the supervised signal in the training data.

In this paper, we propose two global attention mechanisms based on supervised learning and unsupervised learning respectively for salient information detection. For the supervised attention mechanism, we employ a supervised learning method to estimate the probability of each word in the input text to be included in the generated summary. The normalized probability value is regarded as the supervised attention signal. For the unsupervised attention mechanism, inspired by the PageRank

Page et al. (1999) based text summarization methods such as LexRank Erkan and Radev (2004) and TextRank Mihalcea and Tarau (2004), as well as the graph-based attention mechanism Tan et al. (2017), we employ the PageRank algorithm to estimate the salience score of each input word, which is regarded as the unsupervised attention signal. Thus, these two types of attention signals contain the salience information of the terms in the source text. To examine the efficacy of the obtained salience information, we integrate these signals into a simple base model for abstractive summarization, i.e. the attention based seq2seq model. Note that we do not employ more sophisticated and powerful models, because the aim of this work is to verify that bringing back salience estimation for neural abstractive summarization is helpful to improve the performance, where a simple base model allows the conclusion not biased by other modeling structures.

Our main contributions are summarized as follows. (1) We investigate a crucial element of text summarization problem, namely salience estimation, which has been overlooked by the prior neural abstractive summarization approaches. (2) We propose a supervised attention mechanism to directly estimate the salience under the supervision signal provided by the state of the input text, and an unsupervised attention mechanism which employs a graph algorithm to estimate the salience of each input word. (3) We integrate the two types of attention information into a base model and propose a unified neural network based framework, named Multi-Attention Learning (MAL), to tackle the task of abstractive summarization. (4) Experimental results on some benchmark datasets in different languages demonstrate the effectiveness of the proposed attention learning methods for salience estimation.

2 Our Framework

2.1 Overview

Figure 1: Our Multi-Attention Learning (MAL) framework for abstractive summarization.

The proposed Multi-Attention Learning (MAL) framework is shown in Figure 1. The input is a variable-length sequence , representing the source text. The output ground truth is also a sequence . We denote the generated summary sequence as . For global salience estimation, we add two tailor-made attention learning mechanisms: supervised attention learning and unsupervised attention learning. The aim of supervised attention learning is to predict if words from the input source text should be selected into the generated summaries, i.e. predicting a or label for each word. As shown in Figure 1, the word embeddings

and the encoder recurrent neural network (RNN) hidden states

are taken as the input information of this supervised attention learning modular. We also design a self-attention model to capture more context information from the source text for better feature representation learning. The output of this component is regarded as the supervised attention information

. For unsupervised attention learning, we employ the PageRank algorithm to estimate the salience score of each input words in an unsupervised manner. We treat the salience score as the unsupervised attention information . Then these two types of global attention information representing word salience are combined with the hidden states of the input source text to obtain the global attention context. Finally, the attention context information is incorporated in the decoding procedure to generate the abstractive summaries.

2.2 Supervised Attention Learning

The aim of supervised attention learning is to estimate the probability of the words in the source text to appear in the generated summaries. With sufficient training data, we can regard this problem as a supervised sequence labeling task. We employ a straightforward method to prepare the ground truth labels for the source text. The words in the source text (except the stopwords) that appear in the ground truth summaries are annotated with the positive label . All other words and the punctuations are annotated with the negative label .

The structure of the supervised attention learning framework (we illustrate the computational logic with the first two states) is depicted on the left of Figure 1. We first map each input word

into a vector

by retrieving an embedding lookup table, which is randomly initialized and fine-tuned in the training procedure. The word embedding sequence is fed into a bi-directional RNN to capture the context information. Compared with LSTM Hochreiter and Schmidhuber (1997), GRU Cho et al. (2014) has comparable performance but with less parameters and more efficient computation, so we employ GRU as the basic recurrent unit:


where is the reset gate, is the update gate to control the mixture of the previous hidden and to get the current hidden , and those ’s and ’s are learnable parameters. denotes the element-wise multiplication, and

is the hyperbolic tangent activation function. We employ a bidirectional GRU network to produce two hidden states at the time step



Then the overall hidden state of the encoder is a concatenation of both directions:


In order to capture more context information of the input sequence, we integrate a Self-Attention modeling component. The self-attention weight at the time step is calculated based on the relationship of and all the source hidden states . Let be the attention weight between and which can be calculated as:


where , , , and . Then the self-attention context is obtained by the weighted linear combination of all the source hidden states:


where is the sequence length. The original hidden state can be revised using the self-attention context information :


Finally, as shown in Figure 1, we feed the word embedding vector , the hidden state , the self-attention context , and the self-attention state into the final output layer to get the prediction :



is the sigmoid function. The value of

represents the salience of the corresponding words in the source text.

In order to get the attention information and the attention context information, we first add a normalization procedure to the predicted :


We regard the vector as the supervised attention information.

Based on the supervised attention information , we can obtain one type of global attention context by the weighted linear combination of the source hidden states:


Finally, is incorporated in the decoder as the supervised attention context information for the summary generation.

2.3 Unsupervised Attention Learning

In the traditional text summarization research, the PageRank Page et al. (1999) based salience estimation methods play a crucial role in identifying the most important information from the source text. Some classical methods such as LexRank Erkan and Radev (2004) and TextRank Mihalcea and Tarau (2004) were proposed to tackle the problems of text summarization and keyphrase extraction, and have been applied into practical summarization applications and products. Tan et al. (2017) introduced the graph-based attention mechanism into the seq2seq framework for sentence salience estimation and obtained encouraging results. Here, we also employ PageRank algorithm to conduct the unsupervised attention learning for salience estimation, as depicted in the middle-upper part of Figure 1. The difference is that we conduct the learning on word level to estimate the salience.

For an input text sequence with length , and representing the embedding vector for the word , to build the word-based graph , we take the nonstop words as the vertex set , and the relations between the words, computed with Equation 10, as edge set

. We employ a parameterized tensor method to calculate the weights of the edges. Assume that the adjacent matrix is

, then each element can be calculated as:


where is a neural parameter to be learned. PageRank is a iterative algorithm, but we can get the closed form as discussed in Tan et al. (2017):


where is a diagonal matrix and , is the damping factor, and with all the elements equal to . Then the vector is the estimated salience score for all the words. We also add a normalization procedure to :


Then the vector is regarded as the unsupervised attention information. We can also obtain the second type of global attention context by the weighted linear combination of the word embeddings using :


And will be incorporated with the seq2seq framework as the unsupervised attention context information.

2.4 Summary Generation

The decoder of our MAL framework is still a GRU based recurrent neural network with improved attention modeling. The first hidden state of the decoder is initialized using the average of all the source input hidden states: . Then the two layers of GRUs are designed to conduct the attention weights calculation and decoder hidden states update. On the first GRU layer, the hidden state is calculated only using the current input word embedding and the previous hidden state :


Then the attention weights at the time step are calculated based on the relationship of and all the source hidden states :


The attention context is obtained by the weighted linear combination of all the source hidden states: . The final hidden state is the output of the second GRU layer, jointly considering the word , the previous hidden state , and the attention context :


The traditional seq2seq framework will predict the target word based on .

2.4.1 Multi-Attention Integration

Recall that we have obtained the supervised attention context in Section 2.2 and the unsupervised attention context in Section 2.3. Then we integrate all the attention context information here in a straightforward manner:


Finally, the probability of generating any target word is given as follows:


where and . is the softmax function. In the prediction state, we use the beam search algorithm Koehn (2004) for decoding and generating the best summary.

2.5 Model Training

For supervised attention learning, we use the cross-entropy as the objective function which need to be minimized:


where and is the prediction and the ground truth respectively.

For summary generation, we employ the negative log likelihood (NLL) as the objective function. Given the ground truth summary for the input sequence , we have:


Then the final objective loss function is:


The whole framework can be trained using the multi-task learning paradigm with the back-propagation method in an end-to-end training style. Adadelta Zeiler (2012)

with hyperparameters

and is used for gradient based optimization.

3 Experimental Setup

3.1 Datasets

We train and evaluate our framework on three popular benchmark datasets. Gigawords is an English sentence summarization dataset prepared based on Annotated Gigawords111 by extracting the first sentence from a news report together with the headline to form a source and summary pair (i.e. the first sentence and headline). We directly download the prepared dataset used in Rush et al. (2015). It roughly contains 3.8M training pairs, 190K validation pairs, and 2,000 test pairs. The test set is identical to the one used in all the comparative methods. DUC-2004222 is another English dataset only used for testing, where we directly apply the model trained from Gigawords. It contains 500 documents. Each document contains 4 model summaries written by experts. The length of the summary is limited to 75 bytes. LCSTS is a large-scale Chinese short text summarization dataset, consisting of pairs of short text and summary, collected from Sina Weibo333 Hu et al. (2015). We take Part-I as the training set, Part-II as the development set, and Part-III as the test set. There is a score in the range of labeled by human to indicate the relevance between an article and its summary. We only make use of those pairs with scores no less than 3. Thus, the three parts contain 2.4M, 8.7k, and 725 data points respectively. In our experiments, we directly take the Chinese character sequences as input, without performing word segmentation.

3.2 Evaluation Metrics

We use ROUGE Lin (2004)

with standard options as our evaluation metric. The idea of ROUGE is to count the number of overlapping units between the generated summaries and the reference summaries, such as overlapped n-grams, word sequences, and word pairs. F-measures of ROUGE-1 (R-1), ROUGE-2 (R-2) and ROUGE-L (R-L) are reported for Gigawords and LCSTS datasets. ROUGE recalls are reported for the DUC dataset.

3.3 Comparative Methods

We compare our MAL with a bunch of previous methods. Since the datasets are quite standard, so we just extract the results from their papers, if reported. Therefore, the compared methods on different datasets may be slightly different. TOPIARY Zajic et al. (2004) is the best on DUC2004 Task-1 for compressive text summarization. It combines a system using linguistic based transformations and an unsupervised topic detection algorithm for compressive text summarization. MOSES+ Rush et al. (2015) uses a phrase-based statistical machine translation system trained on Gigaword to produce summaries. ABS and ABS+ Rush et al. (2015) are both the neural network based models with local attention modeling for abstractive sentence summarization. RNN and RNN-context Hu et al. (2015) are two seq2seq architectures. RNN-context integrates attention mechanism to model the context. CopyNet Gu et al. (2016) integrates a copying mechanism into the seq2seq framework. RNN-distract Chen et al. (2016) uses a new attention mechanism by distracting the historical attention in the decoding steps. RAS-LSTM and RAS-Elman Chopra et al. (2016) both consider words and word positions as input and use convolutional encoders to handle the source information. For the attention based sequence decoding process, RAS-Elman selects Elman RNN Elman (1990) as its decoder, and RAS-LSTM selects LSTM architecture Hochreiter and Schmidhuber (1997). LenEmb Kikuchi et al. (2016) uses a mechanism to control the summary length by considering the length embedding vector as the input. ASC+FSC Miao and Blunsom (2016) uses a generative model with attention mechanism to tackle the sentence compression problem. lvt2k-1sent and lvt5k-1sent Nallapati et al. (2016) utilize a trick to control the vocabulary size to improve the training efficiency. SEASS Zhou et al. (2017) integrates a selective gated network into the seq2seq framework to control the information flow from encoder to decoder. DRGD Li et al. (2017b) proposes a deep recurrent generative decoder to enhance the modeling ability of latent structures in the target summaries.

3.4 Experimental Settings

For the experiments on the English dataset of Gigawords, we set the dimension of word embeddings to 300, and the dimension of hidden states and latent variables to 500. The maximum length of documents and summaries is 100 and 50 respectively. For DUC-2004, the maximum length of summaries is 75 bytes. For the dataset of LCSTS, the dimension of word embeddings is 350. We also set the dimension of hidden states and latent variables to 500. The maximum length of documents and summaries is 120 and 25 Chinese characters respectively. The damping factor

of the PageRank algorithm for the unsupervised attention learning is set to 0.9. The beam size of the decoder is set to 10. Our neural network based framework is implemented using Theano

Theano Development Team (2016).

4 Results and Discussions

4.1 ROUGE Evaluation

System R-1 R-2 R-L
ABS 29.55 11.32 26.42
ABS+ 29.78 11.89 26.97
RAS-LSTM 32.55 14.70 30.03
RAS-Elman 33.78 15.97 31.15
ASC-FSC 34.17 15.94 31.92
lvt2k-1sent 32.67 15.59 30.64
lvt5k-1sent 35.30 16.64 32.62
SEASS 36.15 17.54 33.63
seq2seq (our version) 34.49 16.79 33.06
seq2seq+SuAtt 35.80 17.02 33.25
seq2seq+UnAtt 35.91 17.16 33.38
seq2seq+MAL 36.39 17.37 33.82
DRGD 36.27 17.57 33.62
DRGD+MAL 36.30 17.77 33.64
Table 1: ROUGE-F1 on Gigawords.
System R-1 R-2 R-L
TOPIARY 25.12 6.46 20.12
MOSES+ 26.50 8.13 22.85
ABS 26.55 7.06 22.05
ABS+ 28.18 8.49 23.81
RAS-Elman 28.97 8.26 24.06
RAS-LSTM 27.41 7.69 23.06
LenEmb 26.73 8.39 23.88
lvt2k-1sen 28.35 9.46 24.59
lvt5k-1sen 28.61 9.42 25.24
SEASS 29.21 9.56 25.51
seq2seq (our version) 28.82 9.47 25.27
seq2seq+SuAtt 29.46 9.92 25.85
seq2seq+UnAtt 29.13 9.96 25.63
seq2seq+MAL 29.49 10.13 25.91
DRGD 28.99 9.72 25.28
DRGD+MAL 29.04 9.90 25.31
Table 2: ROUGE-Recall on DUC-2004.
System R-1 R-2 R-L
RNN 21.50 8.90 18.60
RNN-context 29.90 17.40 27.20
CopyNet 34.40 21.60 31.30
RNN-distract 35.20 22.60 32.50
DRGD 36.99 24.15 34.21
seq2seq (our version) 35.38 23.00 32.85
seq2seq+SuAtt 36.80 24.20 34.45
seq2seq+UnAtt 37.00 24.20 34.39
seq2seq+MAL 37.04 24.65 34.70
DRGD 36.99 24.15 34.21
DRGD+MAL 37.36 24.73 34.65
Table 3: ROUGE-F1 on LCSTS.

The results on the English datasets of Gigawords and DUC-2004 are shown in Table 1 and Table 2 respectively. Among the ablations, “seq2seq (our version)” is the typical attention based seq2seq framework implemented by us. “seq2seq+SuAtt” is the ablation method only considering supervised attention information. “seq2seq+UnAtt” only considers unsupervised attention information. “seq2seq+MAL” is our proposed framework. From the experimental results, we can see that our MAL framework performs better than the typical seq2seq method as well as some other strong comparisons, which means that the multi-attention context information can indeed improve the performance of the typical seq2seq summarization models. It is worth noting that the methods lvt2k-1sent and lvt5k-1sent utilize linguistic features such as parts-of-speech tags, named-entity tags, and TF and IDF statistics of the words as part of the document representation. Generally, more useful features can indeed improve the performance. Nevertheless, our framework is still better than them which demonstrates the effectiveness of our salience detection components.

The results on the Chinese dataset LCSTS are shown in Table 3. Our MAL also achieves the best performance. Although CopyNet employs a copying mechanism to improve the summary quality, RNN-distract considers attention information diversity in their decoders, and DRGD integrates a recurrent variational auto-encoder into the typical seq2seq framework, our model is still better than these methods demonstrating that the effectiveness of the incorporation of the multi-attention context information. It is expectable that integrating the copying mechanism and coverage diversity in our framework will further improve the summarization performance.

4.1.1 Highlight Discussion

Note that in our framework, we integrate the multi-attention information with a simple base model, namely, the attention based seq2seq model. Thus the performance of the whole framework is indeed limited. And the evaluation results are not as good as some very strong recent methods, such as SEASS Zhou et al. (2017), Pointer-Generator See et al. (2017), and the Reinforced model Paulus et al. (2018). However, the purpose of this work is to investigate the performance of applying the traditional salience detection intuitions in the simple attention based seq2seq framework, and such a simple base model allows the conclusions not biased by other modeling complications. The experimental analysis can demonstrate its effectiveness, therefore, our study in this paper not only reminds the peer researchers that the crucial salience detection component for summarization should be reexamine in the scope of neural network based models, but also presents a practical approach to solving this problem. If the two types of attention signals are appropriately integrated into the above recent models, we believe that their performance can be improved as well. Moreover, our attention learning framework can also help revise the design of the copy mechanism as well as the coverage modeling strategy. All these are worthwhile directions to investigate for the future works.

4.2 Attention Analysis

System Giga DUC LCSTS
SuAtt 30.97 29.14 24.97
UnAtt 21.38 20.16 17.87
Table 4: ROUGE-1 evaluation for the top-10 words extracted from SuAtt and UnAtt.
S(1): japan ’s toyota team europe were banned from the world rally championship for one year here on friday in a crushing ruling by the world council of the international automobile federation fia.
Golden: toyota are banned for a year.
SuAtt: toyota, rally, world, europe, banned, championship, team, ruling, year, fia
UnAtt: world, council, europe, federation, international, japan, ruling, friday, championship, banned
S(2): a powerful bomb exploded outside a navy base near the sri lankan capital colombo tuesday, seriously wounding at least one person, military officials said.
Golden: bomb attack outside srilanka navy base.
SuAtt: sri, bomb, base, navy, colombo, ankan, powerful, military, wounding, exploded
UnAtt: sri, military, capital, tuesday, bomb, powerful, navy, exploded, base, officials
S(3): palestinian prime minister ismail haniya insisted friday that his hamas-led government was continuing efforts to secure the release of an israeli soldier captured by militants.
Golden: efforts still underway to secure soldier’s release: hamas pm.
SuAtt: palestinian, release, haniya, hamas-led, soldier, israeli, government, secure, efforts, minister
UnAtt: government, prime, palestinian, friday, israeli, militants, efforts, continuing, minister, secure
Table 5: Top-10 words extracted from SuAtt and UnAtt receptively for samples in Gigawords.

We regard the supervised attention and the unsupervised attention as the salience score for the words in the source text. So we also design experiments to verify the performance of the two attention mechanisms for finding important words. For each input sequence, and are the two attention vectors obtained by supervised attention learning and unsupervised attention learning respectively. The element value represents the word salience score. Therefore, we can select the top- words from the input sequence according to the salience scores in and . Intuitively, the extracted top- words are very important and may have a large overlapping with the ground truth summary. To verify it quantitatively, we regard the top words as summaries and conduct ROUGE evaluation on them. Because the order of the top words is ignored, so we employ the F-measure score of ROUGE-1 as the evaluation metric. The experimental results on those three datasets are given in Table 4. We set to 10 here. The results illustrate that both methods can extract the important words from the source text, and the quality of the top words extracted from the supervised attention , i.e., SuAtt, is better than those extracted from the unsupervised attention , i.e. UnAtt. This adheres to our intuition because the SuAtt method can obtain stronger supervision signals than the unsupervised method UnAtt.

However, from the ROUGE results presented in Tables 1, 2, and 3, we find that the performance of seq2seq+UnAtt is similar to or even better than seq2seq+SuAtt. This phenomenon may be because that both the method of seq2seq and SuAtt can receive supervision signals to guide the training, but UnAtt is an unsupervised salience detection method which may find some complementary information to further improve the summarization performance. In order to show the differences vividly, we present the extracted top words in Table 5. And all the words are ranked based on the corresponding salience scores. From the results we know that SuAtt and UnAtt can indeed assign large salience scores to the important words. For instance, SuAtt can extract words of “toyoda”, “banned”, and “year” which are the core elements of the golden summary “toyota are banned for a year”. The result of UnAtt is more diversified. Although the performance of SuAtt and UnAtt are different, the integration of them performs well in the quantitative evaluation experiments in the previous subsection, which may be because that different attention methods can capture different aspects of the source text and they can complement each other.

4.3 Summary Case Analysis

S(1): japan ’s toyota team europe were banned from the world rally championship for one year here on friday in a crushing ruling by the world council of the international automobile federation fia.
Golden: toyota are banned for a year.
seq2seq: toyota ’s world rally europe banned from world rally championship.
MAL: toyota barred from world rally championship.
S(2): slovaks started voting at #:## am on saturday in elections to the #-seat parliament, with centre-right prime minister mikulas dzurinda fighting to continue far-reaching but painful reforms.
Golden: slovaks start voting in legislative elections.
seq2seq: slovakia’s parliament begins voting.
MAL: slovaks start voting in early elections.
S(3): the thai government has set aside ### million baht about ##.## million u.s. dollars to support new eco-tourism plans during ####-#### , according to a report of the thai news agency tna tuesday.
Golden: thai government to support eco-tourism.
seq2seq: thailand to support new eco-tourism in ####-####.
MAL: thailand to support new eco-tourism plans.
Table 6: Examples of the generated summaries.

Finally, some examples of the source texts, golden summaries, and the generated summaries by the typical attention-based seq2seq framework and our proposed MAL framework are shown in Table 6. From these cases we can see that the generated summaries by MAL generally have better quality. Moreover, because of the attention learning components for salience detection, our framework has the ability to assign small salience scores to unimportant words and uninformative symbols, while the summary generated by seq2seq contains more noisy symbols as the cases shown in S(3).

5 Related Works

Automatic summarization is the process of automatically generating a summary that retains the most important content of the original text document Nenkova and McKeown (2012)

. Conventional summarization methods can be classified into three categories: extraction-based methods

Erkan and Radev (2004); Min et al. (2012), compression-based methods Li et al. (2013); Wang et al. (2013); Li et al. (2017a), and abstraction-based methods Barzilay and McKeown (2005); Bing et al. (2015). 444Some researchers regard the compression approach as a special case of the extraction approach.

Recently, some researchers employ neural network based frameworks to tackle the abstractive summarization problem and obtain encouraging performance. Rush et al. (2015) proposed a neural model with local attention modeling, which is trained on the Gigaword corpus, but combined with an additional log-linear extractive summarization model with handcrafted features. Nallapati et al. (2016) utilized a trick to control the vocabulary size to improve the training efficiency. Gu et al. (2016) integrated a copying mechanism into a seq2seq framework to improve the quality of the generated summaries. Chen et al. (2016) proposed a new attention mechanism that not only considers the important source segments, but also distracts them in the decoding step in order to better grasp the overall meaning of input documents. Miao and Blunsom (2016) extended the seq2seq framework and proposed a generative model to capture the latent summary information. Zhou et al. (2017) integrated a selective gated network into the seq2seq framework to control the information flow from encoder to decoder. Li et al. (2017b) proposed a deep recurrent generative decoder to enhance the modeling ability of latent structures in the target summaries. See et al. (2017) employed pointer networks and converge mechanism to improve the quality of the generated summaries. Paulus et al. (2018) proposed a reinforcement learning based framework to enhance the performance of summarization. Chen et al. (2018) proposes a generative bridging network in which a bridge module is introduced to assist the training of the sequence prediction model. Li et al. (2018) employ actor-critic training paradigm to enhance the quality of the generated summaries.

Meanwhile, some researchers also combine the traditional salience estimation methods into the seq2seq frameworks in order to enhance the summarization performance. Tan et al. (2017) incorporated the graph-based attention information obtained by the PageRank algorithm into their framework. Hsu et al. (2018) weighted the attention mechanism using sentence salience information calculated by a traditional supervised method. In contrast, we consider both the supervised salience information and unsupervised salience information in our framework to generate better summaries.

6 Conclusions

In this work, we investigate the effect of adding the traditional salience detection of text summarization back to the typical attention-based seq2seq framework for abstractive summarization. We propose a Multi-Attention Learning (MAL) framework which contains two new attention learning components, namely, supervised attention learning and unsupervised attention learning, for salience estimation. The salience information obtained based on these two types of attentions is incorporated with the typical attention mechanism in the decoder to conduct the summary generation. Extensive experiments on some benchmark datasets in different languages demonstrate the effectiveness of the proposed framework for the task of abstractive summarization.


  • D. Bahdanau, K. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In ICLR, Cited by: §1.
  • R. Barzilay and K. R. McKeown (2005) Sentence fusion for multidocument news summarization. Computational Linguistics 31 (3), pp. 297–328. Cited by: §5.
  • L. Bing, P. Li, Y. Liao, W. Lam, W. Guo, and R. Passonneau (2015) Abstractive multi-document summarization via phrase selection and merging. In ACL, pp. 1587–1597. Cited by: §5.
  • Q. Chen, X. Zhu, Z. Ling, S. Wei, and H. Jiang (2016) Distraction-based neural networks for document summarization.. In IJCAI, pp. 2754–2760. Cited by: §3.3, §5.
  • W. Chen, G. Li, S. Ren, S. Liu, Z. Zhang, M. Li, and M. Zhou (2018) Generative bridging network in neural sequence prediction. NAACL. Cited by: §5.
  • K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio (2014) Learning phrase representations using rnn encoder-decoder for statistical machine translation. In EMNLP, pp. 1724–1734. Cited by: §2.2.
  • S. Chopra, M. Auli, A. M. Rush, and S. Harvard (2016) Abstractive sentence summarization with attentive recurrent neural networks. NAACL-HLT, pp. 93–98. Cited by: §1, §3.3.
  • J. L. Elman (1990) Finding structure in time. Cognitive science 14 (2), pp. 179–211. Cited by: §3.3.
  • G. Erkan and D. R. Radev (2004) Lexrank: graph-based lexical centrality as salience in text summarization.

    Journal of Artificial Intelligence Research

    22, pp. 457–479.
    Cited by: §1, §1, §2.3, §5.
  • J. Gu, Z. Lu, H. Li, and V. O. Li (2016) Incorporating copying mechanism in sequence-to-sequence learning. In ACL, pp. 1631–1640. Cited by: §1, §3.3, §5.
  • S. Hochreiter and J. Schmidhuber (1997) Long short-term memory. Neural computation 9 (8), pp. 1735–1780. Cited by: §2.2, §3.3.
  • W. Hsu, C. Lin, M. Lee, K. Min, J. Tang, and M. Sun (2018) A unified model for extractive and abstractive summarization using inconsistency loss. arXiv preprint arXiv:1805.06266. Cited by: §5.
  • B. Hu, Q. Chen, and F. Zhu (2015) Lcsts: a large scale chinese short text summarization dataset. In EMNLP, pp. 1962–1972. Cited by: §3.1, §3.3.
  • Y. Kikuchi, G. Neubig, R. Sasano, H. Takamura, and M. Okumura (2016) Controlling output length in neural encoder-decoders. In EMNLP, pp. 1328–1338. Cited by: §3.3.
  • P. Koehn (2004) Pharaoh: a beam search decoder for phrase-based statistical machine translation models. In Conference of the Association for Machine Translation in the Americas, pp. 115–124. Cited by: §2.4.1.
  • C. Li, F. Liu, F. Weng, and Y. Liu (2013) Document summarization via guided sentence compression.. In EMNLP, pp. 490–500. Cited by: §5.
  • P. Li, L. Bing, and W. Lam (2018) Actor-critic based training framework for abstractive summarization. arXiv preprint arXiv:1803.11070. Cited by: §5.
  • P. Li, W. Lam, L. Bing, W. Guo, and H. Li (2017a) Cascaded attention based unsupervised information distillation for compressive summarization. In EMNLP, Cited by: §5.
  • P. Li, W. Lam, L. Bing, and Z. Wang (2017b) Deep recurrent generative decoder for abstractive text summarization. In EMNLP, pp. 2091–2100. Cited by: §1, §3.3, §5.
  • C. Lin (2004) Rouge: a package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL-04 workshop, Vol. 8. Cited by: §3.2.
  • Y. Miao and P. Blunsom (2016) Language as a latent variable: discrete generative models for sentence compression. In EMNLP, pp. 319–328. Cited by: §3.3, §5.
  • R. Mihalcea and P. Tarau (2004) Textrank: bringing order into text. In EMNLP, Cited by: §1, §1, §2.3.
  • Z. L. Min, Y. K. Chew, and L. Tan (2012) Exploiting category-specific information for multi-document summarization. COLING, pp. 2903–2108. Cited by: §5.
  • R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang, et al. (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023. Cited by: §1, §3.3, §5.
  • A. Nenkova and K. McKeown (2012) A survey of text summarization techniques. In Mining Text Data, pp. 43–76. Cited by: §5.
  • J. Ng, P. Bysani, Z. Lin, M. Kan, and C. Tan (2012) Exploiting category-specific information for multi-document summarization. COLING, pp. 2093–2108. Cited by: §1.
  • L. Page, S. Brin, R. Motwani, and T. Winograd (1999) The pagerank citation ranking: bringing order to the web.. Technical report Stanford InfoLab. Cited by: §1, §2.3.
  • R. Paulus, C. Xiong, and R. Socher (2018) A deep reinforced model for abstractive summarization. ICLR. Cited by: §1, §4.1.1, §5.
  • A. M. Rush, S. Chopra, and J. Weston (2015) A neural attention model for abstractive sentence summarization. In EMNLP, pp. 379–389. Cited by: §1, §3.1, §3.3, §5.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In ACL, Vol. 1, pp. 1073–1083. Cited by: §1, §4.1.1, §5.
  • J. Tan, X. Wan, and J. Xiao (2017) Abstractive document summarization with a graph-based attentional neural model. In ACL, Vol. 1, pp. 1171–1181. Cited by: §1, §1, §2.3, §2.3, §5.
  • Theano Development Team (2016) Theano: A Python framework for fast computation of mathematical expressions. arXiv e-prints abs/1605.02688. External Links: Link Cited by: §3.4.
  • Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li (2016) Modeling coverage for neural machine translation. In ACL, Vol. 1, pp. 76–85. Cited by: §1.
  • O. Vinyals, M. Fortunato, and N. Jaitly (2015) Pointer networks. In NIPS, pp. 2692–2700. Cited by: §1.
  • L. Wang, H. Raghavan, V. Castelli, R. Florian, and C. Cardie (2013) A sentence compression based framework to query-focused multi-document summarization.. In ACL, pp. 1384–1394. Cited by: §1, §5.
  • D. Zajic, B. Dorr, and R. Schwartz (2004) Bbn/umd at duc-2004: topiary. In HLT-NAACL, pp. 112–119. Cited by: §3.3.
  • M. D. Zeiler (2012) ADADELTA: an adaptive learning rate method. arXiv preprint arXiv:1212.5701. Cited by: §2.5.
  • Q. Zhou, N. Yang, F. Wei, and M. Zhou (2017) Selective encoding for abstractive sentence summarization. In ACL, pp. 1095–1104. Cited by: §1, §3.3, §4.1.1, §5.