Transfer Learning for Abstractive Summarization at Controllable Budgets

02/18/2020 ∙ by Ritesh Sarkhel, et al. ∙ The Ohio State University 0

Summarizing a document within an allocated budget while maintaining its major concepts is a challenging task. If the budget can take any arbitrary value and not known beforehand, it becomes even more difficult. Most of the existing methods for abstractive summarization, including state-of-the-art neural networks are data intensive. If the number of available training samples becomes limited, they fail to construct high-quality summaries. We propose MLS, an end-to-end framework to generate abstractive summaries with limited training data at arbitrary compression budgets. MLS employs a pair of supervised sequence-to-sequence networks. The first network called the MFS-Net constructs a minimal feasible summary by identifying the key concepts of the input document. The second network called the Pointer-Magnifier then generates the final summary from the minimal feasible summary by leveraging an interpretable multi-headed attention model. Experiments on two cross-domain datasets show that MLS outperforms baseline methods over a range of success metrics including ROUGE and METEOR. We observed an improvement of approximately 4 budgets. Results from a human evaluation study also establish the effectiveness of MLS in generating complete coherent summaries at arbitrary compression budgets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

The wealth of data available at a single click often adds to the information overload problem  (Wurman et al., 2001)

. Summarization is an intuitive way to address this problem by constructing a condensed equivalent of the available data. There are two main approaches for text summarization: extractive and abstractive approach. In extractive approach, sentences are sampled from the input document, while in the abstractive approach, the summary is not constrained by the vocabulary of the input document. Great progress 

(Rush et al., 2015; Chopra et al., 2016; Nallapati et al., 2016; See et al., 2017) has been made in recent years on abstractive summarization techniques. Various sequence-to-sequence networks with attention have been proposed, ranging from RNN (Rush et al., 2015), BLSTM (See et al., 2017) network, and CNN (Gehring et al., 2017) with Gated Linear Units (Dauphin et al., 2017)

to generate abstractive summaries. Controlling the length of a summary while preserving its quality is one of the most challenging but important aspects of abstractive summarization. One of the most important real-world application of budgeted summarization is optimizing web content for varying screen-sizes. Web content creators such as news portals, bloggers and online advertisements agencies with target audience on multiple digital platforms (e.g. mobiles, laptops, smart-watches) are some of its biggest benefactors. High variance in the screen-sizes available for these devices makes it difficult to implement a solution to effectively deliver textual content following a traditional supervised approach.

To employ sequence-to-sequence networks such as (Nallapati et al., 2016; See et al., 2017) for generating summaries at a given compression budget , we need a parallel corpus of text documents and their corresponding summaries at . Constructing such a corpus for any given budget () is a resource intensive task that usually requires human supervision. Repeating this process for all possible values of  to account for the inherent variability makes it even more costly. Furthermore, in many real world applications the allowed budget can only be known at run-time. One of the current practices to get around this problem is to generate a summary () independent of the budget and then truncate it to account for the budget. Naive approaches such as this often produce incomplete and/or incoherent summaries. We propose MLS, an end-to-end framework to construct high-quality summaries at arbitrary compression budgets, leveraging limited training data. Given a document  with  tokens and a compression budget , the objective of our framework is to generate an abstractive summary () of , such that the following conditions are satisfied:

C1: Information redundancy is minimized with respect to .

C2: Coverage of the major topics of  is maximized.

C3: is maximal with respect to the allocated budget i.e., & such that without worsening the first two conditions.

Conditions C1 and C2 ensure that the qualities of a good summary is preserved in . Whereas, C3 ensures that  is the largest possible summary within the allocated budget without compromising its quality. MLS takes a prototype-driven approach (Liu et al., 2019; Saito et al., 2020) towards summarization. In simple words, summaries at specified compression budgets are constructed using a prototype-text as guide. We call it the minimal feasible summary () of the document. Contrary to previous works that followed a similar approach, the minimal feasible summary is not a bag of keywords extracted from the document. It is a coherent and complete summary representing the most prevalent concepts of the input document.

MLS constructs abstractive summaries of a document  at a specified budget in two steps. A sequence-to-sequence network, referred as the MFS-Net constructs the minimal feasible summary () first. It is an abstractive summary of the input document that captures its key concepts while maintaining coherence and fluency. MFS-Net is an LSTM-based encoder-decoder network similar to (See et al., 2017) with one key difference. We pretrain it on the CNN-DailyMail dataset (Nallapati et al., 2016) first and then fine-tune (L1-transfer of encoder-decoder weights (Pan and Yang, )) on the experimental dataset to construct the minimal feasible summary, allowing us to obtain good generalization capabilities in the final summaries with significantly small training set (refer to Section 4). A second network called the Pointer-Magnifier network constructs the final budgeted summary () from  using attention (Vaswani et al., 2017). It is a sequence-to-sequence network with interpretable, multi-headed attention. Each attention-head represents a desirable quality in the final summary. Sentences in  are copied or expanded depending on the budget to construct the final summary. To summarize, the main contributions of this work are as follows:

  • We propose MLS, an end-to-end framework to construct abstractive summaries with limited number of training samples at arbitrary compression budgets.

  • We develop an interpretable multi-headed attention mechanism to construct budgeted summaries from the minimal feasible summary.

  • Results show that MLS generated summaries are coherent and complete, corroborated by human evaluation at multiple compression budgets. Better results are observed when the desired length is short.

We evaluated our framework on three cross-domain datasets. Results show that MLS performed competitively or better against a number of baseline methods with limited training data. Subsequent human evaluation of MLS generated summaries further establishes that we were able to generate coherent, grammatically correct summaries for a range of compression budgets for all datasets.

[colback=blue!3!white,colframe=black,boxrule=1pt,arc=2pt,boxsep=0pt,left=5pt,right=5pt,top=7pt,bottom=4pt]

Input text

police are hunting a man aged between 50 and 60 suspected of robbing a bank in broad daylight and running off with £3,000 in cash. the robbery took place at 12.30pm at a lloyds bank branch in fairwater, cardiff, police said. detectives have issued cctv images of the suspect, who is 50 to 60, 5ft 9in to 6ft and was wearing black clothing. the white male suspect, who has greying black hair and wore glasses, was captured on camera inside the bank. detectives said no one was injured during the robbery and they were ‘confident’ the public would be able to identify the suspect. detective sergeant andy miles, from fairwater cid, said: ‘inquiries are continuing to identify the culprit. the cctv is clear and i am confident that members of the public will know his identity. i can confirm there have been no reports of any injuries as a result of the incident. while incidents of this nature are rare in south wales, when they do occur we will investigate them thoroughly to trace whoever is responsible’. (truncated)

 

Summary at compression budget =

robbery took place at 12.30pm at a lloyds bank branch in fairwater, cardiff. detectives have issued cctv images of the suspect, who is 50 to 60.

 

Summary at compression budget =

police are hunting a man aged between 50 and 60 suspected of robbing a bank in broad daylight and running off with £3,000 in cash. the robbery took place at 12.30pm at a lloyds bank branch in fairwater, cardiff, police said. the white male suspect, who has greying black hair and wore glasses, was captured on camera inside the bank. detectives have issued cctv images of the suspect, who is 50 to 60, 5ft 9in to 6ft and was wearing black clothing. detective sergeant andy miles, from fairwater cid, said: ‘inquiries are continuing to identify the culprit.

 

Minimal feasible summary

robbery took place at 12.30pm at a lloyds bank branch in fairwater , cardiff. detectives have issued cctv images of the suspect , who is 50 to 60. detective sergeant andy miles , from fairwater cid , said : ’ inquiries are continuing to identify the culprit.

Figure 1. MLS constructs summaries at different compression budgets for a document (truncated for illustration purposes) by copying from or expanding its minimal feasible summary using attention. The highlighted parts of the summary at budget = represent sentences that were copied from the minimal feasible summary. The boldfaced sentences in the input text were included to the summary at budget = following an expansion operation

2. Related Works

From handcrafted features to deep learning based methods, text summarization has garnered a lot of attention from researchers in recent times. In this section, we will review some of these works.

Document structure: Most of the earlier approaches towards generating summaries at multiple resolutions leveraged the physical structure of the document. In (Buyukkokten et al., 2001), Buyukkokten et al. used HTML tags to identify the structural components of a document from its DOM-tree. The structural tags were then leveraged to generate summaries at different compression budgets. Physical structure of the document was also leveraged by Yang et al. (Yang and Wang, 2003) to construct summaries for mobile devices. Multi-level summaries were constructed by iteratively adding finer details to a skeleton summary. Contrary to these methods, we do not make any assumption on the physical structure of a document.

Incremental summarization : One of the earlier efforts to summarize a document at multiple budgets was proposed by Otterbacher et al. (Otterbacher et al., 2006). A tree-like structure was constructed for each document to select sentences from each level to construct summaries incrementally. Campana et al. (Campana and Tombros, 2009) proposed a sampling based approach to generate personalized summaries by taking user-specific interaction model and information need into consideration. Summaries generated by these methods are extractive in nature. Moreover, they cannot generate summaries at arbitrary compression budgets.

Supervised methods : Kikuchi et al. (Kikuchi et al., 2016) were the first to propose a supervised approach for budget controlled abstractive summarization using length embeddings. Fan et al. (Fan et al., 2017) also used length embeddings as an additional input for controlling the length of the final summary. Their method, however, cannot generate summaries at arbitrary budgets, rather approximates the length constraints within a set of predefined ranges. Liu et al. (Liu et al., 2018) proposed a convolutional architecture with Gated Linear Units following a similar approach. Desired length of the final summary is fed as an input to the decoder’s initial state. Contrary to these methods, MLS shares some high-level intuition with extract-then-compress methods (Meng et al., 2012; Liu et al., 2019)

. Earlier attempts in this paradigm used Hidden Markov Models and rule-based systems 

(Jing and McKeown, 2000), statistical models (Knight and Marcu, 2000)

, and integer linear programming based methods 

(Berg-Kirkpatrick et al., 2011). More recently, researchers (Cheng and Lapata, 2016; Chen and Bansal, 2018) have proposed neural models to select the most salient sentences from the document and then compress/rewrite them using a second neural network. However, these methods cannot construct summaries at specified budgets. This has been recently addressed by Saito et al. (Saito et al., 2020)

. They were able to generate abstractive summaries at specified lengths following a prototype-driven natural language generation approach 

(Gatt and Krahmer, 2018). To construct a summary of length  (tokens), they extract top-K keywords from the document first, and then generate an abstractive summary using an LSTM based encoder-decoder model (See et al., 2017). One of the limitations of this approach is finding the optimal number of keywords to be extracted. As there is a direct relationship between the quality of the prototype-text and the final summary, setting close to the length of the gold-standard summary is critical for optimal performance. MLS gets around this by inferring the length of the minimal feasible summary using a supervised neural network trained on gold-standard summaries to infer when to output the EOS (end of summary) token for each document. Using a supervised neural network to obtain the prototype-text also improves the quality of the final summary.

3. Proposed Methodology

The first network in our framework i.e., the MFS-Net constructs a prototype-summary of the input document, called the minimal feasible summary () independent of the allocated budget. We describe it in Section 3.1. The Pointer-Magnifier network then generates the final summary () at the specified budget  from  using a multi-headed attention model. We will discuss this in Section 3.2. Both networks are trained separately.

3.1. Overview of the MFS-Net

We extend the BLSTM-based encoder-decoder model with attention proposed in (See et al., 2017) to construct the minimal feasible summary. The encoder network (blue rectangles in Fig. 2) takes a multi-sentence document as input, coverts it to lowercase, tokenizes it, processes each token sequentially and updates its hidden states. The decoder network (pink rectangles in Fig. 2) constructs the minimal feasible summary, one token at a time by soft-selection between the input document and an external vocabulary. Named entities in the input text are anonymized (See et al., 2017) before feeding it to the encoder network. An overview of the architecture is shown in Fig 2. We will describe how the minimal feasible summary is constructed below.

Figure 2. MLS architecture. MFS-Net i.e., network on the left constructs the minimal feasible summary (mfs) of the input document. The Pointer-Magnifier network i.e., network on the right constructs the final budgeted summary from mfs

Attention distribution over the input document:                                                                                                                                     Upon Upon encountering a token from the input document at timestep 

, hidden states of the encoder network are updated using Eq. 1 to 3. Each hidden state is represented using a 256-dimensional vector.

(1)
(2)
(3)

and , and are learnable parameters, denotes the Hadamard product, and

represents the activation function. During inference, the decoder network takes the last encoder hidden state as input, updates its hidden state and derives the next token to be included in 

using beam-search. If  and denote the last hidden state of the encoder and the decoder network at timestep , attention distribution  over the input text is computed as follows.

(4)
(5)

represents the probability distribution of copying a token from the input text at timestep 

. , , and are learnable parameters. represents the coverage-vector, introduced tn the attention distribution to avoid repetition in the summary.

Generation from external vocabulary:                                                                                                                                     Let Let, represents the probability distribution of tokens in an external vocabulary at timestep . We compute it as follows.

(6)
(7)
(8)

is computed by forward propagating the context-vector () concatenated with the decoder hidden state () through three fully-connected layers. For each mini-batch, the external vocabulary consists of tokens that appeared in that batch and the top-k most frequent tokens from a target dictionary. In our setup, the target dictionary was made up of top 80K tokens appearing in the training samples of the experimental dataset, dataset the network was pretrained on, and their gold-standard summaries.

Soft-selection between copying and generation:                                                                                                                                     The The final probability P() of including a token  into  at timestep  is defined as a weighted sum of the attention distribution () over the input text and the generation probability of the external vocabulary as follows.

(9)
(10)

acts as a soft-switch between generating a token from the external vocabulary and copying from the input text. This mixture model approach allows us to copy while simultaneously consulting the language model, enabling operations like stitching, truncation, and paraphrasing to be performed with reasonable grammatical accuracy (see Fig. 1 for an example). We used beam-search with a beam-size of 4 to select the next token during inference.

Pretraining, learning objective and parameter settings:                                                                                                                                     We We aim to minimize the negative log-likelihood of the next token to be included at each timestep. The learning objective used to train our network is defined as follows:

(11)

In Eq. 11, is a regularization term. The value of 

is gradually increased at later epochs during the training process. One of the major differences between MFS-Net and 

(See et al., 2017) is its training procedure. To construct the minimal feasible summary using limited training samples available for a dataset , we train MFS-Net following the principles of transfer learning (Pan and Yang, ). Specifically, the network is trained on the CNN-DailyMail corpus (Nallapati et al., 2016). This pretrained network is then fine-tuned on the training samples from . All encoder-decoder weights are allowed to be updated. The external vocabulary for training MFS-Net consisted of the top 80K most frequent tokens that appeared in the training samples of the CNN-DailyMail dataset or  or both. It is worth mentioning here that we cannot use a different vocabulary during the fine-tuning stage as indexing of the tokens may be different. During inference, if an out-of-vocabulary word is encountered by the network then . The probability of including that word in the summary therefore depends on the attention distribution of the input text (see Eq. 9).

Implementation details: We used Adagrad (Duchi et al., 2011) to train our network. The learning-rate and initial accumulator value were set to 0.15 and 0.1. Our network was trained on a single NVIDIA Titan-XP GPU with a batch size of 16. At each timestep the encoder read 400 tokens, and the decoder generated 100 tokens. Validation loss was used to implement early stopping. To prevent overfitting, training was stopped after 3000 epochs during the fine-tuning stage. An example of the minimal feasible summary constructed by the MFS-Net is shown in Fig. 1.

3.2. The Pointer-Magnifier Network

The task of generating the final summary () at compression budget  is the responsibility of the Pointer-Magnifier network. It is a sequence-to-sequence network with multi-headed attention (Vaswani et al., 2017). Each attention-head represents a desirable quality in the final summary (see C1 and C2 in Section 1). The network consists of a multiplex layer, a stacked encoder and a decoder layer. We will describe each in details in the following sections. Each sentence in  is processed sequentially until the EOS (end of summary) token is encountered. During inference, a sentence from  is copied or expanded to a set of similar sentences from the input document based on the remaining budget. An overview of the network is shown in Fig. 2.

The Multiplex Layer:                                                                                                                                     The The multiplex layer () is a nested matrix of dimensions . Each row of  represents an optimizable property  of the final summary. A row  in  contains information on how to take this property  into account when computing the final attention distribution. Each row contains (a) a distance-metric (), (b) a scalar value (), and (c) a query-matrix (). computes the contribution of a sentence towards optimizing , represents the metric used for this computation, and denotes its relative importance in the final attention distribution. We define  such that and .

Figure 3. The encoder module in the Pointer-Magnifier network consists of  encoder-blocks. Each block corresponds to an attention-head in the multiplex layer. An encoder-block (shown on the left) consists of an embedding layer and a local-attention layer followed by a normalization layer

We identify three such properties. They are as follows: (1) topic-coverage (), (2) keyword-coverage (), and (3) information redundancy (). The query-matrix () for measuring the contribution of a sentence towards topic-coverage is a matrix of dimensions . Each row of  represents one of the three most dominant topic vectors of the input document. Symmetric KL-divergence is used as the distance metric. We used the unsupervised LDA-model by Blei et al. (Blei et al., 2003) to compute . The query-matrix () used for measuring keyword-coverage is a single-dimension fixed-length vector of length 50. Each component of  represents the relative term frequency of one of the top-50 most frequent keywords in the input document. Symmetric KL-divergence is used as the distance metric for this computation. We used RAKE (Rose et al., 2010)

, a publicly available open-source library to construct 

for each document. Lastly, the query-matrix  for measuring redundancy is of dimensions . Each row of  represents a sentence embedding vector from the input document. We used the distributional memory model (Le and Mikolov, 2014)

trained on English Wikipedia to generate the embedding vector for each sentence. Cosine similarity was used as the distance metric.

111it is worth noting here that our choice of query matrices is driven by limiting ourselves to models that are either unsupervised or utilize limited training sample; other methods with the same constraints can also be used The scalar values  and  depend on the experimental dataset and is learned by performing grid-search over the interval . Each row in the multiplex layer is associated with an attention-head. We will discuss how the final attention distribution is computed by combining local-attention from these attention-heads in the following sections.

Stacked Encoder Layer:                                                                                                                                     The Our encoder layer contains  parallel encoder-blocks. Each encoder-block consists of an embedding layer followed by a local-attention layer. At each timestep  during inference, a sentence from  is introduced to the embedding layer. A fixed-length embedding-vector () is generated and propagated to the local-attention layer. The query-matrix  and metric  associated with , the indexed property in  is retrieved and local-attention () is computed as follows.

(12)
(13)

For a sentence embedding vector  of length  and a query-matrix  of dimensions , is a matrix of dimensions . We compute local-attention  for the sentence by taking column-wise average of the matrix . This is repeated for all sentences until the EOS (end of summary) token in encountered. The distribution obtained from this process is then normalized to get the local-attention distribution  over . Similar distributions are obtained from all  encoder-blocks. The final attention distribution is then computed as follows.

(14)
(15)

We compute the final attention distribution () over  by normalizing the weighted average of the local-attention distributions from each encoder-block. Positional information of each sentence is maintained during this computation.  represents the weight of the  attention-head (as indexed in ) in the final attention distribution. It is worth noting here that there is a dedicated pathway for each attention-head in our encoder architecture; from the multiplex layer to the local-attention layer of every encoder-block. This allows us to parallelize and speed-up the inference process.

Stacked Decoder Layer:                                                                                                                                     The Architecture of our decoder layer is similar to the encoder. The decoder-layer consists of  parallel decoder-blocks. Each block consists of an embedding layer followed by a local-attention layer. Parameters of the  encoder and decoder-blocks are shared. Contrary to the MFS-Net, the Pointer-Magnifier network constructs the final summary using sentence-level attention.

Definition 1: We define the compression ratio () of the minimal feasible summary as follows.

(16)

In Eq. 16, and  denote the number of tokens in the minimal feasible summary and the input document respectively.

Copying from the minimal feasible summary: During inference, the probability of copying a sentence  from  at timestep  is defined as follows.

(17)
(18)

represents the attention distribution at timestep . Initialized as  (from Eq. 15), the attention distribution is updated at each timestep when a sentence from  is included to the final summary. To update the attention distribution at timestep  after the inclusion of sentence  into the final summary, we set the attention at  to 0 and re-normalize the distribution. Positional information of each sentence is maintained.

Expanding a sentence from the minimal feasible summary: If  is less than the allocated budget  at timestep 0, a sentence in  can be ‘expanded’ to a set of coherent sentences () from the input document and included in the final summary. Let,  denote the fixed-length vector representation of the  sentence in  by the embedding-layer of the  decoder-block. The probability  of including in the summary is computed as follows.

(19)
(20)

is the query-matrix of dimensions , shared between the  encoder and decoder-block, is a fixed-length vector of length . The decoder-block utilizes the shared attention-head to compute . It is the normalized (see Eq. 20) distribution computed from each sentence in . The probability of including  in the final summary is defined as the average (see Eq. 22) of the inclusion probabilities of all sentences in .

(21)
(22)
Dataset Metric Budget = 1/32 Budget = 1/16 Budget = 1/8 Budget = 1/4 Budget = 1/2
MLS A1 A2 A3 MLS A1 A2 A3 MLS A1 A2 A3 MLS A1 A2 A3 MLS A1 A2 A3
D1 ROUGE-1 45.99 23.44 37.46 41.65 45.99 30.5 37.68 43.07 45.99 31.27 38.05 43.50 46.11 41.86 43.95 44.10 45.67 40.67 41.13 45.50
ROUGE-2 35.97 14.79 22.59 30.65 35.97 20.77 25.50 30.65 35.98 22.95 29.14 33.50 35.6 27.57 32.36 34.50 36.7 29.38 31.02 35.02
ROUGE-L 40.89 21.35 32.38 37.65 42.50 27.9 33.07 38.92 43.01 36.25 37.62 43.50 42.83 38.83 40.95 41.07 40.18 39.60 40.74 41.50
METEOR 47.12 18.91 24.22 45.51 47.12 13.07 25.02 45.60 46.50 20.89 30.86 43.88 46.61 27.26 33.05 44.65 45.71 27.84 32.95 45.39
D2 ROUGE-1 40.25 16.20 21.06 35.60 40.0 17.08 22.0 36.0 40.25 22.59 28.10 39.72 41.01 23.55 27.83 38.50 44.36 29.53 32.75 44.06
ROUGE-2 33.25 11.25 17.22 26.50 34.50 12.0 16.75 30.05 35.67 14.60 19.01 31.80 36.0 17.90 20.06 31.0 38.70 20.67 23.46 36.44
ROUGE-L 37.17 14.50 19.06 33.67 37.0 15.60 20.55 35.70 37.05 21.65 20.26 34.33 37.96 21.87 22.60 32.77 41.50 26.04 27.17 39.75
METEOR 40.22 12.68 24.33 35.05 44.82 15.17 23.22 42.90 44.82 11.96 30.79 42.0 42.88 24.20 21.83 38.05 44.79 28.08 25.82 45.70
Table 1. Experimental results on the MSR-Narrative (D1) and Thinking Machines (D2) dataset by MLS (highlighted column), Budgeted Systematic Sampling (A1), PageRank-guided Sampling (A2) and a pretrained Convolutional Seq-to-Seq model (Liu et al., 2018) (A3). The best performance achieved for each metric is shown in boldface

Definition 2: For each sentence  in , the expansion-set 

is a mutually exclusive n-gram of sentences from the input document that are similar to 

. sizeof() is therefore equal to in Eq. 22.

To determine the expansion-set of , we perform beam-search over all possible n-gram of sentences in the input document with the following objective: maximize the number of token overlaps with  weighted by its average pairwise cosine similarity. To remove ‘across-sentence’ repetitions, we apply a reranking strategy similar to Chen et al. (Chen and Bansal, 2018). We keep all candidates generated by beam search, where  is the size of the beam. Next, we rerank all combinations of summaries generated by including an expansion-set candidate into the partially constructed summary. Each summary generated this way is then reranked by the number of repeated n-grams, the smaller the better.

Soft-selection between copying and expansion: The final probability  of including a sentence  or its expansion-set  into the final summary at timestep  is defined as a weighted sum of the probability distribution over  and .

(23)
(24)

In Eq. 23,  denotes the compression budget allocated for the final summary and sgn denotes the sign function. Sentences in  are processed sequentially based on the position they appeared. For each timestep , a sentence in  is copied or expanded to  and included in the budgeted summary. If the allocated compression budget () is less than  i.e., the compression ratio of , probability of including a sentence in the final summary depends on the attention distribution of sentences from  not yet included in the final summary. Otherwise, acts as a soft-switch between copying or expanding a sentence from the minimal feasible summary. A sentence in  is expanded only if it does not violate (see Eq. 24) the remaining budget. Otherwise, the inclusion probability of that sentence depends on the attention distribution () of . In Eq. 24, and denote the average number of tokens in the expansion-set and the sentences that have not been included or expanded into the summary till timestep  respectively. represents the compression ratio of the partially constructed summary till timestep . Once the inclusion probability of a sentence (or its expansion set) is computed, the decoder attends to the position with the highest probability and includes it in the final summary. Generation stops once the allocated budget has been reached.

Implementation details: The learnable parameters  are pretrained on the CNN-DailyMail dataset first. We learn the optimal values of these parameters by performing grid-search over the interval [-1,1], minimizing the F1 score of the ROUGE-1 value on the cross-validation dataset. All parameters are initialized with the value . For both of our datasets, the local-attention weights associated to topic-coverage () and keyword-coverage () were positive numbers, whereas information redundancy () was assigned a negative weight. We set the cardinality of the expansion-set to 3 in our experiments and change it to 2 during the later iterations.

4. Experiments

We seek to answer three key questions in our experiments. Given a summary  at compression budget : (a) how good is the quality of the summary ?, (b) is the summary coherent and complete? (c) how abstractive is ? We answer the first two questions by evaluating MLS generated summaries (Sections 4.3 and 4.4) on two publicly available datasets for five compression budgets. The third key question is answered by computing the percentage of tokens introduced from the external vocabulary in Section 4.5. We also performed a human evaluation of the summaries at multiple compression budgets and present our findings in Section 4.6.

Index Dataset Size Max Median Mean
D1 MSR Narrative 476 130 15 18.65
D2 Thinking Machines 186 82 33 33.23
D3 CNN-DailyMail 312804 221 24 28.24
Table 2. Dataset statistics: The minimum, maximum, median, and average number of sentences in each dataset.

4.1. Experiment design

4.1.1. Datasets:

We evaluate MLS on two publicly available cross-domain datasets: the MSR-Narrative dataset (Ouyang et al., 2017) (D1) and the Thinking-Machines dataset (Brockman, 2018) (D2). We used the CNN-DailyMail dataset (Nallapati et al., 2016) for pretraining the MFS-Net in our framework. Each of these contain documents from a separate domain. The CNN-DailyMail dataset contains 312,804 online news articles collected from two national news websites. 287,226 articles were used to construct the training corpus, the test corpus contained 11490 articles. The MSR-Narrative dataset contains 476 personal stories shared by users of the social network Reddit. Whereas, the Thinking-Machines dataset contains 186 op-ed articles by different authors on a popular topic published in an educational website. 25% of the documents were randomly selected to construct the training corpus for both datasets. Some of the significant summary statistics of our experimental datasets are shown in Table 2.

Dataset Metric Budget = 1/32 Budget = 1/16 Budget = 1/8 Budget = 1/4 Budget = 1/2
MLS A1 A2 A3 MLS A1 A2 A3 MLS A1 A2 A3 MLS A1 A2 A3 MLS A1 A2 A3
D1 Topic 0.12 0.28 0.29 0.21 0.12 0.27 0.27 0.20 0.12 0.26 0.23 0.15 0.13 0.21 0.19 0.18 0.13 0.21 0.21 0.18
Sentiment 0.09 0.22 0.19 0.11 0.09 0.23 0.15 0.13 0.09 0.19 0.15 0.12 0.1 0.14 0.12 0.1 0.16 0.07 0.17 0.13
Coherence 0.08 0.3 0.20 0.11 0.08 0.26 0.18 0.09 0.08 0.21 0.11 0.07 0.09 0.13 0.10 0.12 0.1 0.06 0.09 0.1
D2 Topic 0.05 0.27 0.24 0.15 0.05 0.27 0.25 0.16 0.05 0.17 0.2 0.12 0.05 0.08 0.08 0.11 0.03 0.03 0.02 0.10
Sentiment 0.03 0.24 0.16 0.10 0.03 0.21 0.13 0.07 0.03 0.12 0.15 0.04 0.03 0.06 0.08 0.05 0.04 0.02 0.03 0.03
Coherence 0.03 0.27 0.20 0.05 0.03 0.18 0.12 0.10 0.03 0.09 0.09 0.05 0.03 0.05 0.05 0.06 0.04 0.03 0.03 0.04
Table 3. Experimental results on the MSR-Narrative (D1) and Thinking Machines (D2) dataset by MLS (highlighted column), Budgeted Systematic Sampling (A1), PageRank-guided Sampling (A2) and a pretrained Convolutional Seq-to-Seq model (Liu et al., 2018) (A3). The best performance achieved for each metric is shown in boldface, denotes the average absolute difference between semantic coherence scores, ‘Topic’ and ‘Sentiment’ denote the average KL-divergence score between topic and polarity distributions of a summary and the input document

4.1.2. Metrics:

We evaluate the quality of our summary by computing the average score for ROUGE-1, ROUGE-2 and ROUGE-L metrics against gold-standard summaries. Given an input document, abstractive summarization may create summaries that don’t share many words but have the same meaning. Therefore, it is important to capture semantic similarity beyond n-grams. ROUGE metrics fail to take this into account (Yao et al., 2017). We introduced METEOR score in our test suite to complement this aspect. We used the py-rouge library222https://pypi.org/project/py-rouge/ to compute ROUGE scores. The nltk library333https://www.nltk.org/_modules/nltk/translate/meteor_score.html was used for computing METEOR scores in our experimental setup. To evaluate the completeness of our summaries, we compute the average KL-divergence between the top-3 topic vectors (normalized) from the summary and the input document. A good summary is coherent and follows the narrative style of its source document. We measure coherence by computing the average pairwise cosine similarity between consecutive sentences (Srinivasan et al., 2018)

and sentiment polarity distribution of a summary. For both metrics, scores obtained from the summary were compared against the input document. We report the absolute difference between the average pairwise similarity score computed for a summary and the input document. Whereas, the symmetric KL-divergence score between the sentiment polarity distribution obtained from the summary and the input document is reported for the second metric. We used a using a publicly available sentiment analysis tool 

(Hutto and Gilbert, 2014) to obtain the polarity of a document.

4.2. Baselines

Budgeted Systematic Sampling (A1) : A systematic sampling approach is undertaken following the expand-till-it-is-allowed principle. Initialized with a sentence randomly chosen from the first  sentences of the input document, the budgeted summary is constructed by sampling the  sentence from the last sampled position at each round until the allocated budget is met. We set the value of  equal to the 3 i.e., size of the expansion-set in the Pointer-Magnifier network.

PageRank-guided Sampling (A2) : We extend the TextRank (Mihalcea and Tarau, 2004) algorithm to construct summaries at a specified budget. Each sentence in the input document is represented as a node in an undirected complete graph, where the edge between two nodes is assigned a weight equal to the cosine similarity between them. We construct the final summary by sampling the top  nodes within the specified budget using the weighted PageRank algorithm.

Seq-to-seq model with length control (A3) : This is a supervised method for abstractive summarization proposed by Liu et al. (Liu et al., 2018) that employs a convolutional sequence-to-sequence model with Gated Linear Units (Dauphin et al., 2017) to construct budgeted summaries of arbitrary lengths. Desired summary length is fed as an additional input to the decoder. We followed a training procedure similar to MFS-Net for this model i.e., the network was trained on gold-standard summaries from the CNN-DailyMail dataset first and then fine-tuned on a limited training set from the experimental dataset.

4.3. Experimental Results 1

To measure its capability of generating high-quality summaries, we compared MLS against each baseline method on four metrics (ROUGE-1, ROUGE-2, ROUGE-L and METEOR) at five compression budgets. Results are shown in Table 1. We highlight some of the key findings in the following section.

Dataset Metric Budget = 1/32 Budget = 1/16 Budget = 1/8 Budget = 1/4 Budget = 1/2
MLS MLS MLS MLS MLS MLS MLS MLS MLS MLS
D1 ROUGE-1 45.99 17.60 45.99 17.60 45.99 16.62 46.11 17.75 45.67 17.81
ROUGE-2 35.97 10.50 35.97 11.50 35.98 10.25 35.6 10.06 36.7 11.06
ROUGE-L 40.89 15.41 42.50 14.09 43.01 13.09 42.83 14.83 40.18 15.95
METEOR 47.12 18.06 47.12 18.50 46.50 19.02 46.61 18.10 45.71 19.67
D2 ROUGE-1 40.25 14.88 40.0 14.50 40.25 14.02 41.01 13.86 44.36 15.05
ROUGE-2 33.25 9.05 34.50 9.88 35.67 10.02 36.0 9.57 38.70 10.25
ROUGE-L 37.17 11.02 37.0 11.90 37.05 12.22 37.96 12.28 41.50 12.67
METEOR 40.22 15.06 44.82 14.99 44.82 15.75 42.88 14.99 44.79 15.20
Table 4. Experimental results on the MSR-Narrative (D1) and Thinking Machines (D2) dataset by MLS (highlighted column) and MLS. The best performance achieved for each metric is shown in boldface

Abstractive methods performed better: In terms of generating high-quality summaries, abstractive methods (MLS and A3) performed better than sampling-based methods for constructing budgeted summaries at all compression budgets.

MLS performed better at smaller budgets: MLS performed consistently well on all budgets. However, performance was better on smaller budgets. We observed a relative improvement of at least 4% on all metrics against the convolutional baseline (A3) at the budget of . Averaged over all compression budgets, we obtained an absolute improvement of approximately 2.40% in ROUGE-1 score on both datasets. An improvement of 1.49% and 2.77% were observed over A3 in METEOR score for datasets D1 and D2 respectively. As the length of the summaries increased, we performed competitively with the convolutional baseline (A3).

Good generalization using limited training data: To generate budgeted summaries with limited training samples, we pretrained our framework on the CNN-DailyMail dataset and then fine-tuned it on our experimental datasets. End-to-end results in Table 1

show that we were able to obtain summaries with desirable qualities for both datasets over a range of evaluation metrics. We further investigate the quality of MLS generated summaries by performing a human evaluation study in Section 4.6.

(a) MSR-Narrative
(b) Thinking Machines
Figure 4. Percentage of novel word n-grams and sentences in our summaries at five compression budgets

4.4. Experimental Results 2

We measured the coherence and completeness of a budgeted summary using three metrics. For each of these metrics, the score obtained from each document was compared against the input document for five compression budgets. A smaller score signifies better performance by a competing method for these three metrics. We present the results obtained from this experiment in Table 3. Some of the key takeaways from this experiment are as follows.

Topic coverage is better in MLS: Compared to our abstractive baseline (A3), MLS performed well for all compression budgets in covering the main concepts of the document. Improvement in performance was significant at lower compression budgets. More specifically, MLS was approximately at least 75% more accurate in capturing the main concepts of the input document in summaries at lower compression budgets than A3 for both datasets. We also outperformed both sampling based baselines. Performance became comparable at higher budgets. This is due to the fact that more sentences from the minimal feasible summary were expanded to be include in the final summary at higher compression budgets.

MLS generated summaries are coherent: A good summary shoudl be coherent and follow the narrative style of its source document. The bottom two rows in Table 3 for each dataset evaluates the coherence of summaries generated by all competing methods on our experimental datasets. Results show that MLS generates coherent summaries from the input document at all budgets. Improvement in the quality of summaries in this aspect is significant compared to our sampling-based baselines A1 and A2. We will examine this aspect further by a human evaluation study in Section 4.6.

(a) MSR-Narrative
(b) Thinking Machines
Figure 5. MLS outperforms the ablative baseline MLS+ at all compression budgets for both datasets

4.5. Experimental Results 3

Abstractiveness of the summaries: We evaluate the abstractiveness of a summary by measuring the percentage of novel word n-grams and sentences included in the final summary in Fig. 4. We observe that the percentage of novel n-grams and sentences in the final summaries increases with the allocated budget. For the MSR-Narrative dataset 44.63% of sentences were novel in summaries constructed at a budget of 0.5. Whereas, 11.5% of all sentences were novel in summaries constructed at a budget of .

Effect of pretraining: To isolate gains in end-to-end performance due to pretraining on the CNN-DailyMail dataset, we compare against MLS, a baseline identical to MLS without any pretraining. Results of this experiment is shown in Table 4. We observed significant improvement over MLS, thus establishing that pretraining helps improve the quality of our summaries. The effect on performance is apparent for all compression budgets in our test suite.

Quality of the minimal feasible summary: We evaluate the extent to which quality of the final summaries depend on 

by comparing it against an ablative baseline MLS+. Instead of the MFS-Net, a fixed-length minimal feasible summary is constructed by MLS+ following a greedy sampling based heuristics 

(Otterbacher et al., 2006). Results of this experiment are shown in Fig. 5.

4.6. User Study

We conducted a user study in order to evaluate the coherence and completeness of the generated summaries at various compression budgets. We consider a scenario in which the user needs to complete a fact checking task in a limited time. We randomly chose three documents from both of our datasets for this purpose. Users were given a summary of the documents and asked to answer a question. The questions were mostly concerned with the key facts or events described in the original article and asked users to verify whether that key fact was present in the summary. Instruction were provided to complete the task solely based on the content of the summary and not depending on previous knowledge. Similar to (Otterbacher et al., 2006) we used multiple choice questions. Users could answer ”Yes”, ”No”, or ”More information is required”. If users indicated that they needed more information, a longer summary was shown paired with the same question. If users could definitively verify the presence or absence of the information, the task was considered complete. For example, one of the question and summary pair provided to the users was as follows. Question: ”Does the story tell us why the narrator was fired?” with the following summary: ”I tried to return a lost wallet to a customer who accused me of stealing it and then grabbed my hair. We got in a physical fight and I was fired from my job”.

4.6.1. Treatment:

Our experiments (see Section  4.3 and 4.4) show that among the two extractive baselines, PageRank-guided Sampling (A2) performed significantly better on both datasets. Thus, for fair comparison, we only compare MLS against A2 and A3 from our setup. In addition to these, we also added two extreme settings: (a) the full-Content setting in which the entire article was shown to the users and (b) the no-Content setting in which only the question (and no content) was shown. The full-content control setting was added to make sure that the answer could indeed be answered from the article, while the no-content setting was added to ensure that the questions themselves did not contain any hint about the answers. Users were shown summaries generated with a budget equal to first. If they chose the ”More information is required” as their response, we provided a summary generated by the same method with a budget of . The budget was doubled each time until the user chose ”Yes” or ”No” or we reached the budget of . At each compression budget, new sentences shown for the first time were highlighted. We computed task completion time and user response for each treatment.

4.6.2. Study details:

15 graduate students participated in this study. Each user was assigned to two different settings. To prevent information retention, no user was exposed to summaries generated by two different methods for a same article. Users were encouraged to complete the task as accurately and as fast as possible. Using a balanced, incomplete block design, all 10 combinations of 5 settings (MLS, A2, A3, Full-Content, and No-Content) and 2 datasets were assigned to 3 subjects. Users were not given any information on the method used to generate a summary. Order of the summaries shown to a user by two competing methods were randomized.

4.6.3. Results:

The average accuracy and task completion time (in seconds) is shown in Table 5. We observe that the accuracy of the no-Content setting is 0 on both datasets, meaning that the questions did not contain any hint that could guide the users to the correct answer. Our results on the MSR-Narrative dataset (D1) shows that users could respond to the questions using MLS as accurately as the Full-Content setting, while completing the task more than two times faster. MLS outperformed both competing methods on dataset D1. We also noticed that one of the users were able to respond to the question correctly from MLS generated summaries while failed to do so from the original article. For the Thinking Machines dataset (D2), MLS outperformed A2 closely following A3. In terms of accuracy, out of the 6 articles selected for our experiment, MLS outperformed A3 on 4 articles and comparatively on 1. MLS outperformed A3 on 5 out of 6 articles in terms of task completion time.

MLS A2 A3 No-Content Full-Content
D1 Accuracy 0.88 0.55 0.55 0.0 0.88
Duration (s) 36.7 43.69 69.08 12.0 75.6
D2 Accuracy 0.55 0.44 0.66 0.0 0.88
Duration (s) 70.24 68.9 96.47 20.95 132.86
Table 5. Average completion time and accuracy by MLS, PageRank-guided Sampling (A2), convolutional Seq-to-Seq model (A3), No-Content and Full-Content setting

5. Conclusion

We have proposed MLS – a supervised method for generating abstractive summaries with limited training data at arbitrary compression budgets. Leveraging an extract-then-compress approach, we construct budgeted summaries following a two-phase protocol. A sequence-to-sequence network with attention called the MFS-Net constructs the minimal feasible summary () by capturing the key concepts of the document. The second network then generates the budgeted summaries from  by leveraging an interpretable multi-headed attention model. Following the principles of transfer learning, we were able to construct high quality summaries with limited training samples by pretraining our networks on the CNN-DailyMail dataset. Using a mixture model approach, MLS constructs high-quality summaries with reasonable grammatical accuracy. Completeness and coherence of the summaries was further established by a human evaluation study. In the future, we would like to extend our work to task-driven summaries for domain-specific extraction tasks. Personalized summaries leveraging a user-specific context model is also an exciting research direction for future work.

References

  • [1] T. Berg-Kirkpatrick, D. Gillick, and D. Klein (2011) Jointly learning to extract and compress. In ACL, pp. . Cited by: §2.
  • [2] D. M. Blei, A. Y. Ng, and M. I. Jordan (2003) Latent dirichlet allocation. JMLR (), pp. . Cited by: §3.2.
  • [3] J. Brockman (2018)(Website) Note: https://www.edge.org/annual-question/what-do-you-think-about-machines-that-think Cited by: §4.1.1.
  • [4] O. Buyukkokten, H. Garcia-Molina, and A. Paepcke (2001) Seeing the whole in parts: text summarization for web browsing on handheld devices. In WWW, pp. . Cited by: §2.
  • [5] M. Campana and A. Tombros (2009)

    Incremental personalised summarisation with novelty detection

    .
    In FQAS, pp. . Cited by: §2.
  • [6] Y. Chen and M. Bansal (2018) Fast abstractive summarization with reinforce-selected sentence rewriting. arXiv preprint:1805.11080. Cited by: §2, §3.2.
  • [7] J. Cheng and M. Lapata (2016) Neural summarization by extracting sentences and words. arXiv preprint:1603.07252. Cited by: §2.
  • [8] S. Chopra, M. Auli, and A. M. Rush (2016)

    Abstractive sentence summarization with attentive recurrent neural networks

    .
    In NAACL, pp. . Cited by: §1.
  • [9] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier (2017) Language modeling with gated convolutional networks. In ICML, pp. . Cited by: §1, §4.2.
  • [10] J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization. JMLR (), pp. . Cited by: §3.1.
  • [11] A. Fan, D. Grangier, and M. Auli (2017) Controllable abstractive summarization. arXiv preprint:1711.05217. Cited by: §2.
  • [12] A. Gatt and E. Krahmer (2018) Survey of the state of the art in natural language generation: core tasks, applications and evaluation. JAIR , pp. . Cited by: §2.
  • [13] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin (2017) Convolutional sequence to sequence learning. In ICML, pp. . Cited by: §1.
  • [14] C. J. Hutto and E. Gilbert (2014) Vader: a parsimonious rule-based model for sentiment analysis of social media text. In ICWSM, Cited by: §4.1.2.
  • [15] H. Jing and K. R. McKeown (2000) Cut and paste based text summarization. In NAACL, pp. . Cited by: §2.
  • [16] Y. Kikuchi, G. Neubig, R. Sasano, H. Takamura, and M. Okumura (2016) Controlling output length in neural encoder-decoders. arXiv preprint:1609.09552. Cited by: §2.
  • [17] K. Knight and D. Marcu (2000) Statistics-based summarization-step one: sentence compression. AAAI/IAAI , pp. . Cited by: §2.
  • [18] Q. Le and T. Mikolov (2014) Distributed representations of sentences and documents. In ICML, pp. . Cited by: §3.2.
  • [19] C. Liu, P. Wang, J. Xu, Z. Li, and J. Ye (2019) Automatic dialogue summary generation for customer service. In SIGKDD, pp. . Cited by: §1, §2.
  • [20] Y. Liu, Z. Luo, and K. Zhu (2018)

    Controlling length in abstractive summarization using a convolutional neural network

    .
    In EMNLP, pp. . Cited by: §2, Table 1, §4.2, Table 3.
  • [21] X. Meng, F. Wei, X. Liu, M. Zhou, S. Li, and H. Wang (2012) Entity-centric topic-oriented opinion summarization in twitter. In SIGKDD, pp. . Cited by: §2.
  • [22] R. Mihalcea and P. Tarau (2004) Textrank: bringing order into text. In EMNLP, pp. . Cited by: §4.2.
  • [23] R. Nallapati, B. Zhou, C. Gulcehre, B. Xiang, et al. (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint: 1602.06023. Cited by: §1, §1, §1, §3.1, §4.1.1.
  • [24] J. Otterbacher, D. Radev, and O. Kareem (2006) News to go: hierarchical text summarization for mobile devices. In SIGIR, pp. . Cited by: §2, §4.5, §4.6.
  • [25] J. Ouyang, S. Chang, and K. McKeown (2017) Crowd-sourced iterative annotation for narrative summarization corpora. In EACL, Vol. , pp. . Cited by: §4.1.1.
  • [26] S. J. Pan and Q. Yang () A survey on transfer learning. TKDE (), pp. . Cited by: §1, §3.1.
  • [27] S. Rose, D. Engel, N. Cramer, and W. Cowley (2010) Automatic keyword extraction from individual documents. Text mining: applications and theory, pp. . Cited by: §3.2.
  • [28] A. M. Rush, S. Chopra, and J. Weston (2015) A neural attention model for abstractive sentence summarization. arXiv preprint:1509.00685. Cited by: §1.
  • [29] I. Saito, K. Nishida, K. Nishida, A. Otsuka, H. Asano, J. Tomita, H. Shindo, and Y. Matsumoto (2020) Length-controllable abstractive summarization by guiding with summary prototype. arXiv preprint:2001.07331. Cited by: §1, §2.
  • [30] A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. arXiv preprint:1704.04368. Cited by: §1, §1, §1, §2, §3.1, §3.1.
  • [31] B. V. Srinivasan, P. Maneriker, K. Krishna, and N. Modani (2018) Corpus-based content construction. In COLING, pp. . Cited by: §4.1.2.
  • [32] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NIPS, pp. . Cited by: §1, §3.2.
  • [33] R. S. Wurman, L. Leifer, D. Sume, and K. Whitehouse (2001) Information anxiety 2. Que. Cited by: §1.
  • [34] C. C. Yang and F. L. Wang (2003) Fractal summarization for mobile devices to access large documents on the web. In WWW, pp. . Cited by: §2.
  • [35] J. Yao, X. Wan, and J. Xiao (2017) Recent advances in document summarization. KAIS (), pp. . Cited by: §4.1.2.