The wealth of data available at a single click often adds to the information overload problem (Wurman et al., 2001)
. Summarization is an intuitive way to address this problem by constructing a condensed equivalent of the available data. There are two main approaches for text summarization: extractive and abstractive approach. In extractive approach, sentences are sampled from the input document, while in the abstractive approach, the summary is not constrained by the vocabulary of the input document. Great progress(Rush et al., 2015; Chopra et al., 2016; Nallapati et al., 2016; See et al., 2017) has been made in recent years on abstractive summarization techniques. Various sequence-to-sequence networks with attention have been proposed, ranging from RNN (Rush et al., 2015), BLSTM (See et al., 2017) network, and CNN (Gehring et al., 2017) with Gated Linear Units (Dauphin et al., 2017)
to generate abstractive summaries. Controlling the length of a summary while preserving its quality is one of the most challenging but important aspects of abstractive summarization. One of the most important real-world application of budgeted summarization is optimizing web content for varying screen-sizes. Web content creators such as news portals, bloggers and online advertisements agencies with target audience on multiple digital platforms (e.g. mobiles, laptops, smart-watches) are some of its biggest benefactors. High variance in the screen-sizes available for these devices makes it difficult to implement a solution to effectively deliver textual content following a traditional supervised approach.
To employ sequence-to-sequence networks such as (Nallapati et al., 2016; See et al., 2017) for generating summaries at a given compression budget , we need a parallel corpus of text documents and their corresponding summaries at . Constructing such a corpus for any given budget () is a resource intensive task that usually requires human supervision. Repeating this process for all possible values of to account for the inherent variability makes it even more costly. Furthermore, in many real world applications the allowed budget can only be known at run-time. One of the current practices to get around this problem is to generate a summary () independent of the budget and then truncate it to account for the budget. Naive approaches such as this often produce incomplete and/or incoherent summaries. We propose MLS, an end-to-end framework to construct high-quality summaries at arbitrary compression budgets, leveraging limited training data. Given a document with tokens and a compression budget , the objective of our framework is to generate an abstractive summary () of , such that the following conditions are satisfied:
C1: Information redundancy is minimized with respect to .
C2: Coverage of the major topics of is maximized.
C3: is maximal with respect to the allocated budget i.e., & such that without worsening the first two conditions.
Conditions C1 and C2 ensure that the qualities of a good summary is preserved in . Whereas, C3 ensures that is the largest possible summary within the allocated budget without compromising its quality. MLS takes a prototype-driven approach (Liu et al., 2019; Saito et al., 2020) towards summarization. In simple words, summaries at specified compression budgets are constructed using a prototype-text as guide. We call it the minimal feasible summary () of the document. Contrary to previous works that followed a similar approach, the minimal feasible summary is not a bag of keywords extracted from the document. It is a coherent and complete summary representing the most prevalent concepts of the input document.
MLS constructs abstractive summaries of a document at a specified budget in two steps. A sequence-to-sequence network, referred as the MFS-Net constructs the minimal feasible summary () first. It is an abstractive summary of the input document that captures its key concepts while maintaining coherence and fluency. MFS-Net is an LSTM-based encoder-decoder network similar to (See et al., 2017) with one key difference. We pretrain it on the CNN-DailyMail dataset (Nallapati et al., 2016) first and then fine-tune (L1-transfer of encoder-decoder weights (Pan and Yang, )) on the experimental dataset to construct the minimal feasible summary, allowing us to obtain good generalization capabilities in the final summaries with significantly small training set (refer to Section 4). A second network called the Pointer-Magnifier network constructs the final budgeted summary () from using attention (Vaswani et al., 2017). It is a sequence-to-sequence network with interpretable, multi-headed attention. Each attention-head represents a desirable quality in the final summary. Sentences in are copied or expanded depending on the budget to construct the final summary. To summarize, the main contributions of this work are as follows:
We propose MLS, an end-to-end framework to construct abstractive summaries with limited number of training samples at arbitrary compression budgets.
We develop an interpretable multi-headed attention mechanism to construct budgeted summaries from the minimal feasible summary.
Results show that MLS generated summaries are coherent and complete, corroborated by human evaluation at multiple compression budgets. Better results are observed when the desired length is short.
We evaluated our framework on three cross-domain datasets. Results show that MLS performed competitively or better against a number of baseline methods with limited training data. Subsequent human evaluation of MLS generated summaries further establishes that we were able to generate coherent, grammatically correct summaries for a range of compression budgets for all datasets.
2. Related Works
From handcrafted features to deep learning based methods, text summarization has garnered a lot of attention from researchers in recent times. In this section, we will review some of these works.
Document structure: Most of the earlier approaches towards generating summaries at multiple resolutions leveraged the physical structure of the document. In (Buyukkokten et al., 2001), Buyukkokten et al. used HTML tags to identify the structural components of a document from its DOM-tree. The structural tags were then leveraged to generate summaries at different compression budgets. Physical structure of the document was also leveraged by Yang et al. (Yang and Wang, 2003) to construct summaries for mobile devices. Multi-level summaries were constructed by iteratively adding finer details to a skeleton summary. Contrary to these methods, we do not make any assumption on the physical structure of a document.
Incremental summarization : One of the earlier efforts to summarize a document at multiple budgets was proposed by Otterbacher et al. (Otterbacher et al., 2006). A tree-like structure was constructed for each document to select sentences from each level to construct summaries incrementally. Campana et al. (Campana and Tombros, 2009) proposed a sampling based approach to generate personalized summaries by taking user-specific interaction model and information need into consideration. Summaries generated by these methods are extractive in nature. Moreover, they cannot generate summaries at arbitrary compression budgets.
Supervised methods : Kikuchi et al. (Kikuchi et al., 2016) were the first to propose a supervised approach for budget controlled abstractive summarization using length embeddings. Fan et al. (Fan et al., 2017) also used length embeddings as an additional input for controlling the length of the final summary. Their method, however, cannot generate summaries at arbitrary budgets, rather approximates the length constraints within a set of predefined ranges. Liu et al. (Liu et al., 2018) proposed a convolutional architecture with Gated Linear Units following a similar approach. Desired length of the final summary is fed as an input to the decoder’s initial state. Contrary to these methods, MLS shares some high-level intuition with extract-then-compress methods (Meng et al., 2012; Liu et al., 2019)2000), statistical models (Knight and Marcu, 2000)
, and integer linear programming based methods(Berg-Kirkpatrick et al., 2011). More recently, researchers (Cheng and Lapata, 2016; Chen and Bansal, 2018) have proposed neural models to select the most salient sentences from the document and then compress/rewrite them using a second neural network. However, these methods cannot construct summaries at specified budgets. This has been recently addressed by Saito et al. (Saito et al., 2020)
. They were able to generate abstractive summaries at specified lengths following a prototype-driven natural language generation approach(Gatt and Krahmer, 2018). To construct a summary of length (tokens), they extract top-K keywords from the document first, and then generate an abstractive summary using an LSTM based encoder-decoder model (See et al., 2017). One of the limitations of this approach is finding the optimal number of keywords to be extracted. As there is a direct relationship between the quality of the prototype-text and the final summary, setting close to the length of the gold-standard summary is critical for optimal performance. MLS gets around this by inferring the length of the minimal feasible summary using a supervised neural network trained on gold-standard summaries to infer when to output the EOS (end of summary) token for each document. Using a supervised neural network to obtain the prototype-text also improves the quality of the final summary.
3. Proposed Methodology
The first network in our framework i.e., the MFS-Net constructs a prototype-summary of the input document, called the minimal feasible summary () independent of the allocated budget. We describe it in Section 3.1. The Pointer-Magnifier network then generates the final summary () at the specified budget from using a multi-headed attention model. We will discuss this in Section 3.2. Both networks are trained separately.
3.1. Overview of the MFS-Net
We extend the BLSTM-based encoder-decoder model with attention proposed in (See et al., 2017) to construct the minimal feasible summary. The encoder network (blue rectangles in Fig. 2) takes a multi-sentence document as input, coverts it to lowercase, tokenizes it, processes each token sequentially and updates its hidden states. The decoder network (pink rectangles in Fig. 2) constructs the minimal feasible summary, one token at a time by soft-selection between the input document and an external vocabulary. Named entities in the input text are anonymized (See et al., 2017) before feeding it to the encoder network. An overview of the architecture is shown in Fig 2. We will describe how the minimal feasible summary is constructed below.
Attention distribution over the input document: Upon Upon encountering a token from the input document at timestep
, hidden states of the encoder network are updated using Eq. 1 to 3. Each hidden state is represented using a 256-dimensional vector.
and , and are learnable parameters, denotes the Hadamard product, and
represents the activation function. During inference, the decoder network takes the last encoder hidden state as input, updates its hidden state and derives the next token to be included inusing beam-search. If and denote the last hidden state of the encoder and the decoder network at timestep , attention distribution over the input text is computed as follows.
represents the probability distribution of copying a token from the input text at timestep. , , and are learnable parameters. represents the coverage-vector, introduced tn the attention distribution to avoid repetition in the summary.
Generation from external vocabulary: Let Let, represents the probability distribution of tokens in an external vocabulary at timestep . We compute it as follows.
is computed by forward propagating the context-vector () concatenated with the decoder hidden state () through three fully-connected layers. For each mini-batch, the external vocabulary consists of tokens that appeared in that batch and the top-k most frequent tokens from a target dictionary. In our setup, the target dictionary was made up of top 80K tokens appearing in the training samples of the experimental dataset, dataset the network was pretrained on, and their gold-standard summaries.
Soft-selection between copying and generation: The The final probability P() of including a token into at timestep is defined as a weighted sum of the attention distribution () over the input text and the generation probability of the external vocabulary as follows.
acts as a soft-switch between generating a token from the external vocabulary and copying from the input text. This mixture model approach allows us to copy while simultaneously consulting the language model, enabling operations like stitching, truncation, and paraphrasing to be performed with reasonable grammatical accuracy (see Fig. 1 for an example). We used beam-search with a beam-size of 4 to select the next token during inference.
Pretraining, learning objective and parameter settings: We We aim to minimize the negative log-likelihood of the next token to be included at each timestep. The learning objective used to train our network is defined as follows:
In Eq. 11, is a regularization term. The value of
is gradually increased at later epochs during the training process. One of the major differences between MFS-Net and(See et al., 2017) is its training procedure. To construct the minimal feasible summary using limited training samples available for a dataset , we train MFS-Net following the principles of transfer learning (Pan and Yang, ). Specifically, the network is trained on the CNN-DailyMail corpus (Nallapati et al., 2016). This pretrained network is then fine-tuned on the training samples from . All encoder-decoder weights are allowed to be updated. The external vocabulary for training MFS-Net consisted of the top 80K most frequent tokens that appeared in the training samples of the CNN-DailyMail dataset or or both. It is worth mentioning here that we cannot use a different vocabulary during the fine-tuning stage as indexing of the tokens may be different. During inference, if an out-of-vocabulary word is encountered by the network then . The probability of including that word in the summary therefore depends on the attention distribution of the input text (see Eq. 9).
Implementation details: We used Adagrad (Duchi et al., 2011) to train our network. The learning-rate and initial accumulator value were set to 0.15 and 0.1. Our network was trained on a single NVIDIA Titan-XP GPU with a batch size of 16. At each timestep the encoder read 400 tokens, and the decoder generated 100 tokens. Validation loss was used to implement early stopping. To prevent overfitting, training was stopped after 3000 epochs during the fine-tuning stage. An example of the minimal feasible summary constructed by the MFS-Net is shown in Fig. 1.
3.2. The Pointer-Magnifier Network
The task of generating the final summary () at compression budget is the responsibility of the Pointer-Magnifier network. It is a sequence-to-sequence network with multi-headed attention (Vaswani et al., 2017). Each attention-head represents a desirable quality in the final summary (see C1 and C2 in Section 1). The network consists of a multiplex layer, a stacked encoder and a decoder layer. We will describe each in details in the following sections. Each sentence in is processed sequentially until the EOS (end of summary) token is encountered. During inference, a sentence from is copied or expanded to a set of similar sentences from the input document based on the remaining budget. An overview of the network is shown in Fig. 2.
The Multiplex Layer: The The multiplex layer () is a nested matrix of dimensions . Each row of represents an optimizable property of the final summary. A row in contains information on how to take this property into account when computing the final attention distribution. Each row contains (a) a distance-metric (), (b) a scalar value (), and (c) a query-matrix (). computes the contribution of a sentence towards optimizing , represents the metric used for this computation, and denotes its relative importance in the final attention distribution. We define such that and .
We identify three such properties. They are as follows: (1) topic-coverage (), (2) keyword-coverage (), and (3) information redundancy (). The query-matrix () for measuring the contribution of a sentence towards topic-coverage is a matrix of dimensions . Each row of represents one of the three most dominant topic vectors of the input document. Symmetric KL-divergence is used as the distance metric. We used the unsupervised LDA-model by Blei et al. (Blei et al., 2003) to compute . The query-matrix () used for measuring keyword-coverage is a single-dimension fixed-length vector of length 50. Each component of represents the relative term frequency of one of the top-50 most frequent keywords in the input document. Symmetric KL-divergence is used as the distance metric for this computation. We used RAKE (Rose et al., 2010)
, a publicly available open-source library to constructfor each document. Lastly, the query-matrix for measuring redundancy is of dimensions . Each row of represents a sentence embedding vector from the input document. We used the distributional memory model (Le and Mikolov, 2014)
trained on English Wikipedia to generate the embedding vector for each sentence. Cosine similarity was used as the distance metric.111it is worth noting here that our choice of query matrices is driven by limiting ourselves to models that are either unsupervised or utilize limited training sample; other methods with the same constraints can also be used The scalar values and depend on the experimental dataset and is learned by performing grid-search over the interval . Each row in the multiplex layer is associated with an attention-head. We will discuss how the final attention distribution is computed by combining local-attention from these attention-heads in the following sections.
Stacked Encoder Layer: The Our encoder layer contains parallel encoder-blocks. Each encoder-block consists of an embedding layer followed by a local-attention layer. At each timestep during inference, a sentence from is introduced to the embedding layer. A fixed-length embedding-vector () is generated and propagated to the local-attention layer. The query-matrix and metric associated with , the indexed property in is retrieved and local-attention () is computed as follows.
For a sentence embedding vector of length and a query-matrix of dimensions , is a matrix of dimensions . We compute local-attention for the sentence by taking column-wise average of the matrix . This is repeated for all sentences until the EOS (end of summary) token in encountered. The distribution obtained from this process is then normalized to get the local-attention distribution over . Similar distributions are obtained from all encoder-blocks. The final attention distribution is then computed as follows.
We compute the final attention distribution () over by normalizing the weighted average of the local-attention distributions from each encoder-block. Positional information of each sentence is maintained during this computation. represents the weight of the attention-head (as indexed in ) in the final attention distribution. It is worth noting here that there is a dedicated pathway for each attention-head in our encoder architecture; from the multiplex layer to the local-attention layer of every encoder-block. This allows us to parallelize and speed-up the inference process.
Stacked Decoder Layer: The Architecture of our decoder layer is similar to the encoder. The decoder-layer consists of parallel decoder-blocks. Each block consists of an embedding layer followed by a local-attention layer. Parameters of the encoder and decoder-blocks are shared. Contrary to the MFS-Net, the Pointer-Magnifier network constructs the final summary using sentence-level attention.
Definition 1: We define the compression ratio () of the minimal feasible summary as follows.
In Eq. 16, and denote the number of tokens in the minimal feasible summary and the input document respectively.
Copying from the minimal feasible summary: During inference, the probability of copying a sentence from at timestep is defined as follows.
represents the attention distribution at timestep . Initialized as (from Eq. 15), the attention distribution is updated at each timestep when a sentence from is included to the final summary. To update the attention distribution at timestep after the inclusion of sentence into the final summary, we set the attention at to 0 and re-normalize the distribution. Positional information of each sentence is maintained.
Expanding a sentence from the minimal feasible summary: If is less than the allocated budget at timestep 0, a sentence in can be ‘expanded’ to a set of coherent sentences () from the input document and included in the final summary. Let, denote the fixed-length vector representation of the sentence in by the embedding-layer of the decoder-block. The probability of including in the summary is computed as follows.
is the query-matrix of dimensions , shared between the encoder and decoder-block, is a fixed-length vector of length . The decoder-block utilizes the shared attention-head to compute . It is the normalized (see Eq. 20) distribution computed from each sentence in , . The probability of including in the final summary is defined as the average (see Eq. 22) of the inclusion probabilities of all sentences in .
|Dataset||Metric||Budget = 1/32||Budget = 1/16||Budget = 1/8||Budget = 1/4||Budget = 1/2|
Definition 2: For each sentence in , the expansion-set
is a mutually exclusive n-gram of sentences from the input document that are similar to. sizeof() is therefore equal to in Eq. 22.
To determine the expansion-set of , we perform beam-search over all possible n-gram of sentences in the input document with the following objective: maximize the number of token overlaps with weighted by its average pairwise cosine similarity. To remove ‘across-sentence’ repetitions, we apply a reranking strategy similar to Chen et al. (Chen and Bansal, 2018). We keep all candidates generated by beam search, where is the size of the beam. Next, we rerank all combinations of summaries generated by including an expansion-set candidate into the partially constructed summary. Each summary generated this way is then reranked by the number of repeated n-grams, the smaller the better.
Soft-selection between copying and expansion: The final probability of including a sentence or its expansion-set into the final summary at timestep is defined as a weighted sum of the probability distribution over and .
In Eq. 23, denotes the compression budget allocated for the final summary and sgn denotes the sign function. Sentences in are processed sequentially based on the position they appeared. For each timestep , a sentence in is copied or expanded to and included in the budgeted summary. If the allocated compression budget () is less than i.e., the compression ratio of , probability of including a sentence in the final summary depends on the attention distribution of sentences from not yet included in the final summary. Otherwise, acts as a soft-switch between copying or expanding a sentence from the minimal feasible summary. A sentence in is expanded only if it does not violate (see Eq. 24) the remaining budget. Otherwise, the inclusion probability of that sentence depends on the attention distribution () of . In Eq. 24, and denote the average number of tokens in the expansion-set and the sentences that have not been included or expanded into the summary till timestep respectively. represents the compression ratio of the partially constructed summary till timestep . Once the inclusion probability of a sentence (or its expansion set) is computed, the decoder attends to the position with the highest probability and includes it in the final summary. Generation stops once the allocated budget has been reached.
Implementation details: The learnable parameters are pretrained on the CNN-DailyMail dataset first. We learn the optimal values of these parameters by performing grid-search over the interval [-1,1], minimizing the F1 score of the ROUGE-1 value on the cross-validation dataset. All parameters are initialized with the value . For both of our datasets, the local-attention weights associated to topic-coverage () and keyword-coverage () were positive numbers, whereas information redundancy () was assigned a negative weight. We set the cardinality of the expansion-set to 3 in our experiments and change it to 2 during the later iterations.
We seek to answer three key questions in our experiments. Given a summary at compression budget : (a) how good is the quality of the summary ?, (b) is the summary coherent and complete? (c) how abstractive is ? We answer the first two questions by evaluating MLS generated summaries (Sections 4.3 and 4.4) on two publicly available datasets for five compression budgets. The third key question is answered by computing the percentage of tokens introduced from the external vocabulary in Section 4.5. We also performed a human evaluation of the summaries at multiple compression budgets and present our findings in Section 4.6.
4.1. Experiment design
We evaluate MLS on two publicly available cross-domain datasets: the MSR-Narrative dataset (Ouyang et al., 2017) (D1) and the Thinking-Machines dataset (Brockman, 2018) (D2). We used the CNN-DailyMail dataset (Nallapati et al., 2016) for pretraining the MFS-Net in our framework. Each of these contain documents from a separate domain. The CNN-DailyMail dataset contains 312,804 online news articles collected from two national news websites. 287,226 articles were used to construct the training corpus, the test corpus contained 11490 articles. The MSR-Narrative dataset contains 476 personal stories shared by users of the social network Reddit. Whereas, the Thinking-Machines dataset contains 186 op-ed articles by different authors on a popular topic published in an educational website. 25% of the documents were randomly selected to construct the training corpus for both datasets. Some of the significant summary statistics of our experimental datasets are shown in Table 2.
|Dataset||Metric||Budget = 1/32||Budget = 1/16||Budget = 1/8||Budget = 1/4||Budget = 1/2|
We evaluate the quality of our summary by computing the average score for ROUGE-1, ROUGE-2 and ROUGE-L metrics against gold-standard summaries. Given an input document, abstractive summarization may create summaries that don’t share many words but have the same meaning. Therefore, it is important to capture semantic similarity beyond n-grams. ROUGE metrics fail to take this into account (Yao et al., 2017). We introduced METEOR score in our test suite to complement this aspect. We used the py-rouge library222https://pypi.org/project/py-rouge/ to compute ROUGE scores. The nltk library333https://www.nltk.org/_modules/nltk/translate/meteor_score.html was used for computing METEOR scores in our experimental setup. To evaluate the completeness of our summaries, we compute the average KL-divergence between the top-3 topic vectors (normalized) from the summary and the input document. A good summary is coherent and follows the narrative style of its source document. We measure coherence by computing the average pairwise cosine similarity between consecutive sentences (Srinivasan et al., 2018)
and sentiment polarity distribution of a summary. For both metrics, scores obtained from the summary were compared against the input document. We report the absolute difference between the average pairwise similarity score computed for a summary and the input document. Whereas, the symmetric KL-divergence score between the sentiment polarity distribution obtained from the summary and the input document is reported for the second metric. We used a using a publicly available sentiment analysis tool(Hutto and Gilbert, 2014) to obtain the polarity of a document.
Budgeted Systematic Sampling (A1) : A systematic sampling approach is undertaken following the expand-till-it-is-allowed principle. Initialized with a sentence randomly chosen from the first sentences of the input document, the budgeted summary is constructed by sampling the sentence from the last sampled position at each round until the allocated budget is met. We set the value of equal to the 3 i.e., size of the expansion-set in the Pointer-Magnifier network.
PageRank-guided Sampling (A2) : We extend the TextRank (Mihalcea and Tarau, 2004) algorithm to construct summaries at a specified budget. Each sentence in the input document is represented as a node in an undirected complete graph, where the edge between two nodes is assigned a weight equal to the cosine similarity between them. We construct the final summary by sampling the top nodes within the specified budget using the weighted PageRank algorithm.
Seq-to-seq model with length control (A3) : This is a supervised method for abstractive summarization proposed by Liu et al. (Liu et al., 2018) that employs a convolutional sequence-to-sequence model with Gated Linear Units (Dauphin et al., 2017) to construct budgeted summaries of arbitrary lengths. Desired summary length is fed as an additional input to the decoder. We followed a training procedure similar to MFS-Net for this model i.e., the network was trained on gold-standard summaries from the CNN-DailyMail dataset first and then fine-tuned on a limited training set from the experimental dataset.
4.3. Experimental Results 1
To measure its capability of generating high-quality summaries, we compared MLS against each baseline method on four metrics (ROUGE-1, ROUGE-2, ROUGE-L and METEOR) at five compression budgets. Results are shown in Table 1. We highlight some of the key findings in the following section.
|Dataset||Metric||Budget = 1/32||Budget = 1/16||Budget = 1/8||Budget = 1/4||Budget = 1/2|
Abstractive methods performed better: In terms of generating high-quality summaries, abstractive methods (MLS and A3) performed better than sampling-based methods for constructing budgeted summaries at all compression budgets.
MLS performed better at smaller budgets: MLS performed consistently well on all budgets. However, performance was better on smaller budgets. We observed a relative improvement of at least 4% on all metrics against the convolutional baseline (A3) at the budget of . Averaged over all compression budgets, we obtained an absolute improvement of approximately 2.40% in ROUGE-1 score on both datasets. An improvement of 1.49% and 2.77% were observed over A3 in METEOR score for datasets D1 and D2 respectively. As the length of the summaries increased, we performed competitively with the convolutional baseline (A3).
Good generalization using limited training data: To generate budgeted summaries with limited training samples, we pretrained our framework on the CNN-DailyMail dataset and then fine-tuned it on our experimental datasets. End-to-end results in Table 1
show that we were able to obtain summaries with desirable qualities for both datasets over a range of evaluation metrics. We further investigate the quality of MLS generated summaries by performing a human evaluation study in Section 4.6.
4.4. Experimental Results 2
We measured the coherence and completeness of a budgeted summary using three metrics. For each of these metrics, the score obtained from each document was compared against the input document for five compression budgets. A smaller score signifies better performance by a competing method for these three metrics. We present the results obtained from this experiment in Table 3. Some of the key takeaways from this experiment are as follows.
Topic coverage is better in MLS: Compared to our abstractive baseline (A3), MLS performed well for all compression budgets in covering the main concepts of the document. Improvement in performance was significant at lower compression budgets. More specifically, MLS was approximately at least 75% more accurate in capturing the main concepts of the input document in summaries at lower compression budgets than A3 for both datasets. We also outperformed both sampling based baselines. Performance became comparable at higher budgets. This is due to the fact that more sentences from the minimal feasible summary were expanded to be include in the final summary at higher compression budgets.
MLS generated summaries are coherent: A good summary shoudl be coherent and follow the narrative style of its source document. The bottom two rows in Table 3 for each dataset evaluates the coherence of summaries generated by all competing methods on our experimental datasets. Results show that MLS generates coherent summaries from the input document at all budgets. Improvement in the quality of summaries in this aspect is significant compared to our sampling-based baselines A1 and A2. We will examine this aspect further by a human evaluation study in Section 4.6.
4.5. Experimental Results 3
Abstractiveness of the summaries: We evaluate the abstractiveness of a summary by measuring the percentage of novel word n-grams and sentences included in the final summary in Fig. 4. We observe that the percentage of novel n-grams and sentences in the final summaries increases with the allocated budget. For the MSR-Narrative dataset 44.63% of sentences were novel in summaries constructed at a budget of 0.5. Whereas, 11.5% of all sentences were novel in summaries constructed at a budget of .
Effect of pretraining: To isolate gains in end-to-end performance due to pretraining on the CNN-DailyMail dataset, we compare against MLS, a baseline identical to MLS without any pretraining. Results of this experiment is shown in Table 4. We observed significant improvement over MLS, thus establishing that pretraining helps improve the quality of our summaries. The effect on performance is apparent for all compression budgets in our test suite.
Quality of the minimal feasible summary: We evaluate the extent to which quality of the final summaries depend on
by comparing it against an ablative baseline MLS+. Instead of the MFS-Net, a fixed-length minimal feasible summary is constructed by MLS+ following a greedy sampling based heuristics(Otterbacher et al., 2006). Results of this experiment are shown in Fig. 5.
4.6. User Study
We conducted a user study in order to evaluate the coherence and completeness of the generated summaries at various compression budgets. We consider a scenario in which the user needs to complete a fact checking task in a limited time. We randomly chose three documents from both of our datasets for this purpose. Users were given a summary of the documents and asked to answer a question. The questions were mostly concerned with the key facts or events described in the original article and asked users to verify whether that key fact was present in the summary. Instruction were provided to complete the task solely based on the content of the summary and not depending on previous knowledge. Similar to (Otterbacher et al., 2006) we used multiple choice questions. Users could answer ”Yes”, ”No”, or ”More information is required”. If users indicated that they needed more information, a longer summary was shown paired with the same question. If users could definitively verify the presence or absence of the information, the task was considered complete. For example, one of the question and summary pair provided to the users was as follows. Question: ”Does the story tell us why the narrator was fired?” with the following summary: ”I tried to return a lost wallet to a customer who accused me of stealing it and then grabbed my hair. We got in a physical fight and I was fired from my job”.
Our experiments (see Section 4.3 and 4.4) show that among the two extractive baselines, PageRank-guided Sampling (A2) performed significantly better on both datasets. Thus, for fair comparison, we only compare MLS against A2 and A3 from our setup. In addition to these, we also added two extreme settings: (a) the full-Content setting in which the entire article was shown to the users and (b) the no-Content setting in which only the question (and no content) was shown. The full-content control setting was added to make sure that the answer could indeed be answered from the article, while the no-content setting was added to ensure that the questions themselves did not contain any hint about the answers. Users were shown summaries generated with a budget equal to first. If they chose the ”More information is required” as their response, we provided a summary generated by the same method with a budget of . The budget was doubled each time until the user chose ”Yes” or ”No” or we reached the budget of . At each compression budget, new sentences shown for the first time were highlighted. We computed task completion time and user response for each treatment.
4.6.2. Study details:
15 graduate students participated in this study. Each user was assigned to two different settings. To prevent information retention, no user was exposed to summaries generated by two different methods for a same article. Users were encouraged to complete the task as accurately and as fast as possible. Using a balanced, incomplete block design, all 10 combinations of 5 settings (MLS, A2, A3, Full-Content, and No-Content) and 2 datasets were assigned to 3 subjects. Users were not given any information on the method used to generate a summary. Order of the summaries shown to a user by two competing methods were randomized.
The average accuracy and task completion time (in seconds) is shown in Table 5. We observe that the accuracy of the no-Content setting is 0 on both datasets, meaning that the questions did not contain any hint that could guide the users to the correct answer. Our results on the MSR-Narrative dataset (D1) shows that users could respond to the questions using MLS as accurately as the Full-Content setting, while completing the task more than two times faster. MLS outperformed both competing methods on dataset D1. We also noticed that one of the users were able to respond to the question correctly from MLS generated summaries while failed to do so from the original article. For the Thinking Machines dataset (D2), MLS outperformed A2 closely following A3. In terms of accuracy, out of the 6 articles selected for our experiment, MLS outperformed A3 on 4 articles and comparatively on 1. MLS outperformed A3 on 5 out of 6 articles in terms of task completion time.
We have proposed MLS – a supervised method for generating abstractive summaries with limited training data at arbitrary compression budgets. Leveraging an extract-then-compress approach, we construct budgeted summaries following a two-phase protocol. A sequence-to-sequence network with attention called the MFS-Net constructs the minimal feasible summary () by capturing the key concepts of the document. The second network then generates the budgeted summaries from by leveraging an interpretable multi-headed attention model. Following the principles of transfer learning, we were able to construct high quality summaries with limited training samples by pretraining our networks on the CNN-DailyMail dataset. Using a mixture model approach, MLS constructs high-quality summaries with reasonable grammatical accuracy. Completeness and coherence of the summaries was further established by a human evaluation study. In the future, we would like to extend our work to task-driven summaries for domain-specific extraction tasks. Personalized summaries leveraging a user-specific context model is also an exciting research direction for future work.
-  (2011) Jointly learning to extract and compress. In ACL, pp. . Cited by: §2.
-  (2003) Latent dirichlet allocation. JMLR (), pp. . Cited by: §3.2.
-  (2018)(Website) Note: https://www.edge.org/annual-question/what-do-you-think-about-machines-that-think Cited by: §4.1.1.
-  (2001) Seeing the whole in parts: text summarization for web browsing on handheld devices. In WWW, pp. . Cited by: §2.
Incremental personalised summarisation with novelty detection. In FQAS, pp. . Cited by: §2.
-  (2018) Fast abstractive summarization with reinforce-selected sentence rewriting. arXiv preprint:1805.11080. Cited by: §2, §3.2.
-  (2016) Neural summarization by extracting sentences and words. arXiv preprint:1603.07252. Cited by: §2.
Abstractive sentence summarization with attentive recurrent neural networks. In NAACL, pp. . Cited by: §1.
-  (2017) Language modeling with gated convolutional networks. In ICML, pp. . Cited by: §1, §4.2.
-  (2011) Adaptive subgradient methods for online learning and stochastic optimization. JMLR (), pp. . Cited by: §3.1.
-  (2017) Controllable abstractive summarization. arXiv preprint:1711.05217. Cited by: §2.
-  (2018) Survey of the state of the art in natural language generation: core tasks, applications and evaluation. JAIR , pp. . Cited by: §2.
-  (2017) Convolutional sequence to sequence learning. In ICML, pp. . Cited by: §1.
-  (2014) Vader: a parsimonious rule-based model for sentiment analysis of social media text. In ICWSM, Cited by: §4.1.2.
-  (2000) Cut and paste based text summarization. In NAACL, pp. . Cited by: §2.
-  (2016) Controlling output length in neural encoder-decoders. arXiv preprint:1609.09552. Cited by: §2.
-  (2000) Statistics-based summarization-step one: sentence compression. AAAI/IAAI , pp. . Cited by: §2.
-  (2014) Distributed representations of sentences and documents. In ICML, pp. . Cited by: §3.2.
-  (2019) Automatic dialogue summary generation for customer service. In SIGKDD, pp. . Cited by: §1, §2.
Controlling length in abstractive summarization using a convolutional neural network. In EMNLP, pp. . Cited by: §2, Table 1, §4.2, Table 3.
-  (2012) Entity-centric topic-oriented opinion summarization in twitter. In SIGKDD, pp. . Cited by: §2.
-  (2004) Textrank: bringing order into text. In EMNLP, pp. . Cited by: §4.2.
-  (2016) Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint: 1602.06023. Cited by: §1, §1, §1, §3.1, §4.1.1.
-  (2006) News to go: hierarchical text summarization for mobile devices. In SIGIR, pp. . Cited by: §2, §4.5, §4.6.
-  (2017) Crowd-sourced iterative annotation for narrative summarization corpora. In EACL, Vol. , pp. . Cited by: §4.1.1.
-  () A survey on transfer learning. TKDE (), pp. . Cited by: §1, §3.1.
-  (2010) Automatic keyword extraction from individual documents. Text mining: applications and theory, pp. . Cited by: §3.2.
-  (2015) A neural attention model for abstractive sentence summarization. arXiv preprint:1509.00685. Cited by: §1.
-  (2020) Length-controllable abstractive summarization by guiding with summary prototype. arXiv preprint:2001.07331. Cited by: §1, §2.
-  (2017) Get to the point: summarization with pointer-generator networks. arXiv preprint:1704.04368. Cited by: §1, §1, §1, §2, §3.1, §3.1.
-  (2018) Corpus-based content construction. In COLING, pp. . Cited by: §4.1.2.
-  (2017) Attention is all you need. In NIPS, pp. . Cited by: §1, §3.2.
-  (2001) Information anxiety 2. Que. Cited by: §1.
-  (2003) Fractal summarization for mobile devices to access large documents on the web. In WWW, pp. . Cited by: §2.
-  (2017) Recent advances in document summarization. KAIS (), pp. . Cited by: §4.1.2.