Abstractive text summarization http://arxiv.org/abs/1509.00685
Summarization based on text extraction is inherently limited, but generation-style abstractive methods have proven challenging to build. In this work, we propose a fully data-driven approach to abstractive sentence summarization. Our method utilizes a local attention-based model that generates each word of the summary conditioned on the input sentence. While the model is structurally simple, it can easily be trained end-to-end and scales to a large amount of training data. The model shows significant performance gains on the DUC-2004 shared task compared with several strong baselines.READ FULL TEXT VIEW PDF
Extractive text summarization has been an extensive research problem in ...
We propose a unified model combining the strength of extractive and
Forum threads are lengthy and rich in content. Concise thread summaries ...
Current abstractive summarization systems outperform their extractive
Various Seq2Seq learning models designed for machine translation were ap...
Most current work in NLP utilizes deep learning, which requires a lot of...
Compressive summarization systems typically rely on a crafted set of
Abstractive text summarization http://arxiv.org/abs/1509.00685
a trained attention-based summarization model
Summarization is an important challenge of natural language understanding. The aim is to produce a condensed representation of an input text that captures the core meaning of the original. Most successful summarization systems utilize extractive approaches that crop out and stitch together portions of the text to produce a condensed version. In contrast, abstractive summarization attempts to produce a bottom-up summary, aspects of which may not appear as part of the original.
We focus on the task of sentence-level summarization. While much work on this task has looked at deletion-based sentence compression techniques (knight2002summarization, among many others), studies of human summarizers show that it is common to apply various other operations while condensing, such as paraphrasing, generalization, and reordering [Jing2002]. Past work has modeled this abstractive summarization problem either using linguistically-inspired constraints [Dorr et al.2003, Zajic et al.2004] or with syntactic transformations of the input text [Cohn and Lapata2008, Woodsend et al.2010]. These approaches are described in more detail in Section 6.
We instead explore a fully data-driven approach for generating abstractive summaries. Inspired by the recent success of neural machine translation, we combine a neural language model with a contextual input encoder. Our encoder is modeled off of the attention-based encoder of bahdanau2014neural in that it learns a latent soft alignment over the input text to help inform the summary (as shown in Figure1). Crucially both the encoder and the generation model are trained jointly on the sentence summarization task. The model is described in detail in Section 3. Our model also incorporates a beam-search decoder as well as additional features to model extractive elements; these aspects are discussed in Sections 4 and 5.
This approach to summarization, which we call Attention-Based Summarization (Abs), incorporates less linguistic structure than comparable abstractive summarization approaches, but can easily scale to train on a large amount of data. Since our system makes no assumptions about the vocabulary of the generated summary it can be trained directly on any document-summary pair.111In contrast to a large-scale sentence compression systems like filippova2013overcoming which require monotonic aligned compressions. This allows us to train a summarization model for headline-generation on a corpus of article pairs from Gigaword [Graff et al.2003] consisting of around 4 million articles. An example of generation is given in Figure 2, and we discuss the details of this task in Section 7.
To test the effectiveness of this approach we run extensive comparisons with multiple abstractive and extractive baselines, including traditional syntax-based systems, integer linear program-constrained systems, information-retrieval style approaches, as well as statistical phrase-based machine translation. Section8 describes the results of these experiments. Our approach outperforms a machine translation system trained on the same large-scale dataset and yields a large improvement over the highest scoring system in the DUC-2004 competition.
We begin by defining the sentence summarization task. Given an input sentence, the goal is to produce a condensed summary. Let the input consist of a sequence of words coming from a fixed vocabulary of size
. We will represent each word as an indicator vectorfor , sentences as a sequence of indicators, and as the set of possible inputs. Furthermore define the notation to indicate the sub-sequence of elements .
A summarizer takes as input and outputs a shortened sentence of length . We will assume that the words in the summary also come from the same vocabulary and that the output is a sequence . Note that in contrast to related tasks, like machine translation, we will assume that the output length is fixed, and that the system knows the length of the summary before generation.222For the DUC-2004 evaluation, it is actually the number of bytes of the output that is capped. More detail is given in Section 7.
Next consider the problem of generating summaries. Define the set as all possible sentences of length , i.e. for all and , is an indicator. We say a system is abstractive if it tries to find the optimal sequence from this set ,
under a scoring function . Contrast this to a fully extractive sentence summary333Unfortunately the literature is inconsistent on the formal definition of this distinction. Some systems self-described as abstractive would be extractive under our definition. which transfers words from the input:
or to the related problem of sentence compression that concentrates on deleting words from the input:
While abstractive summarization poses a more difficult generation challenge, the lack of hard constraints gives the system more freedom in generation and allows it to fit with a wider range of training data.
In this work we focus on factored scoring functions, , that take into account a fixed window of previous words:
where we define for a window of size .
In particular consider the conditional log-probability of a summary given the input,. We can write this as:
where we make a Markov assumption on the length of the context as size and assume for , is a special start symbol .
With this scoring function in mind, our main focus will be on modelling the local conditional distribution: . The next section defines a parameterization for this distribution, in Section 4, we return to the question of generation for factored models, and in Section 5 we introduce a modified factored scoring function.
The distribution of interest, , is a conditional language model based on the input sentence
. Past work on summarization and compression has used a noisy-channel approach to split and independently estimate a language model and a conditional summarization model[Banko et al.2000, Knight and Marcu2002, Daumé III and Marcu2002], i.e.,
are estimated separately. Here we instead follow work in neural machine translation and directly parameterize the original distribution as a neural network. The network contains both a neural probabilistic language model and an encoder which acts as a conditional summarization model.
The core of our parameterization is a language model for estimating the contextual probability of the next word. The language model is adapted from a standard feed-forward neural network language model (NNLM), particularly the class of NNLMs described by bengio2003neural. The full model is:
The parameters are where is a word embedding matrix, , , are weight matrices,444Each of the weight matrices , , also has a corresponding bias term. For readability, we omit these terms throughout the paper. is the size of the word embeddings, and is a hidden layer of size . The black-box function is a contextual encoder term that returns a vector of size representing the input and current context; we consider several possible variants, described subsequently. Figure 2(a) gives a schematic representation of the decoder architecture.
Note that without the encoder term this represents a standard language model. By incorporating in and training the two elements jointly we crucially can incorporate the input text into generation. We discuss next several possible instantiations of the encoder.
Our most basic model simply uses the bag-of-words of the input sentence embedded down to size , while ignoring properties of the original order or relationships between neighboring words. We write this model as:
Where the input-side embedding matrix is the only new parameter of the encoder and
is a uniform distribution over the input words.
For summarization this model can capture the relative importance of words to distinguish content words from stop words or embellishments. Potentially the model can also learn to combine words; although it is inherently limited in representing contiguous phrases.
To address some of the modelling issues with bag-of-words we also consider using a deep convolutional encoder for the input sentence. This architecture improves on the bag-of-words model by allowing local interactions between words while also not requiring the context while encoding the input.
We utilize a standard time-delay neural network (TDNN) architecture, alternating between temporal convolution layers and max pooling layers.
Where is a word embedding matrix and consists of a set of filters for each layer . Eq. 7 is a temporal (1D) convolution layer, Eq. 6 consists of a 2-element temporal max pooling layer and a pointwise non-linearity, and final output Eq. 5 is a max over time. At each layer is one half the size of
. For simplicity we assume that the convolution is padded at the boundaries, and thatis greater than so that the dimensions are well-defined.
While the convolutional encoder has richer capacity than bag-of-words, it still is required to produce a single representation for the entire input sentence. A similar issue in machine translation inspired bahdanau2014neural to instead utilize an attention-based contextual encoder that constructs a representation based on the generation context. Here we note that if we exploit this context, we can actually use a rather simple model similar to bag-of-words:
Where is an embedding of the context, is a new weight matrix parameter mapping between the context embedding and input embedding, and is a smoothing window. The full model is shown in Figure 2(b).
Informally we can think of this model as simply replacing the uniform distribution in bag-of-words with a learned soft alignment, , between the input and the summary. Figure 1 shows an example of this distribution as a summary is generated. The soft alignment is then used to weight the smoothed version of the input when constructing the representation. For instance if the current context aligns well with position then the words are highly weighted by the encoder. Together with the NNLM, this model can be seen as a stripped-down version of the attention-based neural machine translation model.555To be explicit, compared to bahdanau2014neural our model uses an NNLM instead of a target-side LSTM, source-side windowed averaging instead of a source-side bi-directional RNN, and a weighted dot-product for alignment instead of an alignment MLP.
The lack of generation constraints makes it possible to train the model on arbitrary input-output pairs. Once we have defined the local conditional model, , we can estimate the parameters to minimize the negative log-likelihood of a set of summaries. Define this training set as consisting of input-summary pairs . The negative log-likelihood conveniently factors666This is dependent on using the gold standard contexts . An alternative is to use the predicted context within a structured or reenforcement-learning style objective. into a term for each token in the summary:
We now return to the problem of generating summaries. Recall from Eq. 4 that our goal is to find,
Unlike phrase-based machine translation where inference is NP-hard, it actually is tractable in theory to compute . Since there is no explicit hard alignment constraint, Viterbi decoding can be applied and requires time to find an exact solution. In practice though is large enough to make this difficult. An alternative approach is to approximate the with a strictly greedy or deterministic decoder.
A compromise between exact and greedy decoding is to use a beam-search decoder (Algorithm 1) which maintains the full vocabulary while limiting itself to potential hypotheses at each position of the summary. This has been the standard approach for neural MT models [Bahdanau et al.2014, Sutskever et al.2014, Luong et al.2015]. The beam-search algorithm is shown here, modified for the feed-forward model:
As with Viterbi this beam search algorithm is much simpler than beam search for phrase-based MT. Because there is no explicit constraint that each source word be used exactly once there is no need to maintain a bit set and we can simply move from left-to-right generating words. The beam search algorithm requires time. From a computational perspective though, each round of beam search is dominated by computing for each of the hypotheses. These can be computed as a mini-batch, which in practice greatly reduces the factor of .
While we will see that the attention-based model is effective at generating summaries, it does miss an important aspect seen in the human-generated references. In particular the abstractive model does not have the capacity to find extractive word matches when necessary, for example transferring unseen proper noun phrases from the input. Similar issues have also been observed in neural translation models particularly in terms of translating rare words [Luong et al.2015].
To address this issue we experiment with tuning a very small set of additional features that trade-off the abstractive/extractive tendency of the system. We do this by modifying our scoring function to directly estimate the probability of a summary using a log-linear model, as is standard in machine translation:
Where is a weight vector and is a feature function. Finding the best summary under this distribution corresponds to maximizing a factored scoring function ,
where to satisfy Eq. 4. The function is defined to combine the local conditional probability with some additional indicator featrues:
These features correspond to indicators of unigram, bigram, and trigram match with the input as well as reordering of input words. Note that setting gives a model identical to standard Abs.
After training the main neural model, we fix and tune the parameters. We follow the statistical machine translation setup and use minimum-error rate training (MERT) to tune for the summarization metric on tuning data [Och2003]. This tuning step is also identical to the one used for the phrase-based machine translation baseline.
Abstractive sentence summarization has been traditionally connected to the task of headline generation. Our work is similar to early work of banko2000headline who developed a statistical machine translation-inspired approach for this task using a corpus of headline-article pairs. We extend this approach by: (1) using a neural summarization model as opposed to a count-based noisy-channel model, (2) training the model on much larger scale (25K compared to 4 million articles), (3) and allowing fully abstractive decoding.
This task was standardized around the DUC-2003 and DUC-2004 competitions [Over et al.2007]. The Topiary system [Zajic et al.2004] performed the best in this task, and is described in detail in the next section. We point interested readers to the DUC web page (http://duc.nist.gov/) for the full list of systems entered in this shared task.
More recently, cohn2008sentence give a compression method which allows for more arbitrary transformations. They extract tree transduction rules from aligned, parsed texts and learn weights on transfomations using a max-margin learning algorithm. woodsend2010generation propose a quasi-synchronous grammar approach utilizing both context-free parses and dependency parses to produce legible summaries. Both of these approaches differ from ours in that they directly use the syntax of the input/output sentences. The latter system is W&L in our results; we attempted to train the former system T3 on this dataset but could not train it at scale.
In addition to banko2000headline there has been some work using statistical machine translation directly for abstractive summary. wubben2012sentence utilize Moses directly as a method for text simplification.
Recently filippova2013overcoming developed a strictly extractive system that is trained on a relatively large corpora (250K sentences) of article-title pairs. Because their focus is extractive compression, the sentences are transformed by a series of heuristics such that the words are in monotonic alignment. Our system does not require this alignment step but instead uses the text directly.
This work is closely related to recent work on neural network language models (NNLM) and to work on neural machine translation. The core of our model is a NNLM based on that of bengio2003neural.
. Of these our model is most closely related to the attention-based model of bahdanau2014neural, which explicitly finds a soft alignment between the current position and the input source. Most of these models utilize recurrent neural networks (RNNs) for generation as opposed to feed-forward models. We hope to incorporate an RNN-LM in future work.
We experiment with our attention-based sentence summarization model on the task of headline generation. In this section we describe the corpora used for this task, the baseline methods we compare with, and implementation details of our approach.
The standard sentence summarization evaluation set is associated with the DUC-2003 and DUC-2004 shared tasks [Over et al.2007]. The data for this task consists of 500 news articles from the New York Times and Associated Press Wire services each paired with 4 different human-generated reference summaries (not actually headlines), capped at 75 bytes. This data set is evaluation-only, although the similarly sized DUC-2003 data set was made available for the task. The expectation is for a summary of roughly 14 words, based on the text of a complete article (although we only make use of the first sentence). The full data set is available by request at http://duc.nist.gov/data.html.
For this shared task, systems were entered and evaluated using several variants of the recall-oriented ROUGE metric [Lin2004]
. To make recall-only evaluation unbiased to length, output of all systems is cut-off after 75-characters and no bonus is given for shorter summaries. Unlike BLEU which interpolates various n-gram matches, there are several versions of ROUGE for different match lengths. The DUC evaluation uses ROUGE-1 (unigrams), ROUGE-2 (bigrams), and ROUGE-L (longest-common substring), all of which we report.
In addition to the standard DUC-2014 evaluation, we also report evaluation on single reference headline-generation using a randomly held-out subset of Gigaword. This evaluation is closer to the task the model is trained for, and it allows us to use a bigger evaluation set, which we will include in our code release. For this evaluation, we tune systems to generate output of the average title length.
For training data for both tasks, we utilize the annotated Gigaword data set [Graff et al.2003, Napoles et al.2012], which consists of standard Gigaword, preprocessed with Stanford CoreNLP tools [Manning et al.2014]. Our model only uses annotations for tokenization and sentence separation, although several of the baselines use parsing and tagging as well. Gigaword contains around 9.5 million news articles sourced from various domestic and international news services over the last two decades.
For our training set, we pair the headline of each article with its first sentence to create an input-summary pair. While the model could in theory be trained on any pair, Gigaword contains many spurious headline-article pairs. We therefore prune training based on the following heuristic filters: (1) Are there no non-stop-words in common? (2) Does the title contain a byline or other extraneous editing marks? (3) Does the title have a question mark or colon? After applying these filters, the training set consists of roughly million title-article pairs. We apply a minimal preprocessing step using PTB tokenization, lower-casing, replacing all digit characters with #, and replacing of word types seen less than 5 times with UNK. We also remove all articles from the time-period of the DUC evaluation. release.
The complete input training vocabulary consists of million word tokens and 110K unique word types with an average sentence size of words. The headline vocabulary consists of million tokens and 69K word types with the average title of length words (note that this is significantly shorter than the DUC summaries). On average there are overlapping word types between the headline and the input; although only in the first 75-characters of the input.
Due to the variety of approaches to the sentence summarization problem, we report a broad set of headline-generation baselines.
From the DUC-2004 task we include the Prefix baseline that simply returns the first 75-characters of the input as the headline. We also report the winning system on this shared task, Topiary [Zajic et al.2004]. Topiary merges a compression system using linguistically-motivated transformations of the input [Dorr et al.2003] with an unsupervised topic detection (UTD) algorithm that appends key phrases from the full article onto the compressed output. woodsend2010generation (described above) also report results on the DUC dataset.
The DUC task also includes a set of manual summaries performed by 8 human summarizers each summarizing half of the test data sentences (yielding 4 references per sentence). We report the average inter-annotater agreement score as Reference. For reference, the best human evaluator scores 31.7 ROUGE-1.
We also include several baselines that have access to the same training data as our system. The first is a sentence compression baseline Compress [Clarke and Lapata2008]. This model uses the syntactic structure of the original sentence along with a language model trained on the headline data to produce a compressed output. The syntax and language model are combined with a set of linguistic constraints and decoding is performed with an ILP solver.
To control for memorizing titles from training, we implement an information retrieval baseline, IR. This baseline indexes the training set, and gives the title for the article with highest BM-25 match to the input (see manning2008introduction).
Finally, we use a phrase-based statistical machine translation system trained on Gigaword to produce summaries, Moses+ [Koehn et al.2007]. To improve the baseline for this task, we augment the phrase table with “deletion” rules mapping each article word to , include an additional deletion feature for these rules, and allow for an infinite distortion limit. We also explicitly tune the model using MERT to target the 75-byte capped ROUGE score as opposed to standard BLEU-based tuning. Unfortunately, one remaining issue is that it is non-trivial to modify the translation decoder to produce fixed-length outputs, so we tune the system to produce roughly the expected length.
For training, we use mini-batch stochastic gradient descent to minimize negative log-likelihood. We use a learning rate of
, and split the learning rate by half if validation log-likelihood does not improve for an epoch. Training is performed with shuffled mini-batches of size 64. The minibatches are grouped by input length. After each epoch, we renormalize the embedding tables[Hinton et al.2012]
. Based on the validation set, we set hyperparameters as, , , , and .
Our implementation uses the Torch numerical framework (http://torch.ch/) and will be openly available along with the data pipeline. Crucially, training is performed on GPUs and would be intractable or require approximations otherwise. Processing 1000 mini-batches with , requires 160 seconds. Best validation accuracy is reached after 15 epochs through the data, which requires around 4 days of training.
Our main results are presented in Table 1. We run experiments both using the DUC-2004 evaluation data set (500 sentences, 4 references, 75 bytes) with all systems and a randomly held-out Gigaword test set (2000 sentences, 1 reference). We first note that the baselines Compress and IR do relatively poorly on both datasets, indicating that neither just having article information or language model information alone is sufficient for the task. The Prefix baseline actually performs surprisingly well on ROUGE-1 which makes sense given the earlier observed overlap between article and summary.
Both Abs and Moses+ perform better than Topiary, particularly on ROUGE-2 and ROUGE-L in DUC. The full model Abs+ scores the best on these tasks, and is significantly better based on the default ROUGE confidence level than Topiary on all metrics, and Moses+ on ROUGE-1 for DUC as well as ROUGE-1 and ROUGE-L for Gigaword. Note that the additional extractive features bias the system towards retaining more input words, which is useful for the underlying metric.
Next we consider ablations to the model and algorithm structure. Table 2 shows experiments for the model with various encoders. For these experiments we look at the perplexity of the system as a language model on validation data, which controls for the variable of inference and tuning. The NNLM language model with no encoder gives a gain over the standard n-gram language model. Including even the bag-of-words encoder reduces perplexity number to below 50. Both the convolutional encoder and the attention-based encoder further reduce the perplexity, with attention giving a value below 30.
We also consider model and decoding ablations on the main summary model, shown in Table 3. These experiments compare to the BoW encoding models, compare beam search and greedy decoding, as well as restricting the system to be complete extractive. Of these features, the biggest impact is from using a more powerful encoder (attention versus BoW), as well as using beam search to generate summaries. The abstractive nature of the system helps, but for ROUGE even using pure extractive generation is effective.
Finally we consider example summaries shown in Figure 4. Despite improving on the baseline scores, this model is far from human performance on this task. Generally the models are good at picking out key words from the input, such as names and places. However, both models will reorder words in syntactically incorrect ways, for instance in Sentence 7 both models have the wrong subject. Abs often uses more interesting re-wording, for instance new nz pm after election in Sentence 4, but this can also lead to attachment mistakes such a russian oil giant chevron in Sentence 11.
We have presented a neural attention-based model for abstractive summarization, based on recent developments in neural machine translation. We combine this probabilistic model with a generation algorithm which produces accurate abstractive summaries. As a next step we would like to further improve the grammaticality of the summaries in a data-driven way, as well as scale this system to generate paragraph-level summaries. Both pose additional challenges in terms of efficient alignment and consistency in generation.
The Journal of Machine Learning Research, 3:1137–1155.
Journal of Artificial Intelligence Research, pages 399–429.
Using hidden markov modeling to decompose human-written summaries.Computational linguistics, 28(4):527–543.
The stanford corenlp natural language processing toolkit.In Proceedings of 52nd Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 55–60.