What comes next? Extractive summarization by next-sentence prediction

01/12/2019 ∙ by Jingyun Liu, et al. ∙ McGill University 0

Existing approaches to automatic summarization assume that a length limit for the summary is given, and view content selection as an optimization problem to maximize informativeness and minimize redundancy within this budget. This framework ignores the fact that human-written summaries have rich internal structure which can be exploited to train a summarization system. We present NEXTSUM, a novel approach to summarization based on a model that predicts the next sentence to include in the summary using not only the source article, but also the summary produced so far. We show that such a model successfully captures summary-specific discourse moves, and leads to better content selection performance, in addition to automatically predicting how long the target summary should be. We perform experiments on the New York Times Annotated Corpus of summaries, where NEXTSUM outperforms lead and content-model summarization baselines by significant margins. We also show that the lengths of summaries produced by our system correlates with the lengths of the human-written gold standards.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Writing a summary is a different task compared to producing a longer article. As a consequence, it is likely that the topic and discourse moves made in summaries differ from those in regular articles. In this work, we present a powerful extractive summarization system which exploits rich summary-internal structure to perform content selection, redundancy reduction, and even predict the target summary length, all in one joint model.

Text summarization has been addressed by numerous techniques in the community Nenkova and McKeown (2011). For extractive summarization, which is the focus of this paper, a popular task setup is to generate summaries that respect a fixed length limit. In the summarization shared tasks of the past Document Understanding Conferences (DUC111http://duc.nist.gov/), these limits are defined in terms of words or bytes. As a result, much work has framed summarization as a constrained optimization problem, in order to select a subset of sentences with desirable summary qualities such as informativeness, coherence, and non-redundancy within the length budget Gillick and Favre (2009); Lin and Bilmes (2011); Kulesza and Taskar (2011).

One problem with this setup is that it does not match many real-world summarization settings. For example, writers can tailor the length of their summaries to the amount of noteworthy content in the source article. Summaries created by news editors for archives, such as the New York Times Annotated Corpus Sandhaus (2008), exhibit a variety of lengths. There is also evidence that in the context of web search, people prefer summaries of different lengths for the documents in search results depending on the type of the search query Kaisser et al. (2008). More generally, current systems focus heavily on properties of the source document to learn to identify important sentences, and score the coherence of sentence transitions. They reason about the content of summaries primarily for purposes of avoiding redundancy, and respecting the length budget. But they ignore the idea that it might actually be useful to learn content structure and discourse planning for summaries from large collections of multi-sentence summaries.

This work proposes an extractive summarization system that focuses on capturing rich summary-internal structure. Our key idea is that since summaries in a domain often follow some predictable structure, a partial summary or set of summary sentences should help predict other summary sentences. We formalize this intuition in a model called NextSum, which selects the next summary sentence based not only on properties of the source text, but also on the previously selected sentences in the summary. An example choice is shown in Table 1. This setup allows our model to capture summary-specific discourse and topic transitions. For example, it can learn to expand on a topic that is already mentioned in the summary, or to introduce a new topic. It can learn to follow a script or discourse relations that are expected for that domain’s summaries. It can even learn to predict the end of the summary, avoiding the need to explicitly define a length cutoff.

The core of our system is a next-sentence prediction component, which is a feed-forward neural network driven by features capturing the prevalence of domain subtopics in the source and the summary, sentence importance in the source, and coverage of the source document by the summary so far. A full summary can then be generated by repeatedly predicting the next sentence until the model predicts that the summary should end.

Since summary-specific moves may depend on the domain, we first explore domain-specific summarization on event-oriented news topics (War Crimes, Assassinations, Bombs) from the New York Times Annotated Corpus Sandhaus (2008). We also train a domain-general model across multiple types of events. NextSum predicts the next summary sentence with remarkably high accuracies, reaching 67% compared to a chance accuracy of 9%. The generated summaries outperform the lead baseline as well as domain-specific summarization baselines without requiring explicit redundancy check or a length constraint. Moreover, the system produces summaries of variable lengths which correlate with how long human summaries are for the same texts.

  Summary so far
  [S] After a sordid campaign overshadowed by the killing last month of a leading liberal politician, the citizens of St. Petersburgh, Russia’s second largest city, voted in record numbers Sunday, and today’s preliminary results indicated that they had improved the fortunes of the city’s embattled democratic alliance.
  Correct next summary sentence
  [A] The biggest winner on Sunday was Yabloko, a liberal party led by a presidential aspirant, Grigory A. Yablinsky, whose candidates in 24 districts scored well enough to move to the final round.
  Incorrect as next sentence
  [B] Yabloko, which has long considered St. Petersburg as its stronghold, was even opposed by a party that called itself Yabloko-St Petersburg.
Table 1: Example of a partial summary [S], and two sentences from the same source article. Both [A] and [B] are about the same entity, but [A] is clearly a logical next sentence to continue [S] when compared to [B].

2 Related work

Many approaches to extractive summarization are unsupervised, and focus on the role of word frequency and source document representation for selecting informative and non-redundant content Erkan and Radev (2004); Mihalcea and Tarau (2004); Nenkova et al. (2006). More recently, supervised approaches are popular, which view content selection as a sentence-level binary classification problem, typically using a neural network Cheng and Lapata (2016); Nallapati et al. (2017).

Using source structure. Source structure is a common cue for summarization. Relative word frequency and position of sentences are standardly used in many systems. Discourse- and graph-based summarization techniques explicitly focus on computing document structure Marcu (1998); Louis et al. (2010); Christensen et al. (2013). Other techniques include learning probabilistic topic models over source articles within a domain to capture subtopics and transitions between them Barzilay and Lee (2004); Haghighi and Vanderwende (2009); Cheung and Penn (2013). But, the use of structure from summaries is less explored.

Using summary structure. Actually, almost all systems maintain some representation of the partial summary at a timestep. At the very least, it is needed for respecting a length limit and for preventing redundancy. Even in recent neural network based extractive summarization, a representation of the summary so far has been proposed to allow redundancy checks Nallapati et al. (2017). However, current methods do not focus on capturing rich summary discourse and content structure.

Recent abstractive neural summarization models based on encoder-decoder frameworks actually have greater scope for capturing summary structure and content. The use of techniques such as attention and pointer mechanisms can be viewed as a form of summary structure modelling Rush et al. (2015); Nallapati et al. (2016); See et al. (2017); Paulus et al. (2017). However, because such systems currently operate at the word level, these mechanisms are mostly used for handling issues such as grammaticality, out-of-vocabulary items, predicate-argument structure, and local coherence. By contrast, we aim to capture higher-level transitions in the contents of a summary.

Next-sentence prediction. The way we learn summary structure is by training a module for next summary sentence prediction. A parallel idea can be found in the form of next-utterance prediction in retrieval-based dialogue systems Jafarpour et al. (2010); Wang et al. (2013); Lowe et al. (2016). There have also been recent attempts at predicting the next sentence in text. The skip-thought model Kiros et al. (2015) is trained to predict a sentence from its neighbouring sentences to produce sentence representations. CLSTM2016 and pichotta:acl16 evaluate neural language models on next-sentence and event prediction. In contrast, we aim to predict the next output sentence within the tangible application of summarization.

3 NextSum model overview

We first present the key ideas, and the next section explains how we implement the model.

NextSum comprises two components, a next-sentence prediction system, and a summary generation module. The first is a supervised system trained to select the next summary sentence, given a set of candidate sentences from the source, and the summary generated so far. NextSum’s generation component builds a summary by making repeated calls to the next-sentence predictor.

3.1 Predicting the next summary sentence

The next-sentence predictor is trained on a corpus of source articles and their gold-standard summaries written by humans. In this work, we focus on single-document summarization.

Consider a source article containing sentences, and a gold-standard extractive summary , a sequence of sentences. Since is extractive, .

In NextSum, summaries are created by adding one sentence at a time. Let be the partial summary at timestep ; has sentences. At time , the goal of NextSum is to score a set of candidate sentences from the source, , and find the best next sentence to follow . Let the gold-standard next sentence be . The set may either be all of the source sentences which have not yet been included in the summary, or be limited to a smaller size . For now, assume that all the unselected source sentences are in the candidate set, and thus .

The model selects the next summary sentence from such that:

When there is a tie, the earlier sentence in the article is selected. In this work,

is estimated by a neural network parameterized by

. Recall that the oracle next sentence is in . Hence one approach to learn the parameters of is to frame it as a binary classification problem where the label for sentence is 1, and 0 for all where

. We implement this classifier using a feed-forward neural network which takes the encoded representations of (

, and

), and outputs the probability of label 1,

, which we use as . The loss for the classification at timestep is the binary cross-entropy loss:


One of the special features of NextSum is that we model the end of the summary within the same setup. To do so, we introduce a special sentence (End of Summary) to mark the end of every gold-standard summary, i.e. . In the model, is included in candidate sets at every timestep. This inclusion allows the model to learn to discriminate between selecting a sentence from the source versus ending the summary by picking the marker. Thus our candidate set is in fact .

3.2 Summary generation

After the next sentence prediction model is trained, it can be used to generate a complete summary for a source article. The model performs this task by iteratively predicting the next sentence until is selected. Note that, unlike previous work, the generation component is not given the target length of the summary.

4 Implementing NextSum

In this section, we explain how we select the candidate set, what features we use in the neural network for next sentence prediction, and the design of the generation component.

4.1 Candidate selection

Some source articles are very long, which means that can contain many candidate sentences if we take all of the unselected sentences as candidates. In practice, we limit the size of in order to reduce the search space of the model, which improves running time.

In the single-document scenario, the source text sentences are in a natural discourse, and thus in a logical and temporal order. Hence, it is not unreasonable to assume that a good summary is a subsequence of the source. Given this assumption, suppose the last sentence chosen for the summary is at timestep , then we consider the sentences in the source immediately following as the candidate set at time .

During development, we found that when , the gold-standard next summary sentence is in the candidate set 90% of the time, and is present 80% of the time when using =5. Based on this empirical support for the subsequence hypothesis, we use plus the end of summary marker for all the experiments in this paper, for a total candidate set size of 11. For comparison, a source article in our corpus has on average 33 sentences, and the maximum is as high as 500 sentences. During training, when fewer than 10 sentences remain, we randomly sample other sentences from the entire article to ensure having enough negative samples. The model is trained on balanced dataset by downsampling, and tested on the distribution where each candidate set has size 11.

4.2 Features for next sentence prediction

We have a source document with sentences, is a partial summary at time , and let be a sentence (or EOS) in the candidate set . NextSum’s next sentence prediction relies on computing using a feedforward neural network with parameters . This network learns from rich feature-based representations of , , , and their interactions.

Domain subtopics. These features are based on topics induced from a large collection of documents in the same domain as the source article.

These topics are obtained using the content-model approach of barzilay04. The content model is a Hidden Markov Model (HMM), where the states correspond to topics, and transitions between them indicate how likely it is for one topic to follow another. The emission distribution from a state is a bigram language model indicating what lexical content is likely under that topic. Each sentence in the article is emitted by one state (i.e., one topic). The probability of an article

under a HMM with states is given by:

Content models can be trained in an unsupervised fashion to maximize the log likelihood of the articles from the domain. We choose the number of topics on a development set.222The number of topics are between 10 and 27 for the domains in our corpus.

Once trained, the model can compute the most likely state sequence for sentences in the source document, and in the partial summary, using Viterbi decoding. Based on the predicted topics, we compute a variety of features:

  • [noitemsep]

  • the proportion of source sentences assigned to each topic

  • the proportion of sentences in the partial summary assigned to each topic

  • the most likely topic of the candidate given by

  • the emission probability of from each topic

  • the transition probability between the topic of the previous summary sentence , and the topic of ,

  • a global estimation of observing the candidate ,

Content. We compute an encoding of source, summary so far, and the candidate sentence by averaging the pretrained word2vec embeddings Mikolov et al. (2013) (trained on Google News Corpus) of each word in the span (900 features in total, 300 each for the source, summary so far, and the candidate). We also add features for the 1,000 most frequent words in the training articles, in order to encode their presence in , and in the sentence previous to in the source article, i.e. (). Similarly, for and , we record the presence of each part-of-speech tag and named entity. We expect these features for and are useful for predicting , since the last sentence in a summary might contain some lexical cues.

Redundancy. These features calculate the degree to which the candidate sentence overlaps with the summary so far. They include for , , (3 features), where

is computed using cosine similarity between count vector representations of the words in

and . We also include the number of overlapping nouns and verbs between and (2 features).

Position. The position of a sentence in the source document is an important indicator for content selection and is widely used in systems. We indicate the position in the source of the last generated summary sentence (as one of 5 bins, the size of each bin depends on the length of the source article). We also indicate the position of the candidate sentence, and its distance to in the source (normalized by the length of the source).

Length. We include features for the length of the source, both as number of sentences, and number of words (binned into 5 bins). We also include the number of sentences and words in the summary so far. The length measures for the partial summary are not binned.

Coverage. These features compute how much of the source will be covered by the summary when a candidate sentence is added to it. We use the KL divergence between the source and candidate summary when is included in it: where the distribution of and are unigram language models.

Sentence importance. We also indicate the individual importance of a candidate sentence. The frequency of a word in the source is known to be a strong feature for importance Nenkova and Vanderwende (2005). With this intuition, we include the where is a token in the candidate sentence, and is the unigram probability of in the source .

We also use a separate pre-trained model of word importance. This model feeds the context of a target word (the two words before and two words after) into a LSTM model which outputs the probability of the target word appearing in a summary. The importance score of a sentence is then the average and maximum of the predicted scores of each word in the sentence. This model is trained on the same training and development data sets.

4.3 Summary generation

To generate the full summary, the model employs a greedy method that simply calls the next-sentence prediction module repeatedly until is selected. We also tried beam search decoding for a more globally optimal sequence of sentences, but we found in preliminary experiments that this search did not improve our results.

5 Data

We hypothesize that next-sentence prediction is more likely to be successful in event-oriented domains (describing events as opposed to explanations and opinions). Moreover, summary-specific moves may be more prominent and learnable from summary-article pairs within specific domains compared to a general corpus.

So we create three domain-specific and one domain-general dataset, all focusing on events. We use the New York Times Annotated Corpus (NYtimes) Sandhaus (2008) since it provides topic metadata, has thousands of article-summary pairs on different topics, and summaries are not written to set lengths. We selected three topics: “War Crimes and Criminals” (crime), “Assassinations and Attempted Assassinations” (assassin.), and “Bombs and Explosives” (bombs). We also create a more general dataset (mixed) by randomly sampling from all the three domains.

We sample a similar number of articles across each domain, and randomly split each domain into 80% training, 10% development and 10% test data. Table 2 shows the sizes of these datasets.

We use the Stanford CoreNLP toolkit Manning et al. (2014) to tokenize, segment sentences, and assign part of speech tags to all the texts.

5.1 Length of articles and summaries

As previously mentioned, summaries are often written to express the summary-worthy content of an article, and not restricted to an arbitrary length. This property can be seen in our data (Table 3).

The NYTimes summaries are abstractive in nature and range from a minimum of 2 words333Sometimes just the caption to a photo, not very common. to as many as 278 words. The last column of the table gives the Kendall Tau correlation (corrected for ties) between the length of the source and the summary. There is a significant positive correlation, implying that the length of the article is indicative of its information content. This finding motivates us to include the length of the source article as a feature for next sentence prediction, though we note that the source length by itself is not enough to determine the summary length without doing further analysis of the source content.

Domain Train. Dev. Test
crime 986 123 123
assassin. 1,087 136 136
bombs 1,440 180 180
mixed 1,600 200 200
Table 2: Number of article-summary pairs in our data.
Domain Source Summary Tau
min max avg min max avg
crime 4 8,300 648 2 236 51 0.548
assassin. 3 6,081 705 3 226 60 0.481
bombs 48 7,808 874 15 278 82 0.343
mixed 3 7,819 815 3 278 81 0.358
Table 3: Min, max and average lengths (in words) of source articles and abstracts. Tau is the Kendall Tau correlation between the length of source and abstract.

5.2 Obtaining extractive summaries

The summaries from NYTimes are abstractive in nature. Our system is extractive, and for training the next sentence selection from the source, we need a mapping between the abstractive summary and the sentences in the source article. Note that we create these extractive summaries only for training our model. We will evaluate NextSum’s output by comparing with the abstractive human summaries as is standard practice.

We map each sentence in the abstract to the most similar sentence in the source article. Let be the sequence of sentences in the abstract. For each , we find where is the set of source sentences, and is the cosine similarity between the word unigrams of and .

The sequence corresponding to forms the gold standard extractive summary. Since the extractive summary mirrors the sequence of content in the abstract, the structure of the summary is preserved, allowing our next sentence prediction system to be trained on the extractive sequence of sentences. It is also for this reason that we do not use summarization datasets such as the CNN/Daily Mail corpus Hermann et al. (2015) where summaries are three-sentence highlights, and do not have any discernible discourse structure as a whole.

6 Experiments

We first evaluate our model intrinsically on the next-sentence prediction task, then test its performance on the full summary generation problem.

6.1 Next-sentence prediction

Here, the goal is to select the best sentence to follow the partial summary from a candidate set of 11 options (see Section 4.1). For evaluating this part of our system, we assume that we have oracle partial summaries; i.e., the partial summary at timestep , is the same as the gold summary sequence up to time . The question is how well we can predict the next sentence in this sequence from the candidate set . The correct answer is the sentence in the gold standard at position . The prediction at each timestep is a separate classification example.

Recall that we framed the machine learning problem as one of binary classification. We thus present two sets of results: a) on the binary task, and b) on the final choice of one sentence from the candidate set (among the 11 candidates). In task (a.), the binary evaluation, the model discriminates among the 2 classes by thresholding at

. The best setting has 4 hidden layers, each layer comprising between 500 to 1,500 neurons. We trained the model by backpropagation using the Adam optimizer

Kingma and Ba (2014)

for up to 75 epochs. Hyperparameters were tuned on the development set. The choice of a final sentence, Task (b.), is made by picking the candidate sentence with highest


Table 4 shows the accuracy on binary classification task and 1-of-11 task, on the different domains. In the 1-of-11 task, the expected chance-level accuracy is roughly 9.1%, since we force every candidate set to have size 11. Our next-utterance prediction system’s accuracy is between 60 to 67% on the different domains, showing that there are distinctive clues on summary internal structure and content, which can be learned by a model. Note also that the accuracy numbers are consistent across all domains and the mixed case indicating that the patterns are fairly domain-general within event-oriented documents.

These evaluations are somewhat idealistic in that the model has access to oracle partial summaries during prediction. We next evaluate NextSum on the full summarization task.

6.2 Summary generation

We developed two versions of our system. Previous methods of summary content selection assume a fixed length limit. To compare against these systems, in one version of our model, NextSum, the length limit is provided as a constraint. If, after the model generates a summary sentence, the word count exceeds the given length, we stop generation and truncate the last sentence so the summary is within the length limit. The second version, NextSum, is the full model which predicts the summary length. Both systems have no access to the oracle partial summary, and use their own previous decisions to construct the partial summary.

We evaluate all the summaries by comparing them with gold-standard abstracts using ROUGE Lin (2004).444The settings are ROUGE-1.5.5.pl -n 2 -x -m -2 4 -u -c 95 -r 1000 -f A -p 0.5 -t 0 -d.

We use ROUGE-2 F-score, as

NextSum generates summaries of varied length.

Domain Binary 1 of 11
Accuracy (%) Accuracy (%)
Crime 74.5 67.2
Assassin. 71.7 63.8
Bombs 73.5 60.0
Mixed 73.1 60.9
Random 50.0 9.0
Table 4: Results on next sentence prediction task.

6.2.1 Baselines and comparison systems

In all these systems, the target length of the summary is given as a constraint. We set the length to the average length (in words) of summaries in the training dataset for each domain (Table 3).

Lead takes the first words from the source article. For single-document extractive summarization, the lead is a very strong baseline which many systems fail to beat Over and Liggett (2002).

CHMM is the approach used by barzilay04 for extractive summarization using content models. CHMM computes an importance score for each topic . This score is a probability computed by: 1) counting the articles in the training set where appears in both the article and its summary, 2) and normalizing by the number of articles containing . To generate a summary, the model ranks the topics in order of decreasing importance, and adds one sentence from the source for each topic (breaks ties randomly if multiple sentences decoded into the same topic). The generation stops upon reaching the length limit. This method scores the summary-worthy nature of sentences based solely on their topic.

Transition is an iterative greedy approach based on the transition probability of topics from the content model. It selects at each timestep until the length limit is reached. This baseline simulates a degenerate version of next-sentence prediction, where the choice is based on a single feature at topic level; i.e., the probability of transitioning from the topic of the last summary sentence to the topic of the candidate. Like our model, this baseline has no access to the oracle partial summary, and uses its previous decisions for next sentence selection.

CHMM-T is also an iterative greedy approach where the evaluation function is the product between topic transition probability (Transition) and topic importance (CHMM).

Apart from the above domain baselines, we also compare with two other types of summaries.

General is based on a recent competitive neural network based extractive system Cheng and Lapata (2016). This model is designed to be domain-general. We trained it on the DailyMail dataset Hermann et al. (2015), containing around 200K articles and their highlights, without using pretrained embeddings. Our systems are not directly comparable, because NextSum is trained on much less data, but we show this result to give an idea of the performance of recent methods.

Oracle is the gold-standard extractive summary created from abstracts using the mapping method from Section 5.2. It represents an upper bound on the performance of any extractive summary.

6.2.2 Results

Table 5 shows the ROUGE-2 F-score results for all the systems. The baselines, NextSum, oracle and general are fixed length summaries.

Model ROUGE-2 F-scores
crime assassin. bombs mixed
Lead 0.240 0.210 0.250 0.232
CHMM 0.220 0.156 0.135 0.139
Transition 0.210 0.120 0.179 0.153
CHMM-T 0.210 0.120 0.176 0.153
Our models:
NextSum 0.278 0.227 0.240 0.234
NextSum 0.281 0.241 0.250 0.241
Other comparisons:
General 0.281 0.201 0.237 0.225
Oracle 0.420 0.350 0.365 0.363
Table 5: ROUGE-2 F-scores for the generated summaries. The best results for each domain are bolded.

Among the baselines, we see that the simple lead summary comprising the first words of the source article is the strongest, outperforming domain-trained content model systems in all the domains. The oracle results, however, show that there is still considerable scope for the improvement of automatic systems performing sentence extraction. The oracle extractive summary (which was chosen to maximize similarity with the abstract) gets close to double the ROUGE score of lead baseline in the crime domain.

Both NextSum and NextSum outperform the lead (with statistical significance) in all cases except the bombs domain. Importantly, NextSum, which does automatic length prediction, outperforms NextSum, indicating that automatically tailoring summaries to different lengths is clearly of value. In the next section, we examine this length prediction ability in detail.

Comparing performance across domains, the source articles in bombs domain are on average longer than the other domains (refer Table 3), which could be a reason that content selection performance is lower here. This domain also has longer gold standard summaries and the correlation between the length of human abstracts and source articles is also the lowest in this domain.

The domain-general system of cheng-lapata:2016 is trained on a much larger general corpus of summary-article pairs. While our results are not directly comparable, we see that NextSum’s performance is competitive with current methods, and since it is based on a new outlook and no explicit constraints, it provides much scope for future improvements.

6.3 Performance of length prediction

NextSum requires neither redundancy removal nor length constraints. In this section, we show that our system produces summaries of varied lengths which correlate with the lengths of human-written summaries of the same source article.

Figure 1 shows the distribution of the length (in words) of NextSum summaries (all domains put together). The generated lengths vary greatly, and span the average range covered by the summaries in the training data. The majority of lengths are in the 30 to 50 words limit. Hence NextSum is specializing summary lengths to cover a wide range.

Next, we measure how well these summary lengths correlate with the lengths of the human-written abstracts. Table 6 shows the Kendall Tau correlation (corrected for ties) between length in words of the NextSum summary and the length of the abstract for the same source.

Figure 1: Distribution of lengths (in words) of summaries generated by NextSum.

NextSum’s summary lengths correlate fairly well with those of the abstracts, leading to significant numbers in all the domains and the mixed case. Again, the length prediction is worse on the Bombs domain compared to the rest. Overall, this result shows promise that we can develop summarization systems which automatically tailor their content based on properties of the source.

Domain Tau
crime 0.46
assassin. 0.40
bombs 0.28
mixed 0.32
Table 6: Kendall Tau correlation between length (in words) of NextSum summaries and human abstracts.

7 Conclusion

In this work, we have presented the first summarization system which integrates content selection, summary length prediction, and redundancy removal. Central to this system is the use of a next-sentence prediction system which learns summary-internal discourse transitions. We show that NextSum outperforms a number of baselines on ROUGE-2 F-scores even when the summary length is not provided to the system. Furthermore, the lengths of the predicted summaries correlate positively with the lengths of human-written abstracts, indicating that our method implicitly captures some aspect of how much summary-worthy content is present in the source article.

In future work, we plan to investigate whether this approach also leads to more coherent summaries. This issue will be especially important in the multi-document setting, which we would also like to investigate using an extension of our model.


  • Barzilay and Lee (2004) R. Barzilay and L. Lee. 2004. Catching the drift: Probabilistic content models, with applications to generation and summarization. In Proceedings of NAACL-HLT. pages 113–120.
  • Cheng and Lapata (2016) J. Cheng and M. Lapata. 2016. Neural summarization by extracting sentences and words. In Proceedings of ACL. pages 484–494.
  • Cheung and Penn (2013) J. Cheung and G. Penn. 2013. Probabilistic domain modelling with contextualized distributional semantic vectors. In Proceedings of ACL. pages 392–401.
  • Christensen et al. (2013) J. Christensen, Mausam, S. Soderland, and O. Etzioni. 2013. Towards coherent multi-document summarization. In Proceedings of NAACL: HLT. pages 1163–1173.
  • Erkan and Radev (2004) G. Erkan and D. Radev. 2004. Lexrank: Graph-based centrality as salience in text summarization.

    Journal of Artificial Intelligence Research (JAIR)

  • Ghosh et al. (2016) S. Ghosh, O. Vinyals, B. Strope, S. Roy, T. Dean, and L. Heck. 2016. Contextual lstm (clstm) models for large scale nlp tasks. In

    Proceedings of KDD Workshop on Large-scale Deep Learning for Data Mining

  • Gillick and Favre (2009) D. Gillick and B. Favre. 2009. A scalable global model for summarization. In

    Proceedings of the Workshop on Integer Linear Programming for Natural Langauge Processing

    . pages 10–18.
  • Haghighi and Vanderwende (2009) A. Haghighi and L. Vanderwende. 2009. Exploring content models for multi-document summarization. In Proceedings of NAACL-HLT. pages 362–370.
  • Hermann et al. (2015) K. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom. 2015. Teaching machines to read and comprehend. In Proceedings of NIPS. pages 1693–1701.
  • Jafarpour et al. (2010) S. Jafarpour, C. J. Burges, and A. Ritter. 2010. Filter, rank, and transfer the knowledge: Learning to chat. Advances in Ranking 10.
  • Kaisser et al. (2008) M. Kaisser, M. A. Hearst, and J. B. Lowe. 2008. Improving search results quality by customizing summary lengths. In Proceedings of ACL-HLT. pages 701–709.
  • Kingma and Ba (2014) Diederik P. Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. CoRR abs/1412.6980. http://arxiv.org/abs/1412.6980.
  • Kiros et al. (2015) R. Kiros, Y. Zhu, R. Salakhutdinov, R. Zemel, A. Torralba, R. Urtasun, and S. Fidler. 2015. Skip-thought vectors. In Proceedings of NIPS. pages 3294–3302.
  • Kulesza and Taskar (2011) A. Kulesza and B. Taskar. 2011. Learning determinantal point processes. In Proceedings of the 27th Conference on Uncertainty in Artificial Intelligence.
  • Lin (2004) C. Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of Text Summarization Branches Out Workshop, ACL. pages 74–81.
  • Lin and Bilmes (2011) H. Lin and J. Bilmes. 2011. A class of submodular functions for document summarization. In Proceedings of ACL: HLT. pages 510–520.
  • Louis et al. (2010) A. Louis, A. Joshi, and A. Nenkova. 2010. Discourse indicators for content selection in summarization. In Proceedings of SIGDIAL. pages 147–156.
  • Lowe et al. (2016) R. Lowe, I. Serban, M. Noseworthy, L. Charlin, and J. Pineau. 2016. On the evaluation of dialogue systems with next utterance classification. In Proceedings of SIGDIAL. pages 264–269.
  • Manning et al. (2014) C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky. 2014.

    The stanford corenlp natural language processing toolkit.

    In Proceedings of ACL (System Demonstrations). pages 55–60.
  • Marcu (1998) D. Marcu. 1998. To build text summaries of high quality, nuclearity is not sufficient. In Working Notes of the the AAAI-98 Spring Symposium on Intelligent Text Summarization. pages 1–8.
  • Mihalcea and Tarau (2004) R. Mihalcea and P. Tarau. 2004. Textrank: Bringing order into texts. In Proceedings of EMNLP 2004. pages 404–411.
  • Mikolov et al. (2013) T. Mikolov, W-T. Yih, and G. Zweig. 2013. Linguistic regularities in continuous space word representations. In Proceedings of HLT-NAACL. pages 746–751.
  • Nallapati et al. (2017) R. Nallapati, F. Zhai, and B. Zhou. 2017.

    Summarunner: A recurrent neural network based sequence model for extractive summarization of documents.

    In Proceedings of AAAI.
  • Nallapati et al. (2016) R. Nallapati, B. Zhou, C. Nogueira dos Santos, Ç. Gülçehre, and B. Xiang. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of CoNLL. pages 280–290.
  • Nenkova and McKeown (2011) A. Nenkova and K. McKeown. 2011. Automatic summarization. Foundations and Trends® in Information Retrievl 5(2-3):103–233.
  • Nenkova and Vanderwende (2005) A. Nenkova and L. Vanderwende. 2005. The impact of frequency on summarization. Technical Report MSR-TR-2005-101, Microsoft research.
  • Nenkova et al. (2006) A. Nenkova, L. Vanderwende, and K. McKeown. 2006. A compositional context sensitive multi-document summarizer: exploring the factors that influence summarization. In Proceedings of SIGIR.
  • Over and Liggett (2002) P. Over and W. Liggett. 2002. Introduction to DUC: An intrinsic evaluation of generic news text summarization system. Technical report, Document Understanding Conference.
  • Paulus et al. (2017) R. Paulus, C. Xiong, and R. Socher. 2017. A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304 .
  • Pichotta and Mooney (2016) K. Pichotta and R. Mooney. 2016. Using sentence-level LSTM language models for script inference. In Proceedings of ACL-16. page 279–289.
  • Rush et al. (2015) A. Rush, S. Chopra, and J. Weston. 2015.

    A neural attention model for abstractive sentence summarization.

    In Proceedings of EMNLP. pages 379–389.
  • Sandhaus (2008) E. Sandhaus. 2008. The New York Times Annotated Corpus. Corpus number LDC2008T19, Linguistic Data Consortium, Philadelphia .
  • See et al. (2017) A. See, P. Liu, and C. Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of ACL. pages 1073–1083.
  • Wang et al. (2013) H. Wang, Z. Lu, H. Li, and E. Chen. 2013. A dataset for research on short-text conversations. In Proceedings of EMNLP. pages 935–945.