StructSum: Incorporating Latent and Explicit Sentence Dependencies for Single Document Summarization

Traditional preneural approaches to single document summarization relied on modeling the intermediate structure of a document before generating the summary. In contrast, the current state of the art neural summarization models do not preserve any intermediate structure, resorting to encoding the document as a sequence of tokens. The goal of this work is two-fold: to improve the quality of generated summaries and to learn interpretable document representations for summarization. To this end, we propose incorporating latent and explicit sentence dependencies into single-document summarization models. We use structure-aware encoders to induce latent sentence relations, and inject explicit coreferring mention graph across sentences to incorporate explicit structure. On the CNN/DM dataset, our model outperforms standard baselines and provides intermediate latent structures for analysis. We present an extensive analysis of our summaries and show that modeling document structure reduces copying long sequences and incorporates richer content from the source document while maintaining comparable summary lengths and an increased degree of abstraction.



There are no comments yet.


page 1

page 2

page 3

page 4


Neural Latent Extractive Document Summarization

Extractive summarization models need sentence level labels, which are us...

Topic-Aware Encoding for Extractive Summarization

Document summarization provides an instrument for faster understanding t...

Learning-Based Single-Document Summarization with Compression and Anaphoricity Constraints

We present a discriminative model for single-document summarization that...

Document Similarity for Texts of Varying Lengths via Hidden Topics

Measuring similarity between texts is an important task for several appl...

Screenplay Summarization Using Latent Narrative Structure

Most general-purpose extractive summarization models are trained on news...

Don't Give Me the Details, Just the Summary! Topic-Aware Convolutional Neural Networks for Extreme Summarization

We introduce extreme summarization, a new single-document summarization ...

Structure-Infused Copy Mechanisms for Abstractive Summarization

Seq2seq learning has produced promising results on summarization. Howeve...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Traditional approaches to abstractive summarization have relied on interpretable structured representations such as graph based sentence centrality (Erkan and Radev, 2004), AMR parses (Liu et al., 2015), discourse based compression and anaphora constraints Durrett et al. (2016). On the other hand, state of the art neural approaches to single document summarization encode the document as a sequence of tokens and compose them into a document representation (See et al., 2017; Li et al., 2018; Paulus et al., 2018; Tan et al., 2017; Gehrmann et al., 2018). Albeit being effective, these systems learn to rely significantly on layout bias associated with the source document (Kryscinski et al., 2019) and do not lend themselves easily to interpretation via intermediate structures.

Recent work provides evidence that structured representation of text leads to better document representations Bhatia et al. (2015); Ji and Smith (2017). However, structured representations are under-explored in the neural summarization literature. Motivated by this, we propose a structure-aware end-to-end model (§2) for summarization. Our proposed model, StructSum, augments the existing pointer-generator network (See et al., 2017) with two novel components: (1) a latent-structure attention module that adapts structured representations (Kim et al., 2017; Liu and Lapata, 2017) for the summarization task, and (2) an explicit-structure attention module, that incorporates a coreference graph. The components together model sentence level dependencies in a document generating rich structured representations. The motivation of this work is to provide a framework to induce rich interpretable latent structures and inject external document structures that can be introduced into any document encoder model.

Encoders with induced latent structures have been shown to benefit several tasks including document classification, natural language inference (Liu and Lapata, 2017; Cheng et al., 2016), and machine translation (Kim et al., 2017). Building on this motivation, our latent structure attention module builds upon Liu and Lapata (2017) to model the dependencies between sentences in a document. It uses a variant of Kirchhoff’s matrix-tree theorem (Tutte, 1984) to model such dependencies as non-projective tree structures(§2.2). The explicit attention module is linguistically-motivated and aims to incorporate sentence-level structures from externally annotated document structures. We incorporate a coreference based sentence dependency graph, which is then combined with the output of the latent structure attention module to produce a hybrid structure-aware sentence representation (§2.3).

We evaluate our model on the CNN/DM dataset (Hermann et al., 2015) and show in §4 that it outperforms strong baselines by up to 1.1 ROUGE-L. We find that the latent and explicit structures are complementary, both contributing to the final performance improvement. Our modules are also independent of the underlying encoder-decoder architectures, rendering them flexible to be incorporated into any advanced models. Our analysis quantitatively compares our generated summaries with the baselines and reference documents (§5). It reveals that structure-aware summarization reduces the bias of copying large sequences from the source inherently making the summaries more abstractive by generating

15% more novel n-grams compared to a competitive baseline. We also show qualitative examples of the learned interpretable sentence dependency structures, motivating further research for structure-aware modeling.

2 StructSum Model

Consider a source document consisting of sentences where each sentence is composed of a sequence of words. Document summarization aims to map the source document to a target summary of words . A typical neural abstractive summarization system is an attentional sequence-to-sequence model that encodes the input sequence as a continuous sequence of tokens

using a BiLSTM. The encoder produces a set of hidden representations

. An LSTM decoder maps the previously generated token

to a hidden state and computes a soft attention probability distribution

over encoder hidden states. A distribution over the vocabulary is computed at every timestep and the network is trained using negative log likelihood loss : . The pointer-generator network (See et al., 2017)

augments the standard encoder-decoder architecture by linearly interpolating a pointer based copy mechanism.

StructSum  uses the pointer-generator network as the base model. Our encoder is a structured hierarchical encoder Yang et al. (2016)

, which computes hidden representations of the sequence both at the token and sentence level. The model then uses the explicit-structure and implicit-structure attention modules to augment the sentence representations with rich sentence dependency information, leveraging both learned latent structure and additional external structure from other NLP modules. The attended vectors are then passed to the decoder, which produces the output sequence for abstractive summarization. In the rest of this section, we describe our model architecture, shown in Figure 

1, in detail.

2.1 Encoder

Figure 1: StructSum Model Architecture.

Our hierarchical encoder consists of a BiLSTM encoder over words, followed by sentence level BiLSTM encoder. The word encoder takes a sequence of words in a sentence as input and produces contextual hidden representation for each word , where is the word of the sentence, and is the number of words in the sentence

. The word hidden representations are max-pooled at the sentence level and the result is passed to a BiLSTM sentence-encoder which produces new hidden sentence representations for each sentence

. The sentence hidden representations are then passed as inputs to latent and explicit structure attention modules.

2.2 Latent Structure (LS) Attention

We model the latent structure of a source document as a non-projective dependency tree and force a pair-wise attention module to automatically induce this tree. We denote the marginal probability of a dependency edge as where is the latent variable representing the edge from sentence to sentence

. We parameterize with a neural network the unnormalized pair-wise scores between sentences and use the Kirchoff’s matrix tree theorem

Tutte (1984) to compute the marginal probability of a dependency edge between any two sentences.

We decompose the representation of sentence into a semantic vector and structure vector as . Using the structure vectors , we compute a score between sentence pairs (where sentence is the parent node of sentence ) and a score for sentence being the root node :

where and are linear-projection functions to build representations for the parent, child and root node respectively and is the weight for bilinear transformation. Here, is the edge weight between nodes in a weighted adjacency graph and is computed for all pairs of sentences. Using and , we compute normalized attention scores and using a variant of Kirchhoff’s matrix-tree theorem (Liu and Lapata, 2017; Tutte, 1984) where is the marginal probability of a dependency edge between sentences and is the probability of sentence being the root.

Using these probabilistic attention weights and the semantic vectors , we compute the attended sentence representations as:

where is the context vector gathered from possible parents of sentence , is the context vector gathered from possible children, and is a special embedding for the root node. Here, the updated sentence representation incorporates the implicit structural information.

2.3 Explicit Structure (ES) Attention

Durrett et al. (2016) showed that modeling coreference knowledge through anaphora constraints led to improved clarity or grammaticality in summaries. Taking inspiration from this, we choose coreference links across sentences as our explicit structure. First, we use an off-the-shelf coreference parser 111 to identify coreferring mentions. We then build a coreference based sentence graph by adding a link between sentences , if they have any coreferring mentions between them. This representation is then converted into a weighted graph by incorporating a weight on the edge between two sentences that is proportional to the number of unique coreferring mentions between them. We normalize these edge weights for every sentence, effectively building a weighted adjacency matrix where is given by:


where denotes the set of unique mentions in sentence , ( ) denotes the set of co-referring mentions between the two sentences and is a latent variable representing a link in the coreference sentence graph.

is a smoothing hyperparameter.

Pointer-Generator See et al. (2017) 36.44 15.66 33.42
Pointer-Generator + Coverage See et al. (2017) 39.53 17.28 36.38
Graph Attention Tan et al. (2017) 38.1 13.9 34.0
Pointer-Generator + DiffMask Gehrmann et al. (2018) 38.45 16.88 35.81
Pointer-Generator (Re-Implementation) 35.55 15.29 32.05
Pointer-Generator + Coverage (Re-Implementation) 39.07 16.97 35.87
Latent-Structure (LS) Attention 39.52 16.94 36.71
Explicit-Structure (ES) Attention 39.63 16.98 36.72
LS + ES Attention 39.62 17.00 36.95
Table 1: Results of abstractive summarizers on the CNN/DM dataset. The top part shows abstractive summarization baselines. The second section are re-implementations of See et al. (2017) 333 and results from StructSum.

Incorporating explicit structure

Given contextual sentence representations and our explicit coreference based weighted adjacency matrix , we learn an explicit-structure aware representation as follows:

where and are linear projections and is an updated sentence representation which incorporates explicit structural information.

Finally, to combine the two structural representations, we concatenate the latent and explicit sentence vectors as: to form encoder sentence representations of the source document. To provide every token representation with context of the entire document, we keep the same formulation as pointer-generator networks, where each token is mapped to its hidden representation using a BiLSTM. The token representation is concatenated with their corresponding structure-aware sentence representation: where is the sentence to which the word belongs. The resulting structure-aware token representations can be used to directly replace previous token representations as input to the decoder.

3 Experiments


We evaluate our approach on the CNN/Daily Mail corpus (Hermann et al., 2015; Nallapati et al., 2016) and use the same preprocessing steps as shown in See et al. (2017). The CNN/DM summaries have an average of 66 tokens () and 4.9 sentences. Differing from See et al. (2017), we truncate source documents to 700 tokens instead of 400 in training and validation sets to model longer documents with more sentences.


We choose the following baselines based on their relatedness to the task and wide applicability:
See et al. (2017) : We re-implement the base pointer-generator model and the additional coverage mechanism. This forms the base model of our implementation and hence our addition of modeling document structure can be directly compared to it.
Tan et al. (2017)

: This is a graph-based attention model that is closest in spirit to the method we present in this work. They use a graph attention module to learn attention between sentences, but cannot be easily used to induce interpretable document structures, since their attention scores are not constrained to learn structure. In addition to learning latent and interpretable structured attention between sentences, StructSum also introduces an explicit structure component to inject external document structure.

Gehrmann et al. (2018) : We compare with the DiffMask experiment with this work. This work introduces a separate content selector which tags words and phrases to be copied. The DiffMask variant is an end-to-end variant like ours and hence is included in our baselines. 444Note that the best results from Gehrmann et al. (2018) are better than the DiffMask experiment, yielding 41.22, 18.68, and 38.34 scores for R1, R2, and RL respectively. However, the best results use inference-time hard masking and not an end-to-end model. The same masking mechanism can similarly augment our approach.

Our baselines exclude Reinforcement Learning (RL) based systems as they aren’t directly comparable, but our approach can be easily introduced in any encoder-decoder based RL system. Since we do not incorporate any pretraining, we do not compare with recent contextual representation based models

(Liu and Lapata, 2019).


Our encoder uses 256 hidden states for both directions in the one-layer LSTM, and 512 for the single-layer decoder. We use the adagrad optimizer (Duchi et al., 2011)

with a learning rate of 0.15 and an initial accumulator value of 0.1. We do not use dropout and use gradient-clipping with a maximum norm of 2. We selected the best model using early stopping based on the ROUGE score on the validation dataset as our criteria. We also used the coverage penalty during inference as shown in

Gehrmann et al. (2018). For decoding, we use beam-search with a beam width of 3. We did not observe significant improvements with higher beam widths.

4 Results

Table 1 shows the results of our work on the CNN/DM dataset. We use the standard ROUGE-1,2 and L Lin (2004) F1 metric to evaluate all our summarization output. We first observe that introducing the capability to learn latent structures already improves our performance on ROUGE-L. It suggests that modeling dependencies between sentences helps the model compose better long sequences w.r.t reference compared to baselines. We do not see a significant improvement in ROUGE-1 and ROUGE-2, hinting that we retrieve similar content words as the baseline but compose them into better contiguous sequences.

We observe similar results when using explicit structures only with the ES attention module. This shows that adding inductive bias in the form of coreference based sentence graphs helps compose long sequences. Our results here are close to the model that uses just LS attention. This demonstrates that LS attention induces good latent dependencies that make up for pure external coreference knowledge.

Finally, our combined model which uses both Latent and Explicit structure performs the best with a strong improvement of 1.08 points in ROUGE-L over our base pointer-generator model and 0.6 points in ROUGE-1. It shows that the latent and explicit information are complementary and a model can jointly leverage them to produce better summaries.

Modeling structure and adding inductive biases also helps a model to converge faster where the combined LS+ES Attention model took 126K iterations for training in comparison to 230K iterations required to train the plain pointer-generator network and an additional 3K iterations for the coverage loss See et al. (2017).

Copy Coverage
PG+Cov 16.61 12.1 %
StructSum 9.13 24.0 %
Reference 5.07 16.7 %
Table 2: Results of analysis of copying, coverage and distribution over the source sentences on CNN/DM test set. Copy Len denotes the average length of copied sequences; Coverage – coverage of source sentences.

5 Analysis

We present below analysis on the quality of summarization as compared to our base model, the pointer-generator network with coverage (See et al., 2017) and the reference.

5.1 Analysis of Copying

Despite being an abstractive model, the pointer-generator model tends to copy very long sequences of words including whole sentences from the source document (also observed by Gehrmann et al. (2018)). Table 2

shows a comparison of the Average Length (Copy Len) of contiguous copied sequences greater than length 3. We observe that the pointer-generator baseline on average copies 16.61 continuous tokens from the source which shows the extractive nature of the model. This indicates that pointer networks, aimed at combining advantages from abstractive and extractive methods by allowing to copy content from the input document, tend to skew towards copying, particularly in this dataset. A consequence of this is that the model fails to interrupt copying at desirable sequence length.

In contrast, modeling document structure through StructSum reduces the length of copied sequences to 9.13 words on average reducing the bias of copying sentences in entirety. This average is closer to the reference (5.07 words) in comparison, without sacrificing task performance. StructSum learns to stop when needed, only copying enough content to generate a coherent summary.

Figure 2: Comparison of % Novel N-grams between StructSum, Pointer-Generator+Coverage and the Reference. Here, “sent” indicates full novel sentences.

5.2 Content Selection and Abstraction

A direct outcome of copying shorter sequences is being able to cover more content from the source document within given length constraints. We observe that this leads to better summarization performance. In our analysis, we compute coverage by computing the number of source sentences from which sequences greater than length 3 are copied in the summary. Table 2 shows a comparison of the coverage of source sentences in the summary content. We see that while the baseline pointer-generator model only copies from 12.1% of the source sentences, we copy content from 24.0% of the source sentences. Additionally, the average length of the summaries produced by StructSum remains mostly unchanged at 66 words on average compared to 61 of the baseline model. This indicates that StructSum produces summaries that draw from a wider selection of sentences from the original article compared to the baseline models.

Figure 3: Coverage of source sentences in summary. Here the x-axis is the sentence position in the source article and y-axis shows the normalized count of sentences in that position copied to the summary.

Kikuchi et al. (2014) show that copying more diverse content in isolation does not necessarily lead to better summaries for extractive summarization. Our analysis suggests that this observation might not extend to abstractive summarization methods. The proportion of novel n-grams generated has been used in the literature to measure the degree of abstraction of summarization models See et al. (2017). Figure 2 compares the percentage of novel n-grams in StructSum as compared to the baseline model. Our model produces novel trigrams 21.0% of the time and copies whole sentences only 21.7% of the time. In comparison, the pointer-generator network has only 6.1% novel trigrams and copies entire sentences 51.7% of the time. This shows that StructSum on average generates 14.7% more novel n-grams in comparison to the pointer-generator baseline.

Depth StructSum
2 29.3%
3 53.7%
4 14.4%
5+ 2.6%
Table 3: Distribution of latent tree depth.

5.3 Layout Bias

Neural abstractive summarization methods applied to news articles are typically biased towards selecting and generating summaries based on the first few sentences of the articles. This stems from the structure of news articles, which present the salient information of the article in the first few sentences and expand in the subsequent ones. As a result, the LEAD 3 baseline, which selects the top three sentences of an article, is widely used in the literature as a strong baseline to evaluate summarization models applied to the news domain (Narayan et al., 2018). Kryscinski et al. (2019) observed that the current summarization models learn to exploit the layout biases of current datasets and offer limited diversity in their outputs.

To analyze whether StructSum also holds the same layout biases, we compute a distribution of source sentence indices that are used for copying content (copied sequences of length 3 or more are considered). Figure 3 shows the comparison of coverage of sentences. The coverage of sentences in the reference summaries shows a high proportion of the top 5 sentences of any article being copied to the summary. Additionally, the reference summaries have a smoother tail end distribution with relevant sentences in all positions being copied. It shows that a smooth distribution over all sentences is a desirable feature. We notice that the sequence-to-sequence and pointer-generator framework (with and without coverage enabled) have a stronger bias towards the beginning of the article with a high concentration of copied sentences within the top 5 sentences of the article. In contrast, StructSum improves coverage slightly having a lower concentration of top 5 sentences and copies more tail end sentences than the baselines. However, although the modeling of structure does help, our model has a reasonable gap compared to the reference distribution. We see this as an area of improvement and a direction for future work.

Coref NER Coref+NER
precision 0.29 0.19 0.33
recall 0.11 0.08 0.09
Table 4: Precision and recall of shared edges between the latent and explicit structures

5.4 Document Structures

Figure 4: Examples of induced structures and generated summaries.

Similar to Liu and Lapata (2017), we also look at the quality of the intermediate structures learned by the model. We use the Chu-Liu-Edmonds algorithm (Chu and Liu, 1965; Edmonds, 1967) to extract the maximum spanning tree from the attention score matrix as our sentence structure. Table 3 shows the frequency of various tree depths. We find that the average tree depth is 2.9 and the average proportion of leaf nodes is 88%, consistent with results from tree induction in document classification Ferracane et al. (2019). Further, we compare latent trees extracted from StructSum with undirected graphs based on coreference and NER. These are constructed similarly to our explicit coreference based sentence graphs in §2.3 by linking sentences with overlapping coreference mentions or named entities. We measure the similarity between the learned latent trees and the explicit graphs through precision and recall over edges. The results are shown in Table 4. We observe that our latent graphs have low recall with the linguistic graphs showing that our latent graphs do not capture the coreference or named entity overlaps explicitly, suggesting that the latent and explicit structures capture complementary information.

Figure 4 shows qualitative examples of our induced structures along with generated summaries from the StructSum model. The first example shows a tree with sentence 3 chosen as root, which was the key sentence mentioned in the reference. We notice that in both examples, the sentences in the lower level of the dependency tree contribute less to the generated summary. Along the same lines, in the examples source sentences used to generate summaries tend to be closer to the root node. In the first summary, all sentences from which content was drawn are either the root node or within depth 1 of the root node. Similarly, in the second example, 4 out of 5 source sentences were at depth=1 in the tree. In the two examples, generated summaries diverged from the reference by omitting certain sentences used in the reference. These sentences appear in the lower section of the tree giving us some insights on which sentences were preferred for the summary generation. Further, in example 1, we notice that the latent structures cluster sentences based on the main topic of the document. Sentences 1,2,3 differ from sentences 5,6,7 on the topic being discussed and our model has clustered the two sets separately.

6 Related Work

Prior to neural models for summarization, document structure played a critical role in generating relevant, diverse and coherent summaries. Leskovec et al. (2004) formulated document summarization using linguistic features to construct a semantic graph of the document and building a subgraph for the summary. Litvak and Last (2008) leverage language-independent syntactic graphs of the source document to do unsupervised document summarization. Liu et al. (2015) parse the source text into a set of AMR graphs, transform the graphs to summary graphs and then generate text from the summary graph. While such systems generate grammatical summaries and preserve linguistic quality Durrett et al. (2016), they are often computationally demanding and do not generalize well Kikuchi et al. (2014).

Data-driven neural models for summarization fall into extractive Cheng et al. (2016); Zhang et al. (2018) or abstractive Rush et al. (2015); See et al. (2017); Gehrmann et al. (2018); Chen and Bansal (2018). See et al. (2017) proposed a pointer-generator framework that learns to either generate novel in-vocabulary words or copy words from the source. This model has been the foundation for a lot of follow up work on abstractive summarization (Gehrmann et al., 2018; Hsu et al., 2018; Song et al., 2018). Our model extends the pointer-generator model by incorporating latent structure and explicit structure knowledge, making our extension applicable to any of the followup work. Tan et al. (2017) present a graph-based attention system to improve the saliency of summaries. While this model learns attention between sentences, it does not induce interpretable intermediate structures.

A lot of recent work looks into incorporating structure into neural models. Song et al. (2018)

infuse source side syntactic structure into the copy mechanism of the pointer-generator model. They identify explicit word-level syntactic features based on dependency parses and parts of speech tags and augment the decoder copy mechanism to attend to them. In contrast, we model sentence level dependency structures in the form of latent or induced structures and explicit coreference based structures. We do not identify any heuristic or salient features other than linking dependent sentences.

Li et al. (2018) propose structural compression and coverage regularizers to provide an objective to neural models to generate concise and informative content. Here, they incorporate structural bias about the target summaries but we choose to model the structure of the source sentence to produce rich document representations. Frermann and Klementiev (2019) induce latent document structure for aspect based summarization. Cohan et al. (2018) use present long document summarization model applicable for scientific papers, which attends to discourse sections in a document, while Isonuma et al. (2019) propose an unsupervised model for review summarization which learns a latent discourse structure and uses it to summarize a review. Mithun and Kosseim (2011) use discourse structures to improve coherence in blog summarization. These are all complementary directions to our work. To our knowledge, we are the first to simultaneously incorporate latent and explicit document structure in a single framework for document summarization.

7 Conclusion and Future Work

To summarize, our contributions are three-fold. We propose a framework for incorporating latent and explicit document structure in neural abstractive summarization. We introduce a novel explicit-attention module which can incorporate external linguistic structures, and we show one such application where we use coreference to enhance summarization. We show quantitative improvements on the ROUGE metric over strong summarization baselines and demonstrate improvements in abstraction and coverage through extensive qualitative analysis.

StructSum has demonstrated performance gain and higher quality output summaries; with a potential direction to study the role of latent structures in the interpretability of models in the future. Another possible direction is to investigate whether structured representations allow better generalization for transfer learning and summarization in other domains with limited data.


  • P. Bhatia, Y. Ji, and J. Eisenstein (2015)

    Better document-level sentiment analysis from RST discourse parsing


    Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing

    Lisbon, Portugal, pp. 2212–2218. External Links: Document Cited by: §1.
  • Y. Chen and M. Bansal (2018) Fast abstractive summarization with reinforce-selected sentence rewriting. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Melbourne, Australia, pp. 675–686. External Links: Document Cited by: §6.
  • J. Cheng, L. Dong, and M. Lapata (2016) Long short-term memory-networks for machine reading. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 551–561. External Links: Document Cited by: §1, §6.
  • Y. Chu and T. K. Liu (1965) On the shortest arborescence of a directed graph. Science Sinica. Cited by: §5.4.
  • A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang, and N. Goharian (2018) A discourse-aware attention model for abstractive summarization of long documents. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), New Orleans, Louisiana, pp. 615–621. External Links: Document Cited by: §6.
  • J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization.

    Journal of Machine Learning Research

    12 (Jul), pp. 2121–2159.
    Cited by: §3.
  • G. Durrett, T. Berg-Kirkpatrick, and D. Klein (2016) Learning-based single-document summarization with compression and anaphoricity constraints. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1998–2008. External Links: Document Cited by: §1, §2.3, §6.
  • J. Edmonds (1967) Optimum branchings. Journal of Research of the national Bureau of Standards B 71 (4), pp. 233–240. Cited by: §5.4.
  • G. Erkan and D. R. Radev (2004) Lexrank: graph-based lexical centrality as salience in text summarization.

    Journal of artificial intelligence research

    22, pp. 457–479.
    Cited by: §1.
  • E. Ferracane, G. Durrett, J. J. Li, and K. Erk (2019) Evaluating discourse in structured text representations. In ACL, Cited by: §5.4.
  • L. Frermann and A. Klementiev (2019) Inducing document structure for aspect-based summarization. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pp. 6263–6273. Cited by: §6.
  • S. Gehrmann, Y. Deng, and A. M. Rush (2018) Bottom-up abstractive summarization. In EMNLP, Cited by: §1, Table 1, §3, §3, §5.1, §6, footnote 4.
  • K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom (2015) Teaching machines to read and comprehend. In Advances in neural information processing systems, pp. 1693–1701. Cited by: §1, §3.
  • W. T. Hsu, C. Lin, M. Lee, K. Min, J. Tang, and M. Sun (2018) A unified model for extractive and abstractive summarization using inconsistency loss. In ACL, Cited by: §6.
  • M. Isonuma, J. Mori, and I. Sakata (2019) Unsupervised neural single-document summarization of reviews via learning latent discourse structure and its ranking. In ACL, Cited by: §6.
  • Y. Ji and N. A. Smith (2017) Neural discourse structure for text categorization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 996–1005. External Links: Document Cited by: §1.
  • Y. Kikuchi, T. Hirao, H. Takamura, M. Okumura, and M. Nagata (2014) Single document summarization based on nested tree structure. In Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp. 315–320. Cited by: §5.2, §6.
  • Y. Kim, C. Denton, L. Hoang, and A. M. Rush (2017) Structured attention networks. In 5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April 24-26, 2017, Conference Track Proceedings, Cited by: §1, §1.
  • W. Kryscinski, N. S. Keskar, B. McCann, C. Xiong, and R. Socher (2019) Neural text summarization: a critical evaluation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 540–551. External Links: Document Cited by: §1, §5.3.
  • J. Leskovec, M. Grobelnik, and N. Milic-Frayling (2004) Learning sub-structures of document semantic graphs for document summarization. In LinkKDD Workshop, pp. 133–138. Cited by: §6.
  • W. Li, X. Xiao, Y. Lyu, and Y. Wang (2018) Improving neural abstractive document summarization with explicit information selection modeling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1787–1796. External Links: Document Cited by: §1.
  • W. Li, X. Xiao, Y. Lyu, and Y. Wang (2018) Improving neural abstractive document summarization with structural regularization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4078–4087. Cited by: §6.
  • C. Lin (2004) Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. Cited by: §4.
  • M. Litvak and M. Last (2008)

    Graph-based keyword extraction for single-document summarization

    In Proceedings of the workshop on Multi-source Multilingual Information Extraction and Summarization, pp. 17–24. Cited by: §6.
  • F. Liu, J. Flanigan, S. Thomson, N. Sadeh, and N. A. Smith (2015) Toward abstractive summarization using semantic representations. In Proceedings of the 2015 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Denver, Colorado, pp. 1077–1086. External Links: Document Cited by: §1, §6.
  • Y. Liu and M. Lapata (2017) Learning structured text representations. Transactions of the Association for Computational Linguistics 6, pp. 63–75. Cited by: §1, §1, §2.2, §5.4.
  • Y. Liu and M. Lapata (2019) Text summarization with pretrained encoders. IJCNLP abs/1908.08345. Cited by: §3.
  • S. Mithun and L. Kosseim (2011) Discourse structures to reduce discourse incoherence in blog summarization. In Proceedings of the International Conference Recent Advances in Natural Language Processing 2011, Hissar, Bulgaria, pp. 479–486. Cited by: §6.
  • R. Nallapati, B. Zhou, C. dos Santos, Ç. Gu̇lçehre, and B. Xiang (2016) Abstractive text summarization using sequence-to-sequence RNNs and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Berlin, Germany, pp. 280–290. External Links: Document Cited by: §3.
  • S. Narayan, S. B. Cohen, and M. Lapata (2018)

    Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization

    In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 1797–1807. External Links: Document Cited by: §5.3.
  • R. Paulus, C. Xiong, and R. Socher (2018) A deep reinforced model for abstractive summarization. In 6th International Conference on Learning Representations, ICLR 2018, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, Cited by: §1.
  • A. M. Rush, S. Chopra, and J. Weston (2015) A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015, pp. 379–389. Cited by: §6.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1073–1083. External Links: Document Cited by: §1, §1, Table 1, §2, §3, §3, §4, §5.2, §5, §6.
  • K. Song, L. Zhao, and F. Liu (2018) Structure-infused copy mechanisms for abstractive summarization. In COLING, Cited by: §6, §6.
  • J. Tan, X. Wan, and J. Xiao (2017) Abstractive document summarization with a graph-based attentional neural model. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 1171–1181. Cited by: §1, Table 1, §3, §6.
  • W. T. Tutte (1984) Graph theory, vol. 21 of. Encyclopedia of Mathematics and its Applications. Cited by: §1, §2.2, §2.2.
  • Z. Yang, D. Yang, C. Dyer, X. He, A. J. Smola, and E. H. Hovy (2016) Hierarchical attention networks for document classification. In HLT-NAACL, Cited by: §2.
  • X. Zhang, M. Lapata, F. Wei, and M. Zhou (2018) Neural latent extractive document summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, pp. 779–784. External Links: Document Cited by: §6.