Discourse-Aware Neural Extractive Model for Text Summarization

by   Jiacheng Xu, et al.

Recently BERT has been adopted in state-of-the-art text summarization models for document encoding. However, such BERT-based extractive models use the sentence as the minimal selection unit, which often results in redundant or uninformative phrases in the generated summaries. As BERT is pre-trained on sentence pairs, not documents, the long-range dependencies between sentences are not well captured. To address these issues, we present a graph-based discourse-aware neural summarization model - DiscoBert. By utilizing discourse segmentation to extract discourse units (instead of sentences) as candidates, DiscoBert provides a fine-grained granularity for extractive selection, which helps reduce redundancy in extracted summaries. Based on this, two discourse graphs are further proposed: (i) RST Graph based on RST discourse trees; and (ii) Coreference Graph based on coreference mentions in the document. DiscoBert first encodes the extracted discourse units with BERT, and then uses a graph convolutional network to capture the long-range dependencies among discourse units through the constructed graphs. Experimental results on two popular summarization datasets demonstrate that DiscoBert outperforms state-of-the-art methods by a significant margin.


page 1

page 2

page 3

page 4


HipoRank: Incorporating Hierarchical and Positional Information into Graph-based Unsupervised Long Document Extractive Summarization

We propose a novel graph-based ranking model for unsupervised extractive...

A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

Neural abstractive summarization models have led to promising results in...

Predicting Discourse Trees from Transformer-based Neural Summarizers

Previous work indicates that discourse information benefits summarizatio...

BookSum: A Collection of Datasets for Long-form Narrative Summarization

The majority of available text summarization datasets include short-form...

ABCD: A Graph Framework to Convert Complex Sentences to a Covering Set of Simple Sentences

Atomic clauses are fundamental text units for understanding complex sent...

The ontogeny of discourse structure mimics the development of literature

Discourse varies with age, education, psychiatric state and historical e...

Sentence Centrality Revisited for Unsupervised Summarization

Single document summarization has enjoyed renewed interests in recent ye...

1 Introduction

Neural networks have achieved great success in the task of text summarization [31]. There are two main lines of research: abstractive and extractive. While the abstractive paradigm [23, 25, 4, 26] focuses on generating a summary word-by-word after encoding the full document, the extractive approach [6, 1, 35, 21] directly selects sentences from the document to assemble into a summary. The abstractive approach is more flexible and generally produces less redundant summaries, while the extractive approach enjoys better factuality and efficiency [2].

Recently, some hybrid methods have been proposed to take advantage of both, by designing a two-stage pipeline to first select and then rewrite (or compress) candidate sentences [5, 11, 32, 30]. Compression or rewriting aims to discard uninformative phrases from the selected sentences, but most of such systems suffer from the disconnection of the two stages in the pipeline, which results in limited performance improvement.

Figure 1: Illustration of DiscoBert for text summarization. Sentence-based Bert model (baseline) selects whole sentences 1, 2 and 5. The proposed discourse-aware model DiscoBert selects EDUs {1-1, 2-1, 5-2, 20-1, 20-3, 22-1}, which avoids unnecessary details and generates a more concise summary. The right side of the figure shows the two discourse graphs we use: () Coreference Graph (with the mentions of ‘Pulitzer prizes’ annotated as example); and () RST Graph (induced by RST discourse trees).

Meanwhile, modeling long-range context for document summarization still remains a challenging task. With the recent success of pre-trained language models (LMs) [7], the encoding of input document has been greatly improved. However, since pre-trained LMs are mostly designed for target sentence pairs or short paragraphs, they perform poorly at capturing long-range dependencies among sentences. Empirical observations [17] show that adding standard encoders such as LSTM or Transformer [28] on top of BERT to model inter-sentential relations does not bring in much performance gain.

In this work, we present DiscoBert, a discourse-aware neural extractive summarization model built upon BERT. To perform compression simultaneously with extraction and reduce the redundancy across sentences, we take the Elementary Discourse Unit (EDU), a sub-sentence phrase unit originated from RST [18, 3] as the minimal selection unit (instead of the sentence unit) for extractive summarization. Figure 1 shows an example of discourse segmentation, with sentences broken down into EDUs (annotated with brackets). The Baseline Selection is realized by sentence-based BERT model, and the EDU Selection is achieved by our model. After discarding some redundant details in the sentences, our model has the capacity for including additional concepts or events, therefore generating more concise and informative summaries.

Furthermore, to better capture document-level long-distance dependency, we also propose a graph-based approach to leverage intra-sentence discourse relations among EDUs. Two types of discourse graph are proposed: () a directed RST Graph, and () an undirected Coreference Graph. The RST Graph is constructed from the parse tree over the EDUs of the document. Rhetorical relations of EDUs, such as contradiction, elaboration, and attribution, are addressed in the RST Graph. On the other hand, the Coreference Graph connects the entities and their coreference clusters/mentions across the document. In Figure 1, we show part of the coreference mention cluster of ‘Pulitzer prize’. The path of coreference navigates the model from the core event to other occurrences of the event, as well as further exploring its interactions with other concepts or events. After constructing the graphs, Graph Convolutional Network (GCN) [15] is employed to capture the long-range interactions among EDUs.

The main contributions of this paper are summarized as follows. () We propose a discourse-aware extractive summarization model, DiscoBert, treating EDUs (instead of sentences) as the minimal selection unit to provide a fine-grained granularity for extractive selection, while preserving the grammaticality and fluency of generated summaries. () We propose two discourse graphs, and use a graph-based approach to model the inter-sentential context based on discourse relations among EDUs. () Experimental results show that DiscoBert achieves new state-of-the-art performance on two popular newswire text summarization datasets.

Figure 2: Example of discourse segmentation and RST tree conversion. The original sentence is segmented into 5 EDUs in the lower right box, and then parsed into an RST discourse tree in the left box. The converted dependency-based RST discourse tree is shown in the upper right box. Nucleus and Satellite nodes are denoted in solid lines and dashed lines, respectively. Non-terminal relation nodes with the start and the end of the span are shown on the right side. Relations are in italic. The second EDU [2] is the head of the whole tree (span [1-5]), while the third EDU [3] is the head of the span [3-5].

2 Discourse Graph Construction

We first introduce the concept of Rhetorical Structure Theory (RST) [18], a linguistic theory for discourse analysis, and then explain the methods to construct discourse graphs, which will be used in DiscoBert. Two discourse graphs are considered: RST Graph and Coreference Graph. For initialization of both graphs, all edges are disconnected. Connections are then added for a subset of nodes based on RST discourse parse tree or coreference mentions.

2.1 Discourse Analysis

Discourse analysis focuses on inter-sentential relations of text in a document or conversation. In RST, the discourse structure of text can be represented in a tree format. The whole document can be segmented into contiguous, adjacent and non-overlapping text spans called Elementary Discourse Units (EDUs). Each EDU is tagged as either Nucleus or Satellite, which characterizes its nuclearity or saliency. Nucleus nodes are generally more central, and Satellite nodes are more peripheral and less important in terms of its content and grammatical reliance. There are dependencies among the EDUs that represent their rhetorical relations. In this work, we treat EDU as the minimal unit for content selection in text summarization. Figure 2 shows an example of discourse segmentation and the parse tree of a sentence. Among these EDUs, rhetorical relations represent the functions of different discourse units. For example, ‘elaboration’ shows that one EDU describes the detail of the other EDU. ‘Contrast’ means that two EDUs are rhetorically different or opposed to each other.

In text summarization, we expect a good model to select the most concise, relevant and central concept of the document with low redundancy. For example, in Figure 2, details such as the name of the suspected child in [3], the exact location of the photo in [5], and who was carrying the child in [4], are unlikely to be reflected in the final summary. However, in traditional extractive summarization methods, the model is required to select the whole sentence, even though some parts of the sentence are not necessary. In our approach, the model is allowed to select one or several fine-grained EDUs to make the generated summaries less redundant. This serves as the basis of our proposed DiscoBert model.

0:  Coreference clusters ; mentions for each cluster .
  Initialize the Graph without any edge .
  for  to  do
     Collect the location of all occurences to .
     for  to  do
         for  to  do
         end for
     end for
  end for
  return  Constructed Graph .
Algorithm 1 Construction of the Coreference Graph .

2.2 RST Graph

As aforementioned, we adopt EDU as the minimal selection unit for our summarization model. Here, we further utilize the discourse trees of the document to capture the rhetorical structure of the document, and build a discourse-aware model for text summarization.

When selecting sentences as candidates for extractive summarization, we assume each sentence is grammatically self-contained. But for EDUs, some restrictions need to be considered to ensure grammaticality. For example, Figure 2 illustrates an RST discourse parse tree of a sentence, where [2] This iconic … series is a grammatical sentence but [3] and shows … 8 is not by itself. We need to understand the dependencies between EDUs to ensure the grammaticality of the selected combinations. There are two steps to learn the derivation of dependencies: head inheritance and tree conversion.

Head inheritance defines the head node for each valid non-terminal tree node. For each leaf node, the head is itself. We determine the head node(s) of non-terminal nodes based on their nuclearity.111If both children are N(ucleus), then the head of the current node inherits the head of the left child. Otherwise, when one child is N and the other is S, the head of the current node inherits the head of the N child. For example, in Figure 2, the head of text spans [1-5], [2-5], [3-5] and [4-5] need to be grounded to a single EDU.

In this way, we propose a simple yet effective schema to convert the RST discourse tree to a dependency-based discourse tree.222The proposed schema is summarized as follows: If one child node is N and the other is S, the head of the S node depends on the head of the N node. If both children are N and the right child does not contain a subject in the discourse, the head of the right N node depends on the head of the left N node. We always consider the dependency restriction such as the reliance of Satellite on Nucleus, when we create oracle during pre-processing and when the model makes the prediction. For the example shown in Figure 2, if the model selects [5] being carried … Liberia., we will enforce the model to select [3] and shows … 8, and [2] This … series,. The dependencies in this example are {(), (), (), ()}.

The construction of the RST Graph aims to provide not only local paragraph-level but also long-range document-level connections among the EDUs. We use the converted dependency version of the tree to build the RST Graph , by initializing an empty graph and treating every discourse dependency from the -th EDU to the -th EDU as a directed edge, i.e., .

Figure 3: Example of the adjacent matrix of Coreference Graph and RST Graph .

2.3 Coreference Graph

Text summarization, especially news summarization, usually suffers from the well-known ‘position bias’ [14], where most of the crucial information is described at the very beginning of the document. However, there is still a decent amount of information spread in the middle or at the end of the document, which is often ignored by summarization models. We find that around 25% of oracle sentences appear after the first 10 sentences in the CNNDM dataset. Besides, in long news articles, there are often multiple core characters and events throughout the whole document. However, existing neural models are not good at modeling such long-range context, especially when there are multiple ambiguous coreferences to resolve.

Figure 4: (Left) Model architecture of DiscoBert. The Stacked Discourse Graph Encoders contain stacked DGE blocks. (Right) The architecture of each Discourse Graph Encoder (DGE) block.

To encourage and guide the model to capture the long-range context in the document, we propose a Coreference Graph built upon discourse units. Algorithm 1 describes how to construct the Coreference Graph. We first use Stanford CoreNLP [19] to detect all the coreference clusters in an article. For each coreference cluster, all the discourse units containing the mention of the same cluster will be connected. This process is iterated over all the coreference mention clusters to generate the final Coreference Graph.

Figure 1 provides an example, where ‘Pulitzer prizes’ is an important entity and has occurred multiple times in multiple discourse units. The constructed Coreference Graph is shown on the right side of the document. We intentionally ignore other entities and mentions in this example for simplicity. When graph is constructed, edges among 1-1, 2-1, 20-1 and 22-1 are all connected due to the mentions of ‘Pulitzer prizes’. Figure 3 shows an example of the two constructed graphs. is symmetric and self-loop is added to all the nodes to prevent the graph from being too sparse.

3 DiscoBERT Model

In this section, we present DiscoBert, a BERT-based extractive summarization model, which takes EDUs as the minimal selection unit for redundancy reduction and uses discourse graphs to capture long-range dependencies between EDUs.

3.1 Model Overview

Figure 4 provides an overview of the proposed model, consisting of a Document Encoder, and a Graph Encoder. For the Document Encoder, a pre-trained BERT model is first used to encode the whole document on token level, and then a self-attentive span extractor is designed to obtain the EDU representations from the corresponding text spans. The Graph Encoder takes the output of the Document Encoder as input and updates the EDU representations with Graph Convolutional Network based on the constructed discourse graphs, which are then used to predict the oracle labels.

Assume that document is segmented into EDUs in total, i.e., , where denotes the -th EDU. Following liu2019fine liu2019fine, we formulate extractive summarization as a sequential labelling task, where each EDU is scored by the neural networks, and decisions are made based on the scores of all EDUs. The oracle labels are a sequence of binary labels, where 1 stands for being selected and 0 for not. We denote the labels as . During training, we aim to predict the sequence of labels given the document . During inference, we need to further consider discourse dependency to ensure the coherence and grammaticality of the output summary.

3.2 Document Encoder

BERT is a pre-trained deep bidirectional Transformer encoder [28, 7]. Following liu2019fine liu2019fine, we encode the whole document with BERT, and finetune the BERT model for summarization.

BERT is originally trained to encode a single sentence or sentence pair. However, a news article typically contains more than 500 words, hence we need to make some adaptation to apply BERT for document encoding. Specifically, we insert and tokens at the beginning and the end of each sentence, respectively.333We also tried inserting and at the beginning and the end of every EDU, and treating the corresponding representation as the representation for each EDU, but the performance drops drastically. In order to encode long documents such as news articles, we also extend the maximum sequence length that BERT can take from 512 to 768 in all our experiments.

The input document after tokenization is denoted as , and , where is the number of BPE tokens in the -th EDU. If is the first EDU in a sentence, there is also a token prepended to ; if is the last EDU in a sentence, there is a token appended to (see Figure 4 for illustration). These two tokens are not shown in the equations for simplicity. The BERT model is then used to encode the document as:


where is the BERT output of the whole document in the same length as the input sequence.

After the BERT encoder, the representation of the token can be used as sentence representation. However, this approach does not work in our setting, since we need to extract the representation for EDUs instead. Therefore, we adopt a Self-Attentive Span Extractor (SpanExt), proposed in lee-etal-2017-end lee-etal-2017-end, to learn the representation of EDUs.

For the -th EDU with words, with the output from the BERT encoder , we obtain the EDU representation as follows:


where is the score of the -th word in the EDU, is the normalized attention of the -th word w.r.t. all the words in the span. is a weighted sum of the BERT output hidden states. Throughout the paper, all the matrices and vectors are parameters to learn. We abstract the above Self-Attentive Span Extractor as .

After the span extraction step, the whole document is represented as a sequence of EDU representations: , which will be sent into the graph encoder.

3.3 Graph Encoder

Given the constructed graph , the nodes correspond to the EDUs in a document, and the edges correspond to either the RST discourse relations or the coreference mentions. We then use Graph Convolutional Network to update the representations of all the EDUs, to capture long-range dependencies missed by BERT for better summarization. To modularize the architecture design, we present a single Discourse Graph Encoder (DGE) layer here. Multiple DGE layers are stacked in our experiments.

Assume that the input for the -th DGE layer is denoted as , and the corresponding output is denoted as . The -th DGE layer is designed as follows:


where LN represents Layer Normalization, denotes the neighorhood of the -th EDU node. is the output of the -th EDU in the -th DGE layer, and , which is the output from the Document Encoder. After layers of graph propagation, we obtain , which is the final representation of all the EDUs after the stacked DGE layers. For different graphs the parameter of DGEs are not shared. If we use both graphs, we concatenate the output of two graphs:


3.4 Training & Inference

During training, is used for predicting the oracle labels. Specifically,


where represents the logistic function, and

is the prediction probability ranging from 0 to 1. The training loss of the model is the binary cross-entropy loss given the predictions and oracles:


The above loss is summed over all the training samples. For DiscoBert without graphs, the output from Document Encoder is used for prediction instead. The creation of oracle is operated on EDU level. We greedily pick up EDUs with their necessary dependencies until the R-1 F drops.

During inference, for an input document, after obtaining the prediction probabilities of all the EDUs, i.e., , we sort in descending order, and select EDUs accordingly. Note that the dependencies between the EDUs are also enforced in prediction to ensure grammacality of generated summaries.

4 Experiments

In this section, we present experimental results on three popular news summarization datasets. We compare our proposed model with state-of-the-art baselines, and conduct detailed analysis to validate the effectiveness of DiscoBERT.

Dataset Document Sum. in Graph
# sent. # EDU # tok. # tok.
CNNDM 24 67 541 54 66 233
NYT 22 66 591 87 65 143
Table 1: Statistics of the datasets. The first block shows the average number of sentences, EDUs and tokens in the documents. The second block shows the average number of tokens in the reference summaries. The third block shows the average number of edges in the constructed RST Graphs () and Coreference Graphs (), respectively.

4.1 Datasets

We evaluate the proposed models on three datasets: New York Times (NYT) [24], CNN and Dailymail (CNNDM) [12]. We use the script from See_Get_2017 See_Get_2017 to extract summaries from raw data. We use Stanford CoreNLP for sentence boundary detection, tokenization and parsing [19]. Due to the limitation of BERT, we only encode up to 768 BERT BPEs.

Table 1 shows the statistics of the datasets. The edges in are undirected, while the ones in are directional. For CNNDM, there are 287,226, 13,368 and 11,490 samples for training, validation and test, respectively. We use the un-anonymized version as in previous summarization work. For NYT, it is licensed by LDC444https://catalog.ldc.upenn.edu/LDC2008T19 , and following previous work [33, 30], there are 137,778, 17,222 and 17,223 samples for training, validation and test, respectively.

Model R-1 R-2 R-L
Lead3 40.42 17.62 36.67
Oracle (Sentence) 55.61 32.84 51.88
Oracle (Discourse) 61.61 37.82 59.27
NeuSum [35] 41.59 19.01 37.98
BanditSum [8] 41.50 18.70 37.60
JECS [30] 41.70 18.50 37.90
PNBERT [34] 42.39 19.51 38.69
PNBERT w. RL 42.69 19.60 38.85
BERT [33] 41.82 19.48 38.30
42.10 19.70 38.53
42.31 19.87 38.78
42.37 19.95 38.83
BERTSUM [17] 43.25 20.24 39.63
T5-Base [22] 42.05 20.34 39.40
Bert 43.07 19.94 39.44
DiscoBert 43.38 20.44 40.21
DiscoBert w. 43.58 20.64 40.42
DiscoBert w. 43.68 20.71 40.54
DiscoBert w. & 43.77 20.85 40.67
Table 2: Results on the test set of the CNNDM dataset. ROUGE-1, -2 and -L are reported. Models with the asterisk symbol (*) used extra data for pre-training. R-1 and R-2 are shorthands for unigram and bigram overlap; R-L is the longest common subsequence.

4.2 State-of-the-art Baselines

We compare the proposed model with the following state-of-the-art neural text summarization models.

Extractive Models BanditSum treats extractive summarization as a contextual bandit problem, and is trained with policy gradient methods [8]. NeuSum is an extractive model with seq2seq architecture, where the attention mechanism scores the document and emits the index as the selection [35]. DeepChannel

is an extractive model with salience estimation and contrastive training strategy


Compressive Models JECS is a neural text-compression-based summarization model using BLSTM as the encoder [30]. The first stage is selecting sentences, and the second stage is sentence compression with pruning constituency parsing tree.

BERT-based Models BERT-based models have achieved significant improvement on CNNDM and NYT, when compared with their LSTM counterparts. Specifically, BertSum is the first BERT-based extractive summarization model [17]. Our baseline model Bert is the re-implementation of BertSum. PNBert

proposed a BERT-based model with various training strategies, including reinforcement learning and Pointer Networks

[34]. HiBert is a hierarchical BERT-based model for document encoding, which is further pretrained with unlabeled data [33].

Model R-1 R-2 R-L
Lead3 41.80 22.60 35.00
Oracle (Sentence) 64.22 44.57 57.27
Oracle (Discourse) 67.76 48.05 62.40
JECS [30] 45.50 25.30 38.20
BERT [33] 48.38 29.04 40.53
48.92 29.58 41.10
49.06 29.70 41.23
49.25 29.92 41.43
49.47 30.11 41.63
Bert 48.48 29.01 40.62
DiscoBert 49.78 30.30 42.44
DiscoBert w. 49.79 30.18 42.48
DiscoBert w. 49.86 30.25 42.55
DiscoBert w. & 50.00 30.38 42.70
Table 3: Results on the test set of the NYT dataset. Models with the asterisk symbol (*) used extra data for pre-training.

4.3 Implementation Details

We use AllenNLP [10] as the code framework. Experiments are conducted on a single NVIDIA P100 card, and the mini-batch size is set to 6 due to GPU memory capacity. The length of each document is truncated to 768 BPEs. We use the ‘bert-base-uncased’ model for all experiments. We train all our models for up to 80,000 steps. ROUGE [16]

is used as the evaluation metrics, and ‘R-1’ is used as the validation criteria.

The realization of discourse units and structure is a critical part of EDU pre-processing, which requires two steps: discourse segmentation and RST parsing. In the segmentation phase, we use a neural discourse segmenter based on the BiLSTM CRF framework [29]555https://github.com/PKU-TANGENT/NeuralEDUSeg. The segmenter achieved 94.3 F score on the RST-DT test set, in which the human performance is 98.3. In the parsing phase, we use a shift-reduce discourse parser to extract relations and identify neuclrity [13]666https://github.com/jiyfeng/DPLP.

4.4 Experimental Results

Results on CNNDM Table 2 shows the results on CNNDM. The first section includes the Lead3 baseline, the sentence-based oracle, and the discourse-based oracle. The second section lists the performance of baseline models, including non-BERT-based and BERT-based variants. The performance of our proposed model is listed in the third section. Bert is our implementation of the sentence-based BERT model. DiscoBert is our discourse-based BERT model without Discourse Graph Encoder. DiscoBert w. and DiscoBert w. are the discourse-based BERT model with Coreference Graph and RST Graph, respectively. DiscoBert w. & is the fusion model encoding both graphs.

The proposed DiscoBert beats the sentence-based counterpart and all the competitor models. With the help of Discourse Graph Encoder, the graph-based DiscoBert beats the state-of-the-art BERT model by a significant margin (0.52/0.61/1.04 on R-1/-2/-L on F). Ablation study with individual graphs shows that the RST Graph is slightly more helpful than the Coreference Graph, while the combination of both achieves better performance overall.

Results on NYT Results on the NYT dataset are summarized in Table 3. The proposed model surpasses previous state-of-the-art BERT-based model by a significant margin. HIBERT and HIBERT used extra data for pre-training the model. We notice that in the NYT dataset, most of the improvement comes from the use of EDUs as minimial selection units. DiscoBert provides 1.30/1.29/1.82 gain on R-1/-2/-L over the Bert baseline. However, the use of discourse graphs does not help much in this case.

Grammaticality Due to the segmentation and partial selection of the sentence, the output of our model might not be as grammatical as the original sentence. We manually examined and automatically evaluated the model output, and observed that overall, the generated summaries are still grammatical, given the RST dependency tree constraining the rhetorical relations among EDUs. A set of simple yet effective post-processing rules helps to complete the EDUs in some cases.

Table 4 shows automatic grammatical checking results using Grammarly, where the average number of errors in every 10,000 characters on CNNDM and NYT datasets is reported. We compare DiscoBert with the sentence-based Bert model. ‘All’ shows the summation of the number of errors in all categories. As shown in the table, the summaries generated by our model have retained the quality of the original text.

Source M All CR PV PT O
CNNDM Sent 33.0 18.7 9.0 2.3 3.0
Disco 34.0 18.3 8.4 2.6 4.7
NYT Sent 23.3 13.5 5.9 0.8 3.1
Disco 23.8 13.9 5.7 0.8 3.4
Table 4: Number of errors per 10,000 characters based on automatic grammaticality checking on CNNDM and NYT with Grammarly, where lower values are better. Detailed error categories, including correctness (CR), passive voice (PV) misuse, punctuation (PT) in compound/complex sentences and others (O), are listed from left to right.

We also conduct human evaluation on the model outputs. We sampled 200 documents from the test set of CNNDM, and for each sample we asked two Turkers to compare three summaries and grade them on the basis of 1 to 5 scale. Results are shown in Table 5. The Sent-BERT model (the original BERTSum model) selects sentences from the document, hence it provides the best overall readability, coherence and grammaticality. In some cases reference summaries are just long phrases, so the scores are slightly lower than those of the sentence model. Our DiscoBERT model is slightly worse than the Sent-BERT model, but is fully comparable to the other two variants.

Error Analysis Despite the sucess, we furhter conducted error analysis, and found that the errors mostly originated from punctuation and coherence. Common punctuation issues include extra or missing commas, as well as missing quotation marks. For example, if we only select the first EDU of the sentence [‘Johnny is believed to have drowned,] [but actually he is fine,’] [the police say.], the output ‘Johnny is believed to have drowned. does not look like a grammatical sentence due to the punctuation. The coherence issue originates from the missing or improper pronoun resolution. As shown in the above example, only selecting the second EDU yields a sentence actually he is fine, which is not clear who is ‘he’ mentioned here.

Model All Coherence Grammaticality
Table 5:

Human evaluation on model outputs. We ask Turkers to grade the overall preference, coherence and grammaticality based on 1 to 5 scale. The mean value along with the standard deviation is shown.

5 Related Work

Neural Extractive Summarization Neural networks have been widely used in extractive summarization. Various decoding approaches, including ranking [21], index prediction [35] and sequential labelling [20, 32, 8], have been applied to content selection. Our model uses similar configuration to encode the document with BERT as liu2019fine liu2019fine did, but we use discourse graph structure and graph encoder to handle the long-range dependency issue.

Neural Compressive Summarization Text summarization with compression and deletion has been explored in some recent work. xu-durrett-compression xu-durrett-compression presented a two-stage neural model for selection and compression based on constituency tree pruning. dong-etal-2019-editnts dong-etal-2019-editnts presented a neural sentence compression model with discrete operations including deletion and addition. Different from these studies, as we use EDUs as minimal selection basis, sentence compression is achieved automatically in our model.

EDU for Summarization The use of discourse theory for text summarization has been explored before. louis-etal-2010-discourse louis-etal-2010-discourse examined the benefit of graph structure provided by discourse relations for text summarization. hirao-etal-2013-single,yoshida-etal-2014-dependency hirao-etal-2013-single,yoshida-etal-2014-dependency formulated the summarization problem as the trimming of the document discourse tree. Durrett_Learning_2016 Durrett_Learning_2016 presented a system of sentence extraction and compression with ILP methods using discourse structure. Li_The_2016 Li_The_2016 demonstrated that using EDUs as units of content selection leads to stronger summarization performance. Compared with them, our proposed method is the first neural end-to-end summarization model using EDUs as selection basis.

Graph-based Summarization Graph approach has been explored in text summarization over decades. LexRank introduced a stochastic graph-based method for computing relative importance of textual units [9]. yasunaga-etal-2017-graph yasunaga-etal-2017-graph employed a GCN on the relation graphs with sentence embeddings obtained from RNN. tan-etal-2017-abstractive tan-etal-2017-abstractive also proposed a graph-based attention mechanism in abstractive summarization model.

6 Conclusions

In this paper, we present DiscoBert for text summarization. DiscoBert uses discourse unit as the minimal selection basis to reduce summarization redundancy, and leverages two constructed discourse graphs as inductive bias to capture long-range dependencies among discourse units for better summarization. We validate our proposed approach on two popular datasets, and observe consistent improvement over baseline methods. For future work, we will explore better graph encoding methods, and apply discourse graphs to other tasks that require long document encoding.


  • [1] Z. Cao, W. Li, S. Li, F. Wei, and Y. Li (2016) AttSum: joint learning of focusing and summarization with neural attention. In COLING, External Links: Link Cited by: §1.
  • [2] Z. Cao, F. Wei, W. Li, and S. Li (2018) Faithful to the Original: Fact Aware Neural Abstractive Summarization. In AAAI, Cited by: §1.
  • [3] L. Carlson, D. Marcu, and M. E. Okurovsky (2001) Building a discourse-tagged corpus in the framework of rhetorical structure theory. In SIGDIAL, External Links: Link Cited by: §1.
  • [4] A. Celikyilmaz, A. Bosselut, X. He, and Y. Choi (2018) Deep communicating agents for abstractive summarization. In NAACL, Cited by: §1.
  • [5] Y. Chen and M. Bansal (2018) Fast Abstractive Summarization with Reinforce-Selected Sentence Rewriting. In ACL, Cited by: §1.
  • [6] J. Cheng and M. Lapata (2016) Neural summarization by extracting sentences and words. In ACL, Cited by: §1.
  • [7] J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In NAACL, Cited by: §1, §3.2.
  • [8] Y. Dong, Y. Shen, E. Crawford, H. van Hoof, and J. C. K. Cheung (2018) BanditSum: Extractive Summarization as a Contextual Bandit. In EMNLP, Cited by: §4.2, Table 2, §5.
  • [9] G. Erkan and D. R. Radev (2004) Lexrank: graph-based lexical centrality as salience in text summarization. JAIR. Cited by: §5.
  • [10] M. Gardner, J. Grus, M. Neumann, O. Tafjord, P. Dasigi, N. F. Liu, M. Peters, M. Schmitz, and L. Zettlemoyer (2018)

    AllenNLP: a deep semantic natural language processing platform


    Proceedings of Workshop for NLP Open Source Software (NLP-OSS)

    Cited by: §4.3.
  • [11] S. Gehrmann, Y. Deng, and A. Rush (2018) Bottom-Up Abstractive Summarization. In EMNLP, Cited by: §1.
  • [12] K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom (2015) Teaching Machines to Read and Comprehend. In NeurIPS, Cited by: §4.1.
  • [13] Y. Ji and J. Eisenstein (2014) Representation learning for text-level discourse parsing. In ACL, Cited by: §4.3.
  • [14] C. Kedzie, K. McKeown, and H. Daume III (2018)

    Content selection in deep learning models of summarization

    In EMNLP, Cited by: §2.3.
  • [15] T. N. Kipf and M. Welling (2017) Semi-supervised classification with graph convolutional networks. In ICLR, Cited by: §1.
  • [16] C. Lin (2004) ROUGE: A Package for Automatic Evaluation of Summaries. In Text Summarization Branches Out, Cited by: §4.3.
  • [17] Y. Liu (2019) Fine-tune bert for extractive summarization. arXiv preprint arXiv:1903.10318. Cited by: §1, §4.2, Table 2.
  • [18] W. C. Mann and S. A. Thompson (1988) Rhetorical structure theory: toward a functional theory of text organization. Text-interdisciplinary Journal for the Study of Discourse. Cited by: §1, §2.
  • [19] C. Manning, M. Surdeanu, J. Bauer, J. Finkel, S. Bethard, and D. McClosky (2014) The Stanford CoreNLP Natural Language Processing Toolkit. In ACL, Cited by: §2.3, §4.1.
  • [20] R. Nallapati, F. Zhai, and B. Zhou (2017)

    SummaRuNNer: A Recurrent Neural Network Based Sequence Model for Extractive Summarization of Documents

    In AAAI, Cited by: §5.
  • [21] S. Narayan, S. B. Cohen, and M. Lapata (2018) Ranking sentences for extractive summarization with reinforcement learning. In NAACL, Cited by: §1, §5.
  • [22] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    arXiv preprint arXiv:1910.10683. Cited by: Table 2.
  • [23] A. M. Rush, S. Chopra, and J. Weston (2015)

    A Neural Attention Model for Abstractive Sentence Summarization

    In EMNLP, Cited by: §1.
  • [24] E. Sandhaus (2008) The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia. Cited by: §4.1.
  • [25] A. See, P. J. Liu, and C. D. Manning (2017) Get To The Point: Summarization with Pointer-Generator Networks. In ACL, Cited by: §1.
  • [26] E. Sharma, L. Huang, Z. Hu, and L. Wang (2019) An entity-driven framework for abstractive summarization. In EMNLP, Cited by: §1.
  • [27] J. Shi, C. Liang, L. Hou, J. Li, Z. Liu, and H. Zhang (2019) DeepChannel: salience estimation by contrastive learning for extractive document summarization. In AAAI, Cited by: §4.2.
  • [28] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In NeurIPS, Cited by: §1, §3.2.
  • [29] Y. Wang, S. Li, and J. Yang (2018) Toward fast and accurate neural discourse segmentation. In EMNLP, Cited by: §4.3.
  • [30] J. Xu and G. Durrett (2019) Neural extractive text summarization with syntactic compression. In EMNLP, Cited by: §1, §4.1, §4.2, Table 2, Table 3.
  • [31] J. Yao, X. Wan, and J. Xiao (2017) Recent advances in document summarization. Knowledge and Information Systems 53 (2), pp. 297–336. Cited by: §1.
  • [32] X. Zhang, M. Lapata, F. Wei, and M. Zhou (2018) Neural Latent Extractive Document Summarization. In EMNLP, Cited by: §1, §5.
  • [33] X. Zhang, F. Wei, and M. Zhou (2019) HIBERT: document level pre-training of hierarchical bidirectional transformers for document summarization. In ACL, Cited by: §4.1, §4.2, Table 2, Table 3.
  • [34] M. Zhong, P. Liu, D. Wang, X. Qiu, and X. Huang (2019) Searching for effective neural extractive summarization: what works and what’s next. In ACL, Cited by: §4.2, Table 2.
  • [35] Q. Zhou, N. Yang, F. Wei, S. Huang, M. Zhou, and T. Zhao (2018) Neural Document Summarization by Jointly Learning to Score and Select Sentences. In ACL, Cited by: §1, §4.2, Table 2, §5.