A Divide-and-Conquer Approach to the Summarization of Academic Articles

by   Alexios Gidiotis, et al.

We present a novel divide-and-conquer method for the summarization of long documents. Our method processes the input in parts and generates a corresponding summary. These partial summaries are then combined in order to produce a final complete summary. Splitting the problem of long document summarization into smaller and simpler problems, reduces the computational complexity of the summarization process and leads to more training examples that at the same time contain less noise in the target summaries compared to the standard approach of producing the whole summary at once. Using a fairly simple sequence to sequence architecture with a combination of LSTM units and Rotational Units of Memory (RUM) our approach leads to state-of-the-art results in two publicly available datasets of academic articles.


page 1

page 2

page 3

page 4


Structured Summarization of Academic Publications

We propose SUSIE, a novel summarization method that can work with state-...

Unsupervised Neural Multi-document Abstractive Summarization

Abstractive summarization has been studied using neural sequence transdu...

On Generating Extended Summaries of Long Documents

Prior work in document summarization has mainly focused on generating sh...

Transfer Learning for Abstractive Summarization at Controllable Budgets

Summarizing a document within an allocated budget while maintaining its ...

Unsupervised Extractive Summarization by Human Memory Simulation

Summarization systems face the core challenge of identifying and selecti...

Towards Abstractive Grounded Summarization of Podcast Transcripts

Podcasts have recently shown a rapid rise in popularity. Summarization o...

Dynamic Sliding Window for Meeting Summarization

Recently abstractive spoken language summarization raises emerging resea...

1 Introduction

Automatic summarization has been recently recognized as one the most important, yet least solved, natural language processing (NLP) tasks Socher (2020)

. Closely following the advances in neural machine translation

Sutskever et al. (2014); Bahdanau et al. (2015) and fueled by the increased availability of computational resources as well as large annotated datasets Sandhaus (2008); Napoles et al. (2012); Grusky et al. (2018), neural approaches to automatic summarization are nowadays gaining significant attention from researchers.

In previous years, automatic summarization approaches have mainly focused on short pieces of text that typically come from news articles Chopra et al. (2016); Nallapati et al. (2016a); See et al. (2017); Paulus et al. (2017). This is also reflected in the amount of datasets that exist for this particular problem Hermann et al. (2015); Sandhaus (2008); Napoles et al. (2012); Grusky et al. (2018). Recently, summarization methods have started to focus on summarizing longer documents, such as academic articles Qazvinian et al. (2013); Collins et al. (2017); Cohan et al. (2018); Gidiotis and Tsoumakas (2019); Subramanian et al. (2019).

Summarizing long documents is a very different problem to newswire summarization. In academic articles for example, the input text can range from 2,000 to 7,000 words, while in the case of newswire articles it rarely exceeds 700 words Cohan et al. (2018). Similarly, the expected summary of a news article is less than 100 words long, while the abstract of an academic article can easily exceed 200 words. The increased input and output length lead neural summarization methods to a much higher computational complexity, making it extremely hard to train models that have enough capacity to perform this task. Most importantly, long documents introduce a lot of noise to the summarization process. Indeed, one of the major difficulties in summarizing a long document is that large parts of the document are not really key to its narrative and thus should be ignored. Finally, long summaries typically contain a number of diverse key information points from a document, which are more difficult to produce, compared to the more focused information contained in short summaries.

Certain methods have tried to address this problem by limiting the size of the input document, either by selecting specific sections that are more informative Subramanian et al. (2019), or by first employing an extractive model that learns to identify and select the most important parts of the input Chen and Bansal (2018); Gehrmann et al. (2019). While this reduces the noise and the computational cost in processing a long document, there remain the computational cost and information diversity issues in producing a long summary.

In contrast to the above methods that produce a complete summary at once, we propose a novel method that processes the input in parts and generates a summary for each one of the different parts. These partial summaries are then combined in order to produce a final complete summary. This divide-and-conquer approach is a much more efficient and effective way to solve the problem of long document summarization by splitting it into smaller and simpler problems that are a lot easier to tackle. Using a 3 years-old neural model See et al. (2017) for summarizing the parts, our method achieves state-of-the-art results in two publicly available datasets of academic articles, surpassing recent more advanced models Cohan et al. (2018); Subramanian et al. (2019).

This paper is based on past work Gidiotis and Tsoumakas (2019) that assumes the existence of structured summaries, in particular structured biomedical abstracts within PubMed. That approach exploits the underlying structure of biomedical articles with structured abstracts in order to split them into segments and then process each segment separately to produce summaries of different parts of the text. This work takes a different approach and instead uses sentence level Rouge similarities between the abstract and the full text in order to split and create source-target pairs for training. Such an approach makes this work applicable to any type of document, from academic articles to blog posts and financial documents. In addition, we extensively study different variations of our model setup and methodology in two datasets, namely arXiv and PubMed.

The rest of this work is structured as follows. Section 2 gives a brief overview of the related work. Section 3 describes in detail the proposed method. Section 4 presents the setup and Section 5 discusses the results of our experiments. Finally, Section 6 concludes this works and points to future work directions.

2 Related work

We review the state of the art in automatic text summarization. We first discuss related work on neural text summarization. Next we focus on the summarization of academic articles.

2.1 Neural text summarization

Most of the early literature on neural text summarization has been driven by advances in neural machine translation (NMT) Sutskever et al. (2014); Bahdanau et al. (2015)

. Several approaches extended the attentional sequence-to-sequence recurrent neural networks (RNNs), that achieved impressive results in NMT, towards text summarization. These approaches were mainly focused on summarizing short documents (e.g. news articles), in order to produce short summaries (e.g. headlines)

Chopra et al. (2016); Nallapati et al. (2016a); See et al. (2017); Paulus et al. (2017). A number of publicly available datasets of short articles, such as CNN/Daily Mail Hermann et al. (2015) and Newsroom Grusky et al. (2018), are commonly used to benchmark such methods.

Summarization methods are generally classified into two categories. Extractive methods try to select the most important sentences from the input and combine them, typically by concatenation, in order to generate a summary

Nallapati et al. (2016b); Collins et al. (2017)

. This is usually approached as a binary classification problem, where for each sentence the model decides whether it should be included in the summary or not. On the other hand, abstractive methods try to encode the input into a hidden representation and then use a decoder conditioned on that representation to generate the summary

Rush et al. (2015); Chopra et al. (2016); Nallapati et al. (2016a). In addition to these main categories, there are hybrid approaches that combine both extractive and abstractive methods either by using pointer-generators See et al. (2017); Paulus et al. (2017); Cohan et al. (2018); Celikyilmaz et al. (2018); Gidiotis and Tsoumakas (2019) or by fusing extractive and abstractive models Chen and Bansal (2018); Gehrmann et al. (2019).

The majority of sequence-to-sequence RNN models that are used for summarization are constructed with long-short term memory units (LSTM) or gated recurrent units (GRU). Models based on both LSTM and GRU units tend to suffer from recall and copy mistakes, especially when the input text gets longer. A different type of RNN unit called

Rotational Unit of Memory (RUM) was proposed in Dangovski et al. (2018)

. RUM modifies it’s hidden state by rotating the hidden state vector in the corresponding Euclidean space. This results in better information flow and improves performance in a number of NLP tasks including summarization

Dangovski et al. (2019).

In more recent work, there has been increasing interest in the use of transformer models Vaswani et al. (2017) to replace the RNNs for either the decoder, the encoder or both parts of the sequence-to-sequence model Subramanian et al. (2019). Using a pre-trained BERT model to replace the encoder part of the model has led to improved results over methods that are trained from scratch Liu and Lapata (2019).

In order to enhance performance and address some common shortcomings of neural summarization models, policy learning Rennie et al. (2017) was used in Paulus et al. (2017)

in order to fine-tune the models. Other approaches have employed reinforcement learning objectives in order to further improve summarization performance

Celikyilmaz et al. (2018); Keneshloo et al. (2019).

2.2 Summarizing academic articles

Summarization of academic articles has recently received a lot of attention. Existing approaches include extractive models that perform sentence selection Qazvinian et al. (2013); Cohan and Goharian (2015, 2018); Collins et al. (2017) and hybrid models that first select and then re-write sentences from the full text Cohan et al. (2018); Subramanian et al. (2019).

In order to train neural models for this task, several large scale datasets have been introduced. The arXiv and PubMed datasets Cohan et al. (2018) were created using open access articles from the corresponding popular repositories. Similar to the PubMed dataset, PMC-SA Gidiotis and Tsoumakas (2019) is a dataset of open access articles from PubMed Central, where the abstract of each article is structured into sections similar to the full text. Finally, the Science Daily dataset Dangovski et al. (2019) was created by crawling stories from the Science Daily web site111https://www.sciencedaily.com/. Each story is about a recent scientific paper and is also accompanied by a short summary that is used as target for training and evaluation.

In addition to the datasets mentioned above, there is also the TAC2014 biomedical summarization dataset222http://www.nist.gov/tac/2014. TAC2014 contains 20 topics, each consisting of one reference article and several articles citing it. Additionally, each reference article is accompanied by four scientific summaries that are written by domain experts. This dataset has been used multiple times in the earlier literature, but since it is rather small, it is not suitable for the training of neural summarization models.

3 Methods

We propose a divide-and-conquer approach for the summarization of long documents. In this section, we present a training algorithm for the partial summarization systems as well as a simple the methodology we are following at prediction time. Finally, we discuss the basic model variants that we are using in our experiments.

3.1 Divide-and-conquer summarization

We argue that a very efficient way of dealing with long documents is to train a summarization model that learns to summarize separately the different sections of the document. Our approach assumes that long documents are structured into discrete sections and exploits this discourse structure by working on each section separately. Each section of the document is treated as a different example during the training of the model by pairing it with a distinct summarization target.

A first idea for achieving this pairing would be to use the whole summary of the document as target for each different section. However, this approach would be problematic for a couple of reasons. First of all, having very long target sequences is very demanding in terms of computational resources. This problem would be even more apparent if instead of an RNN model we decided to use a transformer-based model, since the computational complexity and memory requirements of transformers explode for very long sequences. Secondly, the summary will most likely include information that is irrelevant to some sections of the document. For example, information about the conclusions of an academic article in its abstract will most likely be irrelevant to the section describing the methods. As a result, it would be impossible for the model to generate these parts of the target sequence and this may result in poor performance.

We introduce Divide-ANd-ConquER (DANCER) summarization, a method that automatically splits the summary of a document into sections and pairs each of these sections to the appropriate section of the document, in order to create distinct target summaries. Splitting a summary into sections is not straightforward, apart from the limited case of structured abstracts of academic articles Gidiotis and Tsoumakas (2019). In DANCER we employ ROUGE metrics Lin (2004) in order to match each part of the summary with a section of the document. Similar to Subramanian et al. (2019), a summary is represented as a list of sentences . In addition, each document is represented as a list of sections and each section of the document as a list of sentences . We compute the ROUGE-L precision between each sentence of the summary and each sentence of the document . Given two word sequences and , the longest common sub-sequence (LCS) is the common sub-sequence with the maximum length. If is the length of the longest common sub-sequence of and , then ROUGE-L precision between and , is computed as follows:


In more detail, once we have computed the ROUGE-L precision between the summary sentence and all the sentences of the document, we find the full text sentence with the highest ROUGE-L precision score and we assign to be part of the summary of section . We repeat this process until all sentences of the summary have been assigned to one document section. Then we group all summary sentences by section and concatenate the sentences corresponding to the same section in order to create the target summary for that section. We restrict the length of each target summary to the first 100 words for computational efficiency although this limit is rarely exceeded.

During training, each section of the document is used as input text and the corresponding part of the summary is the target summary. The training itself is performed with simple teacher forcing Williams and Zipser (1995), where we are minimizing the negative log likelihood of the target summary sequence given the input sequence .


We have found that this training strategy has several advantages over other methods proposed in the literature. Firstly, by breaking down the problem into multiple smaller problems we greatly reduce the complexity and make it much easier to solve. We believe that this is a very efficient way to approach the summarization of long documents, since it greatly reduces the length of both the input and more importantly the output sequences. Also, since the target summaries for each section are selected based on the ROUGE-L scores of each sentence, we create a better and more focused matching between the source and the target sequences and avoid having parts of the target summary that are irrelevant to the input sequence. This property prevents us from penalizing the model for not predicting information that was absent in the input text.

Secondly, by splitting each training document into multiple input-target pairs we create a lot more training examples. This is especially beneficial for neural summarization models because by splitting each document into multiple examples we can effectively make use of more training content. This becomes clearer if we think of a neural summarization decoder as a conditional Language Model that cannot process an unlimited amount of text from each training example. The way that we approach the training allows us to effectively distribute the source and target texts into more training examples and thus enable us to train our model on a larger amount of textual content which leads to improved output quality.

Finally, the method itself is simple and model agnostic and can employ different summarization models, from encoder-decoder RNNs to transformers. It can also be combined with other more sophisticated methods that perform sentence extraction before the main summarization process, since it has been observed that pointer neural networks sometimes struggle at selecting relevant parts of the input.

3.2 Section selection

When working with long structured documents it is usually the case that not all sections of the document are key to the document. If we take as an example an academic article, sections like literature review or background are not essential when trying to summarize the main points of the article. On the other hand, sections like the introduction and conclusion usually include quite a lot of the important information that we want to include in the summary. Another similar example would be financial reports that are also structured in sections. Some of those sections, usually referred to as “front-end” sections, include key information and reviews that are core to the narrative, while others consist mostly of financial statements and are less useful for producing a summary El-Haj (2019).

What’s more, by trying to include sections that are not really important to the overall summary we can possibly end up adding a lot of noise and overall reducing the quality of the generated summary. With that in mind we decided that by selecting specific section types and only including those into the summary we can improve the overall quality of the summarization results.

We are following the same approach described in Gidiotis and Tsoumakas (2019) in order to select the sections we want to use for summarization. First we classify each section into different section types like introduction, methods and conclusion

based on a heuristic keyword matching of some common keywords in the section header. The specific keywords used for the classification are presented in Table

1. When generating the summary, we select and use only the sections of the full text that are classified introduction, methods, results and conclusion ignoring all other sections.

section keywords
introduction introduction, case
literature background, literature, related
methods methods, method, techniques, methodology
results result, results, experimental, experiments, experiment
discussion discussion, limitations
conclusion conclusion, conclusions, concluding
Table 1: Here we present the different section types and the common keywords that are used in order to classify them. If the header of a section includes any of the keywords associated with a specific section type it is classified in that section type. Sections that can’t be matched with any section type are ignored.

3.3 Core model

Our core model is a fairly common sequence-to-sequence model with LSTM units, attention and a copying mechanism. This model was first proposed in See et al. (2017) and has been widely used ever since Paulus et al. (2017); Cohan et al. (2018); Celikyilmaz et al. (2018); Gidiotis and Tsoumakas (2019). Figure 1 illustrates the basic architecture of this model. One of the advantages of this model is the ability to both extract tokens from the input and generate new tokens with a language model. Token extraction is especially important when working with scientific articles that include a lot of out-of-vocabulary

technical terms, which need to be copied directly. On the other hand, the language model has the ability to rewrite parts of the text and improve the fluency of the generated text. In this work we are only using only word tokens for the encoding of sequences unlike other approaches that make use of subword units

Subramanian et al. (2019); Liu and Lapata (2019).

Figure 1:

Architecture of the core Pointer-generator model. For each decoder timestep the model has a probability to either generate words from a fixed vocabulary or copy words from the source text.

Incorporating rotational units of memory (RUM) into a sequence-to-sequence model can improve summarization results Dangovski et al. (2019) . In particular, including RUM units in the model results in larger gradients during training thus leading to a more stable training and better convergence. In contrast, the gates of LSTM units typically have tanhactivation functions and as a result the gradients very quickly become small despite using gradient clipping. We created a variant of our model, where we replaced the LSTM units of the decoder with RUM units. We decided to keep the LSTM units for the encoder, since it has been shown that a mixture of both unit types is usually advantageous Dangovski et al. (2019).

At test time, we autoregressively generate a summary for each section of the input text one word at a time using simple beam search decoding Graves (2012); Boulanger-Lewandowski et al. (2013). The generated summaries are then concatenated into a single summary for the whole article.

A common problem with this type of generative models is that the decoder in each decoding step has no way to remember the previously attended parts of the input and consequently may attend to the same inputs multiple times. During the prediction phase this is even worse since the at every decoding step we condition the decoder to the output at the previous step. As a result of this behavior the generated text may sometimes include repetitions and in certain situations the whole decoded sequence may end up in a degenerate repetitive text. In order to deal with this issue multiple different approaches are used. We avoided using the coverage mechanism proposed in See et al. (2017), since this approach adds more complexity to the model and needs to be included during the training. Instead we opted for a simpler yet effective approach that tries to deal with repetition at the decoding phase and was proposed by Paulus et al. (2017). During beam search decoding we prevent the decoder from outputting the same trigram multiple times. In order to do this we set the output probability , when outputting would create a trigram already existing in the generated hypothesis of the current beam.

Since the summarization of each section is independent of the other sections, our approach is highly parallelizable. At test time, we can very easily process all sections of the document in parallel and thus make the summary generation quite a lot faster. This can be particularly useful for systems that are trying to offer summarization as an online service for multiple users.

4 Experimental setup

Here we describe the experiments we conducted using two different variants of our core model on two different datasets in order to demonstrate the effectiveness of our method. We first introduce the two datasets and present the details of the models we are using as well as the training and evaluation setup.

4.1 Data

We employed two large-scale publicly available summarization datasets that focus on scientific papers, namely arXiv and PubMed Cohan et al. (2018). The arXiv dataset was created directly from LaTeX files that were taken from the arXiv repository of electronic preprints333https://arxiv.org. The files were processed and converted to plain text using Pandoc444https://pandoc.org to preserve section information. All citation markers and math formulas were replaced by special tokens. The resulting dataset includes approximately 215k documents with abstracts. The average full text length is 6,913 words and the average abstract length is 292 words. The PubMed dataset was created from the XML files that are part of the Open Access collection of the PubMed Central (PMC) repository. In contrast to the arXiv dataset, the citation markers were completely removed, while the math equations were converted to plain text. This dataset consists of approximately 133k documents with abstracts. The average full text length is 3,224 words and the average abstract length is 214 words.

In order to be comparable with published results on these datasets, we use the predefined training, validation and test set splits and do not perform any additional pre-processing steps. Both datasets are already processed in such a way that only the first level section headings are used as section information and all subsections headings were included as plain text. Also, all figures and tables have already been removed along with text styling options for both datasets. As discussed in Section 3, our method splits each document into multiple training examples based on the discourse structure of the document. As a result, we end up with a lot more training examples than documents. Detailed statistics for both datasets are presented in Table 2.

Arxiv PubMed
# documents 215k 133k
# examples 584,396 385,229
avg. document length (words) 6,913 3,224
avg. summary length (words) 292 214
avg. example length (words) 1,018 639
avg. target length (words) 69 69
Table 2: Statistics about the two datasets that are used in our experiments. Since we are creating multiple examples from each document the example and target lengths are much smaller than the document and summary lengths respectively.

4.2 Model details

Our core models are implemented in Tensorflow and are based on the implementation555https://github.com/abisee/pointer-generator of See et al. (2017)

. The hyperparameter selection is similar to the setup suggested in

See et al. (2017). Our first model employes a bidirectional LSTM layer of 256 units for the encoder and a unidirectional LSTM layer of 256 units for the decoder. For the second model we keep the encoder part the same but we replace the LSTM units of the decoder with RUM units. The RUM unit implementation is taken from the original code666https://github.com/rdangovs/rotational-unit-of-memory of Dangovski et al. (2019). We restrict the vocabulary to 50,000 word tokens for both the input and output and use word embeddings of size 128. We do not use pretrained word embeddings, but rather learn them from scratch during training, as suggested in See et al. (2017) .

Our models were trained on a single Nvidia 1080 GPU with a batch size of 16. We train all of our models using Adagrad Duchi et al. (2011)

with 0.15 learning rate and initialize the accumulator to 0.1. We clip the gradients to have a maximum norm of 2, but avoid using any regularization. During training we are regularly (every 3,000 steps) measuring the loss and the ROUGE-1 F-score on the validation set of the dataset in order to monitor the learning of our model. We end the training when the validation loss stops improving.

For the training, input sequences are truncated to 500 word tokens while padding the shorter ones with zeros to the same length. Similarly, the target sequences are truncated to 100 word tokens. We have found that it is preferable to train with the full length sequences from the beginning of the training rather than starting off with highly truncated sequences and then increasing the sequence length after convergence. This is in contrast to the common practice suggested in

See et al. (2017). We believe one possible reason might be that training with very short and generic sequences first could lead the model to converge into a local optimum and have a hard time getting out of there once the sequence length is increased.

For the prediction phase, we use beam search decoding with 4 beams and generate a maximum of 120 tokens per section. We are also using the mechanism described previously to avoid repeating the same trigrams. Once we have generated a summary for each section, we concatenate the generated summaries in order to get the final summary.

4.3 Baselines and state-of-the-art methods

We compare DANCER with several well known extractive and abstractive baselines as well as state-of-the-art methods. The baseline methods we are comparing against are a simple Lead-10 extractor, LexRank Erkan and Radev (2004), SumBasic Vanderwende et al. (2007), LSA Steinberger and Jezek (2004), Attention Seq2Seq Nallapati et al. (2016a); Chopra et al. (2016); Rush et al. (2015), Pointer-Generator Seq2Seq See et al. (2017) and Discourse-Aware Summarizer Cohan et al. (2018). Sent-CLF and Sent-PTR Subramanian et al. (2019) are state-of-the-art extractive methods and TLM-I+E Subramanian et al. (2019) is a state-of-the-art hybrid method that uses content selection and encoding combined with a transformer decoder.

5 Results and discussion

The results of our experiments on the arXiv and PubMed datasets are shown in Tables 3 and 4 respectively. We are reporting the full-length F-score of the ROUGE-1, ROUGE-2 and ROUGE-L metrics Lin (2004) computed using the official pyrouge package777https://pypi.org/project/pyrouge/0.1.3. All our reported ROUGE scores have a confidence interval of at most as reported by the official ROUGE script. The results of SumBasic, LexRank, LSA, Attention Seq2Seq, Pointer-Generator Seq2Seq and Discourse-Aware Summarizer are taken directly from Cohan et al. (2018), while the results of Sent-CLF, Sent-PTR and TLM-I+E come from Subramanian et al. (2019).

Model Type ROUGE-1 F1 ROUGE-2 F1 ROUGE-L F1
SumBasic Ext 29.47 6.95 26.3
LexRank Ext 33.85 10.73 28.99
LSA Ext 29.91 7.42 25.67
Lead-10 Ext 35.52 10.33 31.44
Sent-CLF Ext 34.01 8.71 30.41
Sent-PTR Ext 42.32 15.63 38.06
Attention Seq2Seq Abs 29.3 6.00 25.56
Pointer-Generator Mix 32.06 9.04 25.16
Discourse-Aware Mix 35.8 11.05 31.8
TLM-I+E Mix 42.43 15.24 24.08
Our Models
DANCER LSTM Mix 41.87 15.92 37.61
DANCER RUM Mix 42.7 16.54 38.44
Table 3: ROUGE F1 results on arXiv test set. Underlined are the top performing models in each category (extractive and mixed) while bold is the overall top performing model. In this dataset DANCER RUM outperforms all other models including the top performing extractive models.
Model Type ROUGE-1 F1 ROUGE-2 F1 ROUGE-L F1
SumBasic Ext 37.15 11.36 33.43
LexRank Ext 39.19 13.89 34.59
LSA Ext 33.89 9.93 29,70
Lead-10 Ext 37.45 14.19 34.07
Sent-CLF Ext 45.01 19.91 41.16
Sent-PTR Ext 43.3 17.92 39.47
Attention Seq2Seq Abs 31.55 8.52 27.38
Pointer-Generator Mix 35.86 10.22 29.69
Discourse-Aware Mix 38.93 15.37 35.21
TLM-I+E Mix 41.43 15.89 24.32
Our Models
DANCER LSTM Mix 44.09 17.69 40.27
DANCER RUM Mix 43.98 17.65 40.25
Table 4: ROUGE F1 results on PubMed test set. Again underlined are the top performing models in each category (extractive and mixed) while bold is the overall top performing model. In this dataset the best extractive model outperforms all other models.

5.1 LSTM vs RUM

We first look at which of the two types of recurrent units, LSTM or RUM, lead DANCER to better results. We can see that the model that uses RUM units outperforms the one using LSTM units on the arXiv dataset, while performing slightly worse on PubMed. Given the observation that LSTM based models tend to copy more phrases from the source than RUM based models Dangovski et al. (2019), we hypothesize that the target abstracts in PubMed include a higher amount of text that is copied directly from the full text, compared to arXiv.

In order to validate this hypothesis we computed the percentage of n-grams in the target summaries that are copied from the source. In Figure

2 we show these percentages for both datasets. It is clear that the target abstracts in the PubMed dataset have a greater percentage of copied 2-grams, 3-grams and 4-grams compared to the arXiv dataset.

Figure 2: The percentage of N-grams that are copied directly from the source to the target summary for both datasets. The percentages are high for both datasets but for the PubMed dataset we observe a higher percentage of copied 2-grams, 3-grams, 4-grams. This implies that the abstracts of the articles are in fact very much extractive and as a result this dataset favors extractive approaches more.

In addition, we found that when using a decoder with RUM units, the training is more stable than when using a decoder with LSTM units and converges steadily at a lower loss value. This is in line with the observation that RUM based models exhibit larger gradients and as a result have more robust training compared to LSTM based models Dangovski et al. (2019). On the other hand, we also found that models with a RUM based decoder need more steps to converge to the final loss, compared to models with an LSTM based decoder.

5.2 DANCER vs baselines and the state of the art

We can see that both DANCER variants outperform all baseline methods by a large margin in both datasets. This is an important achievement in itself, since the core model we are using is identical to the Pointer-Generator model. If DANCER is used alongside other more sophisticated models it may further improve results. We can also see that our models outperform the state-of-the-art mixed model TLM-I+E, despite using a much simpler model architecture.

The state-of-the-art extractive model Sent-CLF is worse that both variants of DANCER in arXiv, but better in PubMed. The state-of-the-art extractive model Sent-PTR is worse than the RUM variant of DANCER in both arXiv and PubMed, with the exception of the ROUGE-2 measure in PubMed. The LSTM variant of DANCER is better than Sent-PTR in ROUGE-1 and ROUGE-L measures in PubMed and in the ROUGE-2 measure in PubMed. We therefore see a mixed picture when comparing DANCER with the state-of-the-art extractive models Sent-CLF and Sent-PTR.

Going back to Figure 2, we notice that both datasets have a high percentage of text copied directly from the source, which explains the high performance of all extractive approaches, even simple ones, like LexRank and Lead-10. Usually it is way easier for extractive models to achieve higher ROUGE scores due to the way that ROUGE metrics are calculated. Since the metric is purely based on the overlap of the the generated text with the target text and in many cases the target summary includes a parts that are copied from the source input, ROUGE scores clearly favor extractive summarization approaches.

In the Appendix of this paper we present sample summaries for a couple of papers generated by our LSTM model trained on the arXiv dataset. These samples demonstrate the quality of the summaries we can produce using our proposed methods.

6 Conclusion

We presented DANCER, a novel summarization method for long documents. We focused on the summarization of academic articles, but the same method can easily be applied to different types of long documents, such as financial reports. We have demonstrated quantitatively through experiments on the arXiv and PubMed datasets that this method combined with a basic sequence-to-sequence model can outperform state-of-the-art summarization models.

We have also evaluated the advantages of using a combination of LSTM and RUM units inside the sequence-to-sequence model in terms of ROUGE F1 as well as training stability and convergence. We have found that including RUM units in the decoder of the model can lead to a more stable training and better convergence as well as improved ROUGE scores, when the target sequence includes less text directly copied from the source sequence.

Overall, we have focused on the effectiveness of our proposed method regardless of the complexity of the core model. In future work we would like to combine DANCER with more complex summarization models that could potentially further improve summarization quality.


  • D. Bahdanau, K. H. Cho, and Y. Bengio (2015) Neural machine translation by jointly learning to align and translate. In 3rd International Conference on Learning Representations, ICLR 2015 - Conference Track Proceedings, Cited by: §1, §2.1.
  • N. Boulanger-Lewandowski, Y. Bengio, and P. Vincent (2013) Audio chord recognition with recurrent neural networks. In Proceedings of the 14th International Society for Music Information Retrieval Conference, ISMIR 2013, pp. 335–340. External Links: ISBN 9780615900650 Cited by: §3.3.
  • A. Celikyilmaz, A. Bosselut, X. He, and Y. Choi (2018) Deep Communicating Agents for Abstractive Summarization. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), New Orleans, Louisiana, pp. 1662–1675. External Links: Document Cited by: §2.1, §2.1, §3.3.
  • Y. C. Chen and M. Bansal (2018) Fast abstractive summarization with reinforce-selected sentence rewriting. In ACL 2018 - 56th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference (Long Papers), pp. 675–686. External Links: ISBN 9781948087322, Document Cited by: §1, §2.1.
  • S. Chopra, M. Auli, and A. M. Rush (2016) Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 93–98. Cited by: §1, §2.1, §2.1, §4.3.
  • A. Cohan and N. Goharian (2015) Scientific article summarization using citation-context and article’s discourse structure. In Conference Proceedings - EMNLP 2015: Conference on Empirical Methods in Natural Language Processing, pp. 390–400. External Links: ISBN 9781941643327, Document Cited by: §2.2.
  • A. Cohan, F. Dernoncourt, D. S. Kim, T. Bui, S. Kim, W. Chang, and N. Goharian (2018)

    A Discourse-Aware Attention Model for Abstractive Summarization of Long Documents

    In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), pp. 615–621. External Links: Document Cited by: §1, §1, §1, §2.1, §2.2, §2.2, §3.3, §4.1, §4.3, §5.
  • A. Cohan and N. Goharian (2018) Scientific document summarization via citation contextualization and scientific discourse. International Journal on Digital Libraries. External Links: Document, ISSN 14321300 Cited by: §2.2.
  • E. Collins, I. Augenstein, and S. Riedel (2017) A Supervised Approach to Extractive Summarisation of Scientific Papers. In Proceedings of the 21st Conference on Computational Natural Language Learning (CoNLL 2017), pp. 195–205. External Links: Document Cited by: §1, §2.1, §2.2.
  • R. Dangovski, L. Jing, P. Nakov, M. Tatalović, and M. Soljačić (2019) Rotational Unit of Memory: A Novel Representation Unit for RNNs with Scalable Applications. Transactions of the Association for Computational Linguistics. External Links: Document, ISSN 2307-387X Cited by: §2.1, §2.2, §3.3, §4.2, §5.1, §5.1.
  • R. Dangovski, L. Jing, and M. Soljačić (2018) Rotational unit of memory. In 6th International Conference on Learning Representations, ICLR 2018 - Workshop Track Proceedings, Cited by: §2.1.
  • J. Duchi, E. Hazan, and Y. Singer (2011) Adaptive subgradient methods for online learning and stochastic optimization.

    Journal of Machine Learning Research

    12 (Jul), pp. 2121–2159.
    Cited by: §4.2.
  • M. El-Haj (2019) MultiLing 2019: Financial Narrative Summarisation. Assoc. for Computational Linguistics Bulgaria, pp. pp. 6–10. External Links: Document Cited by: §3.2.
  • G. Erkan and D. R. Radev (2004) LexRank: Graph-based lexical centrality as salience in text summarization.

    Journal of Artificial Intelligence Research

    External Links: Document, ISSN 10769757 Cited by: §4.3.
  • S. Gehrmann, Y. Deng, and A. Rush (2019) Bottom-Up Abstractive Summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4098–4109. External Links: Document Cited by: §1, §2.1.
  • A. Gidiotis and G. Tsoumakas (2019) Structured Summarization of Academic Publications. arXiv preprint arXiv:1905.07695. Cited by: §1, §1, §2.1, §2.2, §3.1, §3.2, §3.3.
  • A. Graves (2012) Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711. Cited by: §3.3.
  • M. Grusky, M. Naaman, and Y. Artzi (2018) Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp. 708–719. Cited by: §1, §1, §2.1.
  • K. M. Hermann, T. Kocisky, E. Grefenstette, L. Espeholt, W. Kay, M. Suleyman, and P. Blunsom (2015) Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pp. 1693–1701. Cited by: §1, §2.1.
  • Y. Keneshloo, N. Ramakrishnan, and C. K. Reddy (2019) Deep transfer reinforcement learning for text summarization. In SIAM International Conference on Data Mining, SDM 2019, pp. 675–683. External Links: ISBN 9781611975673, Document Cited by: §2.1.
  • C. Lin (2004) Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out. Cited by: §3.1, §5.
  • Y. Liu and M. Lapata (2019) Text Summarization with Pretrained Encoders. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp. 3721–3731. External Links: Document Cited by: §2.1, §3.3.
  • R. Nallapati, B. Zhou, C. dos Santos, C. Gulcehre, and B. Xiang (2016a) Abstractive Text Summarization using Sequence-to-sequence RNNs and Beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, pp. 280–290. Cited by: §1, §2.1, §2.1, §4.3.
  • R. Nallapati, B. Zhou, and M. Ma (2016b) Classify or select: Neural architectures for extractive document summarization. arXiv preprint arXiv:1611.04244. Cited by: §2.1.
  • C. Napoles, M. Gormley, and B. Van Durme (2012) Annotated gigaword. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pp. 95–100. Cited by: §1, §1.
  • R. Paulus, C. Xiong, and R. Socher (2017) A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304. Cited by: §1, §2.1, §2.1, §2.1, §3.3, §3.3.
  • V. Qazvinian, D. R. Radev, S. M. Mohammad, B. Dorr, D. Zajic, M. Whidby, and T. Moon (2013) Generating extractive summaries of scientific paradigms. Journal of Artificial Intelligence Research. External Links: Document, ISSN 10769757 Cited by: §1, §2.2.
  • S. J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, and V. Goel (2017)

    Self-critical sequence training for image captioning


    Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017

    pp. 7008–7024. External Links: ISBN 9781538604571, Document Cited by: §2.1.
  • A. M. Rush, S. Chopra, and J. Weston (2015) A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685. Cited by: §2.1, §4.3.
  • E. Sandhaus (2008) The new york times annotated corpus. Cited by: §1, §1.
  • A. See, P. J. Liu, and C. D. Manning (2017) Get To The Point: Summarization with Pointer-Generator Networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Stroudsburg, PA, USA, pp. 1073–1083. External Links: Document Cited by: §1, §1, §2.1, §2.1, §3.3, §3.3, §4.2, §4.2, §4.3.
  • R. Socher (2020) Boiling the Information Ocean. External Links: Link Cited by: §1.
  • J. Steinberger and K. Jezek (2004) Using latent semantic analysis in text summarization and summary evaluation. Proc. ISIM 4, pp. 93–100. Cited by: §4.3.
  • S. Subramanian, R. Li, J. Pilault, and C. Pal (2019) On Extractive and Abstractive Neural Document Summarizationwith Transformer Language Models. arXiv preprint arXiv:1909.03186. Cited by: §1, §1, §1, §2.1, §2.2, §3.1, §3.3, §4.3, §5.
  • I. Sutskever, O. Vinyals, and Q. V. Le (2014) Sequence to sequence learning with neural networks. In Advances in neural information processing systems, pp. 3104–3112. Cited by: §1, §2.1.
  • L. Vanderwende, H. Suzuki, C. Brockett, and A. Nenkova (2007) Beyond SumBasic: Task-focused summarization with sentence simplification and lexical expansion. Information Processing and Management. External Links: Document, ISSN 03064573 Cited by: §4.3.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, pp. 5998–6008. Cited by: §2.1.
  • R. J. Williams and D. Zipser (1995) Gradient-based learning algorithms for recurrent networks and their computational complexity. Backpropagation: Theory, architectures, and applications 1, pp. 433–486. Cited by: §3.1.

7 Appendix

In order to demonstrate the good quality of the abstracts produced by our method, we generated summaries from a couple of notable papers in our field. The model used to generate those summaries is the LSTM based model trained on the arXiv dataset. We also provide the original abstract of each paper for reference and comparison purposes.

7.1 BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Original abstract: We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models (Peters et al., 2018a; Radford et al., 2018), BERT is designed to pretrain deep bidirectional representations from unlabeled text by jointly conditioning on both left and right context in all layers. As a result, the pre-trained BERT model can be finetuned with just one additional output layer to create state-of-the-art models for a wide range of tasks, such as question answering and language inference, without substantial taskspecific architecture modifications. BERT is conceptually simple and empirically powerful. It obtains new state-of-the-art results on eleven natural language processing tasks, including pushing the GLUE score to 80.5% (7.7% point absolute improvement), MultiNLI accuracy to 86.7% (4.6% absolute improvement), SQuAD v1.1 question answering Test F1 to 93.2 (1.5 point absolute improvement) and SQuAD v2.0 Test F1 to 83.1 (5.1 point absolute improvement)

Generated summary: Language model pre-training has been shown to be effective for improving many natural language processing tasks (Dai and Le, 2015; Radford et al., 2018). The major limitation is that standard language models are unidirectional, and this limits the choice of architectures that can be used during pre-training. We argue that current techniques restrict the power of the pre-trained representations, especially for the fine-tuning approaches. The BERT model is trained on unlabeled data over different pre-training tasks. For finetuning, the BERT model is first initialized with the pre-trained parameters, and all of the parameters are fine-tuned using labeled data from the downstream tasks. These results enable the same pre-trained model to successfully tackle a broad set of NLP tasks. In particular, these findings enable even low-resource tasks to benefit from deep unidirectional architectures. Our major contribution is further generalizing these findings to deep bidirectional architectures.

7.2 Neural Machine Translation by Jointly Learning to Align and Translate

Original abstract: Neural machine translation is a recently proposed approach to machine translation. Unlike the traditional statistical machine translation, the neural machine translation aims at building a single neural network that can be jointly tuned to maximize the translation performance. The models proposed recently for neural machine translation often belong to a family of encoder–decoders and encode a source sentence into a fixed-length vector from which a decoder generates a translation. In this paper, we conjecture that the use of a fixed-length vector is a bottleneck in improving the performance of this basic encoder–decoder architecture, and propose to extend this by allowing a model to automatically (soft-)search for parts of a source sentence that are relevant to predicting a target word, without having to form these parts as a hard segment explicitly. With this new approach, we achieve a translation performance comparable to the existing state-of-the-art phrase-based system on the task of English-to-French translation. Furthermore, qualitative analysis reveals that the (soft-)alignments found by the model agree well with our intuition.

Generated summary: Neural machine translation is a newly emerging approach to machine translation, recently proposed by Kalchbrenner and Blunsom (2013), Sutskever et al. (2014) and Cho et al., (2003) which consists of the encoder and the decoder for a language pair that are jointly trained to maximize the probability of a correct translation given a source sentence. We show that the proposed approach of jointly learning to align and translate achieves significantly improved translation performance. The performance of the RNNsearch is as high as that of the conventional phrase-based translation system (Moses), when only the sentences consisting of known words are considered. This is a significant achievement, considering that Moses uses a separate monolingual corpus (418m words) in addition to the parallel corpora we used to train the RNNsearch and RNNencdec. One of the motivations behind the proposed approach was the basic encoder-decoder approach to underperform with long sentences. We show that the proposed approach provides an intuitive way to inspect the (soft-) alignment between the words in a generated translation and those in a source sentence. This is done by visualizing the annotation weights. In this paper, we propose a novel approach to neural machine translation, called an encoder-decoder approach, encodes a whole input sentence into a fixed-length vector from which a translation will be decoded. We conjectured that the proposed RNNsearch outperforms the conventional encoder-decoder model (RNNencdec) significantly, regardless of the sentence length and that it is much more robust to the length of a source sentence.