Self-Supervised Learning for Contextualized Extractive Summarization

Existing models for extractive summarization are usually trained from scratch with a cross-entropy loss, which does not explicitly capture the global context at the document level. In this paper, we aim to improve this task by introducing three auxiliary pre-training tasks that learn to capture the document-level context in a self-supervised fashion. Experiments on the widely-used CNN/DM dataset validate the effectiveness of the proposed auxiliary tasks. Furthermore, we show that after pre-training, a clean model with simple building blocks is able to outperform previous state-of-the-art that are carefully designed.


page 1

page 2

page 3

page 4


PEGASUS: Pre-training with Extracted Gap-sentences for Abstractive Summarization

Recent work pre-training Transformers with self-supervised objectives on...

Self-Supervised Representation Learning on Document Images

This work analyses the impact of self-supervised pre-training on documen...

Similarity Analysis of Self-Supervised Speech Representations

Self-supervised speech representation learning has recently been a prosp...

Leverage Unlabeled Data for Abstractive Speech Summarization with Self-Supervised Learning and Back-Summarization

Supervised approaches for Neural Abstractive Summarization require large...

Re-entry Prediction for Online Conversations via Self-Supervised Learning

In recent years, world business in online discussions and opinion sharin...

MAGNeto: An Efficient Deep Learning Method for the Extractive Tags Summarization Problem

In this work, we study a new image annotation task named Extractive Tags...

Language Model Pre-training for Hierarchical Document Representations

Hierarchical neural architectures are often used to capture long-distanc...

1 Introduction

Extractive summarization aims at shortening the original article while retaining the key information through the way of selection sentences from the original articles. This paradigm has been proven effective by many previous systems Carbonell and Goldstein (1998); Mihalcea and Tarau (2004); McDonald (2007); Cao et al. (2015). In order to decide whether to choose a particular sentence, the system should have a global view of the document context, e.g., the subject and structure of the document. However, previous works Nallapati et al. (2017); Al-Sabahi et al. (2018); Zhou et al. (2018); Zhang et al. (2018) usually directly build an end-to-end training system to learn to choose sentences without explicitly modeling the document context, counting on that the system can automatically learn the document-level context.

Figure 1: An example for the Mask pre-training task. A sentence is masked in the original paragraph, and the model is required to predicted the missing sentence from the candidate sentences.

We argue that it is hard for these end-to-end systems to learn to leverage the document context from scratch due to the challenges of this task, and a well pre-trained embedding model that incorporates document context should help on this task. In recent years, extensive works Pennington et al. (2014); Nie and Bansal (2017); Lin et al. (2017); Peters et al. (2018); Devlin et al. (2018); Subramanian et al. (2018); Cer et al. (2018); Logeswaran and Lee (2018); Pagliardini et al. (2018) have been done in learning the word or sentence representations, but most of them only use a sentence or a few sentences when learning the representation, and the document context can hardly be included in the representation. Hence, we introduce new pre-training methods that take the whole document into consideration to learn the contextualized sentence representation with self-supervision.

Self-supervised learning Raina et al. (2007); Doersch et al. (2015); Agrawal et al. (2015); Wang and Gupta (2015) is a newly emerged paradigm, which aims to learn from the intrinsic structure of the raw data. The general framework is to construct training signals directly from the structured raw data, and use it to train the model. The structure information learned through the process can then be easily transformed and benefit other tasks. Thus self-supervised learning has been widely applied in structured data like text Okanohara and Tsujii (2007); Collobert and Weston (2008); Peters et al. (2018); Devlin et al. (2018); Wu et al. (2019) and images Doersch et al. (2015); Agrawal et al. (2015); Wang and Gupta (2015); Lee et al. (2017). Since documents are well organized and structured, it is intuitive to employ the power of self-supervised learning to learn the intrinsic structure of the document and model the document-level context for the summarization task.

In this paper, we propose three self-supervised tasks (Mask, Replace and Switch), where the model is required to learn the document-level structure and context. The knowledge learned about the document during the pre-training process will be transferred and benefit on the summarization task. Particularly, The Mask task randomly masks some sentences and predicts the missing sentence from a candidate pool; The Replace task randomly replaces some sentences with sentences from other documents and predicts if a sentence is replaced. The Switch task switches some sentences within the same document and predicts if a sentence is switched. An illustrating example is shown in Figure 1, where the model is required to take into account the document context in order to predict the missing sentence. To verify the effectiveness of the proposed methods, we conduct experiments on the CNN/DM dataset Hermann et al. (2015); Nallapati et al. (2016) based on a hierarchical model. We demonstrate that all of the three pre-training tasks perform better and converge faster than the basic model, one of which even outperforms the state-of-the-art extractive method NeuSum Zhou et al. (2018).

The contributions of this work include:

To the best of our knowledge, we are the first to consider using the whole document to learn contextualized sentence representations with self-supervision and without any human annotations.

We introduce and experiment with various self-supervised approaches for extractive summarization, one of which achieves the new state-of-the-art results with a basic hierarchical model.

Benefiting from the self-supervised pre-training, the summarization model is more sample efficient and converges much faster than those trained from scratch.

Figure 2: The structure of the Basic Model. We use LSTM and self-attention module to encode the sentence and document respectively. represent the word embedding for sentence . and represent the independent and document involved sentence embedding for sentence respectively.

2 Model and Pre-training Methods

2.1 Basic Model

As shown in Figure 2, our basic model for extractive summarization is mainly composed of two parts: a sentence encoder and a document-level self-attention module. The sentence encoder is a bidirectional LSTM Hochreiter and Schmidhuber (1997), which encodes each individual sentence

(a sequence of words) and whose output vector at the last step is viewed as the sentence representation

. Given the representations of all the sentences, a self-attention module Vaswani et al. (2017) is employed to incorporate document-level context and learn the contextualized sentence representation for each sentence.222We leave the combination of different architectures such as replacing the self-attention module with LSTM for future work. Finally, a linear layer is applied to predict whether to choose the sentence to form the summary.

2.2 Self-supervised Pre-training Methods

In this section, we will describe three self-supervised pre-training approaches. Through solving each pre-training task, the model is expected to learn the document-level contextualized sentence embedding model from the raw documents, which will then be used to solve the downstream summarization task. Note that we are only pretraining the sentence encoder and document-level self-attention module of the basic model for extractive summarization.


Similar to the task of predicting missing word, the Mask

task is to predict the masked sentence from a candidate pool. Specifically, we first mask some sentences within a document with the probability

and put these masked sentences () into a candidate pool . The model is required to predict the correct sentence from the pool for each masked position . We replace the sentence in the masked position with a special token unk and compute its document contextualized sentence embedding . We use the same sentence encoder in the basic model to obtain the sentence embedding for these candidate sentences in . We score each candidate sentence in

by using the cosine similarity:

To train the model, we adopt a ranking loss to maximize the margin between the gold sentence and other sentences:

where is a tuned hyper-parameter, points to the gold sentence in for the masked position , and points to another non-target sentence in .


The Replace task is to randomly replace some sentences (with probability ) in the document with sentences from other documents, and then predict if a sentence is replaced. Particularly, we use sentences from randomly chosen documents to form a candidate pool . Each sentence in the document will be replaced with probability by a random sentence in . Let be the set of positions where sentences are replaced. We use a linear layer to predict if the sentence is replaced based on the document embedding , and minimize the MSE loss:

where if (i.e., the sentence in position has been replaced), otherwise .


The Switch task is similar to the Replace task. Instead of filling these selected sentences with sentences out of the document, this task chooses to use sentences within the same document by switching these selected sentences, i.e., each selected sentence will be put in another position within the same document. Let be the set of positions where the sentences are switched. Similarly, we use a linear layer to predict if a sentence is switched and minimize the MSE loss:

where if , otherwise .

3 Experiment

To show the effectiveness of the pre-training method (Mask, Replace and Switch), we conduct experiments on the commonly used dataset CNN/DM Hermann et al. (2015); Nallapati et al. (2016), and compare them with a popular baseline Lead3 See et al. (2017), which selects first three sentences as the summary, and the state-of-the-art extractive summarization method NeuSum Zhou et al. (2018), which jointly scores and selects sentences using pointer network.

Figure 3: This figure shows the Rouge-2 score for each pre-training method and the basic model on the development set during the training process. We put the result for Rouge-1 and Rouge-L score in Appendix A.2

3.1 On CNN/DM Dataset

Model and training details

We use the rule-based system from

Zhou et al. (2018) to label sentences in a document, e.g., sentences to be extracted will be labeled as . Rouge score333We use PyRouge to compute the Rouge score. Lin (2004) is used to evaluate the performance of the model, and we report Rouge-1, Rouge-2, and Rouge-L as in prior work. We use the pre-trained glove embedding Pennington et al. (2014) with dimensions to initialize the word embedding. A one-layer bidirectional LSTM Hochreiter and Schmidhuber (1997) is used as the sentence encoder, and the size of hidden state is . A 5-layer Transformer encoder Vaswani et al. (2017) with heads is used as the document-level self-attention module. A linear classification layer is used to predict whether to choose the sentence.

The training process consists of two phrases. First, we use the pre-training task to pre-train the basic model using the raw article from the CNN/DM dataset without labels. Second, we fine-tune the pre-trained model for the extractive summarization task using the sentence labels. The learning rate is set as in the pre-training phase and

in the fine-tune phase. We train each pre-training task until it is converged or the number of training epochs reaches the upper bound

. We set the probability to mask, replace or switch sentences as .

Method Rouge-1 Rouge-2 Rouge-L
Basic 41.07 18.95 37.56
LEAD3 39.93 17.62 36.21
NeuSum 41.18 18.84 37.61
Mask 41.15 19.06 37.65
Replace 41.21 19.08 37.73
Switch 41.36 19.20 37.86
SentEnc 41.17 19.04 37.69
Switch 0.15 41.35 19.18 37.85
Switch 0.35 41.27 19.12 37.77
Table 1: The Rouge Lin (2004) scores for the basic model, baselines, pre-training methods, and analytic experiments. All of our Rouge scores have a confidence interval of at most as reported by the official ROUGE script. The best result is marked in bold, and those that are not significantly worse than the best are marked with .


We show the Rouge score on the development set during the training process in Figure 3, and present the best Rouge score for each method in Table 1. All pre-training methods improve the performance compared with the Basic model. Especially, Switch method achieves the best result on all the three evaluations compared with other pre-training methods, and is even better than the state-of-the-art extractive model NeuSum444We use code from to train the model, and evaluate it using our evaluation script. Results using their script (only include Rouge-1 and Rouge-2) is put in Appendix A.1..

In the terms of convergence, the Mask, Replace and Switch task takes epochs in the training phase respectively, and epochs to achieve the best performance in the fine-tune phase. The basic model takes epochs to obtain the best result. From Figure 3, we can see that the Switch task converges much faster than the basic model. Even adding on the epochs taken in the pre-training phase, Switch method ( epochs) takes roughly the same time as the Basic model ( epochs) to achieve the best performance.

3.2 Ablation Study

Reuse only the sentence encoder

Our basic model has mainly two components: a sentence encoder and a document-level self-attention module. The sentence encoder focuses on each sentence, while document-level self-attention module incorporates more document information. To investigate the role of the document-level self-attention module, we only reuse the sentence encoder of the pre-train model, and randomly initialize the document-level self-attention module. The results is shown in Table 1 as SentEnc. We can see that using the whole pre-training model (Switch ) can achieve better performance, which indicates the model learn some useful document-level information from the pre-training task. We notice that only using the sentence encoder also get some improvement over the basic model, which means that the pre-training task may also help to learn the independent sentence representation.

On the sensitivity of hyper-parameter

In this part, we investigate the sensitivity of the model to the important hyper-parameter , i.e., the probability to switch sentences. In the previous experiment, we switch sentences with probability . We further try the probability of and , and show the results in Table 1 as Switch and Switch . We can see Switch achieve basically the same result as Switch , and Switch is slightly worse. So the model is not so sensitive to the hyper-parameter of the probability to switch sentences, and probability between and should be able to work well.

4 Conclusion

In this paper, we propose three self-supervised tasks to force the model to learn about the document context, which will benefit the summarization task. Experiments on the CNN/DM verify that through the way of pre-training on our proposed tasks, the model can perform better and converge faster when learning on the summarization task. Especially, through the Switch pre-training task, the model even outperforms the state-of-the-art method NeuSum Zhou et al. (2018). Further analytic experiments show that the document context learned by the document-level self-attention module will benefit the model in summarization task, and the model is not so sensitive to the hyper-parameter of the probability to switch sentences.


Appendix A Appendix

(a) Rouge-1
(b) Rouge-L
Figure 4: The Rouge-1 and Rouge-L score for each pre-training method and the basic model on the development set during the training process.

a.1 Evaluation results using scripts from NeuSum

Method Rouge-1 Rouge-2
Basic 41.13 18.97
Mask 41.21 19.07
Replace 41.27 19.09
Switch 41.41 19.22
LEAD3 39.98 17.63
NeuSum 41.23 18.85
Table 2: The Rouge Lin (2004) score for basic model, the pre-training methods, and the baselines. We use the script from to compute the Rouge score. All of our Rouge scores have a confidence interval of at most as reported by the official ROUGE script. The best result for each score is marked in bold, and those that are not significantly worse than the best are marked with .

a.2 Rouge-1 and Rouge-L results

The Rouge-1 and Rouge-L results are shown in Figure 4, from which we can see that the Switch method achieves the best performance.