Abstractive Document Summarization without Parallel Data

by   Nikola I. Nikolov, et al.
ETH Zurich

Abstractive summarization typically relies on large collections of paired articles and summaries, however parallel data is scarce and costly to obtain. We develop an abstractive summarization system that only relies on having access to large collections of example summaries and non-matching articles. Our approach consists of an unsupervised sentence extractor, which selects salient sentences to include in the final summary; as well as a sentence abstractor, trained using pseudo-parallel and synthetic data, which paraphrases each of the extracted sentences. We achieve promising results on the CNN/DailyMail benchmark without relying on any article-summary pairs.



There are no comments yet.


page 1

page 2

page 3

page 4


Revisiting the Centroid-based Method: A Strong Baseline for Multi-Document Summarization

The centroid-based model for extractive document summarization is a simp...

Liputan6: A Large-scale Indonesian Dataset for Text Summarization

In this paper, we introduce a large-scale Indonesian summarization datas...

Unsupervised Abstractive Summarization of Bengali Text Documents

Abstractive summarization systems generally rely on large collections of...

Simple Unsupervised Summarization by Contextual Matching

We propose an unsupervised method for sentence summarization using only ...

Bringing Structure into Summaries: Crowdsourcing a Benchmark Corpus of Concept Maps

Concept maps can be used to concisely represent important information an...

Neural Latent Extractive Document Summarization

Extractive summarization models need sentence level labels, which are us...

Compressive Summarization with Plausibility and Salience Modeling

Compressive summarization systems typically rely on a crafted set of syn...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Text summarization aims to produce a shorter, informative version of an input text. While extractive summarization only selects important sentences from the input, abstractive summarization generates content without explicitly re-using whole sentences Nenkova et al. (2011) resulting summaries that are more fluent. In recent years, a number of successful approaches have been proposed for both extractive Nallapati et al. (2017); Narayan et al. (2018) and abstractive See et al. (2017); Chen and Bansal (2018) summarization paradigms. State-of-the-art abstractive approaches are supervised, relying on large collections of paired articles and summaries. However, competitive performance of abstractive systems remains a challenge when availability of parallel data is limited, such as in low-resource domains or for languages other than English.

Even when parallel data is limited, we may have access to example summaries and to large collections of articles on similar topics. Examples are blog posts or scientific press releases, for which the original articles may be unavailable or behind a paywall.

In this paper, we develop a system for abstractive document summarization that only relies on having access to example summaries and non-matching articles, bypassing the need for large-scale parallel corpora. Our system consists of two components: An unsupervised sentence extractor first selects salient sentences. Each extracted sentence is subsequently paraphrased using a sentence abstractor. The abstractor is trained on pseudo-parallel data extracted from raw corpora, as well as on additional synthetic data generated through backtranslation. Our approach achieves promising results on the CNN/DailyMail single-document summarization benchmark (Section 5) without relying on any parallel document-summary pairs.

2 Background

Unsupervised summarization has a long history within the extractive summarization paradigm. Given an input article consisting of sentences, the goal of extractive summarization is to select the most salient sentences as the output summary, without employing any paraphrasing or fusion. A typical approach is to either weigh each sentence with respect to the document as a whole Radev et al. (2004) or through an adjacency-based measure of sentence importance Erkan and Radev (2004).

In recent years, there have been large advances in supervised abstractive summarization, for headline generation Rush et al. (2015); Nallapati et al. (2017) as well as for generation of multi-sentence summaries See et al. (2017). State-of-the-art approaches are typically trained to generate summaries either in a fully end-to-end fashion See et al. (2017), processing the entire article at once; or hierarchically, first extracting content and then paraphrasing it sentence-by-sentence Chen and Bansal (2018). Both approaches rely on large collections of article-summary pairs such as the annotated Gigaword Napoles et al. (2012) or the CNN/DailyMail Nallapati et al. (2016) dataset. The heavy reliance on manually created resources prohibits the use of abstractive summarization in domains other than news articles, or languages other than English, where parallel data may not be as abundantly available. In such areas, extractive summarization often remains the preferred choice.

Our work focuses on abstractive summarization using large-scale non-parallel resources, such as collections of summaries without matching articles. Recently, a number of methods have been proposed to reduce the need for parallel data: through harvesting pseudo-parallel data from raw corpora Nikolov and Hahnloser (2019) or by synthesizing data using backtranslation Sennrich et al. (2016a). Such methods have been shown to be viable for tasks such as unsupervised machine translation Lample et al. (2018), sentence compression Fevry and Phang (2018), and style transfer Lample et al. (2019). To the best of our knowledge, this work is the first to extend such methods to single-document summarization in order to generate multi-sentence abstractive summaries in a data-driven fashion.

3 Approach

Our system consists of two components: an extractor (Section 3.1) which picks salient sentences to include in the final summary; and an abstractor (Section 3.2) which subsequently paraphrases each of the extracted sentences, rewriting them to meet the target summary style.

Our approach is similar to Chen and Bansal (2018), except that they use parallel data to train their extractors and abstractors. In contrast, during training, we only assume access to example summaries without matching articles. During testing, given an input article consisting of sentences, our system is capable of generating a multi-sentence abstractive summary consisting of sentences (where

is a hyperparameter).

3.1 Sentence extractor

The extractor selects the most salient article sentences to include in the summary. We consider two unsupervised variants for the extractor:


picks the first K sentences from the article and returns them as the summary. For many datasets, such as CNN/DailyMail, Lead is a simple but tough baseline to beat, especially using abstractive methods See et al. (2017). Because Lead may not be the optimal choice for other domains or datasets, we experiment with another unsupervised extractive approach.


Erkan and Radev (2004) represents the input as a highly connected graph, in which vertices represent sentences, and edges between sentences are assigned weights equal to their TF-IDF similarity, provided that the similarity is higher than a predefined threshold . The centrality of a sentence is then computed using the PageRank algorithm.

3.2 Sentence abstractor

The sentence abstractor () is trained to generate a paraphrase for every article sentence , rewriting it to meet the target sentence style of the summaries. We implement as an LSTM encoder-decoder with an attention mechanism Bahdanau et al. (2014). Instead on parallel examples of sentences from articles and summaries, the abstractor is trained on a synthetic dataset that is created in two steps:

Pseudo-parallel dataset.

The first step is to obtain an initial set of pseudo-parallel article-summary sentence pairs. Because we assume access to example summaries, our approach is to align summary sentences to an external collection of articles in the same format as our target summaries. Here, we apply the large-scale alignment method from Nikolov and Hahnloser (2019), which hierarchically aligns documents followed by sentences in the two datasets. The alignment is implemented through nearest neighbour search of document and sentence embeddings.

Backtranslated pairs.

We use the initial pseudo-parallel dataset to train a backtranslation model , following Sennrich et al. (2016a). The model learns to synthesize "fake" article sentences given a summary sentence. We use to generate multiple synthetic article sentences for each summary sentence we have available, taking the top hypotheses predicted by beam search111We also experimented with sampling Edunov et al. (2018) but found it to be too noisy in the current setting.. To train our final sentence paraphrasing model , we combine all pseudo-parallel and backtranslated pairs into a single dataset of article-summary sentence pairs.

4 Experimental set-up


We use the CNN/DailyMail (CNN/DM) dataset Hermann et al. (2015) consisting of pairs of news articles from CNN and Daily Mail, along with summaries in the form of bullet points. We choose this dataset because it allows us to compare our approach to existing fully supervised methods and to measure the gap between unsupervised and supervised summarization. We follow the preprocessing pipeline of Chen and Bansal (2018), splitting the dataset into 287k/11k/11k pairs for training/validation/testing. Note that our method relies only on the bullet-point summaries from this training set.

Obtaining synthetic data.

To obtain training data for our sentence abstractor , we follow the procedure from Section 3.2. We align all summaries from the CNN/DM training set to 8.5M news articles from the Gigaword dataset Napoles et al. (2012), which contains no articles from CNN or Daily Mail. After alignment222We follow the set-up from Nikolov and Hahnloser (2019) using the Sent2Vec embedding method Pagliardini et al. (2017) for computing document/sentence embeddings. We use hyperparameters and ., we obtained 1.2M pseudo-parallel pairs which we use to train our backtranslation model . Using , we synthesize 5 article sentences for each of the  1M summary sentences by picking the top 5 beam hypotheses. Our best sentence paraphrasing dataset used to train our final abstractor contains 6.7 million pairs, 18% of which are pseudo-parallel pairs and 82% are backtranslated pairs.

Implementation details.

and are both implemented as bidirectional LSTM encoder-decoder models with 256 hidden units, embedding dimension 128, and an attention mechanism Bahdanau et al. (2014). We pick this model size to be comparable to recent work See et al. (2017); Chen and Bansal (2018). Our models are initialized and trained separately, but they share the same 50k byte pair encoding Sennrich et al. (2016b) vocabulary extracted from a joint collection of articles and summaries. We train both models until convergence with Adam Kingma and Ba (2015); uses beam search with a beam of 5 during testing.

Because both of our extractors are unsupervised, we directly apply them to the CNN/DM articles to select salient sentences. We always set , the number of sentences to be extracted, to , which is the average number of summary sentences in the CNN/DM dataset. We additionally tune the similarity threshold of the LexRank extractor on 1k pairs from the CNN/DM valid set.

2 Approach R-1 R-2 R-L MET # 2 Oracle 47.33 26.43 43.69 30.76 132 2 Unsupervised extractive baselines 2 Lead 38.78 17.57 35.49 23.67 119 LexRank 34.49 14.1 31.32 21.27 133 2 Supervised abstractive baselines (Trained on parallel data) 2 LSTM 35.61 15.04 32.7 16.24 58 Ext-Abs 38.38 16.12 36.04 19.39 Ext-Abs-RL 40.88 17.8 38.54 20.38 73 2 Abstractive summarization without parallel data (this work) 2 Lead + Abspp+syn-5 32.98 11.13 30.88 13.51 50 LexRank + Abspp+syn-5 30.87 9.42 28.82 12.51 52 2
Table 1: Metric results on the CNN/Daily Mail test set. R-1/2/L is ROUGE-1/2/L; MET is METEOR, while # is the average number of tokens in the summaries. are from Chen and Bansal (2018).
2 (1) Abspp-0.63: cnn is the first time in three years . the other contestants told the price of the price . the game show will be hosted by the tv game show . the game of the game is the first of the show . (2) Abspp+syn-5: a tv legend has returned to the first time in eight years . contestants told the price of “ the price is right ” bob barker hosted the tv game show for 35 years . the game is the first of the show ’s “ lucky seven ” (3) Abspar: a tv legend returned to doing what he does best . contestants told to “ come on down ! ” on april 1 edition . he hosted the tv game show for 35 years before stepping down in 2007 . barker handled the first price-guessing game of the show , the classic “ lucky seven ” 2
Table 2: Example outputs (Lead extractor): (1)/(2) are trained on pseudo-parallel/synthetic data, while the abstractor in (3) is trained on parallel data.
2 Approach (# pairs) R-1 R-2 R-L #
2 Abspp-0.60 (2M) 23.08 4.06 21.48 62
Abspp-0.63 (1.2M) 28.08 7.07 26.14 49
Abspp-0.67 (0.3M) 24.36 4.76 22.64 57
2 Abspp+syn-1 (2.4M) 31.92 10.2 29.9 51
Abspp+syn-5 (6.6M) 32.98 11.13 30.88 50
Abspp+syn-10 (12M) 32.8 11.2 30.71 49
2 Abspp-ub (575K) 38.42 15.98 35.8 65
Abspar (1M) 38.68 16.36 36.15 62
Table 3: Comparison of abstractors trained on parallel (Abspar) vs. pseudo-parallel data (Abspp-, using different sentence alignment thresholds ; Abspp-ub is the upper bound for large-scale alignment) or using a mixture of pseudo-parallel and synthetic data (Abspp+syn-N, using the Abspp-0.63 dataset and backtranslated data from the top beam hypotheses). We always use the Lead extractor.

Evaluation details.

We evaluate our systems on the CNN/DM test set, using the recall-based ROUGE-1/2/L metrics Lin (2004) and the precision-based METEOR Banerjee and Lavie (2005).

5 Results


We compare our models with a number of supervised and unsupervised baselines. LSTM is a standard bidirectional LSTM model, trained to directly generate the CNN/DM summaries from the full CNN/DM articles. Ext-Abs is the hierarchical model from Chen and Bansal (2018), consisting of a supervised LSTM extractor and separate abstractor, both of which are individually trained on the CNN/DM dataset by aligning summary to article sentences. Our work best resembles Ext-Abs except that we do not rely on any parallel data. Ext-Abs-RL is a state-of-the-art summarization system that extends Ext-Abs

by jointly tuning the two supervised components using reinforcement learning. We additionally report the performance of our unsupervised extractive baselines,

Lead and LexRank, as well as the result of an oracle (Oracle) which computes an upper bound for extractive summarization.

Automatic evaluation.

Our best abstractive models trained on non-parallel data (Lead + Abspp+syn-5 and LexRank + Abspp+syn-5 in Table 2) performed worse than the baselines trained on parallel data. However, the results are promising: for example, the ROUGE-L gap between our Lead model and the LSTM model is only . Our models generated much shorter summaries than the other systems, indicating that they potentially summarize much more aggressively.

Model analysis.

Is the gap between our approach and fully supervised models due to poor sentence extraction or inadequate sentence abstraction? The CNN/DM dataset is a special case, in which, due to the strong performance of Lead, the main bottleneck is abstraction. For datasets in which the first sentences are less salient, alternative approaches to extractive summarization such as LexRank become increasingly important.

In Table 3, we compare the effect of training our abstractor on pseudo-parallel datasets of different sizes (Abspp-*) as well as on a mixture of pseudo-parallel and backtranslated data (Abspp+syn-*). For reference, we also include results from aligning the original CNN/DM articles and summaries directly. We construct a parallel dataset of sentence pairs by aligning the original CNN/DM document pairs (Abspar system); as well as a pseudo-parallel dataset, without using the CNN/DM document labels (Abspp-ub system). Abspp-ub is an upper bound for large-scale alignment, where the raw dataset of articles perfectly matches the summaries.

Our best pseudo-parallel abstractor performs poorly in comparison to the result of the parallel abstractor. Adding additional synthetic data is helpful but insufficient to compensate the gap and we observe a diminishing improvement from adding synthetic pairs. The large-scale alignment method is able to construct a pseudo-parallel upper bound that almost perfectly matches the parallel dataset, indicating that potentially the main bottleneck in our system is the domain difference between the articles in Gigaword and the CNN/DM.

Example summaries.

In Table 2, we also provide example summaries produced by our system. Our final model, trained on additional backtranslated data, produced much more relevant and coherent sentences than the model trained on pseudo-parallel data only. Despite having seen no parallel examples, the system is capable of generating fluent, abstractive sentences. However, in comparison to the abstractor trained on parallel data, there is still room for further improvement.

6 Conclusion

We developed an abstractive summarization system that does not rely on any parallel resources, but can instead be trained using example summaries and a large collection of non-matching articles, making it particularly relevant to low-resource domains and languages. Perhaps surprisingly, our system performs competitively to fully supervised models. Future work will focus on developing novel unsupervised extractors; on decreasing the gap between abstractors trained on parallel and non-parallel data; as well as on developing methods for combining the abstractor and extractor into a single system.