Text summarization aims to produce a shorter, informative version of an input text. While extractive summarization only selects important sentences from the input, abstractive summarization generates content without explicitly re-using whole sentences Nenkova et al. (2011) resulting summaries that are more fluent. In recent years, a number of successful approaches have been proposed for both extractive Nallapati et al. (2017); Narayan et al. (2018) and abstractive See et al. (2017); Chen and Bansal (2018) summarization paradigms. State-of-the-art abstractive approaches are supervised, relying on large collections of paired articles and summaries. However, competitive performance of abstractive systems remains a challenge when availability of parallel data is limited, such as in low-resource domains or for languages other than English.
Even when parallel data is limited, we may have access to example summaries and to large collections of articles on similar topics. Examples are blog posts or scientific press releases, for which the original articles may be unavailable or behind a paywall.
In this paper, we develop a system for abstractive document summarization that only relies on having access to example summaries and non-matching articles, bypassing the need for large-scale parallel corpora. Our system consists of two components: An unsupervised sentence extractor first selects salient sentences. Each extracted sentence is subsequently paraphrased using a sentence abstractor. The abstractor is trained on pseudo-parallel data extracted from raw corpora, as well as on additional synthetic data generated through backtranslation. Our approach achieves promising results on the CNN/DailyMail single-document summarization benchmark (Section 5) without relying on any parallel document-summary pairs.
Unsupervised summarization has a long history within the extractive summarization paradigm. Given an input article consisting of sentences, the goal of extractive summarization is to select the most salient sentences as the output summary, without employing any paraphrasing or fusion. A typical approach is to either weigh each sentence with respect to the document as a whole Radev et al. (2004) or through an adjacency-based measure of sentence importance Erkan and Radev (2004).
In recent years, there have been large advances in supervised abstractive summarization, for headline generation Rush et al. (2015); Nallapati et al. (2017) as well as for generation of multi-sentence summaries See et al. (2017). State-of-the-art approaches are typically trained to generate summaries either in a fully end-to-end fashion See et al. (2017), processing the entire article at once; or hierarchically, first extracting content and then paraphrasing it sentence-by-sentence Chen and Bansal (2018). Both approaches rely on large collections of article-summary pairs such as the annotated Gigaword Napoles et al. (2012) or the CNN/DailyMail Nallapati et al. (2016) dataset. The heavy reliance on manually created resources prohibits the use of abstractive summarization in domains other than news articles, or languages other than English, where parallel data may not be as abundantly available. In such areas, extractive summarization often remains the preferred choice.
Our work focuses on abstractive summarization using large-scale non-parallel resources, such as collections of summaries without matching articles. Recently, a number of methods have been proposed to reduce the need for parallel data: through harvesting pseudo-parallel data from raw corpora Nikolov and Hahnloser (2019) or by synthesizing data using backtranslation Sennrich et al. (2016a). Such methods have been shown to be viable for tasks such as unsupervised machine translation Lample et al. (2018), sentence compression Fevry and Phang (2018), and style transfer Lample et al. (2019). To the best of our knowledge, this work is the first to extend such methods to single-document summarization in order to generate multi-sentence abstractive summaries in a data-driven fashion.
Our system consists of two components: an extractor (Section 3.1) which picks salient sentences to include in the final summary; and an abstractor (Section 3.2) which subsequently paraphrases each of the extracted sentences, rewriting them to meet the target summary style.
Our approach is similar to Chen and Bansal (2018), except that they use parallel data to train their extractors and abstractors. In contrast, during training, we only assume access to example summaries without matching articles. During testing, given an input article consisting of sentences, our system is capable of generating a multi-sentence abstractive summary consisting of sentences (where
is a hyperparameter).
3.1 Sentence extractor
The extractor selects the most salient article sentences to include in the summary. We consider two unsupervised variants for the extractor:
picks the first K sentences from the article and returns them as the summary. For many datasets, such as CNN/DailyMail, Lead is a simple but tough baseline to beat, especially using abstractive methods See et al. (2017). Because Lead may not be the optimal choice for other domains or datasets, we experiment with another unsupervised extractive approach.
Erkan and Radev (2004) represents the input as a highly connected graph, in which vertices represent sentences, and edges between sentences are assigned weights equal to their TF-IDF similarity, provided that the similarity is higher than a predefined threshold . The centrality of a sentence is then computed using the PageRank algorithm.
3.2 Sentence abstractor
The sentence abstractor () is trained to generate a paraphrase for every article sentence , rewriting it to meet the target sentence style of the summaries. We implement as an LSTM encoder-decoder with an attention mechanism Bahdanau et al. (2014). Instead on parallel examples of sentences from articles and summaries, the abstractor is trained on a synthetic dataset that is created in two steps:
The first step is to obtain an initial set of pseudo-parallel article-summary sentence pairs. Because we assume access to example summaries, our approach is to align summary sentences to an external collection of articles in the same format as our target summaries. Here, we apply the large-scale alignment method from Nikolov and Hahnloser (2019), which hierarchically aligns documents followed by sentences in the two datasets. The alignment is implemented through nearest neighbour search of document and sentence embeddings.
We use the initial pseudo-parallel dataset to train a backtranslation model , following Sennrich et al. (2016a). The model learns to synthesize "fake" article sentences given a summary sentence. We use to generate multiple synthetic article sentences for each summary sentence we have available, taking the top hypotheses predicted by beam search111We also experimented with sampling Edunov et al. (2018) but found it to be too noisy in the current setting.. To train our final sentence paraphrasing model , we combine all pseudo-parallel and backtranslated pairs into a single dataset of article-summary sentence pairs.
4 Experimental set-up
We use the CNN/DailyMail (CNN/DM) dataset Hermann et al. (2015) consisting of pairs of news articles from CNN and Daily Mail, along with summaries in the form of bullet points. We choose this dataset because it allows us to compare our approach to existing fully supervised methods and to measure the gap between unsupervised and supervised summarization. We follow the preprocessing pipeline of Chen and Bansal (2018), splitting the dataset into 287k/11k/11k pairs for training/validation/testing. Note that our method relies only on the bullet-point summaries from this training set.
Obtaining synthetic data.
To obtain training data for our sentence abstractor , we follow the procedure from Section 3.2. We align all summaries from the CNN/DM training set to 8.5M news articles from the Gigaword dataset Napoles et al. (2012), which contains no articles from CNN or Daily Mail. After alignment222We follow the set-up from Nikolov and Hahnloser (2019) using the Sent2Vec embedding method Pagliardini et al. (2017) for computing document/sentence embeddings. We use hyperparameters and ., we obtained 1.2M pseudo-parallel pairs which we use to train our backtranslation model . Using , we synthesize 5 article sentences for each of the 1M summary sentences by picking the top 5 beam hypotheses. Our best sentence paraphrasing dataset used to train our final abstractor contains 6.7 million pairs, 18% of which are pseudo-parallel pairs and 82% are backtranslated pairs.
and are both implemented as bidirectional LSTM encoder-decoder models with 256 hidden units, embedding dimension 128, and an attention mechanism Bahdanau et al. (2014). We pick this model size to be comparable to recent work See et al. (2017); Chen and Bansal (2018). Our models are initialized and trained separately, but they share the same 50k byte pair encoding Sennrich et al. (2016b) vocabulary extracted from a joint collection of articles and summaries. We train both models until convergence with Adam Kingma and Ba (2015); uses beam search with a beam of 5 during testing.
Because both of our extractors are unsupervised, we directly apply them to the CNN/DM articles to select salient sentences. We always set , the number of sentences to be extracted, to , which is the average number of summary sentences in the CNN/DM dataset. We additionally tune the similarity threshold of the LexRank extractor on 1k pairs from the CNN/DM valid set.
|2 Approach (# pairs)||R-1||R-2||R-L||#|
|2 Abspp-0.60 (2M)||23.08||4.06||21.48||62|
|2 Abspp+syn-1 (2.4M)||31.92||10.2||29.9||51|
|2 Abspp-ub (575K)||38.42||15.98||35.8||65|
We compare our models with a number of supervised and unsupervised baselines. LSTM is a standard bidirectional LSTM model, trained to directly generate the CNN/DM summaries from the full CNN/DM articles. Ext-Abs is the hierarchical model from Chen and Bansal (2018), consisting of a supervised LSTM extractor and separate abstractor, both of which are individually trained on the CNN/DM dataset by aligning summary to article sentences. Our work best resembles Ext-Abs except that we do not rely on any parallel data. Ext-Abs-RL is a state-of-the-art summarization system that extends Ext-Abs
by jointly tuning the two supervised components using reinforcement learning. We additionally report the performance of our unsupervised extractive baselines,Lead and LexRank, as well as the result of an oracle (Oracle) which computes an upper bound for extractive summarization.
Our best abstractive models trained on non-parallel data (Lead + Abspp+syn-5 and LexRank + Abspp+syn-5 in Table 2) performed worse than the baselines trained on parallel data. However, the results are promising: for example, the ROUGE-L gap between our Lead model and the LSTM model is only . Our models generated much shorter summaries than the other systems, indicating that they potentially summarize much more aggressively.
Is the gap between our approach and fully supervised models due to poor sentence extraction or inadequate sentence abstraction? The CNN/DM dataset is a special case, in which, due to the strong performance of Lead, the main bottleneck is abstraction. For datasets in which the first sentences are less salient, alternative approaches to extractive summarization such as LexRank become increasingly important.
In Table 3, we compare the effect of training our abstractor on pseudo-parallel datasets of different sizes (Abspp-*) as well as on a mixture of pseudo-parallel and backtranslated data (Abspp+syn-*). For reference, we also include results from aligning the original CNN/DM articles and summaries directly. We construct a parallel dataset of sentence pairs by aligning the original CNN/DM document pairs (Abspar system); as well as a pseudo-parallel dataset, without using the CNN/DM document labels (Abspp-ub system). Abspp-ub is an upper bound for large-scale alignment, where the raw dataset of articles perfectly matches the summaries.
Our best pseudo-parallel abstractor performs poorly in comparison to the result of the parallel abstractor. Adding additional synthetic data is helpful but insufficient to compensate the gap and we observe a diminishing improvement from adding synthetic pairs. The large-scale alignment method is able to construct a pseudo-parallel upper bound that almost perfectly matches the parallel dataset, indicating that potentially the main bottleneck in our system is the domain difference between the articles in Gigaword and the CNN/DM.
In Table 2, we also provide example summaries produced by our system. Our final model, trained on additional backtranslated data, produced much more relevant and coherent sentences than the model trained on pseudo-parallel data only. Despite having seen no parallel examples, the system is capable of generating fluent, abstractive sentences. However, in comparison to the abstractor trained on parallel data, there is still room for further improvement.
We developed an abstractive summarization system that does not rely on any parallel resources, but can instead be trained using example summaries and a large collection of non-matching articles, making it particularly relevant to low-resource domains and languages. Perhaps surprisingly, our system performs competitively to fully supervised models. Future work will focus on developing novel unsupervised extractors; on decreasing the gap between abstractors trained on parallel and non-parallel data; as well as on developing methods for combining the abstractor and extractor into a single system.
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473.
- Banerjee and Lavie (2005) Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proc. of the ACL workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
- Chen and Bansal (2018) Yen-Chun Chen and Mohit Bansal. 2018. Fast abstractive summarization with reinforce-selected sentence rewriting. In Proceedings of ACL.
- Edunov et al. (2018) Sergey Edunov, Myle Ott, Michael Auli, and David Grangier. 2018. Understanding back-translation at scale. arXiv preprint arXiv:1808.09381.
Erkan and Radev (2004)
Günes Erkan and Dragomir R Radev. 2004.
Lexrank: Graph-based lexical centrality as salience in text
Journal of artificial intelligence research, 22:457–479.
- Fevry and Phang (2018) Thibault Fevry and Jason Phang. 2018. Unsupervised sentence compression using denoising auto-encoders. In Proceedings of the 22nd Conference on Computational Natural Language Learning, pages 413–422. Association for Computational Linguistics.
- Hermann et al. (2015) Karl Moritz Hermann, Tomas Kocisky, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pages 1693–1701.
- Kingma and Ba (2015) Diederick P Kingma and Jimmy Ba. 2015. Adam: A method for stochastic optimization. In International Conference on Learning Representations (ICLR).
- Lample et al. (2018) Guillaume Lample, Myle Ott, Alexis Conneau, Ludovic Denoyer, and Marc’Aurelio Ranzato. 2018. Phrase-based & neural unsupervised machine translation. arXiv preprint arXiv:1804.07755.
- Lample et al. (2019) Guillaume Lample, Sandeep Subramanian, Eric Smith, Ludovic Denoyer, Marc’Aurelio Ranzato, and Y-Lan Boureau. 2019. Multiple-attribute text rewriting. In International Conference on Learning Representations.
- Lin (2004) Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out.
Nallapati et al. (2017)
Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.
Summarunner: A recurrent neural network based sequence model for extractive summarization of documents.In Thirty-First AAAI Conference on Artificial Intelligence.
- Nallapati et al. (2016) Ramesh Nallapati, Bowen Zhou, Caglar Gulcehre, Bing Xiang, et al. 2016. Abstractive text summarization using sequence-to-sequence rnns and beyond. arXiv preprint arXiv:1602.06023.
- Napoles et al. (2012) Courtney Napoles, Matthew Gormley, and Benjamin Van Durme. 2012. Annotated gigaword. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, pages 95–100. Association for Computational Linguistics.
- Narayan et al. (2018) Shashi Narayan, Shay B. Cohen, and Mirella Lapata. 2018. Ranking sentences for extractive summarization with reinforcement learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pages 1747–1759. Association for Computational Linguistics.
- Nenkova et al. (2011) Ani Nenkova, Sameer Maskey, and Yang Liu. 2011. Automatic summarization. In Proc. of ACL, page 3. Association for Computational Linguistics.
- Nikolov and Hahnloser (2019) Nikola Nikolov and Richard Hahnloser. 2019. Large-scale hierarchical alignment for data-driven text rewriting. arXiv preprint arXiv:1810.08237.
- Pagliardini et al. (2017) Matteo Pagliardini, Prakhar Gupta, and Martin Jaggi. 2017. Unsupervised learning of sentence embeddings using compositional n-gram features. arXiv preprint arXiv:1703.02507.
- Radev et al. (2004) Dragomir R Radev, Hongyan Jing, Małgorzata Styś, and Daniel Tam. 2004. Centroid-based summarization of multiple documents. Information Processing & Management, 40(6):919–938.
- Rush et al. (2015) Alexander M Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. arXiv preprint arXiv:1509.00685.
- See et al. (2017) Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proc. of ACL, volume 1, pages 1073–1083.
- Sennrich et al. (2016a) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016a. Improving neural machine translation models with monolingual data. In Proc. of ACL, pages 86–96. Association for Computational Linguistics.
- Sennrich et al. (2016b) Rico Sennrich, Barry Haddow, and Alexandra Birch. 2016b. Neural machine translation of rare words with subword units. In Proc. of ACL, pages 1715–1725. Association for Computational Linguistics.