The area of automatic text summarization has received a lot of attention recentlyRush et al. (2015); Cheng and Lapata (2016); Nallapati et al. (2017); See et al. (2017); Zhou et al. (2017); Tan et al. (2017). Many recent summarization models are working on two types of input, i.e., sentence level summarization Rush et al. (2015); Chopra et al. (2016); Nallapati et al. (2016a); Zhou et al. (2017) and single document level summarization Cheng and Lapata (2016); See et al. (2017); Nallapati et al. (2017)
. The development of these neural network summarization systems requires relatively large datasetsRush et al. (2015); Hermann et al. (2015).
The sentence level summarization dataset is constructed automatically by pairing the title and the first sentence in a news article Rush et al. (2015)
. The input sentence and the output title are extracted and cleaned heuristically from Annotated English GigawordsNapoles et al. (2012). The document level datasets being used frequently are newswire datasets such as CNN, Daily Mail and NY Times, which are usually used to produce several sentences as the summary. However, to the best of our knowledge, no prior work has discussed summarizing a text passage which has the potential use for long document summarization, slides highlight generation Wang et al. (2017), language teaching Huang (2015) and so on. The above-mentioned datasets are either for sentence or document summarization, which ignores the passage granularity.
In this paper, we introduce a new summarization dataset which aims to explore the passage-to-summary granularity of text summarization task. We make the key observation that in two temporally adjacent Wikipedia page revisions, the passage in the article body and the sentence in the introduction, which are added simultaneously to a Wikipedia page, are possibly a passage-summary pair. Based on this assumption, we mine the English Wikipedia history dump to extract possible pairs. By cleaning and filtering the extracted data, we created a new passage-to-summary (Psg2Sum) dataset which contains 100,118 examples. Quality analysis and the comparison to other summarization datasets show that it is promising that Psg2Sum can be used as a training and evaluation dataset.
|The collision between trains 608 and 653 happened on kilometer 8.055 at 17:42 (some sources says at 17:44). The speed of the steam train 608 was about 55 km/h, train 653 about 60 km/h. Both drivers tried to slow in the loose , but it was too late.|
|A passenger steam train 608 at speed 55 km/h abreast collided with a diesel railcar 653 at speed 60 km/h.|
Our primary contributions are:
A scalable, language agnostic method to create a passage-to-summary dataset from Wikipedia revision history.
Fill the granularity vacancy of summarization datasets that we first present an open-domain, passage-to-summary corpus.
Publicly release of the English Psg2Sum dataset on an anonymous URL for double-blind review.
The English version of Psg2Sum dataset will be available online at https://res.qyzhou.me/.
We validate the performance of various summarization methods on Psg2Sum.
2 The Psg2Sum Dataset
2.1 Dataset Creation
Wikipedia maintains the history of its pages which contains a list of the pages’ previous revisions111https://en.wikipedia.org/wiki/Help:Page_history. The page revisions have been exploited for some NLP tasks, such as sentence splitting Botha et al. (2018), sentence compression Yamangil and Nelken (2008) and sentence simplification Woodsend and Lapata (2011); Yatskar et al. (2010)
Most of the Wikipedia articles have lead sections222https://en.wikipedia.org/wiki/Wikipedia:Manual_of_Style/Lead_section (also known as the lead or introduction, screenshot available in the Appendix). It serves as an introduction to the article and a summary of its most important contents. Therefore, we pair the passages in the main body and the sentences in the lead section to construct the Psg2Sum corpus. We make the assumption that in a page revision, a sentence added to the lead section is possibly the summary of one passage added to the article at the same time. Based on this assumption, we compare two temporally adjacent revisions of a page to extract the additions.
We first extract and clean text by stripping the Wikipedia markup language (wikicode) using mwparserfromhell333https://github.com/earwig/mwparserfromhell. Then the text in the lead section is split into sentences using the sentence splitting algorithm in Moses Koehn et al. (2007)444We use a python implementation: https://github.com/berkmancenter/mediacloud-sentence-splitter. The sentences are then tokenized with the spaCy tokenizer. We compare the processed two page revisions using Python’s difflib to extract the added sentences and passages.
Given all the added sentences in the lead section and the passages in the article, we use some heuristics to mine passage-summary pairs from them. Similar to Rush et al. (2015), we find the possible candidates by calculating the unigram overlap to ensure the passage-summary relationship. Specifically, for the candidate passage-summary pair , we first remove the stopwords from both the sentence and passage to get The candidate score is defined as the overlap rate:
For the candidate sentence in the lead section, we choose the passage with the maximum overlap score . To filter out the misaligned passage-summary pairs, we set a minimum overlap rate threshold . Specifically, if is less than , we discard the candidate pair .
|Dataset||Granularity||Domain||Corpus Size||avg. Input Length||Output Length||Reference|
|DUC2002 (task 1)||Doc||News||567||27.37||629.64||10.19||215.09||1.96|
2.2 Quality and Statistics of Psg2Sum
As the heuristic method cannot guarantee all the pairs are true passage-summary pairs, we manually check the quality of the constructed dataset. We randomly sample 50 examples and label them as the following:
Good: The sentence is a summary of the given passage.
Unsupported: The sentence is irrelevant to the passage. Or, some important content cannot be found in the passage, such as dates and places, which makes it not understandable.
Furthermore, we do the same labeling on 50 random examples from the English Gigawords sentence summarization dataset Rush et al. (2015) which is also created automatically and cleaned with heuristics.
As shown in Table 3, overlap threshold is a good trade-off between the Good rate and the corpus size. For the 50 random examples, increase the threshold from 0.5 to 0.6 leads to a 12% absolute Good rate improvement with only 6,808 examples filtered. When increasing the threshold from 0.6 to 0.7, we only observe 2% Good rate improvement but the corpus size drastically shrinks to 68,070. Compared to the 56% good rate of the successful English Gigawords dataset, we choose the threshold value .
After filtering and cleaning, the final Psg2Sum dataset contains 100,118 passage-summary pairs. We randomly split the dataset into training, validation and testing sets, which have 92,118, 4000 and 4000 passage-summary pairs respectively.
2.3 Comparison to Other Datasets
Since 2001, NIST had organized the DUC summarization tasks Over et al. (2007). They provided high-quality, human-created document/multi-document summarization datasets. However, DUC dataset is too small to train an abstractive summarization system using artificial neural networks. For example, DUC 2002 task 1 only contains 567 documents associated with around 1.96 references. Therefore, large scale datasets are necessary for training neural abstractive summarization systems.
Abstractive sentence summarization has attracted research focus in recent years Rush et al. (2015); Toutanova et al. (2016); Chopra et al. (2016); Nallapati et al. (2016a). Rush et al. (2015) propose constructing a sentence summarization (or headline generation) dataset by pairing the first sentence and the title in a news article. They use the Annotated English Gigawords Napoles et al. (2012) as the article source. As shown in Table 3, though the Gigawords corpus contains some noise, it is still useful as a training and evaluation dataset. Considering the Good rate of Psg2Sum is about 10% higher than the English Gigawords dataset, it is promising that Psg2Sum can achieve the same goal.
Recently, newswire websites such as CNN, Daily Mail and NY Times have been used as sources for single document summarization. The NY Times is currently the largest summarization dataset as shown in Table 2. However, it is bias toward extractive strategies, and limited work has used this dataset for summarization Grusky et al. (2018). CNN and Daily Mail Hermann et al. (2015) have been frequently used in recent document summarization research. These datasets have been used for summarization as is See et al. (2017), or after pre-processing for entity anonymization Nallapati et al. (2017). Additionally, some systems mix CNN and Daily Mail as training data Nallapati et al. (2017); See et al. (2017); Paulus et al. (2017), whereas others use only Daily Mail articles Cheng and Lapata (2016); Nallapati et al. (2016b). Therefore, it would be challenging for systems to make comparisons considering that previous works are using different versions of datasets.
of various summarization models. The scores with 95% confidence interval are given by the officialRouge script. The best results are in bold.
All the above-mentioned datasets, including both the sentence level and the document level summarization datasets, are constructed or labeled using the newswire source, which leads to the fact that they are all in the news domain. The proposed Psg2Sum is constructed with the open-domain Wikipedia Chen et al. (2017); Yang et al. (2015). As far as we know, this is the first open-domain text summarization dataset. Table 2 summarizes the key features of existing summarization datasets and Psg2Sum. To the best of our knowledge, Psg2Sum is the first passage-to-summary dataset, which is with the same magnitude with the current frequently used CNN and Daily Mail datasets. The average input length of Psg2Sum is 4.83 sentences (118.26 words), compared with the average length 33.98 sentences (760.50 words) of CNN and 29.33 sentences (653.33 words) of Daily Mail corpus.
We evaluate several summary models on the Psg2Sum dataset and the detailed model configurations can be found in the Appendix:
extracts the first sentence as the summary. The leading sentences baseline is also a strong baseline on newswire datasets such as CNN, Daily Mail and NY Times.
3.2 Evaluation Metric
We use Rouge (version 1.5.5) (Lin, 2004)
as our evaluation metric.Rouge measures the quality of summary by computing overlapping lexical units, such as unigram, bigram, trigram, and longest common subsequence (LCS). Following previous works, we report Rouge-1, Rouge-2 and Rouge-L metrics in the experiments.
We validate various models on the Psg2Sum dataset, including abstractive models (s2s), extractive models (LEAD1, TextRank, NN-SE) and mixed models (s2s+copy, PGN). Table 4 shows the Rouge evaluation results. We observe that extractive methods perform better in terms of Rouge Recall. For example, the NN-SE model achieves the best recall performance among all the baseline models, i.e., 43.76 Rouge-1 recall and 23.19 Rouge-2 recall. In the meanwhile, the abstractive models achieve better Rouge Precision scores. The s2s + copy model has the best precision performance in Rouge-1, -2 and -L. Surprisingly, we find that using coverage mechanism (PGN) leads to the precision drop but higher recall score (with longer output), although s2s+copy and PGN are statistically indistinguishable in terms of F1 score.
In this paper, we present a heuristic approach to automatic constructing a passage-to-summary dataset, Psg2Sum, by mining the Wikipedia page revision histories. The quality analysis shows that it is capable of being a training and evaluation corpus despite the imperfection that it contains some noise. Experiments on Psg2Sum show that extractive models tend to select longer sentences and achieves higher recall score, comparing with the abstractive and mixed models’ tendency to generate high precision outputs.
- Neural machine translation by jointly learning to align and translate. In Proceedings of 3rd International Conference for Learning Representations, San Diego. Cited by: §B.1, item s2s.
Learning to split and rephrase from wikipedia edit history.
Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 732–737. Cited by: §2.1.
- Reading wikipedia to answer open-domain questions. arXiv preprint arXiv:1704.00051. Cited by: §2.3.
- Neural summarization by extracting sentences and words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 484–494. Cited by: §B.5, §1, §2.3, item NN-SE.
- Learning phrase representations using rnn encoder–decoder for statistical machine translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, pp. 1724–1734. Cited by: §B.1.
Abstractive sentence summarization with attentive recurrent neural networks. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, California, pp. 93–98. Cited by: §1, §2.3.
- NEWSROOM: a dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, New Orleans, Louisiana, pp. 708–719. Cited by: §2.3.
- Incorporating copying mechanism in sequence-to-sequence learning. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 1631–1640. Cited by: §B.2, item s2s+copy.
- Pointing the unknown words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, pp. 140–149. Cited by: §B.2, item s2s+copy.
- Teaching machines to read and comprehend. In Advances in Neural Information Processing Systems, pp. 1693–1701. Cited by: §1, §2.3.
- A study on the application of task-based language teaching method in a comprehensive english class in china. Journal of Language Teaching and Research 7 (1), pp. 118–127. Cited by: §1.
- Adam: a method for stochastic optimization. In Proceedings of 3rd International Conference for Learning Representations, San Diego. Cited by: §B.1.
Moses: open source toolkit for statistical machine translation. In Proceedings of the 45th annual meeting of the association for computational linguistics companion volume proceedings of the demo and poster sessions, pp. 177–180. Cited by: §2.1.
- Rouge: a package for automatic evaluation of summaries. In Text summarization branches out: Proceedings of the ACL-04 workshop, Vol. 8. Cited by: §3.2.
- Textrank: bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing, Cited by: item TextRank.
- SummaRuNNer: a recurrent neural network based sequence model for extractive summarization of documents.. In AAAI, pp. 3075–3081. Cited by: §1, §2.3.
- Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of The 20th SIGNLL Conference on Computational Natural Language Learning, Cited by: §1, §2.3.
- Classify or select: neural architectures for extractive document summarization. arXiv preprint arXiv:1611.04244. Cited by: §2.3.
- Annotated gigaword. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction, AKBC-WEKEX ’12, Stroudsburg, PA, USA, pp. 95–100. Cited by: §1, §2.3.
- DUC in context. Information Processing & Management 43 (6), pp. 1506–1520. Cited by: §2.3.
- A deep reinforced model for abstractive summarization. arXiv preprint arXiv:1705.04304. Cited by: §2.3.
- Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks, Valletta, Malta, pp. 45–50 (English). Cited by: §B.4, item TextRank.
A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, Lisbon, Portugal, pp. 379–389. Cited by: §1, §1, §2.1, §2.2, §2.3.
- Get to the point: summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1073–1083. Cited by: §B.3, §1, §2.3, item PNG.
Dropout: a simple way to prevent neural networks from overfitting..
Journal of Machine Learning Research15 (1), pp. 1929–1958. Cited by: §B.1.
- Abstractive document summarization with a graph-based attentional neural model. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vol. 1, pp. 1171–1181. Cited by: §1.
- A dataset and evaluation metrics for abstractive compression of sentences and short paragraphs. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, Texas, pp. 340–350. Cited by: §2.3.
- Modeling coverage for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 76–85. Cited by: item PNG.
Phrase-based presentation slides generation for academic papers.
Thirty-First AAAI Conference on Artificial Intelligence, Cited by: §1.
- Learning to simplify sentences with quasi-synchronous grammar and integer programming. In Proceedings of the conference on empirical methods in natural language processing, pp. 409–420. Cited by: §2.1.
- Mining wikipedia revision histories for improving sentence compression. In Proceedings of the 46th Annual Meeting of the Association for Computational Linguistics on Human Language Technologies: Short Papers, pp. 137–140. Cited by: §2.1.
- Wikiqa: a challenge dataset for open-domain question answering. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, pp. 2013–2018. Cited by: §2.3.
- For the sake of simplicity: unsupervised extraction of lexical simplifications from wikipedia. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics, pp. 365–368. Cited by: §2.1.
- Selective encoding for abstractive sentence summarization. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1095–1104. Cited by: §1.
Appendix A Lead Section of Wikipedia
Figure 1 shows a screenshot example of the lead section of a Wikipedia article about Wikipedia.
Appendix B Model Configurations
We use the model architecture introduced in Bahdanau et al. (2015)
. The encoder and decoder are built with Gated Recurrent Units (GRU)Cho et al. (2014)
. The encoder is bidirectional, with 256 dimensional forward and 256 dimensional backward backward GRU. The decoder’s hidden size is 512. The word vector size of encoder and decoder is 300. We use dropoutSrivastava et al. (2014) rate 0.5 to prevent model overfitting. During training, we use the Adam Kingma and Ba (2015) optimizer to learn the model with its default hyper-parameters. The mini-batch size is set to 64. During testing, we use beam search and the beam size is set to 5.
We use the open-source implementation of TextRank in the Gensim Řehůřek and Sojka (2010) toolkit. It refuse to summarization passages with less than three sentences. Therefore, we randomly select one sentence as the summary for passages shorter than three sentences.
We implement NN-SE model as mentioned in the paper Cheng and Lapata (2016). During testing, we select the sentence with highest extraction score as the passage summary.
Appendix C Psg2Sum Data Samples
Table 5 shows 5 random examples in the Psg2Sum dataset.
|PSG||A recording of the musical with 19 tracks was issued in the U.S. on Scepter Records in 1971 . It was a reissue of the 1969 Decca UK album , capitalizing on the success of 1970 ’s Jesus Christ Superstar in the U.S. It featured David Daltrey as Joseph , Tim Rice as Pharaoh , Dr. William S. Lloyd Webber on the Hammond organ , Alan Doggett conducting , various solo vocalists and instrumentalists , and the Colet Court choir as the chorus.”Joseph And The Amazing Technicolor Dreamcoat Listing , Scepter Records , SPS-588X , 1971 ” discogs.com , accessed March 17 , 2011Q&A regarding the original Decca and Scepter albums|
|SUM||Joseph and the Amazing Technicolor Dreamcoat is a musical with lyrics by Tim Rice and music by Andrew Lloyd Webber .|
|PSG||In 1994 , Bush took a leave of absence from the Rangers to run for Governor of Texas against the popular incumbent , Democrat Ann Richards . On November 8 , 1994 , he defeated Richards , 53 % to 46 % . As Governor , Bush forged a legislative alliance with powerful Texas Lt . Governor Bob Bullock , a longtime Democrat . In 1998 Bush went on to win re - election in a landslide victory with nearly 69 % of the vote , becoming the first Texas governor to be elected for two consecutive four - year terms . During Bush ’s governorship , he undertook significant legislative changes in criminal justice , tort law , and school financing . Bush took a hard line on capital punishment and received much criticism from advocates wanting to abolish the death penalty . Under Bush , Texas ’ incarceration rate was 1014 inmates per 100,000 state population in 1999 , the second highest in the nation , owing mainly to strict enforcement of drug laws . In September 1999 , Bush signed the Texas Futile Care Law . Bush ’s transformative agenda and family pedigree now provided an opportunity to advance his political career to the national level .|
|SUM||Bush was elected 46th Governor of Texas in 1994 and re - elected in 1998 .|
|PSG||The group ’s first single , ” Saturday Night Party ( Read My Lips ) ” , was an immediate success , and became an Ibiza anthem during the summer of 1993 . It became their first Top 40 hit in the United Kingdom , peaking at # 29 . After introducing a singer to the group ( Shanie Campbell ) , they released the single ” Do n’t Give Me Your Life ” in 1994 , being an extended remix to the original ” Alex Party ” track . It reached # 2 in both Ireland and the United Kingdom ( their highest charting hit in those countries ) and # 13 in Australia , plus it topped the Club Record category at Music Week ’s 1995 Awards . It was included in many compilation albums all over the world , and remains their most famous release .|
|SUM||Their most famous single to date is ” Do n’t Give Me Your Life ” , a # 2 hit in both Ireland and the United Kingdom in early 1995.|
|PSG||Throughout the existence of medieval Livonia there was a constant struggle for superiority in the rule over the lands by the Church , the order , the secular nobles of German descent who ruled the fiefs and the citizens of the Hanseatic town of Riga . Two major civil wars were fought in 1296 - 1330 , 1313 - 1330 , and in 1343 - 1345 the Estonian revolt resulted in the annexation of the Danish Duchy of Estonia within the Teutonic Ordensstaat .|
|SUM||Throughout the existence of medieval Livonia there was a constant struggle over the supremacy of ruling the lands by the Church , the Order , the secular German nobility and the citizens of the Hanseatic towns of Riga and Reval .|
|PSG||Along with Matsumoto Castle and Kumamoto Castle , Himeji Castle is considered one of Japan ’s three premier castles . It is the most visited castle in Japan , receiving over 820,000 visitors annually . Starting in April 2010 , Himeji Castle underwent restoration work to preserve the castle buildings , and reopened to the public on 27 March 2015 .|
|SUM||In order to preserve the castle buildings , it underwent restoration work for several years and reopened to the public on March 27 , 2015 .|