1 Introduction
The development of learning methods for automatic summarization is constrained by the limited high-quality data available for training and evaluation. Large datasets have driven rapid improvement in other natural language generation tasks, such as machine translation, where data size and diversity have proven critical for modeling the alignment between source and target texts
Tiedemann (2012). Similar challenges exist in summarization, with the additional complications introduced by the length of source texts and the diversity of summarization strategies used by writers. Access to large-scale high-quality data is an essential prerequisite for making substantial progress in summarization. In this paper, we present , a dataset with 1.3 million news articles and human-written summaries.Abstractive Summary: South African photographer Anton Hammerl, missing in Libya since April 4th, was killed in Libya more than a month ago. |
---|
Mixed Summary: A major climate protest in New York on Sunday could mark a seminal shift in the politics of global warming, just ahead of the U.N. Climate Summit. |
Extractive Summary: A person familiar with the search tells The Associated Press that Texas has offered its head coaching job to Louisvilles Charlie Strong and he is expected to accept. |
’s summaries were written by authors and editors in the newsrooms of news, sports, entertainment, financial, and other publications. The summaries were published with articles as HTML metadata for social media services and search engines page descriptions. summaries are written by humans, for common readers, and with the explicit purpose of summarization. As a result, is a nearly two decade-long snapshot representing how single-document summarization is used in practice across a variety of sources, writers, and topics.
Identifying large, high-quality resources for summarization has called for creative solutions in the past. This includes using news headlines as summaries of article prefixes Napoles et al. (2012); Rush et al. (2015), concatenating bullet points as summaries Hermann et al. (2015); See et al. (2017), or using librarian archival summaries Sandhaus (2008). While these solutions provide large scale data, it comes at the cost of how well they reflect the summarization problem or their focus on very specific styles of summarizations, as we discuss in Section 4. is distinguished from these resources in its combination of size and diversity. The summaries were written with the explicit goal of concisely summarizing news articles over almost two decades. Rather than rely on a single source, the dataset includes summaries from 38 major publishers. This diversity of sources and time span translate into a diversity of summarization styles.
We explore to better understand the dataset and how summarization is used in practice by newsrooms. Our analysis focuses on a key dimension, extractivenss and abstractiveness: extractive summaries frequently borrow words and phrases from their source text, while abstractive summaries describe the contents of articles primarily using new language. We develop measures designed to quantify extractiveness and use these measures to subdivide the data into extractive, mixed, and abstractive subsets, as shown in Figure 1, displaying the broad set of summarization techniques practiced by different publishers.
Finally, we analyze the performance of three summarization models as baselines for to better understand the challenges the dataset poses. In addition to automated ROUGE evaluation Lin (2004a, b), we design and execute a benchmark human evaluation protocol to quantify the output summaries relevance and quality. Our experiments demonstrate that presents an open challenge for summarization systems, while providing a large resource to enable data-intensive learning methods. The dataset and evaluation protocol are available online at summari.es.
2 Existing Datasets
There are a several frequently used summarization datasets. Listed in Figure 2 are examples from four datasets. The examples are chosen to be representative: they have scores within 5% of their dataset average across our analysis measures (Section 4). To illustrate the extractive and abstractive nature of summaries, we underline multi-word phrases shared between the article and summary, and italicize words used only in the summary.
DUC
Example Summary: Floods hit north Mozambique as aid to flooded south continues |
Start of Article: MAPUTO, Mozambique (AP) — Just as aid agencies were making headway in feeding hundreds of thousands displaced by flooding in southern and central Mozambique, new floods hit a remote northern region Monday. The Messalo River overflowed […] |
Gigaword
Example Summary: Seve gets invite to US Open |
Start of Article: Seve Ballesteros will be playing in next month’s US Open after all. The USGA decided Tuesday to give the Spanish star a special exemption. American Ben Crenshaw was also given a special exemption by the United States Golf Association. Earlier this week […] |
New York Times Corpus
Example Summary: Annual New York City Toy Fair opens in Manhattan; feud between Toy Manufacturers of America and its landlord at International Toy Center leads to confusion and turmoil as registration begins; dispute discussed. |
Start of Article: There was toylock when the Toy Fair opened in Manhattan yesterday. The reason? A family feud between the Toy Manufacturers of America and its landlord at Fifth Avenue and 23d Street. Toy buyers and exhibitors arriving to attend the kickoff of the […] |
CNN / Daily Mail
Example Summary:
|
Start of Article: Egyptian authorities have served Al Jazeera with a charge sheet that identifies eight of its staff on a list of 20 people – all believed to be journalists – for allegedly conspiring with a terrorist group, the network said Wednesday. The 20 are wanted by Egyptian […] |
2.1 Document Understanding Conference
Datasets produced for the Document Understanding Conference (DUC)111http://duc.nist.gov/ are small, high-quality datasets developed to evaluate summarization systems Harman and Over (2004); Dang (2006).
DUC data consist of newswire articles paired with human summaries written specifically for DUC. One distinctive feature of the DUC datasets is the availability of multiple reference summaries for each article. This is a major advantage of DUC compared to other datasets, especially when evaluating with ROUGE Lin (2004b, a), which was designed to be used with multiple references. However, DUC datasets are small, which makes it difficult to use them as training data.
DUC summaries are often used in conjunction with larger training datasets, including Gigaword Rush et al. (2015); Chopra et al. (2016), CNN / Daily Mail Nallapati et al. (2017); Paulus et al. (2017); See et al. (2017), or Daily Mail alone Nallapati et al. (2016b); Cheng and Lapata (2016). The data have also been used to evaluate unsupervised methods Dorr et al. (2003); Mihalcea and Tarau (2004); Barrios et al. (2016).
2.2 Gigaword
The Gigaword Corpus Napoles et al. (2012) contains nearly 10 million documents from seven newswire sources, including the Associated Press, New York Times Newswire Service, and Washington Post Newswire Service. Compared to other existing datasets used for summarization, the Gigaword corpus is the largest and most diverse in its sources. While Gigaword does not contain summaries, prior work uses Gigaword headlines as simulated summaries Rush et al. (2015); Chopra et al. (2016). These systems are trained on Gigaword to recreate headlines given the first sentence of an article. When used this way, Gigaword’s simulated summaries are shorter than most natural summary text. Gigaword, along with similar text-headline datasets Filippova and Altun (2013), are also used for the related sentence compression task Dorr et al. (2003); Filippova et al. (2015).
2.3 New York Times Corpus
The New York Times Annotated Corpus Sandhaus (2008) is the largest summarization dataset currently available. It consists of carefully curated articles from a single source, The New York Times. The corpus contains several hundred thousand articles written between 1987–2007 that have paired summaries. The summaries were written for the corpus by library scientists, rather than at the time of publication. Our analysis in Section 4 reveals that the data are somewhat biased toward extractive strategies, making it particularly useful as an extractive summarization dataset. Despite this, limited work has used this dataset for summarization Hong and Nenkova (2014); Durrett et al. (2016); Paulus et al. (2017).
2.4 CNN / Daily Mail
The CNN / Daily Mail question answering dataset Hermann et al. (2015) is frequently used for summarization. The dataset includes CNN and Daily Mail articles, each associated with several bullet point descriptions. When used in summarization, the bullet points are typically concatenated into a single summary.222https://github.com/abisee/cnn-dailymail The dataset has been used for summarization as is See et al. (2017), or after pre-processing for entity anonymization Nallapati et al. (2017). This different usage makes comparisons between systems using these data challenging. Additionally, some systems use both CNN and Daily Mail for training Nallapati et al. (2017); Paulus et al. (2017); See et al. (2017), whereas others use only Daily Mail articles Nallapati et al. (2016b); Cheng and Lapata (2016). Our analysis shows that the CNN / Daily Mail summaries have strong bias toward extraction (Section 4). Similar observations about the data were made by Chen et al. (2016) with respect to the question answering task.
3 Collecting Summaries
The dataset was collected using social media and search engine metadata. To create the dataset, we performed a Web-scale crawling of over 100 million pages from a set of online publishers. We identify newswire articles and use the summaries provided in the HTML metadata. These summaries were created to be used in search engines and social media.
We collected HTML pages and metadata using the Internet Archive (Archive.org), accessing archived pages of a large number of popular news, sports, and entertainment sites. Using Archive.org provides two key benefits. First, the archive provides an API that allows for collection of data across time, not limited to recently available articles. Second, the archived URLs of the dataset articles are immutable, allowing distribution of this dataset using a thin, URL-only list.
The publisher sites we crawled were selected using a combination of Alexa.com top overall sites, as well as Alexa’s top news sites.333Alexa removed the extended public list in 2017, see:
https://web.archive.org/web/2016/https://www.alexa.com/topsites/category/News
We supplemented the lists with older lists published by Google of the highest-traffic sites on the Web.444Google removed this list in 2013, see:
https://web.archive.org/web/2012/http://www.google.com/adplanner/static/top1000
We excluded sites such as Reddit that primarily aggregate rather than produce content, as well as publisher sites that proved to have few or no articles with summary metadata available, or have articles primarily in languages other than English.
This process resulted in a set of 38 publishers that were included in the dataset.
3.1 Content Scraping
We used two techniques to identify article pages from the selected publishers on Archive.org: the search API and index-page crawl. The API allows queries using URL pattern matching, which focuses article crawling on high-precision subdomains or paths. We used the API to search for content from the publisher domains, using specific patterns or post-processing filtering to ensure article content. In addition, we used Archive.org to retrieve the historical versions of the home page for all publisher domains. The archive has content from 1998 to 2017 with varying degrees of time resolution. We obtained at least one snapshot of each page for every available day. For each snapshot, we retrieved all articles listed on the page.
For both search and crawled URLs, we performed article de-duplication using URLs to control for varying URL fragments, query parameters, protocols, and ports. When performing the merge, we retained only the earliest article version available to prevent the collection of stale summaries that are not updated when articles are changed.
3.2 Content Extraction
Following identification and de-duplication, we extracted the article texts and summaries and further cleaned and filtered the dataset.
Article Text
We used Readability555https://pypi.org/project/readability-lxml/0.6.2/
to extract HTML body content. Readability uses HTML heuristics to extract the main content and title of a page, producing article text without extraneous HTML markup and images. Our preliminary testing, as well as comparison by
Peters (2015), found Readability to be one of the highest accuracy content extraction algorithms available. To exclude inline advertising and image captions sometimes present in extractions, we applied additional filtering of paragraphs with fewer than five words. We excluded articles with no body text extracted.Summary Metadata
We extracted the article summaries from the metadata available in the HTML pages of articles. These summaries are often written by newsroom editors and journalists to appear in social media distribution and search results. While there is no standard metadata format for summaries online, common fields are often present in the page’s HTML. Popular metadata field types include: og:description, twitter:description, and description. In cases where different metadata summaries were available, and were different, we used the first field available according to the order above. We excluded articles with no summary text of any type. We also removed article-summary pairs with a high amount of precisely-overlapping text to remove rule-based automatically-generated summaries fully copied from the article (e.g., the first paragraph).
3.3 Building the Dataset
Our scraping and extraction process resulted in a set of 1,321,995 article-summary pairs. Simple dataset statistics are shown in Table 1. The data are divided into training (76%), development (8%), test (8%), and unreleased test (8%) datasets using a hash function of the article URL. We use the articles’ Archive.org URLs for lightweight distribution of the data. Archive.org is an ideal platform for distributing the data, encouraging its users to scrape its resources. We provide the extraction and analysis scripts used during data collection for reproducing the full dataset from the URL list.
Dataset Size | 1,321,995 articles |
---|---|
Training Set Size | 995,041 articles |
Mean Article Length | 658.6 words |
Mean Summary Length | 26.7 words |
Total Vocabulary Size | 6,925,712 words |
Occurring 10+ Times | 784,884 words |
4 Data Analysis
contains summaries from different topic domains, written by many authors, over the span of more than two decades. This diversity is an important aspect of the dataset. We analyze the data to quantify the differences in summarization styles and techniques between the different publications to show the importance of reflecting this diversity. In Sections 6 and 7, we examine the effect of the dataset diversity on the performance of a variety of summarization systems.
4.1 Characterizing Summarization Strategies
[ userdefinedwidth=19em, innerleftmargin=-0.5em, ]
We examine summarization strategies using three measures that capture the degree of text overlap between the summary and article, and the rate of compression of the information conveyed.
Given an article text consisting of a sequence of tokens and the corresponding article summary consisting of tokens , the set of extractive fragments is the set of shared sequences of tokens in and . We identify these extractive fragments of an article-summary pair using a greedy process. We process the tokens in the summary in order. At each position, if there is a sequence of tokens in the source text that is prefix of the remainder of the summary, we mark this prefix as extractive and continue. We prefer to mark the longest prefix possible at each step. Otherwise, we mark the current summary token as abstractive. The set includes all the tokens sequences identified as extractive. Figure 3 formally describes this procedure. Underlined phrases of Figures 1 and 2 are examples of fragments identified as extractive. Using , we compute two measures: extractive fragment coverage and extractive fragment density.
Extractive Fragment Coverage
The coverage measure quantifies the extent to which a summary is derivative of a text. measures the percentage of words in the summary that are part of an extractive fragment with the article:
For example, a summary with 10 words that borrows 7 words from its article text and includes 3 new words will have .
Extractive Fragment Density
The density measure quantifies how well the word sequence of a summary can be described as a series of extractions. For instance, a summary might contain many individual words from the article and therefore have a high coverage. However, if arranged in a new order, the words of the summary could still be used to convey ideas not present in the article. We define as the average length of the extractive fragment to which each word in the summary belongs. The density formulation is similar to the coverage definition but uses a square of the fragment length:
For example, an article with a 10-word summary made of two extractive fragments of lengths 3 and 4 would have and .
Compression Ratio
We use a simple dimension of summarization, compression ratio, to further characterize summarization strategies. We define Compression as the word ratio between the article and summary:
Summarizing with higher compression is challenging as it requires capturing more precisely the critical aspects of the article text.
4.2 Analysis of Dataset Diversity
![]() |
![]() |
![]() |
We use density, coverage, and compression to understand the distribution of human summarization techniques across different sources. Figure 4 shows the distributions of summaries for different domains in the dataset, along with three major existing summarization datasets: DUC 2003-2004 (combined), CNN / Daily Mail, and the New York Times Corpus.
Publication Diversity
Each publication shows a unique distribution of summaries mixing extractive and abstractive strategies in varying amounts. For example, the third entry on the top row shows the summarization strategy used by BuzzFeed. The density (y-axis) is relatively low, meaning BuzzFeed summaries are unlikely to include long extractive fragments. While the coverage (x-axis) is more varied, BuzzFeed’s coverage tends to be lower, indicating that it frequently uses novel words in summaries. The publication plots in the figure are sorted by median compression ratio. We observe that publications with lower compression ratio (top-left of the figure) exhibit higher diversity along both dimensions of extractiveness. However, as the median compression ratio increases, the distributions become more concentrated, indicating that summarization strategies become more rigid.
Dataset Diversity
Figure 4 demonstrates how DUC, CNN / Daily Mail, and the New York Times exhibit different human summarization strategies. DUC summarization is fairly similar to the high-compression newsrooms shown in the lower publication plots in Figure 4
. However, DUC’s median compression ratio is much higher than all other datasets and publications. The figure shows that CNN / Daily Mail and New York Times are skewed toward extractive summaries with lower compression ratios. CNN / Daily Mail shows higher coverage and density than all other datasets and publishers in our data. Compared to existing datasets, covers a much larger range of summarization styles, ranging from both highly extractive to highly abstractive.
5 Performance of Existing Systems
We train and evaluate several summarization systems to understand the challenges of and its usefulness for training systems. We evaluate three systems, each using a different summarization strategy with respect to extractiveness: fully extractive (TextRank), fully abstractive (Seq2Seq), and mixed (pointer-generator). We further study the performance of the pointer-generator model on by training three systems using different dataset configurations. We compare these systems to two rule-based systems that provide baseline (Lede-3) and an extractive oracle (Fragments).
Extractive: TextRank
TextRank is a sentence-level extractive summarization system. The system was originally developed by Mihalcea and Tarau (2004) and was later further developed and improved by Barrios et al. (2016). TextRank uses an unsupervised sentence-ranking approach similar to Google PageRank Page et al. (1999). TextRank picks a sequence of sentences from a text for the summary up to a maximum allowable length. While this maximum length is typically preset by the user, in order to optimize ROUGE scoring, we tune this parameter to optimize ROUGE-1 -score on the training data. We experimented with values between 1–200, and found the optimal value to be 50 words. We use tuned TextRank of in Tables 2, 3, and in the supplementary material.
Abstractive: Seq2Seq / Attention
Sequence-to-sequence models with attention Cho et al. (2014); Sutskever et al. (2014); Bahdanau et al. (2014) have been applied to various language tasks, including summarization Chopra et al. (2016); Nallapati et al. (2016a)
. The process by which the model produces tokens is abstractive, as there is no explicit mechanism to copy tokens from the input text. We train a TensorFlow implementation
666https://github.com/tensorflow/models/tree/f87a58/research/textsum of the Rush et al. (2015) model using .Mixed: Pointer-Generator
The pointer-generator model See et al. (2017) uses abstractive token generation and extractive token copying using a pointer mechanism Vinyals et al. (2015); Gülçehre et al. (2016), keeping track of extractions using coverage Tu et al. (2016). We evaluate three instances of this model by varying the training data: (1) Pointer-C: trained on the CNN / Daily Mail dataset; (2) Pointer-N: trained on the dataset; and (3) Pointer-S: trained on a random subset of training data the same size as the CNN / Daily Mail training. The last instance aims to understand the effects of dataset size and summary diversity.
Lower Bound: Lede-3
A common automatic summarization strategy of online publications is to copy the first sentence, first paragraph, or first words of the text and treat this as the summary. Following prior work See et al. (2017); Nallapati et al. (2017), we use the Lede-3 baseline, in which the first three sentences of the text are returned as the summary. Though simple, this baseline is competitive with state-of-the-art systems.
Extractive Oracle: Fragments
This system has access to the reference summary. Given an article and its summary , the system computes (Section 4). Fragments concatenates the fragments in in the order they appear in the summary, representing the best possible performance of an ideal extractive system. Only systems that are capable of abstractive reasoning can outperform the ROUGE scores of Fragments.
6 Automatic Evaluation
DUC 2003 & 2004 | CNN / Daily Mail | Newsroom - T | Newsroom - U | |||||||||
R-1 | R-2 | R-L | R-1 | R-2 | R-L | R-1 | R-2 | R-L | R-1 | R-2 | R-L | |
Lede-3 | 12.99 | 3.89 | 11.44 | 38.64 | 17.12 | 35.13 | 30.49 | 21.27 | 28.42 | 30.63 | 21.41 | 28.57 |
Fragments | 87.04 | 68.45 | 87.04 | 93.36 | 83.19 | 93.36 | 88.46 | 76.03 | 88.46 | 88.48 | 76.06 | 88.48 |
TextRank | 15.75 | 4.06 | 13.02 | 29.06 | 11.14 | 24.57 | 22.77 | 9.79 | 18.98 | 22.76 | 9.80 | 18.97 |
Abs-N | 2.44 | 0.04 | 2.37 | 5.07 | 0.16 | 4.80 | 5.88 | 0.39 | 5.32 | 5.90 | 0.43 | 5.36 |
Pointer-C | 12.40 | 2.88 | 10.74 | 32.51 | 11.90 | 28.95 | 20.25 | 7.32 | 17.30 | 20.29 | 7.33 | 17.31 |
Pointer-S | 15.10 | 4.55 | 12.42 | 34.33 | 13.79 | 28.42 | 24.50 | 12.60 | 20.33 | 24.48 | 12.52 | 20.30 |
Pointer-N | 17.29 | 5.01 | 14.53 | 31.61 | 11.70 | 27.23 | 26.02 | 13.25 | 22.43 | 26.04 | 13.24 | 22.45 |
Extractive | Mixed | Abstractive | Newsroom - D | |||||||||
R-1 | R-2 | R-L | R-1 | R-2 | R-L | R-1 | R-2 | R-L | R-1 | R-2 | R-L | |
Lede-3 | 53.05 | 49.01 | 52.37 | 25.15 | 12.88 | 22.08 | 13.69 | 2.42 | 11.24 | 30.72 | 21.53 | 28.65 |
Fragments | 98.95 | 97.89 | 98.95 | 92.68 | 82.09 | 92.68 | 73.43 | 47.66 | 73.43 | 88.46 | 76.07 | 88.46 |
TextRank | 32.43 | 19.68 | 28.68 | 22.30 | 7.87 | 17.75 | 13.54 | 1.88 | 10.46 | 22.82 | 9.85 | 19.02 |
Abs-N | 6.08 | 0.21 | 5.42 | 5.67 | 0.15 | 5.08 | 6.21 | 1.07 | 5.68 | 5.98 | 0.48 | 5.39 |
Pointer-C | 28.34 | 14.65 | 25.21 | 20.22 | 6.51 | 16.88 | 13.11 | 1.62 | 10.72 | 20.47 | 7.50 | 17.51 |
Pointer-S | 37.29 | 26.56 | 33.34 | 23.71 | 10.59 | 18.79 | 13.89 | 2.22 | 10.34 | 24.83 | 12.94 | 20.66 |
Pointer-N | 39.11 | 27.95 | 36.17 | 25.48 | 11.04 | 21.06 | 14.66 | 2.26 | 11.44 | 26.27 | 13.55 | 22.72 |
We study model performance of , CNN / Daily Mail, and the combined DUC 2003 and 2004 datasets. We use the five systems described in Section 5, including the extractive oracle. We also evaluate the systems using subsets of to characterize the sensitivity of systems to different levels of extractiveness in reference summaries. We use the -score variants of ROUGE-1, ROUGE-2, and ROUGE-L to account for different summary lengths. ROUGE scores are computed with the default configuration of the Lin (2004b) ROUGE v1.5.5 reference implementation. Input article text and reference summaries for all systems are tokenized using the Stanford CoreNLP tokenizer Manning et al. (2014).
Table 2 shows results for summarization systems on DUC, CNN / Daily Mail, and . In nearly all cases, the fully extractive Lede-3 baseline produces the most successful summaries, with the exception of the relatively extractive DUC. Among models, -trained Pointer-N performs best on all datasets other than CNN / Daily Mail, an out-of-domain dataset. Pointer-C, which has access to only a limited subset of , performs worse than Pointer-N on average. However, despite not being trained on CNN / Daily Mail, Pointer-S outperforms Pointer-C on its own data under ROUGE-N and is competitive under ROUGE-L. Finally, both Pointer-N and Pointer-S outperform other systems and baselines on DUC, whereas Pointer-C does not outperform Lede-3.
Table 3 shows development results on the data for different level of extractiveness. Pointer-N outperforms the remaining models across all extractive subsets of and, in the case of the abstractive subset, exceeds the performance of Lede-3. The success of Pointer-N and Pointer-S in generalizing and outperforming models on DUC and CNN / Daily Mail indicates the usefulness of in generalizing to out-of-domain data. Similar subset analysis for our other two measures, coverage and compression, are included in the supplementary material.
7 Human Evaluation
ROUGE scores systems using frequencies of shared -grams. Evaluating systems with ROUGE alone biases scoring against abstractive systems, which rely more on paraphrasing. To overcome this limitation, we provide human evaluation of the different systems on . While human evaluation is still uncommon in summarization work, developing a benchmark dataset presents an opportunity for developing an accompanying protocol for human evaluation.
Our evaluation method is centered around three objectives: (1) distinguishing between syntactic and semantic summarization quality, (2) providing a reliable (consistent and replicable) measurement, and (3) allowing for portability such that the measure can be applied to other models or summarization datasets.
We select two semantic and two syntactic dimensions for evaluation based on experiments with evaluation tasks by Paulus et al. (2017) and Tan et al. (2017). The two semantic dimensions, summary informativeness (INF) and relevance (REL), measure whether the system-generated text is useful as a summary, and appropriate for the source text, respectively. The two syntactic dimensions, fluency (FLU) and coherence (COH), measure whether individual sentences or phrases of the summary are well-written and whether the summary as a whole makes sense respectively. Evaluation was performed on 60 summaries, 20 from each extractive subset. Each system-article pair was evaluated by three unique raters. Exact prompts given to raters for each dimension are shown in Table 4.
Dimension | Prompt |
---|---|
Informativeness | How well does the summary capture the key points of the article? |
Relevance | Are the details provided by the summary consistent with details in the article? |
Fluency | Are the individual sentences of the summary well-written and grammatical? |
Coherence | Do phrases and sentences of the summary fit together and make sense collectively? |
Semantic | Syntactic | ||||
INF | REL | FLU | COH | Avg. | |
Lede-3 | 3.98 | 4.13 | 4.13 | 4.08 | 4.08 |
Fragments | 2.91 | 3.26 | 3.09 | 3.06 | 3.08 |
TextRank | 3.61 | 3.92 | 3.87 | 3.86 | 3.81 |
Abs-N | 2.09 | 2.35 | 2.66 | 2.50 | 2.40 |
Pointer-C | 3.55 | 3.78 | 3.22 | 3.30 | 3.46 |
Pointer-S | 3.77 | 4.02 | 3.56 | 3.56 | 3.73 |
Pointer-N | 3.36 | 3.82 | 3.43 | 3.39 | 3.50 |
Table 5 shows the mean score given to each system under each of the four dimensions, as well as the mean overall score (rightmost column). No summarization system exceeded the scores given to the Lede-3 baseline. However, the extractive oracle designed to maximize -gram based evaluation performed worse than the majority of systems under human evaluation. While the fully abstractive Abs-N model performed very poorly under automatic evaluation, it fared slightly better when scored by humans. TextRank received the highest overall score. TextRank generates full sentences extracted from the article, and raters preferred TextRank primarily for its fluency and coherence. The pointer-generator models do not have this advantage, and raters did not find the pointer-generator models to be as syntactically sound as TextRank. However, raters preferred the informativeness and relevance of the Pointer-S and Pointer-N models, though not the Pointer-C model, over TextRank.
8 Conclusion
We present , a dataset of articles and their summaries written in the newsrooms of online publications. is the largest summarization dataset available to date, and exhibits a wide variety of human summarization strategies. Our proposed measures and the analysis of strategies used by different publications and articles propose new directions for evaluating the difficulty of summarization tasks and for developing future summarization models. We show that the dataset’s diversity of summaries presents a new challenge to summarization systems. Finally, we find that using to train an existing state-of-art mixed-strategy summarization model results in performance improvements on out-of-domain data. The dataset is available online at summari.es.
Acknowledgements
This work is funded by Oath as part of the Connected Experiences Laboratory and by a Google Research Award. We thank the anonymous reviewers for their feedback.
References
- Bahdanau et al. (2014) Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473. http://arxiv.org/abs/1409.0473.
- Barrios et al. (2016) Federico Barrios, Federico López, Luis Argerich, and Rosa Wachenchauzer. 2016. Variations of the similarity function of textrank for automated summarization. CoRR abs/1602.03606. http://arxiv.org/abs/1602.03606.
- Chen et al. (2016) Danqi Chen, Jason Bolton, and Christopher D. Manning. 2016. A thorough examination of the cnn/daily mail reading comprehension task. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. http://aclweb.org/anthology/P/P16/P16-1223.pdf.
- Cheng and Lapata (2016) Jianpeng Cheng and Mirella Lapata. 2016. Neural summarization by extracting sentences and words. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016, August 7-12, 2016, Berlin, Germany, Volume 1: Long Papers. http://aclweb.org/anthology/P/P16/P16-1046.pdf.
-
Cho et al. (2014)
Kyunghyun Cho, Bart van Merrienboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi
Bougares, Holger Schwenk, and Yoshua Bengio. 2014.
Learning phrase
representations using RNN encoder–decoder for statistical machine
translation.
In
Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP)
. https://doi.org/10.3115/v1/D14-1179. - Chopra et al. (2016) Sumit Chopra, Michael Auli, and Alexander M. Rush. 2016. Abstractive sentence summarization with attentive recurrent neural networks. In NAACL HLT 2016, The 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego California, USA, June 12-17, 2016. pages 93–98. http://aclweb.org/anthology/N/N16/N16-1012.pdf.
- Dang (2006) Hoa Trang Dang. 2006. Duc 2005: Evaluation of question-focused summarization systems. In Proceedings of the Workshop on Task-Focused Summarization and Question Answering. Association for Computational Linguistics, Stroudsburg, PA, USA, SumQA ’06, pages 48–55. http://dl.acm.org/citation.cfm?id=1654679.1654689.
- Dorr et al. (2003) Bonnie Dorr, David Zajic, and Richard Schwartz. 2003. Hedge trimmer: A parse-and-trim approach to headline generation. In Proceedings of the HLT-NAACL 03 on Text Summarization Workshop - Volume 5. Association for Computational Linguistics, Stroudsburg, PA, USA, HLT-NAACL-DUC ’03, pages 1–8. https://doi.org/10.3115/1119467.1119468.
- Durrett et al. (2016) Greg Durrett, Taylor Berg-Kirkpatrick, and Dan Klein. 2016. Learning-based single-document summarization with compression and anaphoricity constraints. The Association for Computer Linguistics. http://www.aclweb.org/anthology/P16-1188.
- Filippova et al. (2015) Katja Filippova, Enrique Alfonseca, Carlos A. Colmenares, Lukasz Kaiser, and Oriol Vinyals. 2015. Sentence compression by deletion with lstms. In Lluís Màrquez, Chris Callison-Burch, Jian Su, Daniele Pighin, and Yuval Marton, editors, EMNLP. The Association for Computational Linguistics, pages 360–368. http://aclweb.org/anthology/D/D15/D15-1042.pdf.
- Filippova and Altun (2013) Katja Filippova and Yasemin Altun. 2013. Overcoming the lack of parallel data in sentence compression. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, EMNLP 2013, 18-21 October 2013, Grand Hyatt Seattle, Seattle, Washington, USA, A meeting of SIGDAT, a Special Interest Group of the ACL. pages 1481–1491. http://aclweb.org/anthology/D/D13/D13-1155.pdf.
- Gülçehre et al. (2016) Çaglar Gülçehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. 2016. Pointing the unknown words http://aclweb.org/anthology/P/P16/P16-1014.pdf.
- Harman and Over (2004) Donna Harman and Paul Over. 2004. The effects of human variation in duc summarization evaluation. http://www.aclweb.org/anthology/W04-1003.
- Hermann et al. (2015) Karl Moritz Hermann, Tomáš Kočiský, Edward Grefenstette, Lasse Espeholt, Will Kay, Mustafa Suleyman, and Phil Blunsom. 2015. Teaching machines to read and comprehend. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1. MIT Press, Cambridge, MA, USA, NIPS’15, pages 1693–1701. http://dl.acm.org/citation.cfm?id=2969239.2969428.
- Hong and Nenkova (2014) Kai Hong and Ani Nenkova. 2014. Improving the estimation of word importance for news multi-document summarization. In Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2014, April 26-30, 2014, Gothenburg, Sweden. pages 712–721. http://aclweb.org/anthology/E/E14/E14-1075.pdf.
- Lin (2004a) C. Y. Lin. 2004a. Looking for a few good metrics: Automatic summarization evaluation - how many samples are enough? In Proceedings of the NTCIR Workshop 4.
- Lin (2004b) Chin-Yew Lin. 2004b. Rouge: A package for automatic evaluation of summaries. In Proc. ACL workshop on Text Summarization Branches Out. page 10. http://research.microsoft.com/~cyl/download/papers/WAS2004.pdf.
- Manning et al. (2014) Christopher D. Manning, Mihai Surdeanu, John Bauer, Jenny Finkel, Steven J. Bethard, and David McClosky. 2014. The Stanford CoreNLP natural language processing toolkit. In Association for Computational Linguistics (ACL) System Demonstrations. pages 55–60. http://www.aclweb.org/anthology/P/P14/P14-5010.
- Mihalcea and Tarau (2004) R. Mihalcea and P. Tarau. 2004. TextRank: Bringing order into texts. In Proceedings of EMNLP-04and the 2004 Conference on Empirical Methods in Natural Language Processing. http://www.aclweb.org/anthology/W04-3252.
-
Nallapati et al. (2017)
Ramesh Nallapati, Feifei Zhai, and Bowen Zhou. 2017.
Summarunner:
A recurrent neural network based sequence model for extractive
summarization of documents.
In
Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence, February 4-9, 2017, San Francisco, California, USA.
. pages 3075–3081. http://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14636. - Nallapati et al. (2016a) Ramesh Nallapati, Bowen Zhou, Cícero Nogueira dos Santos, Çaglar Gülçehre, and Bing Xiang. 2016a. Abstractive text summarization using sequence-to-sequence rnns and beyond. In Proceedings of the 20th SIGNLL Conference on Computational Natural Language Learning, CoNLL 2016, Berlin, Germany, August 11-12, 2016. pages 280–290. http://aclweb.org/anthology/K/K16/K16-1028.pdf.
- Nallapati et al. (2016b) Ramesh Nallapati, Bowen Zhou, and Mingbo Ma. 2016b. Classify or select: Neural architectures for extractive document summarization. CoRR abs/1611.04244. http://arxiv.org/abs/1611.04244.
- Napoles et al. (2012) Courtney Napoles, Matthew Gormley, and Benjamin Van Durme. 2012. Annotated gigaword. In Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction. Association for Computational Linguistics, Stroudsburg, PA, USA, AKBC-WEKEX ’12, pages 95–100. http://dl.acm.org/citation.cfm?id=2391200.2391218.
- Page et al. (1999) Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. 1999. The pagerank citation ranking: Bringing order to the web. Technical Report 1999-66, Stanford InfoLab. http://ilpubs.stanford.edu:8090/422/.
- Paulus et al. (2017) Romain Paulus, Caiming Xiong, and Richard Socher. 2017. A deep reinforced model for abstractive summarization. CoRR abs/1705.04304. http://arxiv.org/abs/1705.04304.
- Peters (2015) Matt Peters. 2015. Benchmarking python content extraction algorithms: Dragnet, readability, goose, and eatiht. https://moz.com/devblog/benchmarking-python-content-extraction-/algorithms-dragnet-readability-goose-and-eatiht/.
- Rush et al. (2015) Alexander M. Rush, Sumit Chopra, and Jason Weston. 2015. A neural attention model for abstractive sentence summarization. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing, EMNLP 2015, Lisbon, Portugal, September 17-21, 2015. pages 379–389. http://aclweb.org/anthology/D/D15/D15-1044.pdf.
- Sandhaus (2008) E. Sandhaus. 2008. The New York Times Annotated Corpus. Linguistic Data Consortium, Philadelphia 6(12).
- See et al. (2017) Abigail See, Peter J. Liu, and Christopher D. Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Papers. pages 1073–1083. https://doi.org/10.18653/v1/P17-1099.
-
Sutskever et al. (2014)
Ilya Sutskever, Oriol Vinyals, and Quoc V. Le. 2014.
Sequence to sequence learning with neural networks.
In Neural Information Processing Systems. - Tan et al. (2017) Jiwei Tan, Xiaojun Wan, and Jianguo Xiao. 2017. Abstractive document summarization with a graph-based attentional neural model. In ACL. http://www.aclweb.org/anthology/P17-1108.
- Tiedemann (2012) Jörg Tiedemann. 2012. Parallel data, tools and interfaces in opus. In LREC. volume 2012, pages 2214–2218.
- Tu et al. (2016) Zhaopeng Tu, Zhengdong Lu, Yang Liu, Xiaohua Liu, and Hang Li. 2016. Modeling coverage for neural machine translation. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, pages 76–85. https://doi.org/10.18653/v1/P16-1008.
- Vinyals et al. (2015) Oriol Vinyals, Meire Fortunato, and Navdeep Jaitly. 2015. Pointer networks. In C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett, editors, Advances in Neural Information Processing Systems 28. Curran Associates, Inc., pages 2692–2700. http://papers.nips.cc/paper/5866-pointer-networks.pdf.
Additional Evaluation
In Section 4, we discuss three measures of summarization diversity: coverage, density, and compression. In addition to quantifying diversity of summarization strategies, these measures are helpful for system error analysis. We use the density measurement to understand how system performance varies when compared against references using different extractive strategies by subdividing into three subsets by extractiveness and evaluating using ROUGE on each. We show here a similar analysis using the remaining two measures, coverage and compression. Results for subsets based on coverage and compression are shown in Tables 6 and 7.
Low Coverage | Medium | High Coverage | |||||||
R-1 | R-2 | R-L | R-1 | R-2 | R-L | R-1 | R-2 | R-L | |
Lede-3 | 15.07 | 4.02 | 12.66 | 29.66 | 18.69 | 26.98 | 46.89 | 41.25 | 45.77 |
Fragments | 72.45 | 46.16 | 72.45 | 93.41 | 83.08 | 93.41 | 99.13 | 98.16 | 99.13 |
TextRank | 14.43 | 2.80 | 11.36 | 23.62 | 9.48 | 19.27 | 30.15 | 17.04 | 26.18 |
Abs-N | 6.25 | 1.09 | 5.72 | 5.61 | 0.15 | 5.05 | 6.10 | 0.19 | 5.40 |
Pointer-C | 13.99 | 2.46 | 11.57 | 21.70 | 8.06 | 18.47 | 25.80 | 12.06 | 22.57 |
Pointer-S | 15.16 | 3.63 | 11.61 | 26.95 | 14.51 | 22.30 | 32.42 | 20.77 | 28.15 |
Pointer-N | 16.07 | 3.78 | 12.85 | 28.79 | 15.31 | 24.79 | 34.03 | 21.67 | 30.62 |
Low Compression | Medium | High Compression | |||||||
R-1 | R-2 | R-L | R-1 | R-2 | R-L | R-1 | R-2 | R-L | |
Lede-3 | 42.89 | 34.91 | 41.06 | 30.62 | 20.77 | 28.30 | 18.57 | 8.83 | 16.53 |
Fragments | 87.78 | 77.20 | 87.78 | 89.73 | 77.66 | 89.73 | 87.88 | 73.34 | 87.88 |
TextRank | 30.35 | 17.51 | 26.67 | 22.98 | 8.69 | 18.56 | 15.07 | 3.31 | 11.78 |
Abs-N | 6.27 | 0.75 | 5.65 | 6.22 | 0.52 | 5.60 | 5.48 | 0.18 | 4.93 |
Pointer-C | 27.47 | 13.49 | 24.18 | 20.05 | 6.25 | 16.76 | 14.07 | 2.89 | 11.76 |
Pointer-S | 35.42 | 23.43 | 30.89 | 24.11 | 11.28 | 19.45 | 15.31 | 4.46 | 11.98 |
Pointer-N | 36.96 | 24.52 | 33.43 | 25.56 | 11.68 | 21.47 | 16.57 | 4.72 | 13.52 |