Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

by   Max Grusky, et al.

We present NEWSROOM, a summarization dataset of 1.3 million articles and summaries written by authors and editors in newsrooms of 38 major news publications. Extracted from search and social media metadata between 1998 and 2017, these high-quality summaries demonstrate high diversity of summarization styles. In particular, the summaries combine abstractive and extractive strategies, borrowing words and phrases from articles at varying rates. We analyze the extraction strategies used in NEWSROOM summaries against other datasets to quantify the diversity and difficulty of our new data, and train existing methods on the data to evaluate its utility and challenges. The dataset is available online at


CLTS+: A New Chinese Long Text Summarization Dataset with Abstractive Summaries

The abstractive methods lack of creative ability is particularly a probl...

Analyzing the Abstractiveness-Factuality Tradeoff With Nonlinear Abstractiveness Constraints

We analyze the tradeoff between factuality and abstractiveness of summar...

WikiHow: A Large Scale Text Summarization Dataset

Sequence-to-sequence models have recently gained the state of the art pe...

Two Huge Title and Keyword Generation Corpora of Research Articles

Recent developments in sequence-to-sequence learning with neural network...

Template-based Abstractive Microblog Opinion Summarisation

We introduce the task of microblog opinion summarisation (MOS) and share...

Generating abstractive summaries of Lithuanian news articles using a transformer model

In this work, we train the first monolingual Lithuanian transformer mode...

Newswire versus Social Media for Disaster Response and Recovery

In a disaster situation, first responders need to quickly acquire situat...

1 Introduction

The development of learning methods for automatic summarization is constrained by the limited high-quality data available for training and evaluation. Large datasets have driven rapid improvement in other natural language generation tasks, such as machine translation, where data size and diversity have proven critical for modeling the alignment between source and target texts 

Tiedemann (2012). Similar challenges exist in summarization, with the additional complications introduced by the length of source texts and the diversity of summarization strategies used by writers. Access to large-scale high-quality data is an essential prerequisite for making substantial progress in summarization. In this paper, we present , a dataset with 1.3 million news articles and human-written summaries.

Abstractive Summary: South African photographer Anton Hammerl, missing in Libya since April 4th, was killed in Libya more thanmonth ago.
Mixed Summary: A major climate protest in New York on Sunday could mark a seminal shift in the politics of global warming, just ahead of the U.N. Climate Summit.
Extractive Summary: A person familiar with the search tells The Associated Press that Texas has offered its head coaching job to Louisvilles Charlie Strong and he is expected to accept.
Figure 1: summaries showing different extraction strategies, from,, and Multi-word phrases shared between article and summary are underlined. Novel words used only in the summary are italicized.

’s summaries were written by authors and editors in the newsrooms of news, sports, entertainment, financial, and other publications. The summaries were published with articles as HTML metadata for social media services and search engines page descriptions. summaries are written by humans, for common readers, and with the explicit purpose of summarization. As a result, is a nearly two decade-long snapshot representing how single-document summarization is used in practice across a variety of sources, writers, and topics.

Identifying large, high-quality resources for summarization has called for creative solutions in the past. This includes using news headlines as summaries of article prefixes Napoles et al. (2012); Rush et al. (2015), concatenating bullet points as summaries Hermann et al. (2015); See et al. (2017), or using librarian archival summaries Sandhaus (2008). While these solutions provide large scale data, it comes at the cost of how well they reflect the summarization problem or their focus on very specific styles of summarizations, as we discuss in Section 4. is distinguished from these resources in its combination of size and diversity. The summaries were written with the explicit goal of concisely summarizing news articles over almost two decades. Rather than rely on a single source, the dataset includes summaries from 38 major publishers. This diversity of sources and time span translate into a diversity of summarization styles.

We explore to better understand the dataset and how summarization is used in practice by newsrooms. Our analysis focuses on a key dimension, extractivenss and abstractiveness: extractive summaries frequently borrow words and phrases from their source text, while abstractive summaries describe the contents of articles primarily using new language. We develop measures designed to quantify extractiveness and use these measures to subdivide the data into extractive, mixed, and abstractive subsets, as shown in Figure 1, displaying the broad set of summarization techniques practiced by different publishers.

Finally, we analyze the performance of three summarization models as baselines for to better understand the challenges the dataset poses. In addition to automated ROUGE evaluation Lin (2004a, b), we design and execute a benchmark human evaluation protocol to quantify the output summaries relevance and quality. Our experiments demonstrate that presents an open challenge for summarization systems, while providing a large resource to enable data-intensive learning methods. The dataset and evaluation protocol are available online at

2 Existing Datasets

There are a several frequently used summarization datasets. Listed in Figure 2 are examples from four datasets. The examples are chosen to be representative: they have scores within 5% of their dataset average across our analysis measures (Section 4). To illustrate the extractive and abstractive nature of summaries, we underline multi-word phrases shared between the article and summary, and italicize words used only in the summary.


Example Summary: Floods hit north Mozambique as aid to flooded south continues
Start of Article: MAPUTO, Mozambique (AP) — Just as aid agencies were making headway in feeding hundreds of thousands displaced by flooding in southern and central Mozambique, new floods hit a remote northern region Monday. The Messalo River overflowed […]


Example Summary: Seve gets invite to US Open
Start of Article: Seve Ballesteros will be playing in next month’s US Open after all. The USGA decided Tuesday to give the Spanish star a special exemption. American Ben Crenshaw was also given a special exemption by the United States Golf Association. Earlier this week […]

New York Times Corpus

Example Summary: Annual New York City Toy Fair opens in Manhattan; feud between Toy Manufacturers of America and its landlord at International Toy Center leads to confusion and turmoil as registration begins; dispute discussed.
Start of Article: There was toylock when the Toy Fair opened in Manhattan yesterday. The reason? A family feud between the Toy Manufacturers of America and its landlord at Fifth Avenue and 23d Street. Toy buyers and exhibitors arriving to attend the kickoff of the […]

CNN / Daily Mail

Example Summary:
  • [itemsep=0.05em, topsep=0.2em, leftmargin=1.5em]

  • Eight Al Jazeera journalists are named on an Egyptian charge sheet, the network says

  • The eight were among 20 people named ‘Most are not employees of Al Jazeera,” the network said

  • The eight include three journalists jailed in Egypt

Start of Article: Egyptian authorities have served Al Jazeera with a charge sheet that identifies eight of its staff on a list of 20 people – all believed to be journalists – for allegedly conspiring with a terrorist group, the network said Wednesday. The 20 are wanted by Egyptian […]
Figure 2: Example summaries for existing datasets.

2.1 Document Understanding Conference

Datasets produced for the Document Understanding Conference (DUC)111 are small, high-quality datasets developed to evaluate summarization systems Harman and Over (2004); Dang (2006).

DUC data consist of newswire articles paired with human summaries written specifically for DUC. One distinctive feature of the DUC datasets is the availability of multiple reference summaries for each article. This is a major advantage of DUC compared to other datasets, especially when evaluating with ROUGE Lin (2004b, a), which was designed to be used with multiple references. However, DUC datasets are small, which makes it difficult to use them as training data.

DUC summaries are often used in conjunction with larger training datasets, including Gigaword Rush et al. (2015); Chopra et al. (2016), CNN / Daily Mail Nallapati et al. (2017); Paulus et al. (2017); See et al. (2017), or Daily Mail alone Nallapati et al. (2016b); Cheng and Lapata (2016). The data have also been used to evaluate unsupervised methods Dorr et al. (2003); Mihalcea and Tarau (2004); Barrios et al. (2016).

2.2 Gigaword

The Gigaword Corpus Napoles et al. (2012) contains nearly 10 million documents from seven newswire sources, including the Associated Press, New York Times Newswire Service, and Washington Post Newswire Service. Compared to other existing datasets used for summarization, the Gigaword corpus is the largest and most diverse in its sources. While Gigaword does not contain summaries, prior work uses Gigaword headlines as simulated summaries Rush et al. (2015); Chopra et al. (2016). These systems are trained on Gigaword to recreate headlines given the first sentence of an article. When used this way, Gigaword’s simulated summaries are shorter than most natural summary text. Gigaword, along with similar text-headline datasets Filippova and Altun (2013), are also used for the related sentence compression task Dorr et al. (2003); Filippova et al. (2015).

2.3 New York Times Corpus

The New York Times Annotated Corpus Sandhaus (2008) is the largest summarization dataset currently available. It consists of carefully curated articles from a single source, The New York Times. The corpus contains several hundred thousand articles written between 1987–2007 that have paired summaries. The summaries were written for the corpus by library scientists, rather than at the time of publication. Our analysis in Section 4 reveals that the data are somewhat biased toward extractive strategies, making it particularly useful as an extractive summarization dataset. Despite this, limited work has used this dataset for summarization Hong and Nenkova (2014); Durrett et al. (2016); Paulus et al. (2017).

2.4 CNN / Daily Mail

The CNN / Daily Mail question answering dataset Hermann et al. (2015) is frequently used for summarization. The dataset includes CNN and Daily Mail articles, each associated with several bullet point descriptions. When used in summarization, the bullet points are typically concatenated into a single summary.222 The dataset has been used for summarization as is See et al. (2017), or after pre-processing for entity anonymization Nallapati et al. (2017). This different usage makes comparisons between systems using these data challenging. Additionally, some systems use both CNN and Daily Mail for training Nallapati et al. (2017); Paulus et al. (2017); See et al. (2017), whereas others use only Daily Mail articles Nallapati et al. (2016b); Cheng and Lapata (2016). Our analysis shows that the CNN / Daily Mail summaries have strong bias toward extraction (Section 4). Similar observations about the data were made by Chen et al. (2016) with respect to the question answering task.

3 Collecting Summaries

The dataset was collected using social media and search engine metadata. To create the dataset, we performed a Web-scale crawling of over 100 million pages from a set of online publishers. We identify newswire articles and use the summaries provided in the HTML metadata. These summaries were created to be used in search engines and social media.

We collected HTML pages and metadata using the Internet Archive (, accessing archived pages of a large number of popular news, sports, and entertainment sites. Using provides two key benefits. First, the archive provides an API that allows for collection of data across time, not limited to recently available articles. Second, the archived URLs of the dataset articles are immutable, allowing distribution of this dataset using a thin, URL-only list.

The publisher sites we crawled were selected using a combination of top overall sites, as well as Alexa’s top news sites.333Alexa removed the extended public list in 2017, see:
We supplemented the lists with older lists published by Google of the highest-traffic sites on the Web.444Google removed this list in 2013, see:
We excluded sites such as Reddit that primarily aggregate rather than produce content, as well as publisher sites that proved to have few or no articles with summary metadata available, or have articles primarily in languages other than English. This process resulted in a set of 38 publishers that were included in the dataset.

3.1 Content Scraping

We used two techniques to identify article pages from the selected publishers on the search API and index-page crawl. The API allows queries using URL pattern matching, which focuses article crawling on high-precision subdomains or paths. We used the API to search for content from the publisher domains, using specific patterns or post-processing filtering to ensure article content. In addition, we used to retrieve the historical versions of the home page for all publisher domains. The archive has content from 1998 to 2017 with varying degrees of time resolution. We obtained at least one snapshot of each page for every available day. For each snapshot, we retrieved all articles listed on the page.

For both search and crawled URLs, we performed article de-duplication using URLs to control for varying URL fragments, query parameters, protocols, and ports. When performing the merge, we retained only the earliest article version available to prevent the collection of stale summaries that are not updated when articles are changed.

3.2 Content Extraction

Following identification and de-duplication, we extracted the article texts and summaries and further cleaned and filtered the dataset.

Article Text

We used Readability555

to extract HTML body content. Readability uses HTML heuristics to extract the main content and title of a page, producing article text without extraneous HTML markup and images. Our preliminary testing, as well as comparison by

Peters (2015), found Readability to be one of the highest accuracy content extraction algorithms available. To exclude inline advertising and image captions sometimes present in extractions, we applied additional filtering of paragraphs with fewer than five words. We excluded articles with no body text extracted.

Summary Metadata

We extracted the article summaries from the metadata available in the HTML pages of articles. These summaries are often written by newsroom editors and journalists to appear in social media distribution and search results. While there is no standard metadata format for summaries online, common fields are often present in the page’s HTML. Popular metadata field types include: og:description, twitter:description, and description. In cases where different metadata summaries were available, and were different, we used the first field available according to the order above. We excluded articles with no summary text of any type. We also removed article-summary pairs with a high amount of precisely-overlapping text to remove rule-based automatically-generated summaries fully copied from the article (e.g., the first paragraph).

3.3 Building the Dataset

Our scraping and extraction process resulted in a set of 1,321,995 article-summary pairs. Simple dataset statistics are shown in Table 1. The data are divided into training (76%), development (8%), test (8%), and unreleased test (8%) datasets using a hash function of the article URL. We use the articles’ URLs for lightweight distribution of the data. is an ideal platform for distributing the data, encouraging its users to scrape its resources. We provide the extraction and analysis scripts used during data collection for reproducing the full dataset from the URL list.

Dataset Size 1,321,995 articles
Training Set Size 995,041 articles
Mean Article Length 658.6 words
Mean Summary Length 26.7 words
Total Vocabulary Size 6,925,712 words
Occurring 10+ Times 784,884 words
Table 1: Dataset Statistics

4 Data Analysis

contains summaries from different topic domains, written by many authors, over the span of more than two decades. This diversity is an important aspect of the dataset. We analyze the data to quantify the differences in summarization styles and techniques between the different publications to show the importance of reflecting this diversity. In Sections 6 and 7, we examine the effect of the dataset diversity on the performance of a variety of summarization systems.

4.1 Characterizing Summarization Strategies

[ userdefinedwidth=19em, innerleftmargin=-0.5em, ]

function ()
     while  do
          while  do
               if  then
                    while  do
                    if  then
Figure 3: Procedure to compute the set of extractive phrases in summary extracted from article . For each sequential token of the summary, , the procedure iterates through tokens of the text, . If tokens and match, the longest shared token sequence after and is marked as the extraction starting at .

We examine summarization strategies using three measures that capture the degree of text overlap between the summary and article, and the rate of compression of the information conveyed.

Given an article text consisting of a sequence of tokens and the corresponding article summary consisting of tokens , the set of extractive fragments is the set of shared sequences of tokens in  and . We identify these extractive fragments of an article-summary pair using a greedy process. We process the tokens in the summary in order. At each position, if there is a sequence of tokens in the source text that is prefix of the remainder of the summary, we mark this prefix as extractive and continue. We prefer to mark the longest prefix possible at each step. Otherwise, we mark the current summary token as abstractive. The set includes all the tokens sequences identified as extractive. Figure 3 formally describes this procedure. Underlined phrases of Figures 1 and 2 are examples of fragments identified as extractive. Using , we compute two measures: extractive fragment coverage and extractive fragment density.

Extractive Fragment Coverage

The coverage measure quantifies the extent to which a summary is derivative of a text. measures the percentage of words in the summary that are part of an extractive fragment with the article:

For example, a summary with 10 words that borrows 7 words from its article text and includes 3 new words will have .

Extractive Fragment Density

The density measure quantifies how well the word sequence of a summary can be described as a series of extractions. For instance, a summary might contain many individual words from the article and therefore have a high coverage. However, if arranged in a new order, the words of the summary could still be used to convey ideas not present in the article. We define as the average length of the extractive fragment to which each word in the summary belongs. The density formulation is similar to the coverage definition but uses a square of the fragment length:

For example, an article with a 10-word summary made of two extractive fragments of lengths 3 and 4 would have and .

Compression Ratio

We use a simple dimension of summarization, compression ratio, to further characterize summarization strategies. We define Compression as the word ratio between the article and summary:

Summarizing with higher compression is challenging as it requires capturing more precisely the critical aspects of the article text.

4.2 Analysis of Dataset Diversity

Figure 4: Density and coverage distributions across the different domains and existing datasets. contains diverse summaries that exhibit a variety of summarization strategies. Each box is a normalized bivariate density plot of extractive fragment coverage (x-axis) and density (y-axis), the two measures of extraction described in Section 4.1. The top left corner of each plot shows the number of training set articles and the median compression ratio of the articles. For DUC and New York Times, which have no standard data splits, is the total number of articles. Above, top left to bottom right: Plots for each publication in the dataset. We omit TMZ, Economist, and ABC for presentation. Below, left to right: Plots for each summarization dataset showing increasing diversity of summaries along both dimensions of extraction in .

We use density, coverage, and compression to understand the distribution of human summarization techniques across different sources. Figure 4 shows the distributions of summaries for different domains in the dataset, along with three major existing summarization datasets: DUC 2003-2004 (combined), CNN / Daily Mail, and the New York Times Corpus.

Publication Diversity

Each publication shows a unique distribution of summaries mixing extractive and abstractive strategies in varying amounts. For example, the third entry on the top row shows the summarization strategy used by BuzzFeed. The density (y-axis) is relatively low, meaning BuzzFeed summaries are unlikely to include long extractive fragments. While the coverage (x-axis) is more varied, BuzzFeed’s coverage tends to be lower, indicating that it frequently uses novel words in summaries. The publication plots in the figure are sorted by median compression ratio. We observe that publications with lower compression ratio (top-left of the figure) exhibit higher diversity along both dimensions of extractiveness. However, as the median compression ratio increases, the distributions become more concentrated, indicating that summarization strategies become more rigid.

Dataset Diversity

Figure 4 demonstrates how DUC, CNN / Daily Mail, and the New York Times exhibit different human summarization strategies. DUC summarization is fairly similar to the high-compression newsrooms shown in the lower publication plots in Figure 4

. However, DUC’s median compression ratio is much higher than all other datasets and publications. The figure shows that CNN / Daily Mail and New York Times are skewed toward extractive summaries with lower compression ratios. CNN / Daily Mail shows higher coverage and density than all other datasets and publishers in our data. Compared to existing datasets, covers a much larger range of summarization styles, ranging from both highly extractive to highly abstractive.

5 Performance of Existing Systems

We train and evaluate several summarization systems to understand the challenges of and its usefulness for training systems. We evaluate three systems, each using a different summarization strategy with respect to extractiveness: fully extractive (TextRank), fully abstractive (Seq2Seq), and mixed (pointer-generator). We further study the performance of the pointer-generator model on by training three systems using different dataset configurations. We compare these systems to two rule-based systems that provide baseline (Lede-3) and an extractive oracle (Fragments).

Extractive: TextRank

TextRank is a sentence-level extractive summarization system. The system was originally developed by Mihalcea and Tarau (2004) and was later further developed and improved by Barrios et al. (2016). TextRank uses an unsupervised sentence-ranking approach similar to Google PageRank Page et al. (1999). TextRank picks a sequence of sentences from a text for the summary up to a maximum allowable length. While this maximum length is typically preset by the user, in order to optimize ROUGE scoring, we tune this parameter to optimize ROUGE-1 -score on the training data. We experimented with values between 1–200, and found the optimal value to be 50 words. We use tuned TextRank of in Tables 23, and in the supplementary material.

Abstractive: Seq2Seq / Attention

Sequence-to-sequence models with attention Cho et al. (2014); Sutskever et al. (2014); Bahdanau et al. (2014) have been applied to various language tasks, including summarization Chopra et al. (2016); Nallapati et al. (2016a)

. The process by which the model produces tokens is abstractive, as there is no explicit mechanism to copy tokens from the input text. We train a TensorFlow implementation

666 of the Rush et al. (2015) model using .

Mixed: Pointer-Generator

The pointer-generator model See et al. (2017) uses abstractive token generation and extractive token copying using a pointer mechanism Vinyals et al. (2015); Gülçehre et al. (2016), keeping track of extractions using coverage Tu et al. (2016). We evaluate three instances of this model by varying the training data: (1) Pointer-C: trained on the CNN / Daily Mail dataset; (2) Pointer-N: trained on the dataset; and (3) Pointer-S: trained on a random subset of training data the same size as the CNN / Daily Mail training. The last instance aims to understand the effects of dataset size and summary diversity.

Lower Bound: Lede-3

A common automatic summarization strategy of online publications is to copy the first sentence, first paragraph, or first  words of the text and treat this as the summary. Following prior work See et al. (2017); Nallapati et al. (2017), we use the Lede-3 baseline, in which the first three sentences of the text are returned as the summary. Though simple, this baseline is competitive with state-of-the-art systems.

Extractive Oracle: Fragments

This system has access to the reference summary. Given an article and its summary , the system computes (Section 4). Fragments concatenates the fragments in in the order they appear in the summary, representing the best possible performance of an ideal extractive system. Only systems that are capable of abstractive reasoning can outperform the ROUGE scores of Fragments.

6 Automatic Evaluation

DUC 2003 & 2004 CNN / Daily Mail Newsroom - T Newsroom - U
R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L
Lede-3 12.99 3.89 11.44 38.64 17.12 35.13 30.49 21.27 28.42 30.63 21.41 28.57
Fragments 87.04 68.45 87.04 93.36 83.19 93.36 88.46 76.03 88.46 88.48 76.06 88.48
TextRank 15.75 4.06 13.02 29.06 11.14 24.57 22.77 9.79 18.98 22.76 9.80 18.97
Abs-N 2.44 0.04 2.37 5.07 0.16 4.80 5.88 0.39 5.32 5.90 0.43 5.36
Pointer-C 12.40 2.88 10.74 32.51 11.90 28.95 20.25 7.32 17.30 20.29 7.33 17.31
Pointer-S 15.10 4.55 12.42 34.33 13.79 28.42 24.50 12.60 20.33 24.48 12.52 20.30
Pointer-N 17.29 5.01 14.53 31.61 11.70 27.23 26.02 13.25 22.43 26.04 13.24 22.45
Table 2: ROUGE-1, ROUGE-2, and ROUGE-L scores for baselines and systems on two common existing datasets, the combined DUC 2003 & 2004 datasets and CNN / Daily Mail dataset, and the released (T) and unreleased (U) test sets of . The best results for non-baseline systems in the lower parts of the table are in bold.
Extractive Mixed Abstractive Newsroom - D
R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L
Lede-3 53.05 49.01 52.37 25.15 12.88 22.08 13.69 2.42 11.24 30.72 21.53 28.65
Fragments 98.95 97.89 98.95 92.68 82.09 92.68 73.43 47.66 73.43 88.46 76.07 88.46
TextRank 32.43 19.68 28.68 22.30 7.87 17.75 13.54 1.88 10.46 22.82 9.85 19.02
Abs-N 6.08 0.21 5.42 5.67 0.15 5.08 6.21 1.07 5.68 5.98 0.48 5.39
Pointer-C 28.34 14.65 25.21 20.22 6.51 16.88 13.11 1.62 10.72 20.47 7.50 17.51
Pointer-S 37.29 26.56 33.34 23.71 10.59 18.79 13.89 2.22 10.34 24.83 12.94 20.66
Pointer-N 39.11 27.95 36.17 25.48 11.04 21.06 14.66 2.26 11.44 26.27 13.55 22.72
Table 3: Performance of the baselines and systems on the three extractiveness subsets of the development set, and the overall scores of systems on the full development set (D). The best results for non-baseline systems in the lower parts of the table are in bold.

We study model performance of , CNN / Daily Mail, and the combined DUC 2003 and 2004 datasets. We use the five systems described in Section 5, including the extractive oracle. We also evaluate the systems using subsets of to characterize the sensitivity of systems to different levels of extractiveness in reference summaries. We use the -score variants of ROUGE-1, ROUGE-2, and ROUGE-L to account for different summary lengths. ROUGE scores are computed with the default configuration of the Lin (2004b) ROUGE v1.5.5 reference implementation. Input article text and reference summaries for all systems are tokenized using the Stanford CoreNLP tokenizer Manning et al. (2014).

Table 2 shows results for summarization systems on DUC, CNN / Daily Mail, and . In nearly all cases, the fully extractive Lede-3 baseline produces the most successful summaries, with the exception of the relatively extractive DUC. Among models, -trained Pointer-N performs best on all datasets other than CNN / Daily Mail, an out-of-domain dataset. Pointer-C, which has access to only a limited subset of , performs worse than Pointer-N on average. However, despite not being trained on CNN / Daily Mail, Pointer-S outperforms Pointer-C on its own data under ROUGE-N and is competitive under ROUGE-L. Finally, both Pointer-N and Pointer-S outperform other systems and baselines on DUC, whereas Pointer-C does not outperform Lede-3.

Table 3 shows development results on the data for different level of extractiveness. Pointer-N outperforms the remaining models across all extractive subsets of and, in the case of the abstractive subset, exceeds the performance of Lede-3. The success of Pointer-N and Pointer-S in generalizing and outperforming models on DUC and CNN / Daily Mail indicates the usefulness of in generalizing to out-of-domain data. Similar subset analysis for our other two measures, coverage and compression, are included in the supplementary material.

7 Human Evaluation

ROUGE scores systems using frequencies of shared -grams. Evaluating systems with ROUGE alone biases scoring against abstractive systems, which rely more on paraphrasing. To overcome this limitation, we provide human evaluation of the different systems on . While human evaluation is still uncommon in summarization work, developing a benchmark dataset presents an opportunity for developing an accompanying protocol for human evaluation.

Our evaluation method is centered around three objectives: (1) distinguishing between syntactic and semantic summarization quality, (2) providing a reliable (consistent and replicable) measurement, and (3) allowing for portability such that the measure can be applied to other models or summarization datasets.

We select two semantic and two syntactic dimensions for evaluation based on experiments with evaluation tasks by Paulus et al. (2017) and Tan et al. (2017). The two semantic dimensions, summary informativeness (INF) and relevance (REL), measure whether the system-generated text is useful as a summary, and appropriate for the source text, respectively. The two syntactic dimensions, fluency (FLU) and coherence (COH), measure whether individual sentences or phrases of the summary are well-written and whether the summary as a whole makes sense respectively. Evaluation was performed on 60 summaries, 20 from each extractive subset. Each system-article pair was evaluated by three unique raters. Exact prompts given to raters for each dimension are shown in Table 4.

Dimension Prompt
Informativeness How well does the summary capture the key points of the article?
Relevance Are the details provided by the summary consistent with details in the article?
Fluency Are the individual sentences of the summary well-written and grammatical?
Coherence Do phrases and sentences of the summary fit together and make sense collectively?
Table 4: The prompts given to Amazon Mechanical Turk crowdworkers for evaluating each summary.
  Semantic   Syntactic
Lede-3 3.98 4.13 4.13 4.08 4.08
Fragments 2.91 3.26 3.09 3.06 3.08
TextRank 3.61 3.92 3.87 3.86 3.81
Abs-N 2.09 2.35 2.66 2.50 2.40
Pointer-C 3.55 3.78 3.22 3.30 3.46
Pointer-S 3.77 4.02 3.56 3.56 3.73
Pointer-N 3.36 3.82 3.43 3.39 3.50
Table 5: Average performance of systems as scored by human evaluators. Each summary was scored by three different evaluators. Dimensions, from left to right: informativeness, relevance, fluency, and coherence, and a mean of the four dimensions for each system.

Table 5 shows the mean score given to each system under each of the four dimensions, as well as the mean overall score (rightmost column). No summarization system exceeded the scores given to the Lede-3 baseline. However, the extractive oracle designed to maximize -gram based evaluation performed worse than the majority of systems under human evaluation. While the fully abstractive Abs-N model performed very poorly under automatic evaluation, it fared slightly better when scored by humans. TextRank received the highest overall score. TextRank generates full sentences extracted from the article, and raters preferred TextRank primarily for its fluency and coherence. The pointer-generator models do not have this advantage, and raters did not find the pointer-generator models to be as syntactically sound as TextRank. However, raters preferred the informativeness and relevance of the Pointer-S and Pointer-N models, though not the Pointer-C model, over TextRank.

8 Conclusion

We present , a dataset of articles and their summaries written in the newsrooms of online publications. is the largest summarization dataset available to date, and exhibits a wide variety of human summarization strategies. Our proposed measures and the analysis of strategies used by different publications and articles propose new directions for evaluating the difficulty of summarization tasks and for developing future summarization models. We show that the dataset’s diversity of summaries presents a new challenge to summarization systems. Finally, we find that using to train an existing state-of-art mixed-strategy summarization model results in performance improvements on out-of-domain data. The dataset is available online at


This work is funded by Oath as part of the Connected Experiences Laboratory and by a Google Research Award. We thank the anonymous reviewers for their feedback.


Additional Evaluation

In Section 4, we discuss three measures of summarization diversity: coverage, density, and compression. In addition to quantifying diversity of summarization strategies, these measures are helpful for system error analysis. We use the density measurement to understand how system performance varies when compared against references using different extractive strategies by subdividing into three subsets by extractiveness and evaluating using ROUGE on each. We show here a similar analysis using the remaining two measures, coverage and compression. Results for subsets based on coverage and compression are shown in Tables 6 and 7.

Low Coverage Medium High Coverage
R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L
Lede-3 15.07 4.02 12.66 29.66 18.69 26.98 46.89 41.25 45.77
Fragments 72.45 46.16 72.45 93.41 83.08 93.41 99.13 98.16 99.13
TextRank 14.43 2.80 11.36 23.62 9.48 19.27 30.15 17.04 26.18
Abs-N 6.25 1.09 5.72 5.61 0.15 5.05 6.10 0.19 5.40
Pointer-C 13.99 2.46 11.57 21.70 8.06 18.47 25.80 12.06 22.57
Pointer-S 15.16 3.63 11.61 26.95 14.51 22.30 32.42 20.77 28.15
Pointer-N 16.07 3.78 12.85 28.79 15.31 24.79 34.03 21.67 30.62
Table 6: Performance of the baselines and systems on the three coverage subsets of the development set. Article-summary pairs with low coverage have reference summaries that borrow words less frequently from their texts and contain more novel words and phrases. Article-summary pairs with high coverage borrow more words from their text and include fewer novel words and phrases.
Low Compression Medium High Compression
R-1 R-2 R-L R-1 R-2 R-L R-1 R-2 R-L
Lede-3 42.89 34.91 41.06 30.62 20.77 28.30 18.57 8.83 16.53
Fragments 87.78 77.20 87.78 89.73 77.66 89.73 87.88 73.34 87.88
TextRank 30.35 17.51 26.67 22.98 8.69 18.56 15.07 3.31 11.78
Abs-N 6.27 0.75 5.65 6.22 0.52 5.60 5.48 0.18 4.93
Pointer-C 27.47 13.49 24.18 20.05 6.25 16.76 14.07 2.89 11.76
Pointer-S 35.42 23.43 30.89 24.11 11.28 19.45 15.31 4.46 11.98
Pointer-N 36.96 24.52 33.43 25.56 11.68 21.47 16.57 4.72 13.52
Table 7: Performance of the baselines and systems on the three compression subsets of the development set. Article-summary pairs with low compression have longer reference summaries with respect to their texts. Article-summary pairs with high compression have shorter reference summaries with respect to their texts.