A Large-Scale Multi-Document Summarization Dataset from the Wikipedia Current Events Portal

Multi-document summarization (MDS) aims to compress the content in large document collections into short summaries and has important applications in story clustering for newsfeeds, presentation of search results, and timeline generation. However, there is a lack of datasets that realistically address such use cases at a scale large enough for training supervised models for this task. This work presents a new dataset for MDS that is large both in the total number of document clusters and in the size of individual clusters. We build this dataset by leveraging the Wikipedia Current Events Portal (WCEP), which provides concise and neutral human-written summaries of news events, with links to external source articles. We also automatically extend these source articles by looking for related articles in the Common Crawl archive. We provide a quantitative analysis of the dataset and empirical results for several state-of-the-art MDS techniques.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/04/2019

Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model

Automatic generation of summaries from multiple news articles is a valua...
research
04/24/2018

Towards a Neural Network Approach to Abstractive Multi-Document Summarization

Till now, neural abstractive summarization methods have achieved great s...
research
06/04/2021

AgreeSum: Agreement-Oriented Multi-Document Summarization

We aim to renew interest in a particular multi-document summarization (M...
research
10/07/2021

HowSumm: A Multi-Document Summarization Dataset Derived from WikiHow Articles

We present HowSumm, a novel large-scale dataset for the task of query-fo...
research
10/18/2018

A Temporally Sensitive Submodularity Framework for Timeline Summarization

Timeline summarization (TLS) creates an overview of long-running events ...
research
12/16/2021

A Proposition-Level Clustering Approach for Multi-Document Summarization

Text clustering methods were traditionally incorporated into multi-docum...
research
05/24/2010

Distantly Labeling Data for Large Scale Cross-Document Coreference

Cross-document coreference, the problem of resolving entity mentions acr...

Please sign up or login with your details

Forgot password? Click here to reset