BIGPATENT: A Large-Scale Dataset for Abstractive and Coherent Summarization

06/10/2019
by   Eva Sharma, et al.
0

Most existing text summarization datasets are compiled from the news domain, where summaries have a flattened discourse structure. In such datasets, summary-worthy content often appears in the beginning of input articles. Moreover, large segments from input articles are present verbatim in their respective summaries. These issues impede the learning and evaluation of systems that can understand an article's global content structure as well as produce abstractive summaries with high compression ratio. In this work, we present a novel dataset, BIGPATENT, consisting of 1.3 million records of U.S. patent documents along with human written abstractive summaries. Compared to existing summarization datasets, BIGPATENT has the following properties: i) summaries contain a richer discourse structure with more recurring entities, ii) salient content is evenly distributed in the input, and iii) lesser and shorter extractive fragments are present in the summaries. Finally, we train and evaluate baselines and popular learning models on BIGPATENT to shed light on new challenges and motivate future directions for summarization research.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/03/2015

Summarization of Films and Documentaries Based on Subtitles and Scripts

We assess the performance of generic text summarization algorithms appli...
research
11/15/2020

Open4Business(O4B): An Open Access Dataset for Summarizing Business Documents

A major challenge in fine-tuning deep learning models for automatic summ...
research
07/02/2019

Cooperative Generator-Discriminator Networks for Abstractive Summarization with Narrative Flow

We introduce Cooperative Generator-Discriminator Networks (Co-opNet), a ...
research
08/21/2020

Abstractive Summarization of Spoken and Written Instructions with BERT

Summarization of speech is a difficult problem due to the spontaneity of...
research
10/18/2022

Making Science Simple: Corpora for the Lay Summarisation of Scientific Literature

Lay summarisation aims to jointly summarise and simplify a given text, t...
research
10/24/2022

LANS: Large-scale Arabic News Summarization Corpus

Text summarization has been intensively studied in many languages, and s...
research
09/08/2022

Extractive is not Faithful: An Investigation of Broad Unfaithfulness Problems in Extractive Summarization

The problems of unfaithful summaries have been widely discussed under th...

Please sign up or login with your details

Forgot password? Click here to reset