CLTS+: A New Chinese Long Text Summarization Dataset with Abstractive Summaries

06/09/2022
by   Xiaojun Liu, et al.
0

The abstractive methods lack of creative ability is particularly a problem in automatic text summarization. The summaries generated by models are mostly extracted from the source articles. One of the main causes for this problem is the lack of dataset with abstractiveness, especially for Chinese. In order to solve this problem, we paraphrase the reference summaries in CLTS, the Chinese Long Text Summarization dataset, correct errors of factual inconsistencies, and propose the first Chinese Long Text Summarization dataset with a high level of abstractiveness, CLTS+, which contains more than 180K article-summary pairs and is available online. Additionally, we introduce an intrinsic metric based on co-occurrence words to evaluate the dataset we constructed. We analyze the extraction strategies used in CLTS+ summaries against other datasets to quantify the abstractiveness and difficulty of our new data and train several baselines on CLTS+ to verify the utility of it for improving the creative ability of models.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/30/2018

Newsroom: A Dataset of 1.3 Million Summaries with Diverse Extractive Strategies

We present NEWSROOM, a summarization dataset of 1.3 million articles and...
research
06/19/2015

LCSTS: A Large Scale Chinese Short Text Summarization Dataset

Automatic text summarization is widely regarded as the highly difficult ...
research
10/21/2021

CNewSum: A Large-scale Chinese News Summarization Dataset with Human-annotated Adequacy and Deducibility Level

Automatic text summarization aims to produce a brief but crucial summary...
research
07/25/2019

Summary Refinement through Denoising

We propose a simple method for post-processing the outputs of a text sum...
research
05/18/2023

Counterfactual Debiasing for Generating Factually Consistent Text Summaries

Despite substantial progress in abstractive text summarization to genera...
research
11/15/2020

Open4Business(O4B): An Open Access Dataset for Summarizing Business Documents

A major challenge in fine-tuning deep learning models for automatic summ...
research
04/06/2020

Learning to Summarize Passages: Mining Passage-Summary Pairs from Wikipedia Revision Histories

In this paper, we propose a method for automatically constructing a pass...

Please sign up or login with your details

Forgot password? Click here to reset