SumeCzech: Large Czech News-Based Summarization Dataset

02/12/2021
by   Vojtěch Hudeček, et al.
0

Document summarization is a well-studied NLP task. With the emergence of artificial neural network models, the summarization performance is increasing, as are the requirements on training data. However, only a few datasets are available for Czech, none of them particularly large. Additionally, summarization has been evaluated predominantly on English, with the commonly used ROUGE metric being English-specific. In this paper, we try to address both issues. We present SumeCzech, a Czech news-based summarization dataset. It contains more than a million documents, each consisting of a headline, a several sentences long abstract and a full text. The dataset can be downloaded using the provided scripts available at http://hdl.handle.net/11234/1-2615. We evaluate several summarization baselines on the dataset, including a strong abstractive approach based on Transformer neural network architecture. The evaluation is performed using a language-agnostic variant of ROUGE.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/19/2020

Dataset for Automatic Summarization of Russian News

Automatic text summarization has been studied in a variety of domains an...
research
07/18/2022

GOAL: Towards Benchmarking Few-Shot Sports Game Summarization

Sports game summarization aims to generate sports news based on real-tim...
research
02/01/2023

HunSum-1: an Abstractive Summarization Dataset for Hungarian

We introduce HunSum-1: a dataset for Hungarian abstractive summarization...
research
07/03/2020

Abstractive and mixed summarization for long-single documents

The lack of diversity in the datasets available for automatic summarizat...
research
01/18/2022

Klexikon: A German Dataset for Joint Summarization and Simplification

Traditionally, Text Simplification is treated as a monolingual translati...
research
11/12/2018

A Fine-Grained Approach for Automated Conversion of JUnit Assertions to English

Converting source or unit test code to English has been shown to improve...
research
05/18/2021

BookSum: A Collection of Datasets for Long-form Narrative Summarization

The majority of available text summarization datasets include short-form...

Please sign up or login with your details

Forgot password? Click here to reset