DeepAI AI Chat
Log In Sign Up

SumeCzech: Large Czech News-Based Summarization Dataset

by   Vojtěch Hudeček, et al.

Document summarization is a well-studied NLP task. With the emergence of artificial neural network models, the summarization performance is increasing, as are the requirements on training data. However, only a few datasets are available for Czech, none of them particularly large. Additionally, summarization has been evaluated predominantly on English, with the commonly used ROUGE metric being English-specific. In this paper, we try to address both issues. We present SumeCzech, a Czech news-based summarization dataset. It contains more than a million documents, each consisting of a headline, a several sentences long abstract and a full text. The dataset can be downloaded using the provided scripts available at We evaluate several summarization baselines on the dataset, including a strong abstractive approach based on Transformer neural network architecture. The evaluation is performed using a language-agnostic variant of ROUGE.


page 1

page 2

page 3

page 4


Dataset for Automatic Summarization of Russian News

Automatic text summarization has been studied in a variety of domains an...

GOAL: Towards Benchmarking Few-Shot Sports Game Summarization

Sports game summarization aims to generate sports news based on real-tim...

HunSum-1: an Abstractive Summarization Dataset for Hungarian

We introduce HunSum-1: a dataset for Hungarian abstractive summarization...

Abstractive and mixed summarization for long-single documents

The lack of diversity in the datasets available for automatic summarizat...

Klexikon: A German Dataset for Joint Summarization and Simplification

Traditionally, Text Simplification is treated as a monolingual translati...

A Fine-Grained Approach for Automated Conversion of JUnit Assertions to English

Converting source or unit test code to English has been shown to improve...

BookSum: A Collection of Datasets for Long-form Narrative Summarization

The majority of available text summarization datasets include short-form...