MLSUM: The Multilingual Summarization Corpus

04/30/2020
by   Thomas Scialom, et al.
0

We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages – namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.

READ FULL TEXT

page 6

page 8

research
12/16/2021

CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs

We present CrossSum, a large-scale dataset comprising 1.65 million cross...
research
05/15/2023

PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India

This paper introduces PMIndiaSum, a new multilingual and massively paral...
research
02/13/2023

Large Scale Multi-Lingual Multi-Modal Summarization Dataset

Significant developments in techniques such as encoder-decoder models ha...
research
11/28/2019

GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors

The lack of large-scale datasets has been a major hindrance to the devel...
research
10/07/2020

WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization

We introduce WikiLingua, a large-scale, multilingual dataset for the eva...
research
03/18/2020

X-Stance: A Multilingual Multi-Target Dataset for Stance Detection

We extract a large-scale stance detection dataset from comments written ...
research
03/31/2020

Multilingual Stance Detection: The Catalonia Independence Corpus

Stance detection aims to determine the attitude of a given text with res...

Please sign up or login with your details

Forgot password? Click here to reset