DeepAI AI Chat
Log In Sign Up

MLSUM: The Multilingual Summarization Corpus

04/30/2020
by   Thomas Scialom, et al.
reciTAL
Laboratoire d'Informatique de Paris 6
0

We present MLSUM, the first large-scale MultiLingual SUMmarization dataset. Obtained from online newspapers, it contains 1.5M+ article/summary pairs in five different languages – namely, French, German, Spanish, Russian, Turkish. Together with English newspapers from the popular CNN/Daily mail dataset, the collected data form a large scale multilingual dataset which can enable new research directions for the text summarization community. We report cross-lingual comparative analyses based on state-of-the-art systems. These highlight existing biases which motivate the use of a multi-lingual dataset.

READ FULL TEXT

page 6

page 8

12/16/2021

CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs

We present CrossSum, a large-scale dataset comprising 1.65 million cross...
02/13/2023

Large Scale Multi-Lingual Multi-Modal Summarization Dataset

Significant developments in techniques such as encoder-decoder models ha...
11/28/2019

GitHub Typo Corpus: A Large-Scale Multilingual Dataset of Misspellings and Grammatical Errors

The lack of large-scale datasets has been a major hindrance to the devel...
10/07/2020

WikiLingua: A New Benchmark Dataset for Cross-Lingual Abstractive Summarization

We introduce WikiLingua, a large-scale, multilingual dataset for the eva...
12/19/2022

LR-Sum: Summarization for Less-Resourced Languages

This preprint describes work in progress on LR-Sum, a new permissively-l...
03/31/2020

Multilingual Stance Detection: The Catalonia Independence Corpus

Stance detection aims to determine the attitude of a given text with res...
05/30/2022

X-SCITLDR: Cross-Lingual Extreme Summarization of Scholarly Documents

The number of scientific publications nowadays is rapidly increasing, ca...