XL-Sum: Large-Scale Multilingual Abstractive Summarization for 44 Languages

06/25/2021
by   Tahmid Hasan, et al.
0

Contemporary works on abstractive text summarization have focused primarily on high-resource languages like English, mostly due to the limited availability of datasets for low/mid-resource ones. In this work, we present XL-Sum, a comprehensive and diverse dataset comprising 1 million professionally annotated article-summary pairs from BBC, extracted using a set of carefully designed heuristics. The dataset covers 44 languages ranging from low to high-resource, for many of which no public dataset is currently available. XL-Sum is highly abstractive, concise, and of high quality, as indicated by human and intrinsic evaluation. We fine-tune mT5, a state-of-the-art pretrained multilingual model, with XL-Sum and experiment on multilingual and low-resource summarization tasks. XL-Sum induces competitive results compared to the ones obtained using similar monolingual datasets: we show higher than 11 ROUGE-2 scores on 10 languages we benchmark on, with some of them exceeding 15, as obtained by multilingual training. Additionally, training on low-resource languages individually also provides competitive performance. To the best of our knowledge, XL-Sum is the largest abstractive summarization dataset in terms of the number of samples collected from a single source and the number of languages covered. We are releasing our dataset and models to encourage future research on multilingual abstractive summarization. The resources can be found at <https://github.com/csebuetnlp/xl-sum>.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/29/2023

Summarizing Indian Languages using Multilingual Transformers based Models

With the advent of multilingual models like mBART, mT5, IndicBART etc., ...
research
09/14/2023

SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects

Despite the progress we have recorded in the last few years in multiling...
research
06/07/2023

Echoes from Alexandria: A Large Resource for Multilingual Book Summarization

In recent years, research in text summarization has mainly focused on th...
research
05/06/2022

Aksharantar: Towards building open transliteration tools for the next billion users

We introduce Aksharantar, the largest publicly available transliteration...
research
03/30/2022

An Overview of Indian Language Datasets used for Text Summarization

In this paper, we survey Text Summarization (TS) datasets in Indian Lang...
research
12/19/2022

LR-Sum: Summarization for Less-Resourced Languages

This preprint describes work in progress on LR-Sum, a new permissively-l...
research
02/14/2023

Exploiting Summarization Data to Help Text Simplification

One of the major problems with text simplification is the lack of high-q...

Please sign up or login with your details

Forgot password? Click here to reset