CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs

12/16/2021
by   Tahmid Hasan, et al.
0

We present CrossSum, a large-scale dataset comprising 1.65 million cross-lingual article-summary samples in 1500+ language-pairs constituting 45 languages. We use the multilingual XL-Sum dataset and align identical articles written in different languages via cross-lingual retrieval using a language-agnostic representation model. We propose a multi-stage data sampling algorithm and fine-tune mT5, a multilingual pretrained model, with explicit cross-lingual supervision with CrossSum and introduce a new metric for evaluating cross-lingual summarization. Results on established and our proposed metrics indicate that models fine-tuned on CrossSum outperforms summarization+translation baselines, even when the source and target language pairs are linguistically distant. To the best of our knowledge, CrossSum is the largest cross-lingual summarization dataset and also the first-ever that does not rely on English as the pivot language. We are releasing the dataset, alignment and training scripts, and the models to spur future research on cross-lingual abstractive summarization. The resources can be found at <https://github.com/csebuetnlp/CrossSum>.

READ FULL TEXT
research
04/23/2022

WikiMulti: a Corpus for Cross-Lingual Summarization

Cross-lingual summarization (CLS) is the task to produce a summary in on...
research
04/30/2020

MLSUM: The Multilingual Summarization Corpus

We present MLSUM, the first large-scale MultiLingual SUMmarization datas...
research
12/08/2020

Cross-lingual Approach to Abstractive Summarization

Automatic text summarization extracts important information from texts a...
research
03/07/2023

CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization

Cross-lingual summarization (CLS) has attracted increasing interest in r...
research
11/01/2021

Cross-lingual Hate Speech Detection using Transformer Models

Hate speech detection within a cross-lingual setting represents a paramo...
research
06/16/2022

XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence

Recent advances in machine learning have significantly improved the unde...
research
02/13/2023

Large Scale Multi-Lingual Multi-Modal Summarization Dataset

Significant developments in techniques such as encoder-decoder models ha...

Please sign up or login with your details

Forgot password? Click here to reset