DeepAI AI Chat
Log In Sign Up

CrossSum: Beyond English-Centric Cross-Lingual Abstractive Text Summarization for 1500+ Language Pairs

by   Tahmid Hasan, et al.

We present CrossSum, a large-scale dataset comprising 1.65 million cross-lingual article-summary samples in 1500+ language-pairs constituting 45 languages. We use the multilingual XL-Sum dataset and align identical articles written in different languages via cross-lingual retrieval using a language-agnostic representation model. We propose a multi-stage data sampling algorithm and fine-tune mT5, a multilingual pretrained model, with explicit cross-lingual supervision with CrossSum and introduce a new metric for evaluating cross-lingual summarization. Results on established and our proposed metrics indicate that models fine-tuned on CrossSum outperforms summarization+translation baselines, even when the source and target language pairs are linguistically distant. To the best of our knowledge, CrossSum is the largest cross-lingual summarization dataset and also the first-ever that does not rely on English as the pivot language. We are releasing the dataset, alignment and training scripts, and the models to spur future research on cross-lingual abstractive summarization. The resources can be found at <>.


WikiMulti: a Corpus for Cross-Lingual Summarization

Cross-lingual summarization (CLS) is the task to produce a summary in on...

MLSUM: The Multilingual Summarization Corpus

We present MLSUM, the first large-scale MultiLingual SUMmarization datas...

Cross-lingual Approach to Abstractive Summarization

Automatic text summarization extracts important information from texts a...

CroCoSum: A Benchmark Dataset for Cross-Lingual Code-Switched Summarization

Cross-lingual summarization (CLS) has attracted increasing interest in r...

Cross-lingual Hate Speech Detection using Transformer Models

Hate speech detection within a cross-lingual setting represents a paramo...

XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence

Recent advances in machine learning have significantly improved the unde...

Large Scale Multi-Lingual Multi-Modal Summarization Dataset

Significant developments in techniques such as encoder-decoder models ha...