PMIndiaSum: Multilingual and Cross-lingual Headline Summarization for Languages in India

05/15/2023
by   Ashok Urlana, et al.
0

This paper introduces PMIndiaSum, a new multilingual and massively parallel headline summarization corpus focused on languages in India. Our corpus covers four language families, 14 languages, and the largest to date, 196 language pairs. It provides a testing ground for all cross-lingual pairs. We detail our workflow to construct the corpus, including data acquisition, processing, and quality assurance. Furthermore, we publish benchmarks for monolingual, cross-lingual, and multilingual summarization by fine-tuning, prompting, as well as translate-and-summarize. Experimental results confirm the crucial role of our data in aiding the summarization of Indian texts. Our dataset is publicly available and can be freely modified and re-distributed.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/23/2022

WikiMulti: a Corpus for Cross-Lingual Summarization

Cross-lingual summarization (CLS) is the task to produce a summary in on...
research
03/31/2020

Multilingual Stance Detection: The Catalonia Independence Corpus

Stance detection aims to determine the attitude of a given text with res...
research
04/30/2020

MLSUM: The Multilingual Summarization Corpus

We present MLSUM, the first large-scale MultiLingual SUMmarization datas...
research
10/24/2022

EUR-Lex-Sum: A Multi- and Cross-lingual Dataset for Long-form Summarization in the Legal Domain

Existing summarization datasets come with two main drawbacks: (1) They t...
research
05/23/2023

μPLAN: Summarizing using a Content Plan as Cross-Lingual Bridge

Cross-lingual summarization consists of generating a summary in one lang...
research
06/16/2022

XLCoST: A Benchmark Dataset for Cross-lingual Code Intelligence

Recent advances in machine learning have significantly improved the unde...
research
06/05/2023

Colexifications for Bootstrapping Cross-lingual Datasets: The Case of Phonology, Concreteness, and Affectiveness

Colexification refers to the linguistic phenomenon where a single lexica...

Please sign up or login with your details

Forgot password? Click here to reset