XWikiGen: Cross-lingual Summarization for Encyclopedic Text Generation in Low Resource Languages

03/22/2023
by   Dhaval Taunk, et al.
0

Lack of encyclopedic text contributors, especially on Wikipedia, makes automated text generation for low resource (LR) languages a critical problem. Existing work on Wikipedia text generation has focused on English only where English reference articles are summarized to generate English Wikipedia pages. But, for low-resource languages, the scarcity of reference articles makes monolingual summarization ineffective in solving this problem. Hence, in this work, we propose , which is the task of cross-lingual multi-document summarization of text from multiple reference articles, written in various languages, to generate Wikipedia-style text. Accordingly, we contribute a benchmark dataset, , spanning ∼69K Wikipedia articles covering five domains and eight languages. We harness this dataset to train a two-stage system where the input is a set of citations and a section title and the output is a section-specific LR summary. The proposed system is based on a novel idea of neural unsupervised extractive summarization to coarsely identify salient information followed by a neural abstractive model to generate the section-specific text. Extensive experiments show that multi-domain training is better than the multi-lingual setup on average.

READ FULL TEXT
research
02/01/2022

XAlign: Cross-lingual Fact-to-Text Alignment and Generation for Low-Resource Languages

Multiple critical scenarios (like Wikipedia text generation given Englis...
research
01/30/2018

Generating Wikipedia by Summarizing Long Sequences

We show that generating English Wikipedia articles can be approached as ...
research
09/22/2022

XF2T: Cross-lingual Fact-to-Text Generation for Low-Resource Languages

Multiple business scenarios require an automated generation of descripti...
research
05/30/2023

SWiPE: A Dataset for Document-Level Simplification of Wikipedia Pages

Text simplification research has mostly focused on sentence-level simpli...
research
07/13/2023

MegaWika: Millions of reports and their sources across 50 diverse languages

To foster the development of new models for collaborative AI-assisted re...
research
11/03/2022

Time-aware Prompting for Text Generation

In this paper, we study the effects of incorporating timestamps, such as...
research
05/22/2020

A Generative Approach to Titling and Clustering Wikipedia Sections

We evaluate the performance of transformer encoders with various decoder...

Please sign up or login with your details

Forgot password? Click here to reset