Generating Full Length Wikipedia Biographies: The Impact of Gender Bias on the Retrieval-Based Generation of Women Biographies

by   Angela Fan, et al.

Generating factual, long-form text such as Wikipedia articles raises three key challenges: how to gather relevant evidence, how to structure information into well-formed text, and how to ensure that the generated text is factually correct. We address these by developing a model for English text that uses a retrieval mechanism to identify relevant supporting information on the web and a cache-based pre-trained encoder-decoder to generate long-form biographies section by section, including citation information. To assess the impact of available web evidence on the output text, we compare the performance of our approach when generating biographies about women (for which less information is available on the web) vs. biographies generally. To this end, we curate a dataset of 1,500 biographies about women. We analyze our generated text to understand how differences in available web evidence data affect generation. We evaluate the factuality, fluency, and quality of the generated texts using automatic metrics and human evaluation. We hope that these techniques can be used as a starting point for human writers, to aid in reducing the complexity inherent in the creation of long-form, factual text.


Generating Wikipedia by Summarizing Long Sequences

We show that generating English Wikipedia articles can be approached as ...

A Generative Approach to Titling and Clustering Wikipedia Sections

We evaluate the performance of transformer encoders with various decoder...

WebBrain: Learning to Generate Factually Correct Articles for Queries by Grounding on Large Web Corpus

In this paper, we introduce a new NLP task – generating short factual ar...

Human and Automatic Detection of Generated Text

With the advent of generative models with a billion parameters or more, ...

Abstractive Text Summarization using Attentive GRU based Encoder-Decoder

In todays era huge volume of information exists everywhere. Therefore, i...

Improving Human Text Comprehension through Semi-Markov CRF-based Neural Section Title Generation

Titles of short sections within long documents support readers by guidin...

Enabling Large Language Models to Generate Text with Citations

Large language models (LLMs) have emerged as a widely-used tool for info...

Please sign up or login with your details

Forgot password? Click here to reset