WebBrain: Learning to Generate Factually Correct Articles for Queries by Grounding on Large Web Corpus

04/10/2023
by   Hongjing Qian, et al.
0

In this paper, we introduce a new NLP task – generating short factual articles with references for queries by mining supporting evidence from the Web. In this task, called WebBrain, the ultimate goal is to generate a fluent, informative, and factually-correct short article (e.g., a Wikipedia article) for a factual query unseen in Wikipedia. To enable experiments on WebBrain, we construct a large-scale dataset WebBrain-Raw by extracting English Wikipedia articles and their crawlable Wikipedia references. WebBrain-Raw is ten times larger than the previous biggest peer dataset, which can greatly benefit the research community. From WebBrain-Raw, we construct two task-specific datasets: WebBrain-R and WebBrain-G, which are used to train in-domain retriever and generator, respectively. Besides, we empirically analyze the performances of the current state-of-the-art NLP techniques on WebBrain and introduce a new framework ReGen, which enhances the generation factualness by improved evidence retrieval and task-specific pre-training for generation. Experiment results show that ReGen outperforms all baselines in both automatic and human evaluations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/24/2021

References in Wikipedia: The Editors' Perspective

References are an essential part of Wikipedia. Each statement in Wikiped...
research
04/12/2022

Generating Full Length Wikipedia Biographies: The Impact of Gender Bias on the Retrieval-Based Generation of Women Biographies

Generating factual, long-form text such as Wikipedia articles raises thr...
research
05/10/2021

Wiki-Reliability: A Large Scale Dataset for Content Reliability on Wikipedia

Wikipedia is the largest online encyclopedia, used by algorithms and web...
research
03/07/2016

A matter of words: NLP for quality evaluation of Wikipedia medical articles

Automatic quality evaluation of Web information is a task with many fiel...
research
05/20/2022

Descartes: Generating Short Descriptions of Wikipedia Articles

We introduce and tackle the problem of automatically generating short de...
research
03/16/2022

C-MORE: Pretraining to Answer Open-Domain Questions by Consulting Millions of References

We consider the problem of pretraining a two-stage open-domain question ...
research
12/18/2021

The Web Is Your Oyster – Knowledge-Intensive NLP against a Very Large Web Corpus

In order to address the increasing demands of real-world applications, t...

Please sign up or login with your details

Forgot password? Click here to reset