C-MORE: Pretraining to Answer Open-Domain Questions by Consulting Millions of References

03/16/2022
by   Xiang Yue, et al.
0

We consider the problem of pretraining a two-stage open-domain question answering (QA) system (retriever + reader) with strong transfer capabilities. The key challenge is how to construct a large amount of high-quality question-answer-context triplets without task-specific annotations. Specifically, the triplets should align well with downstream tasks by: (i) covering a wide range of domains (for open-domain applications), (ii) linking a question to its semantically relevant context with supporting evidence (for training the retriever), and (iii) identifying the correct answer in the context (for training the reader). Previous pretraining approaches generally fall short of one or more of these requirements. In this work, we automatically construct a large-scale corpus that meets all three criteria by consulting millions of references cited within Wikipedia. The well-aligned pretraining signals benefit both the retriever and the reader significantly. Our pretrained retriever leads to 2 pretrained reader, the entire system improves by up to 4

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/20/2022

Socratic Pretraining: Question-Driven Pretraining for Controllable Summarization

In long document controllable summarization, where labeled data is scarc...
research
09/21/2021

Relation-Guided Pre-Training for Open-Domain Question Answering

Answering complex open-domain questions requires understanding the laten...
research
05/24/2023

UniChart: A Universal Vision-language Pretrained Model for Chart Comprehension and Reasoning

Charts are very popular for analyzing data, visualizing key insights and...
research
05/25/2022

ORCA: Interpreting Prompted Language Models via Locating Supporting Data Evidence in the Ocean of Pretraining Data

Large pretrained language models have been performing increasingly well ...
research
04/30/2020

Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering

To extract answers from a large corpus, open-domain question answering (...
research
10/09/2021

The Inductive Bias of In-Context Learning: Rethinking Pretraining Example Design

Pretraining Neural Language Models (NLMs) over a large corpus involves c...
research
04/10/2023

WebBrain: Learning to Generate Factually Correct Articles for Queries by Grounding on Large Web Corpus

In this paper, we introduce a new NLP task – generating short factual ar...

Please sign up or login with your details

Forgot password? Click here to reset