Progressively Pretrained Dense Corpus Index for Open-Domain Question Answering

04/30/2020
by   Wenhan Xiong, et al.
0

To extract answers from a large corpus, open-domain question answering (QA) systems usually rely on information retrieval (IR) techniques to narrow the search space. Standard inverted index methods such as TF-IDF are commonly used as thanks to their efficiency. However, their retrieval performance is limited as they simply use shallow and sparse lexical features. To break the IR bottleneck, recent studies show that stronger retrieval performance can be achieved by pretraining a effective paragraph encoder that index paragraphs into dense vectors. Once trained, the corpus can be pre-encoded into low-dimensional vectors and stored within an index structure where the retrieval can be efficiently implemented as maximum inner product search. Despite the promising results, pretraining such a dense index is expensive and often requires a very large batch size. In this work, we propose a simple and resource-efficient method to pretrain the paragraph encoder. First, instead of using heuristically created pseudo question-paragraph pairs for pretraining, we utilize an existing pretrained sequence-to-sequence model to build a strong question generator that creates high-quality pretraining data. Second, we propose a progressive pretraining algorithm to ensure the existence of effective negative samples in each batch. Across three datasets, our method outperforms an existing dense retrieval method that uses 7 times more computational resources for pretraining.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/28/2020

SPARTA: Efficient Open-Domain Question Answering via Sparse Transformer Matching Retrieval

We introduce SPARTA, a novel neural retrieval method that shows great pr...
research
08/09/2023

Building Interpretable and Reliable Open Information Retriever for New Domains Overnight

Information retrieval (IR) or knowledge retrieval, is a critical compone...
research
10/11/2022

Task-Aware Specialization for Efficient and Robust Dense Retrieval for Open-Domain Question Answering

Given its effectiveness on knowledge-intensive natural language processi...
research
12/14/2021

Learning to Retrieve Passages without Supervision

Dense retrievers for open-domain question answering (ODQA) have been sho...
research
09/16/2023

Bridging Dense and Sparse Maximum Inner Product Search

Maximum inner product search (MIPS) over dense and sparse vectors have p...
research
03/16/2022

C-MORE: Pretraining to Answer Open-Domain Questions by Consulting Millions of References

We consider the problem of pretraining a two-stage open-domain question ...
research
03/11/2022

Tevatron: An Efficient and Flexible Toolkit for Dense Retrieval

Recent rapid advancements in deep pre-trained language models and the in...

Please sign up or login with your details

Forgot password? Click here to reset