A Neural Corpus Indexer for Document Retrieval

06/06/2022
by   Yujing Wang, et al.
1

Current state-of-the-art document retrieval solutions mainly follow an index-retrieve paradigm, where the index is hard to be optimized for the final retrieval target. In this paper, we aim to show that an end-to-end deep neural network unifying training and indexing stages can significantly improve the recall performance of traditional methods. To this end, we propose Neural Corpus Indexer (NCI), a sequence-to-sequence network that generates relevant document identifiers directly for a designated query. To optimize the recall performance of NCI, we invent a prefix-aware weight-adaptive decoder architecture, and leverage tailored techniques including query generation, semantic document identifiers and consistency-based regularization. Empirical studies demonstrated the superiority of NCI on a commonly used academic benchmark, achieving +51.9 the best baseline.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
08/19/2022

Ultron: An Ultimate Retriever on Corpus with a Model-based Indexer

Document retrieval has been extensively studied within the index-retriev...
research
04/09/2023

Learning to Tokenize for Generative Retrieval

Conventional document retrieval techniques are mainly based on the index...
research
05/24/2023

Semantic-Enhanced Differentiable Search Index Inspired by Learning Strategies

Recently, a new paradigm called Differentiable Search Index (DSI) has be...
research
05/19/2023

How Does Generative Retrieval Scale to Millions of Passages?

Popularized by the Differentiable Search Index, the emerging paradigm of...
research
04/24/2020

Learning Term Discrimination

Document indexing is a key component for efficient information retrieval...
research
01/09/2023

Doc2Query–: When Less is More

Doc2Query – the process of expanding the content of a document before in...

Please sign up or login with your details

Forgot password? Click here to reset