Large Language Models are Built-in Autoregressive Search Engines

by   Noah Ziems, et al.

Document retrieval is a key stage of standard Web search engines. Existing dual-encoder dense retrievers obtain representations for questions and documents independently, allowing for only shallow interactions between them. To overcome this limitation, recent autoregressive search engines replace the dual-encoder architecture by directly generating identifiers for relevant documents in the candidate pool. However, the training cost of such autoregressive search engines rises sharply as the number of candidate documents increases. In this paper, we find that large language models (LLMs) can follow human instructions to directly generate URLs for document retrieval. Surprisingly, when providing a few Query-URL pairs as in-context demonstrations, LLMs can generate Web URLs where nearly 90% of the corresponding documents contain correct answers to open-domain questions. In this way, LLMs can be thought of as built-in search engines, since they have not been explicitly trained to map questions to document identifiers. Experiments demonstrate that our method can consistently achieve better retrieval performance than existing retrieval approaches by a significant margin on three open-domain question answering benchmarks, under both zero and few-shot settings. The code for this work can be found at <>.


page 1

page 2

page 3

page 4


Autoregressive Search Engines: Generating Substrings as Document Identifiers

Knowledge-intensive language tasks require NLP systems to both provide t...

ReadProbe: A Demo of Retrieval-Enhanced Large Language Models to Support Lateral Reading

With the rapid growth and spread of online misinformation, people need t...

BeamSearchQA: Large Language Models are Strong Zero-Shot QA Solver

Open-domain question answering is a crucial task that often requires acc...

Term-Sets Can Be Strong Document Identifiers For Auto-Regressive Search Engines

Auto-regressive search engines emerge as a promising paradigm for next-g...

Generating a Common Question from Multiple Documents using Multi-source Encoder-Decoder Models

Ambiguous user queries in search engines result in the retrieval of docu...

Knowledge Refinement via Interaction Between Search Engines and Large Language Models

Information retrieval (IR) plays a crucial role in locating relevant res...

Zero-Shot Retrieval with Search Agents and Hybrid Environments

Learning to search is the task of building artificial agents that learn ...

Please sign up or login with your details

Forgot password? Click here to reset