Improving language models by retrieving from trillions of tokens

by   Sebastian Borgeaud, et al.

We enhance auto-regressive language models by conditioning on document chunks retrieved from a large corpus, based on local similarity with preceding tokens. With a 2 trillion token database, our Retrieval-Enhanced Transformer (RETRO) obtains comparable performance to GPT-3 and Jurassic-1 on the Pile, despite using 25× fewer parameters. After fine-tuning, RETRO performance translates to downstream knowledge-intensive tasks such as question answering. RETRO combines a frozen Bert retriever, a differentiable encoder and a chunked cross-attention mechanism to predict tokens based on an order of magnitude more data than what is typically consumed during training. We typically train RETRO from scratch, yet can also rapidly RETROfit pre-trained transformers with retrieval and still achieve good performance. Our work opens up new avenues for improving language models through explicit memory at unprecedented scale.


page 17

page 18

page 33

page 38

page 39

page 40

page 41

page 42


Characterizing Verbatim Short-Term Memory in Neural Language Models

When a language model is trained to predict natural language sequences, ...

On the Generalization Ability of Retrieval-Enhanced Transformers

Recent work on the Retrieval-Enhanced Transformer (RETRO) model has show...

Trusting Language Models in Education

Language Models are being widely used in Education. Even though modern d...

Memorizing Transformers

Language models typically need to be trained or finetuned in order to ac...

RetroMAE v2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models

To better support retrieval applications such as web search and question...

Cure the headache of Transformers via Collinear Constrained Attention

As the rapid progression of practical applications based on Large Langua...

Please sign up or login with your details

Forgot password? Click here to reset