Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

by   Patrick Lewis, et al.

Large pre-trained language models have been shown to store factual knowledge in their parameters, and achieve state-of-the-art results when fine-tuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, and hence on knowledge-intensive tasks, their performance lags behind task-specific architectures. Additionally, providing provenance for their decisions and updating their world knowledge remain open research problems. Pre-trained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue, but have so far been only investigated for extractive downstream tasks. We explore a general-purpose fine-tuning recipe for retrieval-augmented generation (RAG) – models which combine pre-trained parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pre-trained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pre-trained neural retriever. We compare two RAG formulations, one which conditions on the same retrieved passages across the whole generated sequence, the other can use different passages per token. We fine-tune and evaluate our models on a wide range of knowledge-intensive NLP tasks and set the state-of-the-art on three open domain QA tasks, outperforming parametric seq2seq models and task-specific retrieve-and-extract architectures. For language generation tasks, we find that RAG models generate more specific, diverse and factual language than a state-of-the-art parametric-only seq2seq baseline.


page 1

page 2

page 3

page 4


An Experimental Evaluation of Transformer-based Language Models in the Biomedical Domain

With the growing amount of text in health data, there have been rapid ad...

No More Fine-Tuning? An Experimental Evaluation of Prompt Tuning in Code Intelligence

Pre-trained models have been shown effective in many code intelligence t...

TegTok: Augmenting Text Generation via Task-specific and Open-world Knowledge

Generating natural and informative texts has been a long-standing proble...

The Web Is Your Oyster – Knowledge-Intensive NLP against a Very Large Web Corpus

In order to address the increasing demands of real-world applications, t...

DQ-BART: Efficient Sequence-to-Sequence Model via Joint Distillation and Quantization

Large-scale pre-trained sequence-to-sequence models like BART and T5 ach...

AdapterBias: Parameter-efficient Token-dependent Representation Shift for Adapters in NLP Tasks

Transformer-based pre-trained models with millions of parameters require...

Fast Parametric Learning with Activation Memorization

Neural networks trained with backpropagation often struggle to identify ...

1 Introduction

Pre-trained neural language models have been shown to learn a substantial amount of in-depth knowledge from data petroni-etal-2019-language. They can do so without any access to an external memory, as a parameterized implicit knowledge base Raffel et al. (2019); Roberts et al. (2020). While this development is exciting, such models do have downsides: They cannot easily expand or revise their memory, can’t straightforwardly provide insight into their predictions, and may produce “hallucinations” Marcus (2020). Hybrid models that combine parametric memory with non-parametric (i.e., retrieval-based) memories Guu et al. (2020); Karpukhin et al. (2020); Petroni et al. (2020) can address some of these issues because knowledge can be directly revised and expanded, and its access can be inspected and interpreted. REALM Guu et al. (2020) and ORQA lee-etal-2019-latent, two recently introduced models that combine masked language models Devlin et al. (2019) with a differentiable retriever, have shown promising results, but have only explored open-domain extractive question answering. Here, we bring hybrid parametric and non-parametric memory to the “workhorse of NLP,” i.e. sequence-to-sequence (seq2seq) models.

We endow pre-trained, parametric-memory generation models with a non-parametric memory through a general-purpose fine-tuning approach which we refer to as retrieval-augmented generation (RAG). We build RAG models where the parametric memory is a pre-trained generative seq2seq transformer, and the non-parametric memory is a dense vector index of Wikipedia, accessed using a pre-trained neural retriever. We combine these components in an end-to-end probabilistic model; the document retriever (Dense Passage Retriever Karpukhin et al. (2020), henceforth DPR) provides latent documents conditioned on the input, and the seq2seq model (BART Lewis et al. (2019)) then conditions on both these latent documents and the input to generate the output. We marginalize the latent variables through a top-K approximation, either on a per answer basis (assuming the same document is responsible for all tokens) or a per answer token basis (assuming different documents can be responsible for different tokens). Just like T5 Raffel et al. (2019) or BART, RAG can be fine-tuned on any seq2seq task, whereby both the sequence generator and retriever are jointly learned.

There has been extensive previous work proposing architectures to enrich systems with non-parametric memory which are trained from scratch for specific tasks—e.g. in memory networks Weston et al. (2015); Sukhbaatar et al. (2015a), stack-augmented networks Joulin and Mikolov (2015) and memory layers for transformers Lample et al. (2019). In contrast, we explore a setting where both parametric and non-parametric memory components are pre-trained and pre-loaded with extensive knowledge. Crucially, by using pre-trained knowledge-access mechanisms, the ability to access knowledge is present without additional training.

Our results highlight the benefits of combining parametric and non-parametric memory with generation for knowledge-intensive tasks. Our RAG models achieve state-of-the-art results on open Natural Questions Kwiatkowski et al. (2019), WebQuestions Berant et al. (2013) and CuratedTrec Baudiš and Šedivỳ (2015) and strongly outperform recent approaches that use specialised pre-training objectives on TriviaQA Joshi et al. (2017). Despite these being extractive tasks, we find that unconstrained generation outperforms previous extractive approaches. For knowledge-intensive generation, we experiment with MS-MARCO Bajaj et al. (2016) and Jeopardy question generation, and we find that our models generate responses that are more factual, specific, and diverse than a BART baseline. For the FEVER thorne-etal-2018-fever fact verification task, we achieve results within 4% of sophisticated, state-of-the-art pipeline models which use strong supervision. Finally, we show that the non-parametric memory can be replaced in order to control generation, demonstrating a simple mechanism to update the knowledge that the model uses as facts about the world change.

Figure 1: An overview of retrieval-augmented generation (RAG). We combine a pre-trained retriever (Query Encoder + Document Index) with a pre-trained encoder-decoder (Generator) and fine-tune end-to-end. For some query , we use Maximum Inner Product Search (MIPS) to find the top-K most relevant documents of all documents . To make the final prediction , we treat as a latent variable and marginalize over the encoder-decoder predictions given different documents.

2 Methods

We explore RAG models which use the input sequence to retrieve text passages and use these passages as additional context when generating the target sequence . As shown in Figure 1, our models leverage two components: (i) a retriever with parameters that returns (top-K truncated) distributions over text passages given a query and (ii) a generator parametrized by that generates a current token based on a context of the previous tokens , the original input and a retrieved passage .

To train the retriever and generator end-to-end, we treat the retrieved document as a latent variable. We propose two models that marginalize over the latent documents in different ways to produce a distribution over generated text. In one approach,

RAG-Sequence, the model uses the same document to predict each target token. In the other approach, RAG-Token, the model can predict each target token based on a different document. In what follows, we formally introduce both models and then describe the and components, as well as the training and decoding procedure in more detail.

2.1 Models

RAG-Sequence Model

The RAG-Sequence model uses the same retrieved document to generate the complete sequence

. Technically, it treats the retrieved passage as a single latent variable that is marginalized to get the seq2seq probability

via a top-K approximation,

RAG-Token Model

In the RAG-Token model we can draw a different latent passage for each target token and marginalize accordingly. This allows the generator to choose content from several documents when producing an answer. Formally, we define:

Finally, we note that RAG can be used for sequence classification tasks by considering the target class as a target sequence of length one, in which case RAG-Sequence and RAG-Token are equivalent.

2.2 Retriever: DPR

The retrieval component is based on DPR Karpukhin et al. (2020). DPR follows a bi-encoder architecture:

where is a dense representation of the document produced by a BERTBASE transformer Devlin et al. (2019), and a representation of the query by another BERTBASE transformer with a different set of parameters.

To efficiently calculate , the list of elements

with highest prior probability

, DPR employs a Maximum Inner Product Search (MIPS) index provided by the FAISS library Johnson et al. (2017).

For non-parametric pre-trained memory, we use a pre-trained bi-encoder from Karpukhin et al. (2020) to both initialize our retriever and to build the document index. This retriever was trained to retrieve documents which contain answers to TriviaQA Joshi et al. (2017) questions and Natural Questions Kwiatkowski et al. (2019).

2.3 Generator: BART

The generator component could be modelled using any encoder-decoder. We use BART-large Lewis et al. (2019), a pre-trained seq2seq transformer Vaswani et al. (2017) with 400M parameters. To combine the input with the retrieved content when generating from BART, we simply concatenate them.

BART was pre-trained using a denoising objective and a variety of different noising functions. It has obtained state-of-the-art results on a diverse set of generation tasks and outperforms comparably-sized T5 models Lewis et al. (2019). We refer to the BART generator parameters as the parametric memory henceforth.

2.4 Training

We jointly train the retriever and generator components without any direct supervision on what document should be retrieved. Given a fine-tuning training corpus of input/output pairs , we minimize the negative marginal log-likelihood of each target,

using stochastic gradient descent with Adam 

Kingma and Ba (2015). Updating the document encoder during training is costly as it requires the document index to be periodically updated as REALM does during pre-training Guu et al. (2020). We do not find this step necessary for strong performance, and we keep the document encoder (and index) fixed, only fine-tuning the query encoder and the generator.

2.5 Decoding

At test/decoding time, RAG-Sequence and RAG-Token require different ways to approximate .


The RAG-Token model can be seen as a standard, autoregressive, seq2seq generator with transition probability:

To decode, we can plug into a standard beam decoder.


The likelihood does not break into a conventional per-token likelihood for the RAG-Sequence, and hence we cannot solve it with a single beam search pass. Instead, we run beam search for each candidate document , scoring each hypothesis using . This yields a set of hypotheses

of which some might not have appeared in the beams of all documents. To estimate the probability of an hypothesis

across all beams, we run an additional forward pass for each document for which does not appear in the beam, multiply the generator score with and then sum up the probabilities across beams for the marginals. We refer to this decoding procedure as “Thorough Decoding.”

For longer output sequences, can become large, requiring many forward passes. For more efficient decoding, we can make a further approximation that where was not generated during beam search from . This avoids the need to run additional forward passes once the candidate set has been generated. We refer to this decoding procedure as “Fast Decoding”.

3 Experiments

We experiment with RAG in a wide range of knowledge-intensive tasks. For all experiments, we use a single Wikipedia dump for our non-parametric knowledge source. Following lee-etal-2019-latent and Karpukhin et al. (2020), we use the December 2018 dump. Each Wikipedia article is split into disjoint 100-word chunks, to make a total of 21,015,324 documents.111The reader is referred to Karpukhin et al. (2020) for further details on how Wikipedia is pre-processed. We use the DPR document encoder to compute document embeddings for each document, and we build a single MIPS index using FAISS Johnson et al. (2017) using Hierarchical Navigable Small World approximation for efficient retrieval Malkov and Yashunin (2016), which is then used for all experiments. During training, we retrieve the top documents for each query, where we consider . We determine for test time using validation data. In the remainder of this section, we will discuss the experimental details for each of these task settings.

3.1 Open-domain Question Answering

Open-domain QA is an important real-world NLP application and is often used as test-bed for knowledge-intensive tasks Guu et al. (2020). We tackle open-domain QA by treating questions and answers as simple input-output text pairs , and we train RAG by directly minimizing the negative log-likelihood of answers. We compare our results to the popular extractive QA paradigm Chen et al. (2017); Clark and Gardner (2017); lee-etal-2019-latent; Karpukhin et al. (2020), where answers are extracted as spans from retrieved documents, relying primarily on non-parametric knowledge. In addition, we also compare to "Closed-Book QA" approaches Roberts et al. (2020), which, like RAG, generate answers, but do not exploit latent retrieval, instead relying purely on parametric knowledge.

We consider four popular open-domain QA datasets: Natural Questions (NQ) Kwiatkowski et al. (2019), TriviaQA (TQA) Joshi et al. (2017). WebQuestions (WQ) Berant et al. (2013) and CuratedTrec (CT) Baudiš and Šedivỳ (2015). The answers for CuratedTrec are given in the form of regular expressions, which has been cited as a reason why it is unsuitable for answer-generation models Guu et al. (2020)

. To overcome this, we use a pre-processing step where we first retrieve the top 1000 documents for each query, and use the answer that most frequently matches the regex pattern as the supervision target. If no matches are found, we resort to a simple heuristic: generate all possible permutations for each regex, replacing non-deterministic symbols in the regex nested tree structure with a whitespace. As CuratedTrec and WebQuestions are small datasets, we follow DPR 

Karpukhin et al. (2020) by initializing CuratedTrec and WebQuestions models with our Natural Questions RAG model.

We use the same training/dev/testing splitting method as in previous work lee-etal-2019-latent; Karpukhin et al. (2020) and report the standard Exact Match (EM) metric. For TriviaQA, in order to compare to T5 Roberts et al. (2020), we do an additional test evaluation on the TriviaQA Wiki test set.

3.2 Abstractive Question Answering

Because RAG leverages an encoder-decoder model, it can go beyond extractive question-answering and answer questions with free-form, abstractive text generation. To test RAG’s ability to generate natural language responses in a knowledge-intensive setting, we use the MS-MARCO Natural Language Generation task v2.1 Nguyen et al. (2016). This task consists of natural language questions submitted to a search engine, ten snippets retrieved from a search engine for each question, and a full sentence natural language answer annotated from these retrieved passages.

As we are interested in models that can perform their own latent retrieval, we do not use the supplied passages, only the questions and answers, thus treating MS-MARCO as an open-domain abstractive question answering task. MS-MARCO does contain some questions that cannot be answered in a way that matches the reference answer without access to the context passages, such as “What is the weather in volcano, CA?” so we note that performance on Open-MSMARCO will be lower than models that do use these gold context passages.

We further note that there are questions in MS-MARCO that cannot be answered using a Wikipedia knowledge source alone. In these cases, RAG can rely on the parametric implicit knowledge in its BART parameters in order to generate commonsense responses.

3.3 Jeopardy Question Generation

In order to further evaluate RAG’s generation abilities in a non-question answering setting, we propose to study Open-domain question generation. Rather than repurpose questions from standard open-domain QA tasks, which typically consist of short and simple questions, we instead propose to study the more demanding task of generating of Jeopardy questions. Jeopardy is an unusual format that consists of trying to guess an entity from a fact about that entity. For example, “The World Cup” is the answer to the jeopardy question “In 1986 Mexico scored as the first country to host this international sports competition twice.” As Jeopardy “questions" are precise, factual statements, generating Jeopardy-style questions conditioned on the answer entity they refer to constitutes a challenging knowledge-intensive generation task.

We use the raw Jeopardy data and splits from SearchQA Dunn et al. (2017), consisting of 97,391 training, 13,713 development, and 26,848 test datapoints. As this is a new task, we also train a BART system to compare RAG to. Following zhang-bansal-2019-addressing, we evaluate generations using the SQuAD-tuned Q-BLEU-1 metric nema-khapra-2018-towards. Q-BLEU-1 is a variant of BLEU-1 which puts a higher weight on the matching entities, and has higher correlation with human judgment for question generation compared to standard word overlap metrics.

As automatic metrics can be unreliable, especially on such open-ended tasks, we also perform a human evaluation of generations. We run two evaluations, one to assess the factuality of generations, and one to assess specificity. We follow the recent best-practice of performing a pairwise comparative evaluation between two systems Li et al. (2019). Assessors are shown an answer entity and two generated questions about that entity, one from BART and one from RAG. They are then asked to pick one of four possible options—Sentence A is better, Sentence B is better, both are correct or neither is good.

3.4 Fact Verification

FEVER thorne-etal-2018-fever

is a fact verification dataset that involves classifying whether a natural language claim is supported or refuted by Wikipedia, or whether there is not enough information to decide. The task requires retrieving evidence from Wikipedia relating to the claim and then reasoning about the retrieved evidence to classify whether the claim is true, false, or unverifiable from Wikipedia alone. FEVER is a retrieval problem coupled with an entailment reasoning task. It also provides a good test bed for exploring the RAG models’ ability to handle classification rather than generation.

We map FEVER class labels (supports, refutes, or not enough info) to single output tokens and directly train with claim-class pairs. Crucially, unlike most other approaches to FEVER, we do not use supervision on retrieved evidence. We explore two different FEVER variants: the standard 3-way classification task (supports/refutes/not enough info) and the 2-way FEVER (supports/refutes) task studied in Thorne and Vlachos (2020). In both cases we report label accuracy.

3.5 Implementation Details

For Open-domain QA we report test numbers using 15 retrieved documents for RAG-Token models. For RAG-Sequence models, we report test results using 50 retrieved documents, and we use the Thorough Decoding approach since answers are generally short. We use greedy decoding for QA as we did not find beam search improved results. For Open-MSMarco and Jeopardy question generation, we report test numbers using ten retrieved documents for both RAG-Token and RAG-Sequence, and we also train a BART-large model as a baseline. We use a beam size of four, and use the Fast Decoding approach for RAG-Sequence models, as Thorough Decoding did not improve performance.

4 Results

Closed-Book T5-11B Roberts et al. (2020) 34.5 - / 50.1 37.4 -
T5-11B + SSM Roberts et al. (2020) 36.6 - / 60.5 44.7 -
Open-Book REALM Guu et al. (2020) 40.4 - / - 40.7 46.8
DPR Karpukhin et al. (2020) 41.5 57.9 / - 41.1 50.6
RAG-Token 44.1 55.2 / 66.1 45.5 50.0
RAG-Sequence 44.5 56.1 / 68.0 45.2 52.2
Table 1: Open-Domain QA Test Scores. For TQA, the left column uses the test split commonly used in Open-Domain QA. The right column uses the hidden TQA Wiki test split. See Appendix B for further information.
Model Jeopardy QGen MS-MARCO FEVER-3 FEVER-2
B-1 QB-1 R-L B-1 Label Accuracy
SotA - - 49.8* 49.9* 76.8 92.2*
BART 15.1 19.7 38.2 41.6 64.0 81.1
RAG-Token 17.3 22.2 40.1 41.5 72.5 89.5
RAG-Sequence 14.7 21.4 40.8 44.2
Table 2: Generation and classification task Test Scores. SotA for MS-MARCO is Bi et al. (2020), FEVER-3 is Zhong et al. (2019) and FEVER-2 is Thorne and Vlachos (2020) * Uses gold context/evidence, best-performing model without gold access underlined. As FEVER is a classification dataset, RAG-Token and RAG-Sequence are equivalent.

4.1 Open-domain Question Answering

Table 1 shows results for RAG along with recent state-of-the-art models. On all four open-domain QA tasks, RAG sets a new state-of-the-art (in the case of TQA only on the T5-comparable split).

RAG combines the generation flexibility of the “closed-book" (parametric only) approaches and the performance of "open-book" retrieval-based approaches. Unlike REALM and T5+SSM, RAG enjoys strong results without expensive specialized "salient span masking" pre-training Guu et al. (2020), relying on off-the-shelf components. It is worth noting that RAG’s retriever is initialized using DPR’s retriever, which does use retrieval supervision on Natural Questions and TriviaQA. RAG compares favourably to DPR QA system on open-domain QA, which uses a BERT-based cross-encoder system to re-rank documents, along with an extractive reader. RAG demonstrates that neither a re-ranker nor extractive reader is necessary for state-of-the-art machine reading performance. Generating answers even when it is possible to extract them has a number of advantages. Documents which contain clues as to the correct answer but do not contain the correct answer verbatim themselves can still contribute probability mass towards a correct answer being generated, which is not possible with standard extractive approaches, leading to more effective marginalization across documents. Furthermore, RAG is able to generate correct answers even when the correct answer is not present in any of the retrieved documents, achieving an accuracy of 11.8% in such cases for Natural Questions, whereas an extractive model would score 0%.

4.2 Abstractive Question Answering

As shown in Table 2, RAG-Sequence outperforms BART on Open MS-MARCO generation by 2.6 Bleu points and 2.6 Rouge-L points. It approaches the performance of state-of-the-art models, which is impressive considering that (i) these models have access to passages that contain the specific information required to generate the reference answer, (ii) many questions are unanswerable without access to gold passages, and (iii) other questions are unanswerable from Wikipedia alone. Table 4 shows some generated answers from our models. Qualitatively, we find that RAG models hallucinate less and generate factually correct text more often than BART. Later we also show that RAG generations are more diverse than BART generations (see Section 4.6).

4.3 Jeopardy Question Generation

Table 2 shows automatic metric results on the Jeopardy question generation task. We find that RAG-Token performs better than the RAG-Sequence model in this setting, with both models outperforming BART using the Q-BLEU-1 metric.

Table 3 shows the results from the human evaluation. The human evaluation was carried out with 452 pairs of generations from BART and RAG-Token. The annotators indicated that BART was more factual than RAG in only 7.1% of cases, while RAG was more factual in 42.7% of cases and both RAG and BART were factual in a further 17% of cases, clearly demonstrating the comparative effectiveness of RAG on the task over a state-of-the-art conditional generation model. The annotators also strongly prefer RAG generations in terms of specificity.

BART better RAG-Token better Both good Both poor No Majority
Factuality 7.1% 42.7% 11.7% 17.7% 20.8%
Specificity 16.8% 37.4% 18.8% 6.9% 20.1%
Table 3: Human assessments for the Jeopardy Question Generation Task.

Typical example of generations from each model are shown in Table 4. BART generates a more generic response (which is incorrect), whereas the RAG models generate specific and correct facts about Washington state.

We hypothesise that RAG-Token performs best for this task as Jeopardy questions often contain two separate pieces of information about the entity, and RAG-Token is able to synthesize a response by combining disparate information from different retrieved documents in one generation. Figure 2 shows an example where content from two documents has been combined to produce the generated question. Document 2 contains information about Hemingway’s “The Sun also rises,” and the contribution for “Sun” is very high for document 2. Similarly, “A Farewell to Arms” is mentioned in Document 1, which dominates the posterior when this title is generated. Intriguingly, after the first token of these book titles are generated, the distribution over documents flattens again. This observation suggests that the generator completes the book titles without depending on specific documents. In other words, the model’s parametric knowledge is sufficient to complete the titles.

We show evidence for the above interpretation by feeding the BART-only baseline with the partial decoding "The Sun. BART completes the generation "The Sun Also Rises" is a novel by this author of "The Sun Also Rises" indicating the title "The Sun Also Rises" is stored in BART’s parameters. Similarly, feeding the partial decoding "The Sun Also Rises" is a novel by this author of "A will result in BART completing the generation with "The Sun Also Rises" is a novel by this author of "A Farewell to Arms. This example shows how the parametric and non-parametric memories work together—the non-parametric component helps to guide the generation in a particular direction, drawing out specific knowledge stored in the parametric memory.

Figure 2: RAG-Token document posterior for each generated token for input “Hemingway" for Jeopardy generation with 5 retrieved documents. The posterior for document 1 is high when generating “A Farewell to Arms" and for document 2 when generating “The Sun Also Rises"
Task Input Model Generation
MS-MARCO define middle ear BART ?The middle ear is the part of the ear between the middle ear and the nose.
RAG-T The middle ear is the portion of the ear internal to the eardrum.
RAG-S The middle ear includes the tympanic cavity and the three ossicles.
what currency needed in scotland BART The currency needed in Scotland is Pound sterling.
RAG-T Pound is the currency needed in Scotland.
RAG-S The currency needed in Scotland is the pound sterling.
Jeopardy Question Gener -ation Washington BART ?This state has the largest number of counties in the U.S.
RAG-T It’s the only U.S. state named for a U.S. president
RAG-S It’s the state where you’ll find Mount Rainier National Park
The Divine Comedy BART *This epic poem by Dante is divided into 3 parts: the Inferno, the Purgatorio & the Purgatorio
RAG-T Dante’s "Inferno" is the first part of this epic poem
RAG-S This 14th century work is divided into 3 sections: "Inferno", "Purgatorio" & "Paradiso"
Table 4: Example Generations for MS-MARCO and Jeopardy Question generation. RAG models generate mpre specific and factually accurate responses, whereas BART generate more factually incorrect (marked by ‘?’), or partially correct (marked by *) and more generic responses.

4.4 Fact Verification

Table 2 shows our results on the FEVER 3-way and 2-way classification task. For 3-way classification, RAG achieves accuracies that are within 4.3% of state-of-the-art models, which are complex pipeline systems with domain-specific architectures and substantial engineering, trained using intermediate supervision, which RAG does not require.

For 2-way classification, we compare against the model from Thorne and Vlachos (2020), which trains RoBERTa  liu-etal-2019-robust to classify the claim as true or false given the gold evidence sentence. RAG achieves an accuracy within 2.7% of this model, despite being supplied with only the claim and retrieving its own evidence.

We also analyze whether the documents retrieved by RAG correspond to the documents annotated as gold evidence in FEVER. We analyze the overlap in Wikipedia articles between the top documents retrieved by RAG and the gold, annotated evidence documents. We find that the top article retrieved by RAG is a gold document for the claim in 71% of cases, and a gold article is present in the top 10 retrieved articles in 90% of cases.

4.5 Ablations

Model NQ TQA WQ CT Jeopardy-QGen MSMarco FVR-3 FVR-2
Exact Match B-1 QB-1 R-L B-1 Label Accuracy
RAG-Token-BM25 29.7 41.5 32.1 33.1 17.5 22.3 55.5 48.4 75.1 91.6
RAG-Seq-BM25 31.8 44.1 36.6 33.8 11.1 19.5 56.5 46.9
RAG-Token-Frozen 37.8 50.1 37.1 51.1 16.7 21.7 55.9 49.4 72.9 89.4
RAG-Seq-Frozen 41.2 52.1 41.8 52.6 11.8 19.6 56.7 47.3
RAG-Token 43.5 54.8 46.5 51.9 17.9 22.6 56.2 49.4 74.5 90.6
RAG-Seq 44.0 55.8 44.9 53.4 15.3 21.5 57.2 47.5
Table 5: Ablations on the development set. As FEVER is a classification dataset, RAG-Token and RAG-Sequence are equivalent.

To gain a better understanding of what factors affect RAG’s performance, we perform a number of ablation experiments for our tasks on their respective development sets.

Using more documents

Models are trained with either 5 or 10 retrieved latent documents, and we do not observe significant differences in performance between them. We also have the flexibility to adjust the number of retrieved documents at test time, which does affect performance. Figure 3 (left) shows that retrieving more documents at test time monotonically improves Open-domain QA results for RAG-Sequence, but performance peaks for RAG-Token at 10 retrieved documents. Figure 3 (right) shows that retrieving more documents leads to higher Rouge-L for RAG-Token at the expense of Bleu-1, but the effect is less pronounced for RAG-Sequence.

Figure 3: Left: NQ performance as more documents are retrieved. Center: Fraction of answers in NQ where the answer occurs somewhere in the top K documents. Right: MS-MARCO Bleu-1 and Rouge-L as more documents are retrieved.


A key feature of RAG is the ability to learn to retrieve relevant information for the task at hand. To assess the effectiveness of the retrieval mechanism, we run ablations on RAG where we prevent gradients from propagating into the retriever. Table 5 shows the results across all tasks. In each case, learned retrieval improves results, with the largest improvements in question answering. Figure 3 (center) shows that the learned retriever shows a higher recall for gold documents compared to the fixed retriever. The improvements on TriviaQA and Natural Questions are notable, as we initialize the retriever from DPR, which is trained with strong, document-level supervision to perform well on these tasks. We also compare RAG’s dense embedding-based retrieval mechanism to a word overlap-based BM25 retriever Robertson and Zaragoza (2009)

. Here, we replace RAG’s differentiable retriever with a fixed BM25 system. We use the BM25 retrieval scores as logits when calculating

. Table 5 and Figure 3 show the results. For FEVER, we find that BM25 performs best, perhaps since FEVER claims are heavily entity-centric and thus well-suited for word overlap-based retrieval. On all other tasks, we find the differentiable retrieval to be helpful, especially question answering, where it is crucial.

4.6 Generation Diversity

Dataset Gold BART RAG-Token RAG-Sequence
MSMARCO 89.6% 70.7% 77.8% 83.5%
Jeopardy Generation 90.0% 32.4% 46.8 % 53.8%
Table 6: Ratio of distinct tri-grams to total tri-grams in the development set generations for MSMARCO and Jeopardy Question Generation.

Section 4.3 established that RAG models generate are more factual and specific than BART for Jeopardy question generation. Similar to li-etal-2016-diversity, Vijayakumar et al. (2018) and Massarelli et al. (2019), we also investigate the diversity of generations by calculating the ratio of distinct ngrams to total ngrams generated by different models. Table 6 shows that RAG-Sequence generations are more diverse than RAG-Token generations, and both generate significantly more diverse outputs than BART without requiring any diversity-promoting decoding strategy.

4.7 Hot-swapping indices

An advantage of non-parametric knowledge models such as RAG is that the knowledge base can be easily updated at test time. Parametric-only models such as T5 or BART require additional training to update their behavior as facts about the world change. As a demonstration, we build an index using the DrQA Wikipedia dump Chen et al. (2017), (dated December 21st, 2016) and compare generations from RAG using this index to the newer index used in our main results (December 20th, 2018). We prepared a list of 82 heads of states who had changed between these dates and used a template “Who is {position}?” (e.g., “Who is the prime minister of the UK?”) to query our Natural Questions -finetuned RAG model with each index. RAG achieved an accuracy of 70% using the 2016 index for 2016 world leaders and an accuracy of 68% using the 2018 index for the 2018 world leaders. Only 21% of the model’s predictions were the same using the two indices, and accuracy using mismatched indices is very low (12% using the 2018 index for 2016 leaders and 4% using the 2016 index for 2018 leaders). Our result shows that we can effectively update RAG’s behavior with new world knowledge by simply replacing its non-parametric memory.

5 Related Work

Single-Task Retrieval

Prior work has shown that retrieval improves performance across a variety of NLP tasks when considered in isolation. Such tasks include open-domain question answering Chen et al. (2017); Kwiatkowski et al. (2019), fact checking thorne-etal-2018-fever, fact completion Petroni et al. (2020), long-form question answering fan-etal-2019-eli5, Wikipedia article generation Liu* et al. (2018), dialogue moghe-etal-2018-towards; weston-etal-2018-retrieve; Dinan et al. (2019); Fan et al. (2020), translation Gu et al. (2018), and language modeling guu-etal-2018-generating; Khandelwal et al. (2020). Our work unifies previous successes in incorporating retrieval into individual tasks, showing that a single retrieval-based architecture is capable of achieving strong performance across several tasks.

General-Purpose Architectures for NLP

Prior work on general-purpose architectures for NLP tasks has shown great success without the use of retrieval. A single, pre-trained language model has been shown to achieve strong performance on various classification tasks in the GLUE benchmarks wang-etal-2018-glue; Wang et al. (2019) after fine-tuning Radford (2018); Devlin et al. (2019). GPT-2 (Radford et al., 2019) later showed that a single, left-to-right, pre-trained language model could achieve strong performance across both discriminative and generative tasks. For further improvement, BART Lewis et al. (2019) and T5 Raffel et al. (2019); Roberts et al. (2020) propose a single, pre-trained encoder-decoder model that leverages bi-directional attention to achieve stronger performance on discriminative and generative tasks. Our work aims to expand the space of possible tasks with a single, unified architecture, by learning a retrieval module to augment pre-trained, generative language models.

Learned Retrieval

There is significant work on learning to retrieve documents in information retrieval, more recently with pre-trained, neural language models Nogueira and Cho (2019); Karpukhin et al. (2020) similar to ours. Some work optimizes the retrieval module to aid in a specific, downstream task such as question answering, using search perez-etal-2019-finding

, reinforcement learning 

choi-etal-2017-coarse; Wang et al. (2018b, a), or a latent variable approach lee-etal-2019-latent; Guu et al. (2020) as in our work. These successes leverage different retrieval-based architectures and optimization techniques to achieve strong performance on a single task, while we show that a single retrieval-based architecture can be fine-tuned for strong performance on a variety of tasks.

Memory-based Architectures

Our document index can be seen as a large external memory for neural networks to attend to, analagous to memory networks 

Weston et al. (2015); Sukhbaatar et al. (2015b). Concurrent work Févry et al. (2020) learns to retrieve a trained embedding for each entity in the input, rather than to retrieve raw text as in our work. Other work improves the ability of dialog models to generate factual text by attending over fact embeddings Dinan et al. (2019); Fan et al. (2020) or, closer to our work, over retrieved text directly Ghazvininejad et al. (2018)

. A key feature of our memory is that it is comprised of raw text rather distributed representations, which makes the memory both (i) human-readable, lending a form of interpretability to our model, and (ii) human-writable, enabling us to dynamically update the model’s memory by editing the document index.

6 Discussion

In this work, we presented hybrid generation models with access to parametric and non-parametric retrieval-based external memory, in the form of Wikipedia. We showed that our RAG models obtain state-of-the-art performance on open domain question answering. We found that people prefer RAG’s generation over purely parametric BART and find RAG more factual, and we conducted a detailed investigation of the learned retrieval component, validating its effectiveness. We also showed that the model’s grounding in external data leads it to generate more diverse, and illustrated by how the retrieval index can be hot-swapped on the fly without having to retrain the model. In future work, it would be interesting to investigate if the two components can be jointly pre-trained from scratch, either on a denoising objective similar to BART, or through some other objective. Our work opens new research directions on how parametric and non-parametric memories interact and how to most effectively combine the different components, showing promise in being applied to a wide variety of NLP tasks.


EP thanks supports from the NSF Graduate Research Fellowship.


  • P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, and T. Wang (2016) MS MARCO: A Human Generated MAchine Reading COmprehension Dataset. arXiv:1611.09268 [cs]. Note: arXiv: 1611.09268 External Links: Link Cited by: §1.
  • P. Baudiš and J. Šedivỳ (2015) Modeling of the question answering task in the yodaqa system. In International Conference of the Cross-Language Evaluation Forum for European Languages, pp. 222–228. External Links: Link Cited by: §1, §3.1.
  • J. Berant, A. Chou, R. Frostig, and P. Liang (2013) Semantic Parsing on Freebase from Question-Answer Pairs. In

    Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing

    Seattle, Washington, USA, pp. 1533–1544. External Links: Link Cited by: §1, §3.1.
  • B. Bi, C. Li, C. Wu, M. Yan, and W. Wang (2020)

    PALM: pre-training an autoencoding&autoregressive language model for context-conditioned generation

    ArXiv abs/2004.07159. External Links: Link Cited by: Table 2.
  • D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1870–1879. External Links: Link, Document Cited by: §3.1, §4.7, §5.
  • C. Clark and M. Gardner (2017) Simple and Effective Multi-Paragraph Reading Comprehension. arXiv:1710.10723 [cs]. Note: arXiv: 1710.10723 External Links: Link Cited by: §3.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §1, §2.2, §5.
  • E. Dinan, S. Roller, K. Shuster, A. Fan, M. Auli, and J. Weston (2019)

    Wizard of wikipedia: knowledge-powered conversational agents

    In International Conference on Learning Representations, External Links: Link Cited by: §5, §5.
  • M. Dunn, L. Sagun, M. Higgins, V. U. Guney, V. Cirik, and K. Cho (2017) SearchQA: A New Q&A Dataset Augmented with Context from a Search Engine. arXiv:1704.05179 [cs] (en). Note: arXiv: 1704.05179 External Links: Link Cited by: §3.3.
  • A. Fan, C. Gardent, C. Braud, and A. Bordes (2020)

    Augmenting transformers with KNN-based composite memory

    External Links: Link Cited by: §5, §5.
  • T. Févry, L. B. Soares, N. FitzGerald, E. Choi, and T. Kwiatkowski (2020) Entities as experts: sparse memory access with entity supervision. ArXiv abs/2004.07202. External Links: Link Cited by: Appendix B, §5.
  • M. Ghazvininejad, C. Brockett, M. Chang, B. Dolan, J. Gao, W. Yih, and M. Galley (2018) A knowledge-grounded neural conversation model. In

    AAAI Conference on Artificial Intelligence

    External Links: Link Cited by: §5.
  • J. Gu, Y. Wang, K. Cho, and V. O.K. Li (2018)

    Search engine guided neural machine translation

    In AAAI Conference on Artificial Intelligence, External Links: Link Cited by: §5.
  • K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang (2020) REALM: retrieval-augmented language model pre-training. ArXiv abs/2002.08909. External Links: Link Cited by: Appendix D, §1, §2.4, §3.1, §3.1, §4.1, Table 1, §5.
  • J. Johnson, M. Douze, and H. Jégou (2017) Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734. External Links: Link Cited by: §2.2, §3.
  • M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer (2017) TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1601–1611. External Links: Link, Document Cited by: §1, §2.2, §3.1.
  • A. Joulin and T. Mikolov (2015) Inferring algorithmic patterns with stack-augmented recurrent nets. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 1, NIPS’15, Cambridge, MA, USA, pp. 190–198. External Links: Link Cited by: §1.
  • V. Karpukhin, B. Oguz, S. Min, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. arXiv preprint arXiv:2004.04906. External Links: Link Cited by: Appendix B, §1, §1, §2.2, §2.2, §3.1, §3.1, §3.1, §3, Table 1, §5, footnote 1.
  • U. Khandelwal, O. Levy, D. Jurafsky, L. Zettlemoyer, and M. Lewis (2020) Generalization through memorization: nearest neighbor language models. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  • D. P. Kingma and J. Ba (2015) Adam: A method for stochastic optimization. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §2.4.
  • T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, M. Kelcey, J. Devlin, K. Lee, K. N. Toutanova, L. Jones, M. Chang, A. Dai, J. Uszkoreit, Q. Le, and S. Petrov (2019) Natural Questions: a Benchmark for Question Answering Research. Transactions of the Association of Computational Linguistics. External Links: Link Cited by: §1, §2.2, §3.1, §5.
  • G. Lample, A. Sablayrolles, M. A. Ranzato, L. Denoyer, and H. Jegou (2019) Large memory layers with product keys. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d’ Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8548–8559. External Links: Link Cited by: §1.
  • M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer (2019) BART: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461. External Links: Link Cited by: Appendix C, §1, §2.3, §2.3, §5.
  • M. Li, J. Weston, and S. Roller (2019) ACUTE-eval: improved dialogue evaluation with optimized questions and multi-turn comparisons. ArXiv abs/1909.03087. External Links: Link Cited by: §3.3.
  • P. J. Liu*, M. Saleh*, E. Pot, B. Goodrich, R. Sepassi, L. Kaiser, and N. Shazeer (2018) Generating wikipedia by summarizing long sequences. In International Conference on Learning Representations, External Links: Link Cited by: §5.
  • Y. A. Malkov and D. A. Yashunin (2016) Efficient and robust approximate nearest neighbor search using hierarchical navigable small world graphs. IEEE Transactions on Pattern Analysis and Machine Intelligence 42, pp. 824–836. External Links: Link Cited by: §3.
  • G. Marcus (2020) The next decade in ai: four steps towards robust artificial intelligence. arXiv preprint arXiv:2002.06177. External Links: Link Cited by: §1.
  • L. Massarelli, F. Petroni, A. Piktus, M. Ott, T. Rocktäschel, V. Plachouras, F. Silvestri, and S. Riedel (2019) How decoding strategies affect the verifiability of generated text. arXiv preprint arXiv:1911.03587. External Links: Link Cited by: §4.6.
  • T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain, December 9, 2016, T. R. Besold, A. Bordes, A. S. d’Avila Garcez, and G. Wayne (Eds.), CEUR Workshop Proceedings, Vol. 1773. External Links: Link Cited by: §3.2.
  • R. Nogueira and K. Cho (2019) Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085. External Links: Link Cited by: §5.
  • F. Petroni, P. Lewis, A. Piktus, T. Rocktäschel, Y. Wu, A. H. Miller, and S. Riedel (2020) How context affects language models’ factual predictions. In Automated Knowledge Base Construction, External Links: Link Cited by: §1, §5.
  • A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, and I. Sutskever (2019) Language models are unsupervised multitask learners. External Links: Link Cited by: §5.
  • A. Radford (2018) Improving Language Understanding by Generative Pre-Training. External Links: Link Cited by: §5.
  • C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu (2019)

    Exploring the limits of transfer learning with a unified text-to-text transformer

    arXiv e-prints. External Links: 1910.10683, Link Cited by: §1, §1, §5.
  • A. Roberts, C. Raffel, and N. Shazeer (2020) How much knowledge can you pack into the parameters of a language model?. arXiv e-prints. External Links: 2002.08910, Link Cited by: Appendix B, Appendix E, §1, §3.1, §3.1, Table 1, §5.
  • S. Robertson and H. Zaragoza (2009) The probabilistic relevance framework: bm25 and beyond. Found. Trends Inf. Retr. 3 (4), pp. 333–389. External Links: ISSN 1554-0669, Link, Document Cited by: §4.5.
  • S. Sukhbaatar, a. szlam, J. Weston, and R. Fergus (2015a) End-To-End Memory Networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2440–2448. External Links: Link Cited by: §1.
  • S. Sukhbaatar, a. szlam, J. Weston, and R. Fergus (2015b) End-to-end memory networks. In Advances in Neural Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. Lee, M. Sugiyama, and R. Garnett (Eds.), pp. 2440–2448. External Links: Link Cited by: §5.
  • J. H. Thorne and A. Vlachos (2020) Avoiding catastrophic forgetting in mitigating model biases in sentence-pair classification with elastic weight consolidation. ArXiv abs/2004.14366. External Links: Link Cited by: §3.4, §4.4, Table 2.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), pp. 5998–6008. External Links: Link Cited by: §2.3.
  • A. Vijayakumar, M. Cogswell, R. Selvaraju, Q. Sun, S. Lee, D. Crandall, and D. Batra (2018) Diverse beam search for improved description of complex scenes. External Links: Link Cited by: §4.6.
  • A. Wang, Y. Pruksachatkun, N. Nangia, A. Singh, J. Michael, F. Hill, O. Levy, and S. Bowman (2019) SuperGLUE: A Stickier Benchmark for General-Purpose Language Understanding Systems. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. d. Alché-Buc, E. Fox, and R. Garnett (Eds.), pp. 3261–3275. External Links: Link Cited by: §5.
  • S. Wang, M. Yu, X. Guo, Z. Wang, T. Klinger, W. Zhang, S. Chang, G. Tesauro, B. Zhou, and J. Jiang (2018a) R: reinforced ranker-reader for open-domain question answering. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), the 30th innovative Applications of Artificial Intelligence (IAAI-18), and the 8th AAAI Symposium on Educational Advances in Artificial Intelligence (EAAI-18), New Orleans, Louisiana, USA, February 2-7, 2018, S. A. McIlraith and K. Q. Weinberger (Eds.), pp. 5981–5988. External Links: Link Cited by: §5.
  • S. Wang, M. Yu, J. Jiang, W. Zhang, X. Guo, S. Chang, Z. Wang, T. Klinger, G. Tesauro, and M. Campbell (2018b) Evidence aggregation for answer re-ranking in open-domain question answering. In ICLR, External Links: Link Cited by: §5.
  • J. Weston, S. Chopra, and A. Bordes (2015) Memory networks. In 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, Y. Bengio and Y. LeCun (Eds.), External Links: Link Cited by: §1, §5.
  • W. Zhong, J. Xu, D. Tang, Z. Xu, N. Duan, M. Zhou, J. Wang, and J. Yin (2019) Reasoning over semantic-level graph for fact checking. ArXiv abs/1909.03745. External Links: Link Cited by: Table 2.

Appendix A Human evaluation

Figure 4: Annotation interface for human evaluation of factuality. A pop-out for detailed instructions and a worked example appear when clicking "view tool guide".

Figure 4 shows the user interface for human evaluation. To avoid any biases for screen position, which model corresponded to sentence A and sentence B was randomly selected for each example. Annotators were encouraged to research the topic using the internet, and were given detailed instructions and worked examples in a full instructions tab. We included some gold sentences in order to assess the accuracy of the annotators. Two annotators did not perform well on these examples and their annotations were removed from the results.

Appendix B Further details on Open-Domain QA

For open-domain QA, multiple answer annotations are often available for a given question. These answer annotations are exploited by extractive models during training as typically all the answer annotations are used to find matches within documents when preparing training data. For RAG, we also make use of multiple annotation examples for Natural Questions and WebQuestions by training the model with each pair separately, leading to a small increase in accuracy. For TriviaQA, there are often many valid answers to a given question, some of which are not suitable training targets, such as emoji or spelling variants. For TriviaQA, we filter out answer candidates if they do not occur in top 1000 documents for the query.

TriviaQA Evaluation setups

The open-domain QA community customarily uses public development datasets as test datasets, as test data for QA datasets is often restricted and dedicated to reading compehension purposes. We report our results using the datasets splits used in DPR Karpukhin et al. [2020], which are consistent with common practice in Open-domain QA. For TriviaQA, this test dataset is the public TriviaQA Web Development split. Roberts et al. [2020] used the TriviaQA official Wikipedia test set instead. Févry et al. [2020] follow this convention in order to compare with Roberts et al. [2020] (See appendix of Févry et al. [2020]). We report results on both test sets to enable fair comparison to both approaches. We find that our performance is much higher using the official Wiki test set, rather than the more conventional open-domain test set, which we attribute to the official Wiki test set questions being simpler to answer from Wikipedia.

Appendix C Further details on FEVER

For FEVER classification, we follow the practice from Lewis et al. [2019], and first re-generate the claim, and then classify using the representation of the final hidden state, before finally marginalizing across documents to obtain the class probabilities. The FEVER task traditionally has two sub-tasks. The first is to classify the claim as either "Supported", "Refuted" or "Not Enough Info", which is the task we explore in the main paper. FEVER’s other sub-task involves extracting sentences from Wikipedia as evidence supporting the classification prediction. As FEVER uses a different Wikipedia dump to us, directly tackling this task is not straightforward. We hope to address this in future work.

Appendix D "Null document" Probabilities

We experimented with adding "Null document" mechanism to RAG, similar to REALM Guu et al. [2020] in order to model cases where no useful information could be retrieved for a given input. Here, if documents were retrieved, we would additionally "retrieve" an empty document and predict a logit for the null document, before marginalizing over predictions. We explored modelling this null document logit by learning (i) a document embedding for the null document, (ii) a static learnt bias term, or (iii) a neural network to predict the logit. We did not find that these improved performance, so in the interests of simplicity, we omit them. For Open MS-MARCO, where useful retrieved documents cannot always be retrieved, we observe that the model learns to always retrieve a particular set of documents for questions that are less likely to benefit from retrieval, suggesting that null document mechanisms may not be necessary for RAG.

Appendix E Parameters

Our RAG models contain the trainable parameters for the BERT-base query and document encoder of DPR, with 110M parameters each (although we do not train the document encoder ourselves) and 406M trainable parameters from BART-large, 406M parameters, making a total of 626M trainable parameters. The best performing "closed-book" (parametric only) open-domain QA model is T5-11B with 11 Billion trainable parameters. The T5 model with the closest number of parameters to our models is T5-large (770M parameters), which achieves a score of 28.9 EM on Natural Questions Roberts et al. [2020]

, substantially below the 44.5 that RAG-Sequence achieves, indicating that hybrid parametric/non-parametric models require far fewer trainable parameters for strong open-domain QA performance. The non-parametric memory index does not consist of trainable parameters, but does consists of 21M 728 dimensional vectors, consisting of 15.3B values.

Appendix F Retrieval Collapse

In preliminary experiments, we observed that for some tasks such as story generation fan-etal-2018-hierarchical, the retrieval component would “collapse” and learn to retrieve the same documents regardless of the input. In these cases, once retrieval had collapsed, the generator would learn to ignore the documents, and the RAG model would perform equivalently to BART. The collapse could be due to a less-explicit requirement for factual knowledge in some tasks, or the longer target sequences, which could result in less informative gradients for the retriever. perez-etal-2019-finding also found spurious retrieval results when optimizing a retrieval component in order to improve performance on downstream tasks.