Towards Universal Dense Retrieval for Open-domain Question Answering

09/23/2021 ∙ by Christopher Sciavolino, et al. ∙ 0

In open-domain question answering, a model receives a text question as input and searches for the correct answer using a large evidence corpus. The retrieval step is especially difficult as typical evidence corpora have millions of documents, each of which may or may not have the correct answer to the question. Very recently, dense models have replaced sparse methods as the de facto retrieval method. Rather than focusing on lexical overlap to determine similarity, dense methods build an encoding function that captures semantic similarity by learning from a small collection of question-answer or question-context pairs. In this paper, we investigate dense retrieval models in the context of open-domain question answering across different input distributions. To do this, first we introduce an entity-rich question answering dataset constructed from Wikidata facts and demonstrate dense models are unable to generalize to unseen input question distributions. Second, we perform analyses aimed at better understanding the source of the problem and propose new training techniques to improve out-of-domain performance on a wide variety of datasets. We encourage the field to further investigate the creation of a single, universal dense retrieval model that generalizes well across all input distributions.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

2.1 Open-domain Question Answering

Open-domain question answering (QA) is a challenging task that takes in a text question and a large, unstructured text corpus and predicts the correct text answer . Formally, we denote the question and answer as sequences of tokens and where and denote the length of the question and answer sequences respectively. Each document in the text corpus can be segmented into passages, which we denote as the set where each passage is a sequence of tokens . We follow previous work (Karpukhin et al., 2020) and assume each passage has the same length.

While there are many ways to model this problem, by far the most dominant and widely-used models rely on the retriever-reader architecture, first popularized by DrQA (Chen et al., 2017)

. Formally, we consider a probability distribution

and decompose it into . Using this decomposition, the predicted answer to the question would just be:


where is a latent variable denoting the evidence passage retrieved as a basis for the answer. In this formulation, systems split the problem into two parts: retrieving relevant evidence and reading the passages .

The retrieval step considers all passages in the corpus and returns a small number of relevant candidates for further processing. The reading step considers the small set of candidates and performs more expensive, but also more expressive, neural reading comprehension to identify the correct answer. Note the the retrieval step acts as a strict upper bound on the overall system performance, as questions with irrelevant retrieved candidates are impossible to answer correctly.

Passage Retrieval

This paper primarily focuses on the retrieval stage. Formally, we define a retriever that takes as input the question and passage corpus and returns a small set of candidates , where . We deem a passage relevant to the question if the passage contains the answer token sequence.

Retrievers encode the question and each passage into a vector space using an encoding function

where denotes an arbitrary input token sequence and

denotes the encoded dimension. For a particular question, we approximate the probability distribution

where denotes the inner product between the encoded question and the encoded passage. Specifically:


where and are potentially distinct encoding functions for the query and passage respectively. The retriever collects the candidate set by selecting the top- highest-scoring passage in the knowledge source. For efficient computation, the passages are typically preprocessed and indexed offline, while the question encoding and search takes place online.

Evaluation Metric

In general, the goal of a good retriever is to maximize the number of input questions where at least one returned passage is relevant. In our results, we consider recall-at-, denoted , which evaluates the percentage of examples that retrieve at least one passage with the correct answer within the first candidate results. We optimize this metric in our retrieval step in order to maximize the number of examples the reader model can answer correctly.

2.2 Sparse Retrieval

Many retrieval encoding functions are based on sparse bag-of-words representations. Formally, we define a sparse encoding function as where denotes an arbitrary input token sequence and is the size of the unigram vocabulary . In our formulation, we consider a unigram vocabulary, however equivalent formulations exist for larger vocabularies such as bigrams or arbitrary -grams.

An example of a sparse encoding scheme is TF-IDF, which can be decomposed into a dot product between two values: a term frequency vector tf, and an inverse document frequency vector idf (hence the name). The term frequency vector considers each term and sets the corresponding value for in the vector to be proportional to the number of times the term occurs in the input sequence. Similarly, the vector considers each term and sets the corresponding value to be inversely proportional to the number of unique passages the term occurs in. The entire encoding scheme can be written as follows:


where denotes an arbitrary input token sequence. Using this formulation, the TF-IDF score for a particular query and passage would be:


In our experiments, we use BM25, which can be interpreted as the above TF-IDF model with an additional weighting term.

2.3 Dense Retrieval

Most modern retrievers today rely on dense representations to encode queries and documents into a low-dimensional vector space describing semantic meaning. Dense models are built on top of innovations in other areas of the natural language processing community like large pre-trained language models.


Dense retrievers today use Bidirectional Encoder Representations from Transformers, commonly referred to as BERT (Devlin et al., 2019), as the backbone to obtain dense representations. BERT consists of a stack of encoders based on the Transformer architecture (Vaswani et al., 2017) which use multi-head self-attention to learn powerful representations over the input sequence.

BERT performs large-scale pre-training using the Masked Language Modeling (MLM) objective. The task replaces of tokens in the input sequence with a special [MASK] token, and the goal is for the model to predict the original token given the surrounding context. BERT also pre-trains on the Next Sentence Prediction (NSP) objective, where the model is given two sentences and needs to determine if one sentence follows the other.

BERT tokenizes input sequences by adding a special [CLS] token to the start of the sequence and a special [SEP] token to the end. BERT encodes the input sequence

into contextualized hidden representations

. The [CLS] token encodes a summary representation of the entire sequence while the [SEP] token is used to separate sequences, or denote the end of a sequence.

Dual Encoder Models

Concretely, a dense encoding function is defined as for an input token sequence and dimension where . For a token sequence , retrievers use BERT to obtain corresponding hidden representations .

Rather than use all hidden representations to represent the sequence, most retrievers will compress the information using a reduction function reduce that outputs a single hidden vector, usually the representation of the [CLS] token. A general dense encoding function can be implemented as follows:


where is an input token sequence. Most models follow a dual encoder architecture, where one BERT model encodes the query and a separate BERT model encodes the passage. For a particular query, we can calculate the dense retrieval score for a query and passage as follows:


where the superscript and denote distinct BERT encoder models. Note that in most dense models, the passage is tokenized as the concatenation of the title of the article it comes from and its contents, separated by an [SEP] token.

Dense Passage Retriever (DPR)

Karpukhin et al. (2020) use the dual encoder architecture proposed above for dense retrieval and post impressive results in the space of open-domain QA. The authors consider a training dataset where each question has one positive passage and negative passages

. The loss function optimizes the negative log-likelihood of the positive passage

, specifically:


Positive passages come from annotated open-domain QA datasets like Natural Questions (NQ) (Kwiatkowski et al., 2019), which contain (question, answer, context) triples. Their best performing model considers negative passages from two sources: in-batch negatives and BM25 hard negatives.

In-batch negatives mean that for each question, all of the positive passages for other questions in the same training minibatch are treated as negatives. BM25 hard negatives are high-scoring passages retrieved using BM25 that do not contain the correct answer. In practice, the authors use a batch size of 128 and sample 1 BM25 hard negative per question, leading to effective negatives per question.

DPR segments Wikipedia into 100 token passages and filters out semi-structured data like tables and lists. After training, the model pre-processes a dense document index using FAISS (Johnson et al., 2017) for efficient maximum inner product search (MIPS).

After its release, DPR became the primary retriever for many future open-domain QA systems like RAG (Lewis et al., 2021a) and the current state-of-the-art model, fusion-in-decoder (FiD) (Izacard and Grave, 2021). There has also been a lot of investigation into mining harder negatives (Xiong et al., 2020), which further improve the performance of DPR. Most of these new models incorporate techniques like generative readers or asynchronously updating the document index, which are orthogonal to our investigation here.

3.1 Datasets

To evaluate out-of-domain generalization, we consider a wide variety of datasets sourced from different places. A summary of the datasets can be found in Table 6.3.

Natural Questions (NQ)

Kwiatkowski et al. (2019) built the Natural Questions dataset using anonymized Google search data. As with previous works, we follow Lee et al. (2019) and use their Natural Questions Open dataset, which filters out questions without short answers and questions with shorts answers longer than 5 tokens.


Joshi et al. (2017) introduced TriviaQA, a dataset of trivia questions scraped from the web. We follow previous work and consider only question-answer pairs, discarding their evidence documents.

Web Questions (WQ)

(Berant et al., 2013) gather questions from the Google Suggest API into the Web Questions dataset, where answers are entities in Freebase.

CuratedTREC (TREC)

Baudiš and Šedivý (2015) built the CuratedTREC dataset, which is based on the TREC QA tracks. The authors source their queries from numerous online entities like AskJeeves or MSNSearch.


Rajpurkar et al. (2016) introduced the widely-used SQuAD reading comprehension dataset. It was constructed from crowdsourced workers asking questions about Wikipedia passages presented to them. Following previous work, we consider the SQuAD Open variant, which ignores context passages during evaluation.

3.2 T-REx QA Dataset Evaluation

To test how well models are able to adapt to new settings, we create a QA dataset based on facts from Wikidata (Vrandečić and Krötzsch, 2014), a large collection of (subject, relation, object) triples mined from Wikipedia. We sample from the T-REx (Elsahar et al., 2018) dataset, which is a subset of 11M Wikidata triples with aligned sentences.

The full T-REx dataset considers 43 relations; however, we sample 14. We use hand-crafted query templates to rewrite each (subject, relation, object) triple into a question where the subject is part of the question and the object is the answer. Since the relations are very simple (e.g “Where was [X] born?”), but the subjects are specific entities (e.g. “Nikolai Arnoldovich Petrov”), we consider this a lexically rich evaluation set.

We further segment the 14 relations into two sets of 7 relations, one that can be seen during training and one that cannot, which we denote as seen and unseen respectively. For the seen relations, we perform an 80/10/10 split for train/dev/test sets, sampled equally from each relation. For the unseen relations, we only construct a test set using 10% of examples, again sampled evenly from each relation. More details of the query templates, sampled relations, and sizes can be found in Table 3.1.

Rel. Label Query template Size UN
P19 place of birth Where was [X] born? 10,000
P159 headquarters location Where is the headquarter of [X]? 10,000
P176 manufacturer Which company is [X] produced by? 10,000
P264 record label What music label is [X] represented by? 10,000
P407 language of work or name Which language was [X] written in? 6,722
P413 position played on team / speciality What position does [X] play? 10,000
P740 location of formation Where was [X] founded? 9,415
P17 country Which country is [X] located in? 10,000
P20 place of death Where did [X] die? 10,000
P30 continent Which continent is [X] located? 10,000
P127 owned by Who owns [X]? 10,000
P136 genre What type of music does [X] play? 10,000
P276 location Where is [X] located? 10,000
P495 country of origin Which country was [X] created in? 10,000
All 136,137
Table 3.1: T-REx QA Dataset Overview. Relations and query templates used to construct the T-REx QA dataset along with the number of examples per relation. UN denotes whether the relation is included in the unseen test set.

We consider 3 dense models: DPR (pt, NQ) is a pre-trained DPR model trained on only NQ; DPR (pt, Multi) is a pre-trained DPR model trained on NQ, TriviaQA, WebQ, and CuratedTREC in a multi-dataset fashion; and REALM (pt, NQ) is another dual encoder model from Guu et al. (2020) with intermediate pre-training tasks and joint fine-tuning of the reader and retriever models on NQ. Note that REALM retrieves 288 BPE token “blocks,” whereas DPR retrieves 100 word passages, so REALM retrieves more content per passage. As a sparse model, we consider the Pyserini (Lin et al., 2021) implementation of BM25. We adopt all default parameters and we build the index using DPR passage splits.

We also include two additional baselines where we take the DPR (pt, NQ) model and fine-tune it for 10 additional epochs. The

(ft, T-REx) model fine-tunes using only the T-REx training set. The (ft, NQ+T-REx) model fine-tunes on the union of the NQ and T-REx training sets in a multi-dataset training setup. Details on the fine-tuning setup can be found in Table 6.1. We report results in Table 3.2.

NaturalQ (NQ) TriviaQA T-REx (se) T-REx (un)
R@5 R@20 R@5 R@20 R@5 R@20 R@5 R@20
DPR (pt, NQ) 68.3 80.1 57.0 69.0 34.2 48.2 43.9 59.0
DPR (pt, Multi) 67.1 79.5 71.3 80.0 42.9 56.4 50.3 63.6
REALM (pt, NQ)* 70.1 79.0 69.6 77.8 41.5 54.8 57.5 70.4
Init: DPR (pt, NQ)
 + (ft, T-REx) 45.5 62.3 50.6 64.8 72.8 82.3 52.4 65.3
 + (ft, NQ+T-REx) 63.7 76.3 53.4 66.2 62.8 74.9 45.3 60.7
BM25 45.3 64.5 69.4 78.6 54.4 64.4 62.7 73.8
Table 3.2: Baseline Results on T-REx QA Dataset. se and un denote the seen relation evaluation set and unseen relation evaluation set respectively. *: REALM considers 288 BPE token blocks whereas DPR and our BM25 index use 100 word passages.

While dense models perform well on NQ and TriviaQA, they significantly underperform on the T-REx QA subsets. It’s also notable that REALM still underperforms BM25, even though it retrieves more tokens per passage and incorporates expensive intermediate pre-training regimes. This demonstrates that dense models miss key information that sparse models are able to pick up in order to answer these questions.

Looking at the fine-tuned baselines, augmenting examples from T-REx improves performance on the seen relation subset enormously, even outperforming the sparse model; however, if we only fine-tune using T-REx, accuracy on NQ and TriviaQA degrades heavily. When fine-tuning on both NQ and T-REx, we avoid the degradation on NQ and TriviaQA with most, but not all, of the improvements on the seen relation subset. In both cases, very little performance gains on the seen relation subset translate to the unseen relation subset, which means the knowledge learned does not transfer to new relations. These results indicate that current data augmentation techniques or multi-dataset training setups are not enough to close the out-of-domain generalization gap.

3.3 Entities vs. Relations

The questions in the T-REx QA dataset have two distinct dimensions: the subject entities referenced and the specific relations tested. We aim to decouple these two aspects in order to see whether dense models struggle to generalize on unseen relations or on unseen entities.

We construct 4 different subsets: (seen entities, seen relations), (seen entities, unseen relations), (unseen entities, seen relations), and (unseen entities, unseen relations). For each subset, we consider either the 7 seen relations or the 7 unseen relations and sample 300 QA pairs per relation.111For the (seen entities, unseen relations) subset, two relations did not have enough overlapping entities, causing this subset to be slightly smaller. Results are still clear and significant.

We consider 3 models: DPR (rt, NQ) is a re-trained version of DPR trained on NQ that serves as a baseline; DPR (rt, NQ+T-REx) is a re-trained DPR trained on the union of the NQ and T-REx training sets; and BM25. Hyperparameters for the re-trained model variants are included in Table

6.1. We present R@5 and R@20 results in Table 3.3.

Model (E: ✓, R: ✓) (E: ✓, R: ✗) (E: ✗, R: ✓) (E: ✗, R: ✗)
R@5 R@20 R@5 R@20 R@5 R@20 R@5 R@20
DPR (rt, NQ) 31.9 45.8 31.5 43.2 32.7 46.8 41.2 55.3
DPR (rt, NQ+T-REx) 69.1 79.5 40.1 52.6 64.8 75.9 44.4 60.0
BM25 54.6 64.1 48.1 58.5 55.9 66.1 62.2 73.8
Table 3.3: T-REx Entity/Relation Analysis. In column headers, “E:” denotes whether the entities are seen during training and “R:” denotes whether the relations are seen during training. Bold indicates highest performing model in column.

It’s clear that observing the entities and relations during training significantly improves performance of dense models, even outperforming sparse models. Looking at the (E: ✗, R: ✗) column, it’s also clear that training on the T-REx training data does not generalize to unseen entities or unseen relations.

When observing entities during training but not relations, accuracy improves meaningfully over the baseline; however, when observing relations during training and not entities, accuracy improves significantly, almost to the levels of observing both relations and entities. This indicates that dense models are able to generalize to unseen entities well using the same relations, but they struggle to generalize on unseen relations, even if these relations include entities seen during training.

4.1 Removing Positional Biases

One difference between dense and sparse models is the bag-of-words modeling assumption. Sparse models treat all words in the sequence independently and only consider statistics based on term and document frequencies. This completely removes the interactions between words (outside co-occurrence) as well as word compositionality. Dense models, on the other hand, consider the sequence as a whole using BERT and encode word order using positional embeddings.

We investigate whether this bag-of-words modeling assumption, specifically the lack of positional information, helps sparse models generalize better to new distributions. One consideration is that questions from one dataset are written completely differently than questions from a different dataset. Compare the examples in Table 6.2, specifically between TriviaQA and NQ. Questions in TriviaQA are typically very long, robust, and detailed. On the other hand, questions in the NQ dataset are short, fragmented, and occasionally ungrammatical. Training a model that only sees one type of question would likely have trouble generalizing to the other.

To do this, we consider sequence shuffling, where we split each sequence by spaces and randomly order the words. Note that this removes the word compositionality and may even break the meaning of the question. We consider shuffling the question tokens in the training dataset, denoted as models with shuffleQ. We also consider shuffling the passage tokens in the training dataset, denoted as models with shuffleP. All models are based on the re-trained DPR model trained on the NQ dataset, denoted DPR (rt, NQ), and we report R@5 and R@20 on NQ, TriviaQA, WebQ, TREC, and SQuAD in Table 4.1.

Model NQ TriviaQA WebQ TREC SQuAD
DPR (rt, NQ) 62.1 49.6 49.3 69.7 27.4
DPR (rt, shuffleQ) 62.2 48.6 49.7 67.6 26.8
DPR (rt, shuffleQ, shuffleP) 3.6 7.9 3.8 11.4 2.1
DPR (rt, NQ) 75.0 63.2 63.8 82.0 43.9
DPR (rt, NQ, shuffleQ) 75.2 62.6 69.3 80.8 42.5
DPR (rt, NQ, shuffleQ, shuffleP) 7.9 12.8 9.6 23.5 5.9
Table 4.1: Sequence Shuffling Results. shuffleQ denotes shuffling training questions and shuffleP denotes shuffling training passages.

Shuffling the question tokens during training doesn’t hurt accuracy, which means that the model uses very little word composition and essentially ignores positional information altogether. This is notable as word order often changes the meaning or intention of the question, especially around words like “not” or when considering multi-word entities.

On the other hand, shuffling the question tokens during training doesn’t help accuracy, which means the model is not overfitting to the phrasing or formatting of a particular dataset. From these results, the question format differences between NQ and TriviaQA do not affect the model’s ability to retrieve relevant information.

Once the positive/negative passage tokens are shuffled during training, performance degrades significantly. This follows intuition since passages are 100 tokens, likely spanning multiple sentences. By breaking the ordering in the passages, most of the meaning will be lost, which is what makes BERT so strong. BERT builds a vector space based on semantics, which is much more difficult to construct without word ordering.

We use these results as a basis to conclude that the positional information in passages during training is very important for BERT to build a semantic vector space; however, positional information in questions is generally unimportant, neither helping nor hurting model generalization.

4.2 Freeze One Encoder During Fine-tuning

We analyze the typical dual encoder architecture to determine what’s more important: fine-tuning the question encoder or fine-tuning the passage encoder. To do this, we again consider fine-tuning on top of the pre-trained DPR model trained on NQ, denoted DPR (pt, NQ).

We consider fine-tuning under three conditions: (ft, T-REx) serves as a baseline and denotes fine-tuning both encoders normally; (ft, T-REx, fixP) denotes freezing the weights of the passage encoder during fine-tuning, only applying updates to the query encoder; (ft, T-REx, fixQ) denotes freezing the weights of the query encoder during fine-tuning, only applying updates to the passage encoder. Fine-tuning settings can be found in Table 6.1. We report R@5 and R@20 results on NQ, TriviaQA, and both T-REx evaluation subsets in Table 4.2.

NaturalQ (NQ) TriviaQA T-REx (se) T-REx (un)
R@5 R@20 R@5 R@20 R@5 R@20 R@5 R@20
Init: DPR (pt, NQ) 68.3 80.1 57.0 69.0 34.2 48.2 43.9 59.0
 + (ft, T-REx) 45.5 62.3 50.6 64.8 72.8 82.3 52.4 65.3
 + (ft, T-REx, fixP) 60.4 75.0 53.7 66.6 50.1 63.5 46.5 59.9
 + (ft, T-REx, fixQ) 51.9 68.3 51.8 65.4 71.5 81.5 53.8 67.5
Table 4.2: Freeze One Encoder During Fine-tuning Results. All models initialized from pre-trained DPR trained on NQ only. fixQ denotes fixing weights in the query encoder and fixP denotes fixing weights in the passage encoder. se denotes seen relation evaluation set and un denotes unseen relation evaluation set. Bold indicates highest accuracy model in column.

We notice that there is a discrepancy between training both encoders, training only the passage encoder, and training only the query encoder. When fine-tuning on T-REx, freezing the passage encoder and training only the query encoder improves performance meaningfully on the T-REx seen relation subset while only degrading slightly on NQ. When freezing the query encoder and only training the passage encoder, accuracy on the T-REx subsets matches that of training both encoders. Interestingly, NQ performance does not degrade as significantly on this model compared to training both encoders, even though T-REx performance is almost identical. We also note improvement in the unseen relation subset compared to training both encoders. Based on these results, we conclude that the context encoder is particularly important to better answer questions from the T-REx QA dataset.

5.1 Modified Training Techniques

We consider DPR and modify the proposed training regime to further investigate how the training objective affects generalization. All training hyperparameters can be found in Table 6.1.

Single Model Training

We modify DPR’s dual encoder architecture by tying the weights of the query encoder and the passage encoder, effectively creating a single model architecture. The core idea here is a single model that encodes both queries and passages can mimic a “query-aware” passage encoder and a “passage-aware” query encoder, whereas the dual encoder architecture considers each independently. We compare results between the dual encoder architecture and the single encoder architecture. Models using this technique are denoted with 1enc.

Stop-gradient Training

Inspired by Chen and He (2020), we investigate using a new loss function based on the idea of stop-gradient training. Specifically, we define a new loss function:


where denotes the query representation, denotes the passage representation, and denotes detachment from the computational graph. We compare this loss with the unmodified contrastive loss. Models using this technique are denoted with stopG.

PAQ Training

Many open-domain QA systems base their models on the Natural Questions dataset, which sources queries from anonymized Google search data; however, this distribution bakes in its own set of biases on the task. For example, many Google searches ask questions about topics common in pop culture like notable movies, trending celebrities, and famous musicians. Oftentimes, systems trained on this dataset favor certain information over others. We consider training a model using the Probably Asked Questions (PAQ) (Lewis et al., 2021b) dataset, a large-scale collection of (question, answer, passage) triples built using a question generation model.

First, the authors train a classifier to identify passages likely to be asked about based on the NQ dataset. The authors identify probable answer spans in the 10 million most-likely passages using named entity recognition (NER) tools and a learned answer span model trained on NQ. Next, the authors train a query generation model on NQ, TriviaQA, and SQuAD in a multi-dataset fashion. The query-generation model takes (answer, passage) pairs as input and outputs questions about the passage with the corresponding answer. Finally, the authors filter questions by ensuring the state-of-the-art open-domain QA model, fusion-in-decoder (FiD) trained on NQ is able to generate the correct answer given the question, without the associated passage.

For our study, we group all of the questions asked about a particular passage and filter out any passages that have less than 3 generated questions. We then sample 100K such passages and sample one question asked about each. We split this dataset into 70K/15K/15K for train/dev/test splits, although we do not evaluate on this dataset.

We hypothesize that the PAQ dataset has some important benefits for generalization compared to normal open-domain QA datasets. First, the passage distribution is based on 10M passages, which is about half of Wikipedia, as opposed to popular or trendy topics in Natural Questions. Second, the answer distribution considers both named-entity recognition and an answer span model trained on NQ, which is much more robust than just considering one or the other. Third, we argue that the PAQ dataset is similar to a multi-task learning setup because the query generation model is trained on multiple datasets. This allows us to simulate multi-dataset training while still training on a single, reasonably-sized dataset. We investigate the difference between models trained on Natural Questions and models trained on PAQ. Models trained on PAQ are denoted with PAQ.

Flipped Training

We modify the original training objective to consider positive and negatives questions for a given passage. While the original training objective likely encourages a successful question discriminator, we hope to encourage a good passage discriminator to improve the passage vector space. We use the PAQ training dataset, where we consider the 70K passages and use randomly sampled questions generated for that passage as positives. For negatives, we use randomly sampled questions from other passages in the training set. Note that we do not incorporate hard negative mining for this training dataset. Models using the flipped training objective are denoted with flip.


We re-train models according to the parameters noted in Table 6.1 and report combined results in Table 5.1. Using a single encoder during training improves performance across the board, with most of the gains on out-of-domain datasets. We hypothesize this stems from two reasons. First, a query-aware passage encoder is better able to encode relevant information than the dual encoder architecture. Second, using a single encoder helps align the passage vector space and query vector space compared to a dual encoder by using a single model instead of trying to align two distinct models.

Models NaturalQ (NQ) TriviaQA T-REx (se) T-REx (un)
R@5 R@20 R@5 R@20 R@5 R@20 R@5 R@20
Re-trained Dense Models
DPR (rt, NQ) 62.1 75.0 49.6 63.2 31.1 45.0 40.9 55.0
DPR (rt, NQ, 1enc) 65.9 77.8 57.5 69.4 34.0 47.2 43.2 58.8
DPR (rt, NQ, stopG) 65.4 77.4 54.3 66.6 30.3 44.1 41.2 57.0
DPR (rt, NQ, 1enc, stopG) 65.9 78.0 57.2 69.5 32.7 46.4 39.8 56.3
DPR (rt, PAQ) 47.8 66.8 57.2 70.4 42.4 56.8 49.0 63.0
DPR (rt, PAQ, flip) 41.6 62.6 51.2 65.2 39.7 53.4 44.9 60.2
DPR (rt, PAQ, 1enc) 50.1 68.7 59.8 72.8 44.8 57.9 53.2 67.9
DPR (rt, PAQ, stopG) 47.5 67.6 56.0 69.9 44.8 58.6 48.5 63.7
DPR (rt, PAQ, 1enc, stopG) 50.0 69.9 60.9 72.6 50.4 63.1 55.7 69.0
BM25 45.3 64.5 69.4 78.6 54.4 64.4 62.7 73.8
Table 5.1: Re-trained Models with Modifications. 1enc denotes single encoder training. stopG denotes loss function inspired by stop-gradient. flip denotes using positive/negative questions for a passage instead of positive/negative passages for a question. Bold indicates highest performing model in column. Underline denotes highest performing model with same training data.

Using the loss function inspired by stop-gradient has very mixed results. On one hand, the dual encoder architecture when trained on NQ improves performance on NQ and TriviaQA, with slight variation on the T-REx QA dataset. The models trained on PAQ also have mixed results, generally showing marginal differences compared to the dual encoder baseline. On the other hand, when combined with the single encoder architecture, results improve slightly, but meaningfully, when trained on PAQ, but do not improve when trained on NQ. In general, we conclude this loss function has minimal effects compared to single encoder training.

Models trained on the PAQ dataset have some interesting characteristics. These models generally perform much worse (typically by 9-10% absolute) on the NQ dataset, even though the query generation model is trained partly on NQ. PAQ models also perform much better on TriviaQA and T-REx QA compared to NQ models. These characteristics may be influenced by the answer distribution, which may be more entity-heavy than normal datasets due to the NER answer extraction step.

Finally, the flipped training objective underperforms normal DPR training, not matching performance on any of the datasets for either training setting. This could be due to the choice of negatives since Karpukhin et al. (2020) show that harder negatives improve retrieval performance when compared to randomly sampled negatives.

5.2 Query-side Fine-tuning

First introduced in Lee et al. (2021), query-side fine-tuning fixes the passage encoder and only trains the query encoder based on the true retrieval objective in open-domain QA. For each example, the query model encodes the question and performs the full maximum inner product search over the document index to retrieve the top candidate results. The loss reinforces positive passages (those that contain the answer) and can be defined as:


where denotes the top document candidates, denotes whether the candidate has the answer to the question, and denotes the MIPS retrieval score. Query-side fine-tuning helps close the gap between training and testing tasks by performing full retrieval over the document index instead of using static positives and negatives. We consider query-side fine-tuning on all models presented previously using hyperparameters included in Table 6.1. We report R@5 and R@20 results in Table 5.2.

Models NaturalQ TriviaQA T-REx (se) T-REx (un)
R@5 R@20 R@5 R@20 R@5 R@20 R@5 R@20
Re-trained Dense Models
DPR (rt, NQ) + qsft 67.3 78.9 59.4 71.3 39.2 53.6 54.8 67.4
DPR (rt, NQ, 1enc) + qsft 68.3 79.0 60.2 72.1 37.7 52.0 54.9 67.1
DPR (rt, NQ, stopG) + qsft 68.5 78.9 60.4 72.1 37.9 51.5 53.7 67.2
DPR (rt, NQ, 1enc, stopG) + qsft 68.0 79.4 60.4 72.0 38.6 53.0 55.3 68.1
DPR (rt, PAQ) + qsft 63.9 76.7 62.3 73.8 46.1 59.9 56.3 69.2
DPR (rt, PAQ, flip) + qsft 64.1 76.6 61.7 73.5 44.3 58.9 53.9 68.3
DPR (rt, PAQ, 1enc) + qsft 61.1 74.9 61.2 73.1 43.2 58.1 54.0 68.2
DPR (rt, PAQ, stopG) + qsft 63.4 76.5 61.7 73.2 44.8 60.0 56.9 70.4
DPR (rt, PAQ, 1enc, stopG) + qsft 62.6 75.5 61.0 73.1 44.9 61.3 59.6 71.7
BM25 45.3 64.5 69.4 78.6 54.4 64.4 62.7 73.8
Table 5.2: Query-side Fine-tuning Results. 1enc denotes single encoder training. stopG denotes loss function inspired by stop-gradient. flip denotes using positive/negative questions for a passage instead of positive/negative passages for a question. qsft denotes query-side fine-tuning on the Natural Questions dataset. Bold indicates highest performing model in column. Underline denotes highest performing model with same training data.

Query-side fine-tuning improves accuracy across all datasets for all models considered. Surprisingly, improvements are especially pronounced on dual encoder architectures, closing the gap in performance with their single encoder counterparts. We also note that the flipped training objective, which underperformed the regular DPR model, now matches performance.

We hypothesize this improvement across the board is due to two reasons. First, query-side fine-tuning performs retrieval over the passage index instead of on static positives/negatives. This is closer to the actual task performed at inference time, as noted in Lee et al. (2021). Second, query-side fine-tuning helps align the vector space of the query encoder with the vector space of the passage encoder. Since the training signal is only propagating through the query encoder, models are able to better shift the vector space to match representations in the fixed dense passage index.


  • P. Baudiš and J. Šedivý (2015) Modeling of the question answering task in the yodaqa system. In Proceedings of the 6th International Conference on Experimental IR Meets Multilinguality, Multimodality, and Interaction - Volume 9283, CLEF’15, Berlin, Heidelberg, pp. 222–228. External Links: ISBN 9783319240268, Link, Document Cited by: §3.1.
  • J. Berant, A. Chou, R. Frostig, and P. Liang (2013) Semantic parsing on Freebase from question-answer pairs. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Seattle, Washington, USA, pp. 1533–1544. External Links: Link Cited by: §3.1.
  • D. Chen, A. Fisch, J. Weston, and A. Bordes (2017) Reading Wikipedia to answer open-domain questions. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, Canada, pp. 1870–1879. External Links: Link, Document Cited by: Chapter 1, §2.1.
  • X. Chen and K. He (2020) Exploring simple siamese representation learning. External Links: 2011.10566 Cited by: Chapter 1, §5.1.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2019) BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, pp. 4171–4186. External Links: Link, Document Cited by: §2.3.
  • H. Elsahar, P. Vougiouklis, A. Remaci, C. Gravier, J. Hare, F. Laforest, and E. Simperl (2018) T-REx: a large scale alignment of natural language with knowledge base triples. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan. External Links: Link Cited by: §3.2.
  • K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang (2020) REALM: retrieval-augmented language model pre-training. External Links: 2002.08909 Cited by: Chapter 1, §3.2.
  • G. Izacard and E. Grave (2021) Leveraging passage retrieval with generative models for open domain question answering. External Links: 2007.01282 Cited by: §2.3.
  • J. Johnson, M. Douze, and H. Jégou (2017) Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: §2.3.
  • M. Joshi, E. Choi, D. S. Weld, and L. Zettlemoyer (2017) TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension. In Association for Computational Linguistics (ACL), pp. 1601–1611. Cited by: §3.1.
  • V. Karpukhin, B. Oğuz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W. Yih (2020) Dense passage retrieval for open-domain question answering. External Links: 2004.04906 Cited by: Table 1.1, Chapter 1, §2.1, §2.3, §5.1.
  • T. Kwiatkowski, J. Palomaki, O. Redfield, M. Collins, A. Parikh, C. Alberti, D. Epstein, I. Polosukhin, J. Devlin, K. Lee, et al. (2019) Natural questions: a benchmark for question answering research. Transactions of the ACL (TACL). Cited by: §2.3, §3.1.
  • J. Lee, M. Sung, J. Kang, and D. Chen (2021) Learning dense representations of phrases at scale. External Links: 2012.12624 Cited by: Chapter 1, §5.2, §5.2.
  • K. Lee, M. Chang, and K. Toutanova (2019) Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, pp. 6086–6096. External Links: Link, Document Cited by: Chapter 1, §3.1.
  • P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W. Yih, T. Rocktäschel, S. Riedel, and D. Kiela (2021a) Retrieval-augmented generation for knowledge-intensive nlp tasks. External Links: 2005.11401 Cited by: §2.3.
  • P. Lewis, P. Stenetorp, and S. Riedel (2020) Question and answer test-train overlap in open-domain question answering datasets. External Links: 2008.02637 Cited by: Chapter 1.
  • P. Lewis, Y. Wu, L. Liu, P. Minervini, H. Küttler, A. Piktus, P. Stenetorp, and S. Riedel (2021b) PAQ: 65 million probably-asked questions and what you can do with them. External Links: 2102.07033 Cited by: §5.1.
  • J. Lin, X. Ma, S. Lin, J. Yang, R. Pradeep, and R. Nogueira (2021) Pyserini: an easy-to-use python toolkit to support replicable ir research with sparse and dense representations. External Links: 2102.10073 Cited by: §3.2.
  • P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang (2016) SQuAD: 100,000+ questions for machine comprehension of text. External Links: 1606.05250 Cited by: §3.1.
  • S. Robertson and H. Zaragoza (2009) The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval 3 (4), pp. 333–389. Cited by: Chapter 1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems, I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (Eds.), Vol. 30, pp. . External Links: Link Cited by: §2.3.
  • D. Vrandečić and M. Krötzsch (2014) Wikidata: a free collaborative knowledgebase. Commun. ACM 57 (10), pp. 78–85. External Links: ISSN 0001-0782, Link, Document Cited by: §3.2.
  • L. Xiong, C. Xiong, Y. Li, K. Tang, J. Liu, P. Bennett, J. Ahmed, and A. Overwijk (2020) Approximate nearest neighbor negative contrastive learning for dense text retrieval. External Links: 2007.00808 Cited by: Chapter 1, §2.3.