Many language systems rely on text retrieval as their first step to find relevant information. For example, search ranking nogueira2019passage, open domain question answering chen2017reading, and fact verification yang2018hotpotqa; thorne2018fact
all first retrieve relevant documents as the input to their later stage reranking, machine reading, and reasoning models. All these later-stage models enjoy the advancements of deep learning techniquesrajpurkar2016squad; wang2018glue, while, in contrast, the first stage retrieval still mainly relies on matching discrete bag-of-words nogueira2019passage; chen2017reading; yang2018hotpotqa; zhaotransxh2020. Due to intrinsic challenges such as vocabulary mismatch croft2010search, sparse retrieval inevitably introduces noisy information and often becomes the bottleneck of many systems yang2018hotpotqa; luan2020sparsedense.
Dense Retrieval (DR) using learned distributed representations is a promising direction to overcome this sparse retrieval bottleneckluan2020sparsedense; lee2019latent; chang2020pre; gao2020complementing; ma2020zero; dhingra2020differentiable; karpukhin2020dense: The representation space is fully learnable and can leverage the strength of pretraining, while the retrieval operation is sufficiently efficient thanks to the recent progress in Approximate Nearest Neighbor (ANN) search johnson2019billion. With these intriguing properties, one would expect dense retrieval to revolutionize the first stage retrieval, as deep learning has done in almost all language tasks. However, this is not yet the case: Recent studies found dense retrieval often underperforms BM25, especially on documents luan2020sparsedense; lee2019latent. The effectiveness of DR is more observed when combined with sparse retrieval, instead of replacing it gao2020complementing; ma2020zero.
In this paper, we identify that the underwhelming performance of dense retrieval resides in its learning mechanisms, as there exists a severe mismatch between the negatives used to train DR representations and those seen in testing. An example t-SNE maaten2008visualizing representation used in DR is shown in Fig. 1.
As expected, the negatives dense retrieval models need to handle in testing (DR Neg) are quite close to the relevant documents. However, the negatives used to train DR models, sampled from sparse retrieval (BM25 Neg) or randomly from the corpus (Rand Neg), are rather separated from the relevant or the negative documents in testing. Training with those negatives may never guide the model to learn a proper representation space that separates relevant documents from the actual negatives in dense retrieval.
We fundamentally eliminate this discrepancy by developing Approximate nearest neighbor Negative Contrastive Estimation (ANCE), which constructs more realistic training negatives for dense retrieval exactly as how DR is performed. During training, we maintain an ANN index of document encodings, from the same representation model being optimized for DR, which we parallelly update and asynchronously refresh as the learning goes on. The top dense-retrieved documents from the ANN index are used as negatives for each training query; they are retrieved by the same function, in the same representation space, and thus belong to the same distribution with the irrelevant documents to discriminate during testing.
In TREC Deep Learning Track’s text retrieval benchmarks craswell2020overview, ANCE significantly boosts the accuracy of dense retrieval. With ANCE training, BERT-Siamese, the DR architecture used in multiple parallel research luan2020sparsedense; gao2020complementing; ma2020zero, significantly outperforms all sparse retrieval baselines. Impressively, simple dot product in the ANCE-learned representation is nearly as effective as the sparse retrieval and BERT reranking cascade pipeline while being 100 times more efficient.
Our analyses further confirm that the negatives from sparse retrieval or other sampling methods differ drastically from the actual negatives in DR, and that ANCE fundamentally resolves this mismatch. We also show the influence of the asynchronous ANN refreshing on learning convergence and demonstrate that the efficiency bottleneck is in the encoding update, not in the ANN part during ANCE training. These qualifications demonstrate the advantages, perhaps also the necessity, of our asynchronous ANCE learning in dense retrieval.111Code, trained models, and pre-computed embeddings are available at (https://github.com/microsoft/ANCE).
In this section, we discuss the background of sparse, cascade information retrieval, and dense retrieval.
Sparse Retrieval and Cascade IR: Given a query and a corpus , the text retrieval task is to find a set of documents in and rank them based on relevance to the query. Because the corpus is often at the scale of millions or billions, efficient retrieval often requires cascade pipelines. These systems first use an efficient sparse retrieval to zoom in to a small set of candidate documents and then feed them to one or several more sophisticated reranking steps croft2010search. The sparse retrieval (e.g. BM25) usually performs an exact match between query and document in the bag-of-word space using frequency-based statistics. The reranking step often applies BERT on top of the sparse-retrieved documents, i.e. by concatenating them with the query and feeding into a fine-tuned BERT reranker nogueira2019passage; nogueira2019multi.
The quality of the first stage retrieval defines the upper bound of many language systems: if a relevant document is not retrieved, for example, because of no overlap between query and document’s bag-of-words, then its information is never available to later-stage models. Addressing this vocabulary mismatch is a core research topic in IR croft2010search; lavrenko2017relevance; xiong2015query; dai2019context.
Dense Retrieval aims to fundamentally redesign the first stage text retrieval with representation learning. Instead of retrofitting to sparse retrieval, recent approaches in dense retrieval first learn a distributed representation space of the query and documents, in which the relevance function can be a simple similarity calculation luan2020sparsedense; lee2019latent; chang2020pre; gao2020complementing; ma2020zero; karpukhin2020dense; ahmad2019reqa.
A standard formulation of dense retrieval first uses the Siamese/dual-encoder architecture with BERT to encode the query and document individually, and then matches them using their dense encodings luan2020sparsedense; gao2020complementing; karpukhin2020dense:
The encoder uses a layer normalized projection on the last layer’s “[CLS]”, and its weights can be shared between and luan2020sparsedense
. The similarity metric in BERT-Siamese is often as simple as dot product or cosine similarity. The dense retrieval is then performed using efficient ANN search with the learned encoder:
We use to refer to the documents retrieved by dense retrieval for query , which comes from the ANN index with the learned model .
This leads to several intriguing properties of dense retrieval:
Learnability: Compared to bag-of-words, the representation in dense retrieval is fully learned following the advancement of representation learning.
Efficiency: Compared to the costly reranking in cascade pipelines, in dense retrieval, the document representation can be pre-computed offline. Moreover, only the query needs to be encoded online and retrieval from the ANN index has many efficient solutions johnson2019billion.
Representation Learning for Dense Retrieval: The effectiveness of DR depends on learning a representation space that aligns a query with its relevant documents , and separates it from irrelevant ones . This is often done using the following learning objective:
where we used the negative log likelihood (NLL) loss luan2020sparsedense
on positive and negative documents for each query. Other similar loss functions are also exploredchen2020simple. The positive documents () are from those labeled relevant () for the query. The construction of negative documents (), however, is not as straightforward. For reranking models, their negatives in both training and inference are the irrelevant ones in their candidate set, for example, top documents retrieved by BM25:
However, in dense retrieval, the optimal training negatives are different from those in reranking. To address this concern, several recent work enrich the BM25 negatives with random sampling from the corpus:
where is sampled from the entire corpus luan2020sparsedense or in batch karpukhin2020dense.
3 Approximate Nearest Neighbor Noise Contrastive Estimation
Intuitively, the strong negatives close to the relevant documents in an effective dense retrieval representation space should be different from those from sparse retrieval, as the goal of DR is to find documents beyond those retrieved by sparse retrieval. Random sampling from a large corpus is also unlikely to hit those strong negatives as most documents are not relevant to the query.
In this section, we present how to principally align the negatives used in DR representation learning and in inference. We first describe a conceptually simple approach, Approximate nearest neighbor Negative Contrastive Estimation (ANCE), which constructs a query and relevant document pair with negatives retrieved from the ANN index – the same as how the learned representations are used in DR inference. Then we discuss the challenge in updating negative representations in the ANN index during training and how we address it using asynchronous learning.
ANCE: We use the standard dense retrieval model and loss functions described in last section:
The only difference is the negatives used in training:
which are the top documents retrieved from the ANN index using the learned representation model , exactly the same as the inference from the learned DR model. This eliminates the gap between the learning and the application of the representation space.
Asynchronous Training: Since the training is almost always stochastic, the encoder in is updated in each training batch. To update the representations used to construct ANCE negatives (), the following two steps are needed:
Inference: refresh the representations of all documents in the corpus with the new encoder,
Index: rebuild the ANN index using updated representations.
Although rebuilding the ANN index is efficiently implemented in recent libraries johnson2019billion, Inference is costly as it re-encodes the entire corpus. Doing so after every training batch is unrealistic in stochastic settings where the corpus is at a much bigger scale than the training batch size.
To overcome this, we propose Asynchronous ANCE training which refreshes the ANN index used to construct only after each checkpoint which include training batches (i.e., ). As illustrated in Fig. 2, besides the Trainer job, we also maintain a parallel Inferencer job, which
takes the latest checkpoint of the representation model, e.g., at the (mk)-th training step,
parallelly inferences the encoding of the entire corpus using , while the Trainer keeps optimizing with from index at the last checkpoint;
reconstructs the ANN index () once the parallel inference finishes, and connects it with the Trainer to provide more up-to-date .
In this parallel process, the ANCE negatives () are asynchronously updated to “catch up” with the stochastic training as soon as the Inferencer refreshes the ANN index. The asynchronous lap between the training and the negative construction depends on the allocation of computing resources between the Trainer and the Inferencer: one can choose to refresh the ANN index after every back-propagation , to get synchronous ANCE negatives, or never refresh the ANN index to save compute, or somewhere in-between. In experiments, we analyze this efficiency-effectiveness trade-off and its influences on training stability and retrieval accuracy.
4 Experimental Methodologies
Benchmarks: Our experiments are mainly conducted on the TREC 2019 Deep Learning (DL) Track benchmark craswell2020overview. It includes the most recent, realistic, and standard large scale text retrieval datasets. The training and dev sets are passage relevance labels for one million Bing queries from MSMARCO bajaj2016ms. The testing sets are labeled by NIST accessors on the top 10 ranked results from past Track participants craswell2020overview. Our experiments follow the official settings of TREC DL Track and use both the passage and the document task. We mainly evaluate dense retrieval in the retrieval setting but also show the results of DR models as rerankers on the top 100 candidates from BM25. TREC DL official metrics include NDCG@10 on test and MRR@10 on MARCO Passage Dev. MARCO Document Dev is noisy and the recall on the DL Track testing is less meaningful due to low label coverage on DR results (more in Appendix A.1 and A.2).
We also evaluate ANCE on the OpenQA benchmark used in a parallel work (DPR) karpukhin2020dense. It includes five OpenQA tasks, including Natural Questions (NQ) kwiatkowski2019natural, TriviaQA joshi2017triviaqa, WebQuestions (WQ) berant2013semantic, CuratedTREC baudivs2015modeling, and SQuAD rajpurkar2016squad. At the time of our experiment, only the pre-processed NQ and TriviaQA data are released222https://github.com/facebookresearch/DPR. Our experiments use the two released tasks and inherit their retriever evaluation. The evaluation uses the Coverage@20/100 which is whether the Top-20/100 retrieved passages include the answer karpukhin2020dense.
Sparse Retrieval Baselines: By keeping the settings consistent with TREC DL Track, our methods are directly comparable with all the TREC participating runs. We list the results of several runs that are most representative in this paper. The detailed descriptions of these runs and many other systems’ results can be found in Appendix A.1 and the Track overview paper craswell2020overview.
Dense Retrieval Baselines:
All DR baselines use the same BERT-Siamese (base) model as used in various parallel research luan2020sparsedense; gao2020complementing; ma2020zero; karpukhin2020dense. The DR baselines only vary in their mechanisms to construct the negative instances: random samples from the entire corpus or in batch (Rand Neg), random samples from BM25 top 100 (BM25 Neg) gao2020complementing
, Noise Contrastive Estimation, which is the highest scored negatives in batch (NCE Neg)gutmann2010noise, and the 1:1 combination of BM25 and Random negatives (BM25 + Rand Neg) luan2020sparsedense; karpukhin2020dense.
Participants in TREC DL found the passage training labels cleaner than the post-constructed document labels and lead to better results on the document task yanidst. Recent DR research also finds it helps training convergence to include BM25 Negatives to provide stronger contrast for the representation learning luan2020sparsedense; karpukhin2020dense. In all our experiments on TREC DL, we include the “BM25 Warm Up” setting (BM25 ), in which the representation model is first trained using MARCO official passage training triples from BM25 Negatives.
Our Methods and Implementation Details: ANCE uses the same BERT-Siamese model and only differs with DR baselines in the training mechanism. To fit long documents in BERT-Siamese, we use the two settings from Dai et al. dai2019deeper
, FirstP where only the first 512 tokens of the document are used, and MaxP, where the document is split to 512-token passages (maximum 4) and scores on these passages are max-pooled. The max-pooling operation is natively supported by ANNluan2020sparsedense
with an overhead of four times more vectors in the index.
Our ANN search uses the Faiss IndexFlatIP Index johnson2019billion. We implemented the parallel training and ANCE index refreshing upon Faiss and plan to include it in our code release. To reduce the computing cost required to navigate from randomly initialized representations, we first warm up all BERT-Siamese using the standard RoBERTa (base) and then continue ANCE training on TREC DL using BM25 . On OpenQA, we start the ANCE from the released DPR checkpoints karpukhin2020dense.
Our main ANCE setting uses 1:1 training:index refreshing GPU allocation, 1:1 positive-negative with the negative documents sampled from ANN top 200, index refreshing at every 10k training batches, batch size 8, and gradient accumulation step 2 on 4 GPUs. We measured ANCE efficiency in Table 4 using a single 32GB V100 GPU, on an Azure VM containing Intel(R) Xeon(R) Platinum 8168 CPU and 650GB of RAM memory. More details of our implementation can be found in Appendix A.1 and our upcoming code release.
5 Evaluation Results
|MARCO Dev||TREC DL Passage||TREC DL Document|
|Sparse & Cascade IR|
|Best DeepCT dai2019context||0.243||n.a.||–||n.a.||–||0.554|
|Best TREC Trad Retrieval||0.240||n.a.||–||0.554||–||0.549|
|Best TREC Trad LeToR||–||–||0.556||–||0.561||–|
|BERT Reranker nogueira2019passage||–||–||0.742||–||0.646||–|
|NCE Neg gutmann2010noise||0.256||0.943||0.602||0.539||0.618||0.542|
|BM25 Neg gao2020complementing||0.299||0.928||0.664||0.591||0.626||0.529|
|BM25 + Rand Neg karpukhin2020dense; luan2020sparsedense||0.311||0.952||0.653||0.600||0.629||0.557|
|BM25 NCE Neg||0.279||0.942||0.608||0.571||0.638||0.564|
|BM25 BM25 + Rand||0.306||0.939||0.648||0.591||0.626||0.540|
|Single Task Training||Multi Task Training|
|Natural Questions||TriviaQA||Natural Questions||TriviaQA|
|BM25 + DPR||76.6||83.8||79.8||84.5||78.0||83.9||79.9||84.4|
This section first presents the evaluations on ANCE effectiveness and efficiency. Then we study the influences of the asynchronous learning. More evaluations can be found in the Appendix.
5.1 Effectiveness of ANCE Dense Retrieval
The results in TREC Deep Learning Track benchmarks are presented in Table 1
. ANCE empowered dense retrieval to significantly outperform all sparse retrieval baselines in all evaluation metrics. Without using any sparse bag-of-words in retrieval, ANCE leads to 20%+ relative NDCG gains over BM25 and significantly outperforms DeepCT, which uses BERT to optimize sparse retrievaldai2019transformer.
Among the learning mechanisms used in DR, the contemporary method that uses the combination of BM25 + Random Negatives luan2020sparsedense; gao2020complementing; ma2020zero; karpukhin2020dense outperforms sparse retrieval in passage retrieval. However, the same as observed in various parallel research luan2020sparsedense; ma2020zero, their trained DR models are no better than tuned traditional retrieval (Best TREC Trad Retrieval) on long documents, where the term frequency signals are more robust. ANCE is the only one that elevates the same BERT-Siamese architecture to robustly exceed the sparse methods in document retrieval. It also convincingly surpasses the concurrent DR models in passage retrieval on OpenQA benchmarks as shown in Table 2.
When reranking documents, ANCE-learned BERT-Siamese outperforms the interaction-based BERT Reranker (0.671 NDCG versus 0.646). This overthrows a previously-held belief that it is necessary to capture the interactions between the discrete query and document terms xiong2017knrm; qiao2019understanding. With ANCE, it is now feasible to learn a representation space that captures the finesse in search relevance. Solely using the first-stage retrieval, ANCE nearly matches the accuracy of the cascade retrieval-and-reranking pipeline (BERT Reranker) – with effective representation learning, dot product is all you need.
|Sparse & Cascade IR|
|BM25 Index Build||3h||–|
|Cascade Total (BM25 + BERT)||–||1.42s|
|ANN Dense IR|
|Per Document Encoding||4.5ms||–|
|ANN Retrieval (batched q)||–||9ms|
|Dense Rretrieval Total||–||11.6ms|
|Encoding of the Training Corpus||10h||–|
|ANN Index Build||10s||–|
|ANCE Neg Construction Per batch||72ms||–|
|Back Propagation Per Batch||19ms||–|
5.2 Efficiency of ANCE Retrieval and Learning
Table 4 measures the efficiency of sparse retrieval and ANN dense retrieval. The latter use the ANN (FirstP) on TREC DL Track document retrieval. These numbers may vary in different environments.
Impressively, ANCE DR with standard batching only takes 11.6 ms per query, a 100x speed up and with nearly on par accuracy compared to BERT Rerank. This is a natural advantage of dense retrieval: Only the Query Encoding and ANN Retrieval need to be performed online. Encoding one short query is efficient, while ANN Retrieval enjoys the advantages of fast approximate search johnson2019billion. The document encoding can be done offline (e.g., at the crawling or indexing phrase) and is only 4.5ms per document. This leads to a remarkable return of investment (ROI) on computing resource and engineering: The 1.42s throughput of BERT Rerank is prohibitive in many production systems and makes distillation or complicated caching necessary, while ANCE is just a dot product.
The quantification of the ANCE training time reveals the main efficiency bottleneck is the encoding of the training corpus, which is to refresh the encoding of the entire corpus with the newly updated representation model. In general, it is not feasible to refresh the representation of the entire corpus to select perfectly up-to-date negatives after each training batch, because the corpus is orders of magnitude larger than one training batch, and a forward pass in the neural network is only linearly more efficient than backward. We address this efficiency bottleneck using asynchronous Trainer and Inferencer updates. The next experiment studies the influence of it.
5.3 Representation Learning with ANCE
In this experiment, we first demonstrate the main advantage of ANCE in providing realistic training negatives. Then we study the influence of delayed updates in the asynchronous learning.
Fig. 4 shows the overlap of negatives used in training versus those seen in final testing. We measure the overlap through the learning process using the same set of sampled dev queries. The same as in Figure 1, which illustrates the ANCE learned representations on query “what is the most popular food in Switzerland”, there is very low overlap (<20%) between the BM25 negatives or Random negatives with the negatives from their corresponding trained DR models. The discrepancy between the training and testing candidate distributions risks optimizing DR models to undesired local minimums.
ANCE eliminates this discrepancy. The non-perfect overlap at the beginning is merely because the representation is still being learned. The retrieval of training negatives and testing documents are equivalent, subject to a small delay from the async Inferencer. By simply aligning the training distribution with testing, ANCE unleashes the power of representation learning in dense retrieval.
Fig. 5 illustrates the behavior of asynchronous learning with different configurations. A large learning rate or a low refreshing rate (Figure 5(a) and 5(b)) leads to fluctuations as the async gap of the ANN index may drive the representation learning to undesired local optima. Refreshing as often as every 5k Batches yields a smooth convergence (Figure 5(c)), but requires twice as many GPU allocations to the Inferencer. We found a 1:1 allocation of Trainer and Inference GPUs, at an appropriate learning rate, leads to an asynchronous learning process adequate to train effective representations for dense retrieval.
More ablation studies, retrieval results, and case studies are included in the Appendix.
6 Related Work
In neural information retrieval, neural ranking models are categorized into representation-based and interaction-based, depending on whether they represent query and document separately, or model the interactions between discrete term pairs guo2016deep. BERT Reranker is interaction-based as the self-attention is applied on all term pairs, while BERT-Siamese is representation-based. Previous research found interaction-based models more effective as they capture the relevance match between all query-document terms dai2019deeper; xiong2017knrm; qiao2019understanding; guo2016deep. However, the effectiveness of interaction-based models is only available at the reranking stage, as the model needs to go through each query and candidate document pair ahmad2019reqa. Their efficiency also becomes a concern when pretrained models are used humeau2020poly; macavaney2020efficient.
Recently, researchers revisited the representation-based model with BERT for dense retrieval. Progresses include the BERT dual-encoder latent retrieval model lee2019latent and customized pretraining chang2020pre, etc. Promising effectiveness has been achieved on OpenQA passage retrieval tasks, where passages are shorter and questions are cleaner karpukhin2020dense; ahmad2019reqa. On documents, the effectiveness of dense retrieval was more underwhelming and it was more considered as an add-on to sparse retrieval luan2020sparsedense; gao2020complementing; ma2020zero.
To construct stronger training negatives is a rapidly growing topic in representation learning. Especially in contrastive learning for visual representations oord2018representation, remarkable progresses have been made in the past year, for example, SimCLR chen2020simple, MoCo he2019momentum, and MoCo V2 chen2020improved. These methods are also rooted in Noise Constructive Estimation gutmann2010noise; mnih2013learning, but their technical choices are different from ANCE as visual representation learning does not have a natural query and sparse retrieval to start with.
Technical-wise, maintain a parallelly updated ANN index during learning is also used in REALM, but their usage is to retrieve background information in language model pretraining guu2020realm. Our open-source solution can also be used by the community to conduct REALM style pretraining.
ANCE fundamentally eliminates the discrepancy between the representation learning of texts and their usages in dense retrieval. Our ANCE trained dense retrieval model, the vanilla BERT-Siamese, convincingly outperforms all dense retrieval and sparse retrieval baselines in our large scale document retrieval and passage retrieval experiments. It nearly matches the ranking accuracy of the state-of-the-art cascade sparse retrieval and BERT reranking pipeline. More importantly, all these advantages are achieved with a standard transformer encoder at a 1% online inference latency, using a simple dot-product in the ANCE-learned representation space.
For the past decades, in academic community we have been joking that every year we made 10% progress upon BM25, but it had always been 10% upon the same BM25; the techniques developed require more and more IR domain knowledge that might be unfamiliar to researchers in other related fields. For example, in OpenQA, document retrieval was often done with vanilla BM25 instead of the well-tuned BM25F, query expansion, or SDM. In industry, many places build their search solutions upon open source solutions, such as Lucene and ElasticSearch, where BM25, a technique invented in the 1970s and 1980s, was incorporated as late as 2015 elasticbm25; the required expertise, complex infrastructure, and computing resource make many missing out the benefits of Neu-IR.
With their effectiveness, efficiency, and simplicity, ANCE and dense retrieval have the potential to redefine the next stage of information systems and provide broader impacts in many fronts.
Empower User with Better Information Access: The effectiveness of DR is particularly prominent for exploratory or knowledge acquisition information needs. Formulating good queries that have term overlap with the target documents often requires certain domain knowledge, which is a barrier for users trying to learn new information. A medical expert trying to learn how to build a small search functionality on her patient’s medical records may not be aware of the terminology “BM25” and “Dense Retrieval”. By matching user’s information need and the target information in a learned representation space, ANCE has the potential to overcome this language barrier and empower users to achieve more in their daily interactions with search engines.
Reduce Computing Cost and Energy Consumption in Neural Search Stack: The nature of dense retrieval makes it straightforward to conduct most of the costly operations offline and reuse the pre-computed document vectors. This leads to 100x better efficiency and will significantly reduce the hardware cost and energy consumption needed when serving deep pretrained models online. We consider this a solid step towards carbon neutrality in the search stack.
Democratize the Benefit of Neural Techniques: Building, maintaining, and serving a cascade IR pipeline with the advanced pretrained models is daunting and may not lead to good ROI for many companies not in the web search business. In comparison, the simple dot product operation in a mostly pre-computed representation space is much more accessible. Faiss and many other libraries provide easy-to-access solution of efficient ANN retrieval; our (to be) released pretrained encoders and ANCE open-source solution will fill in the effectiveness part. Together we will democratize the recent revolutions in neural information retrieval to a much broader audience and end-users.
Appendix A Appendix
a.1 More Implementation Details
More Details on TREC Deep Learning Benchmarks: There are two tasks in the Track: document retrieval and passage retrieval. The training and development sets are from MS MARCO, which includes passage level relevance labels for one million Bing queries [bajaj2016ms]. The document corpus was post-constructed by back-filling the body texts of the passage’s URLs and their labels were inherited from its passages [craswell2020overview].
There is a two-year gap between the construction of the passage training data and the back-filling of their full document content. Some original documents were no longer available. There is also a decent amount of content changes in those documents during the two-year gap, and many no longer contain the passages. This back-filling perhaps is the reason why many Track participants found the passage training data is more effective than the inherited document labels. Note that the TREC testing labels are not influenced as the annotators were provided the same document contents when judging.
All the TREC DL runs are trained using these training data. Their inference results on the testing queries of the document and the passage retrieval tasks were evaluated by NIST assessors in the standard TREC-style pooling technique [voorhees2000variations]. The pooling depth is set to 10, that is, the top 10 ranked results from all participated runs are evaluated, and these evaluated labels are released as the official TREC DL benchmarks for passage and document retrieval tasks.
More Details on Baselines: The most representative sparse retrieval baselines in TREC DL include the standard BM25 (“bm25base” or “bm25base_p”), Best TREC Sparse Retrieval (“bm25tuned_rm3” or “bm25tuned_prf_p”) with tuned query expansion [lavrenko2017relevance], and Best DeepCT (“dct_tp_bm25e2”, doc only), which uses BERT to estimate the term importance for BM25 [dai2019context]. These three runs represent the standard sparse retrieval, best classical sparse retrieval, and the recent progress of using BERT to improve sparse retrieval.
We also include two cascade retrieval-and-reranking systems: Best TREC LeToR (“srchvrs_run1” or “srchvrs_ps_run3”), which is the best feature-based learning to rank in the Track, and BERT Reranker (“bm25exp_marcomb” or “p_exp_rm3_bert”), which is the best run using standard BERT on top of query/doc expansion, from the groups with multiple top MARCO runs [nogueira2019passage, nogueira2019document].
BERT-Siamese Configurations: We follow the network configurations in Luan et al. [luan2020sparsedense] in all Dense Retrieval methods, which we found provides the most stable results. More specifically, we initialize the BERT-Siamese model with RoBERTa base [liu2019roberta] and add a projection layer on top of the last layer’s “[CLS]” token, followed by a layer norm.
Training Details: The training often takes about 1-2 hours per ANCE epoch, which is whenever new ANCE negative is ready, it immediately replaces existing negatives in training, without waiting.
It converges in about 10 epochs, similar to other DR baselines. The optimization uses LAMB optimizer, learning rate 5e-6 for document and 1e-6 for passage retrieval, and linear warm-up and decay after 5000 steps. More detailed hyperparameter settings can be found in our code release.
The training often takes about 1-2 hours per ANCE epoch, which is whenever new ANCE negative is ready, it immediately replaces existing negatives in training, without waiting. It converges in about 10 epochs, similar to other DR baselines. The optimization uses LAMB optimizer, learning rate 5e-6 for document and 1e-6 for passage retrieval, and linear warm-up and decay after 5000 steps. More detailed hyperparameter settings can be found in our code release.
a.2 Converge of TREC 2019 DL Track Labels on Dense Retrieval Results
As a nature of TREC-style pooling evaluation, only those ranked in the top 10 by the 2019 TREC participating systems were labeled. As a result, documents not in the pool and thus not labeled are all considered irrelevant, even though there may be relevant ones among them. When reusing TREC style relevance labels, it is very important to keep track of the “hole rate” on the evaluated systems, i.e., the fraction of the top K ranked results without TREC labels (not in the pool). A larger hole rate shows that the evaluated methods are very different from those systems that participated in the Track and contributed to the pool, thus the evaluation results are not perfect. Note that the hole rate does not necessarily reflect the accuracy of the system, only the difference of it.
In TREC 2019 Deep Learning Track, all the participating systems are based on sparse retrieval. Dense retrieval methods often differ considerably from sparse retrievals and in general will retrieve many new documents. This is confirmed in Table 3. All DR methods have very low overlap with the official BM25 in their top 100 retrieved documents. At most, only 25% of documents retrieved by DR are also retrieved by BM25. This makes the hole rate quite high and the recall metric not very informative. It also suggests that DR methods might benefit more in this year’s TREC 2020 Deep Learning Track if participants are contributing DR based systems.
The MS MARCO ranking labels were not constructed based on pooling the sparse retrieval results but were from Bing [bajaj2016ms], which include many signals beyond term overlap. This makes the recall metric in MS MARCO more robust as it reflects how a single model can recover a complex online system.
a.3 Hyperparameter Studies
|TREC DL Passage||TREC DL Document|
|Method||Recall@1K||Hole@10||Overlap w. BM25||Recall@100||Hole@10||Overlap w. BM25|
|BM25 + Rand Neg||0.662||20.2%||16.4%||0.240||21.4%||21.0%|
|Hyperparameter||MARCO Dev Passage||TREC DL Document|
|Learning rate||Top K Neg||Refresh (step)||Retrieval MRR@10||Retrieval NDCG@10|
We show the results of some hyperparameter configurations in Table 4. The cost of training with BERT makes it difficult to conduct a more detailed hyperparameter exploration. Often a failed configuration leads to divergence in training loss. We barely explore other configurations due to the time-consuming nature of working with pretrained language models. Our DR model architecture is kept consistent with recent parallel work and the learning configurations in Table 4 are about all the explorations we did. Most of the hyperparameter choices are decided solely using the training loss curve and otherwise by the loss in the MARCO Dev set. We found the training loss, validation NDCG, and testing performance align well in our (limited) hyperparameter explorations.
a.4 Case Studies
In this section, we show Win/Loss case studies between ANCE and BM25. Among the 43 TREC 2019 DL Track evaluation queries in the document task, ANCE outperforms BM25 on 29 queries, loses on 13 queries, and ties on the rest 1 query. The winning examples are shown in Table 5 and the losing ones are in Table 6. Their corresponding ANCE-learned (FirstP) representations are illustrated by t-SNE in Fig. 6 and Fig. 7.
In general, we found ANCE better captures the semantics in the documents and their relevance to the query. The winning cases show the intrinsic limitations of sparse retrieval. For example, BM25 exact matches the “most popular food” in the query “what is the most popular food in Switzerland” but using the document is about Mexico. The term “Switzerland” only appears in the related question section of the web page.
The losing cases in Table 6 are also quite interesting. Many times we found that it is not that DR fails completely and retrieves documents not related to the query’s information needs at all, which was a big concern when we started research in DR. The errors ANCE made include retrieving documents that are related just not exactly relevant to the query, for example, “yoga pose” for “bow in yoga”. In other cases, ANCE retrieved wrong documents due to the lack of the domain knowledge: the pretrained language model may not know “active margin” is a geographical terminology, not a financial one (which we did not know ourselves and took some time to figure out when conducting this case study). There are also some cases where the dense retrieved documents do make sense but were labeled irrelevant due to noise in the labels.
The t-SNE plots in Fig. 6 and Fig. 7 also show many interesting patterns of the learned representation space. The ANCE winning cases often correspond to clear separations of different document groups, while the losing cases are those the representation space is more mixed, or there is too few relevant documents which may cause the variances in model performances. There are also many different patterns in the ANCE-learned representation space, which we found quite interesting.
We include the t-SNE plots for all 43 TREC DL Track queries in our open-source repository (attached in the supplementary material).
More future analyses of the learned patterns in the representation space may help provide more insights into dense retrieval.
also show many interesting patterns of the learned representation space. The ANCE winning cases often correspond to clear separations of different document groups, while the losing cases are those the representation space is more mixed, or there is too few relevant documents which may cause the variances in model performances. There are also many different patterns in the ANCE-learned representation space, which we found quite interesting. We include the t-SNE plots for all 43 TREC DL Track queries in our open-source repository (attached in the supplementary material). More future analyses of the learned patterns in the representation space may help provide more insights into dense retrieval.
|Query:||qid (104861): Cost of interior concrete flooring|
|Title:||Concrete network: Concrete Floor Cost||Pinterest: Types of Flooring|
|Snippet:||For a concrete floor with a basic finish, you can expect to pay $2 to $12 per square foot…||Know About Hardwood Flooring And Its Types White Oak Floors Oak Flooring Laminate Flooring In Bathroom …|
|TREC Label:||3 (Very Relevant)||0 (Irrelevant)|
|Query:||qid (833860): What is the most popular food in Switzerland|
|Title:||Wikipedia: Swiss cuisine||Answers.com: Most popular traditional food dishes of Mexico|
|Snippet:||Swiss cuisine bears witness to many regional influences, … Switzerland was historically a country of farmers, so traditional Swiss dishes tend not to be…||One of the most popular traditional Mexican deserts is a spongy cake … (in the related questions section) What is the most popular food dish in Switzerland?…|
|TREC Label:||3 (Very Relevant)||0 (Irrelevant)|
|Query:||qid (1106007): Define visceral|
|Title:||Vocabulary.com: Visceral||Quizlet.com: A&P EX3 autonomic 9-10|
|Snippet:||When something’s visceral, you feel it in your guts. A visceral feeling is intuitive — there might not be a rational explanation, but you feel that you know what’s best…||
Acetylcholine A neurotransmitter liberated by many peripheral nervous system neurons and some central nervous system neurons…
|TREC Label:||3 (Very Relevant)||0 (Irrelevant)|
|Query:||qid (182539): Example of monotonic function|
|Title:||Wikipedia: Monotonic function||Explain Extended: Things SQL needs: sargability of monotonic functions|
|Snippet:||In mathematics, a monotonic function (or monotone function) is a function between ordered sets that preserves or reverses the given order… For example, if y=g(x) is strictly monotonic on the range [a,b] …||I’m going to write a series of articles about the things SQL needs to work faster and more efficienly…|
|TREC Label:||0 (Irrelevant)||2 (Relevant)|
|Query:||qid (1117099): What is a active margin|
|Title:||Wikipedia: Margin (finance)||Yahoo Answer: What is the difference between passive and active continental margins|
|Snippet:||In finance, margin is collateral that the holder of a financial instrument …||An active continental margin is found on the leading edge of the continent where …|
|TREC Label:||0 (Irrelevant)||3 (Very Relevant)|
|Query:||qid (1132213): How long to hold bow in yoga|
|Title:||Yahoo Answer: How long should you hold a yoga pose for||yogaoutlet.com: How to do bow pose in yoga|
|Snippet:||so i’ve been doing yoga for a few weeks now and already notice that my flexiablity has increased drastically. …That depends on the posture itself …||Bow Pose is an intermediate yoga backbend that deeply opens the chest and the front of the body…Hold for up to 30 seconds …|
|TREC Label:||0 (Irrelevant)||3 (Very Relevant)|