Dense retrieval systems first encode queries and documents into a dense embedding space and then perform information retrieval by finding a query’s nearest neighbors in the embedding space (Lee et al., 2019; Reimers and Gurevych, 2020; Karpukhin et al., 2020; Xiong et al., 2021)
. With the advancement of pre-trained language models(Devlin et al., 2019; Lu et al., 2021), dedicated training strategies (Karpukhin et al., 2020; Xiong et al., 2021), and efficient nearest neighbor search (Johnson et al., 2021; Guo et al., 2020), dense retrieval systems have shown effectiveness in a wide range of tasks, including web search (Nguyen et al., 2016), open domain question answering (Kwiatkowski et al., 2019), and zero-shot IR (Thakur et al., 2021).
Retrieval with dense, fully-learned representations has the potential to address some fundamental challenges in sparse retrieval. For example, vocabulary mismatch can be solved if the embeddings accurately capture the information need behind a query and maps it to relevant documents. However, decades of IR research demonstrates that inferring a user’s search intent from a concise and often ambiguous search query is challenging (Croft et al., 2010). Even with powerful pre-trained language models, it is unrealistic to expect an encoder to perfectly embed the underlying information need from a few query terms.
A common technique to improve query understanding in sparse retrieval systems is pseudo relevance feedback (PRF) (Croft et al., 2010; Xu and Croft, 1996; Lavrenko and Croft, 2001), which uses the top retrieved documents from an initial search as additional information to enrich the query representation. Whether PRF information is used via query expansion (Xu and Croft, 1996; Jaleel et al., 2004) or query term reweighting (Bendersky et al., 2011), its efficacy has been consistently observed across various search scenarios, rendering PRF a standard practice in many sparse retrieval systems.
This work leverages PRF information to improve query representations in dense retrieval. Given the top retrieved documents from a dense retrieval model, e.g., ANCE (Xiong et al., 2021), we build a PRF query encoder, ANCE-PRF, that uses a BERT encoder (Devlin et al., 2019) to consume the query and the PRF documents to refine the query representation. ANCE-PRF is trained end-to-end using relevance labels and learns to optimize the query embeddings using the rich information from PRF documents. It reuses the document index from ANCE to avoid duplicating index storage.
In experiments on MS MARCO and TREC Deep Learning (DL) Track passage ranking benchmarks, ANCE-PRF is consistently more accurate than ANCE and several recent dense retrieval systems that use more sophisticated models and training strategies(Zhan et al., 2020; Luan et al., 2021). We also observe large improvements on DL-HARD (Mackie et al., 2021) queries, a curated set to include complex search intents challenging for neural systems. To the best of our knowledge, ANCE-PRF is among the best performing first-stage retrieval systems on the highly competitive MARCO passage ranking leaderboard.
Our studies confirm that the advantages of ANCE-PRF reside in its ability to leverage the useful information from the PRF documents while ignoring the noise from irrelevant PRF documents.
The PRF encoder allocates substantially more attention to terms from the relevant PRF documents, compared to those from the irrelevant documents.
A case study shows that the encoder focuses more on PRF terms that are complementary to the query terms in representing search intents.
These help ANCE-PRF learn better query embeddings that are closer to the relevant documents and improve the majority of testing queries.111 Our code, checkpoints, and ranking results are open-sourced at
Our code, checkpoints, and ranking results are open-sourced athttps://github.com/yuhongqian/ANCE-PRF.
2. Related Work
In dense retrieval systems, queries and documents are encoded by a dual-encoder, often BERT-based, into a shared embedding space (Karpukhin et al., 2020; Xiong et al., 2021; Luan et al., 2021; Khattab and Zaharia, 2020). Recent research in dense retrieval mainly focuses on improving the training strategies, especially the negative sampling part, including random sampling in batch (Xiong et al., 2021), sampling from BM25 top negatives (Lee et al., 2019; Gao et al., 2021), sampling from an asynchronously (Xiong et al., 2021) updated hard negatives index, constructing hard negatives using document index from an existing dense retrieval model (Zhan et al., 2020), or reranking models (Hofstätter et al., 2021; Lin et al., 2020)
. Most dense retrieval systems encode a document using a constant number of embedding vectors(Karpukhin et al., 2020; Xiong et al., 2021; Luan et al., 2021), often one per document. There are also approaches using one vector per document token (Khattab and Zaharia, 2020), similar to the interaction-based neural IR approaches (Guo et al., 2016; Xiong et al., 2017). In this work, we focus on models that only use one vector per document, whose retrieval efficiency is necessary for real production systems (Xiong et al., 2021).
In recent research, PRF
information has been leveraged by neural networks to combine feedback relevance scores(Li et al., 2018), modify query-document interaction using encoded feedback documents (Ai et al., 2018; Chen et al., 2021), or learn contextualized query-document interactions (Yu et al., 2021). A parallel work (Wang et al., 2021) expands multi-vector query representations with feedback embeddings extracted using a clustering technique.
A typical dense retrieval system encodes query and document using a BERT-style encoder and then calculates the matching score using simple similarity metrics:
where BERT and BERT respectively output their final layer [CLS] embeddings as the query and the document embeddings. Eq. (1) is fine-tuned using standard ranking losses and with various negative sampling techniques (Karpukhin et al., 2020; Luan et al., 2021). The initial retrieval system this work uses, ANCE, conducts negative sampling from an asynchronously updated document index (Xiong et al., 2021).
ANCE-PRF leverages PRF documents retrieved by ANCE to enrich query representations. Given the top documents from ANCE, ANCE-PRF trains a new PRF query encoder to output the query embedding :
It then conducts another retrieval with PRF embeddings:
The training uses the standard negative log-likelihood loss:
where and are embeddings of relevant and irrelevant documents. ANCE-PRF uses document embeddings from the initial dense retrieval model to avoid maintaining a separate document index for PRF. Therefore, only is newly learned.
Eq. (4) trains the query encoder to identify the relevant PRF information using its Transformer attention. Specifically, the attention from the [CLS] embedding in the last layer of Eq. (2) to the th token of the input sequence is:
where and are the “query” vector and th input token’s “key” vector of the th attention head (Vaswani et al., 2017). Ideally, the PRF encoder should learn to yield
where are indexes of the meaningful tokens from the PRF documents, and are those of the irrelevant PRF tokens.
ANCE-PRF can be easily integrated with any dense retrieval models. With the document embeddings and index unchanged, the only computational overheads are one more query encoder forward pass (Eq. (2)) and one more nearest neighbor search (Eq. (3)), a minor addition to the dense retrieval process (Xiong et al., 2021).
|MARCO Dev||MARCO Eval||TREC DL 2019||TREC DL 2020|
|DPR (Karpukhin et al., 2020; Xiong et al., 2021)||-||-||0.311||-5.8%||0.952||-0.1%||-||-||0.600||-7.4%||-||-||-||0.557||-13.8%||-||-||-|
|DE-BERT (Luan et al., 2021)||0.358||-7.7%||0.302||-8.5%||-||-||0.302||-4.7%||0.639||-1.4%||-||-||0.165||-||-||-||-||-|
|ME-BERT (Luan et al., 2021)||0.394||+1.5%||0.334||+1.2%||0.855||-10.8%||0.323||+1.9%||0.687||+6.0%||-||-||0.109||-||-||-||-||-|
|LTRe (Zhan et al., 2020)||-||-||0.341||+3.3%||0.962||+0.0%||-||-||0.675||+4.2%||-||-||-||-||-||-||-||-|
|ANCE (Xiong et al., 2021)||0.388||0.0%||0.330||0.0%||0.959||0.0%||0.317||0.0%||0.0%||0.755||0.0%||0.149||0.0%||0.776||0.0%||0.135|
in t-test. Per query results of thoseunderlined are not available for significance tests.
|MARCO Dev (Binary Label)||TREC DL 2019 (0-3 Scale Label)||TREC DL 2020 (0-3 Scale Label)|
4. Experimental Setup
Next, we discuss the datasets, baselines, and implementation details.
Datasets. We use MS MARCO passage training data (Nguyen et al., 2016) which includes 530K training queries. We first evaluate on its dev set with 7k queries and also obtain the testing results by submitting to its leaderboard. MARCO’s official metric is MRR@10.
includes 43 labeled queries from 2019 and 54 from 2020 for the MARCO corpus. The official metric is NDCG@10 and Recall@1K, the latter with label binarized at relevance point 2. Following Xiong et al.(Xiong et al., 2021), we also report HOLE@10, the unjudged fraction of top 10 retrieved documents, to reflect the coverage of pooled labels on dense retrieval systems. DL-HARD (Mackie et al., 2021) contains 50 queries from TREC DL that were curated to challenge neural systems in a prior TREC DL track. Its official metric is NDCG@10.
Baselines include BM25 (Robertson and Zaragoza, 2009), RM3 (Jaleel et al., 2004; Lavrenko and Croft, 2001), a classical PRF framework in sparse retrieval. We also compare with several recent dense retrievers. ME-BERT (Luan et al., 2021) was trained with hard-negative mining (Gillick et al., 2019), and is the only one that uses multi-vector document encoding. DE-BERT (Luan et al., 2021) is the single-vector version of ME-BERT. DPR (Karpukhin et al., 2020) is trained with in-batch negatives. LTRe (Zhan et al., 2020) generates hard negatives using document embeddings from an existing dense retrieval model. ANCE (Xiong et al., 2021) uses hard negatives from asynchronously updated dense retrieval index using the latest model checkpoint.
and kept the document embeddings from ANCE (and thus also the ANCE negative index) uncharged. All hyperparameters used in ANCE training are inherited in ANCE-PRF. All models are trained on two RTX 2080 Ti GPUs with per-GPU batch size 4 and gradient accumulation step 8 for 450K steps. We keep the model checkpoint with the best MRR@10 score on the MS MARCO dev set.
5. Experimental Results
In this section, we discuss our experimental results and studies.
5.1. Overall Results
Table 1 includes overall retrieval accuracy on MS MARCO and TREC DL datasets. ANCE-PRF outperforms ANCE, its base retrieval system, on all datasets. On the challenging DL-HARD (Table 3), ANCE-PRF improves NDCG@10 By 9.3% over ANCE, indicating ANCE-PRF’s advantage in queries challenging for neural systems. These results suggest that ANCE-PRF effectively leverages PRF information to produce better query embeddings. ANCE-PRF also helps retrieve relevant documents not recognized by ANCE, improving R@1K by about 5% on both TREC DL sets.
ANCE-PRF rankings are significantly more precise than the sparse retrieval baselines with large margins across all datasets. RM3 achieves the best R@1K on both TREC DL sets, but its improvement is not as significant on DL-HARD.
ANCE-PRF also outperforms several strong dense retrieval baselines and produces the most accurate rankings on almost all datasets. While Luan et al. (Luan et al., 2021) discuss the theoretical benefits of higher dimensional dense retrieval as in ME-BERT, our empirical results show that a well-informed query encoder can achieve comparable results, while avoiding the computational and spatial overhead caused by using multiple vectors per document.
5.2. Ablation on PRF Depths
To understand the number of feedback documents () needed for effective learning, we trained models using different and report the results in Table 2. We trained as a controlled experiment, which is equivalent to training ANCE for an extra 450K steps with fixed negatives.
Overall, we observe that models with are consistently better than ANCE () and , showing that ANCE-PRF effectively utilizes the given feedback relevance information. The Avg_Rel indicates that PRF documents at contain noisy relevance information, which is a known challenge for traditional PRF approaches (Collins-Thompson, 2009). Nevertheless, ANCE-PRF yields stable improvements over ANCE for to 5, demonstrating the model’s robustness against noisy feedback from deeper .
5.3. Analyses of Embedding Space & Attention
In this group of experiments, we analyze the learned embeddings and attention in ANCE-PRF.
Embedding Space. Fig. 1(a) shows the distance during training between the ANCE-PRF query embedding and the embeddings of the original ANCE query, the relevant documents, and the irrelevant documents. We use MARCO dev in this study, in which about one out of the three PRF documents is relevant. In the embedding space, ANCE-PRF queries are closest to the original query and then the relevant documents, while further away from the irrelevant documents. ANCE-PRF’s query embeddings effectively encode both the query and the feedback relevance information.
Learned Attention. We also analyze the learned attention on the relevant and the irrelevant PRF documents during training. We use TREC DL 2020 for this study as its dense relevance labels provide more stable observations. We calculate the average attention from the [CLS] token to each group (“relevant”, “irrelevant”, and “all”) of PRF document (Eq. (5) & (6)), and plot them in Fig. 1(b)-1(d).
As training proceeds, ANCE-PRF pays more and more attention to the relevant PRF documents than the irrelevant ones, showing the effectiveness of its learning. Note that the original query always attracts the most attention from the PRF encoder, which is intuitive, as the majority of the search intent is to be determined by the query. The PRF information is to refine the query representation with extra information but not to invalidate it.
5.4. Case Study
Fig. 2 plots the per query win/loss of ANCE-PRF versus ANCE on TREC DL 2020 and shows one example each.
ANCE-PRF wins on more queries and with larger margins. We also notice the PRF query encoder focuses more on terms that are complementary to the query. In the winning example, ANCE-PRF picks up terms explaining what ”un fao” is and does not mistake ”un” as ”uno”. On the other hand, ANCE-PRF may be misled by information appearing in multiple feedback documents. This is a known challenge for PRF because the correctness of information from multiple feedback documents is its core assumption (Lavrenko and Croft, 2001). In the losing example, “pattern their shotguns” occurs in multiple PRF documents, attracting too much attention to allow ANCE-PRF to make a better choice.
Existing dense retrievers learn query representations from short and ambiguous user queries, thus a query representation may not precisely reflect the underlying information need. ANCE-PRF addresses this problem with a new query encoder that learns better query representations from the original query and the top-ranked documents from a state-of-the-art dense retriever, ANCE.
Our experiments demonstrate that ANCE-PRF’s effectiveness in refining query understanding and its robustness against noise from imperfect feedback. Our studies reveal that ANCE-PRF learns to distinguish between relevant and irrelevant documents. We show that ANCE-PRF successfully learns to identify relevance information with its attention mechanism. Its query encoder pays more attention to the relevant portion of the PRF documents, especially the PRF terms that complement the query terms in expressing the information need.
ANCE-PRF provides a straightforward way to leverage the PRF information in dense retrieval and can be used as a plug-in in embedding-based retrieval systems. We observe that simply leveraging the classic PRF information in the new neural-based retrieval regime leads to significant accuracy improvements, suggesting that more future research can be done in this direction.
- Learning a deep listwise context model for ranking refinement. In The 41st International ACM SIGIR Conference on Research Development in Information Retrieval, pp. 135–144. Cited by: §2.
- Parameterized concept weighting in verbose queries. In Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval, pp. 605–614. Cited by: §1.
- Co-bert: A context-aware BERT retrieval model incorporating local and query-specific context. arXiv preprint arXiv:2104.08523. Cited by: §2.
- Reducing the risk of query expansion via robust constrained optimization. In Proceedings of the 18th ACM Conference on Information and Knowledge Management, pp. 837–846. Cited by: §5.2.
- Overview of the TREC 2019 deep learning track. In Proceedings of the Twenty-Eighth Text REtrieval Conference, NIST Special Publication. Cited by: §4.
- Overview of the TREC 2020 deep learning track. In Proceedings of the Twenty-Ninth Text REtrieval Conference, NIST Special Publication. Cited by: §4.
- Search engines: information retrieval in practice. Vol. 520, Addison-Wesley Reading. Cited by: §1, §1.
- BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. 1, pp. 4171–4186. Cited by: §1, §1.
- Complement lexical retrieval model with semantic residual embeddings. In Advances in Information Retrieval - 43rd European Conference on IR Research, Lecture Notes in Computer Science, Vol. 12656, pp. 146–160. Cited by: §2.
- Learning dense representations for entity retrieval. In Proceedings of the 23rd Conference on Computational Natural Language Learning, pp. 528–537. Cited by: §4.
- A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management, pp. 55–64. Cited by: §2.
Accelerating large-scale inference with anisotropic vector quantization.
Proceedings of the 37th International Conference on Machine Learning, Proceedings of Machine Learning Research, Vol. 119, pp. 3887–3896. Cited by: §1.
- Efficiently teaching an effective dense retriever with balanced topic aware sampling. In The 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 113–122. Cited by: §2.
- UMass at TREC 2004: novelty and HARD. In Proceedings of the Thirteenth Text REtrieval Conference, NIST Special Publication, Vol. 500-261. Cited by: §1, §4.
- Billion-scale similarity search with gpus. IEEE Trans. Big Data 7 (3), pp. 535–547. Cited by: §1.
Dense passage retrieval for open-domain question answering.
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 6769–6781. Cited by: §1, §2, Table 1, §3, §4.
- ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In Proceedings of the 43rd International ACM SIGIR conference on research and development in Information Retrieval, pp. 39–48. Cited by: §2.
- Natural questions: a benchmark for question answering research. Trans. Assoc. Comput. Linguistics 7, pp. 452–466. Cited by: §1.
- Relevance-based language models. In Proceedings of the 24th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 120–127. Cited by: §1, §4, §5.4.
- Latent retrieval for weakly supervised open domain question answering. In Proceedings of the 57th Conference of the Association for Computational Linguistics, Vol. 1, pp. 6086–6096. Cited by: §1, §2.
- NPRF: A neural pseudo relevance feedback framework for ad-hoc information retrieval. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 4482–4491. Cited by: §2.
- Distilling dense representations for ranking using tightly-coupled teachers. arXiv preprint arXiv:2010.11386. Cited by: §2.
- Less is more: pre-training a strong siamese encoder using a weak decoder. arXiv preprint arXiv:2102.09206. Cited by: §1.
- Sparse, dense, and attentional representations for text retrieval. Trans. Assoc. Comput. Linguistics 9, pp. 329–345. Cited by: §1, §2, Table 1, §3, §4, §5.1.
- How deep is your learning: the DL-HARD annotated deep learning dataset. In Proceedings of the 44th International ACM SIGIR conference on research and development in Information Retrieval, pp. 2335–2341. Cited by: §1, Table 3, §4.
- MS MARCO: A human generated machine reading comprehension dataset. In Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 co-located with the 30th Annual Conference on Neural Information Processing Systems, CEUR Workshop Proceedings, Vol. 1773. Cited by: §1, §4.
- Making monolingual sentence embeddings multilingual using knowledge distillation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, pp. 4512–4525. Cited by: §1.
- The probabilistic relevance framework: BM25 and beyond. Foundations and Trends in Information Retrieval 3 (4), pp. 333–389. Cited by: §4.
- BEIR: A heterogenous benchmark for zero-shot evaluation of information retrieval models. arXiv preprint arXiv:2104.08663. Cited by: §1.
- Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems, pp. 5998–6008. Cited by: §3.
- Pseudo-relevance feedback for multiple representation dense retrieval. Cited by: §2.
- End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 55–64. Cited by: §2.
- Approximate nearest neighbor negative contrastive learning for dense text retrieval. In 9th International Conference on Learning Representations, Cited by: §1, §1, §2, Table 1, §3, §3, §4, §4, §4.
- Query expansion using local and global document analysis. In Proceedings of the 19th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 4–11. Cited by: §1.
- PGT: pseudo relevance feedback using a graph-based transformer. In Advances in Information Retrieval - 43rd European Conference on IR Research, Lecture Notes in Computer Science, Vol. 12657, pp. 440–447. Cited by: §2.
- Learning to retrieve: how to train a dense retrieval model effectively and efficiently. arXiv preprint arXiv:2010.10469. Cited by: §1, §2, Table 1, §4.