Complementing Lexical Retrieval with Semantic Residual Embedding

04/29/2020 ∙ by Luyu Gao, et al. ∙ Carnegie Mellon University 0

Information retrieval traditionally has relied on lexical matching signals, but lexical matching cannot handle vocabulary mismatch or topic-level matching. Neural embedding based retrieval models can match queries and documents in a latent semantic space, but they lose token-level matching information that is critical to IR. This paper presents CLEAR, a deep retrieval model that seeks to complement lexical retrieval with semantic embedding retrieval. Importantly, CLEAR uses a residual-based embedding learning framework, which focuses the embedding on the deep language structures and semantics that the lexical retrieval fails to capture. Empirical evaluation demonstrates the advantages of CLEAR over classic bag-of-words retrieval models, recent BERT-enhanced lexical retrieval models, as well as a BERT-based embedding retrieval. A full-collection retrieval with CLEAR can be as effective as a BERT-based reranking system, substantially narrowing the gap between full-collection retrieval and cost-prohibitive reranking systems



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

State-of-the-art search engines adopt a pipelined retrieval system: an efficient first-stage retriever that uses a query to fetch a set of documents from the entire document collection, and subsequently one or more reranking algorithms that refine ranking within the retrieved set. Since the retrieval stage is performed with respect to all documents in the collection, the ranking algorithms need to run efficiently. With recent deep neural models like BERT-based rerankers pushing reranking accuracy to new levels, the first-stage retrieval is gradually becoming the performance bottleneck in modern search engines.

Typically, the first-stage ranker is a Boolean, probabilistic, or vector space bag-of-words retrieval model that computes the relevance score with heuristics defined over the

lexical overlap between query and document. Lexical retrieval models such as BM25 Robertson and Walker (1994) had remained state-of-the-art for decades, and are still the most widely used first-stage retrieval algorithms today. Though successful, lexical retrieval models face a critical limitation – they disregard the semantics in the query and the document. Lexical retrieval fails when the query and the document mention the same concept using different words, which is known as the vocabulary mismatch problem. Besides vocabulary mismatch, lexical retrieval also fails to capture high-level properties of the text, e.g., topics, sentence structures, language styles, etc.

Recent advances in deep learning provide a powerful new tool to model semantics for IR. With the uses of distributed text representations (embeddings), neural networks can compare text at the semantic level even if they use different vocabularies

Xiong et al. (2017). However, state-of-the-art neural rankers, e.g., BERT-based rankers  Nogueira and Cho (2019); Dai and Callan (2019b), are cost-prohibitive for first-stage retrieval as they need to compute the interactions between every possible pair of tokens in the input. Recent research attempts to address the efficiency challenge using embedding retrieval that collapses all tokens in the query or the document into a single embedding vector. With these embeddings, retrieval can be done efficiently with maximum inner product search in the embedding space. However, a single low-dimensional embedding has limited representation capacity, and tends to lose specific word-level information which is critical to IR Salton and McGill (1984); Guo et al. (2016).

This paper aims to combine the best of both worlds from deep embedding representations and explicit lexical representations. We argue that the embeddings should focus on encoding semantics that the lexical retrieval fails to capture to make the best use of its limited representation capacity. We propose CLEAR, a novel deep retrieval framework that attempts to complement lexical-match with semantic embeddings acquired from the residual of a lexical retrieval model’s errors. CLEAR incorporates a Siamese framework that uses BERT Devlin et al. (2018) to encode the query and the document separately, as well as a BM25 lexical retrieval model. Unlike existing embedding learning techniques that directly optimize the distances between the embeddings, CLEAR is trained to infer a residual score that adjusts a lexical-based retrieval model score to account for vocabulary mismatch. In such a residual training framework, we lift the burden of learning lexical match from the embedding-based retrieval model and focus it on higher semantic level matching.

During inference, CLEAR parallelly runs two retrieval models: 1) lexical retrieval from the inverted index using the surface form of the query/document text, and 2) semantic retrieval that uses the query embedding to find the nearest neighbours from the document embeddings. As the embeddings are learned with the lexical retrieval’s residual, the two types of retrieval scores are complementary, and can provide addictive gains.

Our experimental results on two distinct query sets show substantial and consistent advantages of CLEAR over widely-used bag-of-words retrieval models, recent deep lexical retrieval models, and a strong BERT-based embedding retrieval model. Furthermore, CLEAR can be as effective as a state-of-the-art ranking system that used multiple retrieval stages and computational-expensive BERT rankers. Ablation study shows that the key to CLEAR’s advantages is the residual-based learning, without which CLEAR’s performance drops substantially.

In the rest of the paper, Section 2 reviews related work. Section 3 describes our CLEAR retrieval model. Section 4 and 5 present the experimental methodologies and results. Section 6 concludes the paper.

2 Related Work

Traditionally, information retrieval has relied on lexical retrieval models such as BM25 Robertson and Walker (1994) and query likelihood Lafferty and Zhai (2001) for their efficiency and effectiveness. A long-standing challenge in this type of retrieval models is the vocabulary mismatch problem. One successful approach to bridge the vocabulary gap between queries and documents is query expansion Lavrenko and Croft (2001). Several recent studies explored using deep language models such as BERT Devlin et al. (2018) to improve lexical-based retrieval model, by adjusting existing terms’ weights Dai and Callan (2019a) or adding new terms to the document Nogueira et al. (2019). However, they may still fail at capturing high-level concepts that are not explicitly mentioned in the text.

Deep neural networks excel at semantic matching with the use of distributed text representations (embeddings). Neural network models for IR in previous studies can be classified into two groups:

representation-based models, and interaction-based models. Representation-based models learn embedding representations of queries and documents and use a simple scoring function (e.g., cosine) to measure the relevance between them. Interaction-based approaches, on the other hand, models the interactions between pairs of words in queries and documents and use the rich word-level matching signals for ranking. Interaction-based approaches were shown to be more effective, but it is prohibitive to apply them for first-stage retrieval as the document-query interactions must be inferred online. For first-stage retrieval, recent research focuses on representation-based models.

Representation-based models for retrieval can be traced back to three decades ago, such as LSI Deerwester et al. (1990), Siamese networks Bromley et al. (1993), and MatchPlus Caid et al. (1995). More recently, several studies investigated using modern deep learning techniques to build the query/document representations. For example,  Aumüller et al. (2018) learns the embedding representations with a shallow neural network. The model was shown effective on small-scale retrieval datasets but failed to scale to larger collections.  Lee et al. (2019) used a BERT-based embedding retrieval to find candidate passages for question answering.  Guu et al. (2020) extended  (Lee et al., 2019) by making the embedding retrieval module trainable along with the rest of the question answering pipeline in an end-to-end manner.  Chang et al. (2020)

proposes a set of pre-training tasks for sentence retrieval tasks. The majority of embedding-based retrieval methods use dense embedding representations, where retrieval turns into a K-nearest neighbour (KNN) search in the embedding space. An alternative is to convert the dense embeddings into sparse ones and effectively represent queries and documents by a set of “latent words” which can be retrieved using inverted indices  

(Salakhutdinov and Hinton, 2009; Zamani et al., 2018).

Although latent embedding based retrieval has achieved great success on several NLP tasks, their effectiveness for standard ad-hoc search is mixed Guo et al. (2016); Zamani et al. (2018). All of the representation-based neural retrieval models inherit the same limitation of representation-based neural IR models – they use a fixed number of dimensions, which incurs the specificity vs. exhaustiveness trade-off found in all controlled vocabularies (Salton and McGill, 1984). While capturing high-level semantics, they collapses all words into a single vector, losing granular level information that has been fundamental to modern search engines. There exist a few studies that consider combining semantic matching with lexical matching Guo et al. (2016); Xiong et al. (2017); Mitra et al. (2017), but they all use complex models and are exclusive to the reranking stage. To the best of our knowledge, this is the first work that investigates jointly training latent embeddings and lexical retrieval for first-stage retrieval.

3 Proposed Method

CLEAR consists of two retrieval models, a lexical retrieval model and an embedding retrieval model. Between these two models, one model’s weakness is another model’s strength: lexical retrieval performs exact token matching but cannot handle vocabulary mismatch problem or topic level matching; embedding retrieval does semantic matching but collapses all sentence tokens into a single vector, losing granular level information. We hypothesize that an effective ranking system can be built by having the two types of models complement each other. In an ideal case, each of the two models produces one score, and two scores combined together will capture both lexical and topic level matching. To achieve this goal, we propose a residual-based learning framework that teaches the two models to be complementary to each other at training time.

3.1 Lexical Retrieval Model

The lexical retrieval model uses token overlap information to score query document pairs. This work uses BM25, a current state-of-the-art BOW retrieval model, but it can also take other lexical retrieval models like Indri Strohman et al. (2005)

, vector space models, or recently proposed machine-learned ones 

Dai and Callan (2019a); Nogueira et al. (2019).

Given a query document pair, for each overlapping word, BM25 generates a score with a simple scoring function based on document statistic and corpus statistics. Adding all the scores together, BM25 produces a lexical matching score between the pair. We denote the input query q, document d, and the lexical matching score , then,


3.2 Embedding Retrieval Model

The embedding retrieval model encodes an input query or document sequence into a dense vector. Between a query vector and a document vector, we compute a dot product between them as a similarity measure. The embedding retrieval model can take various architectures that encode natural language sequences, such as LSTM and Transformer Vaswani et al. (2017). Importantly, we require the model to output a single dense vector for an input sequence.

This work adopts the Transformer encoder. Particularly, we use a fine-tuned BERT Devlin et al. (2018) to map input document of length into a sequence of contextualized word representation vectors of dimension . In matrix form,


where D is representation of the entire document of dimension . Similarly, for a query of length , we have,


where Q is representation of the entire query of dimension . We tie query and document BERT model parameters to reduce training time memory footprint and storage size. To help the model differentiate between query and document, we prepend special token QRY to query and DOC to document before feeding them to the BERT model. We pool query and document representations along the first dimension to derive the final embedding vector of dimension .


In this work, we use average pooling. The final neural model score is computed by similarity between and . We use dot product as our similarity measure.


3.3 Residual-based Learning

To make the two retrieval models complement each other, we need to teach them to do so at training time. In CLEAR, the lexical retrieval model is static by its design with no learnable parameter. On the other hand, the embedding retrieval model is flexible. Therefore, we propose to keep the lexical model as is and optimize the embedding model to complement the lexical matching model.

The training loss for the neural embedding model is defined over a triplet: a query , a relevant document serving as the positive example and an irrelevant document serving as the negative example. The loss takes the form of hinge loss with a margin .


In order to train embeddings that complements the lexical retrieval, we propose two techniques: sampling negative examples from lexical retrieval’s errors, and adjusting margin based on lexical retrieval’s residuals.

Error-based Negative Sampling: We sample negative examples ( in Eq. 7) from those documents mistakenly retrieved by the lexical retrieval model. For each positive document, we uniformly sample from the top documents returned by lexical retrieval with a possibility of . With such negative samples, the embedding model needs to differentiate relevant documents from confusing ones that are lexically similar to the query but semantically irrelevant.

Residual-based Margin

: Intuitively, different query-document pairs require different levels of semantic information for matching. Our negative sampling strategy does not tell the model the degree of errors made by the lexical retrieval. To address this challenge, we propose a new residual margin for the loss function. In particular,

is a residual margin defined on the lexical retrieval:


where is a constant non-negative value, is the residual of the lexical retrieval, and is a scaling factor that adjusts the residual.

When the lexical retrieval model ranks the documents correctly, the residual margin ( Eq 8) will be small or even become negative. The neural embedding model receives small or zero gradient update in this case. On the other hand, when vocab mismatch and topic difference exist, the lexical model fails, the residual margin could remain high and the embedding model will be trained to capture such type of matching.

With the proposed training paradigm, the embedding model learns to produce a score that adjusts the lexical matching score to inject semantic level match/mismatch information into the final ranking score. The embedding model only needs to amend lexical matching scores rather than reproducing them, so that it can focus on encoding the deeper language structures and semantic patterns underlying the surface form of the text.

3.4 Retrieval with CLEAR

The final retrieval score in CLEAR is a weighted sum of lexical matching score and neural embedding score,


In CLEAR, lexical matching model runs fast taking advantage of inverted index data structure. The embedding model can also scale to millions of candidates on a modern GPU, and potentially billions with the help of approximate nearest-neighbor libraries such as FAISS Johnson et al. (2017). As a result, CLEAR is able to serve the first-stage, full-collection retrieval.

4 Experimental Methodologies

This section discusses the experimental methodologies used in this work, including datasets and evaluation, baselines, experimental methods, and implementation details.

Dataset and Evaluation. The current implementation of BERT supports texts of up to 512 tokens, thus we selected a dataset that consists primarily of passages: the MS MARCO passage ranking dataset Nguyen et al. (2016). It is a question-to-passage retrieval dataset with 8.8M passages. The training set contains approximately 0.5M pairs of queries and relevant passages, where each query on average has one relevant passage. Two evaluation query sets are used in this work to test the effectiveness of CLEAR.

  • MS MARCO Dev Queries: this evaluation query set contains 6980 queries from MS MARCO dataset’s development set, which has been widely used in prior research Nogueira and Cho (2019); Dai and Callan (2019a). Most of the queries have only one document judged relevant; the relevance labels are binary. Following Nguyen et al. (2016), we used MRR@10 to evaluate the ranking accuracy on this query set.

  • TREC2019 DL Queries: this evaluation query set is the official evaluation query set used in the TREC 2019 Deep Learning Track Craswell et al. (2019), a shared passage retrieval task. It contains 43 queries that have multiple relevant documents manually judged by NIST assessors with graded relevance labels. On average, a query has 95 relevant documents. TREC2019 DL Queries allow us to understand the distilled models’ behavior on queries with multiple, graded relevance judgments. Follwing Craswell et al. (2019), we used MRR, NDCG@10, and MAP@1000 to evaluate the ranking accuracy on this query set.

Baselines: Experiments were done with three first-stage retrieval baselines as well as a state-of-the-art BERT-based reranking pipeline, as described below.

  • BM25: The BM25 retrieval model  Robertson and Walker (1994) is a widely-used well-performing lexical-based retrieval model.

  • DeepCT: DeepCT Dai and Callan (2019a) is a state-of-the-art deep lexical retrieval model that uses BERT Devlin et al. (2018)

    to estimate the semantic importance of words in a document. The BERT-generated term weights are used to replace standard term frequency signals in

    BM25, helping the retrieval model to focus on essential concepts of documents.

  • BM25+RM3: The relevance model RM3 Lavrenko and Croft (2001) is a popular query expansion technique. It adds related terms to the original query to compensate for the vocabulary gap between queries and documents. BM25+RM3 has been proven to be a strong IR baseline Lin (2019).

  • DeepCT+RM3: Prior research Dai and Callan (2020) shows that using DeepCT weights with RM3 can further improve the BM25+RM3 baseline. Therefore this work also includes DeepCT+RM3 following the method described in Dai and Callan (2020).

  • BM25 + BERT Reranker: this is a pipelined retrieval system that has achieved state-of-the-art performance in various retrieval benchmarks. It uses BM25 for first-stage retrieval, and reranks the top documents using a BERT Reranker Nogueira and Cho (2019). Note that the BERT Reranker uses cross attention between query tokens and document tokens which is slow, therefore it is limited to be used in the reranking stage.

Experimental Methods: We compare the baselines to five experimental retrieval models that all involves neural embeddings.

  • BERT-Siamese: Our first experimental method is a BERT-based embedding retrieval model, as described in Section 3.2. It maps an input query or document into a fixed-size dense vectors, and uses dot product of embeddings for ranking. This method does not use any lexical matching signals. Note that although BERT-based embeding models have been tested on several sentence-level tasks Reimers and Gurevych (2019); Chang et al. (2020), its effectiveness for passage retrieval remains to be studied.

  • CLEAR: The second experimental method is the proposed CLEAR retrieval model.

  • CLEAR + BERT Reranker: this is a pipelined retrieval system that uses CLEAR for first-stage retreieval, and a BERT Reranker Nogueira and Cho (2019) for reranking.

Implementation Details: Lexical retrieval baselines, including BM25, BM25+RM3 and DeepCT, used the Anserini Yang et al. implementation. We tuned the parameters of these retrieval models on the evaluation query sets through 2-fold cross-validation. The parameters include: the and parameters in BM25 and DeepCT, and the number of feedback documents, the number of feedback terms, and the feedback coefficient in BM25+RM3 and DeepCT+RM3.

Experimental Methods (BERT-Emb and CLEAR

models) were implemented in Pytorch

Paszke et al. (2019) based on huggingface re-implementation of BERT Wolf et al. (2019)

. We trained the models by stochastic gradient descent with learning rate of 2e-5, a batch size of 28. We trained for one epoch over the training set and use Adam as optimizer. We fixed

and in the experiments for CLEAR.

5 Results and Discussion

Three experiments study CLEAR’s retrieval effectiveness, its impacts on the end-to-end reranking pipeline, and the contributions of different model components.

5.1 Retrieval Accuracy of CLEAR

Model MS MARCO DEV Queries TREC2019 DL Queres
BM25 0.191 86.4% 0.825 0.506 0.377 73.8%
BM25+RM3 0.166 86.1% 0.818 0.555 0.452 78.9%
DeepCT 0.243 91.3% 0.858 0.551 0.422 75.6%
DeepCT+RM3 0.232 91.4% 0.924 0.601 0.481 79.4%
BERT-Siamese 0.308 92.8% 0.842 0.594 0.307 58.4%
CLEAR 0.338 96.9% 0.979 0.699 0.511 81.2%
Table 1: The first-stage retrieval effectiveness of CLEAR and baseline models on the MS MACRO passage ranking dataset, evaluated using two evaluation sets with different characteristics.

The first experiment examines whether CLEAR improves first-stage retrieval accuracy over baseline retrieval models. Table 1 shows the results on the MS MARCO passage ranking set evaluated with two distinct query sets.

CLEAR vs. Classic Lexical Retrieval BM25 and BM25+RM3 are among the most widely-used first-stage retrieval models in state-of-the-at search engines Lin (2019). On the MS MARCO DEV queries, CLEAR surpassed BM25 and BM25+RM3 by over 12% on Recall@1000, which means that 12% more queries were able to retrieve their relevant documents using CLEAR. CLEAR also almost doubled the MRR scores, meaning that the average ranking of the relevant documents was moved from rank 5 to rank 3. On the TREC2019 DL queries, CLEAR also substantially improved our classic lexical retrieval baselines. Unlike MS MAROC Dev queries, each of the TREC2019 DL query has around 95 relevant documents with multiple grades of relevance. The results demonstrate that CLEAR is also effective when the retrieval model needs to find all relevant documents with different levels of relevance.

CLEAR vs. BERT-enhanced Lexical Retrieval DeepCT and DeepCT+RM3 are two recently-proposed lexical retrieval models enhanced by BERT. These models use BERT to estimate term importance based on the document context; the context-aware term weights are used by a lexical retrieval model and significantly improves retrieval accuracy. Table 1 confirms their advantages over classic lexical retrieval models. However, these models still solely rely on lexical matching of words, and therefore will fail to match different vocabularies or higher level concepts. CLEAR overcomes their drawback by injecting semantic level match/mismatch information into ranking with embeddings, and therefore achieves better performance in both recall and precision.

CLEAR vs. BERT-Siamese Embedding Retrieval Although embedding-based retrieval has received much attention recently, their effectiveness has not been established on standard IR benchmark datasets. In this work, we developed BERT-Siamese to study how embedding retrieval works on different IR settings. As shown in Table 1, BERT-Siamese is effective on the MS MARCO DEV queries; but on the TREC2019 DL queries, it cannot even beat the classic, unsupervised BM25 retrieval model in terms of MAP@1000 and recall. Note that the main difference between the two query sets is that MARCO DEV queries have only one relevant document per query, while each TREC2019 DL query has multiple relevance documents with multiple levels of relevance. The results indicate that the embeddings learned by BERT-Siamese is focused on finding the most relevant document to a query, but fail to capture the more diverse, weaker relevance patterns required by the TREC2019 DL queries. The results indicate that a retrieval solely relying on embedding similarities is not sufficient.

CLEAR’s lexical retrieval compensates for the disadvantages of embedding retrieval. In CLEAR, the lexical retrieval model finds a diverse set of documents that are weakly related to the query – documents that mention the query words. Meanwhile, the embedding retrieval model bridges the vocabulary gap between queries and documents, and complements the weaker lexical match with deeper, more complex semantic patterns encoded in the embeddings.

5.2 Accuracy of CLEAR in Pipelined Retrieval Systems

(a) Recall on MS MARCO DEV Queries
(b) Recall on TREC2019 DL Queries
(c) Reranking on MS MARCO DEV Queries
(d) Reranking on TREC2019 DL Queries
Figure 1: Impacts of CLEAR on a pipelined retrieval system. The system uses the BERT reranker to rerank top K documents retrieved by BM25 or CLEAR.
Evaluation Set MS MARCO DEV Queries TREC2019 DL Queres
BM25 + BERT Reranker (K=1,000) 0.345 0.924 0.707
CLEAR w/o Reranking 0.338 0.979 0.699
CLEAR + BERT Reranker (K=20) 0.360 0.952 0.719
Table 2: Comparing CLEAR and the state-of-the-art BM25+ + BERT Reranker pipeline on the MS MACRO passage ranking dataset. Evaluation used two evaluation sets with different characteristics.

State-of-the-art retrieval pipelines use lexical retrieval such as BM25 as the first stage ranker to fetch an initial set of documents from the document collection, following which a BERT-based reranker is used to improve the ranking Nogueira and Cho (2019); Craswell et al. (2019). The second experiment investigates the impacts of replacing BM25 with CLEAR to such a pipelined retrieval system.

Figure 1 (a)-(b) compares the recall of the top K documents retrieval by BM25 and CLEAR. Re-ranking at a shallower depth (smaller ) has higher efficiency but may miss more relevant passages. CLEAR had higher recall at all depth, meaning a ranking from CLEAR provided more relevant passages in the candidates to the reranker.

Figure 1 (c)-(d) shows the performance of a BERT reranker (Nogueira and Cho, 2019) applied to the top K documents retrieved from BM25 and CLEAR. They also report the performance of a single-stage CLEAR retrieval without reranking. When applied to BM25, the accuracy of the BERT reranker improved as K increases, which is expected. To the contrary, when applied to CLEAR, the BERT reranker’s performance is relatively insensitive to the reranking depth K. The reranking accuracy was already high with small K. It reaches the top performance at around K=20, and then starts to decrease slightly. As shown in these figures, a ranking from CLEAR without reranking was already almost as accurate as the BERT reranking pipeline. Adjusting CLEAR’s ranking using the BERT reranker does not make much difference, and may even hurt the originally correct rankings from CLEAR. With CLEAR’s strong initial rankings, one only needs to rerank a few documents to achieve state-of-the-art performance.

Table 2 further illustrates the advantages of CLEAR in a pipelined ranking system. It reports BM25+BERT Reranker’s best reranking accuracy which was achieved at K=1,000, CLEAR’s ranking accuracy without reranking, and CLEAR+BERT Reranker’s best ranking accuracy which was achieved at K=20. Similar to the observation from Figure 1, the accuracy of CLEAR w/o Reranking was already close to a state-of-the-art BERT-based reranking pipeline. When adding a BERT reranker, our CLEAR+BERT Reranker pipeline can outperform the baseline. Importantly, the required re-ranking depth decreased from 1000 to 20, reducing the computational cost by . In other words, CLEAR generates strong initial rankings that can help SOTA rerankers to achieve higher ranking accuracy with lower computational costs.

In previous research, it is rare to see a full-collection retrieval model outperform a sophisticated reranker. CLEAR shows that a combination of two simple text representations (bag-of-words and embeddings) is sufficient for capturing complex relevance patterns which previously need to be modeled by interaction-based models that use dozens of layers of attention among query and document words. With a stronger initial ranking, current state-of-the-art BERT rerankers are no longer sufficient – the reranker can even weaken the initial ranking of CLEAR. This provides new challenges and opportunities for researchers to explore new reranking approaches that are different from the current BERT-based paradigm.

5.3 Effects of Residual-Based Embedding Learning

Evaluation Set
DEV Queries
DL Queries
CLEAR 0.338 96.9% 0.979 0.699 0.511 81.2%
Error-based Sampling Random 0.241 92.6% 0.850 0.553 0.409 77.9%
Residual Margin Constant Margin 0.314 95.5% 0.928 0.664 0.455 79.4%
Table 3: Ablation study on the error-based negative sampling and the residual margin. : statistical significant difference from CLEAR

The last experiment seeks to understand the effects of our residual-based embedding learning. As described in Section 3, CLEAR attempts to make the lexical retrieval and embedding retrieval complement each other using two techniques: error-based negative sampling, and residual-based margin in the loss function. This experiment studies their impacts on CLEAR through an ablation study.

As shown in Table 3, we first replace the error-based negative samples with random negative samples. This leads to a substantial drop in CLEAR’s retrieval accuracy. The embedding model trained on random negative samples are not aware of the lexical retrieval module. Consequently, combining it with the lexical retrieval does not bring much addictive gains. Next, we replace the residual margin in the loss function with a constant margin. The resulting model’s performance is also significantly lower than the original CLEAR model. In this case, the embedding model is aware of the errors made by the lexical retrieval by seeing the negative samples, but does not know the degree of the error. Our residual margin explicitly let the model know how much the embedding retrieval needs to compensate for the lexical retrieval, so that the embedding model can better fit the lexical retrieval model. In summary, results from this experiment demonstrate that the error-based negative sampling and the residual-based margin are both important to the effectiveness of CLEAR.

6 Conclusion

Traditionally, information retrieval has relied on exact lexical matching signals. Neural embedding based retrieval models, on the other hand, are able to match queries and document at the semantic level in the latent embedding space, but they lose granular word level matching information. This paper recognizes that one model’s weakness can be another model’s strength, and hypothesize that they have the potential to complement each other. We then present CLEAR, a deep retrieval framework that attempts to complement lexical retrieval with semantic embedding retrieval. CLEAR trains the embedding retrieval module to focus on the residual of the lexical retrieval model, encoding the language structures and semantics that the lexical retrieval fails to capture.

Experimental results show that CLEAR achieves the new state-of-the-art first-stage retrieval effectiveness on two distinct evaluation sets, outperforming classic bag-of-words retrieval models, recent deep lexical retrieval models, and a BERT-based embedding retrieval model. Ablation study shows that our residual-based embedding learning is the key to CLEAR’s advantages. The error-based negative sampling allows the embedding retrieval to be aware of the mistakes of the lexical retrieval, and the residual margin further allows the embeddings to focus on the harder errors. Without using these techniques, CLEAR’s performance drops substantially.

There has been increasing attention on embedding-based retrieval. This work finds that such embeddings are indeed effective when the goal is to find a few most relevant document to the query; however, they may fail to capture the weaker and more diverse relevance patterns. CLEAR shows that it is beneficial to use the lexical retrieval model to capture weaker relevant patterns using lexical clues, and complement it with the stronger, more complex semantic patterns learned from the embeddings.

A full-collection retrieval from CLEAR can be as effective as a BERT reranker. This indicates that the relevance patterns modeled by complex and slow Transformers can be largely captured by a combination of two simple text representations: bag-of-words and embeddings. We view this as an encouraging step towards building deep and efficient retrieval systems.


  • M. Aumüller, T. Christiani, R. Pagh, and F. Silvestri (2018) Distance-sensitive hashing. In Proceedings of the 37th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems, Cited by: §2.
  • J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah (1993) Signature verification using a Siamese time delay neural network. In Advances in Neural Information Processing Systems, Cited by: §2.
  • W. R. Caid, S. T. Dumais, and S. I. Gallant (1995) Learned vector-space models for document retrieval. Information Processing & Management. Cited by: §2.
  • W. Chang, F. X. Yu, Y. Chang, Y. Yang, and S. Kumar (2020) Pre-training tasks for embedding-based large-scale retrieval. arXiv preprint arXiv:2002.03932. Cited by: §2, 1st item.
  • N. Craswell, B. Mitra, E. Yilmaz, and D. Campos (2019) Overview of the trec 2019 deep learning track. In TREC (to appear), Cited by: 2nd item, §5.2.
  • Z. Dai and J. Callan (2019a) Context-aware sentence/passage term importance estimation for first stage retrieval. arXiv preprint arXiv:1910.10687. Cited by: §2, §3.1, 1st item, 2nd item.
  • Z. Dai and J. Callan (2019b) Deeper text understanding for ir with contextual neural language modeling. In The 42nd International ACM SIGIR Conference on Research & Development in Information Retrieval, Cited by: §1.
  • Z. Dai and J. Callan (2020) Context-aware document term weighting for ad-hoc search. In The Web Conference 2020, Cited by: 4th item.
  • S. C. Deerwester, S. T. Dumais, T. K. Landauer, G. W. Furnas, and R. A. Harshman (1990) Indexing by latent semantic analysis. Journal of the American Society for Information Science. Cited by: §2.
  • J. Devlin, M. Chang, K. Lee, and K. Toutanova (2018) Bert: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805. Cited by: §1, §2, §3.2, 2nd item.
  • J. Guo, Y. Fan, Q. Ai, and W. B. Croft (2016) A deep relevance matching model for ad-hoc retrieval. In Proceedings of the 25th ACM International Conference on Information and Knowledge Management, Cited by: §1, §2.
  • K. Guu, K. Lee, Z. Tung, P. Pasupat, and M. Chang (2020) Realm: retrieval-augmented language model pre-training. arXiv preprint arXiv:2002.08909. Cited by: §2.
  • J. Johnson, M. Douze, and H. Jégou (2017) Billion-scale similarity search with gpus. arXiv preprint arXiv:1702.08734. Cited by: §3.4.
  • J. Lafferty and C. Zhai (2001) Document language models, query models, and risk minimization for information retrieval. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, Cited by: §2.
  • V. Lavrenko and W. B. Croft (2001) Relevance based language models. In Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval, Cited by: §2, 3rd item.
  • K. Lee, M. Chang, and K. Toutanova (2019) Latent retrieval for weakly supervised open domain question answering. arXiv preprint arXiv:1906.00300. Cited by: §2.
  • J. Lin (2019) The neural hype and comparisons against weak baselines. In ACM SIGIR Forum, Vol. 52, pp. 40–51. Cited by: 3rd item, §5.1.
  • B. Mitra, F. Diaz, and N. Craswell (2017)

    Learning to match using local and distributed representations of text for web search

    In Proceedings of the 26th International Conference on World Wide Web, Cited by: §2.
  • T. Nguyen, M. Rosenberg, X. Song, J. Gao, S. Tiwary, R. Majumder, and L. Deng (2016) MS marco: a human generated machine reading comprehension dataset. arXiv preprint arXiv:1611.09268. Cited by: 1st item, §4.
  • R. Nogueira, K. Cho, Y. Wei, L. Jimmy, and K. Cho (2019) Document expansion by query prediction. arXiv:1904.08375. Cited by: §2, §3.1.
  • R. Nogueira and K. Cho (2019) Passage re-ranking with bert. arXiv:1901.04085. Cited by: §1, 1st item, 5th item, 3rd item, §5.2, §5.2.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, et al. (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems, pp. 8024–8035. Cited by: §4.
  • N. Reimers and I. Gurevych (2019) Sentence-bert: sentence embeddings using siamese bert-networks. In

    Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing

    Cited by: 1st item.
  • S. E. Robertson and S. Walker (1994) Some simple effective approximations to the 2-poisson model for probabilistic weighted retrieval. In SIGIR’94, Cited by: §1, §2, 1st item.
  • R. Salakhutdinov and G. E. Hinton (2009) Semantic hashing. International Journal of Approximate Reasoning. Cited by: §2.
  • G. Salton and M. McGill (1984) Introduction to modern information retrieval. McGraw-Hill Book Company. Cited by: §1, §2.
  • T. Strohman, D. Metzler, H. Turtle, and W. B. Croft (2005) Indri: a language model-based search engine for complex queries. In Proceedings of the International Conference on Intelligent Analysis, Cited by: §3.1.
  • A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin (2017) Attention is all you need. In Advances in Neural Information Processing Systems 30, Cited by: §3.2.
  • T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, and J. Brew (2019) HuggingFace’s transformers: state-of-the-art natural language processing. ArXiv abs/1910.03771. Cited by: §4.
  • C. Xiong, Z. Dai, J. Callan, Z. Liu, and R. Power (2017) End-to-end neural ad-hoc ranking with kernel pooling. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, Cited by: §1, §2.
  • [31] P. Yang, H. Fang, and J. Lin Anserini: enabling the use of lucene for information retrieval research. In Proceedings of the 40th International ACM SIGIR Conference on Research and Development in Information Retrieval, N. Kando, T. Sakai, H. Joho, H. Li, A. P. de Vries, and R. W. White (Eds.), Cited by: §4.
  • H. Zamani, M. Dehghani, W. B. Croft, E. Learned-Miller, and J. Kamps (2018) From neural re-ranking to neural ranking: learning a sparse representation for inverted indexing. In Proceedings of the 27th ACM International Conference on Information and Knowledge Management, Cited by: §2, §2.