RepBERT: Contextualized Text Embeddings for First-Stage Retrieval

06/28/2020 ∙ by Jingtao Zhan, et al. ∙ Tsinghua University 0

Although exact term match between queries and documents is the dominant method to perform first-stage retrieval, we propose a different approach, called RepBERT, to represent documents and queries with fixed-length contextualized embeddings. The inner products of query and document embeddings are regarded as relevance scores. On MS MARCO Passage Ranking task, RepBERT achieves state-of-the-art results among all initial retrieval techniques. And its efficiency is comparable to bag-of-words methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Ranking pipelines are widely used in most search engines. Typically, efficient bag-of-words models are often adopted for initial retrieval, and neural ranking models are utilized for reranking. Although some recent works [DeepCT, Doc2query, docTTTTTquery] adopt deep language models to improve bag-of-words approaches, they still rely on exact term match signals and can hardly retrieve documents on semantic level. This paper tries to tackle such challenge by directly using deep neural models for first-stage retrieval.

Most neural approaches are time-consuming, especially the well-performing deep language models [devlin2018bert, T5]. But efficiency is the critical criterion for initial retrieval techniques because each query has millions of candidate documents. To address this, we encode documents into fixed-length embeddings offline and save them to disk to greatly improve online efficiency. During the online retrieval, the model encodes queries and regards inner products between query and document embeddings as relevance scores. The selection of the most relevant documents can be formulated as Maximum Inner Product Search (MIPS), for which many algorithms [shrivastava2014asymmetric, ram2012maximum, shen2015learning] are proposed and consume sub-linear computational complexity.

BERT [devlin2018bert] is currently one of the state-of-the-art models in NLP and IR. We adopt it to represent queries and documents. Because our model can be categorized as representation-focused models [guo2019deep] in IR community, we call the proposed model RepBERT.

This paper adopts the MS MARCO Passage Ranking dataset [MSMARCO], which is a benchmark dataset for information retrieval. In the following, we describe in detail how we achieve state-of-the-art results for first-stage retrieval. The code and data are released at

2 Related Work

Utilizing neural retrievers have been proved to be effective in Open QA tasks [Realm, karpukhin2020dense] and significantly outperform bag-of-words models, such as BM25 [BM25]

. However, bag-of-words models are still the dominant first-stage retrieval approaches in IR community. For example, according to MS MARCO Passage Ranking leaderboard, almost all public methods utilize bag-of-words models for initial retrieval. Such phenomenon may result from some lessons during the early years of neural networks. Prior work 

[ARCI] found that encoding text into a fixed-length embedding suffers the risk of losing details and that the interactions between terms are essential for superior ranking performance. But we believe such problem can be solved with powerful language models, such as BERT [devlin2018bert].

Despite the lack of neural models for initial retrieval, several works substantially improved bag-of-words models with the help of deep language models. doc2query [Doc2query] utilizes transformers [vaswani2017attention] to predict possibly issued queries for a given document and then expands it with those predictions. docTTTTTquery further improves it with the help of T5 [T5] as the expansion model. DeepCT [DeepCT] uses BERT to compute term weights to replace term frequency field in BM25 [BM25].


3.1 Model Architectures

Following BERT’s [devlin2018bert] input style, we apply wordpiece tokenization to the input text, and then add a [CLS] token at the beginning and a [SEP] token at the end:


Then, we pass the tokens into BERT111Note that BERT has two segment embeddings, which are added to the embeddings of input tokens in the Embedding Module. In our implementation, we assign segment embeddings numbered and to the query tokens and document tokens, respectively.

, which outputs one contextualized vector for each token. The vectors are averaged to produce the contextualized text embedding. In other words, we propose an encoder to represent the input text. Intuitively, representing queries and documents requires similar text understanding ability. Thus, RepBERT shares the weights of query encoder and document encoder. The encoder can be formulated as follows:


After acquiring the embeddings of queries and documents, we regard the inner products of them as relevance scores. Such simple design is mainly based on efficiency considerations. It can be formulated as follows:


3.2 Training

Loss Function The goal of training is to make the embedding inner products of relevant pairs of queries and documents larger than those of irrelevant pairs. Let be one instance of the input training batch. The instance contains one query , relevant (positive) documents and irrelevant (negative) documents. We adopt MultiLabelMarginLoss [paszke2017automatic]

as the loss function:


In-batch Negatives: During training, it is computationally expensive to sample many negative documents for each query. The trick of in-batch negatives is to utilize the documents from other query-document pairs in the same mini-batch as negative examples. For instance, there are query-document pairs in the mini-batch. Thus, most of the time, each query has positive example and negative examples. In rare cases, for a given query, some documents from other query-document pairs (the usual negatives) may be relevant and thus are regarded as positive in Equation 4. Such trick has been used in prior works [karpukhin2020dense, gillick2019learning] for training a siamese neural network.

4 Experiment

4.1 Dataset

MS MARCO Passage Ranking Dataset [MSMARCO] (MS MARCO) is a benchmark English dataset for ad-hoc retrieval. It has approximately 8.8 million passages, 0.5 million queries for training, 6.9 thousand queries for development. A blind, held-out evaluation set with about 6.8 thousand queries is also available and the result is provided by the organizers upon submission to the online leaderboard. In order to maintain consistent terminology throughout this paper, we refer to these basic units of retrieval as "documents".

4.2 Baselines

We compare with four initial retrieval techniques public on MS MARCO leaderboard, which are BM25(Anserini) [yang2018anserini], doc2query [Doc2query], DeepCT [DeepCT], and docTTTTTquery [docTTTTTquery]. The last three methods use deep language models to improve BM25 and are very competitive. They are briefly introduced in Section 2.

We also show performances of two-stage retrieval techniques. BiLSTM + Co-Attention + self attention based document scorer [Alaparthi2019MicrosoftAC] is the best non-ensemble, non-BERT method from the leaderboard with an associated paper. It uses BM25 for initial retrieval and deep attention networks for reranking. Another technique is proposed by Nogueira et al. [nogueira2019passage], which uses BM25 for initial retrieval and BERT Large for reranking.

4.3 First-Stage Retrieval

This section compares RepBERT with other retrieval techniques based on the performance of first-stage retrieval.

MRR@10 R@1000 Latency
Dev Test Dev (ms/query)
BM25(Anserini) [Doc2query] 0.184 0.186 0.853 50
doc2query [Doc2query] 0.215 0.218 0.893 90
DeepCT [DeepCT] 0.243 0.239 0.913 55
docTTTTTquery [docTTTTTquery] 0.277 0.272 0.947 64
Ours (RepBERT) 0.304 0.294 0.943 80
Best non-ensemble, non-BERT [Alaparthi2019MicrosoftAC] 0.298 0.291 0.814 -
BM25 + BERT Large [nogueira2019passage] 0.365 0.358 0.814 3,400
Table 1: Performances of first-stage retrieval and two-stage retrieval models on MS MARCO Passage Ranking dataset

4.3.1 Settings

We adopt the "Train Triples" data provided by MS MARCO [MSMARCO] for training. Due to the limitation of computational resources, we adopt the BERT base model in our experiment, which consists of encoder layers with vector dimension of . The maximum query length and document length are set to and tokens, respectively. We fine-tune the model using one Titan XP GPU with a batch size of and gradient accumulation steps of for steps, which corresponds to training on () query-document pairs. We could not observe any improvement based on a small dev set when training for another steps.

Our implementation is based on a public transformer library [Wolf2019HuggingFacesTS]. We follow the hyper parameter settings in Rodrigo et al. [nogueira2019passage]. Specifically, we use ADAM [Adam] with the initial learning rate set to , , , L2 weight decay of , learning rate warmup over the first

steps, and linear decay of the learning rate. We use a dropout probability of

on all layers.

The latency of different models is also provided. The latency of baselines are copied from prior works [Doc2query, docTTTTTquery]. As for our models, because the document embeddings consume and thus are impossible to load into a single GPU, we utilize Titan XP and GeForce GTX 1080ti to retrieve top-1000 documents for each query. We report the average latency to retrieve queries in the dev set. The efficiency can be further improved using more advanced GPUs or TPUs.

4.3.2 Discussion

The results are shown in Table 1.

RepBERT can represent text to retrieve documents on semantic level with high accuracy. Considering the MRR@10 metric, our model substantially outperforms other first-stage retrieval techniques. Particularly, it is better than the best non-ensemble, non-BERT two-stage retrieval method.

RepBERT can achieve high recall and thus its ranking results can be used for subsequent reranking models. Considering the Recall@1000 metric, our model is very near the best result achieved by docTTTTTquery [docTTTTTquery], which utilizes more powerful T5 [T5] language model. It significantly outperforms other baselines. We believe using more advanced language models to represent text can further improve RepBERT, just as how docTTTTTquery improves doc2query.

In terms of efficiency, RepBERT is comparable to bag-of-words models. It shows that it is practical to represent documents offline and compute inner products online for first-stage retrieval. Note that in our current retrieval implementation, we have not adopted optimized MIPS algorithms [shrivastava2014asymmetric, ram2012maximum, shen2015learning] and simply compute relevance scores between the given query and each document. We plan to investigate them in the future.

In summary, we propose a method to represent text with fixed-length embeddings and efficiently retrieve documents with high accuracy and recall. The model outperforms the original or the improved bag-of-words models, which highlights the possibility to replace them for initial retrieval.

4.4 Rerank based on RepBERT

This section investigates the performance of a reranking model when using RepBERT as the first-stage retriever.

(a) Recall at different depths
(b) Reranking Performance at different depths
Figure 1: At different depths, the recall of the first-stage retrieval method and the reranking accuracy of BERT Large. Dataset: MS MARCO dev.
Depths BM25(Anserini) doc2query DeepCT docTTTTTquery RepBERT
5 0.232 0.265 (14%) 0.279 (20%) 0.314 (36%) 0.319 (38%)
10 0.276 0.307 (11%) 0.320 (16%) 0.351 (27%) 0.344 (25%)
50 0.336 0.354 (5%) 0.361 (8%) 0.375 (12%) 0.370 (10%)
500 0.366 0.373 (2%) 0.374 (2%) 0.380 (4%) 0.377 (3%)
1000 0.371 0.376 (1%) 0.376 (1%) 0.380 (2%) 0.376 (1%)
Table 2: Reranking accuracy (MRR@10) of BERT Large [nogueira2019passage] using different first-stage retrieval techniques at different reranking depths. Dataset: MS MARCO dev. The improvement is relative to the reranking performance using BM25(Anserini) index.

4.4.1 Settings

Intuitively, the recall rate is an important factor for reranking performance. Thus, we compute it for different first-stage retrieval techniques at different depths. The results are shown in Figure 1(a).

Following prior works [DeepCT, Doc2query], we directly utilize the public BERT Large model [nogueira2019passage] finetuned on MS MARCO to rerank the documents retrieved by different models, except doc2query [Doc2query] which already made the reranking run file public. The overall performances on dev set at different depths are shown in Table 2 and Figure 1(b).

4.4.2 Discussion

According to Figure 1(a), our proposed RepBERT can achieve the best recall rates at small depths, partly due to the highest retrieval accuracy of our model. At large depths, RepBERT and docTTTTTquery are both the best-performing models. Thus, RepBERT’s reranking performances should be the best at all depths.

According to Table 2 and Figure 1(b), using RepBERT can achieve the best results at small depths. At large depths, such as 50, though docTTTTTquery’s performance is the best, using RepBERT can significantly outperform other baselines. At larger depths, such as 500 or 1000, the performance gap between models becomes smaller.

However, there is some inconsistency between the recall and the reranking performances. Although at large depths, RepBERT is still as good as docTTTTTquery in terms of recall, its reranking performances are worse than docTTTTTquery. We believe such inconsistency is due to the mismatch between training and testing data distribution for reranking model. It is elaborated in the next section.

4.4.3 Mismatch

Figure 2: For a certain depth, the average proportion of retrieved documents that are also in the official top-1000 candidates provided by MS MARCO. Dataset: MS MARCO dev.

In the following, we present our speculation that the mismatch leads to performance loss. The reranking model [nogueira2019passage] used in prior works [DeepCT, Doc2query] and ours is trained based on the "Train Triples" data provided by MS MARCO. It was generated by pairing positive documents in the qrel file with the negative documents in the top-1000 file retrieved by the official BM25 222 However, during testing, the model is used to rerank the documents retrieved by another method, such as RepBERT. It can cause severe mismatch of input data distribution if the retrievers used during training and testing are very different.

Before further elaboration, we introduce several denotations. We use , , and () to denote a retrieval technique, a query, and a depth, respectively. Specifically, the new technique used in testing is denoted as and the official retriever used in training is denoted as . We use to denote the top-n documents retrieved by for a given query . Note that MS MARCO’s "Train Triples" data is generated using .

We use a simple method to quantify such mismatch based on an intuitive thought. If , there is no mismatch for query at depth . But if , the mismatch for query at depth is the biggest. Therefore, we define the consistency factor of at depth , called , as the average proportion of documents in that are also in . Thus, , where represents no mismatch and represents the biggest mismatch. It can be formulated as follows ( is the cardinality of set .):


We compute the for different first-stage retrieval techniques, and the results are shown in Figure 2. The consistency factor of RepBERT is significantly lower than other methods, especially at large depths. It means that a major proportion of documents retrieved by RepBERT are not considered as candidates by the official BM25. Such results agree with the design of different techniques. The prior works, though also using deep language models, still relies on exact match signals to retrieve, while our proposed model utilizes semantic match signals. The distribution of retrieved documents between RepBERT and BM25 is very different. Thus the model trained to rerank documents retrieved by BM25 may not work well when reranking documents retrieved by RepBERT. We believe training BERT Large with negatives sampled from top-1000 documents retrieved by RepBERT can solve this issue.

4.5 Combination of Semantic Match and Exact Match

As introduced in the previous section, RepBERT utilizes semantic match signals, which is very different from BM25 and its improved versions using exact match signals. Thus, it is a straightforward idea to investigate whether the combination of the two signals can achieve better retrieval performance.

4.5.1 Method

Before we present a simple method to combine two retrieval techniques, we introduce several denotations. We use to denote a retrieval technique, and to denote the document retrieved by for a given query . Thus, the top-n documents retrieved by for a given query , denoted as , are equal to .

Our method is as follows. We use and to refer to the two techniques that will be combined. First, we merge the two retrieval document lists, namely and , in an alternating fashion to acquire a preliminary ranking list. For example, if and , then the merged list is . Such operation usually results in duplicated documents at different ranking positions. Thus, we filter out the documents that also appear at lower ranking positions. In our example, the filtered list is . Finally, we truncate the filtered list to contain only the first 1000 documents. The whole process can be formulated as follows:


We present the retrieval accuracy and recall of different combinations in Table 3 and  4, respectively. The cell in the row and column shows the performance of and the improvement compared with . Note that in our method, and are different combinations, which is clearly reflected using MRR@10 metric.

It is worth pointing out that there is much room for improvement. For example, without truncation in Equation 6, Recall@2000 is 0.980 for RepBERT+docTTTTTquery, compared with 0.967 after truncation. There may be other methods to achieve better combining performance.

MRR@10 +BM25(Anserini) +doc2query +DeepCT +docTTTTTquery +RepBERT
BM25(Anserini) 0.187 0.203 () 0.213 () 0.227 () 0.245 (31%)
doc2query 0.217 () 0.222 0.236 () 0.247 () 0.263 (19%)
DeepCT 0.236 (3%) 0.246 (1%) 0.243 0.263 (8%) 0.276 (14%)
docTTTTTquery 0.263 (5%) 0.270 (3%) 0.275 (1%) 0.277 0.298 (8%)
RepBERT 0.296 (2%) 0.302 (1%) 0.306 (1%) 0.315 (4%) 0.304
Table 3: The retrieval accuracy (MRR@10) of different technique combinations. Dataset: MS MARCO dev. The cell in the row and column shows the ranking accuracy of the combination and the improvement compared with the model corresponding to the row.
Recall@1000 +BM25(Anserini) +doc2query +DeepCT +docTTTTTquery +RepBERT
BM25(Anserini) 0.857 0.888 (4%) 0.909 (6%) 0.937 (9%) 0.957 (12%)
doc2query 0.888 (0%) 0.892 0.919 (3%) 0.941 (6%) 0.961 (8%)
DeepCT 0.909 (1%) 0.919 (2%) 0.904 0.949 (5%) 0.957 (6%)
docTTTTTquery 0.937 (1%) 0.942 (1%) 0.949 (0%) 0.947 0.967 (2%)
RepBERT 0.957 (1%) 0.961 (2%) 0.957 (1%) 0.967 (3%) 0.943
Table 4: The retrieval recall of different technique combinations. Dataset: MS MARCO dev. The cell in the row and column shows the recall of the combination and the improvement compared with the model corresponding to the row.

4.5.2 Discussion

As shown in Table 3 and  4, BM25 and the improved versions achieve the best ranking accuracy and recall when combined with RepBERT. Especially in terms of recall, although docTTTTTquery is as good as (or slightly better than) RepBERT according to Table 1, RepBERT can better boost the recall of other baselines. We believe it is because RepBERT can better complement the semantic matching ability lacked by these baselines.

For RepBERT, combinations with exact match retriever can improve its recall and may also improve its ranking accuracy. According to Table 1,  3, and  4, RepBERT+docTTTTTquery is the best first-stage retriever in this paper. The results suggest that exact match signals are also helpful for semantic matching retrievers.

5 Conclusion

This paper proposes RepBERT to represent text with contextualized embeddings for first-stage retrieval. It achieves state-of-the-art initial retrieval performance on MS MARCO Passage Ranking dataset. We highlight the possibility to use representation-focused neural models to replace the widely-adopted bag-of-words models in first-stage retrieval. In the future, we plan to test model’s generalization ability on different datasets and investigate its performance in retrieving long text.