Log In Sign Up

Passage Re-ranking with BERT

by   Rodrigo Nogueira, et al.
NYU college

Recently, neural models pretrained on a language modeling task, such as ELMo (Peters et al., 2017), OpenAI GPT (Radford et al., 2018), and BERT (Devlin et al., 2018), have achieved impressive results on various natural language processing tasks such as question-answering and natural language inference. In this paper, we describe a simple re-implementation of BERT for query-based passage re-ranking. Our system is the start of the art on the TREC-CAR dataset and the top entry in the leaderboard of the MS MARCO passage retrieval task, outperforming the previous state of the art by 27 code to reproduce our submission is available at


page 1

page 2

page 3

page 4


Fine-tune BERT for Extractive Summarization

BERT, a pre-trained Transformer model, has achieved ground-breaking perf...

Unsupervised Ranking Model for Entity Coreference Resolution

Coreference resolution is one of the first stages in deep language under...

Rethinking Perturbations in Encoder-Decoders for Fast Training

We often use perturbations to regularize neural models. For neural encod...

Thieves on Sesame Street! Model Extraction of BERT-based APIs

We study the problem of model extraction in natural language processing,...

Dark patterns in e-commerce: a dataset and its baseline evaluations

Dark patterns, which are user interface designs in online services, indu...

Exploring Contextualized Neural Language Models for Temporal Dependency Parsing

Extracting temporal relations between events and time expressions has ma...

Pretrained Transformers for Simple Question Answering over Knowledge Graphs

Answering simple questions over knowledge graphs is a well-studied probl...

Code Repositories

1 Introduction

We have seen rapid progress in machine reading compression in recent years with the introduction of large-scale datasets, such as SQuAD (Rajpurkar et al., 2016), MS MARCO (Nguyen et al., 2016), SearchQA (Dunn et al., 2017), TriviaQA (Joshi et al., 2017), and QUASAR-T (Dhingra et al., 2017), and the broad adoption of neural models, such as BiDAF (Seo et al., 2016), DrQA (Chen et al., 2017), DocumentQA (Clark & Gardner, 2017), and QAnet (Yu et al., 2018).

The information retrieval (IR) community has also experienced a flourishing development of neural ranking models, such as DRMM (Guo et al., 2016), KNRM (Xiong et al., 2017), Co-PACRR (Hui et al., 2018), and DUET (Mitra et al., 2017). However, until recently, there were only a few large datasets for passage ranking, with the notable exception of the TREC-CAR (Dietz et al., 2017). This, at least in part, prevented the neural ranking models from being successful when compared to more classical IR techniques (Lin, 2019).

We argue that the same two ingredients that made possible much progress on the reading comprehension task are now available for passage ranking task. Namely, the MS MARCO passage ranking dataset, which contains one million queries from real users and their respective relevant passages annotated by humans, and BERT, a powerful general purpose natural language processing model.

In this paper, we describe in detail how we have re-purposed BERT as a passage re-ranker and achieved state-of-the-art results on the MS MARCO passage re-ranking task.

2 Passage Re-Ranking with BERT


A simple question-answering pipeline consists of three main stages. First, a large number (for example, a thousand) of possibly relevant documents to a given question are retrieved from a corpus by a standard mechanism, such as BM25. In the second stage, passage re-ranking, each of these documents is scored and re-ranked by a more computationally-intensive method. Finally, the top ten or fifty of these documents will be the source for the candidate answers by an answer generation module. In this paper, we describe how we implemented the second stage of this pipeline, passage re-ranking.


The job of the re-ranker is to estimate a score

of how relevant a candidate passage is to a query . We use BERT as our re-ranker. Using the same notation used by Devlin et al. (2018), we feed the query as sentence A and the passage text as sentence B. We truncate the query to have at most 64 tokens. We also truncate the passage text such that the concatenation of query, passage, and separator tokens have the maximum length of 512 tokens. We use a model as a binary classification model, that is, we use the vector as input to a single layer neural network to obtain the probability of the passage being relevant. We compute this probability for each passage independently and obtain the final list of passages by ranking them with respect to these probabilities.

We start training from a pre-trained BERT model and fine-tune it to our re-ranking task using the cross-entropy loss:


where is the set of indexes of the relevant passages and is the set of indexes of non-relevant passages in top-1,000 documents retrieved with BM25.

3 Experiments

We train and evaluate our models on two passage-ranking datasets, MS MARCO and TREC-CAR.

3.1 Ms Marco

The training set contains approximately 400M tuples of a query, relevant and non-relevant passages. The development set contains approximately 6,900 queries, each paired with the top 1,000 passages retrieved with BM25 from the MS MARCO corpus. On average, each query has one relevant passage. However, some have no relevant passage because the corpus was initially constructed by retrieving the top-10 passages from the Bing search engine and then annotated. Hence, some of the relevant passages might not be retrieved by BM25.

An evaluation set with approximately 6,800 queries and their top 1,000 retrieved passages without relevance annotations is also provided.


We fine-tune the model using TPUs111 with a batch size of 32 (32 sequences * 512 tokens = 16,384 tokens/batch) for 400k iterations, which takes approximately 70 hours. This corresponds to training on 12.8M (400k * 32) query-passage pairs or less than 2% of the full training set. We could not see any improvement in the dev set when training for another 10 days, which equivalent to seeing 50M pairs in total.

We use ADAM (Kingma & Ba, 2014) with the initial learning rate set to , , , L2 weight decay of 0.01, learning rate warmup over the first 10,000 steps, and linear decay of the learning rate. We use a dropout probability of on all layers.

Method Dev Eval Test
BM25 (Lucene, no tuning) 16.7 16.5 12.3
BM25 (Anserini, tuned) - - 15.3
Co-PACRR (MacAvaney et al., 2017) - - 14.8
KNRM (Xiong et al., 2017) 21.8 19.8 -
Conv-KNRM (Dai et al., 2018) 29.0 27.1 -
IRNet 27.8 28.1 -
BERT Base 34.7 - 31.0
BERT Large 36.5 35.8 33.5
Table 1: Main Result on the passage re-ranking datasets. Best Entry in the TREC-CAR 2017. Previous SOTA in the MS MARCO leaderboard as of 01/04/2019; unpublished work.

3.2 Trec-Car

Introduced by Dietz et al. (2017), in this dataset, the input query is the concatenation of a Wikipedia article title with the title of one of its section. The relevant passages are the paragraphs within that section. The corpus consists of all of the English Wikipedia paragraphs, except the abstracts. The released dataset has five predefined folds, and we use the first four as a training set (approximately 3M queries), and the remaining as a validation set (approximately 700k queries). The test set is the same one used to evaluate the submissions to TREC-CAR 2017 (approx. 1,800 queries).

Although TREC-CAR 2017 organizers provide manual annotations for the test set, only the top five passages retrieved by the systems submitted to the competition have manual annotations. This means that true relevant passages are not annotated if they rank low. Hence, we evaluate using the automatic annotations, which provide relevance scores for all possible query-passage pairs.


We follow the same procedure described for the MS MARCO dataset to fine-tune our models on TREC-CAR. However, there is an important difference. The official pre-trained BERT models222 were pre-trained on the full Wikipedia, and therefore they have seen, although in an unsupervised way, Wikipedia documents that are used in the test set of TREC-CAR. Thus, to avoid this leak of test data into training, we pre-trained the BERT re-ranker only on the half of Wikipedia used by TREC-CAR’s training set.

For the fine-tuning data, we generate our query-passage pairs by retrieving the top ten passages from the entire TREC-CAR corpus using BM25.333We use the Anserini toolkit (Yang et al., 2018) to index and retrieve the passages. This means that we end up with 30M example pairs (3M queries * 10 passages/query) to train our model. We train it for 400k iterations, or 12.8M examples (400k iterations * 32 pairs/batch), which corresponds to only 40% of the training set. Similarly to MS MARCO experiments, we did not see any gain on the dev set by training the models longer.

3.3 Results

We show the main result in Table 1. Despite training on a fraction of the data available, the proposed BERT-based models surpass the previous state-of-the-art models by a large margin on both of the tasks.

Training size vs performance:

We found that the pretrained models used in this work require few training examples from the end task to achieve a good performance 1. For example, a trained on 640k question-passage pairs (2% of the MS MARCO training data) is only 2 MRR@10 points lower than model trained on 12.8M pairs (40% of training data).

4 Method

Figure 1: Number of MS MARCO examples seen during training vs. MRR@10 performance (note the log scale).

5 Conclusion

We have described a simple adaptation of BERT as a passage re-ranker that has become the state of the art on two different tasks, which are TREC-CAR and MS MARCO. We have made the code to reproduce our MS MARCO entry publicly available.