Log In Sign Up

An Updated Duet Model for Passage Re-ranking

by   Bhaskar Mitra, et al.

We propose several small modifications to Duet---a deep neural ranking model---and evaluate the updated model on the MS MARCO passage ranking task. We report significant improvements from the proposed changes based on an ablation study.


page 1

page 2

page 3

page 4


Improving Deep Learning For Airbnb Search

The application of deep learning to search ranking was one of the most i...

Learning-to-Rank with BERT in TF-Ranking

This paper describes a machine learning algorithm for document (re)ranki...

Significant Improvements over the State of the Art? A Case Study of the MS MARCO Document Ranking Leaderboard

Leaderboards are a ubiquitous part of modern research in applied machine...

Multi-Stage Document Ranking with BERT

The advent of deep neural networks pre-trained via language modeling tas...

Investigating the Successes and Failures of BERT for Passage Re-Ranking

The bidirectional encoder representations from transformers (BERT) model...

Finite and infinite Mallows ranking models, maximum likelihood estimator, and regeneration

In this paper we are concerned with various Mallows ranking models. Firs...

A Lexicographic Public Good Ranking

In this paper, we consider the consistency of the desirability relation ...

1 Introduction

In information retrieval (IR), traditional learning to rank (Liu, 2009)

models estimate the relevance of a document to a query based on hand-engineered features. The input to these models typically includes, among others, features based on patterns of exact matches of query terms in the document. Recently proposed deep neural IR models

(Mitra and Craswell, 2018)

, in contrast, accept the raw query and document text as input. The input text is represented as one-hot encoding of words (or sub-word components

(Kim et al., 2016; Jozefowicz et al., 2016; Sennrich et al., 2015))—and the deep neural models focus primarily on learning latent representations of text that are effective for matching query and document. Mitra et al. (2017) posit that deep neural ranking models should focus on both: representation learning for text matching, as well as on feature learning based on patterns of exact matches of query terms in the document. They demonstrate that a neural ranking model called Duet111 While Mitra et al. (2017) propose a specific neural architecture, they refer more broadly to the family of neural architectures that operate on both term space and learned latent space as duet. We refer to the specific architecture proposed by Mitra et al. (2017) as Duet—to distinguish it from the general family of such architectures that we refer to as duet (note the difference in capitilization). —with two distinct sub-models that consider both matches in the term space (the local sub-model) and the learned latent space (the distributed sub-model)—is more effective at estimating query-document relevance. In this work, we evaluate a duet model on the MS MARCO passage ranking task (Bajaj et al., 2016). We propose several simple modifications to the original Duet architecture and demonstrate through an ablation study that incorporating these changes results in significant improvements on the passage ranking task.

2 Passage re-ranking on MS MARCO

The MS MARCO passage ranking task (Bajaj et al., 2016) requires a model to rank approximately thousand passages for each query. The queries are sampled from Bing’s search logs, and then manually annotated to restrict them to questions with specific answers. A BM25 (Robertson et al., 2009) model is employed to retrieve the top thousand candidate passages for each query from the collection. For each query, zero or more candidate passages are deemed relevant based on manual annotations. The ranking model is evaluated on this passage re-ranking task using the mean reciprocal rank (MRR) metric (Craswell, 2009). Participants are required to submit the ranked list of passages per query for a development (dev) set and a heldout (eval) set. The ground truth annotations for the development set are available publicly, while the corresponding annotations for the evaluation set are heldout to avoid overfitting. A public leaderboard222 presents all submitted runs from different participants on this task.

3 The updated Duet model

In this section, we briefly describe several modifications to the Duet model. A public implementation of the updated Duet model using PyTorch

(Paszke et al., 2017) is available online333

Word embeddings

We replace the character level -graph encoding in the input of the distributed model with word embeddings. We see significant reduction in training time given a fixed number of minibatches and a fixed minibatch size. This change primarily helps us to train on a significantly larger amount of data under fixed training time constraints. We initialize the word embeddings using pre-trained GloVe (Pennington et al., 2014) embeddings before training the Duet model.

Inverse document frequency weighting

In contrast to some of the other datasets on which the Duet model has been previously evaluated (Mitra et al., 2017; Nanni et al., 2017), the MS MARCO dataset contains a relatively larger percentage of natural language queries and the queries are considerably longer on average. In traditional IR models, the inverse document frequency (IDF) (Robertson, 2004) of a query term provides an effective mechanism for weighting the query terms by their discriminative power. In the original Duet model, the input to the local sub-model corresponding to a query and a document is a binary interaction matrix defined as follows:


We incorporate IDF in the Duet model by weighting the interaction matrix by the IDF of the matched terms. We adopt the Robertson-Walker definition of IDF (Jones et al., 2000) normalized to the range .


Where, is the total number of passages in the collection and is the number of passages in which the term appears at least once.

Non-linear combination of local and distributed models

Zamani et al. (2018)

show that when combining different sub-models in a neural ranking model, it is more effective if each sub-model produce a vector output that are further combined by additional multi-layer perceptrons (MLP). In the original Duet model, the local and the distributed sub-models produce a single score that are linearly combined. In our updated architecture, both models produce a vector that are further combined by an MLP—with two hidden layers—to generate the estimated relevance score.

Rectifier Linear Units (ReLU)

We replace the Tanh non-linearities in the original Duet model with ReLU

(Glorot et al., 2011) activations.


We observe some additional improvements from combining multiple Duet models—trained with different random seeds and on different random sample of the training data—using bagging (Breiman, 1996).

4 Experiments

The MS MARCO task provides a pre-processed training dataset—called “triples.train.full.tsv”—where each training sample consists of a triple , where is a query and and are a pair of passages, with being more relevant to than . Similar to the original Duet model, we employ the cross-entropy with softmax loss to learn the parameters of our model :


Where, is the relevance score for the pair as estimated by the model . Note, that by considering a single negative passage per sample, our loss is equivalent to the RankNet loss (Burges et al., 2005).

We use the Adam optimizer with default parameters and a learning rate of . We set in Equation 5 to and dropout rate for the model to . We trim all queries and passages to their first and words, respectively. We restrict our input vocabulary to the most frequent terms in the collection and set the size of all hidden layers to . We use minibatches of size 1024 and train the model for 1024 minibatches. Finally, for bagging we train eight different Duet models with different random seeds and on different samples of the training data. We train and evaluate our models using a Tesla K40 GPU—on which it takes a total of only hours to train each single Duet model and to evaluate it on both dev and eval sets.

5 Results

Model MRR@10
Dev Eval
Other approaches
Single CKNRM (Dai et al., 2018) model
Ensemble of 8 CKNRM (Dai et al., 2018) models
IRNet (a proprietary deep neural model)
BERT (Nogueira and Cho, 2019)
Duet variants
Single Duet v2 w/o IDF weighting for interaction matrix -
Single Duet v2 w/ Tanh non-linearity (instead of ReLU) -
Single Duet v2 w/o MLP to combine local and distributed scores -
Single Duet v2 model
Ensemble of 8 Duet v2 models
Table 1: Comparison of the different Duet variants and other state-of-the-art approaches from the public MS MARCO leaderboard. The update Duet model—referred to as Duet v2—benefits significantly from the modifications proposed in this paper.

Table 1 presents the MRR@ corresponding to all the Duet variants we evaluated on the dev set. The updated Duet model with all the modifications described in Section 3—referred hereafter as Duet v2—achieves an MRR@ of . We perform an ablation study by leaving out one of the three modifications— IDF weighting for interaction matrix, ReLU non-linearity instead of Tanh, and LP to combine local and distributed scores,—out at a time. We observe a degradation in MRR by not incorporating the IDF weighting alone. It is interesting to note that the Github implementations444 of the KNRM (Xiong et al., 2017) and CKNRM (Dai et al., 2018) models also indicate that their MS MARCO submissions incorporated IDF term-weighting—potentially indicating the value of IDF weighting across multiple architectures. Similarly, we also observe a degradation in MRR by using Tanh non-linearity instead of ReLU. Using a linear combination of scores from the local and the distributed model instead of combining their vector outputs using an MLP results in degradation in MRR. Finally, we observe a improvement in MRR by ensembling eight Duet v2 models using bagging. We also submit the individual Duet v2 model and the ensemble of eight Duet v2 models for evaluation on the heldout set and observe similar numbers. We include the MRR numbers for other non-Duet based approaches that are available on the public leaderboard in Table 1. As of writing this paper, BERT (Devlin et al., 2018) based approaches—e.g., (Nogueira and Cho, 2019)—are outperforming other approaches by a significant margin. Among the non-BERT based approaches, a proprietary deep neural model—called IRNet—currently demonstrates the best performance on the heldout evaluation set. This is followed, among others, by an ensemble of CKNRM (Dai et al., 2018) models and the single CKNRM model. The single Duet v2 model achieves comparable MRR to the single CKNRM model on the eval set. The ensemble of Duet v2 models, however, performs slightly worse than the ensemble of the CKNRM models on the same set.

6 Discussion and conclusion

In this paper, we describe several simple modifications to the original Duet model that result in significant improvements over the original architecture on the MS MARCO task. The updated architecture—we call Duet v2—achieves comparable performance to other non-BERT based top performing approaches, as listed on the public MS MARCO leaderboard. We note, that the Duet v2 model we evaluate contains significantly fewer learnable parameters—approximately million—compared to other top performing approaches, such as BERT based models (Nogueira and Cho, 2019) and single CKNRM model (Dai et al., 2018)—both of which contains few hundred million learnable parameters. Comparing the models based on the exact number of learnable parameters, however, may not be meaningful as most of these parameters are due to large vocabulary size in the input embedding layers. It is not clear how significantly the vocabulary size impacts model performance—an aspect we may want to analyse in the future. It is worth emphasizing that compared to other top performing approaches, training the Duet v2 model takes significantly less resource and time—

hours to train a single Duet model and to evaluate it on both dev and eval sets using a Tesla K40 GPU—which may make the model an attractive starting point for new MS MARCO participants. The model performance on the MS MARCO task may be further improved by adding more depth and / or more careful hyperparameter tuning.