ARSM Gradient Estimator for Supervised Learning to Rank

11/01/2019 ∙ by Siamak Zamani Dadaneh, et al. ∙ 0

We propose a new model for supervised learning to rank. In our model, the relevancy labels are are assumed to follow a categorical distribution whose probabilities are constructed based on a scoring function. We optimize the training objective with respect to the multivariate categorical variables with an unbiased and low-variance gradient estimator. Learning to rank methods can generally be categorized into pointwise, pairwise, and listwise approaches. Our approach belongs to the class of pointwise methods. Although it has previously been reported that pointwise methods cannot achieve as good performance as of pairwise or listwise approaches, we show that the proposed method achieves better or comparable results on two datasets compared with pairwise and listwise methods.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning to rank is fundamental to information retrieval, E-commerce, and many other applications, for ranking items [14]. In this work we focus on document retrieval without loss of generality. Document retrieval (i.e., document ranking) has applications in large-scale item search engines, which can generally be described as follows: There is a collection of documents (items). Given a query (e.g. a query entered by a user in the search engine), the ranking function assigns a score to each document, quantifying the relative relevancy of the document to the query. The documents are ranked in the descending order based on these scores and the top ranked ones are returned.

Traditional approaches rank documents based on unsupervised models of words appearing in the documents and query and do not need any training [6]

. The rise of using machine learning to learn ranking models has been due to the availability of more signals related to relevance of documents, such as click items or search log data


The bulk of machine learning methods for learning to rank can roughly be categorized as pointwise, pairwise and listwise methods. Pointwise methods cast the ranking problem as a regression problem for predicting relevance scores [5] or a multiple ordinal classification to predict categorical relevance levels [13]. Pairwise approaches take document pairs as instances in learning, and formalize the learning-to-rank problem as that of classification. More precisely, they collect document pairs to query the relative ranking from the underlying unknown ranking lists. They then train a classification model with the labeled data and make use of the classification model in ranking [10]

. Finally, listwise methods use ranked document lists instead of document pairs as instances in learning and define an optimization loss function over the entire ranked list(s)


In this paper, we propose a new framework for supervised learning to rank

. Specifically, we define a scoring function that maps the input vector of features for a document to the probability parameters of a categorical distribution, where each category represents the relative relevance of the input document to the query. We then define the objective function of learning-to-rank as the expectation of a loss function, which determines the distance between predicted and true relevance labels of the input document, with respect to the scoring function categorical distribution. To achieve a rich family of ranking algorithms, we employ neural networks as scoring functions.

Due to its novel discrete structure, we exploit stochastic gradient based optimization to learn the parameters of the scoring function. The main difficulty arises when backpropagating the gradients through categorical variables. The recently proposed augment-REINFORCE-swap-merge (ARSM)

[20] gradient estimator provides a natural solution with low varaince unbiased gradient updates during the training of our proposed learning-to-rank framework. ARSM first uses variable augmentation, REINFORCE [18], and Rao-Blackwellization [4] to re-express the gradient as an expectation under the Dirichlet distribution, then uses variable swapping to construct differently expressed but equivalent expectations, and finally shares common random numbers between these expectations to achieve significant variance reduction.

The proposed framework, hereby referred to as ARSM-L2R, has a main advantage over the existing learning-to-rank methods. More precisely, due to the utilization of ARSM gradient estimator, the loss function assessing the distance between predicted and true document relevance labels needs not to be differentiable. This significantly increase the choices of loss functions that can be employed. Specifically, in our experiments, we optimize the truncated normalized discounted cumulative gain (NDCG) [9].

Comprehensive experiments conducted on benchmark datasets demonstrate that our proposed ARSM-L2R method achieves better or comparable results with pairwise and listwise approaches in terms of common ranking metrics such as truncated NDCG and mean average precision (MAP).

The remainder of this paper is organized as follows. In Section 2, we present the methodology, including the new formulation of ARSM-L2R for supervised learning to rank, and its parameter estimation using Monte Carlo gradient estimates. Section 3 provides comprehensive experimental results for comparison with several existing learning-to-rank methods. The paper is concluded in Section 4.

2 Arsm-L2r

2.1 Supervised learning to rank

In the supervised learning-to-rank setting, a set of queries is given. Each query is associated with a list of documents , where and denote the th document and size of respectively. In addition, a list of scores is available for each list of documents . The score represents the relevance degree of document to query , and can be a judgment score explicitly or implicitly given by humans [3]. Higher scores imply more relevant documents.

For each query-document pair , a -dimensional vector of features is constructed. The training set is represented as . The objective of learning is to create ranking functions that map the input query-document features to scores resembling the true relevant scores. In the following discussions, we drop the query index to avoid cluttering the notations.

In this paper, we formulate the supervised learning-to-rank problem as maximizing an objective, expressed as an expectation over multivariate categorical variables. More specifically, given documents for a query, let denote the relevance label for th document, where is the number of possible levels of relevance for each document. In our proposed generative model, each is distributed according to a categorical distribution whose probabilities are constructed based on a scoring function parameterized by :



is the softmax function. We use multi-layer perceptrons (MLPs) as scoring functions, thus

corresponds to the collection of weight matrices of MLPs. For each realization of categorical variables , we employ a loss function to determine their distance from the true labels . We then define the learning-to-rank optimization problem as finding:


where can be any loss function measuring the dissimilarity of two vectors of ordinal labels. We resort to stochastic gradient based methods to solve the optimization problem in (2). Backpropagating the gradient through discrete latent variables have been recently studied extensively [17, 20, 8]. For optimizing (2

), the challenge lies in developing a low-variance and unbiased estimator for its gradient with respect to

, which is denoted by .

2.2 ARSM gradient estimator

We employ Augment-REINFORCE-Swap-Merge (ARSM) gradient estimator for training the scoring functions described in the previous section. To describe this algorithm, we start by the simple objective function with respect to a univariate categorical variable, where is the reward function and . In the augmentation step, the gradient of can be expressed as an expectation under a Dirichlet distribution as


Given the vector , we denote the vector obtained after swapping th and th elements of as , where , and for we have . Exploiting the symmetrical property , and sharing common random numbers between different expectations to potentially significantly reduce Monte Carlo integration variance leads to another unbiased estimator referred as ARS estimator:


where and is the reference category. Finally, the ARS estimator can be further improved by considering all swap operations, and adding a merge step to construct the ARSM estimator as


2.3 ARSM for learning to rank

To employ ARSM for learning to rank, we need to consider the optimization problem with respect to the multivariate categorical variables . Let denote the multivariate swapping whose elements are defined, similar to those in (4) and (5), as . Then the multivariate extension of ARSM gradient estimator for the learning-to-rank objective in (2) can be expressed as [20]:


where . Since we define the categorical distribution parameter in terms of a neural network with parameters

, the final gradients are computed using the chain rule as


The estimated gradients are then utilized in a stochastic optimization process to learn the model parameters. Algorithm 1 summarizes the parameter learning for ARSM-L2R.

2.4 Loss function and rank prediction

The loss function in (2) measures the dissimilarity between predicted categorical labels and the true labels . In this work, we utilize the negative truncated NDCG as the loss function of ARSM-L2R. The calculation of NDCG only relies on the sorting of the predicted labels , and the true labels . Furthermore, our experiments show that setting the number of possible levels of relevance to be higher than the number of true levels in improves the performance of ARSM-L2R. Hence, for all experiments in this paper we set .

After the parameters of the scoring function are learned in the training phase, the probability of different levels of relevance for the test documents can be calculated by simply passing the documents features through the scoring function. We then construct the final scores of the test documents by a weighted combination of these probabilities, and sort the documents based on these scores. More precisely, given the probability of different labels for a test document, we calculate its overall ranking score as , where and higher values of correspond to more relevant levels. Our experiments show that the performance of ARSM-L2R is not sensitive to the choice of the weight combination scheme.

input :  Document labels and query-document features
output :  Categorical distribution parameters and parameters of scoring function
Initialize and randomly;
while not converged do
       Sample for ;
       Let for , to obtain the true categorical labels ;
       Initialize the diagonal of the loss matrix with ;
       for  do
             Let for ;
             Denote ;
       end for
      Let for ;
       Let for all ;
       Update , with step size ;
       Update , with step size
end while
Algorithm 1 Parameter inference in ARSM-L2R.

3 Experiments

3.1 Datasets

We evaluate the performance of ARSM-L2R on two widely tested benchmark datasets, including a query set from Million Query track of TREC 2007, denoted as MQ2007 [16], as well as the OHSUMED dataset [15]. Each dataset consists of queries, corresponding retrieved documents and labels provided by human experts. The possible relevance labels for each document are relevant, partially relevant, and not relevant. We use the 5-fold partitions provided in the original dataset for 5-fold cross validation in the experiments. In each fold, there are three subsets for learning: training set, validation set and testing set. The properties of these learning to rank datasets are presented in Table 1.

dataset #queries #documents #features
MQ2007 1700 25,000,000 46
OHSUMED 106 350,000 45
Table 1: Properties of learning-to-rank datasets used in the experiments.
RankSVM 0.4045 0.4019 0.4072 0.4383 0.4644
ListNet 0.4002 0.4091 0.4170 0.4440 0.4652
AdaRank-MAP 0.3821 0.3984 0.4070 0.4335 0.4577
AdaRank-NDCG 0.3876 0.4044 0.4102 0.4369 0.4602
ARSM-L2R 0.4051 0.4112 0.4159 0.4432 0.4608
Table 2: Performance of different learning-to-rank methods on MQ2007 dataset.
RankSVM 0.4958 0.4207 0.4164 0.4140 0.4468
ListNet 0.5326 0.4732 0.4432 0.4410 0.4495
AdaRank-MAP 0.5388 0.4682 0.4613 0.4429 0.4418
AdaRank-NDCG 0.5330 0.4790 0.4673 0.4496 0.4424
ARSM-L2R 0.5601 0.4642 0.4546 0.4460 0.4503
Table 3: Performance of different learning-to-rank methods on OHSUMED dataset.

3.2 Baselines

We compare the performance of our ARSM-L2R with several state-of-the-art baselines, including a pairwise method of RankSVM [12], a listwise method of ListNet [3], and several other listwise methods that directly optimize different evaluation measures: AdaRank-MAP, and AdaRank-NDCG [19].

3.3 Evaluation metrics

We use two popular learning-to-rank scoring functions to compare the predicted rankings of the test documents with their true rankings: truncated Normalized Discounted Cumulative Gain (NDCG@) [9] and Mean Average Precision (MAP) [2]. NDCG (DCG) has the effect of giving high scores to the ranking lists in which relevant documents are ranked high. Average Precision (AP) represents the averaged precision over all the positions of documents with relevant label for query . Denoting the ranking list on , MAP is defined as


where . NDCG@ is calculated by


where if represents the true ranking list of , then . here represents the truncation level.

3.4 Implementation details

For the scoring function neural network, we employ a fully connected neural network with one hidden layer of 500 units and the tanh

nonlinear activation function. We initialize the wights of the neural network by

Glorot method [7], and train ARSM-L2R using the Adam optimizer [11] with a learning rate of

. The algorithm is run for a total of 2000 epochs, and the ranking metrics on the validation sets are monitored for choosing the best performing neural network weights. ARSM-L2R is implemented in

Tensorflow [1].

3.5 Results and discussions

We compare the performance of the different methods based on NDCG@1, NDCG@3, NDCG@5, NDCG@10, and MAP. The results for MQ2007 and OHSUMED datasets are provided in Tables 2 and 3

, respectively. Our ARSM-L2R achieves the highest NDCG@1 and NDCG@3 on the MQ2007 dataset. On the OHSUMED dataset ARSM-L2R has a significantly higher NDCG@1 compared with all the other methods tested. It also shows the best MAP on this dataset. It is worth mentioning that NDCG@1 is one of the most important metrics for ranking systems, since it quantifies the relevance of the top ranked item. It is interesting to note that our method only optimizes a rough approximation of the evaluation metric NDCG, but shows the best performance on both two metrics for each dataset and comparable results for the rest of the metrics on the datasets. Previous works have shown that pointwise approaches cannot achieve as good performance as listwise approaches. But our proposed method achieves better or comparable performance due to utilizing a loss function more directly related to ranking performance and also taking advantage of unbiased and low-variance gradient estimation.

4 Conclusions

We have developed a new supervised learning-to-rank model—ARSM-L2R—that generates relevance labels based on a categorical model with probabilities estimated by a MLP. The training objective is optimized with respect to the multivariate categorical variables with an unbiased and low-variance gradient estimator, ARSM. The experimental results show that ARSM-L2R achieves better or comparable results with pairwise and listwise approaches.


  • [1] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. (2016) Tensorflow: a system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, OSDI’16, pp. 265–283. External Links: ISBN 978-1-931971-33-1 Cited by: §3.4.
  • [2] R. Baeza-Yates, B. Ribeiro-Neto, et al. (1999) Modern information retrieval. Vol. 463, ACM press New York. Cited by: §3.3.
  • [3] Z. Cao, T. Qin, T. Liu, M. Tsai, and H. Li (2007) Learning to rank: From pairwise approach to listwise approach. In Proceedings of the 24th international conference on Machine learning, pp. 129–136. Cited by: §1, §2.1, §3.2.
  • [4] G. Casella and C. P. Robert (1996) Rao-Blackwellisation of sampling schemes. Biometrika 83 (1), pp. 81–94. Cited by: §1.
  • [5] D. Cossock and T. Zhang (2006) Subset ranking using regression. In

    International Conference on Computational Learning Theory

    pp. 605–619. Cited by: §1.
  • [6] W. B. Croft, D. Metzler, and T. Strohman (2010) Search engines: Information retrieval in practice. Vol. 520, Addison-Wesley Reading. Cited by: §1.
  • [7] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the thirteenth international conference on artificial intelligence and statistics

    pp. 249–256. Cited by: §3.4.
  • [8] W. Grathwohl, D. Choi, Y. Wu, G. Roeder, and D. Duvenaud (2017) Backpropagation through the void: optimizing control variates for black-box gradient estimation. arXiv preprint arXiv:1711.00123. Cited by: §2.1.
  • [9] K. Järvelin and J. Kekäläinen (2000) IR evaluation methods for retrieving highly relevant documents. In Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval, pp. 41–48. Cited by: §1, §3.3.
  • [10] T. Joachims (2002) Optimizing search engines using clickthrough data. In Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 133–142. Cited by: §1.
  • [11] D. P. Kingma and J. Ba (2014) Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980. Cited by: §3.4.
  • [12] C. Lee and C. Lin (2014) Large-scale linear ranksvm. Neural computation 26 (4), pp. 781–817. Cited by: §3.2.
  • [13] P. Li, Q. Wu, and C. J. Burges (2008)

    Mcrank: learning to rank using multiple classification and gradient boosting

    In Advances in neural information processing systems, pp. 897–904. Cited by: §1.
  • [14] T. Liu et al. (2009) Learning to rank for information retrieval. Foundations and Trends® in Information Retrieval 3 (3), pp. 225–331. Cited by: §1, §1.
  • [15] T. Qin, T. Liu, J. Xu, and H. Li (2010) LETOR: a benchmark collection for research on learning to rank for information retrieval. Information Retrieval 13 (4), pp. 346–374. Cited by: §3.1.
  • [16] T. Qin and T. Liu (2013) Introducing LETOR 4.0 datasets. CoRR abs/1306.2597. External Links: Link Cited by: §3.1.
  • [17] G. Tucker, A. Mnih, C. J. Maddison, J. Lawson, and J. Sohl-Dickstein (2017) Rebar: low-variance, unbiased gradient estimates for discrete latent variable models. In Advances in Neural Information Processing Systems, pp. 2627–2636. Cited by: §2.1.
  • [18] R. J. Williams (1992)

    Simple statistical gradient-following algorithms for connectionist reinforcement learning

    Machine learning 8 (3-4), pp. 229–256. Cited by: §1.
  • [19] J. Xu and H. Li (2007) AdaRank: A boosting algorithm for information retrieval. In Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval, pp. 391–398. Cited by: §3.2.
  • [20] M. Yin, Y. Yue, and M. Zhou (2019) ARSM: augment-reinforce-swap-merge estimator for gradient backpropagation through categorical variables. arXiv preprint arXiv:1905.01413. Cited by: §1, §2.1, §2.3.