1 Introduction
In various information retrieval applications, a system may need to provide a ranking of candidate items that satisfies a criteria. For instance, a search engine must produce a list of results, ranked by their relevance to a user query. The relationship between items (e.g. documents) represented as feature vectors and their rankings (e.g. based on relevance scores) is often complex, so machine learning is used to learn a function that generates a ranking given a list of items.
The ranking system is evaluated using metrics that reflect certain goals for the system. The choice of metric, as well as its relative importance, varies by application area. For instance, a search engine may evaluate its ranking system with Normalized Discounted Cumulative Gain (NDCG), while a questionanswering system evaluates its ranking using precision at 3; a high NDCG score is meant to indicate results that are relevant to a user’s query, while a high precision shows that a favorable amount of correct answers were ranked highly. Other common metrics include Recall @ k, Mean Average Precision (MAP), and Area Under the ROC Curve (AUC).
Ranking algorithms may optimize error rate as a proxy for improving metrics such as AUC, or may optimize the metrics directly. However, typical metrics such as NDCG and AUC are either flat everywhere or nondifferentiable with respect to model parameters, making direct optimization with gradient descent difficult.
LambdaMART[2]
is a ranking algorithm that is able to avoid this issue and directly optimize nonsmooth metrics. It uses a gradientboosted tree model and forms an approximation to the gradient whose value is derived from the evaluation metric. LambdaMART has been empirically shown to find a local optimum of NDCG, Mean Reciprocal Rank, and Mean Average Precision
[6]. An additional attractive property of LambdaMART is that the evaluation metric that LambdaMART optimizes is easily changed; the algorithm can therefore be adjusted for a given application area. This flexibility makes the algorithm a good candidate for a production system for general ranking, as using a single algorithm for multiple applications can reduce overall system complexity.However, to our knowledge LambdaMART’s ability to optimize AUC has not been explored and empirically verified in the literature. In this paper, we propose extensions to LambdaMART to optimize AUC and multiclass AUC, and show that the extensions can be computed efficiently. To evaluate the system, we conduct experiments on several binaryclass and multiclass benchmark datasets. We find that LambdaMART with the AUC extension performs similarly to an SVM baseline on binaryclass datasets, and LambdaMART with the multiclass AUC extension outperforms the SVM baseline on multiclass datasets.
2 Related Work
This work relates to two areas: LambdaMART and AUC optimization in ranking. LambdaMART was originally proposed in [15] and is overviewed in [2]. The LambdaRank algorithm, upon which LambdaMART is based, was shown to find a locally optimal model for the IR metrics NDCG@10, mean NDCG, MAP, and MRR [6]. Svore et. al [14] propose a modification to LambdaMART that allows for simultaneous optimization of NDCG and a measure based on clickthrough rate.
Various approaches have been developed for optimizing AUC in binaryclass settings. Cortes and Mohri [5] show that minimum error rate training may be insufficient for optimizing AUC, and demonstrate that the RankBoost algorithm globally optimizes AUC. Calders and Jaroszewicz [3] propose a smooth polynomial approximation of AUC that can be optimized with a gradient descent method. Joachims [9] proposes an SVM method for various IR measures including AUC, and evaluates the system on text classification datasets. The SVM method is used as the comparison baseline in this paper.
3 Ranking Metrics
We will first provide a review of the metrics used in this paper. Using document retrieval as an example, consider queries , and let denote the number of documents in query . Let denote document in query , where , .
3.1 Contingency Table Metrics
Several IR metrics are derived from a model’s contingency table, which contains the four entries True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN):
TP  FP  
FN  TN 
where denotes an example’s label, denotes the predicted label, denotes the class label considered positive, and denotes the class label considered negative.
Measuring the precision of the first ranked documents is often important in ranking applications. For instance, is important for question answering systems to evaluate whether the system’s top ranked item is a correct answer. Although precision is a metric for binary class labels, many ranking applications and standard datasets have multiple class labels. To evaluate precision in the multiclass context we use Microaveraged Precision and Macroaveraged Precision, which summarize precision performance on multiple classes [10].
3.1.1 Microaveraged Precision
Microaveraged Precision pools the contingency tables across classes, then computes precision using the pooled values:
(1) 
where denotes the number of classes, is the number of true positives for class , and is the number of false positives for class .
is measured by using only the first ranked documents in each query:
(2) 
Microaveraged precision indicates performance on prevalent classes, since prevalent classes will contribute the most to the and sums.
3.1.2 Macroaveraged Precision
Macroaveraged Precision is a simple average of perclass precision values:
(3) 
Restricting each query’s ranked list to the first documents gives:
(4) 
Macroaveraged precision indicates performance across all classes regardless of prevalence, since each class’s precision value is given equal weight.
3.1.3 Auc
AUC refers to the area under the ROC curve. The ROC curve plots True Positive Rate versus False Positive Rate , with appearing on the yaxis, and appearing on the xaxis.
Each point on the ROC curve corresponds to a contingency table for a given model. In the ranking context, the contingency table is for the ranking cutoff ; the curve shows the and as changes. A model is considered to have better performance as its ROC curve shifts towards the upper left quadrant. The AUC measures the area under this curve, providing a single metric that summarizes a model’s ROC curve and allowing for easy comparison.
3.1.4 MultiClass AUC
The standard AUC formulation is defined for binary classification. To evaluate a model using AUC on a dataset with multiple class labels, AUC can be extended to multiclass AUC (MAUC).
We define the class reference AUC value as the AUC when class label is viewed as positive and all other labels as negative. The multiclass AUC is then the weighted sum of class reference AUC values, where each class reference AUC is weighted by the proportion of the dataset examples with that class label, denoted [7]:
(5) 
Note that the classreference AUC of a prevalent class will therefore impact the MAUC score more than the classreference AUC of a rare class.
4 Gradient Optimization of the MAUC function
We briefly describe LambdaMART’s optimization procedure here and refer the reader to [2] for a more extensive treatment. LambdaMART uses a gradient descent optimization procedure that only requires the gradient, rather than the objective function, to be defined. The objective function can in principal be left undefined, since only the gradient is required to perform gradient descent. Each gradient approximation, known as a gradient, focuses on document pairs of conflicting relevance values (document more or less relevant than document ):
(6) 
with when and when .
The gradient includes the change in IR metric, , from swapping the rank positions of the two documents, discounted by a function of the score difference between the documents.
For a given sorted order of the documents, the objective function is simply a weighted version of the RankNet [1] cost function. The RankNet cost is a pairwise crossentropy cost applied to the logistic of the difference of the model scores. If document , with score , is to be ranked higher than document , with score , then the RankNet cost can be written as follows:
(7) 
where is the score difference of a pair of documents in a query. The derivative of the RankNet cost according to the difference in score is
(8) 
The optimization procedure using gradients was originally defined using as the term in order to optimize NDCG. and were also used to define effective gradients for MAP and MRR, respectively. In this work, we adopt the approach of replacing the term to define gradients for AUC and multiclass AUC.
4.1 gradients for AUC and multiclass AUC
4.1.1 
Defining the gradient for AUC requires deriving a formula for that can be efficiently computed. Efficiency is important since in every iteration, the term is computed for document pairs for each query .
To derive the term, we begin with the fact that AUC is equivalent to the WilcoxonMannWhitney statistic [5]. For documents with positive labels and documents with negative labels, we have:
(9) 
The indicator function is
when the ranker assigns a score to a document with a positive label that is higher than the score assigned to a document with a negative label. Hence the numerator is the number of correctly ordered pairs, and we can write
[9]:(10) 
where
(11) 
Note that a pair with equal labels is not considered a correct pair, since a document pair contributes to if and only if is ranked higher than in the ranked list induced by the current model scores, and .
We now derive a formula for computing the term in time, given the ranked list and labels. This avoids the bruteforce approach of counting the number of correct pairs before and after the swap, in turn providing an efficient way to compute a gradient for . Specifically, we have:
Theorem 4.1
Let be a list of documents with positive labels and negative labels, denoted , with . For each document pair ,
(12) 
Proof
To derive this formula, we start with
(13) 
where is the value of after swapping the scores assigned to documents and , and is the value of prior to the swap. Note that the swap corresponds to swapping positions of documents and in the ranked list. The numerator of is the change in the number of correct pairs due to the swap. The following lemma shows that we only need to compute the change in the number of correct pairs for the pairs of documents within the interval [i,j] in the ranked list.
Lemma 1
Let be a document pair where at least one of . Then after swapping documents , the pair correctness of will be left unchanged or negated by another pair.
Proof
Without loss of generality, assume . There are five cases to consider.
case , : Then the pair does not change due to the swap, therefore its pair correctness does not change.
Note that unless one of or is an endpoint or , the pair does not change. Hence we now assume that one of or is an endpoint or .
case , : The pair correctness of will change if and only if prior to the swap. But then the pair correctness of will change from correct to not correct, canceling out the change (see Fig. 1).
case , : Then the pair correctness of will change if and only if prior to the swap. But then the pair correctness of will change from correct to not correct, canceling out the change.
case , : Then pair correctness of will change if and only if prior to the swap. But then the pair correctness of will change from correct to not correct, canceling out the change.
case , : Then pair correctness of will change if and only if prior to the swap. But then the pair correctness of will change from correct to not correct, canceling out the change.
Hence in all cases, either the pair correctness stays the same, or the pair changes from not correct to correct and an additional pair changes from correct to not correct, thus canceling out the change with respect to the total number of correct pairs after the swap. ∎
Lemma 1 shows that the difference in correct pairs is equivalent to , namely the change in the number of correct pairs within the interval [i,j]. Lemma 2 tells us that this value is simply the length of the interval [i,j].
Lemma 2
Assume . Then
(14) 
Proof
There are three cases to consider.
case : The number of correct pairs will not change since no document labels change due to the swap. Hence .
case , : Before swapping, each pair , such that is a correct pair. After the swap, each of these pairs is not a correct pair. There are such pairs, namely the number of documents in the interval with label .
Each pair , such that is a correct pair before swapping, and not correct after swapping. There are such pairs, namely the number of documents in the interval with label .
Every other pair remains unchanged, therefore
(15) 
pairs changed from correct to not correct, corresponding to a decrease in the number of correct pairs. Hence we have:
.
case , : Before swapping, each pair , such that is not a correct pair. After the swap, each of these pairs is a correct pair. There are such pairs, namely the number of documents in the interval with label .
Each pair , such that is not a correct pair before swapping, and is correct after swapping. There are such pairs, namely the number of documents in the interval with label .
Each pair , such that remains not correct. Each pair , such that remains not correct. Every other pair remains unchanged. Therefore
(16) 
pairs changed from not correct to correct, corresponding to an increase in the number of correct pairs. Hence we have:
∎
Therefore by Lemmas 1 and 2, we have:
completing the proof of Theorem 1. ∎
Applying the formula from Theorem 1 to the list of documents sorted by the current model scores, we define the gradient for AUC as:
(17) 
where and are as defined previously, and .
4.1.2 
To extend the gradient for AUC to a multiclass setting, we consider the multiclass AUC definition found in equation 5. Since MAUC is a linear combination of classreference AUC values, to compute we can compute the change in each classreference AUC value separately using equation 12 and weight each value by the proportion , giving:
(18) 
Using this term and the previously defined terms and , we define the gradient for as:
(19) 
5 Experiments
Experiments were conducted on binaryclass datasets to compare the AUC performance of LambdaMART trained with the AUC gradient, referred to as LambdaMARTAUC, against a baseline model. Similar experiments were conducted on multiclass datasets to compare LambdaMART trained with the MAUC gradient, referred to as LambdaMARTMAUC, against a baseline in terms of MAUC. Differences in precision on the predicted rankings were also investigated.
The LambdaMART implementation used in the experiments was a modified version of the JForests learning to rank library [8]. This library showed the best NDCG performance out of the available Java ranking libraries in preliminary experiments. We then implemented extensions required to compute the AUC and multiclass AUC gradients. For parameter tuning, a learning rate was chosen for each dataset by searching over the values and choosing the value that resulted in the best performance on a validation set.
As the comparison baseline, we used a Support Vector Machine (SVM) formulated for optimizing AUC. The SVM implementation was provided by the SVMPerf
[9]library. The ROCArea loss function was used, and the regularization parameter
was chosen by searching over the valuesand choosing the value that resulted in the best performance on a validation set. For the multiclass setting, a binary classifier was trained for each individual relevance class. Prediction scores for a document
were then generated by computing the quantity , where denotes the number of classes, and denotes the binary classifier for relevance class . These scores were used to induce a ranking of documents for each query.5.1 Datasets
For evaluating LambdaMARTAUC, we used six binaryclass websearch datasets from the LETOR 3.0 [13] Gov dataset collection, named td2003, td2004, np2003, np2004, hp2003, and hp2004. Each dataset is divided into five folds and contains feature vectors representing querydocument pairs and binary relevance labels.
For evaluating LambdaMARTMAUC, we used four multiclass websearch datasets: versions 1.0 and 2.0 of the Yahoo! Learning to Rank Challenge [4] dataset, and the mq2007 and mq2008 datasets from the LETOR 4.0 [12] collection. The Yahoo! and LETOR datasets are divided into two and five folds, respectively. Each Yahoo! dataset has integer relevance scores ranging from 0 (not relevant) to 4 (very relevant), while the LETOR datasets have integer relevance scores ranging from 0 to 2. The LETOR datasets have 1700 and 800 queries, respectively, while the larger Yahoo! datasets have approximately 20,000 queries.
5.2 Results
5.2.1 Auc
On the binaryclass datasets, LambdaMARTAUC and SVMPerf performed similarly in terms of AUC and MeanAverage Precision. The results did not definitively show that either algorithm was superior on all datasets; LambdaMARTAUC had higher AUC scores on 2 datasets (td2003 and td2004), lower AUC scores on 3 datasets (hp2003, hp2004, np2004), and a similar score on np2003. In terms of MAP, LambdaMARTAUC was higher on 2 datasets (td2003 and td2004), lower on 2 datasets (np2004, hp2004), and similar on 2 datasets (np2003, hp2003). The results confirm that LambdaMARTAUC is an effective option for optimizing AUC on binary datasets, since the SVM model has previously been shown to perform effectively.
5.2.2 Mauc
Table 1 shows the MAUC scores on held out test sets for the four multiclass datasets. The reported value is the average MAUC across all dataset folds. The results indicate that in terms of optimizing Multiclass AUC, LambdaMARTMAUC is as effective as SVMPerf on the LETOR datasets, and more effective on the larger Yahoo! datasets.
Yahoo V1  Yahoo V2  mq2007  mq2008  

LambdaMARTMAUC  0.594  0.592  0.662  0.734 
SVMPerf  0.576  0.576  0.659  0.737 
Yahoo V1  Yahoo V2  mq2007  mq2008  

LambdaMARTMAUC  0.862  0.858  0.466  0.474 
SVMPerf  0.837  0.837  0.450  0.458 
Additionally, the experiments found that LambdaMARTMAUC outperformed SVMPerf in terms of precision in all cases. Table 2 shows the Mean Average Precision scores for the four datasets. LambdaMARTMAUC also had higher and on all datasets, for . For instance, Figure 2 shows the values of and for the Yahoo! V1 dataset.
The classreference AUC scores indicate that LambdaMARTMAUC and SVMPerf arrive at their MAUC scores in different ways. LambdaMARTMAUC focuses on the most prevalent class; each term for a prevalent class receives a higher weighting than for a rare class due to the term in the computation. As a result the gradients in LambdaMARTMAUC place more emphasis on achieving a high than a high . Table 3 shows the classreference AUC scores for the Yahoo! V1 dataset. We observe that LambdaMARTMAUC produces better than SVMPerf, but worse , since class 1 is much more prevalent than class 4; 48% of the documents in the training set with a positive label have a label of class 1, while only 2.5% have a label of class 4.
Finally, we note that on the largescale Microsoft Learning to Rank Dataset MSLRWEB10k [11], the SVMPerf training failed to converge on a single fold after 12 hours.
Therefore training a model for each class for every fold was impractical using SVMPerf, while LambdaMARTMAUC was able to train on all five folds in less than 5 hours.
This further suggests that LambdaMARTMAUC is preferable to SVMPerf for optimizing MAUC on large ranking datasets.
LambdaMARTMAUC  0.503  0.690  0.757  0.831 

SVMPerf  0.474  0.682  0.796  0.920 
6 Conclusions
We have introduced a method for optimizing AUC on ranking datasets using a gradientboosting framework. Specifically, we have derived gradient approximations for optimizing AUC with LambdaMART in binary and multiclass settings, and shown that the gradients are efficient to compute. The experiments show that the method performs as well as, or better than, a baseline SVM method, and performs especially well on large, multiclass datasets. In addition to adding LambdaMART to the portfolio of algorithms that can be used to optimize AUC, our extensions expand the set of IR metrics for which LambdaMART can be used.
There are several possible future directions. One is investigating local optimality of the solution produced by LambdaMARTAUC using Monte Carlo methods. Other directions include exploring LambdaMART with multiple objective functions to optimize AUC, and creating an extension to optimize area under a PrecisionRecall curve rather than an ROC curve.
Acknowledgements
Thank you to Dwi Sianto Mansjur for giving helpful guidance and providing valuable comments about this paper.
References
 [1] Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to rank using gradient descent. In: Proceedings of the 22Nd International Conference on Machine Learning. pp. 89–96. ICML ’05, ACM, New York, NY, USA (2005)
 [2] Burges, C.J.: From ranknet to lambdarank to lambdamart: An overview. Learning 11, 23–581 (2010)
 [3] Calders, T., Jaroszewicz, S.: Efficient auc optimization for classification. Knowledge Discovery in Databases: PKDD 2007 pp. 42–53 (2007)
 [4] Chapelle, O., Chang, Y.: Yahoo! learning to rank challenge overview (2011)
 [5] Cortes, C., Mohri, M.: Auc optimization vs. error rate minimization. Advances in neural information processing systems 16(16), 313–320 (2004)
 [6] Donmez, P., Svore, K., Burges, C.J.: On the optimality of lambdarank. Tech. Rep. MSRTR2008179, Microsoft Research (November 2008), http://research.microsoft.com/apps/pubs/default.aspx?id=76530
 [7] Fawcett, T.: An introduction to roc analysis. Pattern Recogn. Lett. 27(8), 861–874 (Jun 2006)

[8]
Ganjisaffar, Y., Caruana, R., Lopes, C.V.: Bagging gradientboosted trees for high precision, low variance ranking models. In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. pp. 85–94. ACM (2011)
 [9] Joachims, T.: A support vector method for multivariate performance measures. In: Proceedings of the 22Nd International Conference on Machine Learning. pp. 377–384. ICML ’05, ACM, New York, NY, USA (2005)
 [10] Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
 [11] Microsoft learning to rank datasets, http://research.microsoft.com/enus/projects/mslr/
 [12] Qin, T., Liu, T.: Introducing LETOR 4.0 datasets. CoRR abs/1306.2597 (2013), http://arxiv.org/abs/1306.2597
 [13] Qin, T., Liu, T.Y., Xu, J., Li, H.: Letor: A benchmark collection for research on learning to rank for information retrieval. Inf. Retr. 13(4), 346–374 (Aug 2010), http://dx.doi.org/10.1007/s107910099123y
 [14] Svore, K.M., Volkovs, M.N., Burges, C.J.: Learning to rank with multiple objective functions. In: Proceedings of the 20th international conference on World wide web. pp. 367–376. ACM (2011)
 [15] Wu, Q., Burges, C.J., Svore, K.M., Gao, J.: Adapting boosting for information retrieval measures. Information Retrieval 13(3), 254–270 (2010)
Comments
There are no comments yet.