Efficient AUC Optimization for Information Ranking Applications

11/16/2015 ∙ by Sean J. Welleck, et al. ∙ ibm 0

Adequate evaluation of an information retrieval system to estimate future performance is a crucial task. Area under the ROC curve (AUC) is widely used to evaluate the generalization of a retrieval system. However, the objective function optimized in many retrieval systems is the error rate and not the AUC value. This paper provides an efficient and effective non-linear approach to optimize AUC using additive regression trees, with a special emphasis on the use of multi-class AUC (MAUC) because multiple relevance levels are widely used in many ranking applications. Compared to a conventional linear approach, the performance of the non-linear approach is comparable on binary-relevance benchmark datasets and is better on multi-relevance benchmark datasets.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In various information retrieval applications, a system may need to provide a ranking of candidate items that satisfies a criteria. For instance, a search engine must produce a list of results, ranked by their relevance to a user query. The relationship between items (e.g. documents) represented as feature vectors and their rankings (e.g. based on relevance scores) is often complex, so machine learning is used to learn a function that generates a ranking given a list of items.

The ranking system is evaluated using metrics that reflect certain goals for the system. The choice of metric, as well as its relative importance, varies by application area. For instance, a search engine may evaluate its ranking system with Normalized Discounted Cumulative Gain (NDCG), while a question-answering system evaluates its ranking using precision at 3; a high NDCG score is meant to indicate results that are relevant to a user’s query, while a high precision shows that a favorable amount of correct answers were ranked highly. Other common metrics include Recall @ k, Mean Average Precision (MAP), and Area Under the ROC Curve (AUC).

Ranking algorithms may optimize error rate as a proxy for improving metrics such as AUC, or may optimize the metrics directly. However, typical metrics such as NDCG and AUC are either flat everywhere or non-differentiable with respect to model parameters, making direct optimization with gradient descent difficult.


is a ranking algorithm that is able to avoid this issue and directly optimize non-smooth metrics. It uses a gradient-boosted tree model and forms an approximation to the gradient whose value is derived from the evaluation metric. LambdaMART has been empirically shown to find a local optimum of NDCG, Mean Reciprocal Rank, and Mean Average Precision

[6]. An additional attractive property of LambdaMART is that the evaluation metric that LambdaMART optimizes is easily changed; the algorithm can therefore be adjusted for a given application area. This flexibility makes the algorithm a good candidate for a production system for general ranking, as using a single algorithm for multiple applications can reduce overall system complexity.

However, to our knowledge LambdaMART’s ability to optimize AUC has not been explored and empirically verified in the literature. In this paper, we propose extensions to LambdaMART to optimize AUC and multi-class AUC, and show that the extensions can be computed efficiently. To evaluate the system, we conduct experiments on several binary-class and multi-class benchmark datasets. We find that LambdaMART with the AUC extension performs similarly to an SVM baseline on binary-class datasets, and LambdaMART with the multi-class AUC extension outperforms the SVM baseline on multi-class datasets.

2 Related Work

This work relates to two areas: LambdaMART and AUC optimization in ranking. LambdaMART was originally proposed in [15] and is overviewed in [2]. The LambdaRank algorithm, upon which LambdaMART is based, was shown to find a locally optimal model for the IR metrics NDCG@10, mean NDCG, MAP, and MRR [6]. Svore et. al [14] propose a modification to LambdaMART that allows for simultaneous optimization of NDCG and a measure based on click-through rate.

Various approaches have been developed for optimizing AUC in binary-class settings. Cortes and Mohri [5] show that minimum error rate training may be insufficient for optimizing AUC, and demonstrate that the RankBoost algorithm globally optimizes AUC. Calders and Jaroszewicz [3] propose a smooth polynomial approximation of AUC that can be optimized with a gradient descent method. Joachims [9] proposes an SVM method for various IR measures including AUC, and evaluates the system on text classification datasets. The SVM method is used as the comparison baseline in this paper.

3 Ranking Metrics

We will first provide a review of the metrics used in this paper. Using document retrieval as an example, consider queries , and let denote the number of documents in query . Let denote document in query , where , .

3.1 Contingency Table Metrics

Several IR metrics are derived from a model’s contingency table, which contains the four entries True Positive (TP), False Positive (FP), False Negative (FN), and True Negative (TN):


where denotes an example’s label, denotes the predicted label, denotes the class label considered positive, and denotes the class label considered negative.

Measuring the precision of the first ranked documents is often important in ranking applications. For instance, is important for question answering systems to evaluate whether the system’s top ranked item is a correct answer. Although precision is a metric for binary class labels, many ranking applications and standard datasets have multiple class labels. To evaluate precision in the multi-class context we use Micro-averaged Precision and Macro-averaged Precision, which summarize precision performance on multiple classes [10].

3.1.1 Micro-averaged Precision

Micro-averaged Precision pools the contingency tables across classes, then computes precision using the pooled values:


where denotes the number of classes, is the number of true positives for class , and is the number of false positives for class .

is measured by using only the first ranked documents in each query:


Micro-averaged precision indicates performance on prevalent classes, since prevalent classes will contribute the most to the and sums.

3.1.2 Macro-averaged Precision

Macro-averaged Precision is a simple average of per-class precision values:


Restricting each query’s ranked list to the first documents gives:


Macro-averaged precision indicates performance across all classes regardless of prevalence, since each class’s precision value is given equal weight.

3.1.3 Auc

AUC refers to the area under the ROC curve. The ROC curve plots True Positive Rate versus False Positive Rate , with appearing on the y-axis, and appearing on the x-axis.

Each point on the ROC curve corresponds to a contingency table for a given model. In the ranking context, the contingency table is for the ranking cutoff ; the curve shows the and as changes. A model is considered to have better performance as its ROC curve shifts towards the upper left quadrant. The AUC measures the area under this curve, providing a single metric that summarizes a model’s ROC curve and allowing for easy comparison.

We also note that the AUC is equivalent to the Wilcoxon-Mann-Whitney statistic [5] and can therefore be computed using the number of correctly ordered document pairs. Fawcett [7] provides an efficient algorithm for computing AUC.

3.1.4 Multi-Class AUC

The standard AUC formulation is defined for binary classification. To evaluate a model using AUC on a dataset with multiple class labels, AUC can be extended to multi-class AUC (MAUC).

We define the class reference AUC value as the AUC when class label is viewed as positive and all other labels as negative. The multi-class AUC is then the weighted sum of class reference AUC values, where each class reference AUC is weighted by the proportion of the dataset examples with that class label, denoted [7]:


Note that the class-reference AUC of a prevalent class will therefore impact the MAUC score more than the class-reference AUC of a rare class.

4 -Gradient Optimization of the MAUC function

We briefly describe LambdaMART’s optimization procedure here and refer the reader to [2] for a more extensive treatment. LambdaMART uses a gradient descent optimization procedure that only requires the gradient, rather than the objective function, to be defined. The objective function can in principal be left undefined, since only the gradient is required to perform gradient descent. Each gradient approximation, known as a -gradient, focuses on document pairs of conflicting relevance values (document more or less relevant than document ):


with when and when .

The -gradient includes the change in IR metric, , from swapping the rank positions of the two documents, discounted by a function of the score difference between the documents.

For a given sorted order of the documents, the objective function is simply a weighted version of the RankNet [1] cost function. The RankNet cost is a pairwise cross-entropy cost applied to the logistic of the difference of the model scores. If document , with score , is to be ranked higher than document , with score , then the RankNet cost can be written as follows:


where is the score difference of a pair of documents in a query. The derivative of the RankNet cost according to the difference in score is


The optimization procedure using -gradients was originally defined using as the term in order to optimize NDCG. and were also used to define effective -gradients for MAP and MRR, respectively. In this work, we adopt the approach of replacing the term to define -gradients for AUC and multi-class AUC.

4.1 -gradients for AUC and multi-class AUC

4.1.1 -

Defining the -gradient for AUC requires deriving a formula for that can be efficiently computed. Efficiency is important since in every iteration, the term is computed for document pairs for each query .

To derive the term, we begin with the fact that AUC is equivalent to the Wilcoxon-Mann-Whitney statistic [5]. For documents with positive labels and documents with negative labels, we have:


The indicator function is

when the ranker assigns a score to a document with a positive label that is higher than the score assigned to a document with a negative label. Hence the numerator is the number of correctly ordered pairs, and we can write





Note that a pair with equal labels is not considered a correct pair, since a document pair contributes to if and only if is ranked higher than in the ranked list induced by the current model scores, and .

We now derive a formula for computing the term in time, given the ranked list and labels. This avoids the brute-force approach of counting the number of correct pairs before and after the swap, in turn providing an efficient way to compute a -gradient for . Specifically, we have:

Theorem 4.1

Let be a list of documents with positive labels and negative labels, denoted , with . For each document pair ,


To derive this formula, we start with


where is the value of after swapping the scores assigned to documents and , and is the value of prior to the swap. Note that the swap corresponds to swapping positions of documents and in the ranked list. The numerator of is the change in the number of correct pairs due to the swap. The following lemma shows that we only need to compute the change in the number of correct pairs for the pairs of documents within the interval [i,j] in the ranked list.

Lemma 1

Let be a document pair where at least one of . Then after swapping documents , the pair correctness of will be left unchanged or negated by another pair.


Without loss of generality, assume . There are five cases to consider.

case , : Then the pair does not change due to the swap, therefore its pair correctness does not change.

Note that unless one of or is an endpoint or , the pair does not change. Hence we now assume that one of or is an endpoint or .

case , : The pair correctness of will change if and only if prior to the swap. But then the pair correctness of will change from correct to not correct, canceling out the change (see Fig. 1).















Figure 1: Document swap for case , , with

case , : Then the pair correctness of will change if and only if prior to the swap. But then the pair correctness of will change from correct to not correct, canceling out the change.

case , : Then pair correctness of will change if and only if prior to the swap. But then the pair correctness of will change from correct to not correct, canceling out the change.

case , : Then pair correctness of will change if and only if prior to the swap. But then the pair correctness of will change from correct to not correct, canceling out the change.

Hence in all cases, either the pair correctness stays the same, or the pair changes from not correct to correct and an additional pair changes from correct to not correct, thus canceling out the change with respect to the total number of correct pairs after the swap. ∎

Lemma 1 shows that the difference in correct pairs is equivalent to , namely the change in the number of correct pairs within the interval [i,j]. Lemma 2 tells us that this value is simply the length of the interval [i,j].

Lemma 2

Assume . Then


There are three cases to consider.

case : The number of correct pairs will not change since no document labels change due to the swap. Hence .

case , : Before swapping, each pair , such that is a correct pair. After the swap, each of these pairs is not a correct pair. There are such pairs, namely the number of documents in the interval with label .

Each pair , such that is a correct pair before swapping, and not correct after swapping. There are such pairs, namely the number of documents in the interval with label .

Every other pair remains unchanged, therefore


pairs changed from correct to not correct, corresponding to a decrease in the number of correct pairs. Hence we have:


case , : Before swapping, each pair , such that is not a correct pair. After the swap, each of these pairs is a correct pair. There are such pairs, namely the number of documents in the interval with label .

Each pair , such that is not a correct pair before swapping, and is correct after swapping. There are such pairs, namely the number of documents in the interval with label .

Each pair , such that remains not correct. Each pair , such that remains not correct. Every other pair remains unchanged. Therefore


pairs changed from not correct to correct, corresponding to an increase in the number of correct pairs. Hence we have:

Therefore by Lemmas 1 and 2, we have:

completing the proof of Theorem 1. ∎

Applying the formula from Theorem 1 to the list of documents sorted by the current model scores, we define the -gradient for AUC as:


where and are as defined previously, and .

4.1.2 -

To extend the -gradient for AUC to a multi-class setting, we consider the multi-class AUC definition found in equation 5. Since MAUC is a linear combination of class-reference AUC values, to compute we can compute the change in each class-reference AUC value separately using equation 12 and weight each value by the proportion , giving:


Using this term and the previously defined terms and , we define the -gradient for as:


5 Experiments

Experiments were conducted on binary-class datasets to compare the AUC performance of LambdaMART trained with the AUC -gradient, referred to as LambdaMART-AUC, against a baseline model. Similar experiments were conducted on multi-class datasets to compare LambdaMART trained with the MAUC -gradient, referred to as LambdaMART-MAUC, against a baseline in terms of MAUC. Differences in precision on the predicted rankings were also investigated.

The LambdaMART implementation used in the experiments was a modified version of the JForests learning to rank library [8]. This library showed the best NDCG performance out of the available Java ranking libraries in preliminary experiments. We then implemented extensions required to compute the AUC and multi-class AUC -gradients. For parameter tuning, a learning rate was chosen for each dataset by searching over the values and choosing the value that resulted in the best performance on a validation set.

As the comparison baseline, we used a Support Vector Machine (SVM) formulated for optimizing AUC. The SVM implementation was provided by the SVM-Perf


library. The ROCArea loss function was used, and the regularization parameter

was chosen by searching over the values

and choosing the value that resulted in the best performance on a validation set. For the multi-class setting, a binary classifier was trained for each individual relevance class. Prediction scores for a document

were then generated by computing the quantity , where denotes the number of classes, and denotes the binary classifier for relevance class . These scores were used to induce a ranking of documents for each query.

5.1 Datasets

For evaluating LambdaMART-AUC, we used six binary-class web-search datasets from the LETOR 3.0 [13] Gov dataset collection, named td2003, td2004, np2003, np2004, hp2003, and hp2004. Each dataset is divided into five folds and contains feature vectors representing query-document pairs and binary relevance labels.

For evaluating LambdaMART-MAUC, we used four multi-class web-search datasets: versions 1.0 and 2.0 of the Yahoo! Learning to Rank Challenge [4] dataset, and the mq2007 and mq2008 datasets from the LETOR 4.0 [12] collection. The Yahoo! and LETOR datasets are divided into two and five folds, respectively. Each Yahoo! dataset has integer relevance scores ranging from 0 (not relevant) to 4 (very relevant), while the LETOR datasets have integer relevance scores ranging from 0 to 2. The LETOR datasets have 1700 and 800 queries, respectively, while the larger Yahoo! datasets have approximately 20,000 queries.

5.2 Results

5.2.1 Auc

On the binary-class datasets, LambdaMART-AUC and SVM-Perf performed similarly in terms of AUC and Mean-Average Precision. The results did not definitively show that either algorithm was superior on all datasets; LambdaMART-AUC had higher AUC scores on 2 datasets (td2003 and td2004), lower AUC scores on 3 datasets (hp2003, hp2004, np2004), and a similar score on np2003. In terms of MAP, LambdaMART-AUC was higher on 2 datasets (td2003 and td2004), lower on 2 datasets (np2004, hp2004), and similar on 2 datasets (np2003, hp2003). The results confirm that LambdaMART-AUC is an effective option for optimizing AUC on binary datasets, since the SVM model has previously been shown to perform effectively.

5.2.2 Mauc

Table 1 shows the MAUC scores on held out test sets for the four multi-class datasets. The reported value is the average MAUC across all dataset folds. The results indicate that in terms of optimizing Multi-class AUC, LambdaMART-MAUC is as effective as SVM-Perf on the LETOR datasets, and more effective on the larger Yahoo! datasets.

Yahoo V1 Yahoo V2 mq2007 mq2008
LambdaMART-MAUC 0.594 0.592 0.662 0.734
SVM-Perf 0.576 0.576 0.659 0.737
Table 1: Summary of Multi-class AUC on test folds
Yahoo V1 Yahoo V2 mq2007 mq2008
LambdaMART-MAUC 0.862 0.858 0.466 0.474
SVM-Perf 0.837 0.837 0.450 0.458
Table 2: Summary of Mean Average Precision on test folds

Additionally, the experiments found that LambdaMART-MAUC outperformed SVM-Perf in terms of precision in all cases. Table 2 shows the Mean Average Precision scores for the four datasets. LambdaMART-MAUC also had higher and on all datasets, for . For instance, Figure 2 shows the values of and for the Yahoo! V1 dataset.

The class-reference AUC scores indicate that LambdaMART-MAUC and SVM-Perf arrive at their MAUC scores in different ways. LambdaMART-MAUC focuses on the most prevalent class; each term for a prevalent class receives a higher weighting than for a rare class due to the term in the computation. As a result the -gradients in LambdaMART-MAUC place more emphasis on achieving a high than a high . Table 3 shows the class-reference AUC scores for the Yahoo! V1 dataset. We observe that LambdaMART-MAUC produces better than SVM-Perf, but worse , since class 1 is much more prevalent than class 4; 48% of the documents in the training set with a positive label have a label of class 1, while only 2.5% have a label of class 4.

Finally, we note that on the large-scale Microsoft Learning to Rank Dataset MSLR-WEB10k [11], the SVM-Perf training failed to converge on a single fold after 12 hours. Therefore training a model for each class for every fold was impractical using SVM-Perf, while LambdaMART-MAUC was able to train on all five folds in less than 5 hours. This further suggests that LambdaMART-MAUC is preferable to SVM-Perf for optimizing MAUC on large ranking datasets.

Figure 2: Micro and Macro Precision@1-10 on the Yahoo! V1 test folds
LambdaMART-MAUC 0.503 0.690 0.757 0.831
SVM-Perf 0.474 0.682 0.796 0.920
Table 3: Summary of class-reference AUC scores on the Yahoo! V1 test folds

6 Conclusions

We have introduced a method for optimizing AUC on ranking datasets using a gradient-boosting framework. Specifically, we have derived gradient approximations for optimizing AUC with LambdaMART in binary and multi-class settings, and shown that the gradients are efficient to compute. The experiments show that the method performs as well as, or better than, a baseline SVM method, and performs especially well on large, multi-class datasets. In addition to adding LambdaMART to the portfolio of algorithms that can be used to optimize AUC, our extensions expand the set of IR metrics for which LambdaMART can be used.

There are several possible future directions. One is investigating local optimality of the solution produced by LambdaMART-AUC using Monte Carlo methods. Other directions include exploring LambdaMART with multiple objective functions to optimize AUC, and creating an extension to optimize area under a Precision-Recall curve rather than an ROC curve.


Thank you to Dwi Sianto Mansjur for giving helpful guidance and providing valuable comments about this paper.


  • [1] Burges, C., Shaked, T., Renshaw, E., Lazier, A., Deeds, M., Hamilton, N., Hullender, G.: Learning to rank using gradient descent. In: Proceedings of the 22Nd International Conference on Machine Learning. pp. 89–96. ICML ’05, ACM, New York, NY, USA (2005)
  • [2] Burges, C.J.: From ranknet to lambdarank to lambdamart: An overview. Learning 11, 23–581 (2010)
  • [3] Calders, T., Jaroszewicz, S.: Efficient auc optimization for classification. Knowledge Discovery in Databases: PKDD 2007 pp. 42–53 (2007)
  • [4] Chapelle, O., Chang, Y.: Yahoo! learning to rank challenge overview (2011)
  • [5] Cortes, C., Mohri, M.: Auc optimization vs. error rate minimization. Advances in neural information processing systems 16(16), 313–320 (2004)
  • [6] Donmez, P., Svore, K., Burges, C.J.: On the optimality of lambdarank. Tech. Rep. MSR-TR-2008-179, Microsoft Research (November 2008), http://research.microsoft.com/apps/pubs/default.aspx?id=76530
  • [7] Fawcett, T.: An introduction to roc analysis. Pattern Recogn. Lett. 27(8), 861–874 (Jun 2006)
  • [8]

    Ganjisaffar, Y., Caruana, R., Lopes, C.V.: Bagging gradient-boosted trees for high precision, low variance ranking models. In: Proceedings of the 34th international ACM SIGIR conference on Research and development in Information Retrieval. pp. 85–94. ACM (2011)

  • [9] Joachims, T.: A support vector method for multivariate performance measures. In: Proceedings of the 22Nd International Conference on Machine Learning. pp. 377–384. ICML ’05, ACM, New York, NY, USA (2005)
  • [10] Manning, C., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press (2008)
  • [11] Microsoft learning to rank datasets, http://research.microsoft.com/en-us/projects/mslr/
  • [12] Qin, T., Liu, T.: Introducing LETOR 4.0 datasets. CoRR abs/1306.2597 (2013), http://arxiv.org/abs/1306.2597
  • [13] Qin, T., Liu, T.Y., Xu, J., Li, H.: Letor: A benchmark collection for research on learning to rank for information retrieval. Inf. Retr. 13(4), 346–374 (Aug 2010), http://dx.doi.org/10.1007/s10791-009-9123-y
  • [14] Svore, K.M., Volkovs, M.N., Burges, C.J.: Learning to rank with multiple objective functions. In: Proceedings of the 20th international conference on World wide web. pp. 367–376. ACM (2011)
  • [15] Wu, Q., Burges, C.J., Svore, K.M., Gao, J.: Adapting boosting for information retrieval measures. Information Retrieval 13(3), 254–270 (2010)