Direct Learning to Rank and Rerank

Learning-to-rank techniques have proven to be extremely useful for prioritization problems, where we rank items in order of their estimated probabilities, and dedicate our limited resources to the top-ranked items. This work exposes a serious problem with the state of learning-to-rank algorithms, which is that they are based on convex proxies that lead to poor approximations. We then discuss the possibility of "exact" reranking algorithms based on mathematical programming. We prove that a relaxed version of the "exact" problem has the same optimal solution, and provide an empirical analysis.

Authors

• 78 publications
• 46 publications
05/20/2020

Learning to rank via combining representations

Learning to rank – producing a ranked list of items specific to a query ...
10/05/2018

A note on spanoid rank

We construct a spanoid S on n elements with rank(S) > n^c f-rank(S) wher...
01/16/2022

Lower bounds on the performance of online algorithms for relaxed packing problems

We prove new lower bounds for suitable competitive ratio measures of two...
03/02/2021

On Estimating Recommendation Evaluation Metrics under Sampling

Since the recent study (Krichene and Rendle 2020) done by Krichene and R...
02/28/2021

PairRank: Online Pairwise Learning to Rank by Divide-and-Conquer

Online Learning to Rank (OL2R) eliminates the need of explicit relevance...
07/22/2019

Robust Approach to Restricted Items Selection Problem

We consider the robust version of items selection problem, in which the ...
04/23/2020

Transformation of Mean Opinion Scores to Avoid Misleading of Ranked based Statistical Techniques

The rank correlation coefficients and the ranked-based statistical tests...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We are often faced with prioritization problems – how can we rank aircraft in order of vulnerability to failure? How can we rank patients in order of priority for treatment? When we have limited resources and need to make decisions on how to allocate them, these ranking problems become important. The quality of a ranked list is often evaluated in terms of rank statistics

. The area under the receiver operator characteristic curve

(AUC, Metz, 1978; Bradley, 1997), which counts pairwise comparisons, is a rank statistic, but it does not focus on the top of a ranked list, and is not a good evaluation measure if we care about prioritization problems. For prioritization problems, we would use rank statistics that focus on the top of the ranked list, such as a weighted area under the curve that focuses on the left part of the curve. Then, since we evaluate our models using these rank statistics, we should aim to optimize them out-of-sample by optimizing them in-sample. The learning-to-rank field (also called supervised ranking) is built from this fundamental idea. Learning-to-rank is a natural fit for many prioritization problems. If we are able to improve the quality of a prioritization policy by even a small amount, it can have an important practical impact. Learning-to-rank can be used to prioritize mechanical equipment for repair (e.g., airplanes, as considered by Oza et al, 2009), it could be useful for prioritizing maintenance on the power grid (Rudin et al, 2012, 2010), it could be used for ranking medical workers in order of likelihood that they accessed medical records inappropriately (as considered by Menon et al, 2013), prioritizing safety inspections or lead paint inspections in dwellings (Potash et al, 2015), ranking companies in order of likeliness of committing tax violations (see Kong and Saar-Tsechansky, 2013), ranking water pipes in order of vulnerability (as considered by Li et al, 2013), other areas of information retrieval (Xu, 2007; Cao et al, 2007; Matveeva et al, 2006; Lafferty and Zhai, 2001; Li et al, 2007)

and in almost any domain where one measures the quality of results by rank statistics. Learning-to-rank algorithms have been used also in sentiment analysis

(Kessler and Nicolov, 2009)(Ji et al, 2006; Collins and Koo, 2005)(Jain and Varma, 2011; Kang et al, 2011), and reverse-engineering product quality rating systems (Chang et al, 2012).

This work exposes a serious problem with the state of learning-to-rank algorithms, which is that they are based on convex proxies for rank statistics, and when these convex proxies are used, computation is faster but the quality of the solution can be poor.

We then discuss the possibility of more direct optimization of rank statistics for predictive learning-to-rank problems. In particular, we consider a strategy of ranking with a simple ranker (logistic regression for instance) which is computationally efficient, and then reranking only the candidates near the top of the ranked list with an “exact” method. The exact method does not have the shortcoming that we discussed earlier for convex proxies.

For most ranking applications, we care only about the top of the ranked list; thus, as long as we rerank enough items with the exact method, the re-ranked list is (for practical purposes) just as useful as a full ranked list would be (if we could compute it with the exact method, which would be computationally prohibitive).

The best known theoretical guarantee on ranking methods is obtained by directly optimizing the rank statistic of interest (as shown by theoretical bounds of Clemençon and Vayatis, 2008; Rudin and Schapire, 2009, for instance) hence our choice of methodology – mixed-integer programming (MIP) – for reranking in this work. Our general formulation can optimize any member of a large class rank statistics using a single mixed-integer linear program. Specifically, we can handle (a generalization of) the large class of conditional linear rank statistics, which includes the Wilcoxon-Mann Whitney U statistic, or equivalently the Area Under the ROC Curve, the Winner-Take-All statistic, the Discounted Cumulative Gain used in information retrieval (Järvelin and Kekäläinen, 2000), and the Mean Reciprocal Rank.

Exact learning-to-rank computations need to be performed carefully; we should not refrain from solving hard problems, but certain problems are harder than others. We provide two MIP formulations aimed at the same ranking problems. The first one works no matter what the properties of the data are. The second formulation is much faster, and is theoretically shown to produce the same quality of result as the first formulation when there are no duplicated observations. Note that if the observations are chosen from a continuous distribution then duplicated observations do not occur, with probability one.

One challenge in the exact learning-to-rank formulation is the way of handling ties in score. As it turns out, the original definition of conditional linear rank statistics can be used for the purpose of evaluation but not optimization. We show that a small change to the definition can be used for optimization.

This paper differs from our earlier technical report and non-archival conference paper (Chang et al, 2011, 2010), which were focused on solving full problems to optimality, and did not consider reranking or regularization; our exposition for the formulations closely follows this past work. The technique was used by Chang et al (2012) for the purpose of reverse engineering product rankings from rating companies that do not reveal their secret rating formula.

Section 2 of this paper introduces ranking and reranking, introduces the class of conditional linear rank statistics that we work with, and provides background on some current approximate algorithms for learning-to-rank. It also provides an example to show how ranked statistics can be “washed out” when they are approximated by convex substitutes. Section 2 also discusses a major difference between approximation methods and exact methods for optimizing rank statistics, which is how to handle ties in rank. As it turns out, we cannot optimize conditional linear rank statistics without changing their definition: a tie in score needs to be counted as a mistake. Section 3 provides the two MIP formulations for ranking, and Section 4 contains a proof that the second formulation is sufficient to solve the ranking problem provided that no observations are duplicates of each other. Then follows an empirical discussion in Section 5, designed to highlight the tradeoffs in the quality of the solution outlined above. Appendix A.1 contains a MIP formulation for regularized AUC maximization, and Appendix A.2 contains a MIP formulation for a general (non bipartite) ranking problem.

The recent work most related to ours are possibly those of Ataman et al (2006)

who proposed a ranking algorithm to maximize the AUC using linear programming, and

Brooks (2010)

, who uses a ramp loss and hard margin loss rather than a conventional hinge loss, making their method robust to outliers, within a mixed-integer programming framework. The work of

Tan et al (2013) uses a non-mathematical-programming coordinate ascent approach, aiming to approximately optimize the exact ranking measures, for large scale problems. There are also algorithms for ordinal regression, which is a related but different learning problem (Li et al, 2007; Crammer et al, 2001; Herbrich et al, 1999), and listwise approaches to ranking (Cao et al, 2007; Xia et al, 2008; Xu and Li, 2007; Yue et al, 2007).

2 Learning-to-Rank and Learning-To-Rerank

We first introduce learning-to-rank, or supervised bipartite ranking. The training data are labeled observations , with observations and labels for all . The observations labeled “1” are called “positive observations,” and the observations labeled “0” are “negative observations.” There are positive observations and negative observations, with index sets and . A ranking algorithm uses the training data to produce a scoring function that assigns each observation a real-valued score. Ideally, for a set of test observations drawn from the same (unknown) distribution as the training data, should rank the observations in order of , and we measure the quality of the solution using “rank statistics,” or functions of the observations relative to each other. Note that bipartite ranking and binary classification are fundamentally different, and there are many works that explain the differences (e.g., Ertekin and Rudin, 2011). Briefly, classification algorithms consider a statistic of the observations relative to a decision boundary ( comparisons) whereas ranking algorithms consider observations relative to each other (on the order of comparisons for pairwise rank statistics).

Since the evaluation of test observations uses a chosen rank statistic, the same rank statistic (or a convexified version of it) is optimized on the training set to produce . Regularization is added to help with generalization. Thus, a ranking algorithm looks like:

 minf∈FRankStatistic(f,{xi,yi}i)+C⋅Regularizer(f).

This is the form of algorithm we consider for the reranking step.

2.1 Reranking

We are considering reranking methods, which have two ranking steps. In the first ranking step, a base algorithm is run over the training set, and a scoring function is produced and observations are rank-ordered by the score. A threshold is chosen, and all observations with scores above the threshold are reranked by another ranking algorithm which produces another scoring function . To evaluate the quality of the solution on the test set, each test observation is evaluated first by . For the observations with scores above the threshold, they are reranked according to . The full ranking of test observations is produced by appending the test observations scored by to the test observations scored only by .

2.2 Rank Statistics

We will extend the definition of conditional linear rank statistics (Clemençon and Vayatis, 2008; Clemençon et al, 2008) to include various definitions of rank. For now, we assume that there are no ties in score for any pair of observations, but we will heavily discuss ties later, and extend this definition to include rank definitions when there are ties. For the purpose of this section, the rank is currently defined so that the top of the list has the highest ranks, and all ranks are unique. The rank of an observation is the number of observations with scores at or beneath it:

 Rank(f(xi))=n∑t=11[f(xt)≤f(xi)].

Thus, ranks can range from 1 at the bottom to at the top. A conditional linear rank statistic (CLRS) created from scoring function is of the form

 CLRS(f)=n∑i=11yi=1ϕ(Rank(f(xi)).

Here is a non-decreasing function producing only non-negative values. Without loss of generality, we define , the contribution to the score if the observation with rank has label +1. By properties of , we know . Then

 CLRS(f)=n∑i=1yin∑ℓ=11[Rank(f(xi))=ℓ]⋅aℓ. (1)

This class captures a broad collection of rank statistics, including the following well-known rank statistics:

• : Wilcoxon Rank Sum (WRS) statistic, which is an affine function of the Area Under the Receiver Operator Characteristic Curve (AUC) when there are no ties in rank (that is, such that ).

 WRS(f) = ∑i∈S+Rank(f(xi))=n+n−⋅AUC(f)+n+(n++1)2.

If ties are present, we would subtract the number of ties within the positive class from the right side of the equation above. The AUC is the fraction of correctly ranked positive-negative pairs:

 AUC(f)=1n+n−∑i∈S+∑k∈S−1[f(xk)

The AUC, when multiplied by constant , is the Mann-Whitney U statistic. The AUC has an affine relationship with the pairwise misranking error (the fraction of positive-negative pairs in which a positive is ranked at or below a negative):

 PairwiseMisrankingError(f)=1−AUC(f)=1n+n−∑i∈S+∑k∈S−1[f(xk)≥f(xi)]. (2)

Some ranking algorithms are designed to approximately minimize the pairwise misranking error, e.g., RankBoost (Freund et al, 2003).

• for predetermined threshold : Related to the local AUC or partial AUC, which looks at the area under the leftmost part of the ROC curve only (Clemençon and Vayatis, 2007, 2008; Dodd and Pepe, 2003). The leftmost part of the ROC curve is the top portion of the ranked list. The top of the list is the most important in applications such as information retrieval and maintenance.

• : Winner Takes All (WTA), which is 1 when the top observation in the list is positively-labeled and 0 otherwise (Burges et al, 2006).

• : Mean Reciprocal Rank (MRR) (Burges et al, 2006).

• : Discounted Cumulative Gain (DCG), which is used in information retrieval (Järvelin and Kekäläinen, 2000).

• : DCG@N, which cuts off the DCG after the top N. (See, for instance, Le et al, 2010).

• for some : Similar to the -Norm Push, which uses norms to focus on the top of the list, the same way as an

norm focuses on the largest elements of a vector

(Rudin, 2009a).

Rank statistics have been studied in several theoretical papers (e.g., Wang et al, 2013).

2.3 Some Known Methods for Learning-To-Rank

Current methods for learning-to-rank optimize convex proxies for the rank statistics discussed above. RankBoost (Freund et al, 2003)

uses the exponential loss function as an upper bound for the 0-1 loss within the misranking error,

, and minimizes

 ∑i∈S+∑k∈S−e−(f(xi)−f(xk)), (3)

whereas support vector machine ranking algorithms

(e.g., Joachims, 2002; Herbrich et al, 2000; Shen and Joshi, 2003) use the hinge loss , that is:

 ∑i∈S+∑k∈S−max{0,1−(f(xi)−f(xk))}+C∥f∥2K, (4)

where the regularization term is a reproducing kernel Hilbert space norm. Other ranking algorithms include RankProp and RankNet (Caruana et al, 1996; Burges et al, 2005).

We note that the class of CLRS includes a very wide range of rank statistics, some of which concentrate on the top of the list (e.g., DCG) and some that do not (e.g.,WRS), and it is not clear which conditional linear rank statistics (if any) from the CLRS are close to the convexified loss functions (3) and (4).

Since the convexified loss functions do not necessarily represent the rank statistics of interest, it is not even necessarily true that an algorithm for ranking will perform better for ranking than an algorithm designed for classification; in fact, AdaBoost and RankBoost provably perform equally well for ranking under fairly general circumstances (Rudin and Schapire, 2009). Ertekin and Rudin (2011) provide a discussion and comparison of classification versus ranking methods. Ranking algorithms ultimately aim to put the observations in order of , and so do some classification algorithms such as logistic regression. Thus, one might consider using logistic regression for ranking (e.g., Cooper et al, 1994; Fine et al, 1997; Perlich et al, 2003). Logistic regression minimizes:

 n∑i=1ln(1+e−yif(xi)). (5)

This loss function does not closely resemble the AUC. On the other hand, it is surprising how common it is within the literature to use logistic regression to produce a predictive model, and yet evaluate the quality of the learned model using AUC.

Since RankBoost, RankProp, RankNet, etc., do not directly optimize any CLRS, they do not have the problem with ties in score that we will find when we directly try to optimize a CLRS.

2.4 Why Learning-To-Rank Methods Can Fail

We prove that the exponential loss and other common loss functions may yield poor results for some rank statistics.

Theorem 2.1

There is a simple one-dimensional dataset for which there exist two ranked lists (called Solution 1 and Solution 2) that are completely reversed from each other (the top of one list is the bottom of the other and vice versa) such that the WRS (the AUC), partial AUC@100, DCG, MRR and hinge loss prefer Solution 1, whereas the DCG@100, partialAUC@10 and exponential loss all prefer Solution 2.

The proof is by construction. Along the single dimension , the dataset has 10 negatives near =3, then 3000 positives near =1, then 3000 negatives near =0, and 80 positives near =

. We generated each of the four clumps of points wth a a standard deviation of 0.05 just so that there would not be ties in score. Figure

1 shows data drawn from the distribution, where for display purposes we spread the points along the horizontal axis, but the vertical axis is the only one that matters: one ranked list goes from top to bottom (Solution 1) and the other goes from bottom to top (Solution 2).

The bigger clumps are designed to dominate rank statistics that do not decay (or decay slowly) down the list, like the WRS. The smaller clumps are designed to dominate rank statistics that concentrate on the top of the list, like the partial WRS or partial DCG.

This theorem means that using the exponential loss to approximate the AUC, as RankBoost does, could give the completely opposite result than desired. It also means that using the hinge loss to approximate the partial DCG or partial AUC could yield completely the wrong result. Further, the fact that the exponential loss and hinge loss behave differently also suggests that convex losses can behave quite differently than the underlying rank statistics that they are meant to approximate. Another way to say this is that the convexification “washes out” the differences between rank statistics. If we were directly to optimize the rank statistic of interest, the problem discussed above would vanish.

It is not surprising that rank statistics can behave quite differently on the same dataset. Rank statistics are very different than classification statistics. Rank statistics consider every pair of observations relative to each other, so even small changes in a scoring function

can lead to large changes in a rank statistic. Classification is different – observations are considered relative only to a decision boundary.

The example considered in this section also illustrates why arguments about consistency (or lack thereof) of ranking methods (e.g., Kotlowski et al, 2011)

are not generally relevant for practice. Sometimes these arguments rely on incorrect assumptions about the class of models used for ranking with respect to the underlying distribution of the data. These arguments also depend on how the modeler is assumed to “change” this class as the sample size increases to infinity. The tightest bounds available for limited function classes and for finite data are those from statistical learning theory. Those bounds support optimizing rank statistics.

To optimize rank statistics, there is a need for more refined models; however, this refinement comes at a computational cost of solving a harder problem. This thought has been considered in several previous works on learning-to-rank (Le et al, 2010; Ertekin and Rudin, 2011; Tan et al, 2013; Chakrabarti et al, 2008; Qin et al, 2013).

2.5 Most Learning-To-Rank Methods Have The Problem Discussed Above

The class of CLRS includes a very wide range of rank statistics, some of which concentrate on the top of the list (e.g., DCG) and some that do not (e.g.,WRS), and it is not clear which conditional linear rank statistics (if any) from the CLRS are close to the convexified loss functions of the ranking algorithms. RankBoost is not the only algorithm where problems can occur, and they can also occur for support vector machine ranking algorithms (e.g., Joachims, 2002; Herbrich et al, 2000) and algorithms like RankProp and RankNet (Caruana et al, 1996; Burges et al, 2005). The methods of Ataman et al (2006), Brooks (2010), and Tan et al (2013) have used linear relaxations or greedy methods for learning to rank, rather than exact reranking, which will have similar issues; if one optimizes the wrong rank statistic, one may not achieve the correct answer. Logistic regression is commonly used for ranking. Logistic regression minimizes: This loss function does not closely resemble AUC. On the other hand, it is surprising how common it is to use logistic regression to produce a predictive model, and yet evaluate the quality of the model using AUC.

The fundamental premise of learning-to-rank is that better test performance can be achieved by optimizing the performance measure (a rank statistic) on the training set. This means that one should choose to optimize differently for each rank statistic. However, in practice when the same convex substitute is used to approximate a variety of rank statistics, it directly undermines this fundamental premise, and could compromise the quality of the solution. If convexified rank statistics are a reasonable substitute for rank statistics, we would expect to see that (i) the rank statistics are reasonably approximated by their convexified versions, (ii) if we consider several convex proxies for the same rank statistic (in this case AUC), then they should all behave very similarly to each other, and similarly to the true (non-convexified) AUC. However, as we discussed, neither of these are true.

2.6 Ties and Problematic, Thus Use ResolvedRank and Subrank

Dealing with ties in rank is critical when directly optimizing rank statistics. If a tie in rank between a positive and negative is considered as correct, then an optimal learning algorithm would produce the trivial scoring function ; this solution would unfortunately attain the highest possible score when optimizing any pairwise rank statistic. This problem happens, for instance, with the definition of Clemençon and Vayatis (2008), that is:

 RankCV(f(xi))=n∑k=11f(xk)≤f(xi),

which counts ties in score as correct. Using this definition for rank in the CLRS:

 CLRSCV(f)=n∑i=1yin∑ℓ=11[RankCV(f(xi))=ℓ]⋅aℓ. (6)

we find that optimizing CLRS directly yields the trivial solution that all observations get the same score. So this definition of rank should not be used.

We need to encourage our ranking algorithm not to produce ties in score, and thus in rank. To do this, we pessimistically consider a tie between and positive and a negative as a misrank. We will use two definitions of rank within the CLRS – ResolvedRanks and Subranks. For ResolvedRanks, when negatives are tied with positives, we force the negatives to be higher ranked. For Subranks, we do not force this, but when we optimize the CLRS, we will prove that ties are resolved this way anyway.

The assignment of ResolvedRanks and Subranks are not unique, there can be multiple ways to assign ResolvedRanks or Subranks for a set of observations.

We define the Subrank by the following formula:

 Subrank(f(xi))=n∑k=11[f(xk)

The Subrank of observation is the number of observations that score strictly below it. Subranks range from 0 to and the CLRS becomes:

Observations with equal score have tied Subranks.

ResolvedRanks are defined as follows, where the tied ranks are resolved pessimistically. ResolvedRanks are assigned so that:

1. The ResolvedRank of an observation is greater than or equal to its Subrank.

2. If a positive observation and a negative observation have the same score, then the negative observation gets a higher ResolvedRank.

3. Each possible ResolvedRank, 0 through , is assigned to exactly one observation.

The SubRanks and ResolvedRanks are equal to each other when there are no ties in score. We provide one possible assignment of Subranks and ResolvedRanks in Figure 2 to demonstrate the treatment of ties. We then have the CLRS with ResolvedRanks as:

 CLRSResolvedRank(f)=n∑i=1yin∑ℓ=11[ResolvedRank(f(xi))=ℓ−1]⋅aℓ. (8)

The ResolvedRanks are the quantity of interest, as optimizing them will provide a scoring function with minimal misranks and minimal ties between positives and negatives.

Note that ties are not fundamental in other statistical uses of rank statistics, such as hypothesis testing. Ties are usually addressed by fixing them, or assigning the tied observations a (possibly fractional) rank that is the average (e.g., tied observations would get ranks 7.5 rather than 7 and 8) (see Tamhane and Dunlop, 2000; Wackerly et al, 2002). Ties are not treated uniformly across statistical applications (Savage, 1957), and there has been comparative work on treatment of ties (e.g., Putter, 1955). This differs from when we optimize rank statistics, where ties are of central importance as we discussed.

3 Reranking Formulations Using ResolvedRanks and Subranks

Here we produce the two formulations – one for optimizing the regularized CLRS with ResolvedRanks, and the other for optimizing the regularized CLRS with Subranks.

3.1 Maximize the Regularized CLRS with ResolvedRanks

We would like to optimize the general CLRS, for any choices of the ’s, where we want to penalize ties in rank between positives and negatives, and we would also like a full ranking of observations. Thus, we will directly optimize for our reranking algorithm. Our hypothesis space is the space of linear scoring functions , where .

 maxw∈Rd CLRSResolvedRank(w)−C∥w∥0 =maxw∈Rd

Nonlinearities can be incorporated as usual by including additional variables, such as indicator variables or nonlinear functions of the original variables. We optimize over choices for vector .

Building up to the formulation, we will create the binary variable

so that it is 1 for and 0 otherwise. That is, if observation has ResolvedRank equal to 5, then are all 1 and are 0. Then

 n∑ℓ=1(aℓ−aℓ−1)tiℓ (9)

is a telescoping sum for ResolvedRank. When we define , the sum (9) becomes simply , or equivalently, the term from (8):

 n∑ℓ=11[ResolvedRank(f(xi))=ℓ−1]⋅aℓ.

As in (8) we multiply by and sum over observations to produce the CLRS. Doing this to (9), CLRS becomes:

 n∑i=1yin∑ℓ=1(aℓ−aℓ−1)tiℓ%wherea0=0.

By definition for all , so we can simplify the CLRS function above to:

 ∑i∈S+(n∑ℓ=2(aℓ−aℓ−1)tiℓ+a1)=|S+|a1+∑i∈S+n∑ℓ=2(aℓ−aℓ−1)tiℓ.

Note that the differences are all nonnegative. When they are zero they cannot contribute to the CLRS function. When they are strictly positive there can be a contribution made to the CLRS function. Thus, we introduce notation and which are used in both formulations below. The CLRS becomes:

 |S+|a1+∑i∈S+∑ℓ∈Sr~aℓtiℓ. (10)

We will maximize this, which means that the ’s will be set to 1 when possible, because the ’s in the sum are all positive. When we maximize, we do not need the constant term.

We define integer variables to represent the ResolvedRanks of the observations.Variables and are related in that can only be 1 when , implying .

We use linear scoring functions, so the score of instance is . Variables are indicators of whether the score of observation is above the score of observation . Thus we want to have if and otherwise. Beyond this we want to ensure no ties in score, so we want all scores to be at least apart. This will be discussed further momentarily.

Our first ranking algorithm is below, which maximizes the regularized CLRS using ResolvedRanks.

 argmaxw,γj,zik,tiℓ,ri∀i,k,ℓ,j ∑i∈S+∑ℓ∈Sr~altiℓ−C∑jγj subject to (11) zik≤wT(xi−xk)+1−ε,∀i,k=1,…,n, (12) zik≥wT(xi−xk),∀i,k=1,…,n, (13) γj≥wj (14) γj≥−wj (15) ri−rk≥1+n(zik−1),∀i,k=1,…,n, (16) rk−ri≥1−nzik,∀i∈S+,k∈S−, (17) rk−ri≥1−nzik,∀i,k∈S+,i

To ensure that solutions with ranks that are close together are not feasible, Constraint (12) forces if , and Constraint (13) forces if . Thus, a solution where any two observations have a score difference above 0 and less than is not feasible. (Note that these constraints alone do not prevent a score difference of exactly 0; for that we need the constraints that follow.) Constraints (14) and (15) define the ’s to be indicators of nonzero coefficients .

Constraints (16)-(19) are the “tie resolution” equations. Constraint (16) says that for any pair , if the score of is larger than that of so that , then . That handles the assignment of ranks when there are no ties, so now we need only to resolve ties in the score. We have Constraint (17) that applies to positive-negative pairs: when the pair is tied, this constraint forces the negative observation to have higher rank. Similarly, Constraints (18) and (19) apply to positive-positive pairs and negative-negative pairs respectively, and state that ties are broken lexicographically, that is, according to their index in the dataset.

We discussed Constraint (20) earlier, which provides the definition of so that whenever . Also we force the ’s to be between -1 and 1 so their values do not go to infinity and so that the values are meaningful, in that they can be considered relative to the maximum possible range of .

3.2 Maximize the Regularized CLRS with Subranks

We are solving:

Maximizing the Subrank problem is much easier, since we do not want to force a unique assignment of ranks. This means the “tie resolution” equations are no longer present. We can directly assign a Subrank for observation by because it is exactly the count of observations ranked beneath observation ; that way the variables do not even need to appear in the formulation.

Here is the formulation:

 argmaxw,γj,zik,tiℓ∀i,k,ℓ,j∑i∈S+∑ℓ∈Sr~altiℓ−C∑jγjsubject to (22) tiℓ≤1ℓ−1n∑k=1zik,∀i∈S+,ℓ∈Sr, (23) zik≤wT(xi−xk)+1−ε,∀i∈S+,k=1,…,n, (24) γj≥wj (25) γj≥−wj (26) zik+zki=1[xi≠xk],∀i,k∈S+, (27) tiℓ≥ti,ℓ+1,∀i∈S+,ℓ∈Sr∖max(ℓ∈Sr), (28) ∑i∈S+∑ℓ∈Sr~altiℓ≤n∑ℓ=1aℓ, (29) zik=0,∀i∈S+,k=1,…,n,xi=xk, (30) −1≤wj≤1,∀j=1,…,d, tiℓ,zik,γj∈{0,1},∀i∈S+,ℓ∈Sr,k=1,…,n,j∈{1,...d}.

Constraint (23) is similar to Constraint (20) from the ResolvedRank formulation. Since we are maximizing with respect to the ’s, the ’s will naturally be maximized by Constraint (23). Thus we need to again force the ’s down to 0 when , which is done via Constraint (24). Constraints (25) and (26) define the ’s to be indicators of nonzero coefficients . It is not necessary to include Constraints (27) through (30); they are there only to speed up computation, by helping to make the linear relaxation of the integer program closer to the set of feasible integer points. For the experiments in this paper they did not substantially speed up computation and we chose not to use them.

Beyond the formulations presented here, we have placed a formulation for optimizing the regularized AUC in the Appendix A.1 and another formulation for optimizing the general pairwise rank statistic that inspired RankBoost (Freund et al, 2003) in Appendix A.2.

4 Why Subranks Are Often Sufficient

The ResolvedRank formulation above has variables, which is the total number of , , , , and variables. The Subrank formulation on the other hand has only variables, since we only have , , , and . This difference of variables can heavily influence the speed at which we are able to find a solution. We would ultimately like to get away with solving the Subrank problem rather than the ResolvedRank problem. This would allow us to scale up our reranking problem substantially. In this section we will show why this is generally possible.

Denote the objectives as follows, where we have .

 GRR(f) := n∑i=1yin∑ℓ=11[ResolvedRank(f(xi))=ℓ−1]⋅aℓ−C∥w∥0 GSub(f) := n∑i=1yin∑ℓ=11[Subrank(f(xi))=ℓ−1]⋅aℓ−C∥w∥0.

In this section, we will ultimately prove that any maximizer of also maximizes . This is true under a very general condition, which is that there are no exactly duplicated observations. The reason for this condition is not completely obvious. In the Subrank formulation, if two observations are exactly the same, they will always get the same score and Subrank - there is no mechanism to resolve ties and assign ranks. This causes problems when approximating the ResolvedRank with the Subrank. We remark however, that this should not be a problem in practice. First, we can check in advance whether any of our observations are exact copies of each other, so we know whether it is likely to be a problem. Second, if we do have duplicated observations, we can always slightly perturb the values of the duplicated observations so they are not identical. Third, we remark that if the data are chosen from a continuous distribution, with probability 1 the observations will all be distinct anyway. We have found that in practice the Subrank formulation does not have problems even when there are ties.

In the first part of the section, we consider whether there are maximizers of that have no ties in score, in other words, solutions where for any two observations and . Assuming such solutions exist, we then show that any maximizer of is also a maximizer of . This is the result within Theorem 4.1. In the second part of the section, we show that the assumption we made for Theorem 4.1 is always satisfied, assuming no duplicated observations. That is, a maximizer of with no ties in score exists. The outline within our technical report (Chang et al, 2011) follows a similar outline but does not include regularization.

The following lemma establishes basic facts about the two objectives:

Lemma 1

for all . Further, for all with no ties.

Proof

Choose any function . Since by definition Subrank ResolvedRank , and since the are nondecreasing,

 n∑ℓ=11[Subrank(f(xi))=ℓ−1]⋅aℓ =a(Subrank(f(xi))+1) (31) ≤a(ResolvedRank(f(xi))+1) =n∑ℓ=11[ResolvedRank(f(xi))=ℓ−1]⋅aℓ∀i.

Multiplying both sides by , summing over and subtracting the regularization term from both sides yields . When no ties are present (that is, ), Subranks and ResolvedRanks are equal, and the inequality above becomes an equality, and in that case, .

This lemma will be used within the following theorem which says that maximizers of are maximizers of .

Theorem 4.1

Assume that the set contains at least one function having no ties in score. Then any such that also obeys .

Proof

Assume there exists such that there are no ties in score. Since is a maximizer of and does not have ties, it is also a maximizer of by Lemma 1:

 GSub(¯f)=GRR(¯f)=maxfGRR(f)≥maxfGSub(f), thus GSub(¯f)=maxfGSub(f).

Let be an arbitrary maximizer of (not necessarily tie-free). We claim that is also a maximizer of . Otherwise,

 GRR(f⋆)

which is a contradiction. Equation (a) comes from Lemma 1 applied to . Equation (b) comes from the fact that both and are maximizers of . Inequality (c) comes from Lemma 1 applied to .

Interestingly enough, it is true that if maximizes and it has no ties, then also maximizes . In particular,

 maxfGSub(f)≤maxfGRR(f)≤GRR(¯f)=GSub(¯f).

Note that so far, the results about and hold for functions from any arbitrary set; we did not need to have in the preceding computations. In what follows we take advantage of the fact that is a linear combination of features in order to perturb the function away from ties in score. With this method we will be able to achieve the same maximal value of but with no ties.

Define to be the maximum absolute value of the features, so that for all , we have .

Lemma 2

If we are given that yields a scoring function with ties, it is possible to construct a perturbed scoring function that:

i

preserves all pairwise orderings, ,

ii

has no ties, for all .

iii

has .

This result holds whenever no observations are duplicates of each other, .

Proof

We will construct using the following procedure:

Step 1

Find the nonzero indices of : let . Choose a unit vector in uniformly at random. Construct vector to be equivalent to for restricted to the dimensions and 0 otherwise.

Step 2

Choose real number to be between 0 and , where

 η=min{margin¯w2M√d,minj∈¯J|wj|}

where in the above expression

 margin¯w=min{i,k:¯f(xi)>¯f(xk)}(¯f(xi)−¯f(xk)).
Step 3

Construct as follows: .

With probability one, we will show that preserves pairwise orderings of but with no ties.

We will prove each part of the lemma separately.

Proof of (i) We choose any two observations and where , and we need to show that .

 ^f(xi)−^f(xk) =(¯w+δu)T(xi−xk)=¯wT(xi−xk)+δuT(xi−xk) =¯f(xi)−¯f(xk)+δuT(xi−xk)≥margin¯w+δuT(xi−xk). (32)

In order to bound the right hand side away from zero we will use that:

 ∥xi−xk∥2=(d∑j=1(xij−xkj)2)1/2≤(d∑j=1(2M)2)1/2=2M√d. (33)

Now,

 ∣∣δuT(xi−xk)∣∣(a)≤δ∥u∥2∥xi−xk∥2(b)≤δ⋅2M√d(c)

Here, inequality (a) follows from the Cauchy-Schwarz inequality, (b) follows from (33) and that , and (c) follows from the bound on from Step 2 of the procedure for constructing above. Thus , which combined with (32) yields

 ^f(xi)−^f(xk)≥margin¯w+δuT(xi−xk)>margin¯w−margin¯w=0.

Proof of (ii) We show that has no ties for all . This must be true with probability 1 over the choice of random vector .

Since we know that all pairwise inequalities are preserved, we need to ensure only that ties become untied through the perturbation . Thus, let us consider tied observations and , so . We need to show that they become untied: we need to show . Consider :

 |^f(xi)−^f(xk)| =∣∣(¯w+δu)T(xi−xk)∣∣=∣∣¯wT(xi−xk)+δuT(xi−xk)∣∣ =|δ|∣∣uT(xi−xk)∣∣.

We now use the key assumption that no two observations are duplicates – this implies that at least one entry of vector is nonzero. Further, since is a random vector, the probability that it is orthogonal to vector is zero. So, with probability one with respect to the choice of , we have . From the expression above,

 |^f(xi)−^f(xk)|=|δ|∣∣uT(x1−x2)∣∣>0.

Proof of (iii) By our definitions, , , and is only nonzero in the components where is not 0. Each component of is nonzero with probability 1. For component where , we have which means . So, for all components where is nonzero, we also have nonzero in those components. Further, for all components where is zero, we also have zero in those components. Thus .

The result below establishes the main result of the section, which is that if we optimize , we get an optimizer of even though it is a much more complex optimization problem to optimize directly.

Theorem 4.2

Given , then .
This holds when there are no duplicated observations, where .

Proof

We will show that the assumption of Theorem 4.1, which says that has a maximizer with no ties, is always true. This will give us the desired result. Let . Either has no ties already, in which case there is nothing to prove, or it does have ties. If so, we can take its vector and perturb it using Lemma 2. The resulting vector has no ties. We need only to show that also maximizes . To do this we will show .

We know that

 GRR(¯f) =n∑i=1yin∑ℓ=11[ResolvedRank(¯f(xi))=ℓ−1]⋅aℓ−c∥¯w∥0 =∑i∈S+a(ResolvedRank(¯f