## 1 Introduction

Bipartite ranking aims to learn a real-valued ranking function that places positive instances above negative instances. It has attracted much attention because of its applications in several areas such as information retrieval and recommender systems (Rendle:2009:LOR, ; Liu11, )

. In the past decades, many ranking methods have been developed for bipartite ranking, and most of them are essentially based on pairwise ranking. These algorithms reduce the ranking problem into a binary classification problem by treating each positive-negative instance pair as a single object to be classified

(Herbrich00, ; Freund-JMLR03, ; Burges-ICML05, ; ValizadeganJZM09, ; Usunier09, ; Rudin:2009, ; Agarwal11, ; BoydCMR12, ). Since the number of instance pairs can grow quadratically in the number of training instances, one limitation of these methods is their high computational costs, making them not scalable to large datasets.Since for applications such as document retrieval and recommender systems, only the top ranked instances will be examined by users, there has been a growing interest in learning ranking functions that perform especially well at the top of the ranked list (Clemencon07, ; BoydCMR12, ). In the literature, most of these existing methods can be classified into two groups. The first group maximizes the ranking accuracy at the top of the ranked list by optimizing task specific metrics (Joachims05, ; Le07, ; Li:13, ; xu13, ), such as average precision (AP) (Yue:2007, ), NDCG (ValizadeganJZM09, ) and partial AUC (NarasimhanA-ICML13, ; NarasimhanA-KDD13, ). The main limitation of these methods is that they often result in non-convex optimization problems that are difficult to solve efficiently. Structural SVM (Tsochantaridis05, ) addresses this issue by translating the non-convexity into an exponential number of constraints. It can still be computationally challenging because it usually requires to search for the most violated constraint at each iteration of optimization. In addition, these methods are statistically inconsistency (Tewari:2007, ; Le07, )

, thus often leading to suboptimal solutions. The second group of methods are based on pairwise ranking. They design special convex loss functions that place more penalties on the ranking errors related to the top ranked instances, for example, by weighting

(Usunier09, ) or exploiting special functions such as -norm (Rudin:2009, ) and infinite norm (Agarwal11, ). Since these methods are essentially based on pairwise ranking, their computational costs are usually proportional to the number of positive-negative instance pairs, making them unattractive for large datasets.In this paper, we address the computational challenge of bipartite ranking by designing a ranking algorithm, named TopPush, that can efficiently optimize the ranking accuracy at the top. The key feature of the proposed TopPush algorithm is that its time complexity is only *linear* in the number of training instances. This is in contrast to most existing methods for bipartite ranking whose computational costs depend on the number of instance pairs. Moreover, we develop novel analysis for bipartite ranking. One shortcoming of the existing theoretical studies (Rudin:2009, ; Agarwal11, )

on bipartite ranking is that they try to bound the probability for a positive instance to be ranked before

*any*negative instance, leading to relatively pessimistic bounds. We overcome this limitation by bounding the probability of ranking a positive instance before

*most*negative instances, and show that TopPush is effective in placing positive instances at the top of a ranked list. Extensive empirical study shows that TopPush is computationally more efficient than most ranking algorithms, and yields comparable performance as the state-of-the-art approaches that maximize the ranking accuracy at the top.

The rest of this paper is organized as follows. Section 2 introduces the preliminaries of bipartite ranking, and addresses the difference between AUC optimization and maximizing accuracy at the top. Section 3 presents the proposed TopPush algorithm and its key theoretical properties. Section 4 gives proofs and technical details. Section 5 summarizes the empirical study, and Section 6 concludes this work with future directions.

## 2 Bipartite Ranking: AUC vs Accuracy at the Top

Let be the instance space. Let be a set of training instances, where and include positive instances and negative instances independently sampled from distributions and , respectively. The goal of bipartite ranking is to learn a ranking function that is likely to place a positive instance before most negative ones. In the literature, bipartite ranking has found applications in many domains, and its theoretical properties have been examined by several studies (for example, Agarwal-JMLR05, ; Clemencon08, ; KotlowskiDH11, ; Narasimhan-NIPS13, ).

AUC is a commonly used evaluation metric for bipartite ranking

(Hanley82, ; CortesNIPS03, ). By exploring its equivalence to Wilcoxon-Mann-Whitney statistic (Hanley82, ), many ranking algorithms have been developed to optimize AUC by minimizing the ranking loss defined as(1) |

where is the indicator function with and otherwise. Other than a few special loss functions such as exponential and logistic loss (Rudin:2009, ; KotlowskiDH11, ), most of these methods need to enumerate all the positive-negative instance pairs, making them unattractive for large datasets. Various methods have been developed to address this computational challenge. For example, in recent years, ZhaoHJY11 and Gao13 respectively studied online and one-pass AUC optimization .

In recent literature, there is a growing interest in optimizing accuracy at the top of the ranked list (Clemencon07, ; BoydCMR12, ). Maximizing AUC is not suitable for this goal as indicated by the analysis in (Clemencon07, ). To address this challenge, we propose to maximize the number of positive instances that are ranked before the first negative instance, which is known as *positives at the top* (Rudin:2009, ; Agarwal11, ; BoydCMR12, ). We can translate this objective into the minimization of the following loss

(2) |

which computes the fraction of positive instances ranked below the top ranked negative instance. By minimizing the loss in (2), we essentially push negative instances away from the top of the ranked list, leading to more positive ones placed at the top. We note that (2) is fundamentally different from AUC optimization as AUC does not focus on the ranking accuracy at the top. This can be seen from the relationship between the loss functions (1) and (2) as summarized below. Let be a dataset consisting of positive instances and negative instances, and be a ranking function, we have

(3) |

The proof of this proposition is deferred to Section 4.1. According to Proportion 2, we can see if the ranking loss is greater than which is common in practice, the loss can be as large as one, implying that no positive instance is ranked above any negative instance. Surely, this is not what we want, also it indicates that our goal of maximizing positives at the top can not be achieved by AUC optimization, consistent with the theoretical analysis in (Clemencon07, ). Meanwhile, we can find that is an upper bound over the ranking loss , thus by minimizing , small ranking loss can be expected, benefiting AUC optimization. This constitutes the main motivation of current work.

To design practical learning algorithms, we replace the indicator function in (2) with its convex surrogate, leading to the following loss function

(4) |

where is a convex surrogate loss function that is non-decreasing^{1}^{1}1 In this paper, we let to be non-decreasing for the simplicity of formulating dual problem. and differentiable. Examples of such loss functions include truncated quadratic loss , exponential loss , and logistic loss , etc. In the discussion below, we restrict ourselves to the truncated quadratic loss, even though most of our analysis applies to other loss functions.

It is easy to verify that the loss function in (4) is equivalent to the loss used in InfinitePush (Agarwal11, ) (a special case of -norm Push (Rudin:2009, ))

(5) |

The apparent advantage of employing instead of is that it only needs to evaluate on positive-negative instance pairs, whereas the later needs to enumerate all the instance pairs. As a result, the number of dual variables induced by is , linear in the number of training instances, which is significantly smaller than , the number of dual variables induced by (see Agarwal11, ; Rakotomamonjy12, ). It is this difference that makes the proposed algorithm achieve a computational complexity linear in the number of training instances and therefore be more efficient than most state-of-the-art algorithms for bipartite ranking.

## 3 TopPush for Optimizing Top Accuracy

In this section, we first present a learning algorithm to minimize the loss function in (4), and then the computational complexity and performance guarantee for the proposed algorithm.

### 3.1 Dual Formulation

We consider linear ranking function, that is , where

is the weight vector to be learned. For nonlinear ranking function, we can use kernel methods, and Nyström method and random Fourier features can transform the kernelized problem into a linear one, see

(YangLMJZ12, ) for more discussions on this topic. As a result, the learning problem is given by the following optimization problem(6) |

where is a regularization parameter.

Directly minimizing the objective in (6) can be challenging because of the max operator in the loss function. We address this challenge by developing a dual formulation for (6). Specifically, given a convex and differentiable function , we can rewrite it in its convex conjugate form as

where is the convex conjugate of and is the domain of dual variable (bv-cvx, ). For example, the convex conjugate of truncated quadratic loss is

We note that dual form has been widely used to improve computational efficiency (Sun:2010, ) and connect different styles of learning algorithms (Kanamori:2013, ). Here we exploit this technique to overcome the difficulty caused by max operator. The dual form of (6) is given in the following theorem, whose detailed proof is deferred to section 4.2.

###### Theorem 1

The key feature of the dual problem in (7) is that the number of dual variables is . This is in contrast to the InfinitPush algorithm (Agarwal11, ) that introduces dual variables. In addition, the objective function in (7) is smooth if the convex conjugate is smooth, which is true for many common loss functions (e.g., truncated quadratic loss, exponential loss and logistic loss). It is well known in the literature of optimization that an convergence rate can be achieved if the objective function is smooth, where is the number of iterations. Surely, this also helps in designing efficient learning algorithm.

### 3.2 Linear Time Bipartite Ranking Algorithm

According to Theorem 1, to learn a ranking function , it is sufficient to learn the dual variables and by solving the problem in (7). For this purpose, we adopt the accelerated gradient method due to its light computation per iteration. Since we are pushing positive instances before the top-ranked negative, we refer the obtained algorithm as TopPush.

#### 3.2.1 Efficient Optimization

We choose the Nesterov’s method (Nesterov03, ; Nemirovski94, ) that achieves an optimal convergence rate for smooth objective function. One of the key features of the Nesterov’s method is that besides the solution sequence , it also maintains a sequence of auxiliary solutions , which is introduced to exploit the smoothness of the objective function to achieve faster convergence rate. Meanwhile, its step size depends on the smoothness of the objective function, in current work, we adopt the Nemirovski’s line search scheme (Nemirovski94, )

to estimate the smoothness parameter. Of course, other schemes such as the one developed in

(Liu:2009, ) can also be used.Algorithm 1 summarizes the steps of the TopPush algorithm. At each iteration, the gradients of the objective function can be efficiently computed as

(10) |

where and is the derivative of . It should be noted that, the problem in (7) is a constrained optimization problem, and therefore, at each step of gradient mapping, we have to project the dual solution into the domain (that is, in step 9) to keep them feasible. Below, we discuss how to solve this projection step efficiently.

#### 3.2.2 Projection Step

For clear notations, we expand the projection step into the problem

(11) | ||||

s.t. |

where and are the solutions to be projected. We note that similar projection problems have been studied in (Shalev-Shwartz:2006, ; Liu-ICML09, ) whereas they either have time complexity or only provide approximate solutions. Instead, based on the following proposition, we provide a method which find the exact solution to (11) in time.

The optimal solution to the projection problem in (11) is given by

where is the unique root of function

(12) |

The proof of this proposition is similar to that for (Liu-ICML09, , Theorem 2), thus omitted here. According to Proposition 3.2.2, the key to solving the projection problem is to find the root of . Instead of approximating the solution via bisection as in (Liu-ICML09, ), we develop a different scheme to get the exact solution as follows.

For a given value of , define two index sets

then the function in (12) can be rewrite as

(13) |

Also, define

and let denote its -th order statistics, that is, . It can be found that for a given and any in the interval , it holds that

Thus, from (13), if the interval contains the root of , the root can be exactly computed as

(14) |

Consequently, the task can be reduced to finding such that and .

Inspired by (DuchiSSC08, ), we devise a divide-and-conquer procedure based on a modification of the randomized median finding algorithm (Cormen01, , Chapter 9), and it is summarized in Algorithm 2.
In particular, it maintains a set^{2}^{2}2To make the updating of partial sums efficient, in practice, two sets and are respectively maintained for and , and is their union. Also, the sets and are handled in a similar manner. of unprocessed elements from , whose relationship to an element we do not know. On each round, we partition into two subsets and , which respectively contains the elements in that are respectively greater and less than the element that is picked up at random from .
Then, by evaluating the function in (13), we update to the set (i.e., or ) containing the needed element and discard the other. The process ends when is empty. Afterwards, we compute the exact optimal as (14) and perform projection as described in Proposition 3.2.2.
In addition, for efficiency issues, along the process we keep track of the partial sums in (13) such that they will be not recalculated.
Based on similar analysis of the randomized median finding algorithm, we can obtain Algorithm 2 has expected linear time complexity.

### 3.3 Convergence and Computational Complexity

The theorem below states the convergence of the TopPush algorithm, which follows immediately from the convergence result for the Nesterov’s method (Nemirovski94, ).

###### Theorem 2

Let and be the solution output from the TopPush algorithm after iterations, we have

provided .

Finally, the computational cost of each iteration is dominated by the gradient evaluation and the projection step. Since the complexity of projection step is and the cost of computing the gradient is , the time complexity of each iteration is . Combining this result with Theorem 2, we have, to find an -suboptimal solution, the total computational complexity of the TopPush algorithm is , which is linear in the number of training instances.

Algorithm | Computational Complexity | |
---|---|---|

SVM | (Joachims:2006, ) | |

SVM | (Yue:2007, ) | |

OWPC | (Usunier09, ) | |

SVM | (NarasimhanA-ICML13, ; NarasimhanA-KDD13, ) | |

InfinitePush | (Agarwal11, ) | |

L1SVIP | (Rakotomamonjy12, ) | |

TopPush | this paper |

Table 1 compares the computational complexity of TopPush with that of some state-of-the-art ranking algorithms. It is easy to see that TopPush is asymptotically more efficient than the state-of-the-art ranking algorithm^{3}^{3}3In Table 1, we report the complexity of SVM in (NarasimhanA-KDD13, ), which is more efficient than SVM in (NarasimhanA-ICML13, ). In addition, SVM is used in experiments and we do not distinguish between them in this paper. . For instances, it is much more efficient than InfinitePush and its sparse extension L1SVIP whose complexity depends on the number of positive-negative instance pairs; compared with SVM, SVM and SVM that handle specific performance metrics via structural-SVM, the linear dependence on the number of training instances makes our proposed TopPush algorithm more appealing, especially for large datasets.

### 3.4 Theoretical Guarantee

We develop theoretical guarantee for the ranking performance of TopPush. In (Rudin:2009, ; Agarwal11, ), the authors have developed margin-based generalization bounds
for the loss function . One limitation with the analysis in (Rudin:2009, ; Agarwal11, ) is that they try to bound the probability for a positive instance to be ranked before *any* negative instance, leading to relatively pessimistic bounds. For instance, for the bounds in (Rudin:2009, , Theorems 2 and 3), the failure probability can be as large as 1 if the parameter is large. Our analysis avoids this pitfall by considering the probability of ranking a positive instance before *most* negative instances.

To this end, we first define , the probability for any negative instance to be ranked above using ranking function , as

Since we are interested in whether positive instances are ranked above most negative instances, we will measure the quality of by the probability for any positive instance to be ranked below percent of negative instances, that is

Clearly, if a ranking function achieves a high ranking accuracy at the top, it should have a large percentage of positive instances with ranking scores higher than most of the negative instances, leading to a small value for with little . The following theorem bounds for TopPush, whose proof can be found in the supplementary document.

###### Theorem 3

Given training data consisting of independent samples from and independent samples from , let be the optimal solution to the problem in (6). Assume and , we have, with a probability at least ,

where and

is the empirical loss.

Theorem 3 implies that if the empirical loss , for most positive instance (i.e., ), the percentage of negative instances ranked above is upper bounded by . We observe that and play different roles in the bound. That is, since the empirical loss compares the positive instances to the negative instance with the largest score, it usually grows significantly slower with increasing . For instance, the largest absolute value of Gaussian random samples grows in . Thus, we believe that the main effect of increasing in our bound is to reduce (decrease at the rate of ), especially when is large. Meanwhile, by increasing the number of positive instances , we will reduce the bound for , and consequently increase the chance of finding positive instances at the top.

## 4 Proofs and Technical Details

In this section, we give all the detailed proofs missing from the main text, along with ancillary remarks and comments.

### 4.1 AUC vs. Accuracy at the Top

### 4.2 Proof of Theorem 1

Since is a convex loss function that is non-decreasing and differentiable, it can be rewritten in its convex conjugate form, that is

where is the convex conjugate of , and hence rewritten the problem in (6) as

(15) |

where are dual variables.

Let and be the standard -simplex, we have

(16) |

By substituting (16) into (15), the optimization problem becomes

(17) |

By defining and then using variable replacement, (17) can be equivalently rewritten as

s.t. | (18) |

where are new variables, the constraint is replaced with the , and the equality constraint to keep two problems equivalent.

Since the objective of (4.2) is convex in , and jointly concave in and , also its feasible domain is convex; hence it satisfies the strong max-min property (bv-cvx, ), the min and max can be swapped. After swapping min and max, we first consider the inner minimization subproblem over , that is

where is omitted since it does not depend on . This is an unconstrained quadratic programming problem, whose solution is

and the minimal value is given as

Then, by considering the maximization over and , we can obtain the conclusion of Theorem 1 (after multiplying the objective function with ).

### 4.3 Proof of Theorem 3

For the convenience of analysis, we consider the constrained version of the optimization problem in (6), that is

(19) |

where is a domain and specifies the size of the domain that plays similar role as the regularization parameter in (6).

First, we denote as the Lipschitz constant of the truncated quadratic loss on the domain , and define the following two functions based on , i.e.,

The lemma below relates the empirical counterpart of with the loss .

###### Lemma 1

With a probability at least , for any , we have

where

(20) |

###### Proof.

For any , we define two instance sets by splitting , that is

For , we define

Using the Talagrand’s inequality and in particular its variant (specifically, Bousquet bound) with improved constants derived in (Bousquet024, ) (see also Koltchinskii11, , Chapter 2), we have, with probability at least ,

(21) |

We now bound each item on the right hand side of (21). First, we bound as

(22) |

where

’s are Rademacher random variables, the fist inequality utilizes the contraction property of Rademacher complexity, and the last follows from Cauchy-Schwarz inequality and Jensen’s inequality. Next, we bound

, that is,(23) |

By putting (4.3) and (23) into (21) and using the fact that

we thus have, with probability ,

Using the union bound over all ’s, we obtain

where is in (20). Thus, with probability , it follows

Therefore, we can obtain the conclusion based on the fact . ∎

###### of Theorem 3.

Let be a proper -net of and be the corresponding covering number. According to standard result, we have

By using concentration inequality and union bound over , we have, with probability at least ,

(24) |

Let and . For , there exists such that , it holds that

Comments

There are no comments yet.