Top Rank Optimization in Linear Time

Bipartite ranking aims to learn a real-valued ranking function that orders positive instances before negative instances. Recent efforts of bipartite ranking are focused on optimizing ranking accuracy at the top of the ranked list. Most existing approaches are either to optimize task specific metrics or to extend the ranking loss by emphasizing more on the error associated with the top ranked instances, leading to a high computational cost that is super-linear in the number of training instances. We propose a highly efficient approach, titled TopPush, for optimizing accuracy at the top that has computational complexity linear in the number of training instances. We present a novel analysis that bounds the generalization error for the top ranked instances for the proposed approach. Empirical study shows that the proposed approach is highly competitive to the state-of-the-art approaches and is 10-100 times faster.

Comments

There are no comments yet.

Authors

• 34 publications
• 34 publications
• 66 publications
08/24/2017

Active Sampling of Pairs and Points for Large-scale Linear Bipartite Ranking

Bipartite ranking is a fundamental ranking problem that learns to order ...
12/10/2018

Top-N-Rank: A Scalable List-wise Ranking Method for Recommender Systems

We propose Top-N-Rank, a novel family of list-wise Learning-to-Rank mode...
01/18/2019

Cold-start Playlist Recommendation with Multitask Learning

Playlist recommendation involves producing a set of songs that a user mi...
11/29/2015

MidRank: Learning to rank based on subsequences

We present a supervised learning to rank algorithm that effectively orde...
11/23/2019

SemEval-2013 Task 4: Free Paraphrases of Noun Compounds

In this paper, we describe SemEval-2013 Task 4: the definition, the data...
07/02/2012

Surrogate Regret Bounds for Bipartite Ranking via Strongly Proper Losses

The problem of bipartite ranking, where instances are labeled positive o...
03/14/2016

A ranking approach to global optimization

We consider the problem of maximizing an unknown function over a compact...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bipartite ranking aims to learn a real-valued ranking function that places positive instances above negative instances. It has attracted much attention because of its applications in several areas such as information retrieval and recommender systems (Rendle:2009:LOR, ; Liu11, )

. In the past decades, many ranking methods have been developed for bipartite ranking, and most of them are essentially based on pairwise ranking. These algorithms reduce the ranking problem into a binary classification problem by treating each positive-negative instance pair as a single object to be classified

(Herbrich00, ; Freund-JMLR03, ; Burges-ICML05, ; ValizadeganJZM09, ; Usunier09, ; Rudin:2009, ; Agarwal11, ; BoydCMR12, ). Since the number of instance pairs can grow quadratically in the number of training instances, one limitation of these methods is their high computational costs, making them not scalable to large datasets.

Since for applications such as document retrieval and recommender systems, only the top ranked instances will be examined by users, there has been a growing interest in learning ranking functions that perform especially well at the top of the ranked list (Clemencon07, ; BoydCMR12, ). In the literature, most of these existing methods can be classified into two groups. The first group maximizes the ranking accuracy at the top of the ranked list by optimizing task specific metrics (Joachims05, ; Le07, ; Li:13, ; xu13, ), such as average precision (AP) (Yue:2007, ), NDCG (ValizadeganJZM09, ) and partial AUC (NarasimhanA-ICML13, ; NarasimhanA-KDD13, ). The main limitation of these methods is that they often result in non-convex optimization problems that are difficult to solve efficiently. Structural SVM (Tsochantaridis05, ) addresses this issue by translating the non-convexity into an exponential number of constraints. It can still be computationally challenging because it usually requires to search for the most violated constraint at each iteration of optimization. In addition, these methods are statistically inconsistency (Tewari:2007, ; Le07, )

, thus often leading to suboptimal solutions. The second group of methods are based on pairwise ranking. They design special convex loss functions that place more penalties on the ranking errors related to the top ranked instances, for example, by weighting

(Usunier09, ) or exploiting special functions such as -norm (Rudin:2009, ) and infinite norm (Agarwal11, ). Since these methods are essentially based on pairwise ranking, their computational costs are usually proportional to the number of positive-negative instance pairs, making them unattractive for large datasets.

In this paper, we address the computational challenge of bipartite ranking by designing a ranking algorithm, named TopPush, that can efficiently optimize the ranking accuracy at the top. The key feature of the proposed TopPush algorithm is that its time complexity is only linear in the number of training instances. This is in contrast to most existing methods for bipartite ranking whose computational costs depend on the number of instance pairs. Moreover, we develop novel analysis for bipartite ranking. One shortcoming of the existing theoretical studies (Rudin:2009, ; Agarwal11, )

on bipartite ranking is that they try to bound the probability for a positive instance to be ranked before

any negative instance, leading to relatively pessimistic bounds. We overcome this limitation by bounding the probability of ranking a positive instance before most negative instances, and show that TopPush is effective in placing positive instances at the top of a ranked list. Extensive empirical study shows that TopPush is computationally more efficient than most ranking algorithms, and yields comparable performance as the state-of-the-art approaches that maximize the ranking accuracy at the top.

The rest of this paper is organized as follows. Section 2 introduces the preliminaries of bipartite ranking, and addresses the difference between AUC optimization and maximizing accuracy at the top. Section 3 presents the proposed TopPush algorithm and its key theoretical properties. Section 4 gives proofs and technical details. Section 5 summarizes the empirical study, and Section 6 concludes this work with future directions.

2 Bipartite Ranking: AUC vs Accuracy at the Top

Let be the instance space. Let be a set of training instances, where and include positive instances and negative instances independently sampled from distributions and , respectively. The goal of bipartite ranking is to learn a ranking function that is likely to place a positive instance before most negative ones. In the literature, bipartite ranking has found applications in many domains, and its theoretical properties have been examined by several studies (for example, Agarwal-JMLR05, ; Clemencon08, ; KotlowskiDH11, ; Narasimhan-NIPS13, ).

AUC is a commonly used evaluation metric for bipartite ranking

(Hanley82, ; CortesNIPS03, ). By exploring its equivalence to Wilcoxon-Mann-Whitney statistic (Hanley82, ), many ranking algorithms have been developed to optimize AUC by minimizing the ranking loss defined as

 (1)

where is the indicator function with and otherwise. Other than a few special loss functions such as exponential and logistic loss (Rudin:2009, ; KotlowskiDH11, ), most of these methods need to enumerate all the positive-negative instance pairs, making them unattractive for large datasets. Various methods have been developed to address this computational challenge. For example, in recent years, ZhaoHJY11 and Gao13 respectively studied online and one-pass AUC optimization .

In recent literature, there is a growing interest in optimizing accuracy at the top of the ranked list (Clemencon07, ; BoydCMR12, ). Maximizing AUC is not suitable for this goal as indicated by the analysis in (Clemencon07, ). To address this challenge, we propose to maximize the number of positive instances that are ranked before the first negative instance, which is known as positives at the top (Rudin:2009, ; Agarwal11, ; BoydCMR12, ). We can translate this objective into the minimization of the following loss

 L(f;S)=1mm∑i=1 I(f(x+i)≤max1≤j≤nf(x−j)) . (2)

which computes the fraction of positive instances ranked below the top ranked negative instance. By minimizing the loss in (2), we essentially push negative instances away from the top of the ranked list, leading to more positive ones placed at the top. We note that (2) is fundamentally different from AUC optimization as AUC does not focus on the ranking accuracy at the top. This can be seen from the relationship between the loss functions (1) and (2) as summarized below. Let be a dataset consisting of positive instances and negative instances, and be a ranking function, we have

 Lrank(f;S)≤L(f;S)≤min(nLrank(f;S),1) . (3)

The proof of this proposition is deferred to Section 4.1. According to Proportion 2, we can see if the ranking loss is greater than which is common in practice, the loss can be as large as one, implying that no positive instance is ranked above any negative instance. Surely, this is not what we want, also it indicates that our goal of maximizing positives at the top can not be achieved by AUC optimization, consistent with the theoretical analysis in (Clemencon07, ). Meanwhile, we can find that is an upper bound over the ranking loss , thus by minimizing , small ranking loss can be expected, benefiting AUC optimization. This constitutes the main motivation of current work.

To design practical learning algorithms, we replace the indicator function in (2) with its convex surrogate, leading to the following loss function

 Lℓ(f;S)=1mm∑i=1 ℓ(max1≤j≤nf(x−j)−f(x+i)) , (4)

where is a convex surrogate loss function that is non-decreasing111 In this paper, we let to be non-decreasing for the simplicity of formulating dual problem. and differentiable. Examples of such loss functions include truncated quadratic loss , exponential loss , and logistic loss , etc. In the discussion below, we restrict ourselves to the truncated quadratic loss, even though most of our analysis applies to other loss functions.

It is easy to verify that the loss function in (4) is equivalent to the loss used in InfinitePush (Agarwal11, ) (a special case of -norm Push (Rudin:2009, ))

 (5)

The apparent advantage of employing instead of is that it only needs to evaluate on positive-negative instance pairs, whereas the later needs to enumerate all the instance pairs. As a result, the number of dual variables induced by is , linear in the number of training instances, which is significantly smaller than , the number of dual variables induced by  (see Agarwal11, ; Rakotomamonjy12, ). It is this difference that makes the proposed algorithm achieve a computational complexity linear in the number of training instances and therefore be more efficient than most state-of-the-art algorithms for bipartite ranking.

3 TopPush for Optimizing Top Accuracy

In this section, we first present a learning algorithm to minimize the loss function in (4), and then the computational complexity and performance guarantee for the proposed algorithm.

3.1 Dual Formulation

We consider linear ranking function, that is , where

is the weight vector to be learned. For nonlinear ranking function, we can use kernel methods, and Nyström method and random Fourier features can transform the kernelized problem into a linear one, see

(YangLMJZ12, ) for more discussions on this topic. As a result, the learning problem is given by the following optimization problem

 minw  λ2∥w∥2+1mm∑i=1ℓ(max1≤j≤nw⊤x−j−w⊤x+i) , (6)

where is a regularization parameter.

Directly minimizing the objective in (6) can be challenging because of the max operator in the loss function. We address this challenge by developing a dual formulation for (6). Specifically, given a convex and differentiable function , we can rewrite it in its convex conjugate form as

 ℓ(z)=maxα∈Ω αz−ℓ∗(α) ,

where is the convex conjugate of and is the domain of dual variable (bv-cvx, ). For example, the convex conjugate of truncated quadratic loss is

 ℓ∗(α)=−α+α2/4   with   Ω=R+ .

We note that dual form has been widely used to improve computational efficiency (Sun:2010, ) and connect different styles of learning algorithms (Kanamori:2013, ). Here we exploit this technique to overcome the difficulty caused by max operator. The dual form of (6) is given in the following theorem, whose detailed proof is deferred to section 4.2.

Theorem 1

Define and , the dual problem of the problem in (6) is

 min(α,β)∈Ξg(α,β)=12λm∥α⊤X+−β⊤X−∥2+m∑i=1ℓ∗(αi) (7)

where and are dual variables, and the domain is defined as

 Ξ={α∈Rm+, β∈Rn+: 1⊤mα=1⊤nβ }. (8)

Let and be the optimal solution to the dual problem in (7). Then, the optimal solution to the primal problem in (6) is given by

 w∗=1λm(a∗⊤X+−β∗⊤X−) . (9)

The key feature of the dual problem in (7) is that the number of dual variables is . This is in contrast to the InfinitPush algorithm (Agarwal11, ) that introduces dual variables. In addition, the objective function in (7) is smooth if the convex conjugate is smooth, which is true for many common loss functions (e.g., truncated quadratic loss, exponential loss and logistic loss). It is well known in the literature of optimization that an convergence rate can be achieved if the objective function is smooth, where is the number of iterations. Surely, this also helps in designing efficient learning algorithm.

3.2 Linear Time Bipartite Ranking Algorithm

According to Theorem 1, to learn a ranking function , it is sufficient to learn the dual variables and by solving the problem in (7). For this purpose, we adopt the accelerated gradient method due to its light computation per iteration. Since we are pushing positive instances before the top-ranked negative, we refer the obtained algorithm as TopPush.

3.2.1 Efficient Optimization

We choose the Nesterov’s method (Nesterov03, ; Nemirovski94, ) that achieves an optimal convergence rate for smooth objective function. One of the key features of the Nesterov’s method is that besides the solution sequence , it also maintains a sequence of auxiliary solutions , which is introduced to exploit the smoothness of the objective function to achieve faster convergence rate. Meanwhile, its step size depends on the smoothness of the objective function, in current work, we adopt the Nemirovski’s line search scheme (Nemirovski94, )

to estimate the smoothness parameter. Of course, other schemes such as the one developed in

(Liu:2009, ) can also be used.

Algorithm 1 summarizes the steps of the TopPush algorithm. At each iteration, the gradients of the objective function can be efficiently computed as

 ∇αg(α,β)=X+ν⊤λm+ℓ′∗(α) ,    ∇βg(α,β)=−X−ν⊤λm . (10)

where and is the derivative of . It should be noted that, the problem in (7) is a constrained optimization problem, and therefore, at each step of gradient mapping, we have to project the dual solution into the domain (that is, in step 9) to keep them feasible. Below, we discuss how to solve this projection step efficiently.

3.2.2 Projection Step

For clear notations, we expand the projection step into the problem

 minα≥0,β≥0 12∥α−α0∥2+12∥β−β0∥2 (11) s.t. 1⊤mα=1⊤nβ

where and are the solutions to be projected. We note that similar projection problems have been studied in (Shalev-Shwartz:2006, ; Liu-ICML09, ) whereas they either have time complexity or only provide approximate solutions. Instead, based on the following proposition, we provide a method which find the exact solution to (11) in time.

The optimal solution to the projection problem in (11) is given by

 α∗=[α0−γ∗]+   %and   β∗=[β0+γ∗]+ ,

where is the unique root of function

 ρ(γ)=m∑i=1[α0i−γ]+−n∑j=1[β0j+γ]+ . (12)

The proof of this proposition is similar to that for (Liu-ICML09, , Theorem 2), thus omitted here. According to Proposition 3.2.2, the key to solving the projection problem is to find the root of . Instead of approximating the solution via bisection as in (Liu-ICML09, ), we develop a different scheme to get the exact solution as follows.

For a given value of , define two index sets

then the function in (12) can be rewrite as

 (13)

Also, define

 U={α0i:1≤i≤m}∪{−β0j:1≤j≤n} ,

and let denote its -th order statistics, that is, . It can be found that for a given and any in the interval , it holds that

 I(γ)=I(u(k))  and  J(γ)=J(u(k)) .

Thus, from (13), if the interval contains the root of , the root can be exactly computed as

 γ∗=∑i∈I(u(k))α0i−∑j∈J(u(k))β0j|I(u(k))|+|J(u(k))| . (14)

Consequently, the task can be reduced to finding such that and .

Inspired by (DuchiSSC08, ), we devise a divide-and-conquer procedure based on a modification of the randomized median finding algorithm (Cormen01, , Chapter 9), and it is summarized in Algorithm 2. In particular, it maintains a set222To make the updating of partial sums efficient, in practice, two sets and are respectively maintained for and , and is their union. Also, the sets and are handled in a similar manner. of unprocessed elements from , whose relationship to an element we do not know. On each round, we partition into two subsets and , which respectively contains the elements in that are respectively greater and less than the element that is picked up at random from . Then, by evaluating the function in (13), we update to the set (i.e., or ) containing the needed element and discard the other. The process ends when is empty. Afterwards, we compute the exact optimal as (14) and perform projection as described in Proposition 3.2.2. In addition, for efficiency issues, along the process we keep track of the partial sums in (13) such that they will be not recalculated. Based on similar analysis of the randomized median finding algorithm, we can obtain Algorithm 2 has expected linear time complexity.

3.3 Convergence and Computational Complexity

The theorem below states the convergence of the TopPush algorithm, which follows immediately from the convergence result for the Nesterov’s method (Nemirovski94, ).

Theorem 2

Let and be the solution output from the TopPush algorithm after iterations, we have

 g(αT,βT)≤min(α,β)∈Ξg(α,β)+ϵ

provided .

Finally, the computational cost of each iteration is dominated by the gradient evaluation and the projection step. Since the complexity of projection step is and the cost of computing the gradient is , the time complexity of each iteration is . Combining this result with Theorem 2, we have, to find an -suboptimal solution, the total computational complexity of the TopPush algorithm is , which is linear in the number of training instances.

Table 1 compares the computational complexity of TopPush with that of some state-of-the-art ranking algorithms. It is easy to see that TopPush is asymptotically more efficient than the state-of-the-art ranking algorithm333In Table 1, we report the complexity of SVM in (NarasimhanA-KDD13, ), which is more efficient than SVM in (NarasimhanA-ICML13, ). In addition, SVM is used in experiments and we do not distinguish between them in this paper. . For instances, it is much more efficient than InfinitePush and its sparse extension L1SVIP whose complexity depends on the number of positive-negative instance pairs; compared with SVM, SVM and SVM that handle specific performance metrics via structural-SVM, the linear dependence on the number of training instances makes our proposed TopPush algorithm more appealing, especially for large datasets.

3.4 Theoretical Guarantee

We develop theoretical guarantee for the ranking performance of TopPush. In (Rudin:2009, ; Agarwal11, ), the authors have developed margin-based generalization bounds for the loss function . One limitation with the analysis in (Rudin:2009, ; Agarwal11, ) is that they try to bound the probability for a positive instance to be ranked before any negative instance, leading to relatively pessimistic bounds. For instance, for the bounds in (Rudin:2009, , Theorems 2 and 3), the failure probability can be as large as 1 if the parameter is large. Our analysis avoids this pitfall by considering the probability of ranking a positive instance before most negative instances.

To this end, we first define , the probability for any negative instance to be ranked above using ranking function , as

 hb(x,w)

Since we are interested in whether positive instances are ranked above most negative instances, we will measure the quality of by the probability for any positive instance to be ranked below percent of negative instances, that is

 Pb(w,δ)

Clearly, if a ranking function achieves a high ranking accuracy at the top, it should have a large percentage of positive instances with ranking scores higher than most of the negative instances, leading to a small value for with little . The following theorem bounds for TopPush, whose proof can be found in the supplementary document.

Theorem 3

Given training data consisting of independent samples from and independent samples from , let be the optimal solution to the problem in (6). Assume and , we have, with a probability at least ,

where and

 Lℓ(w∗,S)=1mm∑i=1ℓ(max1≤j≤nw∗⊤x−j−w∗⊤x+i)

is the empirical loss.

Theorem 3 implies that if the empirical loss , for most positive instance (i.e., ), the percentage of negative instances ranked above is upper bounded by . We observe that and play different roles in the bound. That is, since the empirical loss compares the positive instances to the negative instance with the largest score, it usually grows significantly slower with increasing . For instance, the largest absolute value of Gaussian random samples grows in . Thus, we believe that the main effect of increasing in our bound is to reduce (decrease at the rate of ), especially when is large. Meanwhile, by increasing the number of positive instances , we will reduce the bound for , and consequently increase the chance of finding positive instances at the top.

4 Proofs and Technical Details

In this section, we give all the detailed proofs missing from the main text, along with ancillary remarks and comments.

4.1 AUC vs. Accuracy at the Top

We investigate the relationship between AUC and accuracy at the top by their corresponding loss functions, i.e. the ranking loss in (1) and our loss in (2).

of Proposition 2.

It is easy to verify that the loss in (2) is equivalent to

 L∞(f;S)=max1≤j≤n 1mm∑i=1I(f(x+i)≤f(x−j)) .

Define , thus we have , and

 L(f;S)=L∞(f;S)=max1≤j≤nκj ,Lrank(f;S)=1n∑nj=1κj .

Based on the relationship between the mean and the maximum of a set of elements, we can obtain the conclusion. ∎

4.2 Proof of Theorem 1

Since is a convex loss function that is non-decreasing and differentiable, it can be rewritten in its convex conjugate form, that is

 ℓ(z)=maxα≥0 αz−ℓ∗(α)

where is the convex conjugate of , and hence rewritten the problem in (6) as

 minw maxα≥0   1mm∑i=1αi(max1≤j≤nw⊤x−j−w⊤x+i)−1mm∑i=1ℓ∗(αi)+λ2∥w∥2 , (15)

where are dual variables.

Let and be the standard -simplex, we have

 max1≤j≤n w⊤x−j=maxp∈Δ n∑j=1pjw⊤x−j . (16)

By substituting (16) into (15), the optimization problem becomes

 minwmaxα≥0,p∈Δ1mn∑j=1pjm∑i=1αiw⊤x−j  −1mm∑i=1αiw⊤x+i−1mm∑i=1ℓ∗(αi)+λ2∥w∥2. (17)

By defining and then using variable replacement, (17) can be equivalently rewritten as

 minwmaxα≥0,β≥0 1m(n∑j=1βjw⊤x−j−m∑i=1αiw⊤x+i)−1mm∑i=1ℓ∗(αi)+λ2∥w∥2 s.t. 1⊤mα=1⊤nβ , (18)

where are new variables, the constraint is replaced with the , and the equality constraint to keep two problems equivalent.

Since the objective of (4.2) is convex in , and jointly concave in and , also its feasible domain is convex; hence it satisfies the strong max-min property (bv-cvx, ), the min and max can be swapped. After swapping min and max, we first consider the inner minimization subproblem over , that is

 minw 1mn∑j=1βjw⊤x−j−1mm∑i=1αiw⊤x+i+λ2∥w∥2 ,

where is omitted since it does not depend on . This is an unconstrained quadratic programming problem, whose solution is

 w∗=1λm(a⊤X+−β⊤X−) ,

and the minimal value is given as

 −12λm2∥a⊤X+−β⊤X−∥2 .

Then, by considering the maximization over and , we can obtain the conclusion of Theorem 1 (after multiplying the objective function with ).

4.3 Proof of Theorem 3

For the convenience of analysis, we consider the constrained version of the optimization problem in (6), that is

 minw∈WLℓ(w;S)=1mm∑i=1ℓ(max1≤j≤nw⊤x−j−w⊤x+i) (19)

where is a domain and specifies the size of the domain that plays similar role as the regularization parameter in (6).

First, we denote as the Lipschitz constant of the truncated quadratic loss on the domain , and define the following two functions based on , i.e.,

 hℓ(x,w)=Ex−∼P−[ℓ(w⊤x−−w⊤x)]   and   Pℓ(w,δ)=Prx+∼P+(hℓ(x+i,w)≥δ) .

The lemma below relates the empirical counterpart of with the loss .

Lemma 1

With a probability at least , for any , we have

 1mm∑i=1I(hℓ(x+i,w)≥δ)≤Lℓ(w,S) ,

where

 δ=4G(ρ+1)√n+5ρ(t+logm)3n+2Gρ√2(t+logm)n . (20)
Proof.

For any , we define two instance sets by splitting , that is

 A(w)={x+i:w⊤x+i>maxj∈[n]w⊤x−j+1} ,  B(w)={x+i:w⊤x+i≤maxj∈[n]w⊤x−j+1} .

For , we define

 ∥P −Pn∥W=sup∥w∥≤ρ∣∣ ∣∣hℓ(x+i,w)−1nn∑j=1ℓ(w⊤x−j−w⊤x+i)∣∣ ∣∣ .

Using the Talagrand’s inequality and in particular its variant (specifically, Bousquet bound) with improved constants derived in (Bousquet024, ) (see also Koltchinskii11, , Chapter 2), we have, with probability at least ,

 ∥P−Pn∥W≤E∥P−Pn∥W+2tρ3n+√2tn(σ2P(W)+2E∥P−Pn∥W) . (21)

We now bound each item on the right hand side of (21). First, we bound as

 E∥P−Pn∥W =2nE[sup∥w∥≤ρn∑j=1σjℓ(w⊤(x−j−x+i))] ≤4GnE[sup∥w∥≤ρn∑j=1σj(w⊤(x−j−x+i))]≤4Gρ√n , (22)

where

’s are Rademacher random variables, the fist inequality utilizes the contraction property of Rademacher complexity, and the last follows from Cauchy-Schwarz inequality and Jensen’s inequality. Next, we bound

, that is,

 σ2P(W)=sup∥w∥≤ρh2ℓ(x,w)≤4G2ρ2 . (23)

By putting (4.3) and (23) into (21) and using the fact that

 1nn∑j=1ℓ(w⊤(x−j−x+i))=0   for   x+i∈A(w),

we thus have, with probability ,

 |hℓ(x+i,w)| ≤∥P−Pn∥W≤4Gρ√n+2tρ3n+ ⎷2tn(4G2ρ2+8Gρ√n) ≤4Gρ√n+2tρ3n+2Gρ√2tn+4G√n+tρn ≤4G(ρ+1)√n+5tρ3n+2Gρ√2tn .

Using the union bound over all ’s, we obtain

 maxx+i∈A(w)hℓ(x+i,w)≤δ ,

where is in (20). Thus, with probability , it follows

 ∑x+i∈A(w)I(hℓ(x+i,w)≥δ)=0 .

Therefore, we can obtain the conclusion based on the fact . ∎

Based on Lemma 1, we are at the position to prove Theorem 3.

of Theorem 3.

Let be a proper -net of and be the corresponding covering number. According to standard result, we have

 logN(ρ,ε)≤dlog(9ρ/ε) .

By using concentration inequality and union bound over , we have, with probability at least ,

 supw′∈S(W,ε)Pℓ(w′,δ)−1mm∑i=1I(hℓ(x+i,w′)≥δ)≤√2(t+dlog(9ρ/ε))m  . (24)

Let and . For , there exists such that , it holds that

 I(