Variance Reduced Stochastic Proximal Algorithm for AUC Maximization

Stochastic Gradient Descent has been widely studied with classification accuracy as a performance measure. However, these stochastic algorithms cannot be directly used when non-decomposable pairwise performance measures are used such as Area under the ROC curve (AUC) which is a common performance metric when the classes are imbalanced. There have been several algorithms proposed for optimizing AUC as a performance metric, and one of the recent being a stochastic proximal gradient algorithm (SPAM). But the downside of the stochastic methods is that they suffer from high variance leading to slower convergence. To combat this issue, several variance reduced methods have been proposed with faster convergence guarantees than vanilla stochastic gradient descent. Again, these variance reduced methods are not directly applicable when non-decomposable performance measures are used. In this paper, we develop a Variance Reduced Stochastic Proximal algorithm for AUC Maximization (VRSPAM) and perform a theoretical analysis as well as empirical analysis to show that our algorithm converges faster than SPAM which is the previous state-of-the-art for the AUC maximization problem.

There are no comments yet.

Authors

• 6 publications
• 4 publications
• Stochastic Proximal AUC Maximization

In this paper we consider the problem of maximizing the Area under the R...
06/14/2019 ∙ by Yunwen Lei, et al. ∙ 0

• Convergence of Variance-Reduced Stochastic Learning under Random Reshuffling

Several useful variance-reduced stochastic gradient algorithms, such as ...
08/04/2017 ∙ by Bicheng Ying, et al. ∙ 0

• Optimizing Non-decomposable Performance Measures: A Tale of Two Classes

Modern classification problems frequently present mild to severe label i...
05/26/2015 ∙ by Harikrishna Narasimhan, et al. ∙ 0

• Sampling and Update Frequencies in Proximal Variance Reduced Stochastic Gradient Methods

Variance reduced stochastic gradient methods have gained popularity in r...
02/13/2020 ∙ by Martin Morin, et al. ∙ 0

• Stochastic Hard Thresholding Algorithms for AUC Maximization

In this paper, we aim to develop stochastic hard thresholding algorithms...
11/04/2020 ∙ by Zhenhuan Yang, et al. ∙ 0

• Stochastic EM methods with Variance Reduction for Penalised PET Reconstructions

Expectation-maximization (EM) is a popular and well-established method f...
06/06/2021 ∙ by Zeljko Kereta, et al. ∙ 0

• AUC Optimisation and Collaborative Filtering

In recommendation systems, one is interested in the ranking of the predi...
08/25/2015 ∙ by Charanpal Dhanjal, et al. ∙ 0

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Classification accuracy is a commonly used performance measure to evaluate a classifier. However, this measure is not suitable in the presence of class imbalance i.e. when one class occurs much more frequently than the other class

[elkan2001foundations]. To overcome this drawback, Area under the ROC curve (AUC) [hanley1982meaning, bradley1997use, fawcett2006introduction] is used as a standard metric for quantifying the performance of a classifier. AUC measures the ability of a family of classifiers to correctly rank a positive example with respect to a randomly selected negative example.

There have been several algorithms for AUC maximization in the batch setting, where all the training data is assumed to be available at the beginning [rakotomamonjy2004optimizing, herschtal2004optimising, zhang2012smoothing, joachims2005support]. However, this assumption is unrealistic in several cases, especially for streaming data analysis. Several online algorithms proposed for such settings where the per iteration complexity is low [bottou2004large, srebro2010stochastic, shalev2012online, hazan2012projection, rakhlin2011making, orabona2014simultaneous]. Although online algorithms have been thoroughly explored for classification accuracy where the loss decomposes over individual examples, the case of maximizing AUC as performance measure has been looked at only recently [zhao2011online, wang2012generalization, kar2013generalization]. The main challenge in the AUC maximization framework is that at each step the algorithm needs to pair the current datapoint with all previously seen datapoints leading to space and time complexity at step where the dimension of the instance space is . The problem was just slightly alleviated by the technique of buffering [zhao2011online, wang2012generalization, kar2013generalization] as the good generalization performance depends on having a large buffer size. Recently, [palaniappan2016stochastic] provided a primal dual algorithm by extending stochastic variance reduced algorithms (SVRG, SAGA) to handle non-decomposable losses or regularizers (in the form of convex-concave saddle point problem)and thereby provided linear convergence rate . Although this can be applied to AUC optimization with the least-squared loss, their algorithm needs to assume strong convexity of both the primal and dual variables. Their algorithm also has expensive per-iteration complexity where is the number of data points and is the dimension.

Recent works take a different approach by reformulating the AUC maximization problem with the least square loss. [ying2016stochastic] reformulated it as a saddle point problem and gave an algorithm which has a convergence rate of . However, they only consider smooth penalty terms such as Frobenius norm and their convergence rate is still sub-optimal to which is what stochastic gradient descent (SGD) achieves with classification accuracy as a performance measure. [natole2018stochastic] then proposed a stochastic proximal algorithm for AUC maximization which under assumptions of strong convexity can achieve a convergence rate of and has per iteration complexity of i.e. one data-point and is applicable to general non-smooth regularization terms. However, due to the inherent variance of random sampling we need to pick a step size of for SGD which leads to a slower sublinear convergence rate of . Thus, for SGD we have a low per iteration complexity and slow convergence versus high per iteration complexity and fast convergence for full gradient descent. Thus, SGD might take long to get a good approximation of the solution of the optimization problem.

In the context of classification accuracy, to reduce the variance of SGD several popular methods have been proposed such as SAG [roux2012stochastic], SDCA [shalev2013stochastic], SVRG [johnson2013accelerating]. One issue with SAG and SDCA is that they require the storage of all the gradients and dual variables respectively. On the other hand, SVRG enjoys the same fast convergence rates as SDCA and SAG but has a much simpler analysis and does not require storage of gradients. This allows SVRG to be applicable in complex problems where the storage of all gradients would be infeasible, unlike, SAG and SDCA.

Since SVRG is applicable only for smooth strongly convex functions, several works have explored ways to tackle the presence of a regularizer term in addition to the average of several smooth component function term. Two simple strategies are to use the Proximal Full Gradient and the Proximal Stochastic Gradient method. While the Proximal Stochastic Gradient is much faster since it computes only the gradient of a single component function per iteration, it convergences much slower than the Proximal Full Gradient method. The proximal gradient methods can be viewed as a special case of splitting methods [lions1979splitting, chen1997convergence, bauschke2011convex, tseng2000modified, beck2008fast]. However, both Proximal methods do not fully exploit the problem structure. Proximal SVRG [xiao2014proximal] is an extension of the SVRG [johnson2013accelerating]

technique and can be used whenever the objective function is composed of two terms- the first term is an average of smooth functions (decomposes across the individual instances) and the second term admits a simple proximal mapping. Prox-SVRG needs far fewer iterations to achieve the same approximation ratio than the proximal full and stochastic gradient descent methods. However, there is an important gap that has not been addressed yet — existing techniques that guarantee faster convergence by controlling the variance are not directly applicable to non-decomposable loss functions as in the problem of AUC optimization and this is the gap that we close in this paper.

In this paper, we present Variance Reduced Stochastic Proximal algorithm for AUC Maximization (VRSPAM). VRSPAM applies the standard SVRG variance reduction technique to the SPAM algorithm, which is a proximal stochastic gradient descent applied to a convex surrogate of the AUC maximization problem. We provide theoretical analysis for the VRSPAM algorithm showing that it achieves linear convergence rate with a fixed step size better than SPAM which has sub-linear convergence rate and constantly decreasing step size. Also, the theoretical analysis provided in the paper is much simpler as compared to the analysis of SPAM. We also perform numerical experiments to show that the VRSPAM algorithm converges faster than SPAM.

The organization of the remainder of the paper is as follows. In Section 2, we briefly state the AUC optimization problem and state the equivalent formulation that is necessary for our algorithmic analysis. In Section 3, we discuss our algorithm for faster AUC optimization with variance reduction and do a thorough convergence analysis of it in Section 4. In Section 5, we perform experiments on a suite of UCI datasets to show our proposed algorithm indeed converges faster than the state-of-the-art algorithms for AUC optimization. We conclude in Section 6 with some potential avenues for future research.

2 AUC formulation

The AUC score associated with a linear scoring function

, is defined as the probability that the score of a randomly chosen positive example is higher than a randomly chosen negative example

[hanley1982meaning, clemenccon2008ranking] and is denoted by . If and is drawn independently from an unknown distribution , then

 AUC(w) =Pr(wTx≥wTx′|y=1,y′=−1) =E[IwT(x−x′)≥0|y=1,y′=−1]

Since in the above form is not convex because of the loss, it is a common practice to replace this by a convex surrogate loss. In this paper, we focus on the least square loss which is consistent unlike some other choices such as the hinge loss. The following is the objective for AUC maximization :

 minw∈Rdp(1−p) E[(1−wT(x−x′))2|y=1,y′=−1] (1) +Ω(w)}

Let such that the function in above minimization problem can be written as . Here, and are the class priors and is the convex regularizer. Throughout this paper we assume

• is strongly convex i.e. for any

• such that .

In this paper we have used Frobenius norm and Elastic Net as the convex regularizers where are regularization parameters.

The minimization problem in equation 1 can be reformulated such that stochastic gradient descent can be performed to find the optimum value. Below is the equivalent formulation from Theorem in [natole2018stochastic]-

 minw,a,bmaxζ∈RE[F(w,a,b,ζ;z)]+Ω(w)

where the expectation is with respect to and

 F(w, a,b,ζ;z)=(1−p)(wTx−a)2I[y=1] +p(wTx−b)2I[y=−1]+2(1+ζ)wTx(pI[y=−1] −(1−p)I[y=1])−p(1−p)ζ2

Thus, . [natole2018stochastic] also state that the optimal choices for

 a(w) =wTE[x|y=1] b(w) =wTE[x|y=−1] ζ(w) =wT(E[x′|y′=−1]−E[x|y=1])

An important thing to note here is that we differentiate the objective function only with respect to (which is also the case for SPAM) and do not compute the gradient with respect to the other parameters which themselves depend on . This is the reason why existing methods cannot be applied directly — since the other parameters which also depend on are then updated in closed form.

3 Method

The major issue that slows down convergence for SGD is the decay of the step size to as the iteration increase. This is necessary for mitigating the effect of variance introduced by random sampling in SGD. We follow the method of Prox-SVRG closely on the reformulation of AUC to derive the proximal SVRG algorithm for AUC maximization given in Algorithm 1. We store a after every Prox-SGD iterations that is progressively closer to the optimal

(essentially an estimate of

). Full gradient is computed whenever gets update i.e. after every iterations Prox-SGD:

 ~μ=1nn∑i=1G(~w,zi)

is used to update next gradients. Next iterations are initialized by . For each iteration, we randomly pick and compute

 wt=wt−1−ηvt

where and and then the proximal step is taken

 wt=proxηΩ(wt)

Notice that if we take expectation of with respect to we get . Now if we take expectation of with respect to conditioned on , we can get the following:

 E[vt] =E[G(w,zit−1)]−E[G(~w,zit−1)]+~μ) =1nn∑i=1G(~wt−1,zi)

Hence the modified direction is stochastic gradient of at — similar to . However, the variance can be much smaller than . We will show in section 4 that the following inequality holds

 ≤2(8M2)2∥wt−w∗∥2+2(8M2)2∥~w−w∗∥

From above, when both and converge to , the variance goes to 0. Therefore, by using a constant size we can achieve better convergence rate. Thus, this is a multi-stage scheme to explicitly reduce the variance of the modified proximal gradient.

4 Convergence Analysis

In this section we analyse the convergence rate of VRSPAM formally. We first define some lemmas which will be used for proving the Theorem 1 which is the main theorem proving the geometric convergence of Algorithm 1. First is the Lemma 1 from [natole2018stochastic] which states that

is an unbiased estimator of the true gradient. As we are not calculating the true gradient in

VRSPAM, we need the following Lemma to prove the convergence result.

Lemma 1 ([natole2018stochastic]).

Let be given by VRSPAM in Algorithm 1. Then, we have

 ∂f(wt)=Ezt[∂wF(wt,a(wt),b(wt),α(wt);zt)]

This Lemma is directly applicable in VRSPAM since the proof of the Lemma hinges on the objective function formulation and not on the algorithm specifics.

The next lemma from [natole2018stochastic] provides an upper bound on the norm of difference of gradients at different time steps.

Lemma 2 ([natole2018stochastic]).

Let be described as above. Then, we have

 ∥G(wt′;zt′)−G(wt;zt)∥≤8M2∥wt′−wt∥
Proof.
 ∥G(wt′;zt′)−G(wt;zt)∥≤4M2p∥wt−w∗∥1[yt=−1]+ 4M2(1−p)∥wt−w∗∥1[yt=1]+ ≤8M2∥wt′−wt∥

The proof directly follows by writing out the difference and using the second assumption on the boundedness of . ∎

We now present and prove a result that will be necessary in showing convergence in Theorem 1

Lemma 3.

Let and , if then holds true

Proof.

 Cm+DCCm−1C−1≤1 ⇒ DCCm−1C−1≤1−Cm ⇒ D≤1−CC

Substituting values of and and using the condition that , we get

 128M4η2≤1−1+128M4η2(1+ηβ)21+128M4η2(1+ηβ)2 ⇒ 128M4η2≤(1+ηβ)2−1+128M4η21+128M4η2 ⇒ 128M4η2+(128M4η2)2≤(ηβ)2+2ηβ−128M4η2 ⇒ 128M4η2(2+128M4η2)≤ηβ(2+1ηβ) ⇒ 128M4η2≤ηβ ⇒ η≤β128M4

The following is the main theorem of this paper stating the convergence rate of Algorithm 1 and its analysis.

Theorem 1.

Consider VRSPAM (Algorithm 1) and let , if , then the following inequality holds true

 α=Cm+DCCm−1C−1<1

and we have the geometric convergence in expectation:

 E[∥ws−w∗∥2]≤αsE[∥w0−w∗∥2]

For proving the above theorem, first we upper bound the variance of the gradient step and show that it approaches zero as approaches .

Bounding the variance

Bound on the variance of modified gradient is given by following theorem :

Theorem 2.

Consider VRSPAM (Algorithm 1), then the variance of the is upper bounded as:

 E[∥vk− E[vk]∥2]) ≤2(8M2)2∥wt−w∗∥2+2(8M2)2∥~w−w∗∥
Proof.

Let the variance reduced update be denoted as . As we know , the variance of can be written as below

 E[∥ G(wt,zit)−G(~w,zit)+~μ−∂(w∗))∥2] ≤2E[∥G(wt,zit)−G(w∗,zit)∥2] +2E[∥G(w∗,zit)−G(~w,zit)+~μ−∂f(w∗))∥2]

Also, from Lemma 1 and using the property that we get

 E[∥ G(wt,zit)−G(~w,zit)+~μ−∂f(w∗))∥2] ≤2E[∥G(wt,zit)−G(w∗,zit)∥2] +2E[∥G(w∗,zit)−G(~w,zit)∥2]

From Lemma 2, we have and . Using this, we can upper bound the variance of gradient step as:

 (2) ≤2(8M2)2∥wt−w∗∥2+2(8M2)2∥~w−w∗∥

We have the desired result. ∎

At the convergence, and . Thus the variance of the updates are bounded and go to zero as the algorithm converges. Whereas in the case of SPAM algorithm, the variance of the gradient does not go to zero as it is a stochastic gradient descent based algorithm.

We now present the proof of Theorem 1.

Proof of Theorem 1

From the first order optimality condition, we can directly write

 w∗=proxηΩ(w∗−η∂f(w∗))

Using the above we can write

 ∥wt+1−w∗∥2 =∥proxηΩ(^wt+1)−proxηΩ(w∗−η∂f(w∗))∥2

Using Proposition 23.11 from [bauschke2011convex], we have is -cocoercieve and for any and using Cauchy Schwartz we can get the following inequality

 ∥proxηΩ(u)−proxηΩ(w)∥≤11+ηβ∥u−w∥

From above we get

 ∥wt+1−w∗∥2 ≤1(1+ηβ)2∥(^wt+1)−(w∗−η∂f(w∗))∥2 ≤1(1+ηβ)2∥(wt−w∗)−η(G(wt,zit) −G(~w,zit)+~μ−∂f(w∗))∥2

Taking expectation on both sides we get

 E∥wt+1−w∗∥2≤1(1+ηβ)2(η2E[∥G(wt,zit) (3) −G(~w,zit)+~μ−∂f(w∗))∥2]+E[∥wt−w∗∥2]− 2ηE[⟨wt−w∗,G(wt,zit)−G(~w,zit)+~μ−∂f(w∗)⟩])

Now, we first bound the last term in equation 3. Using Lemma 1 we can write

 T =E[⟨wt−w∗,Ezt[G(wt,zit)]−Ezt[G(~w,zit)] +~μ−∂f(w∗)⟩] =E[⟨wt−w∗,Ezt[G(wt,zit)]−∂f(w∗)⟩] =E[⟨wt−w∗,∂f(wt)−∂f(w∗)⟩] ≥0

Now, can be bounded by using above bound and Theorem 2 as below‘

 E∥wt+1−w∗∥2 ≤1(1+ηβ)2(E[∥wt−w∗∥2]+2(8M2)2E[∥wt−w∗∥2] +2(8M2)2E[∥~w−w∗∥2]) ≤1+128M4η2(1+ηβ)2E[∥wt−w∗∥2] +128M4η2(1+ηβ)2E[∥~w−w∗∥2]

Let and , then after iterations and

 E∥ws−w∗∥2 ≤Cm(E∥ws−1−w∗∥2+m−1∑i=0DCiE∥ws−1−w∗∥2) ≤(Cm+m−1∑i=0DCmCi)E∥ws−1−w∗∥2 ≤(Cm+DCm1−(1/Cm)1−(1/C))E∥ws−1−w∗∥2 ≤(Cm+DCCm−1C−1)E∥ws−1−w∗∥2 ≤αE∥ws−1−w∗∥2

where is the decay parameter, and by using Lemma 1. After steps in outer loop of Algorithm 1, we get where . Hence, we get geometric convergence of which is much stronger than the convergence obtained in [natole2018stochastic]. In the next section we derive the time complexity of the algorithm and investigate dependence of on the problem parameters.

Complexity analysis

To get , the number of iterations required is

 s≥log1αlogE∥ws−w∗∥2ϵ

At each stage, the number of gradient evaluations are where is the number of samples and is the iterations in the inner loop then the complexity is i.e. Algorithm 1 takes iterations to achieve accuracy of . Here, the complexity is dependent on and as itself is dependent on and .

Now we find the dependence of and on and . Let where , then

 C =1+128M4η2(1+ηβ)2 =1+θ2β2128M4(1+θβ2128M4)2 <1+θβ2128M4(1+θβ2128M4)2 =1(1+θβ2128M4) =E

therefore and , using the above equations we can simplify as

 α =Cm+DC1−Cm1−C

In the above equation, only depends on , if we choose to be sufficiently large then . An important thing to note here is that , now if we choose then which is independent of . Thus the time complexity of the algorithm is when . As the order has inverse dependency on , increase in will result in increase in number of iterations i.e. as the maximum norm of training samples is increased, larger is required to reach accuracy.

Now we will compare the time complexity of our algorithm with SPAM algorithm. First, we find the time complexity of SPAM. We will use Theorem 3 from [natole2018stochastic] which states that SPAM achieves the following:

 E[∥wT+1− w∗∥]≤t0TE[∥wt0−w∗∥]+clogTT

where , is the number of iterations and is a constant. Through averaging scheme developed by [lacoste2012simpler] the following can be obtained:

 E[∥wT+1− w∗∥]≤t0TE[∥wt0−w∗∥] (4)

where , and . Using equation 4, time complexity of SPAM algorithm can be written as i.e. SPAM algorithm takes iterations to achieve accuracy. Thus, SPAM has lower per iteration complexity but slower convergence rate as compared to VRSPAM. Therefore, VRSPAM will take less time to get a good approximation of the solution.

5 Experiment

Here we empirically compare VRSPAM with other existing algorithms used for AUC maximization. We use two variants of our proposed algorithm depending on the regularizer used-

• (Frobenius Norm Regularizer)

• (Elastic Net Regularizer [zou2005regularization]). The proximal step for elastic net is given as

VRSPAM is compared with SPAM, SOLAM [ying2016stochastic] and one-pass AUC optimization algorithm (OPAUC) [gao2013one]. SOLAM was modified to have the Frobenius Norm Regularizer (as in [natole2018stochastic]). VRSPAM is compared against OPAUC with the least square loss.

All datasets are publicly available from [chang2011libsvm] and [frank2010uci]

. Some of the datasets are multiclass and we convert them to binary labels by numbering the classes and assigning all the even labels to one class and all the odd labels to another. The results are the mean AUC score and standard deviation of 20 runs on each dataset. All the datasets were divided into training and test data with 80% and 20% of the data. The parameters

and for and are chosen by 5 fold cross validation on the training set. All the code is implemented in matlab and will be released upon publication. We measured the computational time of the algorithm using an Intel i-7 CPU with a clock speed of 3538 MHz.

• Variance results : In the left column of Figure 1, we show the variance of the VRSPAM update () in comparison with the variance of SPAM update () . We observe that the variance of VRSPAM is lower than the variance of SPAM and decreases to the minimum value faster , which is in line with Theorem 1

• Convergence results : In the right column of Figure 1, we show the performance of our algorithm compared to existing methods for AUC maximization. We observe that VRSPAM converges to the maximum value faster than the other methods and in some cases this maximum value itself is higher for VRSPAM.

Note that, the initial weights of VRSPAM are set to be the output generated by SPAM after 1 iteration which is similar practice to [johnson2013accelerating]

Table1 summarizes the AUC evaluation for different algorithms. AUC values for SPAM-, SPAM-NET, SOLAM and OPAUC were taken from [natole2018stochastic].

6 Conclusion

In this paper, we propose a variance reduced stochastic proximal algorithm for AUC maximization (VRSPAM). We theoretically analyze the proposed algorithm and derive a much faster convergence rate of where (linear convergence rate), improving upon state-of-the-art methods ([natole2018stochastic]) which have a convergence rate of (sub-linear convergence rate), for strongly convex objective functions with per iteration complexity of one data-point. We gave a theoretical analysis of this and showed empirically VRSPAM converges faster than other methods for AUC maximization.

2 AUC formulation

The AUC score associated with a linear scoring function

, is defined as the probability that the score of a randomly chosen positive example is higher than a randomly chosen negative example

[hanley1982meaning, clemenccon2008ranking] and is denoted by . If and is drawn independently from an unknown distribution , then

 AUC(w) =Pr(wTx≥wTx′|y=1,y′=−1) =E[IwT(x−x′)≥0|y=1,y′=−1]

Since in the above form is not convex because of the loss, it is a common practice to replace this by a convex surrogate loss. In this paper, we focus on the least square loss which is consistent unlike some other choices such as the hinge loss. The following is the objective for AUC maximization :

 minw∈Rdp(1−p) E[(1−wT(x−x′))2|y=1,y′=−1] (1) +Ω(w)}

Let such that the function in above minimization problem can be written as . Here, and are the class priors and is the convex regularizer. Throughout this paper we assume

• is strongly convex i.e. for any