1 Introduction
Classification accuracy is a commonly used performance measure to evaluate a classifier. However, this measure is not suitable in the presence of class imbalance i.e. when one class occurs much more frequently than the other class
[elkan2001foundations]. To overcome this drawback, Area under the ROC curve (AUC) [hanley1982meaning, bradley1997use, fawcett2006introduction] is used as a standard metric for quantifying the performance of a classifier. AUC measures the ability of a family of classifiers to correctly rank a positive example with respect to a randomly selected negative example.There have been several algorithms for AUC maximization in the batch setting, where all the training data is assumed to be available at the beginning [rakotomamonjy2004optimizing, herschtal2004optimising, zhang2012smoothing, joachims2005support]. However, this assumption is unrealistic in several cases, especially for streaming data analysis. Several online algorithms proposed for such settings where the per iteration complexity is low [bottou2004large, srebro2010stochastic, shalev2012online, hazan2012projection, rakhlin2011making, orabona2014simultaneous]. Although online algorithms have been thoroughly explored for classification accuracy where the loss decomposes over individual examples, the case of maximizing AUC as performance measure has been looked at only recently [zhao2011online, wang2012generalization, kar2013generalization]. The main challenge in the AUC maximization framework is that at each step the algorithm needs to pair the current datapoint with all previously seen datapoints leading to space and time complexity at step where the dimension of the instance space is . The problem was just slightly alleviated by the technique of buffering [zhao2011online, wang2012generalization, kar2013generalization] as the good generalization performance depends on having a large buffer size. Recently, [palaniappan2016stochastic] provided a primal dual algorithm by extending stochastic variance reduced algorithms (SVRG, SAGA) to handle nondecomposable losses or regularizers (in the form of convexconcave saddle point problem)and thereby provided linear convergence rate . Although this can be applied to AUC optimization with the leastsquared loss, their algorithm needs to assume strong convexity of both the primal and dual variables. Their algorithm also has expensive periteration complexity where is the number of data points and is the dimension.
Recent works take a different approach by reformulating the AUC maximization problem with the least square loss. [ying2016stochastic] reformulated it as a saddle point problem and gave an algorithm which has a convergence rate of . However, they only consider smooth penalty terms such as Frobenius norm and their convergence rate is still suboptimal to which is what stochastic gradient descent (SGD) achieves with classification accuracy as a performance measure. [natole2018stochastic] then proposed a stochastic proximal algorithm for AUC maximization which under assumptions of strong convexity can achieve a convergence rate of and has per iteration complexity of i.e. one datapoint and is applicable to general nonsmooth regularization terms. However, due to the inherent variance of random sampling we need to pick a step size of for SGD which leads to a slower sublinear convergence rate of . Thus, for SGD we have a low per iteration complexity and slow convergence versus high per iteration complexity and fast convergence for full gradient descent. Thus, SGD might take long to get a good approximation of the solution of the optimization problem.
In the context of classification accuracy, to reduce the variance of SGD several popular methods have been proposed such as SAG [roux2012stochastic], SDCA [shalev2013stochastic], SVRG [johnson2013accelerating]. One issue with SAG and SDCA is that they require the storage of all the gradients and dual variables respectively. On the other hand, SVRG enjoys the same fast convergence rates as SDCA and SAG but has a much simpler analysis and does not require storage of gradients. This allows SVRG to be applicable in complex problems where the storage of all gradients would be infeasible, unlike, SAG and SDCA.
Since SVRG is applicable only for smooth strongly convex functions, several works have explored ways to tackle the presence of a regularizer term in addition to the average of several smooth component function term. Two simple strategies are to use the Proximal Full Gradient and the Proximal Stochastic Gradient method. While the Proximal Stochastic Gradient is much faster since it computes only the gradient of a single component function per iteration, it convergences much slower than the Proximal Full Gradient method. The proximal gradient methods can be viewed as a special case of splitting methods [lions1979splitting, chen1997convergence, bauschke2011convex, tseng2000modified, beck2008fast]. However, both Proximal methods do not fully exploit the problem structure. Proximal SVRG [xiao2014proximal] is an extension of the SVRG [johnson2013accelerating]
technique and can be used whenever the objective function is composed of two terms the first term is an average of smooth functions (decomposes across the individual instances) and the second term admits a simple proximal mapping. ProxSVRG needs far fewer iterations to achieve the same approximation ratio than the proximal full and stochastic gradient descent methods. However, there is an important gap that has not been addressed yet — existing techniques that guarantee faster convergence by controlling the variance are not directly applicable to nondecomposable loss functions as in the problem of AUC optimization and this is the gap that we close in this paper.
In this paper, we present Variance Reduced Stochastic Proximal algorithm for AUC Maximization (VRSPAM). VRSPAM applies the standard SVRG variance reduction technique to the SPAM algorithm, which is a proximal stochastic gradient descent applied to a convex surrogate of the AUC maximization problem. We provide theoretical analysis for the VRSPAM algorithm showing that it achieves linear convergence rate with a fixed step size better than SPAM which has sublinear convergence rate and constantly decreasing step size. Also, the theoretical analysis provided in the paper is much simpler as compared to the analysis of SPAM. We also perform numerical experiments to show that the VRSPAM algorithm converges faster than SPAM.
The organization of the remainder of the paper is as follows. In Section 2, we briefly state the AUC optimization problem and state the equivalent formulation that is necessary for our algorithmic analysis. In Section 3, we discuss our algorithm for faster AUC optimization with variance reduction and do a thorough convergence analysis of it in Section 4. In Section 5, we perform experiments on a suite of UCI datasets to show our proposed algorithm indeed converges faster than the stateoftheart algorithms for AUC optimization. We conclude in Section 6 with some potential avenues for future research.
2 AUC formulation
The AUC score associated with a linear scoring function
, is defined as the probability that the score of a randomly chosen positive example is higher than a randomly chosen negative example
[hanley1982meaning, clemenccon2008ranking] and is denoted by . If and is drawn independently from an unknown distribution , thenSince in the above form is not convex because of the loss, it is a common practice to replace this by a convex surrogate loss. In this paper, we focus on the least square loss which is consistent unlike some other choices such as the hinge loss. The following is the objective for AUC maximization :
(1)  
Let such that the function in above minimization problem can be written as . Here, and are the class priors and is the convex regularizer. Throughout this paper we assume

is strongly convex i.e. for any

such that .
In this paper we have used Frobenius norm and Elastic Net as the convex regularizers where are regularization parameters.
The minimization problem in equation 1 can be reformulated such that stochastic gradient descent can be performed to find the optimum value. Below is the equivalent formulation from Theorem in [natole2018stochastic]
where the expectation is with respect to and
Thus, . [natole2018stochastic] also state that the optimal choices for
An important thing to note here is that we differentiate the objective function only with respect to (which is also the case for SPAM) and do not compute the gradient with respect to the other parameters which themselves depend on . This is the reason why existing methods cannot be applied directly — since the other parameters which also depend on are then updated in closed form.
3 Method
The major issue that slows down convergence for SGD is the decay of the step size to as the iteration increase. This is necessary for mitigating the effect of variance introduced by random sampling in SGD. We follow the method of ProxSVRG closely on the reformulation of AUC to derive the proximal SVRG algorithm for AUC maximization given in Algorithm 1. We store a after every ProxSGD iterations that is progressively closer to the optimal
(essentially an estimate of
). Full gradient is computed whenever gets update i.e. after every iterations ProxSGD:is used to update next gradients. Next iterations are initialized by . For each iteration, we randomly pick and compute
where and and then the proximal step is taken
Notice that if we take expectation of with respect to we get . Now if we take expectation of with respect to conditioned on , we can get the following:
Hence the modified direction is stochastic gradient of at — similar to . However, the variance can be much smaller than . We will show in section 4 that the following inequality holds
From above, when both and converge to , the variance goes to 0. Therefore, by using a constant size we can achieve better convergence rate. Thus, this is a multistage scheme to explicitly reduce the variance of the modified proximal gradient.
4 Convergence Analysis
In this section we analyse the convergence rate of VRSPAM formally. We first define some lemmas which will be used for proving the Theorem 1 which is the main theorem proving the geometric convergence of Algorithm 1. First is the Lemma 1 from [natole2018stochastic] which states that
is an unbiased estimator of the true gradient. As we are not calculating the true gradient in
VRSPAM, we need the following Lemma to prove the convergence result.Lemma 1 ([natole2018stochastic]).
Let be given by VRSPAM in Algorithm 1. Then, we have
This Lemma is directly applicable in VRSPAM since the proof of the Lemma hinges on the objective function formulation and not on the algorithm specifics.
The next lemma from [natole2018stochastic] provides an upper bound on the norm of difference of gradients at different time steps.
Lemma 2 ([natole2018stochastic]).
Let be described as above. Then, we have
Proof.
The proof directly follows by writing out the difference and using the second assumption on the boundedness of . ∎
We now present and prove a result that will be necessary in showing convergence in Theorem 1
Lemma 3.
Let and , if then holds true
Proof.
We start with:
Substituting values of and and using the condition that , we get
∎
The following is the main theorem of this paper stating the convergence rate of Algorithm 1 and its analysis.
Theorem 1.
Consider VRSPAM (Algorithm 1) and let , if , then the following inequality holds true
and we have the geometric convergence in expectation:
For proving the above theorem, first we upper bound the variance of the gradient step and show that it approaches zero as approaches .
Bounding the variance
Bound on the variance of modified gradient is given by following theorem :
Theorem 2.
Consider VRSPAM (Algorithm 1), then the variance of the is upper bounded as:
Proof.
Let the variance reduced update be denoted as . As we know , the variance of can be written as below
Also, from Lemma 1 and using the property that we get
From Lemma 2, we have and . Using this, we can upper bound the variance of gradient step as:
(2)  
We have the desired result. ∎
At the convergence, and . Thus the variance of the updates are bounded and go to zero as the algorithm converges. Whereas in the case of SPAM algorithm, the variance of the gradient does not go to zero as it is a stochastic gradient descent based algorithm.
We now present the proof of Theorem 1.
Proof of Theorem 1
From the first order optimality condition, we can directly write
Using the above we can write
Using Proposition 23.11 from [bauschke2011convex], we have is cocoercieve and for any and using Cauchy Schwartz we can get the following inequality
From above we get
Taking expectation on both sides we get
(3)  
Now, we first bound the last term in equation 3. Using Lemma 1 we can write
Now, can be bounded by using above bound and Theorem 2 as below‘
Let and , then after iterations and
where is the decay parameter, and by using Lemma 1. After steps in outer loop of Algorithm 1, we get where . Hence, we get geometric convergence of which is much stronger than the convergence obtained in [natole2018stochastic]. In the next section we derive the time complexity of the algorithm and investigate dependence of on the problem parameters.
Complexity analysis
To get , the number of iterations required is
At each stage, the number of gradient evaluations are where is the number of samples and is the iterations in the inner loop then the complexity is i.e. Algorithm 1 takes iterations to achieve accuracy of . Here, the complexity is dependent on and as itself is dependent on and .
Now we find the dependence of and on and . Let where , then
therefore and , using the above equations we can simplify as
In the above equation, only depends on , if we choose to be sufficiently large then . An important thing to note here is that , now if we choose then which is independent of . Thus the time complexity of the algorithm is when . As the order has inverse dependency on , increase in will result in increase in number of iterations i.e. as the maximum norm of training samples is increased, larger is required to reach accuracy.






Now we will compare the time complexity of our algorithm with SPAM algorithm. First, we find the time complexity of SPAM. We will use Theorem 3 from [natole2018stochastic] which states that SPAM achieves the following:
where , is the number of iterations and is a constant. Through averaging scheme developed by [lacoste2012simpler] the following can be obtained:
(4) 
where , and . Using equation 4, time complexity of SPAM algorithm can be written as i.e. SPAM algorithm takes iterations to achieve accuracy. Thus, SPAM has lower per iteration complexity but slower convergence rate as compared to VRSPAM. Therefore, VRSPAM will take less time to get a good approximation of the solution.
N  VRSPAM  VRSPAMNET  SPAM  SPAMNET  SOLAM  OPAUC 

1  .8299.0323  .8305.0319  .8272.0277  .8085.0431  .8128.0304  .8309.0350 
2  .79020386  .7845.0398  .7942.0388  .7937.0386  .7778.0373  .7978.0347 
3  .9640.0156  .9699.0139  .9263.0091  .9267.0090  .9246.0087  .9232.0099 
4  .8552.006  .8549.0059  .8542.0388  .8537.0386  .8395.0061  .8114.0065 
5  .9834.0023  .9804.0032  .9868.0032  .9855.0029  .9822.0036  .9620.0040 
6  .9003.0045  .8981.0046  .8998.0046  .8980.0047  .8966.0043  .9002.0047 
7  .9876.0008  .9787.0013  .9682.0020  .9604.0020  .9817.0015  .9633.0035 
8  .9465.0014  .9351.0014  .9254.0025  .9132.0026  .9118.0029  .9242.0021 
9  .8093.0033  .8052.033  .8120.0030  .8109.0028  8099.0036  .8192.0032 
10  .9750.001  .9745.002  .9174.0024  .9155.0024  .9129.0030  .9269.0021 
5 Experiment
N  Name  Instances  Features 

1  DIABETES  768  8 
2  GERMAN  1000  24 
3  SPLICE  3,175  60 
4  USPS  9,298  256 
5  LETTER  20,000  16 
6  A9A  32,561  123 
7  W8A  64,700  300 
8  MNIST  60,000  780 
9  ACOUSTIC  78,823  50 
10  IJCNN1  141,691  22 
Here we empirically compare VRSPAM with other existing algorithms used for AUC maximization. We use two variants of our proposed algorithm depending on the regularizer used

(Frobenius Norm Regularizer)

(Elastic Net Regularizer [zou2005regularization]). The proximal step for elastic net is given as
VRSPAM is compared with SPAM, SOLAM [ying2016stochastic] and onepass AUC optimization algorithm (OPAUC) [gao2013one]. SOLAM was modified to have the Frobenius Norm Regularizer (as in [natole2018stochastic]). VRSPAM is compared against OPAUC with the least square loss.
All datasets are publicly available from [chang2011libsvm] and [frank2010uci]
. Some of the datasets are multiclass and we convert them to binary labels by numbering the classes and assigning all the even labels to one class and all the odd labels to another. The results are the mean AUC score and standard deviation of 20 runs on each dataset. All the datasets were divided into training and test data with 80% and 20% of the data. The parameters
and for and are chosen by 5 fold cross validation on the training set. All the code is implemented in matlab and will be released upon publication. We measured the computational time of the algorithm using an Intel i7 CPU with a clock speed of 3538 MHz.
Variance results : In the left column of Figure 1, we show the variance of the VRSPAM update () in comparison with the variance of SPAM update () . We observe that the variance of VRSPAM is lower than the variance of SPAM and decreases to the minimum value faster , which is in line with Theorem 1

Convergence results : In the right column of Figure 1, we show the performance of our algorithm compared to existing methods for AUC maximization. We observe that VRSPAM converges to the maximum value faster than the other methods and in some cases this maximum value itself is higher for VRSPAM.
Note that, the initial weights of VRSPAM are set to be the output generated by SPAM after 1 iteration which is similar practice to [johnson2013accelerating]
Table1 summarizes the AUC evaluation for different algorithms. AUC values for SPAM, SPAMNET, SOLAM and OPAUC were taken from [natole2018stochastic].
6 Conclusion
In this paper, we propose a variance reduced stochastic proximal algorithm for AUC maximization (VRSPAM). We theoretically analyze the proposed algorithm and derive a much faster convergence rate of where (linear convergence rate), improving upon stateoftheart methods ([natole2018stochastic]) which have a convergence rate of (sublinear convergence rate), for strongly convex objective functions with per iteration complexity of one datapoint. We gave a theoretical analysis of this and showed empirically VRSPAM converges faster than other methods for AUC maximization.
2 AUC formulation
The AUC score associated with a linear scoring function
, is defined as the probability that the score of a randomly chosen positive example is higher than a randomly chosen negative example
[hanley1982meaning, clemenccon2008ranking] and is denoted by . If and is drawn independently from an unknown distribution , thenSince in the above form is not convex because of the loss, it is a common practice to replace this by a convex surrogate loss. In this paper, we focus on the least square loss which is consistent unlike some other choices such as the hinge loss. The following is the objective for AUC maximization :
(1)  
Let such that the function in above minimization problem can be written as . Here, and are the class priors and is the convex regularizer. Throughout this paper we assume

is strongly convex i.e. for any
Comments
There are no comments yet.