# Accelerated Stochastic Gradient Descent for Minimizing Finite Sums

We propose an optimization method for minimizing the finite sums of smooth convex functions. Our method incorporates an accelerated gradient descent (AGD) and a stochastic variance reduction gradient (SVRG) in a mini-batch setting. Unlike SVRG, our method can be directly applied to non-strongly and strongly convex problems. We show that our method achieves a lower overall complexity than the recently proposed methods that supports non-strongly convex problems. Moreover, this method has a fast rate of convergence for strongly convex problems. Our experiments show the effectiveness of our method.

## Authors

• 17 publications
05/09/2013

### Stochastic gradient descent algorithms for strongly convex functions at O(1/T) convergence rates

With a weighting scheme proportional to t, a traditional stochastic grad...
01/23/2019

### A Universally Optimal Multistage Accelerated Stochastic Gradient Method

We study the problem of minimizing a strongly convex and smooth function...
05/23/2016

### Accelerated Stochastic Mirror Descent Algorithms For Composite Non-strongly Convex Optimization

We consider the problem of minimizing the sum of an average function of ...
02/08/2016

### A Simple Practical Accelerated Method for Finite Sums

We describe a novel optimization method for finite sums (such as empiric...
06/18/2020

### Stochastic Variance Reduction via Accelerated Dual Averaging for Finite-Sum Optimization

In this paper, we introduce a simplified and unified method for finite-s...
03/02/2020

### Smooth Strongly Convex Regression

Convex regression (CR) is the problem of fitting a convex function to a ...
07/07/2021

### Fast and Accurate Optimization of Metasurfaces with Gradient Descent and the Woodbury Matrix Identity

A fast metasurface optimization strategy for finite-size metasurfaces mo...

## Code Repositories

### STOPT

Experimental code of stochastic optimization methods (Averaged SGD, SVRG, SAGA, and AMSVRG ...) for linear models.

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the minimization problem:

 minimizex∈Rd  f(x)def=1nn∑i=1fi(x), (1)

where are smooth convex functions from to

. In machine learning, we often encounter optimization problems of this type, i.e., empirical risk minimization. For example, given a sequence of training examples

, where and . If we set

, then we obtain linear regression. If we set

), then we obtain logistic regression. Each

may include smooth regularization terms. In this paper we make the following assumption.

###### Assumption 1.

Each convex function is -smooth, i.e., there exists such that for all ,

 ∥∇fi(x)−∇fi(y)∥≤L∥x−y∥.

In part of this paper (the latter half of section 4), we also assume that is -strongly convex.

###### Assumption 2.

f(x) is -strongly convex, i.e., there exists such that for all ,

 f(x)≥f(y)+(∇f(y),x−y)+μ2∥x−y∥2.

Note that it is obvious that .

Several papers recently proposed effective methods (SAG[1, 2], SDCA[3, 4], SVRG[5], S2GD[6], Acc-Prox-SDCA[7], Prox-SVRG[8], MISO[9], SAGA[10], Acc-Prox-SVRG[11], mS2GD[12]) for solving problem (1). These methods attempt to reduce the variance of the stochastic gradient and achieve the linear convergence rates like a deterministic gradient descent when is strongly convex. Moreover, because of the computational efficiency of each iteration, the overall complexities (total number of component gradient evaluations to find an -accurate solution in expectation) of these methods are less than those of the deterministic and stochastic gradient descent methods.

An advantage of the SAG and SAGA is that they support non-strongly convex problems. Although we can apply any of these methods to non-strongly convex functions by adding a slight -regularization, this modification increases the difficulty of model selection. In the non-strongly convex case, the overall complexities of SAG and SAGA are . This complexity is less than that of the deterministic gradient descent, which have a complexity of , and is a trade-off with , which is the complexity of the AGD.

In this paper we propose a new method that incorporates the AGD and SVRG in a mini-batch setting like Acc-Prox-SVRG [11]. The difference between our method and Acc-Prox-SVRG is that our method incorporates [13], which is similar to Nesterov’s acceleration [14], whereas Acc-Prox-SVRG incorporates [15]. Unlike SVRG and Acc-Prox-SVRG, our method is directly applicable to non-strongly convex problems and achieves an overall complexity of

 ~O(n+min{Lϵ,n√Lϵ }),

where the notation hides constant and logarithmic terms. This complexity is less than that of SAG, SAGA, and AGD. Moreover, in the strongly convex case, our method achieves a complexity

 ~O(n+min{κ, n√κ }),

where is the condition number . This complexity is the same as that of Acc-Prox-SVRG. Thus, our method converges quickly for non-strongly and strongly convex problems.

In Section 2 and 3, we review the recently proposed accelerated gradient method [13] and the stochastic variance reduction gradient [5]

. In Section 4, we describe the general scheme of our method and prove an important lemma that gives us a novel insight for constructing specific algorithms. Moreover, we derive an algorithm that is applicable to non-strongly and strongly convex problems and show its quickly converging complexity. Our method is a multi-stage scheme like SVRG, but it can be difficult to decide when we should restart a stage. Thus, in Section 5, we introduce some heuristics for determining the restarting time. In Section 6, we present experiments that show the effectiveness of our method.

## 2 Accelerated Gradient Descent

We first introduce some notations. In this section, denotes the general norm on . Let be a distance generating function (i.e., 1-strongly convex smooth function with respect to ). Accordingly, we define the Bregman divergence by

 Vx(y)=d(y)−(d(x)+(∇d(x),y−x)),  ∀x,∀y∈Rd,

where is the Euclidean inner product. The accelerated method proposed in [13] uses a gradient step and mirror descent steps and takes a linear combination of these points. That is,

 (Convex Combination) xk+1←τkzk+(1−τk)yk, (Gradient Descent) yk+1←argminy∈Rd{ (∇f(xk+1),y−xk+1)+L2∥y−xk+1∥2 }, (Mirror Descent) zk+1←argminz∈Rd{ αk+1(∇f(xk+1),z−zk)+Vzk(z) }.

Then, with appropriate parameters, converge to the optimal value as fast as the Nesterov’s accelerated methods [14, 15] for non-strongly convex problems. Moreover, in the strongly convex case, we obtain the same fast convergence as Nesterov’s methods by restarting this entire procedure.

In the rest of the paper, we only consider the Euclidean norm, i.e., .

## 3 Stochastic Variance Reduction Gradient

To ensure the convergence of stochastic gradient descent (SGD), the learning rate must decay to zero so that we can reduce the variance effect of the stochastic gradient. This slows down the convergence. Variance reduction techniques [5, 8, 6, 12] such as SVRG have been proposed to solve this problem. We review SVRG in a mini-batch setting [11, 12]. SVRG is a multi-stage scheme. During each stage, this method performs SGD iterations using the following direction,

 vk=∇fIk(xk)−∇fIk(~x)+∇f(~x),

where is a starting point at stage, is an iteration index, is a uniformly randomly chosen size subset of , and . Note that

is an unbiased estimator of gradient

: , where denote the expectation with respect to . A bound on the variance of is given in the following lemma, which is proved in the Supplementary Material.

###### Lemma 1.

Suppose Assumption 1 holds, and let . Conditioned on , we have

 EIk∥vk−∇f(xk)∥2≤4Ln−bb(n−1)(f(xk)−f(x∗)+f(~x)−f(x∗)). (2)

Due to this lemma, SVRG with achieves a complexity of .

## 4 Algorithms

We now introduce our Accelerated efficient Mini-batch SVRG (AMSVRG) which incorporates AGD and SVRG in a mini-batch setting. Our method is a multi-stage scheme similar to SVRG. During each stage, this method performs several APG-like [13] iterations and uses SVRG direction in a mini-batch setting. Each stage of AMSVRG is described in Figure 1.

### 4.1 Convergence analysis of the single stage of AMSVRG

Before we introduce the multi-stage scheme, we show the convergence of Algorithm 1. The following lemma is the key to the analysis of our method and gives us an insight on how to construct algorithms.

###### Lemma 2.

Consider Algorithm 1 in Figure 1 under Assumption 1. We set . Let . If , then we have,

 m∑k=0αk+1(1τk−(1+4δk+1)Lαk+1)E[f(xk+1)−f(x∗)]+Lα2m+1E[f(ym+1)−f(x∗)] ≤Vz0(x∗)+m∑k=1(αk+11−τkτk−Lα2k)E[f(yk)−f(x∗)] +(α11−τ0τ0+4Lm∑k=0α2k+1δk+1)(f(y0)−f(x∗)).

To prove Lemma 2, additional lemmas are required, which are proved in the Supplementary Material.

###### Lemma 3.

(Stochastic Gradient Descent). Suppose Assumption 1 holds, and let . Conditioned on , it follows that for ,

 EIk[f(yk)]≤f(xk)−12L∥∇f(xk)∥2+12LEIk∥vk−∇f(xk)∥2. (3)
###### Lemma 4.

(Stochastic Mirror Descent). Conditioned on , we have that for arbitrary ,

 αk(∇f(xk),zk−1−u)≤Vzk−1(u)−EIk[Vzk(u)]+12α2k∥∇f(xk)∥2+12α2kEIk∥vk−∇f(xk)∥2. (4)
###### Proof of Lemma 2.

We denote by for simplicity. From Lemma 1, 3, and 4 with ,

 αk+1(∇f(xk+1),zk−x∗) ≤(???,???)Vk−EIk+1[Vk+1]+Lα2k+1(f(xk+1)−EIk+1[f(yk+1)])+α2k+1EIk+1∥vk+1−∇f(xk+1)∥2 ≤(???)Vk−EIk+1[Vk+1]+Lα2k+1(f(xk+1)−EIk+1[f(yk+1)]) +4Lα2k+1δk+1(f(xk+1)−f(x∗)+f(y0)−f(x∗)) =Vk−EIk+1[Vk+1]+(1+4δk+1)Lα2k+1(f(xk+1)−f(x∗))−Lα2k+1EIk+1[f(yk+1)−f(x∗)] +4Lα2k+1δk+1(f(y0)−f(x∗)).

By taking the expectation with respect to the history of random variables

, we have,

 αk+1E[(∇f(xk+1),zk−x∗)] ≤ E[Vk−Vk+1]+(1+4δk+1)Lα2k+1E[f(xk+1)−f(x∗)] (5) −Lα2k+1E[f(yk+1)−f(x∗)]+4Lα2k+1δk+1(f(y0)−f(x∗)),

and we get

 m∑k=0αk+1E[f(xk+1)−f(x∗)] ≤ m∑k=0αk+1E[(∇f(xk+1),xk+1−x∗)] (6) = m∑k=0αk+1(E[(∇f(xk+1),xk+1−zk)]+E[(∇f(xk+1),zk−x∗)]) = m∑k=0αk+1(1−τkτkE[(∇f(xk+1),yk−xk+1)]+E[(∇f(xk+1),zk−x∗)]) ≤ m∑k=0(αk+11−τkτkE[f(yk)−f(xk+1)]+αk+1E[(∇f(xk+1),zk−x∗)]).

Using (5), (6), and , we have

 m∑k=0αk+1(1+1−τkτk−(1+4δk+1)Lαk+1)E[f(xk+1)−f(x∗)] ≤V0+m∑k=0αk+11−τkτkE[f(yk)−f(x∗)]−Lm∑k=0α2k+1E[f(yk+1)−f(x∗)] +4Lm∑k=0α2k+1δk+1(f(y0)−f(x∗)).

This completes the proof of Lemma 2. ∎

From now on we consider Algorithm 1 with option 1 and set

 η=1L,  αk+1=14L(k+2),  1τk=Lαk+1+12,  for  k=0,1,…. (7)
###### Theorem 1.

Consider Algorithm 1 with option 1 under Assumption 1. For , we choose such that . Then, we have

 E[f(ym+1)−f(x∗)]≤16L(m+2)2Vz0(x∗)+52p(f(y0)−f(x∗)).

Moreover, if for , then it follows

 E[f(ym+1)−f(x∗)]≤(q+52p)(f(y0)−f(x∗)).
###### Proof.

Using Lemma 2 and

 τ0=1,  1τk−(1+4δk+1)Lαk+1≥0, αk+11−τkτk−Lα2k=Lα2k+1−12αk+1−Lα2k=−116L<0,

we have

 Lα2m+1E[f(ym+1)−f(x∗)]≤Vz0(x∗)+4Lm∑k=0α2k+1δk+1(f(y0)−f(x∗)).

This proves the theorem because . ∎

Let be the minimum values satisfying the assumption of Theorem 1 for , i.e., and . Then, from Theorem 1, we have an upper bound on the overall complexity (total number of component gradient evaluations to obtain -accurate solution in expectation):

 O(n+m∑k=0bk+1)≤O(n+mnmϵn+m)=O(n+nLϵ2n+√ϵL),

where we used the monotonicity of with respect to for the first inequality. Note that the notation also hides and .

### 4.2 Multi-Stage Scheme

In this subsection, we introduce AMSVRG, as described in Figure 2.

We consider the convergence of AMSVRG under the following boundedness assumption which has been used in a several papers to analyze incremental and stochastic methods (e.g., [16, 17]).

###### Assumption 3.

(Boundedness) There is a compact subset such that the sequence generated by AMSVRG is contained in .

Note that, if we change the initialization of to , the above method with this modification will achieve the same convergence for general convex problems without the boundedness assumption (c.f. supplementary materials). However, for the strongly convex case, this modified version is slower than the above scheme. Therefore, we consider the version described in Figure 2.

From Theorem 1, we can see that for small and (e.g. ), the expected value of the objective function is halved at every stage under the assumptions of Theorem 1. Hence, running AMSVRG for outer iterations achieves an -accurate solution in expectation. Here, we consider the complexity at stage to halve the expected objective value. Let be the minimum values satisfying the assumption of Theorem 1, i.e., and . If the initial objective gap in stage is larger than , then the complexity at stage is

 O(n+ms∑k=0bk+1)≤O(n+nm2sn+ms) =O(n+nLn(f(ws)−f(x∗))+√(f(ws)−f(x∗))L)≤O(n+nLϵn+√ϵL),

where we used the monotonicity of with respect to for the first inequality. Note that by Assumption 3, are uniformly bounded and notation also hides . The above analysis implies the following theorem.

###### Theorem 2.

Consider AMSVRG under Assumptions 1 and 3. We set and as in (7). Let and , where and are small values described above. Then, the overall complexity to run AMSVRG for outer iterations or to obtain an -accurate solution is

 O((n+nLϵn+√ϵL)log(1ϵ)).

Next, we consider the strongly convex case. We assume that is a -strongly convex function. In this case, we choose the distance generating function , so that the Bregman divergence becomes . Let the parameters be the same as in Theorem 2. Then, the expected value of the objective function is halved at every stage. Because , where is the condition number , the complexity at each stage is

 O(n+ms∑k=0bk+1)≤O(n+nm2sn+ms)≤O(n+nκn+√κ).

Thus, we have the following theorem.

###### Theorem 3.

Consider AMSVRG under Assumptions 1 and 2. Let parameters , and be the same as those in Theorem 2. Then the overall complexity for obtaining -accurate solution in expectation is

 O((n+nκn+√κ)log(1ϵ)).

This complexity is the same as that of Acc-Prox-SVRG. Note that for the strongly convex case, we do not need the boundedness assumption.

Table 1 lists the overall complexities of the AGD, SAG, SVRG, SAGA, Acc-Prox-SVRG, and AMSVRG. The notation hides constant and logarithmic terms. By simple calculations, we see that

 nκn+√κ=12H(κ,n√κ ),   nLϵn+√ϵL=12H(Lϵ,n√Lϵ ),

where

is the harmonic mean whose order is the same as

. Thus, as shown in Table 1, the complexity of AMSVRG is less than or equal to that of other methods in any situation. In particular, for non-strongly convex problems, our method potentially outperform the others.

## 5 Restart Scheme

The parameters of AMSVRG are essentially and because the appropriate values of both and can be expressed by as in (7). It may be difficult to choose an appropriate which is the restart time for Algorithm 1. So, we propose heuristics for determining the restart time.

First, we suppose that the number of components is sufficiently large such that the complexity of our method becomes . That is, for appropriate , is an upper bound on (which is the complexity term). Therefore, we estimate the restart time as the minimum index that satisfies . This estimated value is upper bound on (in terms of the order). In this paper, we call this restart method R1.

Second, we propose an adaptive restart method using SVRG. In a strongly convex case, we can easily see that if we restart the AGD for general convex problems every , then the method achieves a linear convergence similar to that for strongly convex problems. The drawback of this restart method is that the restarting time depends on an unknown parameter , so several papers [18, 19, 20] have proposed effective adaptive restart methods. Moreover, [19] showed that this technique also performs well for general convex problems. Inspired by their study, we propose an SVRG-based adaptive restart method called R2. That is, if

 (vk+1,yk+1−yk)>0,

then we return and start the next stage.

Third, we propose the restart method R3, which is a combination of the above two ideas. When exceeds , we restart Algorithm 1, and when

 (vk+1,yk+1−yk)>0  ∧  m∑k=0bk+1>n,

we return and restart Algorithm 1.

## 6 Numerical Experiments

In this section, we compare AMSVRG with SVRG and SAGA. We ran an -regularized multi-class logistic regularization on mnist and covtype and ran an -regularized binary-class logistic regularization on rcv1. The datasets and their descriptions can be found at the LIBSVM website111http://www.csie.ntu.edu.tw/ cjlin/libsvmtools/datasets/. In these experiments, we vary regularization parameter in . We ran AMSVRG using some values of from and from , and then we chose the best and .

The results are shown in Figure 3. The horizontal axis is the number of single-component gradient evaluations. Our methods performed well and outperformed the other methods in some cases. For mnist and covtype, AMSVRG R1 and R3 converged quickly, and for rcv1, AMSVRG R2 worked very well. This tendency was more remarkable when the regularization parameter was small.

Note that the gradient evaluations for the mini-batch can be parallelized [22, 21, 23], so AMSVRG may be further accelerated in a parallel framework such as GPU computing.

## 7 Conclusion

We propose method that incorporates acceleration gradient method and the SVRG in the increasing mini-batch setting. We showed that our method achieves a fast convergence complexity for non-strongly and strongly convex problems.

## A Proof of the Lemma 1

To prove Lemma 1, the following lemma is required, which is also shown in [1].

###### Lemma A.

Let

be a set of vectors in

and denote an average of . Let denote a uniform random variable representing a size subset of . Then, it follows that,

 EI∥∥ ∥∥1b∑i∈Iξi−μ∥∥ ∥∥2=n−bb(n−1)Ei∥ξi−μ∥2.
###### Proof.

We denote a size subset of by and denote by . Then,

 EI∥∥ ∥∥1b∑i∈Iξi−μ∥∥ ∥∥2 = 1C(n,b)∑S∥∥ ∥∥1bb∑j=1ξij−μ∥∥ ∥∥2 = 1b2C(n,b)∑S∥∥ ∥∥b∑j=1~ξij∥∥ ∥∥2 = 1b2C(n,b)∑S⎛⎝b∑j=1∥~ξij∥2+2∑j,k,j

where is a combination. By symmetry, an each appears times and an each pair for appears times in . Therefore, we have

 EI∥∥ ∥∥1b∑i∈I~ξi−μ∥∥ ∥∥2 = 1b2C(n,b)(bC(n,b)nn∑i=1∥~ξi∥2+2C(b,2)C(n,b)C(n,2)∑i,j,i

Since, , we have

 EI∥∥ ∥∥1b∑i∈I~ξi−μ∥∥ ∥∥2=(1bn−b−1bn(n−1))n∑i=1∥~ξi∥2=n−bb(n−1)1nn∑i=1∥~ξi∥2.

This finishes the proof of Lemma. ∎

We now prove the Lemma 1.

###### Proof of Lemma 1 .

We set . Using Lemma A and

 vk=1b∑j∈Ikv1j,

conditional variance of is as follows

 EIk∥vk−∇f(xk)∥2=1bn−bn−1Ej∥v1j−∇f(