# ASVRG: Accelerated Proximal SVRG

This paper proposes an accelerated proximal stochastic variance reduced gradient (ASVRG) method, in which we design a simple and effective momentum acceleration trick. Unlike most existing accelerated stochastic variance reduction methods such as Katyusha, ASVRG has only one additional variable and one momentum parameter. Thus, ASVRG is much simpler than those methods, and has much lower per-iteration complexity. We prove that ASVRG achieves the best known oracle complexities for both strongly convex and non-strongly convex objectives. In addition, we extend ASVRG to mini-batch and non-smooth settings. We also empirically verify our theoretical results and show that the performance of ASVRG is comparable with, and sometimes even better than that of the state-of-the-art stochastic methods.

There are no comments yet.

## Authors

• 30 publications
• 32 publications
• 11 publications
• 35 publications
• 5 publications
• 1 publication
• ### Direct Acceleration of SAGA using Sampled Negative Momentum

Variance reduction is a simple and effective technique that accelerates ...
06/28/2018 ∙ by Kaiwen Zhou, et al. ∙ 0

• ### Fast Stochastic Variance Reduced Gradient Method with Momentum Acceleration for Machine Learning

Recently, research on accelerated stochastic gradient descent methods (e...
03/23/2017 ∙ by Fanhua Shang, et al. ∙ 0

• ### Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

In this paper, we develop a new accelerated stochastic gradient method f...
03/01/2017 ∙ by Tomoya Murata, et al. ∙ 0

• ### A New Analysis of Variance Reduced Stochastic Proximal Methods for Composite Optimization with Serial and Asynchronous Realizations

We provide a comprehensive analysis of stochastic variance reduced gradi...
05/25/2018 ∙ by Yue Yu, et al. ∙ 0

• ### Accelerated Linear Convergence of Stochastic Momentum Methods in Wasserstein Distances

Momentum methods such as Polyak's heavy ball (HB) method, Nesterov's acc...
01/22/2019 ∙ by Bugra Can, et al. ∙ 0

• ### On the Convergence of Nesterov's Accelerated Gradient Method in Stochastic Settings

We study Nesterov's accelerated gradient method in the stochastic approx...
02/27/2020 ∙ by Mahmoud Assran, et al. ∙ 0

• ### Accelerated Stochastic Power Iteration

Principal component analysis (PCA) is one of the most powerful tools in ...
07/10/2017 ∙ by Christopher De Sa, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Consider the following composite convex minimization:

 minx∈RdF(x):=f(x)+g(x)=1nn∑i=1fi(x)+g(x), (1)

where is a convex function that is a finite average of convex component functions , and

is a “simple” possibly non-smooth convex function. This formulation naturally arises in many problems in machine learning, optimization and signal processing, such as regularized empirical risk minimization (ERM)) and eigenvector computation

(Shamir, 2015; Garber et al., 2016). To solve Problem (1) with a large sum of component functions, computing the full gradient of

in first-order methods is expensive, and hence stochastic gradient descent (SGD) has been widely applied to many large-scale problems

(Zhang, 2004; Krizhevsky et al., 2012). The update rule of proximal SGD is

 xt=argminy∈Rd{12ηt∥y−xt−1∥2+yT∇fit(xt−1)+g(y)}, (2)

where is the step size, and is chosen uniformly at random from . When , the update rule in (2) becomes

. The standard SGD estimates the gradient from just one example (or a mini-batch), and thus it enjoys a low per-iteration cost as opposed to full gradient methods. The expectation of

is an unbiased estimation to

, i.e., . However, the variance of the stochastic gradient estimator may be large, which leads to slow convergence (Johnson and Zhang, 2013). Even under the strongly convex (SC) and smooth conditions, standard SGD can only attain a sub-linear rate of convergence (Rakhlin et al., 2012; Shamir and Zhang, 2013).

Recently, the convergence rate of SGD has been improved by many variance reduced SGD methods (Roux et al., 2012; Shalev-Shwartz and Zhang, 2013; Johnson and Zhang, 2013; Defazio et al., 2014a; Mairal, 2015) and their proximal variants (Schmidt et al., 2013; Xiao and Zhang, 2014; Shalev-Shwartz and Zhang, 2016). The methods use past gradients to progressively reduce the variance of stochastic gradient estimators, so that a constant step size can be used. In particular, these variance reduced SGD methods converge linearly for SC and Lipschitz-smooth problems. SVRG (Johnson and Zhang, 2013) and its proximal variant, Prox-SVRG (Xiao and Zhang, 2014), are particularly attractive because of their low storage requirement compared with the methods in (Roux et al., 2012; Defazio et al., 2014a; Shalev-Shwartz and Zhang, 2016), which need to store all the gradients of the component functions (or dual variables), so that

storage is required in general problems. At the beginning of each epoch of SVRG and Prox-SVRG, the full gradient

is computed at the past estimate . Then the key update rule is given by

 ˜∇fit(xt−1)=∇fit(xt−1)−∇fit(˜x)+∇f(˜x),xt=argminy∈Rd{12η∥y−xt−1∥2+yT˜∇fit(xt−1)+g(y)}. (3)

For SC problems, the oracle complexity (total number of component gradient evaluations to find -suboptimal solutions) of most variance reduced SGD methods is , when each is -smooth and is -strongly convex. Thus, there still exists a gap between the oracle complexity and the upper bound in (Woodworth and Srebro, 2016). In theory, they also converge slower than accelerated deterministic algorithms (e.g., FISTA (Beck and Teboulle, 2009)) for non-strongly convex (non-SC) problems, namely vs. .

Very recently, several advanced techniques were proposed to further speed up the variance reduction SGD methods mentioned above. These techniques mainly include the Nesterov’s acceleration technique (Nitanda, 2014; Lin et al., 2015; Murata and Suzuki, 2017; Lan and Zhou, 2018), the projection-free property of the conditional gradient method (Hazan and Luo, 2016), reducing the number of gradient calculations in the early iterations (Babanezhad et al., 2015; Allen-Zhu and Yuan, 2016; Shang et al., 2017), and the momentum acceleration trick (Hien et al., 2018; Allen-Zhu, 2018; Zhou et al., 2018; Shang et al., 2018b). Lin et al. (2015) and Frostig et al. (2015) proposed two accelerated algorithms with improved oracle complexity of for SC problems. In particular, Katyusha (Allen-Zhu, 2018) attains the optimal oracle complexities of and for SC and non-SC problems, respectively. The main update rules of Katyusha are formulated as follows:

 (4)

where are two parameters for the key momentum terms, and . Note that the parameter is fixed to in (Allen-Zhu, 2018) to avoid parameter tuning.

Our Contributions. In spite of the success of momentum acceleration tricks, most of existing accelerated methods including Katyusha require at least two auxiliary variables and two corresponding momentum parameters (e.g., for the Nesterov’s momentum and Katyusha momentum in (4)), which lead to complicated algorithm design and high per-iteration complexity. We address the weaknesses of the existing methods by a simper accelerated proximal stochastic variance reduced gradient (ASVRG) method, which requires only one auxiliary variable and one momentum parameter. Thus, ASVRG leads to much simpler algorithm design and is more efficient than the accelerated methods. Impressively, ASVRG attains the same low oracle complexities as Katyusha for both SC and non-SC objectives. We summarize our main contributions as follows.

• We design a simple momentum acceleration trick to accelerate the original SVRG. Different from most accelerated algorithms such as Katyusha, which require two momentums mentioned above, our update rule has only one momentum accelerated term.

• We prove that ASVRG converges to an -minimizer with the oracle complexity of for SC problems, which is the same as that in (Defazio, 2016; Allen-Zhu, 2018), and matches the upper bound in (Woodworth and Srebro, 2016).

• We also prove that ASVRG achieves the optimal convergence rate of and the oracle complexity of for non-SC problems, which is identical to the best known result in (Hien et al., 2018; Allen-Zhu, 2018).

• Finally, we introduce mini-batching, adaptive regularization and smooth techniques into our algorithms, and further analyze their convergence properties and summarize the oracle complexities of ASVRG for the four cases of Problem (1) in Table 2.

## 2 Preliminaries

Throughout this paper, we use to denote the standard Euclidean norm. denotes the full gradient of if it is differentiable, or the subgradient if is Lipschitz continuous. We mostly focus on the case of Problem (1) when each component function is -smooth111In the following, we mainly consider the more general class of Problem (1), when every can have different degrees of smoothness, rather than the gradients of all component functions having the same Lipschitz constant ., and is -strongly convex.

###### Assumption 1

Each convex component function is -smooth, if there exists a constant such that for all , .

###### Assumption 2

is -strongly convex (-SC), if there exists a constant such that for any and any (sub)gradient (i.e., , or ) of at

 g(y)≥g(x)+ξT(y−x)+μ2∥x−y∥2.

For a non-strongly convex function, the above inequality can always be satisfied with . As summarized in (Allen-Zhu and Hazan, 2016), there are mainly four interesting cases of Problem (1):

• Case 1: Each is -smooth and is

-SC, e.g., ridge regression, logistic regression, and elastic net regularized logistic regression.

• Case 2: Each is -smooth and is non-SC, e.g., Lasso and -norm regularized logistic regression.

• Case 3: Each is non-smooth (but Lipschitz continuous) and is

-SC, e.g., linear support vector machine (SVM).

• Case 4: Each is non-smooth (but Lipschitz continuous) and is non-SC, e.g., -norm SVM.

## 3 Accelerated Proximal SVRG

In this section, we propose an accelerated proximal stochastic variance reduced gradient (ASVRG) method with momentum acceleration for solving both strongly convex and non-strongly convex objectives (e.g., Cases 1 and 2). Moreover, ASVRG incorporates a weighted sampling strategy as in (Xiao and Zhang, 2014; Zhao and Zhang, 2015; Shamir, 2016) to randomly pick based on a general distribution

rather than the uniform distribution.

### 3.1 Iterate Averaging for Snapshot

Like SVRG, our algorithms are also divided into epochs, and each epoch consists of stochastic updates222In practice, it was reported that reducing the number of gradient calculations in early iterations can lead to faster convergence (Babanezhad et al., 2015; Allen-Zhu and Yuan, 2016; Shang et al., 2017). Thus we set in the early epochs of our algorithms, and fix , and without increasing parameter tuning difficulties., where is usually chosen to be as in Johnson and Zhang (2013). Within each epoch, a full gradient is calculated at the snapshot . Note that we choose to be the average of the past stochastic iterates rather than the last iterate because it has been reported to work better in practice (Xiao and Zhang, 2014; Flammarion and Bach, 2015; Allen-Zhu and Yuan, 2016; Liu et al., 2017; Allen-Zhu, 2018; Shang et al., 2018b). In particular, one of the effects of the choice, i.e., , is to allow taking larger step sizes, e.g., for ASVRG vs.  for SVRG.

### 3.2 ASVRG in Strongly Convex Case

We first consider the case of Problem (1) when each is -smooth, and is -SC. Different from existing accelerated methods such as Katyusha (Allen-Zhu, 2018), we propose a much simpler accelerated stochastic algorithm with momentum, as outlined in Algorithm 1. Compared with the initialization of (i.e., Option I in Algorithm 1), the choices of and (i.e., Option II) also work well in practice.

#### 3.2.1 Momentum Acceleration

The update rule of in our proximal stochastic gradient method is formulated as follows:

 yst=argminy∈Rd{⟨˜∇fit(xst−1),y−yst−1⟩+ω2η∥y−yst−1∥2+g(y)}, (5)

where is the momentum parameter. Note that the gradient estimator used in this paper is the SVRG estimator in (3). Besides, the algorithms and convergence results of this paper can be generalized to the SAGA estimator in (Defazio et al., 2014a). When , the proximal update rule in (5) degenerates to .

Inspired by the Nesterov’s momentum in (Nesterov, 1983, 2004; Nitanda, 2014; Shang et al., 2018a) and Katyusha momentum in (Allen-Zhu, 2018), we design a update rule for as follows:

 xst=˜xs−1+ω(yst−˜xs−1). (6)

The second term on the right-hand side of (6) is the proposed momentum similar to the Katyusha momentum in (Allen-Zhu, 2018). It is clear that there is only one momentum parameter in our algorithm, compared with the two parameters and in Katyusha333Although Acc-Prox-SVRG (Nitanda, 2014) also has a momentum parameter, its oracle complexity is no faster than SVRG when the size of mini-batch is less than , as discussed in (Allen-Zhu, 2018)..

The per-iteration complexity of ASVRG is dominated by the computation of and ,444For some regularized ERM problems, we can save the intermediate gradients in the computation of , which requires storage in general as in (Defazio et al., 2014b). and the proximal update in (5), which is as low as that of SVRG (Johnson and Zhang, 2013) and Prox-SVRG (Xiao and Zhang, 2014). In other words, ASVRG has a much lower per-iteration complexity than most of accelerated stochastic methods (Murata and Suzuki, 2017; Allen-Zhu, 2018) such as Katyusha (Allen-Zhu, 2018), which has one more proximal update for in general.

#### 3.2.2 Momentum Parameter

Next we give a selection scheme for the stochastic momentum parameter . With the given , can be a constant, and must satisfy the inequality: , where . As shown in Theorem 4.1 below, it is desirable to have a small convergence factor . The following proposition obtains the optimal , which yields the smallest value.

Given a suitable learning rate , the optimal parameter is .

In fact, we can fix to a constant, e.g., , which works well in practice as in (Ruder, 2017). When and is smooth, Algorithm 1 degenerates to Algorithm 3 in the supplementary material, which is almost identical to SVRG (Johnson and Zhang, 2013), and the only differences between them are the choice of and the initialization of .

### 3.3 ASVRG in Non-Strongly Convex Case

We also develop an efficient algorithm for solving non-SC problems, as outlined in Algorithm 2. The main difference between Algorithms 1 and 2 is the setting of the momentum parameter. That is, the momentum parameter in Algorithm 2 is decreasing, while that of Algorithm 1 can be a constant. Different from Algorithm 1, in Algorithm 2 needs to satisfy the following inequalities:

 (7)

It is clear that the condition (7) allows the stochastic momentum parameter to decrease, but not too fast, similar to the requirement on the step-size in classical SGD. Unlike deterministic acceleration methods, where is only required to satisfy the first inequality in (7), the momentum parameter in Algorithm 2 must satisfy both inequalities. Inspired by the momentum acceleration techniques in (Tseng, 2010; Su et al., 2014) for deterministic optimization, the update rule for is defined as follows: , and for any , .

## 4 Convergence Analysis

In this section, we provide the convergence analysis of ASVRG for both strongly convex and non-strongly convex objectives. We first give the following key intermediate result (the proofs to all theoretical results in this paper are given in the supplementary material).

Suppose Assumption 1 holds. Let be an optimal solution of Problem (1), and be the sequence generated by Algorithms 1 and 2 with 555Note that the momentum parameter in Algorithm 1 is a constant, that is, . In addition, if the length of the early epochs is not sufficiently large, the epochs can be viewed as an initialization step. Then for all ,

 E[F(˜xs)−F(x⋆)]≤(1−ωs−1)E[F(˜xs−1)−F(x⋆)]+ω2s−12mηE[∥x⋆−ys0∥2−∥x⋆−ysm∥2].

### 4.1 Analysis for Strongly Convex Objectives

For strongly convex objectives, our first main result is the following theorem, which gives the convergence rate and oracle complexity of Algorithm 1.

Suppose Assumptions 1 and 2 hold, and given the same notation as in Lemma 4, and is sufficiently large so that

 ρ:=1−ω+ω2mμη<1.

Then Algorithm 1 with Option I has the following geometric convergence in expectation:

 E[F(˜xs)−F(x⋆)]≤ρs[F(˜x0)−F(x⋆)].

Theorem 4.1 shows that Algorithm 1 with Option I achieves linear convergence for strongly convex problems. We can easily obtain a similar result for Algorithm 1 with Option II. The following results give the oracle complexities of Algorithm 1 with Option I or Option II, as shown in Figure 1, where is the condition number.

The oracle complexity of Algorithm 1 with Option I to achieve an -suboptimal solution (i.e., ) is

 ⎧⎪⎨⎪⎩O(√n˜L/μlogF(~x0)−F(x⋆)ε),if mμ/˜L∈[0.68623,145.72],O((n+˜L/μ)logF(~x0)−F(x⋆)ε),otherwise.

The oracle complexity of Algorithm 1 with Option II and restarts every epochs666For each restart, the new initial point is set to . If we choose the snapshot to be the weighted average as in (Allen-Zhu, 2018) rather than the uniform average, our algorithm without restarts can also achieve the tightest possible result. Note that is always less than . to achieve an -suboptimal solution (i.e., ) is

 O((n+√n˜L/μ)logF(~x0)−F(x⋆)ε).

For the most commonly used uniform random sampling (i.e., sampling probabilities for all ), the oracle complexity of ASVRG becomes , which is identical to that of Katyusha (Allen-Zhu, 2018) and Point-SAGA (Defazio, 2016), and better than those of non-accelerated methods (e.g., of SVRG).

For uniform random sampling, and recalling , the above oracle bound is rewritten as: . As each generally has different degrees of smoothness , picking the random index from a non-uniform distribution is a much better choice than simple uniform random sampling (Zhao and Zhang, 2015; Needell et al., 2016). For instance, the sampling probabilities for all are proportional to their Lipschitz constants, i.e., . In this case, the oracle complexity becomes , where . In other words, the statement of Corollary 4.1 can be revised by simply replacing with . And the proof only needs some minor changes accordingly. In fact, all statements in this section can be revised by replacing with .

### 4.2 Analysis for Non-Strongly Convex Objectives

For non-strongly convex objectives, our second main result is the following theorem, which gives the convergence rate and oracle complexity of Algorithm 2.

Suppose Assumption 1 holds. Then the following result holds,

 E[F(˜xS)−F(x⋆)]≤4(α−1)[F(˜x0)−F(x⋆)](α−2)2(S+1)2+2∥x⋆−˜x0∥2ηm(S+1)2,

where . Furthermore, choosing , Algorithm 2 achieves an -suboptimal solution, i.e., using at most iterations. One can see that the oracle complexity of ASVRG is consistent with the best known result in (Hien et al., 2018; Allen-Zhu, 2018), and all the methods attain the optimal convergence rate . Moreover, we can use the adaptive regularization technique in (Allen-Zhu and Hazan, 2016) to the original non-SC problem, and achieve a SC objective with a decreasing value , e.g., in (Allen-Zhu and Hazan, 2016). Then we have the following result.

Suppose Assumption 1 holds, and is non-SC. By applying the adaptive regularization technique in (Allen-Zhu and Hazan, 2016) for Algorithm 1, then we obtain an -suboptimal solution using at most the following oracle complexity:

 O⎛⎝nlogF(˜x0)−F(x⋆)ε+√n˜L∥˜x0−x⋆∥√ε⎞⎠.

Corollary 4.2 implies that ASVRG has a low oracle complexity (i.e., ), which is the same as that in (Allen-Zhu, 2018). Both ASVRG and Katyusha have a much faster rate than SAGA (Defazio et al., 2014a), whose oracle complexity is .

Although ASVRG is much simpler than Katyusha, all the theoretical results show that ASVRG achieves the same convergence rates and oracle complexities as Katyusha for both SC and non-SC cases. Similar to (Babanezhad et al., 2015; Allen-Zhu and Yuan, 2016), we can reduce the number of gradient calculations in early iterations to further speed up ASVRG in practice.

## 5 Extensions of ASVRG

In this section, we first extend ASVRG to the mini-batch setting. Then we extend Algorithm 1 and its convergence results to the non-smooth setting (e.g., the problems in Cases 3 and 4).

### 5.1 Mini-Batch

In this part, we extend ASVRG and its convergence results to the mini-batch setting. Suppose that the mini-batch size is , the stochastic gradient estimator with variance reduction becomes

 ˜∇fIt(xst−1)=1b∑i∈It1npi[∇fi(xst−1)−∇fi(˜xs−1)]+∇f(˜xs−1),

where is a mini-batch of size . Consequently, the momentum parameters and must satisfy and for SC and non-SC cases, respectively, where . Moreover, the upper bound on the variance of can be extended to the mini-batch setting as follows.

 E[∥∥˜∇fIt(xst−1)−∇f(xst−1)∥∥2]≤2˜L(n−b)b(n−1)[f(˜xs−1)−f(xst−1)+∇f(xst−1)T(xst−1−˜xs−1)].

In the SC case, the convergence result of the mini-batch variant777Note that in the mini-batch setting, the number of stochastic iterations of the inner loop in Algorithms 1 and 2 is reduced from to . of ASVRG is identical to Theorem 4.1 and Corollary 4.1. For the non-SC case, we set the initial parameter . Then Theorem 4.2 can be extended to the mini-batch setting as follows.

Suppose Assumption 1 holds, and given the same notation as in Theorem 4.2 and , then the following inequality holds

 E[F(˜xs)−F(x⋆)]≤4(α−1)τ(b)[F(˜x0)−F(x⋆)][α−1−τ(b)]2(s+1)2+2˜Lα∥x⋆−˜x0∥2m(s+1)2. (8)

For the special case of , we have , and then Theorem  5.1 degenerates to Theorem  4.2. When (i.e., the batch setting), then , and the first term on the right-hand side of (8) diminishes. Then our algorithm degenerates to an accelerated deterministic method with the optimal convergence rate of , where is the number of iterations.

### 5.2 Non-Smooth Settings

In addition to the application of the regularization reduction technique in (Allen-Zhu and Hazan, 2016) for a class of smooth and non-SC problems (i.e., Case 2 of Problem (1)), as shown in Section 4.2, ASVRG can also be extended to solve the problems in both Cases 3 and 4, when each component of is -Lipschitz continuous, which is defined as follows. The function is -Lipschitz continuous if there exists a constant such that, for any , .

The key technique is using a proximal operator to obtain gradients of the -Moreau envelope of a non-smooth function , defined as

 fδsi(x)=infy∈Rdfi(y)+δs2∥x−y∥2, (9)

where is an increasing parameter as in (Allen-Zhu and Hazan, 2016). That is, we use the proximal operator to smoothen each component function, and optimize the new, smooth function which approximates the original problem. This technique has been used in Katyusha (Allen-Zhu, 2018) and accelerated SDCA (Shalev-Shwartz and Zhang, 2016) for non-smooth objectives.

###### Property 1 (Nesterov (2005); Bauschke and Combettes (2011); Orabona et al. (2012))

Let each be convex and -Lipschitz continuous. For any , the following results hold:
(a) is -smooth;
(b) ;
(c) .

By using the similar techniques in (Allen-Zhu and Hazan, 2016), we can apply ASVRG to solve the smooth problem . It is easy to verify that ASVRG satisfies the homogenous objective decrease (HOOD) property in (Allen-Zhu and Hazan, 2016) (see (Allen-Zhu and Hazan, 2016) for the detail of HOOD), as shown below.

Algorithm 1 used to solve the problem in Case 1 satisfies the HOOD property with at most iterations. That is, for every starting point , Algorithm 1 produces an output satisfying in at most iterations.

In the following, we extend the result in Theorem 4.1 to the non-smooth setting as follows. Let be -Lpischitz continuous, and be -strongly convex. By applying the adaptive smooth technique in (Allen-Zhu and Hazan, 2016) on ASVRG, we obtain an -suboptimal solution using at most the following oracle complexity:

 O(nlogF(˜x0)−F(x⋆)ε+√nG√με).

Let be -Lpischitz continuous and be not necessarily strongly convex. By applying both the adaptive regularization and smooth techniques in (Allen-Zhu and Hazan, 2016) on ASVRG, we obtain an -suboptimal solution using at most the following oracle complexity:

 O(nlogF(˜x0)−F(x⋆)ε+√nG∥˜x0−x⋆∥ε).

From Corollaries 5.2 and 5.2, one can see that ASVRG converges to an -accurate solution for Case 3 of Problem (1) in iterations and for Case 4 in iterations. That is, ASVRG achieves the same low oracle complexities as the accelerated stochastic variance reduction method, Katyusha, for the two classes of non-smooth problems (i.e., Cases 3 and 4 of Problem (1)).

## 6 Experiments

In this section, we evaluate the performance of ASVRG, and all the experiments were performed on a PC with an Intel i5-2400 CPU and 16GB RAM. We used the two publicly available data sets in our experiments: Covtype and RCV1, which can be downloaded from the LIBSVM Data website.

### 6.1 Effectiveness of Our Momentum

Figure 2 shows the performance of ASVRG with (i.e., Algorithm 3) and ASVRG in order to illustrate the importance and effectiveness of our momentum. Note that the epoch length is set to for the two algorithms (i.e., and ), as well as SVRG (Johnson and Zhang, 2013). It is clear that Algorithm 3 is without our momentum acceleration technique. The main difference between Algorithm 3 and SVRG is that the snapshot and starting points of the former are set to the uniform average and last iterate of the previous epoch, respectively, while the two points of the later are the last iterate. The results show that Algorithm 3 outperforms SVRG, suggesting that the iterate average can work better in practice, as discussed in (Shang et al., 2018b). In particular, ASVRG converges significantly faster than ASVRG without momentum and SVRG, meaning that our momentum acceleration technique can accelerate the convergence of ASVRG.

### 6.2 Comparison with Stochastic Methods

For fair comparison, ASVRG and the compared algorithms (including SVRG (Johnson and Zhang, 2013), SAGA (Defazio et al., 2014a), Acc-Prox-SVRG (Nitanda, 2014), Catalyst (Lin et al., 2015), and Katyusha (Allen-Zhu, 2018)) were implemented in C++ with a Matlab interface. There is only one parameter (i.e., the learning rate) to tune for all these methods except Catalyst and Acc-Prox-SVRG. In particular, we compare their performance in terms of both the number of effective passes over the data and running time (seconds). As in (Xiao and Zhang, 2014), each feature vector has been normalized so that it has norm .

Figure 3 shows how the objective gap (i.e., ) of all these methods decreases on elastic net regularized logistic regression (, ) as time goes on. It is clear that ASVRG converges significantly faster than the other methods in terms of both oracle calls and running time, while Catalyst and Katyusha achieve comparable and sometimes even better performance than SVRG and SAGA in terms of the running time (seconds). The main reason is that ASVRG not only takes advantage of the momentum acceleration trick, but also can use much larger step-size (e.g., 1/(3) for ASVRG vs. 1/(10) for SVRG). This empirically verifies our theoretical result in Corollary 4.1 that ASVRG has the same low oracle complexity as Katyusha. ASVRG significantly outperforms Katyusha in terms of running time, which implies that ASVRG has a much lower per-iteration cost than Katyusha.

## 7 Conclusions

We proposed an efficient ASVRG method, which integrates both the momentum acceleration trick and variance reduction technique. We first designed a simple momentum acceleration technique. Then we theoretically analyzed the convergence properties of ASVRG, which show that ASVRG achieves the same low oracle complexities for both SC and non-SC objectives as accelerated methods, e.g., Katyusha (Allen-Zhu, 2018). Moreover, we also extended ASVRG and its convergence results to both mini-batch settings and non-smooth settings.

It would be interesting to consider other classes of settings, e.g., the non-Euclidean norm setting. In practice, ASVRG is much simpler than the existing accelerated methods, and usually converges much faster than them, which has been verified in our experiments. Due to its simplicity, it is more friendly to asynchronous parallel and distributed implementation for large-scale machine learning problems (Zhou et al., 2018), similar to (Reddi et al., 2015; Sra et al., 2016; Mania et al., 2017; Zhou et al., 2018; Lee et al., 2017; Wang et al., 2017). One natural open problem is whether the best oracle complexities can be obtained by ASVRG in the asynchronous and distributed settings.

#### Acknowledgments

We thank the reviewers for their valuable comments. This work was supported in part by Grants (CUHK 14206715 & 14222816) from the Hong Kong RGC, Project supported the Foundation for Innovative Research Groups of the National Natural Science Foundation of China (No. 61621005), the Major Research Plan of the National Natural Science Foundation of China (Nos. 91438201 and 91438103), Project supported the Foundation for Innovative Research Groups of the National Natural Science Foundation of China (No. 61621005), the National Natural Science Foundation of China (Nos. 61876220, 61876221, 61836009, U1701267, 61871310, 61573267, 61502369 and 61473215), the Program for Cheung Kong Scholars and Innovative Research Team in University (No. IRT_15R53), the Fund for Foreign Scholars in University Research and Teaching Programs (the 111 Project) (No. B07048), and the Science Foundation of Xidian University (No. 10251180018).

## References

• Allen-Zhu (2018) Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. J. Mach. Learn. Res., 18:1–51, 2018.
• Allen-Zhu and Hazan (2016) Z. Allen-Zhu and E. Hazan. Optimal black-box reductions between optimization objectives. In NIPS, pages 1606–1614, 2016.
• Allen-Zhu and Yuan (2016) Z. Allen-Zhu and Y. Yuan. Improved SVRG for non-strongly-convex or sum-of-non-convex objectives. In ICML, pages 1080–1089, 2016.
• Babanezhad et al. (2015) R. Babanezhad, M. O. Ahmed, A. Virani, M. Schmidt, J. Konecny, and S. Sallinen. Stop wasting my gradients: Practical SVRG. In NIPS, pages 2242–2250, 2015.
• Bauschke and Combettes (2011) H. H. Bauschke and P. L. Combettes. Convex analysis and monotone operator theory in Hilbert spaces. CMS Books in Mathematics, Springer, 2011.
• Beck and Teboulle (2009) A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM J. Imaging Sci., 2(1):183–202, 2009.
• Defazio (2016) A. Defazio. A simple practical accelerated method for finite sums. In NIPS, pages 676–684, 2016.
• Defazio et al. (2014a) A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In NIPS, pages 1646–1654, 2014a.
• Defazio et al. (2014b) A. J. Defazio, T. S. Caetano, and J. Domke. Finito: A faster, permutable incremental gradient method for big data problems. In ICML, pages 1125–1133, 2014b.
• Flammarion and Bach (2015) N. Flammarion and F. Bach. From averaging to acceleration, there is only a step-size. In COLT, pages 658–695, 2015.
• Frostig et al. (2015) R. Frostig, R. Ge, S. M. Kakade, and A. Sidford. Un-regularizing: approximate proximal point and faster stochastic algorithms for empirical risk minimization. In ICML, pages 2540–2548, 2015.
• Garber et al. (2016) D. Garber, E. Hazan, C. Jin, S. M. Kakade, C. Musco, P. Netrapalli, and A. Sidford. Faster eigenvector computation via shift-and-invert preconditioning. In ICML, pages 2626–2634, 2016.
• Hazan and Luo (2016) E. Hazan and H. Luo. Variance-reduced and projection-free stochastic optimization. In ICML, pages 1263–1271, 2016.
• Hien et al. (2018) L. T. K. Hien, C. Lu, H. Xu, and J. Feng. Accelerated stochastic mirror descent algorithms for composite non-strongly convex optimization. arXiv:1605.06892v5, 2018.
• Johnson and Zhang (2013) R. Johnson and T. Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In NIPS, pages 315–323, 2013.
• Konečný et al. (2016) J. Konečný, J. Liu, P. Richtárik, , and M. Takáč. Mini-batch semi-stochastic gradient descent in the proximal setting. IEEE J. Sel. Top. Sign. Proces., 10(2):242–255, 2016.
• Krizhevsky et al. (2012) A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification with deep convolutional neural networks. In NIPS, pages 1097–1105, 2012.
• Lan (2012) G. Lan. An optimal method for stochastic composite optimization. Math. Program., 133:365–397, 2012.
• Lan and Zhou (2018) G. Lan and Y. Zhou. An optimal randomized incremental gradient method. Math. Program., 171:167–215, 2018.
• Lee et al. (2017) J. D. Lee, Q. Lin, T. Ma, and T. Yang. Distributed stochastic variance reduced gradient methods by sampling extra data with replacement. J. Mach. Learn. Res., 18:1–43, 2017.
• Lin et al. (2015) H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In NIPS, pages 3366–3374, 2015.
• Liu et al. (2017) Y. Liu, F. Shang, and J. Cheng. Accelerated variance reduced stochastic ADMM. In AAAI, pages 2287–2293, 2017.
• Mahdavi et al. (2013) M. Mahdavi, L. Zhang, and R. Jin. Mixed optimization for smooth functions. In NIPS, pages 674–682, 2013.
• Mairal (2015) J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM J. Optim., 25(2):829–855, 2015.
• Mania et al. (2017) H. Mania, X. Pan, D. Papailiopoulos, B. Recht, K. Ramchandran, and M. I. Jordan. Perturbed iterate analysis for asynchronous stochastic optimization. SIAM J. Optim., 27(4):2202–2229, 2017.
• Murata and Suzuki (2017) T. Murata and T. Suzuki. Doubly accelerated stochastic variance reduced dual averaging method for regularized empirical risk minimization. In NIPS, pages 608–617, 2017.
• Needell et al. (2016) D. Needell, N. Srebro, and R. Ward. Stochastic gradient descent, weighted sampling, and the randomized Kaczmarz algorithm. Math. Program., 155:549–573, 2016.
• Nesterov (1983) Y. Nesterov. A method of solving a convex programming problem with convergence rate . Soviet Math. Doklady, 27:372–376, 1983.
• Nesterov (2004) Y. Nesterov. Introductory Lectures on Convex Optimization: A Basic Course. Kluwer Academic Publ., Boston, 2004.
• Nesterov (2005) Y. Nesterov. Smooth minimization of non-smooth functions. Math. Program., 103:127–152, 2005.
• Nitanda (2014) A. Nitanda. Stochastic proximal gradient descent with acceleration techniques. In NIPS, pages 1574–1582, 2014.
• Orabona et al. (2012) F. Orabona, A. Argyriou, and N. Srebro. PRISMA: Proximal iterative smoothing algorithm. arXiv:1206.2372v2, 2012.
• Rakhlin et al. (2012) A. Rakhlin, O. Shamir, and K. Sridharan. Making gradient descent optimal for strongly convex stochastic optimization. In ICML, pages 449–456, 2012.
• Reddi et al. (2015) S. Reddi, A. Hefny, S. Sra, B. Poczos, and A. Smola. On variance reduction in stochastic gradient descent and its asynchronous variants. In NIPS, pages 2629–2637, 2015.
• Roux et al. (2012) N. Le Roux, M. Schmidt, and F. Bach. A stochastic gradient method with an exponential convergence rate for finite training sets. In NIPS, pages 2672–2680, 2012.
• Ruder (2017) S. Ruder. An overview of gradient descent optimization algorithms. arXiv:1609.04747v2, 2017.
• Schmidt et al. (2013) M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. Technical report, INRIA, Paris, 2013.
• Shalev-Shwartz and Zhang (2013) S. Shalev-Shwartz and T. Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. J. Mach. Learn. Res., 14:567–599, 2013.
• Shalev-Shwartz and Zhang (2016) S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Math. Program., 155:105–145, 2016.
• Shamir (2015) O. Shamir. A stochastic PCA and SVD algorithm with an exponential convergence rate. In ICML, pages 144–152, 2015.
• Shamir (2016) O. Shamir. Without-replacement sampling for stochastic gradient methods. In NIPS, pages 46–54, 2016.
• Shamir and Zhang (2013) O. Shamir and T. Zhang. Stochastic gradient descent for non-smooth optimization: Convergence results and optimal averaging schemes. In ICML, pages 71–79, 2013.
• Shang et al. (2017) F. Shang, Y. Liu, J. Cheng, and J. Zhuo. Fast stochastic variance reduced gradient method with momentum acceleration for machine learning. arXiv:1703.07948, 2017.
• Shang et al. (2018a) F. Shang, Y. Liu, K. Zhou, J. Cheng, K. W. Ng, and Y. Yoshida. Guaranteed sufficient decrease for stochastic variance reduced gradient optimization. In AISTATS, pages 1027–1036, 2018a.
• Shang et al. (2018b) F. Shang, K. Zhou, H. Liu, J. Cheng, I. W. Tsang, L. Zhang, D. Tao, and L. Jiao. VR-SGD: A simple stochastic variance reduction method for machine learning. arXiv:1802.09932, 2018b.
• Sra et al. (2016) S. Sra, A. W. Yu, M. Li, and A. J. Smola. AdaDelay: Delay adaptive distributed stochastic optimization. In AISTATS, pages 957–965, 2016.
• Su et al. (2014) W. Su, S. P. Boyd, and E. J. Candes. A differential equation for modeling Nesterov’s accelerated gradient method: Theory and insights. In NIPS, pages 2510–2518, 2014.
• Tseng (2010) P. Tseng. Approximation accuracy, gradient methods, and error bound for structured convex optimization. Math. Program., 125:263–295, 2010.
• Wang et al. (2017) J. Wang, W. Wang, and N. Srebro. Memory and communication efficient distributed stochastic optimization with minibatch-prox. In COLT, pages 1882–1919, 2017.
• Woodworth and Srebro (2016) B. Woodworth and N. Srebro. Tight complexity bounds for optimizing composite objectives. In NIPS, pages 3639–3647, 2016.
• Xiao and Zhang (2014) L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM J. Optim., 24(4):2057–2075, 2014.
• Zhang et al. (2013) L. Zhang, M. Mahdavi, and R. Jin. Linear convergence with condition number independent access of full gradients. In NIPS, pages 980–988, 2013.
• Zhang (2004) T. Zhang. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In ICML, pages 919–926, 2004.
• Zhao and Zhang (2015) P. Zhao and T. Zhang. Stochastic optimization with importance sampling for regularized loss minimization. In ICML, pages 1–9, 2015.
• Zhou et al. (2018) K. Zhou, F. Shang, and J. Cheng. A simple stochastic variance reduced algorithm with fast convergence rates. In ICML, pages 5975–5984, 2018.

## Appendix A

### Appendix A1: Proof of Proposition 1

Using Theorem 1, we have

 ρ(ω)=1−ω+ω2μmη.

Obviously, it is desirable to have a small convergence factor . So, we minimize with given . Then we have

 ω⋆=mμη/2≤1−˜Lη1−˜Lη,

and

 ρ(ω⋆)=1−mμη4>0.

The above two inequalities imply that

 η≤1+4c1−√1+16c212˜L=1+4c1−√1+16c212c1mμandη<4mμ,

where . This completes the proof.

### Appendix A2: ASVRG Pseudo-Codes

We first give the details on Algorithm 1 with for optimizing smooth objective functions such as -norm regularized logistic regression, as shown in Algorithm 3, which is almost identical to the regularized SVRG in (Babanezhad et al., 2015) and the original SVRG in (Johnson and Zhang, 2013). The main differences between Algorithm 3 and the latter two are the initialization of and the choice of the snapshot point . Moreover, we can use the doubling-epoch technique in (Mahdavi et al., 2013; Allen-Zhu and Yuan, 2016) to further speed up our ASVRG method for both SC and non-SC cases. Besides, all the proposed algorithms can be extended to the mini-batch setting as in (Nitanda, 2014; Konečný et al., 2016). In particular, our ASVRG method can be extended to an accelerated incremental aggregated gradient method with the SAGA estimator in (Defazio et al., 2014a).