# A unified variance-reduced accelerated gradient method for convex optimization

We propose a novel randomized incremental gradient algorithm, namely, VAriance-Reduced Accelerated Gradient (Varag), for finite-sum optimization. Equipped with a unified step-size policy that adjusts itself to the value of the conditional number, Varag exhibits the unified optimal rates of convergence for solving smooth convex finite-sum problems directly regardless of their strong convexity. Moreover, Varag is the first of its kind that benefits from the strong convexity of the data-fidelity term, and solves a wide class of problems only satisfying an error bound condition rather than strong convexity, both resulting in the optimal linear rate of convergence. Varag can also be extended to solve stochastic finite-sum problems.

## Authors

• 24 publications
• 19 publications
• 95 publications
• ### Variance-Reduced Proximal Stochastic Gradient Descent for Non-convex Composite optimization

Here we study non-convex composite optimization: first, a finite-sum of ...
06/02/2016 ∙ by Xiyu Yu, et al. ∙ 0

• ### Stochastic Variance Reduction via Accelerated Dual Averaging for Finite-Sum Optimization

In this paper, we introduce a simplified and unified method for finite-s...
06/18/2020 ∙ by Chaobing Song, et al. ∙ 0

• ### Dualize, Split, Randomize: Fast Nonsmooth Optimization Algorithms

We introduce a new primal-dual algorithm for minimizing the sum of three...
04/03/2020 ∙ by Adil Salim, et al. ∙ 8

• ### Unbiased Simulation for Optimizing Stochastic Function Compositions

In this paper, we introduce an unbiased gradient simulation algorithms f...
11/20/2017 ∙ by Jose Blanchet, et al. ∙ 0

• ### Dissipativity Theory for Accelerating Stochastic Variance Reduction: A Unified Analysis of SVRG and Katyusha Using Semidefinite Programs

Techniques for reducing the variance of gradient estimates used in stoch...
06/10/2018 ∙ by Bin Hu, et al. ∙ 0

• ### Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

We study the conditions under which one is able to efficiently apply var...
06/06/2017 ∙ by Yossi Arjevani, et al. ∙ 0

• ### SGD with shuffling: optimal rates without component convexity and large epoch requirements

We study without-replacement SGD for solving finite-sum optimization pro...
06/12/2020 ∙ by Kwangjun Ahn, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The problem of interest in this paper is the convex programming (CP) problem given in the form of

 ψ∗:=minx∈X{ψ(x):=1m∑mi=1fi(x)+h(x)}. (1.1)

Here, is a closed convex set, the component function are smooth convex functions with -Lipschitz continuous gradients over , i.e., such that

 ∥∇fi(x1)−∇fi(x2)∥∗≤Li∥x1−x2∥,  ∀x1,x2∈X, (1.2)

and is a relatively simple but possibly nonsmooth convex function. For notational convenience, we denote and . It is easy to see that has -Lipschitz continuous gradients, i.e., for some , It should be pointed out that it is not necessarily to assume being strongly convex. Instead, we assume that is possibly strongly convex with modulus .

Finite-sum optimization given in the form of (1.1

) has recently found a wide range of applications in machine learning (ML), statistical inference, and image processing, and hence becomes the subject of intensive studies during the past few years. In centralized ML,

usually denotes the loss generated by a single data point, while in distributed ML, it may correspond to the loss function for an agent

, which is connect to other agents in a distributed network.

Recently, randomized incremental gradient (RIG) methods have emerged as an important class of first-order methods for finite-sum optimization (e.g.,(Blatt et al., 2007; Johnson and Zhang, 2013; Xiao and Zhang, 2014; Defazio et al., 2014; Schmidt et al., 2017; Lan and Zhou, 2015; Allen-Zhu, 2016; Allen-Zhu and Yuan, 2016; Hazan and Luo, 2016; Lin et al., 2015; Lan and Zhou, 2017)). In an important work, Schmidt et al. (2017) (see (Blatt et al., 2007)

for a precursor) showed that by incorporating new gradient estimators into stochastic gradient descent (SGD) one can possibly achieve a linear rate of convergence for smooth and strongly convex finite-sum optimization. Inspired by this work,

Johnson and Zhang (2013) proposed a stochastic variance reduced gradient (SVRG) which incorporates a novel stochastic estimator of

. More specifically, each epoch of SVRG starts with the computation of the exact gradient

for a given and then runs SGD for a fixed number of steps using the gradient estimator where

is a random variable with support on

. They show that the variance of vanishes as the algorithm proceeds, and hence SVRG exhibits an improved linear rate of convergence, i.e., , for smooth and strongly convex finite-sum problems. See (Xiao and Zhang, 2014; Defazio et al., 2014) for the same complexity result. Moreover, Allen-Zhu and Yuan (2016) show that by doubling the epoch length SVRG obtains an complexity bound for smooth convex finite-sum optimization.

Observe that the aforementioned variance reduction methods are not accelerated and hence they are not optimal even when the number of components . Therefore, much recent research effort has been devoted to the design of optimal RIG methods. In fact, Lan and Zhou (2015) established a lower complexity bound for RIG methods by showing that whenever the dimension is large enough, the number of gradient evaluations required by any RIG methods to find an -solution of a smooth and strongly convex finite-sum problem i.e., a point s.t. , cannot be smaller than

 Ω((m+√mLμ)log1ϵ). (1.3)

As can be seen from Table 1, existing accelerated RIG methods are optimal for solving smooth and strongly convex finite-sum problems, since their complexity matches the lower bound in (1.3).

Notwithstanding these recent progresses, there still remain a few significant issues on the development of accelerated RIG methods. Firstly, as pointed out by (Tang et al., 2018), existing RIG methods can only establish accelerated linear convergence based on the assumption that the regularizer is strongly convex, and fails to benefit from the strong convexity from the data-fidelity term (Wang and Xiao, 2017). This restrictive assumption does not apply to many important applications (e.g., Lasso models) where the loss function, rather than the regularization term, may be strongly convex. Specifically, for the case when only (but not ) is strongly convex , one may not be able to shift the strong convexity of to construct a simple strongly convex term in the objective function. In fact, even if is strongly convex, some of the component functions may only be convex, and hence may become nonconvex after subtracting a strongly convex term. Secondly, if the strongly convex modulus becomes very small, the complexity bounds of all existing RIG methods will go to (see column 2 of Table 1), indicating that they are not robust against problem ill-conditioning. Thirdly, for solving smooth problems without strong convexity, one has to add a strongly convex perturbation into the objective function in order to gain up to a factor of over Nesterov’s accelerated gradient method for gradient computation (see column 3 of Table 1). One significant difficulty for this indirect approach is that we do not know how to choose the perturbation parameter properly, especially for problems with unbounded feasible region (see (Allen-Zhu and Yuan, 2016) for a discussion about a similar issue related to SVRG applied to non-strongly convex problems). However, if one chose not to add the strongly convex perturbation term, the best-known complexity would be given by Katyushans (Allen-Zhu, 2016), which are not more advantageous over Nesterov’s orginal method. In other words, it does not gain much from randomization in terms of computational complexity. Finally, it should be pointed out that only a few existing RIG methods, e.g., RGEM (Lan and Zhou, 2017), can be applied to solve stochastic finite-sum optimization problems, where one can only access the stochastic gradient of via a stochastic first-order oracle (SFO).

### 1.1 Our contributions.

In this paper, we propose a novel accelerated variance reduction type method, namely the variance-reduced accelerated gradient () method, to solve smooth finite-sum optimization problems given in the form of (1.1). Table 2 summarizes the main convergence results achieved by our algorithm.

Firstly, for smooth convex finite-sum optimization, our proposed method exploits a direct acceleration scheme instead of employing any perturbation or restarting techniques to obtain desired optimal convergence results. As shown in the first two rows of Table 2, achieves the optimal rate of convergence if the number of component functions is relatively small and/or the required accuracy is high, while it exhibits a fast linear rate of convergence when the number of component functions is relatively large and/or the required accuracy is low, without requiring any strong convexity assumptions. To the best of our knowledge, this is the first time that these complexity bounds have been obtained through a direct acceleration scheme for smooth convex finite-sum optimization in the literature. In comparison with existing methods using perturbation techniques, does not need to know the target accuracy or the diameter of the feasible region a priori, and thus can be used to solve a much wider class of smooth convex problems, e.g., those with unbounded feasible sets.

Secondly, we equip with a unified step-size policy for smooth convex optimization no matter (1.1) is strongly convex or not, i.e., the strongly convex modulus . With this step-size policy, can adjust to different classes of problems to achieve the best convergence results, without knowing the target accuracy and/or fixing the number of epochs. In particular, as shown in the last column of Table 2, when is relatively large, achieves the well-known optimal linear rate of convergence. If is relatively small, e.g., , it obtains the accelerated convergence rates that is independent of the conditional number . Therefore, is robust against ill-conditioning of problem (1.1). Moreover, our assumptions on the objective function is more general comparing to those used by other RIG methods, such as RPDG and Katyusha. Specifically, does not require to keep a strongly convex regularization term in the projection, and so we can assume that the strong convexity is associated with the smooth function instead of the simple proximal function . Some other advantages of over existing accelerated SVRG methods, e.g., Katyusha, include that it only requires the solution of one, rather than two, subproblems, and that it can allow the application of non-Euclidean Bregman distance for solving all different classes of problems.

Finally, we extend to solve two more general class of finite-sum optimization problems. We demonstrate that is the first randomized method that achieves the accelerated linear rate of convergence when solving the class of problems that satisfies a certain error-bound condition rather than strong convexity. We then show that can also be applied to solve stochastic smooth finite-sum optimization problems resulting in a sublinear rate of convergence.

This paper is organized as follows. In Section 2, we present our proposed algorithm and its convergence results for solving (1.1) under different problem settings. In Section 3 we provide extensive experimental results to demonstrate the advantages of

over several state-of-the-art methods for solving some well-known ML models, e.g., logistic regression, Lasso, etc. We defer the proofs of the main results in Appendix

A.

### 1.2 Notation and terminology.

We use to denote a general norm in without specific mention, and to denote the conjugate norm of . For any , denotes the standard -norm in , i.e., For a given strongly convex function with modulus w.r.t. an arbitrary norm , we define a prox-function associated with as

 V(x0,x)≡Vw(x0,x):=w(x)−[w(x0)+⟨w′(x0),x−x0⟩], (1.4)

where is any subgradient of at . By the strong convexity of , we have

 V(x0,x)≥12∥x−x0∥2,  ∀x,x0∈X. (1.5)

Notice that described above is different from the standard definition for Bregman distance (Bregman, 1967; Auslender and Teboulle, 2006; Bauschke et al., 2003; Kiwiel, 1997; Censor and Lent, 1981) in the sense that is not necessarily differentiable. Throughout this paper, we assume that the prox-mapping associated with and , given by

 argminx∈X{γ[⟨g,x⟩+h(x)+μV(x––0,x)]+V(x0,x)}, (1.6)

can be easily computed for any . We denote logarithm with base as . For any real number , and denote the nearest integer to from above and below.

## 2 Algorithms and main results

This section contains two subsections. We first present in Subsection 2.1 a unified optimal for solving the finite-sum problem given in (1.1) as well as its optimal convergence results. Subsection 2.2 is devoted to the discussion of several extensions of . Throughout this section, we assume that each component function is smooth with -Lipschitz continuous gradients over , i.e., (1.2) holds for all component functions. Moreover, we assume that the objective function is possibly strongly convex, in particular, for , s.t.

 f(y)≥f(x)+⟨∇f(x),y−x⟩+μV(x,y),∀x,y∈X. (2.1)

Note that we assume the strong convexity of comes from , and the simple function is not necessarily strongly convex. Clearly the strong convexity of , if any, can be shifted to since is assumed to simple and its structural information is transparent to us. Also observe that (2.1) is defined based on a generalized Bregman distance, and together with (1.5) they imply the standard definition of strong convexity w.r.t. Euclidean norm.

### 2.1 Varag for convex finite-sum optimization

The basic scheme of is formally described in Algorithm 1. In each epoch (or outer loop), it first computes the full gradient at the point (cf. Line 3), which will then be repeatedly used to define a gradient estimator at each iteration of the inner loop (cf. Line 8). This is the well-known variance reduction technique employed by many algorithms (e.g., (Johnson and Zhang, 2013; Xiao and Zhang, 2014; Allen-Zhu, 2016; Hazan and Luo, 2016)). The inner loop has a similar algorithmic scheme to the accelerated stochastic approximation algorithm (Lan, 2012; Ghadimi and Lan, 2012, 2013) with a constant step-size policy. Indeed, the parameters used in the inner loop, i.e., , and , only depend on the index of epoch . Each iteration of the inner loop requires the gradient information of only one randomly selected component function , and maintains three primal sequences, and , which play important role in the acceleration scheme.

Note that is closely related to stochastic mirror descent method (Nemirovski et al., 2009; Nemirovsky and Yudin, 1983) and SVRG (Johnson and Zhang, 2013; Xiao and Zhang, 2014). By setting and , Algorithm 1 simply combines the variance reduction technique with stochastic mirror descent. In this case, the algorithm only maintains one primal sequence and possesses the non-accelerated rate of convergence for solving (1.1). Interestingly, if we use Euclidean distance instead of prox-function to update and set , Algorithm 1 will further reduce to prox-SVRG proposed in (Xiao and Zhang, 2014).

It is also interesting to observe the difference between and Katyusha (Allen-Zhu, 2016) because both are accelerated variance reduction methods. Firstly, while Katyusha needs to assume that the strongly convex term is specified as in the form of a simple proximal function, e.g., /-regularizer, assumes that is possibly strongly convex, which solves an open issue of the existing accelerated RIG methods pointed out by (Tang et al., 2018). Therefore, the momentum steps in Lines 7 and 10 are different from Katyusha. Secondly, has a less computational expensive algorithmic scheme. Particularly, only needs to solve one proximal mapping (cf. Line 9) per iteration even if is strongly convex, while Katyusha requires to solve two proximal mappings per iteration. Thirdly, incorporates a prox-function defined in (1.4) rather than the Euclidean distance in the proximal mapping to updates . This allows the algorithm to take advantage of the geometry of the constraint set when performing projections. However, Katyusha cannot be fully adapted to the non-euclidean setting because its second proximal mapping must be defined using the Euclidean distance regardless the strong convexity of . Finally, we will show in this section that can achieve a much better rate of convergence than Katyusha for smooth convex finite-sum optimization by using a novel approach to specify step-size and to schedule epoch length.

We first discuss the case when is not necessarily strongly convex, i.e., in (2.1). In Theorem 1, we suggest one way to specify the algorithmic parameters, including , , , , and , for to solve smooth convex problems given in the form of (1.1), and discuss its convergence properties of the resulting algorithm. We defer the proof of this result in Appendix A.1.

###### Theorem 1 (Smooth finite-sum optimization)

Suppose that the probabilities ’s are set to for , and weights are set as

 θt={γsαs(αs+ps)1≤t≤Ts−1γsαst=Ts. (2.2)

Moreover, let us denote and set parameters , and as

 Ts={2s−1,s≤s0Ts0,s>s0, γs=13Lαs,  % and ps=12, with (2.3)
 αs={12,s≤s02s−s0+4,s>s0. (2.4)

Then the total number of gradient evaluations of performed by Algorithm 1 to find a stochastic -solution of (1.1), i.e., a point s.t. , can be bounded by

 ¯N:=⎧⎪ ⎪⎨⎪ ⎪⎩O{mlogD0ϵ},m≥D0/ϵ,O{mlogm+√mD0ϵ},m

where is defined as

 D0:=2[ψ(x0)−ψ(x∗)]+3LV(x0,x∗). (2.6)

We now make a few observations regarding the results obtained in Theorem 1. Firstly, as mentioned earlier, whenever the required accuracy is low and/or the number of components is large, can achieve a fast linear rate of convergence even under the assumption that the objective function is not strongly convex. Otherwise, achieves an optimal sublinear rate of convergence with complexity bounded by . Secondly, whenever is dominating in the second case of (2.5), can save up to gradient evaluations of than the optimal deterministic first-order methods for solving (1.1). To the best of our knowledge, is the first accelerated RIG in the literature to obtain such convergence results by directly solving (1.1). Other existing accelerated RIG methods, such as RPDG (Lan and Zhou, 2015) and Katyusha (Allen-Zhu, 2016), require the application of perturbation and restarting techniques to obtain such convergence results.

Next we consider the case when is possibly strongly convex, including the situation when the problem is almost not strongly convex, i.e., . In the latter case, the term will be dominating in the complexity of existing accelerated RIG methods (e.g., (Lan and Zhou, 2015, 2017; Allen-Zhu, 2016; Lin et al., 2015)) and will tend to as decreases. Therefore, these complexity bounds are significantly worse than (2.5) obtained by simply treating (1.1) as smooth convex problems. Moreover, is very common in ML applications. In Theorem 2, we provide a unified step-size policy which allows to achieve optimal rate of convergence for finite-sum optimization in (1.1) regardless of its strong convexity, and hence it can achieve stronger rate of convergence than existing accelerated RIG methods if the condition number is very large. The proof of this result can be found in Appendix A.2.

###### Theorem 2 (A unified result for convex finite-sum optimization)

Suppose that the probabilities ’s are set to for . Moreover, let us denote and assume that the weights are set to (2.2) if or . Otherwise, they are set to

 θt={Γt−1−(1−αs−ps)Γt,1≤t≤Ts−1,Γt−1,t=Ts, (2.7)

where . If the parameters , and set to (2.3) with

 αs=⎧⎨⎩12,s≤s0,max{2s−s0+4,min{√mμ3L,12}},s>s0, (2.8)

then the total number of gradient evaluations of performed by Algorithm 1 to find a stochastic -solution of (1.1) can be bounded by

 ¯N:=⎧⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪ ⎪⎩O{mlogD0ϵ},m≥D0ϵ or m≥3L4μ,O{mlogm+√mD0ϵ},m

where is defined as in (2.6).

Observe that the complexity bound (2.9) is a unified convergence result for to solve deterministic smooth finite-sum optimization problems (1.1). When the strongly convex modulus of the objective function is large enough, i.e., , exhibits an optimal linear rate of convergence since the third case of (2.9) matches the lower bound (1.3) for RIG methods. If is relatively small, treats the finite-sum problem (1.1) as a smooth problem without strong convexity, which leads to the same complexity bounds as in Theorem 1. It should be pointed out that the parameter setting proposed in Theorem 2 does not require the values of and given a priori.

### 2.2 Generalization of Varag

In this subsection, we extend to solve two general classes of finite-sum optimization problems as well as establishing its convergence properties for these problems.

Finite-sum problems under error bound condition. We investigate a class of weakly strongly convex problems, i.e., is smooth convex and satisfies the error bound condition given by

 V(x,X∗)≤1¯μ(ψ(x)−ψ∗), ∀x∈X, (2.10)

where denotes the set of optimal solutions of (1.1). Many optimization problems satisfy (2.10), for instance, linear systems, quadratic programs, linear matrix inequalities and composite problems (outer: strongly convex, inner: polyhedron functions), see also Section 6 of (Necoara et al., 2018) for more examples. Although these problems are not strongly convex, by properly restarting we can solve them with an accelerated optimal linear rate of convergence, the best-known complexity result to solve this class of problems so far. We formally present the result in Theorem 3, whose proof is given in Appendix A.3.

###### Theorem 3 (Convex finite-sum optimization under error bound)

Assume that the probabilities ’s are set to for , and are defined as (2.2). Moreover, let us set parameters , and as in (2.3) and (2.4) with being set as

 Ts={T12s−1,s≤48T1,s>4, (2.11)

where . Then under condition (2.10), for any , ,

 E[ψ(~xs)−ψ(x∗)]≤516[ψ(x0)−ψ(x∗)]. (2.12)

Moreover, if we restart every time it runs iterations for times, the total number of gradient evaluations of to find a stochastic -solution of (1.1) can be bounded by

 ¯N:=k(∑s(m+Ts))=O{(m+√mL¯μ)logψ(x0)−ψ(x∗)ϵ}. (2.13)
###### Remark 1

Note that can also be extended to obtain an unified result as we shown in Theorem 2 for solving finite-sum problems under error bound condition. In particular, if the conditional number is very large, i.e., , will never be restarted, and the resulting complexity bounds will reduce to the case for solving smooth convex problems provided in Theorem 1.

Stochastic finite-sum optimization. We now consider stochastic smooth convex finite-sum optimization and online learning problems, where only noisy gradient information of can be accessed via a SFO oracle. In particular, for any

, the SFO oracle outputs a vector

such that

 Eξj[Gi(x,ξj)]=∇fi(x), i=1,…,m, (2.14) Eξj[∥Gi(x,ξj)−∇fi(x)∥2∗]≤σ2, i=1,…,m. (2.15)

We present the variant of for stochastic finite-sum optimization in Algorithm 2 as well as its convergence results in Theorem 4, whose proof can be found in Appendix B.

###### Theorem 4 (Stochastic smooth finite-sum optimization)

Assume that are defined as in (2.2), and the probabilities ’s are set to for . Moreover, let us denote and set , , and as in (2.3) and (2.4). Then the number of calls to the SFO oracle required by Algorithm 2 to find a stochastic -solution of (1.1) can be bounded by

 NSFO=∑s(mBs+Tsbs)=⎧⎪ ⎪⎨⎪ ⎪⎩O{mCσ2Lϵ},m≥D0/ϵ,O{Cσ2D0Lϵ2},m

where is given in (2.6).

###### Remark 2

Note that the constant in (2.18) can be easily upper bounded by (recall that ), and if . To the best of our knowledge, among a few existing RIG methods that can be applied to solve the class of stochastic finite-sum problems, is the first to achieve such complexity results as in (2.18) for smooth convex problems. RGEM (Lan and Zhou, 2017) obtains nearly-optimal rate of convergence for strongly convex case, but cannot solve stochastic smooth problems directly, and Kulunchakov and Mairal (2019) required a specific initial point, i.e., an exact solution to a proximal mapping depending on the variance , to achieve rate of convergence for smooth convex problems.

## 3 Numerical experiments

In this section, we demonstrate the advantages of our proposed algorithm, over several state-of-the-art algorithms, e.g., SVRG++ (Allen-Zhu and Yuan, 2016) and Katyusha (Allen-Zhu, 2016), etc., via solving several well-known machine learning models. For all experiments, we use public real datasets downloaded from UCI Machine Learning Repository (Dua and Graff, 2017).

Unconstrained smooth convex problems. We first investigate unconstrained logistic models which cannot be solved via the perturbation approach due to the unboundedness of the feasible set. More specifically, we applied , SVRG++ and Katyushans to solve a logistic regression problem,

 minx∈Rn{ψ(x):=1m∑mi=1fi(x)} where fi(x):=log(1+exp(−biaTix))}. (3.1)

Here is a training data point and is the sample size, and hence now corresponds to the loss generated by a single training data. As we can see from Figure 1, converges much faster than SVRG++ and Katyusha in terms of training loss.

Strongly convex loss with simple convex regularizer.

We now study the class of Lasso regression problems with

as the regularizer coefficient, given in the following form

 minx∈Rn{ψ(x):=1m∑mi=1fi(x)+h(x)} where fi(x):=12(aTix−bi)2,h(x):=λ∥x∥1. (3.2)

Due to the assumption SVRG++ and Katyusha enforced on the objective function that the strong convexity can only be associated with the regularizer, these methods always view Lasso as smooth problems (Tang et al., 2018), while can treat Lasso as strongly convex problems. As can be seen from Figure 2, outperforms SVRG++ and Katyushans in terms of training loss.

Weakly strongly convex problems satisfying error bound condition. Let us consider a special class of finite-sum convex quadratic problems given in the following form

 minx∈Rn{ψ(x):=1m∑mi=1fi(x)} where fi(x):=12xTQix+qTix. (3.3)

Here and is a solution to the symmetric linear system with . Dang et al. (2017)[Section 6] and Necoara et al. (2018)[Section 6.1] proved that (3.3) belongs to the class of weakly strongly convex problems satisfying error bound condition (2.10). For a given solution , we use the following real datasets to generate and . We then compare the performance of with fast gradient method (FGM) proposed in (Necoara et al., 2018). As shown in Figure 3, outperforms FGM for all cases. And as the number of component functions increases, demonstrates more advantages over FGM. These numerical results are consistent with the theoretical complexity bound (2.13) suggesting that can save up to number of gradient computations than deterministic algorithms, e.g., FGM.

Strongly convex problems with small strongly convex modulus.

We consider ridge regression models with a small regularizer coefficient (

) given in the following form,

 minx∈Rn{ψ(x):=1m∑mi=1fi(x)+h(x)} where fi(x):=12(aTix−bi)2,h(x):=λ∥x∥22. (3.4)

Since the above problem is strongly convex, we compare the performance of with those of Prox-SVRG (Xiao and Zhang, 2014) and Katyusha (Allen-Zhu, 2016). As we can see from Figure 4, and Katyusha converges much faster than Prox-SVRG in terms of training loss. Although and Katyusha perform similar in terms of training loss per gradient calls, may require less CPU time to perform one epoch than Katyusha. In fact, only needs to solve one proximal mapping per inner iteration while Katyusha requires to solve two for strongly convex problems.

## Appendix A Convergence analysis of Varag for deterministic finite-sum optimization

Our main goal in this section is to establish the convergence results stated in Theorems 1 and 2 for the method applied to the finite-sum optimization problem in (1.1).

Before proving Theorem 1 and 2, we first need to present some basic properties for smooth convex functions and then provide some important technical results.

###### Lemma 1

If has Lipschitz continuous gradients with Lipschitz constant , then

 12L∥∇f(x)−∇f(z)∥2∗≤f(x)−f(z)−⟨∇f(z),x−z⟩  ∀x,z∈X.

Proof: Denote . Clearly also has -Lipschitz continuous gradients. It is easy to check that , and hence that , which implies

 ϕ(z) ≤ϕ(x−1L∇ϕ(x)) =ϕ(x)+∫10⟨∇ϕ(x−τL∇ϕ(x)),−1L∇ϕ(x)⟩dτ =ϕ(x)+⟨∇ϕ(x),−1L∇ϕ(x)⟩+∫10⟨∇ϕ(x−τL∇ϕ(x))−∇ϕ(x),−1L∇ϕ(x)⟩dτ ≤ϕ(x)−1L∥∇ϕ(x)∥2∗+∫10L∥τL∇ϕ(x)∥∗ ∥1L∇ϕ(x)∥∗dτ =ϕ(x)−12L∥∇ϕ(x)∥2∗.

Therefore, we have and the result follows immediately from this relation.

The following result follows as a consequence of Lemma 1.

###### Lemma 2

Let be an optimal solution of (1.1). Then we have

 1m∑mi=11mqi∥∇fi(x)−∇fi(x∗)∥2∗≤2LQ[ψ(x)−ψ(x∗)], ∀x∈X, (A.1)

where

 LQ=1mmaxi=1,…,mLiqi. (A.2)

Proof: By Lemma 1 (with ), we have

 ∥∇fi(x)−∇fi(x∗)∥2∗≤2Li[fi(x)−fi(x∗)−⟨∇fi(x∗),x−x∗⟩].

Dividing this inequality by , and summing over , we obtain

 1m∑mi=11mqi∥∇fi(x)−∇fi(x∗)∥2∗≤2LQ[f(x)−f(x∗)−⟨∇f(x∗),x−x∗⟩]. (A.3)

By the optimality of , we have for any , which in view of the convexity of , implies that for any . The result then follows by combining the previous two conclusions.

In the sequel, let us define some important notations that help us to simplify the convergence analysis of .

 lf(z,x) :=f(z)+⟨∇f(z),x−z⟩, (A.4) δt :=Gt−∇f(x––t), (A.5) x+t−1 :=11+μγs(xt−1+μγs–xt), (A.6)

where , and are generated as in Algorithm 1. Lemma 3 below shows that

is an unbiased estimator of

and provides a tight upper bound for its variance.

###### Lemma 3

Conditionally on ,

 E[δt] =0, (A.7) E[∥δt∥2∗] ≤2LQ[f(~x)−f