1 Introduction
In this paper, we focus on the following composite convex optimization problem:
(1) 
where , are the smooth convex functions, and is a relatively simple (but possibly nondifferentiable) convex function (referred to as a regularizer). The formulation (1
) arises in many places in machine learning, signal processing, data science, statistics and operations research, such as
regularized empirical risk minimization (ERM). For instance, one popular choice of the component function in binary classification problems is the logistic loss, i.e., , where is a collection of training examples, and . Some popular choices for the regularizer include the norm regularizer (i.e., ), the norm regularizer (i.e., ), and the elasticnet regularizer (i.e., ), where andare two regularization parameters. So far examples of some other applications include deep neural networks
[1, 4, 5, 6], group Lasso [7, 8], sparse learning and coding [9, 10], phase retrieval [11], matrix completion [12, 13], conditional random fields [14], eigenvector computation
[15, 16]such as principal component analysis (PCA) and singular value decomposition (SVD), generalized eigendecomposition and canonical correlation analysis (CCA)
[17].1.1 Stochastic Gradient Descent
In this paper, we are especially interested in developing efficient algorithms to solve regularized ERM problems involving a large sum of component functions. The standard and effective method for solving Problem (1) is the (proximal) gradient descent (GD) method, including accelerated proximal gradient (APG) [18, 19, 20, 21]. For smooth objective functions, the update rule of GD is
(2) 
for , where is commonly referred to as the stepsize in optimization or the learning rate in machine learning. When the regularizer is nonsmooth, e.g., the norm regularizer, we need to introduce the following proximal operator into (2),
(3) 
where The GD methods mentioned above have been proven to achieve linear convergence for strongly convex problems, and APG attains the optimal convergence rate of for nonstrongly convex problems, where denotes the number of iterations. However, the periteration cost of all the batch (or deterministic) methods is , which is expensive.
Instead of evaluating the full gradient of at each iteration, an effective alternative is the stochastic (or incremental) gradient descent (SGD) method [22]. SGD only evaluates the gradient of a single component function at each iteration, thus it has much lower periteration cost, , and has been successfully applied to many largescale learning problems [4, 23, 24, 25]. The update rule of SGD is formulated as follows:
(4) 
where , and the index is chosen uniformly at random from . Although the expectation of is an unbiased estimation for , i.e., , the variance of the stochastic gradient estimator may be large due to the variance of random sampling [1]. Thus, stochastic gradient estimators are also called “noisy gradients”, and we need to gradually reduce its step size, leading to slow convergence. In particular, even under the strongly convex condition, standard SGD attains a slower sublinear convergence rate of [26, 27].
1.2 Accelerated SGD
Recently, many SGD methods with variance reduction techniques were proposed, such as stochastic average gradient (SAG) [28], stochastic variance reduced gradient (SVRG) [1], stochastic dual coordinate ascent (SDCA) [29], SAGA [30], stochastic primaldual coordinate (SPDC) [31], and their proximal variants, such as ProxSAG [32], ProxSVRG [2] and ProxSDCA [33]. All these accelerated SGD methods can use a constant step size instead of diminishing step sizes for SGD, and fall into the following three categories: primal methods such as SVRG and SAGA, dual methods such as SDCA, and primaldual methods such as SPDC. In essence, many of primal methods use the full gradient at the snapshot or average gradients to progressively reduce the variance of stochastic gradient estimators, as well as dual and primaldual methods, which leads to a revolution in the area of firstorder methods [34]. Thus, they are also known as the hybrid gradient descent method [35] or semistochastic gradient descent method [36]. In particular, under the strongly convex condition, the accelerated SGD methods enjoy linear convergence rates and the overall complexity of to obtain an suboptimal solution, where each is smooth and is strongly convex. The complexity bound shows that they converge significantly faster than deterministic APG methods, whose complexity is [36].
SVRG [1] and its proximal variant, ProxSVRG [2], are particularly attractive because of their low storage requirement compared with other stochastic methods such as SAG, SAGA and SDCA, which require to store all the gradients of the component functions or dual variables. At the beginning of each epoch in SVRG, the full gradient is computed at the snapshot point , which is updated periodically. The update rule for the smooth optimization problem (1) is given by
(5a)  
(5b) 
When , the update rule in (5b) becomes the original one in [1], i.e., . It is not hard to verify that the variance of the SVRG estimator , i.e., , can be much smaller than that of the SGD estimator , i.e., . However, for nonstrongly convex problems, the accelerated SGD methods mentioned above converge much slower than batch APG methods such as FISTA [21], namely, vs. .
More recently, many acceleration techniques were proposed to further speed up those variancereduced stochastic methods mentioned above. These techniques mainly include the Nesterov’s acceleration technique used in [24, 37, 38, 39, 40], reducing the number of gradient calculations in the early iterations [34, 41, 42], the projectionfree property of the conditional gradient method (also known as the FrankWolfe algorithm [43]) as in [44], the stochastic sufficient decrease technique [45], and the momentum acceleration trick in [3, 34, 46]. More specifically, [40] proposed an accelerating Catalyst framework and achieved the complexity of for strongly convex problems. [3] and [46] proved that their accelerated methods can attain the best known complexity of for nonstrongly convex problems. The overall complexity matches the theoretical upper bound provided in [47]. Katyusha [3] and pointSAGA [48] achieve the bestknown complexity of for strongly convex problems, which is identical to the upper complexity bound in [47]. That is, Katyusha is the best known stochastic optimization method for both strongly convex and nonstrongly convex problems. Its proximal gradient update rules are formulated as follows:
(6a)  
(6b)  
(6c) 
where are two momentum parameters. To eliminate the need for parameter tuning, is set to , and is fixed to in [3]. Unfortunately, most of the accelerated methods mentioned above, including Katyusha, require at least two auxiliary variables and two momentum parameters, which lead to complicated algorithm design and high periteration complexity [34].
1.3 Our Contributions
From the above discussion, one can see that most of accelerated stochastic variance reduction methods such as [3, 34, 37, 42, 44, 45] and applications such as [9, 10, 13, 15, 16, 49] are based on the stochastic variance reduced gradient (SVRG) method [1]. Thus, any key improvement on SVRG is very important for the research of stochastic optimization. In this paper, we propose a simple variant of the original SVRG [1], which is referred to as the variance reduced stochastic gradient descent (VRSGD). The snapshot point and starting point of each epoch in VRSGD are set to the average and last iterate of the previous epoch, respectively. Different from the settings of SVRG and ProxSVRG [2] (i.e., the last iterate for the two points of the former, while the average of the previous epoch for those of the latter), the two points in VRSGD are different, which makes our convergence analysis more challenging than SVRG and ProxSVRG. Our empirical results show that the performance of VRSGD is significantly better than its counterparts, SVRG and ProxSVRG. Impressively, VRSGD with a sufficiently large learning rate performs much better than the best known stochastic method, Katyusha [3]. The main contributions of this paper are summarized below.

The snapshot point and starting point of VRSGD are set to two different vectors. That is, for all epochs, except the first one, (denoted by Option I) or (denoted by Option II), and . In particular, we find that the setting of VRSGD allows us take much larger learning rates or step sizes than SVRG, e.g., vs. , and thus significantly speeds up the convergence of SVRG and ProxSVRG in practice. Moreover, VRSGD has an advantage over SVRG in terms of robustness of learning rate selection.

Different from proximal stochastic gradient methods, e.g., ProxSVRG and Katyusha, which have a unified update rule for the two cases of smooth and nonsmooth objectives (see Section 2.2 for details), VRSGD employs two different update rules for the two cases, respectively, as in (12) and (13) below. Empirical results show that gradient update rules as in (12) for smooth optimization problems are better choices than proximal update formulas as in (10).

Finally, we theoretically analyze the convergence properties of VRSGD with Option I or Option II for strongly convex problems, which show that VRSGD attains a linear convergence rate. We also give the convergence guarantees of VRSGD with Option I or Option II for nonstrongly convex objective functions.
2 Preliminary and Related Work
Throughout this paper, we use to denote the norm (also known as the standard Euclidean norm), and is the norm, i.e., . denotes the full gradient of if it is differentiable, or the subgradient if is only Lipschitz continuous. For each epoch and inner iteration , is the random chosen index. We mostly focus on the case of Problem (1) when each component function is smooth^{1}^{1}1Actually, we can extend all the theoretical results in this paper for the case, when the gradients of all component functions have the same Lipschitz constant , to the more general case, when some have different degrees of smoothness., and is strongly convex. The two common assumptions are defined as follows.
2.1 Basic Assumptions
Assumption 1 (Smoothness).
Each convex function is smooth, that is, there exists a constant such that for all ,
(7) 
Assumption 2 (Strong Convexity).
The convex function is strongly convex, i.e., there exists a constant such that for all ,
(8) 
2.2 Related Work
To speed up standard and proximal SGD methods, many variance reduced stochastic methods [28, 29, 30, 35] have been proposed for some special cases of Problem (1). In the case when each is smooth, is strongly convex, and , Roux et al. [28] proposed a stochastic average gradient (SAG) method, which attains a linear convergence rate. However, SAG needs to store all gradients as well as other incremental aggregated gradient methods such as SAGA [30], so that storage is required in general problems [41]. Similarly, SDCA [29] requires storage of all dual variables [1], which scales as . In contrast, SVRG [1], as well as its proximal variant, ProxSVRG [2], has the similar convergence rate to SAG and SDCA but without the memory requirements of all gradients and dual variables. In particular, the SVRG estimator in (5a) (independently introduced in [1, 35]) may be the most popular choice for stochastic gradient estimators. Besides, other stochastic gradient estimators include the SAGA estimator in [30] and the stochastic recursive gradient estimator in [50]. Although the original SVRG in [1] only has convergence guarantees for a special case of Problem (1), when each is smooth, is strongly convex, and , one can extend SVRG to the proximal setting by introducing the proximal operator in (3), as shown in Line 7 of Algorithm 1. In other words, when is nonsmooth, the update rule of SVRG becomes
(9) 
Some researchers [8, 51, 52] have borrowed some variance reduction techniques into ADMM for minimizing convex composite objective functions subject to an equality constraint. [15, 16, 17] applied efficient stochastic solvers to compute leading eigenvectors of a symmetric matrix or generalized eigenvectors of two symmetric matrices. The first such method is VRPCA by Shamir [15], and the convergence properties of the VRPCA algorithm for such a nonconvex problem are also provided. Garber et al. [16] analyzed the convergence rate of SVRG when is a convex function that is a sum of nonconvex component functions. Moreover, [6] and [53] proved that SVRG and SAGA with minor modifications can converge asymptotically to a stationary point of nonconvex finitesum problems. Some distributed variants [54, 55] of accelerated SGD methods have also been proposed.
An important class of stochastic methods is the proximal stochastic gradient (ProxSG) method, such as ProxSVRG [2], SAGA [30], and Katyusha [3]. Different from standard variance reduction SGD methods such as SVRG, which have a stochastic gradient update as in (5b), the ProxSG method has a unified update rule for both smooth and nonsmooth cases of . For instance, the update rule of ProxSVRG [2] is formulated as follows:
(10) 
For the sake of completeness, the details of ProxSVRG [2] are shown in Algorithm 1 with Option II. When is the widely used norm regularizer, i.e., , the proximal update formula in (10) becomes
(11) 
3 VarianceReduced Stochastic Gradient Descent
In this section, we propose an efficient variance reduced stochastic gradient descent (VRSGD) method with iterate averaging. Different from the choices of the snapshot and starting points in SVRG [1] and ProxSVRG [2], the two vectors of each epoch in VRSGD are set to the average and last iterate of the previous epoch, respectively. Unlike common stochastic gradient methods such as SVRG and proximal stochastic gradient methods such as ProxSVRG, we design two different update rules for smooth and nonsmooth objective functions, respectively.
3.1 Iterate Averaging
Like SVRG and Katyusha, VRSGD is also divided into epochs, and each epoch consists of stochastic gradient steps, where is usually chosen to be as in [1, 2, 3]. Within each epoch, we need to compute the full gradient at the snapshot and use it to define the variance reduced stochastic gradient estimator as in [1]. Unlike SVRG whose snapshot point is set to the last iterate of the previous epoch, the snapshot of each epoch in VRSGD is set to the average of the previous epoch, e.g., in Option I of Algorithm 2, which leads to better robustness to gradient noise^{2}^{2}2
It should be emphasized that the noise introduced by random sampling is inevitable, and generally slows down the convergence speed in a sense. However, SGD and its variants are probably the most used optimization algorithms for deep learning
[56]. In particular, [57] has shown that by adding noise at each step, noisy gradient descent can escape the saddle points efficiently and converge to a local minimum of nonconvex optimization problems, as well as the application of deep neural networks in [58]., as also suggested in [34, 45, 59]. In fact, the choice of Option II in Algorithm 2, i.e., , also works well in practice, as shown in Fig. 1. Therefore, we provide the convergence guarantees for both our algorithms (including Algorithm 2) with Option I and our algorithms with Option II in the next section. In particular, we find that the one of the effects of the choice in Option I or Option II of Algorithm 2 is to allow taking much larger learning rates or step sizes than SVRG in practice, e.g., for VRSGD vs. for SVRG (see Fig. 2 and Section 5.3 for details). This is the main reason why VRSGD converges significantly faster than SVRG. Actually, a larger learning rate enjoyed by VRSGD means that the variance of its stochastic gradient estimator goes asymptotically to zero faster.Unlike ProxSVRG [2] whose starting point is initialized to the average of the previous epoch, the starting point of each epoch in VRSGD is set to the last iterate of the previous epoch. That is, the last iterate of the previous epoch becomes the new starting point in VRSGD, while those of ProxSVRG are completely different, thereby leading to relatively slow convergence in general. It is clear that both the starting point and snapshot point of each epoch in the original SVRG [1] are set to the last iterate of the previous epoch^{3}^{3}3Note that the theoretical convergence of the original SVRG [1] relies on its Option II, i.e., both and are set to , where is randomly chosen from . However, the empirical results in [1] suggest that Option I is a better choice than its Option II, and the convergence guarantee of SVRG with Option I for strongly convex objective functions is provided in [60]., while the two points of ProxSVRG [2] are set to the average of the previous epoch (also suggested in [1]). Different from the settings in SVRG and ProxSVRG, the starting and snapshot points in VRSGD are set to the two different vectors mentioned above, which makes the convergence analysis of VRSGD more challenging than SVRG and ProxSVRG, as shown in Section 4.
3.2 The Algorithm for Strongly Convex Objectives
In this part, we propose an efficient VRSGD algorithm to solve strongly convex objective functions, as outlined in Algorithm 2. It is well known that the original SVRG [1] only works for the case of smooth and strongly convex objective functions. However, in many machine learning applications, e.g., elastic net regularized logistic regression, the strongly convex objective function is nonsmooth. To solve this class of problems, the proximal variant of SVRG, ProxSVRG [2], was subsequently proposed. Unlike SVRG and ProxSVRG, VRSGD can not only solve smooth objective functions, but directly tackle nonsmooth ones. That is, when the regularizer is smooth, e.g., the norm regularizer, the update rule of VRSGD is
(12) 
When is nonsmooth, e.g., the norm regularizer, the update rule of VRSGD becomes
(13) 
Different from the proximal stochastic gradient methods such as ProxSVRG [2], all of which have a unified update rule as in (10) for both the smooth and nonsmooth cases of , VRSGD has two different update rules for the two cases, as stated in (12) and (13). This leads to the following advantage over the ProxSG methods: the stochastic gradient update rule in (12) usually outperforms the proximal stochastic gradient update rule in (11), as well as the two classes of update rules for Katyusha [3] (see Section 5.4 for details).
Fig. 2 demonstrates that VRSGD has a significant advantage over SVRG in terms of robustness of learning rate selection. That is, VRSGD yields good performance within the range of the learning rate between and , whereas the performance of SVRG is very sensitive to the selection of learning rates. Thus, VRSGD is convenient to apply in various realworld problems of largescale machine learning. In fact, VRSGD can use much larger learning rates than SVRG for logistic regression problems in practice, e.g., for VRSGD vs. for SVRG, as shown in Fig. 2(a).


3.3 The Algorithm for NonStrongly Convex Objectives
Although many variance reduced stochastic methods have been proposed, most of them, including SVRG and ProxSVRG, only have convergence guarantees for the case of Problem (1) when the objective function is strongly convex. However, may be nonstrongly convex in many machine learning applications such as Lasso and norm regularized logistic regression. As suggested in [3, 61], this class of problems can be transformed into strongly convex ones by adding a proximal term , which can be efficiently solved by Algorithm 2. However, the reduction technique may degrade the performance of the involved algorithms both in theory and in practice [42]. Thus, we present an efficient VRSGD algorithm for directly solving the nonstrongly convex problem (1), as outlined in Algorithm 3.
The main difference between Algorithm 2 and Algorithm 3 is the setting for the learning rate. Similar to Algorithm 2, the learning rate of Algorithm 3 can also be fixed to a constant. Inspired by existing accelerated stochastic algorithms [3, 34], the learning rate in Algorithm 3 can be gradually increased, which in principle leads to faster convergence (see Section 5.5 for details). Different from existing stochastic methods such as Katyusha [3], the update rule of the learning rate for Algorithm 3 is defined as follows: , where is an initial learning rate, and for any ,
(14) 
where is a given constant, e.g., .
3.4 Complexity Analysis
From Algorithms 2 and 3, we can see that the periteration cost of VRSGD is dominated by the computation of , , and or the proximal update in (13). Thus, the complexity is , as low as that of SVRG [1] and ProxSVRG [2]. In fact, for some ERM problems, we can save the intermediate gradients in the computation of , which generally requires additional storage. As a result, each epoch only requires component gradient evaluations. In addition, for extremely sparse data, we can introduce the lazy update tricks in [36, 62, 63] to our algorithms, and perform the update steps in (12) and (13) only for the nonzero dimensions of each example, rather than all dimensions. In other words, the periteration complexity of VRSGD can be improved from to , where is the sparsity of feature vectors. Moreover, VRSGD has a much lower periteration complexity than existing accelerated stochastic variance reduction methods such as Katyusha [3], which have at least two more update rules for additional variables, as shown in (6a)(6c).
3.5 Extensions of VRSGD
It has been shown in [36, 37] that minibatching can effectively decrease the variance of stochastic gradient estimates. In this part, we first extend the proposed VRSGD method to the minibatch setting, as well as its convergence results below. Here, we denote by the minibatch size and the selected random index set for each outeriteration and inneriteration . The variance reduced stochastic gradient estimator in (5a) becomes
where is a minibatch of size . If some component functions are nonsmooth, we can use the proximal operator oracle [61] or the Nesterov’s smoothing [64] and homotopy smoothing [65] techniques to smoothen them, and thereby obtain the smoothed approximations of the functions . In addition, we can directly extend our algorithms to the nonsmooth setting as in [34], e.g., Algorithm 3 in [34].
Considering that each component function maybe have different degrees of smoothness, picking the random index
from a nonuniform distribution is a much better choice than the commonly used uniform random sampling
[66, 67], as well as withoutreplacement sampling vs. withreplacement sampling [27]. This can be done using the same techniques in [2, 3], i.e., the sampling probabilities for all are proportional to their Lipschitz constants, i.e., . Moreover, our VRSGD method can also be combined with other accelerated techniques proposed for SVRG. For instance, the epoch length of VRSGD can be automatically determined by the techniques in [42, 68] instead of a fixed epoch length. We can reduce the number of gradient calculations in the early iterations as in [34, 41, 42], which leads to faster convergence in general. Moreover, we can also introduce the Nesterov’s acceleration technique as in [24, 37, 38, 39, 40] and momentum acceleration trick as in [3, 34] to further improve the performance of VRSGD.4 Convergence Analysis
In this section, we provide the convergence guarantees of VRSGD for solving both smooth and nonsmooth general convex problems. We also extend these results to the minibatch setting. Moreover, we analyze the convergence properties of VRSGD for solving both smooth and nonsmooth strongly convex objective functions. We first introduce the following lemma which is useful in our analysis.
Lemma 1 (3point property, [69]).
Let be the optimal solution of the following problem,
where is a convex function (but possibly nondifferentiable), and . Then for any , the following inequality holds
4.1 Convergence Properties for Nonstrongly Convex Problems
In this part, we analyze the convergence properties of VRSGD for solving the more general nonstrongly convex problems. Considering that the two proposed algorithms (i.e., Algorithms 2 and 3) have two different update rules for the smooth and nonsmooth problems, we give the convergence guarantees of VRSGD for the two cases as follows.
4.1.1 Smooth Objectives
We first give the convergence guarantee of Problem (1) when the objective function is smooth and nonstrongly convex. In order to simplify analysis, we denote by , that is, for all .
Lemma 2 (Variance bound of smooth objectives).
The proof of this lemma is included in APPENDIX A. Lemma 2 provides the upper bound on the expected variance of the variance reduced gradient estimator in (5a), i.e., the SVRG estimator independently introduced in [1, 35]. For Algorithm 3 with Option I and a fixed learning rate, we give the following key result for our analysis.
Lemma 3 (Option I and smooth objectives).
Let . If each is convex and smooth, then the following inequality holds for all ,
Proof.
In order to simplify notation, the stochastic gradient estimator is defined as: . Since each component function is smooth, which implies that the gradient of the average function is also smooth, i.e., for all ,
whose equivalent form is
Applying the above smoothness inequality, we have
(15) 
where is a constant. Using Lemma 2, then we get
(16) 
where the first inequality holds due to the Young’s inequality (i.e., for all and ), and the second inequality follows from Lemma 2.
Substituting the inequality in (16) into the inequality in (15), and taking the expectation with respect to the random choice , we have
Here, the first inequality holds due to the inequality in (15) and the inequality in (16); the second inequality follows from Lemma 1 with , , , , and ; the third inequality holds due to the fact that ; and the last inequality follows from the convexity of the smooth function , i.e.,