In this paper, we focus on the following composite convex optimization problem:
where , are the smooth convex functions, and is a relatively simple (but possibly non-differentiable) convex function (referred to as a regularizer). The formulation (1in binary classification problems is the logistic loss, i.e., , where is a collection of training examples, and . Some popular choices for the regularizer include the -norm regularizer (i.e., ), the -norm regularizer (i.e., ), and the elastic-net regularizer (i.e., ), where and
are two regularization parameters. So far examples of some other applications include deep neural networks[1, 4, 5, 6], group Lasso [7, 8], sparse learning and coding [9, 10], phase retrieval , matrix completion [12, 13], conditional random fields 
, eigenvector computation[15, 16]17].
1.1 Stochastic Gradient Descent
In this paper, we are especially interested in developing efficient algorithms to solve regularized ERM problems involving a large sum of component functions. The standard and effective method for solving Problem (1) is the (proximal) gradient descent (GD) method, including accelerated proximal gradient (APG) [18, 19, 20, 21]. For smooth objective functions, the update rule of GD is
for , where is commonly referred to as the step-size in optimization or the learning rate in machine learning. When the regularizer is non-smooth, e.g., the -norm regularizer, we need to introduce the following proximal operator into (2),
where The GD methods mentioned above have been proven to achieve linear convergence for strongly convex problems, and APG attains the optimal convergence rate of for non-strongly convex problems, where denotes the number of iterations. However, the per-iteration cost of all the batch (or deterministic) methods is , which is expensive.
Instead of evaluating the full gradient of at each iteration, an effective alternative is the stochastic (or incremental) gradient descent (SGD) method . SGD only evaluates the gradient of a single component function at each iteration, thus it has much lower per-iteration cost, , and has been successfully applied to many large-scale learning problems [4, 23, 24, 25]. The update rule of SGD is formulated as follows:
where , and the index is chosen uniformly at random from . Although the expectation of is an unbiased estimation for , i.e., , the variance of the stochastic gradient estimator may be large due to the variance of random sampling . Thus, stochastic gradient estimators are also called “noisy gradients”, and we need to gradually reduce its step size, leading to slow convergence. In particular, even under the strongly convex condition, standard SGD attains a slower sub-linear convergence rate of [26, 27].
1.2 Accelerated SGD
Recently, many SGD methods with variance reduction techniques were proposed, such as stochastic average gradient (SAG) , stochastic variance reduced gradient (SVRG) , stochastic dual coordinate ascent (SDCA) , SAGA , stochastic primal-dual coordinate (SPDC) , and their proximal variants, such as Prox-SAG , Prox-SVRG  and Prox-SDCA . All these accelerated SGD methods can use a constant step size instead of diminishing step sizes for SGD, and fall into the following three categories: primal methods such as SVRG and SAGA, dual methods such as SDCA, and primal-dual methods such as SPDC. In essence, many of primal methods use the full gradient at the snapshot or average gradients to progressively reduce the variance of stochastic gradient estimators, as well as dual and primal-dual methods, which leads to a revolution in the area of first-order methods . Thus, they are also known as the hybrid gradient descent method  or semi-stochastic gradient descent method . In particular, under the strongly convex condition, the accelerated SGD methods enjoy linear convergence rates and the overall complexity of to obtain an -suboptimal solution, where each is -smooth and is -strongly convex. The complexity bound shows that they converge significantly faster than deterministic APG methods, whose complexity is .
SVRG  and its proximal variant, Prox-SVRG , are particularly attractive because of their low storage requirement compared with other stochastic methods such as SAG, SAGA and SDCA, which require to store all the gradients of the component functions or dual variables. At the beginning of each epoch in SVRG, the full gradient is computed at the snapshot point , which is updated periodically. The update rule for the smooth optimization problem (1) is given by
When , the update rule in (5b) becomes the original one in , i.e., . It is not hard to verify that the variance of the SVRG estimator , i.e., , can be much smaller than that of the SGD estimator , i.e., . However, for non-strongly convex problems, the accelerated SGD methods mentioned above converge much slower than batch APG methods such as FISTA , namely, vs. .
More recently, many acceleration techniques were proposed to further speed up those variance-reduced stochastic methods mentioned above. These techniques mainly include the Nesterov’s acceleration technique used in [24, 37, 38, 39, 40], reducing the number of gradient calculations in the early iterations [34, 41, 42], the projection-free property of the conditional gradient method (also known as the Frank-Wolfe algorithm ) as in , the stochastic sufficient decrease technique , and the momentum acceleration trick in [3, 34, 46]. More specifically,  proposed an accelerating Catalyst framework and achieved the complexity of for strongly convex problems.  and  proved that their accelerated methods can attain the best known complexity of for non-strongly convex problems. The overall complexity matches the theoretical upper bound provided in . Katyusha  and point-SAGA  achieve the best-known complexity of for strongly convex problems, which is identical to the upper complexity bound in . That is, Katyusha is the best known stochastic optimization method for both strongly convex and non-strongly convex problems. Its proximal gradient update rules are formulated as follows:
where are two momentum parameters. To eliminate the need for parameter tuning, is set to , and is fixed to in . Unfortunately, most of the accelerated methods mentioned above, including Katyusha, require at least two auxiliary variables and two momentum parameters, which lead to complicated algorithm design and high per-iteration complexity .
1.3 Our Contributions
From the above discussion, one can see that most of accelerated stochastic variance reduction methods such as [3, 34, 37, 42, 44, 45] and applications such as [9, 10, 13, 15, 16, 49] are based on the stochastic variance reduced gradient (SVRG) method . Thus, any key improvement on SVRG is very important for the research of stochastic optimization. In this paper, we propose a simple variant of the original SVRG , which is referred to as the variance reduced stochastic gradient descent (VR-SGD). The snapshot point and starting point of each epoch in VR-SGD are set to the average and last iterate of the previous epoch, respectively. Different from the settings of SVRG and Prox-SVRG  (i.e., the last iterate for the two points of the former, while the average of the previous epoch for those of the latter), the two points in VR-SGD are different, which makes our convergence analysis more challenging than SVRG and Prox-SVRG. Our empirical results show that the performance of VR-SGD is significantly better than its counterparts, SVRG and Prox-SVRG. Impressively, VR-SGD with a sufficiently large learning rate performs much better than the best known stochastic method, Katyusha . The main contributions of this paper are summarized below.
The snapshot point and starting point of VR-SGD are set to two different vectors. That is, for all epochs, except the first one, (denoted by Option I) or (denoted by Option II), and . In particular, we find that the setting of VR-SGD allows us take much larger learning rates or step sizes than SVRG, e.g., vs. , and thus significantly speeds up the convergence of SVRG and Prox-SVRG in practice. Moreover, VR-SGD has an advantage over SVRG in terms of robustness of learning rate selection.
Different from proximal stochastic gradient methods, e.g., Prox-SVRG and Katyusha, which have a unified update rule for the two cases of smooth and non-smooth objectives (see Section 2.2 for details), VR-SGD employs two different update rules for the two cases, respectively, as in (12) and (13) below. Empirical results show that gradient update rules as in (12) for smooth optimization problems are better choices than proximal update formulas as in (10).
Finally, we theoretically analyze the convergence properties of VR-SGD with Option I or Option II for strongly convex problems, which show that VR-SGD attains a linear convergence rate. We also give the convergence guarantees of VR-SGD with Option I or Option II for non-strongly convex objective functions.
2 Preliminary and Related Work
Throughout this paper, we use to denote the -norm (also known as the standard Euclidean norm), and is the -norm, i.e., . denotes the full gradient of if it is differentiable, or the subgradient if is only Lipschitz continuous. For each epoch and inner iteration , is the random chosen index. We mostly focus on the case of Problem (1) when each component function is -smooth111Actually, we can extend all the theoretical results in this paper for the case, when the gradients of all component functions have the same Lipschitz constant , to the more general case, when some have different degrees of smoothness., and is -strongly convex. The two common assumptions are defined as follows.
2.1 Basic Assumptions
Assumption 1 (Smoothness).
Each convex function is -smooth, that is, there exists a constant such that for all ,
Assumption 2 (Strong Convexity).
The convex function is -strongly convex, i.e., there exists a constant such that for all ,
2.2 Related Work
To speed up standard and proximal SGD methods, many variance reduced stochastic methods [28, 29, 30, 35] have been proposed for some special cases of Problem (1). In the case when each is -smooth, is -strongly convex, and , Roux et al.  proposed a stochastic average gradient (SAG) method, which attains a linear convergence rate. However, SAG needs to store all gradients as well as other incremental aggregated gradient methods such as SAGA , so that storage is required in general problems . Similarly, SDCA  requires storage of all dual variables , which scales as . In contrast, SVRG , as well as its proximal variant, Prox-SVRG , has the similar convergence rate to SAG and SDCA but without the memory requirements of all gradients and dual variables. In particular, the SVRG estimator in (5a) (independently introduced in [1, 35]) may be the most popular choice for stochastic gradient estimators. Besides, other stochastic gradient estimators include the SAGA estimator in  and the stochastic recursive gradient estimator in . Although the original SVRG in  only has convergence guarantees for a special case of Problem (1), when each is -smooth, is -strongly convex, and , one can extend SVRG to the proximal setting by introducing the proximal operator in (3), as shown in Line 7 of Algorithm 1. In other words, when is non-smooth, the update rule of SVRG becomes
Some researchers [8, 51, 52] have borrowed some variance reduction techniques into ADMM for minimizing convex composite objective functions subject to an equality constraint. [15, 16, 17] applied efficient stochastic solvers to compute leading eigenvectors of a symmetric matrix or generalized eigenvectors of two symmetric matrices. The first such method is VR-PCA by Shamir , and the convergence properties of the VR-PCA algorithm for such a non-convex problem are also provided. Garber et al.  analyzed the convergence rate of SVRG when is a convex function that is a sum of non-convex component functions. Moreover,  and  proved that SVRG and SAGA with minor modifications can converge asymptotically to a stationary point of non-convex finite-sum problems. Some distributed variants [54, 55] of accelerated SGD methods have also been proposed.
An important class of stochastic methods is the proximal stochastic gradient (Prox-SG) method, such as Prox-SVRG , SAGA , and Katyusha . Different from standard variance reduction SGD methods such as SVRG, which have a stochastic gradient update as in (5b), the Prox-SG method has a unified update rule for both smooth and non-smooth cases of . For instance, the update rule of Prox-SVRG  is formulated as follows:
3 Variance-Reduced Stochastic Gradient Descent
In this section, we propose an efficient variance reduced stochastic gradient descent (VR-SGD) method with iterate averaging. Different from the choices of the snapshot and starting points in SVRG  and Prox-SVRG , the two vectors of each epoch in VR-SGD are set to the average and last iterate of the previous epoch, respectively. Unlike common stochastic gradient methods such as SVRG and proximal stochastic gradient methods such as Prox-SVRG, we design two different update rules for smooth and non-smooth objective functions, respectively.
3.1 Iterate Averaging
Like SVRG and Katyusha, VR-SGD is also divided into epochs, and each epoch consists of stochastic gradient steps, where is usually chosen to be as in [1, 2, 3]. Within each epoch, we need to compute the full gradient at the snapshot and use it to define the variance reduced stochastic gradient estimator as in . Unlike SVRG whose snapshot point is set to the last iterate of the previous epoch, the snapshot of each epoch in VR-SGD is set to the average of the previous epoch, e.g., in Option I of Algorithm 2, which leads to better robustness to gradient noise222 It should be emphasized that the noise introduced by random sampling is inevitable, and generally slows down the convergence speed in a sense. However, SGD and its variants are probably the most used optimization algorithms for deep learning
It should be emphasized that the noise introduced by random sampling is inevitable, and generally slows down the convergence speed in a sense. However, SGD and its variants are probably the most used optimization algorithms for deep learning. In particular,  has shown that by adding noise at each step, noisy gradient descent can escape the saddle points efficiently and converge to a local minimum of non-convex optimization problems, as well as the application of deep neural networks in ., as also suggested in [34, 45, 59]. In fact, the choice of Option II in Algorithm 2, i.e., , also works well in practice, as shown in Fig. 1. Therefore, we provide the convergence guarantees for both our algorithms (including Algorithm 2) with Option I and our algorithms with Option II in the next section. In particular, we find that the one of the effects of the choice in Option I or Option II of Algorithm 2 is to allow taking much larger learning rates or step sizes than SVRG in practice, e.g., for VR-SGD vs. for SVRG (see Fig. 2 and Section 5.3 for details). This is the main reason why VR-SGD converges significantly faster than SVRG. Actually, a larger learning rate enjoyed by VR-SGD means that the variance of its stochastic gradient estimator goes asymptotically to zero faster.
Unlike Prox-SVRG  whose starting point is initialized to the average of the previous epoch, the starting point of each epoch in VR-SGD is set to the last iterate of the previous epoch. That is, the last iterate of the previous epoch becomes the new starting point in VR-SGD, while those of Prox-SVRG are completely different, thereby leading to relatively slow convergence in general. It is clear that both the starting point and snapshot point of each epoch in the original SVRG  are set to the last iterate of the previous epoch333Note that the theoretical convergence of the original SVRG  relies on its Option II, i.e., both and are set to , where is randomly chosen from . However, the empirical results in  suggest that Option I is a better choice than its Option II, and the convergence guarantee of SVRG with Option I for strongly convex objective functions is provided in ., while the two points of Prox-SVRG  are set to the average of the previous epoch (also suggested in ). Different from the settings in SVRG and Prox-SVRG, the starting and snapshot points in VR-SGD are set to the two different vectors mentioned above, which makes the convergence analysis of VR-SGD more challenging than SVRG and Prox-SVRG, as shown in Section 4.
3.2 The Algorithm for Strongly Convex Objectives
In this part, we propose an efficient VR-SGD algorithm to solve strongly convex objective functions, as outlined in Algorithm 2. It is well known that the original SVRG  only works for the case of smooth and strongly convex objective functions. However, in many machine learning applications, e.g., elastic net regularized logistic regression, the strongly convex objective function is non-smooth. To solve this class of problems, the proximal variant of SVRG, Prox-SVRG , was subsequently proposed. Unlike SVRG and Prox-SVRG, VR-SGD can not only solve smooth objective functions, but directly tackle non-smooth ones. That is, when the regularizer is smooth, e.g., the -norm regularizer, the update rule of VR-SGD is
When is non-smooth, e.g., the -norm regularizer, the update rule of VR-SGD becomes
Different from the proximal stochastic gradient methods such as Prox-SVRG , all of which have a unified update rule as in (10) for both the smooth and non-smooth cases of , VR-SGD has two different update rules for the two cases, as stated in (12) and (13). This leads to the following advantage over the Prox-SG methods: the stochastic gradient update rule in (12) usually outperforms the proximal stochastic gradient update rule in (11), as well as the two classes of update rules for Katyusha  (see Section 5.4 for details).
Fig. 2 demonstrates that VR-SGD has a significant advantage over SVRG in terms of robustness of learning rate selection. That is, VR-SGD yields good performance within the range of the learning rate between and , whereas the performance of SVRG is very sensitive to the selection of learning rates. Thus, VR-SGD is convenient to apply in various real-world problems of large-scale machine learning. In fact, VR-SGD can use much larger learning rates than SVRG for logistic regression problems in practice, e.g., for VR-SGD vs. for SVRG, as shown in Fig. 2(a).
3.3 The Algorithm for Non-Strongly Convex Objectives
Although many variance reduced stochastic methods have been proposed, most of them, including SVRG and Prox-SVRG, only have convergence guarantees for the case of Problem (1) when the objective function is strongly convex. However, may be non-strongly convex in many machine learning applications such as Lasso and -norm regularized logistic regression. As suggested in [3, 61], this class of problems can be transformed into strongly convex ones by adding a proximal term , which can be efficiently solved by Algorithm 2. However, the reduction technique may degrade the performance of the involved algorithms both in theory and in practice . Thus, we present an efficient VR-SGD algorithm for directly solving the non-strongly convex problem (1), as outlined in Algorithm 3.
The main difference between Algorithm 2 and Algorithm 3 is the setting for the learning rate. Similar to Algorithm 2, the learning rate of Algorithm 3 can also be fixed to a constant. Inspired by existing accelerated stochastic algorithms [3, 34], the learning rate in Algorithm 3 can be gradually increased, which in principle leads to faster convergence (see Section 5.5 for details). Different from existing stochastic methods such as Katyusha , the update rule of the learning rate for Algorithm 3 is defined as follows: , where is an initial learning rate, and for any ,
where is a given constant, e.g., .
3.4 Complexity Analysis
From Algorithms 2 and 3, we can see that the per-iteration cost of VR-SGD is dominated by the computation of , , and or the proximal update in (13). Thus, the complexity is , as low as that of SVRG  and Prox-SVRG . In fact, for some ERM problems, we can save the intermediate gradients in the computation of , which generally requires additional storage. As a result, each epoch only requires component gradient evaluations. In addition, for extremely sparse data, we can introduce the lazy update tricks in [36, 62, 63] to our algorithms, and perform the update steps in (12) and (13) only for the non-zero dimensions of each example, rather than all dimensions. In other words, the per-iteration complexity of VR-SGD can be improved from to , where is the sparsity of feature vectors. Moreover, VR-SGD has a much lower per-iteration complexity than existing accelerated stochastic variance reduction methods such as Katyusha , which have at least two more update rules for additional variables, as shown in (6a)-(6c).
3.5 Extensions of VR-SGD
It has been shown in [36, 37] that mini-batching can effectively decrease the variance of stochastic gradient estimates. In this part, we first extend the proposed VR-SGD method to the mini-batch setting, as well as its convergence results below. Here, we denote by the mini-batch size and the selected random index set for each outer-iteration and inner-iteration . The variance reduced stochastic gradient estimator in (5a) becomes
where is a mini-batch of size . If some component functions are non-smooth, we can use the proximal operator oracle  or the Nesterov’s smoothing  and homotopy smoothing  techniques to smoothen them, and thereby obtain the smoothed approximations of the functions . In addition, we can directly extend our algorithms to the non-smooth setting as in , e.g., Algorithm 3 in .
Considering that each component function maybe have different degrees of smoothness, picking the random index
from a non-uniform distribution is a much better choice than the commonly used uniform random sampling[66, 67], as well as without-replacement sampling vs. with-replacement sampling . This can be done using the same techniques in [2, 3], i.e., the sampling probabilities for all are proportional to their Lipschitz constants, i.e., . Moreover, our VR-SGD method can also be combined with other accelerated techniques proposed for SVRG. For instance, the epoch length of VR-SGD can be automatically determined by the techniques in [42, 68] instead of a fixed epoch length. We can reduce the number of gradient calculations in the early iterations as in [34, 41, 42], which leads to faster convergence in general. Moreover, we can also introduce the Nesterov’s acceleration technique as in [24, 37, 38, 39, 40] and momentum acceleration trick as in [3, 34] to further improve the performance of VR-SGD.
4 Convergence Analysis
In this section, we provide the convergence guarantees of VR-SGD for solving both smooth and non-smooth general convex problems. We also extend these results to the mini-batch setting. Moreover, we analyze the convergence properties of VR-SGD for solving both smooth and non-smooth strongly convex objective functions. We first introduce the following lemma which is useful in our analysis.
Lemma 1 (3-point property, ).
Let be the optimal solution of the following problem,
where is a convex function (but possibly non-differentiable), and . Then for any , the following inequality holds
4.1 Convergence Properties for Non-strongly Convex Problems
In this part, we analyze the convergence properties of VR-SGD for solving the more general non-strongly convex problems. Considering that the two proposed algorithms (i.e., Algorithms 2 and 3) have two different update rules for the smooth and non-smooth problems, we give the convergence guarantees of VR-SGD for the two cases as follows.
4.1.1 Smooth Objectives
We first give the convergence guarantee of Problem (1) when the objective function is smooth and non-strongly convex. In order to simplify analysis, we denote by , that is, for all .
Lemma 2 (Variance bound of smooth objectives).
The proof of this lemma is included in APPENDIX A. Lemma 2 provides the upper bound on the expected variance of the variance reduced gradient estimator in (5a), i.e., the SVRG estimator independently introduced in [1, 35]. For Algorithm 3 with Option I and a fixed learning rate, we give the following key result for our analysis.
Lemma 3 (Option I and smooth objectives).
Let . If each is convex and -smooth, then the following inequality holds for all ,
In order to simplify notation, the stochastic gradient estimator is defined as: . Since each component function is -smooth, which implies that the gradient of the average function is also -smooth, i.e., for all ,
whose equivalent form is
Applying the above smoothness inequality, we have
where is a constant. Using Lemma 2, then we get
where the first inequality holds due to the Young’s inequality (i.e., for all and ), and the second inequality follows from Lemma 2.
Here, the first inequality holds due to the inequality in (15) and the inequality in (16); the second inequality follows from Lemma 1 with , , , , and ; the third inequality holds due to the fact that ; and the last inequality follows from the convexity of the smooth function , i.e.,