1 Introduction
In this paper, we focus on the following composite optimization problem:
(1) 
where , are the smooth functions, and is a relatively simple (but possibly nondifferentiable) convex function (referred to as a regularizer). The formulation (1
) arises in many places in machine learning, signal processing, data science, statistics and operations research, such as
regularized empirical risk minimization (ERM). For instance, one popular choice of the component function in binary classification problems is the logistic loss, i.e., , where is a collection of training examples, and . Some popular choices for the regularizer include the norm regularizer (i.e., ), the norm regularizer (i.e., ), and the elasticnet regularizer (i.e., ). Some other applications include deep neural networks [1, 2, 3, 4, 5], group Lasso [6], sparse learning and coding [7, 8, 9, 10], nonnegative matrix factorization [11], phase retrieval [12], matrix completion [13, 14], conditional random fields [15], generalized eigendecomposition and canonical correlation analysis [16], and eigenvector computation
[17, 18]such as principal component analysis (PCA) and singular value decomposition (SVD).
1.1 Stochastic Gradient Descent
We are especially interested in developing efficient algorithms to solve Problem (1) involving the sum of a large number of component functions. The standard and effective method for solving (1) is the (proximal) gradient descent (GD) method, including Nesterov’s accelerated gradient descent (AGD) [19, 20] and accelerated proximal gradient (APG) [21, 22]. For the smooth problem (1), GD takes the following update rule: starting with , and for any
(2) 
where is commonly referred to as the learning rate in machine learning or stepsize in optimization. When is nonsmooth (e.g., the norm regularizer), we typically introduce the following proximal operator to replace (2),
(3) 
where GD has been proven to achieve linear convergence for strongly convex problems, and both AGD and APG attain the optimal convergence rate for nonstrongly convex problems, where denotes the number of iterations. However, the periteration cost of all the batch (or deterministic) methods is , which is expensive for very large .
Instead of evaluating the full gradient of at each iteration, an efficient alternative is the stochastic (or incremental) gradient descent (SGD) method [23]. SGD only evaluates the gradient of a single component function at each iteration, and has much lower periteration cost, . Thus, SGD has been successfully applied to many largescale learning problems [24, 25, 26]
, especially training for deep learning models
[2, 3, 27], and its update rule is(4) 
where , and the index can be chosen uniformly at random from
. Although the expectation of the stochastic gradient estimator
is an unbiased estimation for , i.e., , the variance of may be large due to the variance of random sampling [1]. Thus, stochastic gradient estimators are also called “noisy gradients”, and we need to gradually reduce its step size, which leads to slow convergence. In particular, even under the strongly convex (SC)condition, standard SGD attains a slower sublinear convergence rate [28].1.2 Accelerated SGD
Recently, many SGD methods with variance reduction have been proposed, such as stochastic average gradient (SAG) [29], stochastic variance reduced gradient (SVRG) [1], stochastic dual coordinate ascent (SDCA) [30], SAGA [31], stochastic primaldual coordinate (SPDC) [32], and their proximal variants, such as ProxSAG [33], ProxSVRG [34] and ProxSDCA [35]. These accelerated SGD methods can use a constant learning rate instead of diminishing step sizes for SGD, and fall into the following three categories: primal methods such as SVRG and SAGA, dual methods such as SDCA, and primaldual methods such as SPDC. In essence, many of the primal methods use the full gradient at the snapshot or the average gradient to progressively reduce the variance of stochastic gradient estimators, as well as the dual and primaldual methods, which leads to a revolution in the area of firstorder optimization [36]. Thus, they are also known as the hybrid gradient descent method [37] or semistochastic gradient descent method [38]. In particular, under the strongly convex condition, most of the accelerated SGD methods enjoy a linear convergence rate (also known as a geometric or exponential rate) and the oracle complexity of to obtain an suboptimal solution, where each is smooth, and is strongly convex. The complexity bound shows that they converge faster than accelerated deterministic methods, whose oracle complexity is [39, 40].
SVRG [1] and its proximal variant, ProxSVRG [34], are particularly attractive because of their low storage requirement compared with other methods such as SAG, SAGA and SDCA, which require storage of all the gradients of component functions or dual variables. At the beginning of the th epoch in SVRG, the full gradient is computed at the snapshot , which is updated periodically.
Definition 1.
It is not hard to verify that the variance of the SVRG estimator (i.e., ) can be much smaller than that of the SGD estimator (i.e., ). Theoretically, for nonstrongly convex (NonSC) problems, the variance reduced methods converge slower than the accelerated batch methods such as FISTA [22], i.e., vs. .
More recently, many acceleration techniques were proposed to further speed up the stochastic variance reduced methods mentioned above. These techniques mainly include the Nesterov’s acceleration techniques in [25, 39, 40, 41, 42], reducing the number of gradient calculations in early iterations [36, 43, 44], the projectionfree property of the conditional gradient method (also known as the FrankWolfe algorithm [45]) as in [46], the stochastic sufficient decrease technique [47], and the momentum acceleration tricks in [36, 48, 49]. [40] proposed an accelerating Catalyst framework and achieved the oracle complexity of for strongly convex problems. [48] and [50] proved that the accelerated methods can attain the oracle complexity of for nonstrongly convex problems. The overall complexity matches the theoretical upper bound provided in [51]. Katyusha [48], pointSAGA [52] and MiG [50] achieve the bestknown oracle complexity of for strongly convex problems, which is identical to the upper complexity bound in [51]. Hence, Katyusha and MiG are the bestknown stochastic optimization method for both SC and NonSC problems, as pointed out in [51]. However, selecting the best values for the parameters in the accelerated methods (e.g., the momentum parameter) is still an open problem. In particular, most of accelerated stochastic variance reduction methods, including Katyusha, require at least one auxiliary variable and one momentum parameter, which lead to complicated algorithm design and high periteration complexity, especially for very highdimensional and sparse data.
1.3 Our Contributions
From the above discussions, we can see that most of the accelerated stochastic variance reduction methods such as [36, 39, 44, 46, 47, 48, 53, 54] and applications such as [7, 9, 10, 14, 17, 18, 55, 56, 57] are based on the SVRG method [1]. Thus, any key improvement on SVRG is very important for the research of stochastic optimization. In this paper, we propose a simple variant of the original SVRG [1], called variance reduced stochastic gradient descent (VRSGD). The snapshot point and starting point of each epoch in VRSGD are set to the average and last iterate of the previous epoch, respectively. This is different from the settings of SVRG and ProxSVRG [34], where the two points of the former are set to be the last iterate, and those of the latter are set to be the average of the previous epoch. This difference makes the convergence analysis of VRSGD significantly more challenging than that of SVRG and ProxSVRG. Our empirical results show that the performance of VRSGD is significantly better than its counterparts, SVRG and ProxSVRG. Impressively, VRSGD with varying learning rates achieves better or at least comparable performance with accelerated methods, such as Catalyst [40] and Katyusha [48]. The main contributions of this paper are summarized below.

The snapshot and starting points of VRSGD are set to two different vectors, i.e., (Option I) or (Option II), and . In particular, we find that the settings of VRSGD allow us to take much larger learning rates than SVRG, e.g., vs. , and thus significantly speed up its convergence in practice. Moreover, VRSGD has an advantage over SVRG in terms of robustness of learning rate selection.

Unlike proximal stochastic gradient methods, e.g., ProxSVRG and Katyusha, which have a unified update rule for the two cases of smooth and nonsmooth objectives (see Section 2.2 for details), VRSGD employs two different update rules for the two cases, respectively, as in (12) and (13) below. Empirical results show that gradient update rules as in (12) for smooth optimization problems are better choices than proximal update formulas as in (10).

We provide the convergence guarantees of VRSGD for solving smooth/nonsmooth and nonstrongly convex (or general convex) functions. In comparison, SVRG and ProxSVRG do not have any convergence guarantees, as shown in Table III.

Moreover, we also present a momentum accelerated variant of VRSGD, discuss their equivalent relationship, and empirically verify that they achieve similar performance to their variant that attains the optimal convergence rate .

Finally, we theoretically analyze the convergence properties of VRSGD with Option I or Option II for smooth/nonsmooth and strongly convex functions, which show that VRSGD attains linear convergence.
2 Preliminary and Related Work
Throughout this paper, we use to denote the norm (also known as the standard Euclidean norm), and is the norm, i.e., . denotes the full gradient of if it is differentiable, or the subgradient if is only Lipschitz continuous. For each epoch and inner iteration , is the random chosen index. We mostly focus on the case of Problem (1) when each is smooth^{1}^{1}1In fact, we can extend the theoretical results for the case, when the gradients of all component functions have the same Lipschitz constant , to the more general case, when some component functions have different degrees of smoothness., and is strongly convex. The two common assumptions are defined as follows.
2.1 Basic Assumptions
Assumption 1 (Smoothness).
Each is smooth, that is, there exists a constant such that for all ,
(6) 
Assumption 2 (Strong Convexity).
is strongly convex, i.e., there exists a constant such that for all ,
(7) 
2.2 Related Work
To speed up standard and proximal SGD, many stochastic variance reduced methods [29, 30, 31, 37] have been proposed for some special cases of Problem (1). In the case when each is smooth, is strongly convex, and , Roux et al. [29] proposed a stochastic average gradient (SAG) method, which attains linear convergence. However, SAG, as well as other incremental aggregated gradient methods such as SAGA [31], needs to store all gradients, so that memory is required in general [43]. Similarly, SDCA [30] requires storage of all dual variables [1], which uses memory. In contrast, SVRG proposed by Johnson and Zhang [1], as well as ProxSVRG [34], has the similar convergence rate to SAG and SDCA, but without the memory requirements of all gradients and dual variables. In particular, the SVRG estimator in (5) may be the most popular choice for stochastic gradient estimators. The update rule of SVRG for the case of Problem (1) when is
(8) 
When the smooth regularizer , the update rule in (8) becomes: . Although the original SVRG in [1] only has convergence guarantees for the special case of Problem (1), when each is smooth, is strongly convex, and , we can extend SVRG to the proximal setting by introducing the proximal operator in (3), as shown in Line 7 of Algorithm 1.
Based on the SVRG estimator in (5), some accelerated algorithms [39, 40, 48] have been proposed. The proximal update rules of Katyusha [48] are formulated as follows:
(9a)  
(9b)  
(9c) 
where are two parameters. To eliminate the need for parameter tuning, is set to , and is fixed to in [48]. In addition, [16, 17, 18] applied efficient stochastic solvers to compute leading eigenvectors of a symmetric matrix or generalized eigenvectors of two symmetric matrices. The first such method is VRPCA proposed by Shamir [17], and the convergence properties of VRPCA for such a nonconvex problem are also provided. Garber et al. [18] analyzed the convergence rate of SVRG when is a convex function that is a sum of nonconvex component functions. Moreover, [4, 5] and [58] proved that SVRG and SAGA with minor modifications can converge asymptotically to a stationary point of nonconvex functions. Some sparse approximation, parallel and distributed variants [50, 59, 60, 61, 62] of accelerated SGD methods have also been proposed.
An important class of stochastic methods is the proximal stochastic gradient (ProxSG) method, such as ProxSVRG [34], SAGA [31], and Katyusha [48]. Different from standard variance reduction SGD methods such as SVRG, the ProxSG method has a unified update rule for both smooth and nonsmooth cases of . For instance, the update rule of ProxSVRG [34] is formulated as follows:
(10) 
For the sake of completeness, the details of ProxSVRG [34] are shown in Algorithm 1 with Option II. When is the widely used norm regularizer, i.e., , the proximal update formula in (10) becomes
(11) 
3 Variance Reduced SGD
In this section, we propose an efficient variance reduced stochastic gradient descent (VRSGD) algorithm, as shown in Algorithm 2. Different from the choices of the snapshot and starting points in SVRG [1] and ProxSVRG [34], the two vectors of each epoch in VRSGD are set to the average and last iterate of the previous epoch, respectively. Moreover, unlike existing proximal stochastic methods, we design two different update rules for smooth and nonsmooth objective functions, respectively.
3.1 Snapshot and Starting Points
Like SVRG, VRSGD is also divided into epochs, and each epoch consists of stochastic gradient steps, where is usually chosen to be , as suggested in [1, 34, 48]. Within each epoch, we need to compute the full gradient at the snapshot and use it to define the variance reduced stochastic gradient estimator in (5). Unlike SVRG, whose snapshot is set to the last iterate of the previous epoch, the snapshot of VRSGD is set to the average of the previous epoch, e.g., in Option I of Algorithm 2, which leads to better robustness to gradient noise^{2}^{2}2
It should be emphasized that the noise introduced by random sampling is inevitable, and generally slows down the convergence speed in this sense. However, SGD and its variants are probably the mostly used optimization algorithms for deep learning
[63]. In particular, [64] has shown that by adding gradient noise at each step, noisy gradient descent can escape the saddle points efficiently and converge to a local minimum of the nonconvex minimization problem, e.g., the application of deep neural networks in [65]., as also suggested in [36, 47, 66]. In fact, the choice of Option II in Algorithm 2, i.e., , also works well in practice, as shown in Fig. 2 in the Supplementary Material. Therefore, we provide the convergence guarantees for our algorithm with either Option I or Option II in the next section. In particular, we find that one of the effects of the choice in Option I or Option II of Algorithm 2 is to allow taking much larger learning rates or step sizes than SVRG in practice, e.g., for VRSGD vs. for SVRG (see Fig. 11). Actually, a larger learning rate enjoyed by VRSGD means that the variance of its stochastic gradient estimator goes asymptotically to zero faster.Unlike ProxSVRG [34] whose starting point is initialized to the average of the previous epoch, the starting point of VRSGD is set to the last iterate of the previous epoch. That is, in VRSGD, the last iterate of the previous epoch becomes the new starting point, while the two points of ProxSVRG are completely different, thereby leading to relatively slow convergence in general. Both the starting and snapshot points of SVRG [1] are set to the last iterate of the previous epoch^{3}^{3}3Note that the theoretical convergence of the original SVRG [1] relies on its Option II, i.e., both and are set to , where is randomly chosen from . However, the empirical results in [1] suggest that Option I is a better choice than its Option II, and the convergence guarantee of SVRG with Option I for strongly convex objective functions is provided in [67]., while the two points of ProxSVRG [34] are set to the average of the previous epoch (also suggested in [1]). By setting the starting and snapshot points in VRSGD to the two different vectors mentioned above, the convergence analysis of VRSGD becomes significantly more challenging than that of SVRG and ProxSVRG, as shown in Section 4.
3.2 The VRSGD Algorithm
In this part, we propose an efficient VRSGD algorithm to solve Problem (1), as outlined in Algorithm 2 for the case of smooth objective functions. It is well known that the original SVRG [1]
only works for the case of smooth minimization problems. However, in many machine learning applications, e.g., elastic net regularized logistic regression, the strongly convex objective function
is nonsmooth. To solve this class of problems, the proximal variant of SVRG, ProxSVRG [34], was subsequently proposed. Unlike the original SVRG, VRSGD can not only solve smooth objective functions, but also directly tackle nonsmooth ones. That is, when the regularizer is smooth (e.g., the norm regularizer), the key update rule of VRSGD is(12) 
When is nonsmooth (e.g., the norm regularizer), the key update rule of VRSGD in Algorithm 2 becomes
(13) 
Unlike the proximal stochastic methods such as ProxSVRG [34], all of which have a unified update rule as in (10) for both the smooth and nonsmooth cases of , VRSGD has two different update rules for the two cases, as in (12) and (13). Fig. 11 demonstrates that VRSGD has a significant advantage over SVRG in terms of robustness of learning rate selection. That is, VRSGD yields good performance within the range of the learning rate from to , whereas the performance of SVRG is very sensitive
to the selection of learning rates. Thus, VRSGD is convenient to be applied in various realworld problems of machine learning. In fact, VRSGD can use much larger learning rates than SVRG for ridge regression problems in practice, e.g.,
for VRSGD vs. for SVRG, as shown in Fig. 1(b).


3.3 VRSGD for NonStrongly Convex Objectives
Although many stochastic variance reduced methods have been proposed, most of them, including SVRG and ProxSVRG, only have convergence guarantees for the case of Problem (1), when is strongly convex. However, may be nonstrongly convex in many machine learning applications, such as Lasso and norm regularized logistic regression. As suggested in [48, 68], this class of problems can be transformed into strongly convex ones by adding a proximal term , which can be efficiently solved by Algorithm 2. However, the reduction technique may degrade the performance of the involved algorithms both in theory and in practice [44]. Thus, we use VRSGD to directly solve nonstrongly convex problems.
The learning rate of Algorithm 2 can be fixed to a constant. Inspired by existing accelerated stochastic algorithms [36, 48], the learning rate in Algorithm 2 can be gradually increased in early iterations for both strongly convex and nonstrongly convex problems, which leads to faster convergence (see Fig. 3 in the Supplementary Material). Different from SGD and Katyusha [48], where the learning rate of the former requires to be gradually decayed and that of the latter needs to be gradually increased, the update rule of in Algorithm 2 is defined as follows: is an initial learning rate, and for any ,
(14) 
where is a given constant, e.g., .
3.4 Extensions of VRSGD
It has been shown in [38, 39] that minibatching can effectively decrease the variance of stochastic gradient estimates. Therefore, we first extend the proposed VRSGD method to the minibatch setting, as well as its convergence results below. Here, we denote by the minibatch size and the selected random index set for each outeriteration and inneriteration .
Definition 2.
The stochastic variance reduced gradient estimator in the minibatch setting is defined as
(15) 
where is a minibatch of size .
If some component functions are nonsmooth, we can use the proximal operator oracle [68] or the Nesterov’s smoothing [69] and homotopy smoothing [70] techniques to smoothen them, and thereby obtain their smoothed approximations. In addition, we can directly extend our VRSGD method to the nonsmooth setting as in [36] (e.g., Algorithm 3 in [36]) without using any smoothing techniques.
Considering that each component function may have different degrees of smoothness, picking the random index
from a nonuniform distribution is a much better choice than commonly used uniform random sampling
[71, 72], as well as withoutreplacement sampling vs. withreplacement sampling [73]. This can be done using the same techniques in [34, 48], i.e., the sampling probabilities for all are proportional to their Lipschitz constants, i.e., . VRSGD can also be combined with other accelerated techniques used for SVRG. For instance, the epoch length of VRSGD can be automatically determined by the techniques in [44, 74], instead of a fixed epoch length. We can reduce the number of gradient calculations in early iterations as in [43, 44], which leads to faster convergence in general (see Section Impact of Increasing Learning Rates for details). Moreover, we can introduce the Nesterov’s acceleration techniques in [25, 39, 40, 41, 42] and momentum acceleration tricks in [36, 48, 75] to further improve the performance of VRSGD.4 Algorithm Analysis
In this section, we provide the convergence guarantees of VRSGD for solving both smooth and nonsmooth general convex problems, and extend the results to the minibatch setting. We also study the convergence properties of VRSGD for solving both smooth and nonsmooth strongly convex objective functions. Moreover, we discuss the equivalent relationship between VRSGD and its momentum accelerated variant, as well as some of its extensions.
4.1 Convergence Properties: Nonstrongly Convex
In this part, we analyze the convergence properties of VRSGD for solving more general nonstrongly convex problems. Considering that the proposed algorithm (i.e., Algorithm 2) has two different update rules for smooth and nonsmooth cases, we give the convergence guarantees of VRSGD for the two cases as follows.
4.1.1 Smooth Objective Functions
We first provide the convergence guarantee of our algorithm for solving Problem (1) when is smooth. In order to simplify analysis, we denote by , that is, for all , and then .
Lemma 1 (Variance bound).
The proofs of this lemma, the lemmas and theorems below are all included in the Supplementary Material. Lemma 3 provides the upper bound on the expected variance of the variance reduced gradient estimator in (5), i.e., the SVRG estimator. For Algorithm 2 with Option II and a fixed learning rate , we have the following result.
Theorem 1 (Smooth objectives).
From Theorem 1 and its proof, one can see that our convergence analysis is very different from that of existing stochastic methods, such as SVRG [1], ProxSVRG [34], and SVRG++ [44]. Similarly, the convergence of Algorithm 2 with Option I and a fixed learning rate can be guaranteed, as stated in Theorem 6 in the Supplementary Material. All the results show that VRSGD attains a convergence rate of for nonstrongly convex functions.
4.1.2 NonSmooth Objective Functions
We also provide the convergence guarantee of Algorithm 2 with Option I and (13) for solving Problem (1) when is nonsmooth and nonstrongly convex, as shown below.
Theorem 2 (Nonsmooth objectives).
Suppose Assumption 1 holds. Then the following inequality holds
Similarly, the convergence of Algorithm 2 with Option II and a fixed learning rate can be guaranteed, as stated in Corollary 4 in the Supplementary Material.
4.1.3 MiniBatch Settings
The upper bound on the variance of is extended to the minibatch setting as follows.
Corollary 1 (Variance bound of minibatch).
If each is convex and smooth, then the following inequality holds
where .
This corollary is essentially identical to Theorem 4 in [38], and hence its proof is omitted. It is not hard to verify that . Based on the variance upper bound, we further analyze the convergence properties of VRSGD in the minibatch setting, as shown below.
Theorem 3 (Minibatch).
If each is convex and smooth, then the following inequality holds
where .
4.2 Convergence Properties: Strongly Convex
We also analyze the convergence properties of VRSGD for solving strongly convex problems. We first give the following convergence result for Algorithm 2 with Option II.
Theorem 4 (Strongly convex).
We also provide the linear convergence guarantees for Algorithm 2 with Option I for solving nonsmooth and strongly convex functions, as stated in Theorem 7 in the Supplementary Material. Similarly, the linear convergence of Algorithm 2 can be guaranteed for the smooth stronglyconvex case. All the theoretical results show that VRSGD attains a linear convergence rate and at most the oracle complexity of for both smooth and nonsmooth strongly convex functions. In contrast, the convergence of SVRG [1] is only guaranteed for smooth and strongly convex problems.
Although the learning rate in Theorem 4 needs to be less than , we can use much larger learning rates in practice, e.g., . However, it can be easily verified that the learning rate of SVRG should be less than in theory, and adopting a larger learning rate for SVRG is not always helpful in practice, which means that VRSGD can use much larger learning rates than SVRG both in theory and in practice. In other words, although they have the same theoretical convergence rate, VRSGD converges significantly faster than SVRG in practice, as shown by our experiments. Note that similar to the convergence analysis in [4, 5, 58], the convergence of VRSGD for some nonconvex problems can also be guaranteed.
4.3 Equivalent to Its Momentum Accelerated Variant
Inspired by the success of the momentum technique in our previous work [6, 50, 75], we present a momentum accelerated variant of Algorithm 2, as shown in Algorithm 3. Unlike existing momentum techniques, e.g., [19, 22, 25, 39, 40, 48], we use the convex combination of the snapshot and latest iterate for acceleration, i.e., . It is not hard to verify that Algorithm 2 with Option I is equivalent to its variant (i.e., Algorithm 3 with Option I), when and is sufficiently small (see the Supplementary Material for their equivalent analysis). We emphasize that the only difference between Options I and II in Algorithm 3 is the initialization of and .
Theorem 5.
This theorem shows that the oracle complexity of Algorithm 3 with Option II is consistent with that of Katyusha [48], and is better than that of accelerated deterministic methods (e.g., AGD [20]), (i.e., ), which are also verified by the experimental results in Fig. 2. Our algorithm also achieves the optimal convergence rate for nonstrongly convex functions as in [48, 49]. Fig. 2 shows that Katyusha and Algorithm 3 with Option II have similar performance as Algorithms 2 and 3 with Option I () in terms of number of effective passes. Clearly, Algorithm 3 and Katyusha have higher complexity per iteration than Algorithm 2. Thus, we only report the results of VRSGD (i.e., Algorithm 2) in Section 5.
4.4 Complexity Analysis
From Algorithm 2, we can see that the periteration cost of VRSGD is dominated by the computation of , , and or the proximal update in (13). Thus, the complexity is , which is as low as that of SVRG [1] and ProxSVRG [34]. In fact, for some ERM problems, we can save the intermediate gradients in the computation of , which generally requires additional storage. As a result, each epoch only requires component gradient evaluations. In addition, for extremely sparse data, we can introduce the lazy update tricks in [38, 76, 77] to our algorithm, and perform the update steps in (12) and (13) only for the nonzero dimensions of each sample, rather than all dimensions. In other words, the periteration complexity of VRSGD can be improved from to , where is the sparsity of feature vectors. Moreover, VRSGD has a much lower periteration complexity than existing accelerated stochastic variance reduction methods such as Katyusha [48], which have more updating steps for additional variables, as shown in (9a)(9c).
5 Experimental Results
In this section, we evaluate the performance of VRSGD for solving a number of convex and nonconvex ERM problems (such as logistic regression, Lasso and ridge regression), and compare its performance with several stateoftheart stochastic variance reduced methods (including SVRG [1], ProxSVRG [34], SAGA [31]) and accelerated methods, such as Catalyst [40] and Katyusha [48]. Moreover, we apply VRSGD to solve other machine learning problems, such as ERM with nonconvex loss and leading eigenvalue computation.


Option I  Option II  Option III 

and  and  and 


5.1 Experimental Setup
We used several publicly available data sets in the experiments: Adult (also called a9a), Covtype, Epsilon, MNIST, and RCV1, all of which can be downloaded from the LIBSVM Data website^{4}^{4}4https://www.csie.ntu.edu.tw/~cjlin/libsvm/. It should be noted that each sample of these date sets was normalized so that they have unit length as in [34, 36], which leads to the same upper bound on the Lipschitz constants , i.e., for all . As suggested in [1, 34, 48], the epoch length is set to for the stochastic variance reduced methods, SVRG [1], ProxSVRG [34], Catalyst [40], and Katyusha [48], as well as VRSGD. Then the only parameter we have to tune by hand is the learning rate, . More specifically, we select learning rates from , where . Since Katyusha has a much higher periteration complexity than SVRG and VRSGD, we compare their performance in terms of both the number of effective passes and running time (seconds), where computing a single full gradient or evaluating component gradients is considered as one effective pass over the data. For fair comparison, we implemented SVRG, ProxSVRG, SAGA, Catalyst, Katyusha, and VRSGD in C++ with a Matlab interface, as well as their sparse versions with lazy update tricks, and performed all the experiments on a PC with an Intel i54570 CPU and 16GB RAM. The source code of all the methods is available at https://github.com/jnhujnhu/VRSGD.


5.2 Deterministic Methods vs. Stochastic Methods
In this subsection, we compare the performance of stochastic methods (including SGD, SVRG, Katyusha, and VRSGD) with that of deterministic methods such as AGD [19, 20] and APG [22] for solving strongly and nonstrongly convex problems. Note that the important momentum parameter of AGD is as in [78], while that of APG is defined as follows: for all [22], where
Comments
There are no comments yet.