VR-SGD: A Simple Stochastic Variance Reduction Method for Machine Learning

02/26/2018 ∙ by Fanhua Shang, et al. ∙ Xidian University 0

In this paper, we propose a simple variant of the original SVRG, called variance reduced stochastic gradient descent (VR-SGD). Unlike the choices of snapshot and starting points in SVRG and its proximal variant, Prox-SVRG, the two vectors of VR-SGD are set to the average and last iterate of the previous epoch, respectively. The settings allow us to use much larger learning rates, and also make our convergence analysis more challenging. We also design two different update rules for smooth and non-smooth objective functions, respectively, which means that VR-SGD can tackle non-smooth and/or non-strongly convex problems directly without any reduction techniques. Moreover, we analyze the convergence properties of VR-SGD for strongly convex problems, which show that VR-SGD attains linear convergence. Different from its counterparts that have no convergence guarantees for non-strongly convex problems, we also provide the convergence guarantees of VR-SGD for this case, and empirically verify that VR-SGD with varying learning rates achieves similar performance to its momentum accelerated variant that has the optimal convergence rate O(1/T^2). Finally, we apply VR-SGD to solve various machine learning problems, such as convex and non-convex empirical risk minimization, leading eigenvalue computation, and neural networks. Experimental results show that VR-SGD converges significantly faster than SVRG and Prox-SVRG, and usually outperforms state-of-the-art accelerated methods, e.g., Katyusha.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In this paper, we focus on the following composite optimization problem:

(1)

where , are the smooth functions, and is a relatively simple (but possibly non-differentiable) convex function (referred to as a regularizer). The formulation (1

) arises in many places in machine learning, signal processing, data science, statistics and operations research, such as

regularized empirical risk minimization (ERM). For instance, one popular choice of the component function in binary classification problems is the logistic loss, i.e., , where is a collection of training examples, and . Some popular choices for the regularizer include the -norm regularizer (i.e., ), the -norm regularizer (i.e., ), and the elastic-net regularizer (i.e., ). Some other applications include deep neural networks [1, 2, 3, 4, 5], group Lasso [6], sparse learning and coding [7, 8, 9, 10], non-negative matrix factorization [11], phase retrieval [12], matrix completion [13, 14], conditional random fields [15], generalized eigen-decomposition and canonical correlation analysis [16]

, and eigenvector computation

[17, 18]

such as principal component analysis (PCA) and singular value decomposition (SVD).

1.1 Stochastic Gradient Descent

We are especially interested in developing efficient algorithms to solve Problem (1) involving the sum of a large number of component functions. The standard and effective method for solving (1) is the (proximal) gradient descent (GD) method, including Nesterov’s accelerated gradient descent (AGD) [19, 20] and accelerated proximal gradient (APG) [21, 22]. For the smooth problem (1), GD takes the following update rule: starting with , and for any

(2)

where is commonly referred to as the learning rate in machine learning or step-size in optimization. When is non-smooth (e.g., the -norm regularizer), we typically introduce the following proximal operator to replace (2),

(3)

where GD has been proven to achieve linear convergence for strongly convex problems, and both AGD and APG attain the optimal convergence rate for non-strongly convex problems, where denotes the number of iterations. However, the per-iteration cost of all the batch (or deterministic) methods is , which is expensive for very large .

Instead of evaluating the full gradient of at each iteration, an efficient alternative is the stochastic (or incremental) gradient descent (SGD) method [23]. SGD only evaluates the gradient of a single component function at each iteration, and has much lower per-iteration cost, . Thus, SGD has been successfully applied to many large-scale learning problems [24, 25, 26]

, especially training for deep learning models

[2, 3, 27], and its update rule is

(4)

where , and the index can be chosen uniformly at random from

. Although the expectation of the stochastic gradient estimator

is an unbiased estimation for , i.e., , the variance of may be large due to the variance of random sampling [1]. Thus, stochastic gradient estimators are also called “noisy gradients”, and we need to gradually reduce its step size, which leads to slow convergence. In particular, even under the strongly convex (SC)condition, standard SGD attains a slower sub-linear convergence rate  [28].

1.2 Accelerated SGD

Recently, many SGD methods with variance reduction have been proposed, such as stochastic average gradient (SAG) [29], stochastic variance reduced gradient (SVRG) [1], stochastic dual coordinate ascent (SDCA) [30], SAGA [31], stochastic primal-dual coordinate (SPDC) [32], and their proximal variants, such as Prox-SAG [33], Prox-SVRG [34] and Prox-SDCA [35]. These accelerated SGD methods can use a constant learning rate instead of diminishing step sizes for SGD, and fall into the following three categories: primal methods such as SVRG and SAGA, dual methods such as SDCA, and primal-dual methods such as SPDC. In essence, many of the primal methods use the full gradient at the snapshot or the average gradient to progressively reduce the variance of stochastic gradient estimators, as well as the dual and primal-dual methods, which leads to a revolution in the area of first-order optimization [36]. Thus, they are also known as the hybrid gradient descent method [37] or semi-stochastic gradient descent method [38]. In particular, under the strongly convex condition, most of the accelerated SGD methods enjoy a linear convergence rate (also known as a geometric or exponential rate) and the oracle complexity of to obtain an -suboptimal solution, where each is -smooth, and is -strongly convex. The complexity bound shows that they converge faster than accelerated deterministic methods, whose oracle complexity is [39, 40].

SVRG [1] and its proximal variant, Prox-SVRG [34], are particularly attractive because of their low storage requirement compared with other methods such as SAG, SAGA and SDCA, which require storage of all the gradients of component functions or dual variables. At the beginning of the -th epoch in SVRG, the full gradient is computed at the snapshot , which is updated periodically.

Definition 1.

The stochastic variance reduced gradient estimator is independently introduced in [1, 37] as follows:

(5)

where is the epoch that iteration belongs to.

It is not hard to verify that the variance of the SVRG estimator (i.e., ) can be much smaller than that of the SGD estimator (i.e., ). Theoretically, for non-strongly convex (Non-SC) problems, the variance reduced methods converge slower than the accelerated batch methods such as FISTA [22], i.e., vs. .

More recently, many acceleration techniques were proposed to further speed up the stochastic variance reduced methods mentioned above. These techniques mainly include the Nesterov’s acceleration techniques in [25, 39, 40, 41, 42], reducing the number of gradient calculations in early iterations [36, 43, 44], the projection-free property of the conditional gradient method (also known as the Frank-Wolfe algorithm [45]) as in [46], the stochastic sufficient decrease technique [47], and the momentum acceleration tricks in [36, 48, 49]. [40] proposed an accelerating Catalyst framework and achieved the oracle complexity of for strongly convex problems. [48] and [50] proved that the accelerated methods can attain the oracle complexity of for non-strongly convex problems. The overall complexity matches the theoretical upper bound provided in [51]. Katyusha [48], point-SAGA [52] and MiG [50] achieve the best-known oracle complexity of for strongly convex problems, which is identical to the upper complexity bound in [51]. Hence, Katyusha and MiG are the best-known stochastic optimization method for both SC and Non-SC problems, as pointed out in [51]. However, selecting the best values for the parameters in the accelerated methods (e.g., the momentum parameter) is still an open problem. In particular, most of accelerated stochastic variance reduction methods, including Katyusha, require at least one auxiliary variable and one momentum parameter, which lead to complicated algorithm design and high per-iteration complexity, especially for very high-dimensional and sparse data.

1.3 Our Contributions

From the above discussions, we can see that most of the accelerated stochastic variance reduction methods such as [36, 39, 44, 46, 47, 48, 53, 54] and applications such as [7, 9, 10, 14, 17, 18, 55, 56, 57] are based on the SVRG method [1]. Thus, any key improvement on SVRG is very important for the research of stochastic optimization. In this paper, we propose a simple variant of the original SVRG [1], called variance reduced stochastic gradient descent (VR-SGD). The snapshot point and starting point of each epoch in VR-SGD are set to the average and last iterate of the previous epoch, respectively. This is different from the settings of SVRG and Prox-SVRG [34], where the two points of the former are set to be the last iterate, and those of the latter are set to be the average of the previous epoch. This difference makes the convergence analysis of VR-SGD significantly more challenging than that of SVRG and Prox-SVRG. Our empirical results show that the performance of VR-SGD is significantly better than its counterparts, SVRG and Prox-SVRG. Impressively, VR-SGD with varying learning rates achieves better or at least comparable performance with accelerated methods, such as Catalyst [40] and Katyusha [48]. The main contributions of this paper are summarized below.

  • The snapshot and starting points of VR-SGD are set to two different vectors, i.e., (Option I) or (Option II), and . In particular, we find that the settings of VR-SGD allow us to take much larger learning rates than SVRG, e.g., vs. , and thus significantly speed up its convergence in practice. Moreover, VR-SGD has an advantage over SVRG in terms of robustness of learning rate selection.

  • Unlike proximal stochastic gradient methods, e.g., Prox-SVRG and Katyusha, which have a unified update rule for the two cases of smooth and non-smooth objectives (see Section 2.2 for details), VR-SGD employs two different update rules for the two cases, respectively, as in (12) and (13) below. Empirical results show that gradient update rules as in (12) for smooth optimization problems are better choices than proximal update formulas as in (10).

  • We provide the convergence guarantees of VR-SGD for solving smooth/non-smooth and non-strongly convex (or general convex) functions. In comparison, SVRG and Prox-SVRG do not have any convergence guarantees, as shown in Table III.

  • Moreover, we also present a momentum accelerated variant of VR-SGD, discuss their equivalent relationship, and empirically verify that they achieve similar performance to their variant that attains the optimal convergence rate .

  • Finally, we theoretically analyze the convergence properties of VR-SGD with Option I or Option II for smooth/non-smooth and strongly convex functions, which show that VR-SGD attains linear convergence.

SVRG [1] Prox-SVRG [34] VR-SGD
SC, smooth linear rate unknown linear rate
SC, non-smooth unknown linear rate linear rate
Non-SC, smooth unknown unknown
Non-SC, non-smooth unknown unknown
TABLE I: Comparison of convergence rates of VR-SGD and its counterparts.

2 Preliminary and Related Work

Throughout this paper, we use to denote the -norm (also known as the standard Euclidean norm), and is the -norm, i.e., . denotes the full gradient of if it is differentiable, or the subgradient if is only Lipschitz continuous. For each epoch and inner iteration , is the random chosen index. We mostly focus on the case of Problem (1) when each is -smooth111In fact, we can extend the theoretical results for the case, when the gradients of all component functions have the same Lipschitz constant , to the more general case, when some component functions have different degrees of smoothness., and is -strongly convex. The two common assumptions are defined as follows.

2.1 Basic Assumptions

Assumption 1 (Smoothness).

Each is -smooth, that is, there exists a constant such that for all ,

(6)
Assumption 2 (Strong Convexity).

is -strongly convex, i.e., there exists a constant such that for all ,

(7)

Note that when is non-smooth, the inequality in (7) needs to be revised by simply replacing the gradient with an arbitrary sub-gradient of at . In contrast, for a non-strongly convex or general convex function, the inequality in (7) can always be satisfied with .

0:  The number of epochs , the number of iterations per epoch, and the learning rate .
0:  .
1:  for  do
2:     ,  ;
3:     for  do
4:        Pick uniformly at random from ;
5:        ;
6:        Option I: ,      or ;
7:        Option II: ;
8:     end for
9:     Option I:  ; Last iterate for snapshot
10:     Option II: ; Iterate averaging for
11:  end for
11:  
Algorithm 1 SVRG (Option I) and Prox-SVRG (Option II)

2.2 Related Work

To speed up standard and proximal SGD, many stochastic variance reduced methods [29, 30, 31, 37] have been proposed for some special cases of Problem (1). In the case when each is -smooth, is -strongly convex, and , Roux et al. [29] proposed a stochastic average gradient (SAG) method, which attains linear convergence. However, SAG, as well as other incremental aggregated gradient methods such as SAGA [31], needs to store all gradients, so that memory is required in general [43]. Similarly, SDCA [30] requires storage of all dual variables [1], which uses memory. In contrast, SVRG proposed by Johnson and Zhang [1], as well as Prox-SVRG [34], has the similar convergence rate to SAG and SDCA, but without the memory requirements of all gradients and dual variables. In particular, the SVRG estimator in (5) may be the most popular choice for stochastic gradient estimators. The update rule of SVRG for the case of Problem (1) when is

(8)

When the smooth regularizer , the update rule in (8) becomes: . Although the original SVRG in [1] only has convergence guarantees for the special case of Problem (1), when each is -smooth, is -strongly convex, and , we can extend SVRG to the proximal setting by introducing the proximal operator in (3), as shown in Line 7 of Algorithm 1.

Based on the SVRG estimator in (5), some accelerated algorithms [39, 40, 48] have been proposed. The proximal update rules of Katyusha [48] are formulated as follows:

(9a)
(9b)
(9c)

where are two parameters. To eliminate the need for parameter tuning, is set to , and is fixed to in [48]. In addition, [16, 17, 18] applied efficient stochastic solvers to compute leading eigenvectors of a symmetric matrix or generalized eigenvectors of two symmetric matrices. The first such method is VR-PCA proposed by Shamir [17], and the convergence properties of VR-PCA for such a non-convex problem are also provided. Garber et al. [18] analyzed the convergence rate of SVRG when is a convex function that is a sum of non-convex component functions. Moreover, [4, 5] and [58] proved that SVRG and SAGA with minor modifications can converge asymptotically to a stationary point of non-convex functions. Some sparse approximation, parallel and distributed variants [50, 59, 60, 61, 62] of accelerated SGD methods have also been proposed.

An important class of stochastic methods is the proximal stochastic gradient (Prox-SG) method, such as Prox-SVRG [34], SAGA [31], and Katyusha [48]. Different from standard variance reduction SGD methods such as SVRG, the Prox-SG method has a unified update rule for both smooth and non-smooth cases of . For instance, the update rule of Prox-SVRG [34] is formulated as follows:

(10)

For the sake of completeness, the details of Prox-SVRG [34] are shown in Algorithm 1 with Option II. When is the widely used -norm regularizer, i.e., , the proximal update formula in (10) becomes

(11)

3 Variance Reduced SGD

In this section, we propose an efficient variance reduced stochastic gradient descent (VR-SGD) algorithm, as shown in Algorithm 2. Different from the choices of the snapshot and starting points in SVRG [1] and Prox-SVRG [34], the two vectors of each epoch in VR-SGD are set to the average and last iterate of the previous epoch, respectively. Moreover, unlike existing proximal stochastic methods, we design two different update rules for smooth and non-smooth objective functions, respectively.

3.1 Snapshot and Starting Points

Like SVRG, VR-SGD is also divided into epochs, and each epoch consists of stochastic gradient steps, where is usually chosen to be , as suggested in [1, 34, 48]. Within each epoch, we need to compute the full gradient at the snapshot and use it to define the variance reduced stochastic gradient estimator in (5). Unlike SVRG, whose snapshot is set to the last iterate of the previous epoch, the snapshot of VR-SGD is set to the average of the previous epoch, e.g., in Option I of Algorithm 2, which leads to better robustness to gradient noise222

It should be emphasized that the noise introduced by random sampling is inevitable, and generally slows down the convergence speed in this sense. However, SGD and its variants are probably the mostly used optimization algorithms for deep learning 

[63]. In particular, [64] has shown that by adding gradient noise at each step, noisy gradient descent can escape the saddle points efficiently and converge to a local minimum of the non-convex minimization problem, e.g., the application of deep neural networks in [65]., as also suggested in [36, 47, 66]. In fact, the choice of Option II in Algorithm 2, i.e., , also works well in practice, as shown in Fig. 2 in the Supplementary Material. Therefore, we provide the convergence guarantees for our algorithm with either Option I or Option II in the next section. In particular, we find that one of the effects of the choice in Option I or Option II of Algorithm 2 is to allow taking much larger learning rates or step sizes than SVRG in practice, e.g., for VR-SGD vs.  for SVRG (see Fig. 11). Actually, a larger learning rate enjoyed by VR-SGD means that the variance of its stochastic gradient estimator goes asymptotically to zero faster.

Unlike Prox-SVRG [34] whose starting point is initialized to the average of the previous epoch, the starting point of VR-SGD is set to the last iterate of the previous epoch. That is, in VR-SGD, the last iterate of the previous epoch becomes the new starting point, while the two points of Prox-SVRG are completely different, thereby leading to relatively slow convergence in general. Both the starting and snapshot points of SVRG [1] are set to the last iterate of the previous epoch333Note that the theoretical convergence of the original SVRG [1] relies on its Option II, i.e., both and are set to , where is randomly chosen from . However, the empirical results in [1] suggest that Option I is a better choice than its Option II, and the convergence guarantee of SVRG with Option I for strongly convex objective functions is provided in [67]., while the two points of Prox-SVRG [34] are set to the average of the previous epoch (also suggested in [1]). By setting the starting and snapshot points in VR-SGD to the two different vectors mentioned above, the convergence analysis of VR-SGD becomes significantly more challenging than that of SVRG and Prox-SVRG, as shown in Section 4.

0:  The number of epochs , and the number of iterations per epoch.
0:  , and .
1:  for  do
2:     ; Compute the full gradient
3:     for  do
4:        Pick uniformly at random from ;
5:        ;
6:        ;
7:     end for
8:     Option I: ; Iterate averaging for
9:     Option II: ; Iterate averaging for
10:     ; Initiate for the next epoch
11:  end for
11:  , if , and otherwise.
Algorithm 2 VR-SGD for solving smooth problems

3.2 The VR-SGD Algorithm

In this part, we propose an efficient VR-SGD algorithm to solve Problem (1), as outlined in Algorithm 2 for the case of smooth objective functions. It is well known that the original SVRG [1]

only works for the case of smooth minimization problems. However, in many machine learning applications, e.g., elastic net regularized logistic regression, the strongly convex objective function

is non-smooth. To solve this class of problems, the proximal variant of SVRG, Prox-SVRG [34], was subsequently proposed. Unlike the original SVRG, VR-SGD can not only solve smooth objective functions, but also directly tackle non-smooth ones. That is, when the regularizer is smooth (e.g., the -norm regularizer), the key update rule of VR-SGD is

(12)

When is non-smooth (e.g., the -norm regularizer), the key update rule of VR-SGD in Algorithm 2 becomes

(13)

Unlike the proximal stochastic methods such as Prox-SVRG [34], all of which have a unified update rule as in (10) for both the smooth and non-smooth cases of , VR-SGD has two different update rules for the two cases, as in (12) and (13). Fig. 11 demonstrates that VR-SGD has a significant advantage over SVRG in terms of robustness of learning rate selection. That is, VR-SGD yields good performance within the range of the learning rate from to , whereas the performance of SVRG is very sensitive

to the selection of learning rates. Thus, VR-SGD is convenient to be applied in various real-world problems of machine learning. In fact, VR-SGD can use much larger learning rates than SVRG for ridge regression problems in practice, e.g.,

for VR-SGD vs.  for SVRG, as shown in Fig. 1(b).

(a) Logistic regression: (left)  and  (right)
(b) Ridge regression: (left)  and  (right)
Fig. 1: Comparison of SVRG [1] and VR-SGD with different learning rates for solving -norm regularized logistic regression and ridge regression on Covtype. Note that the blue lines stand for the results of SVRG, while the red lines correspond to the results of VR-SGD (best viewed in colors).

3.3 VR-SGD for Non-Strongly Convex Objectives

Although many stochastic variance reduced methods have been proposed, most of them, including SVRG and Prox-SVRG, only have convergence guarantees for the case of Problem (1), when is strongly convex. However, may be non-strongly convex in many machine learning applications, such as Lasso and -norm regularized logistic regression. As suggested in [48, 68], this class of problems can be transformed into strongly convex ones by adding a proximal term , which can be efficiently solved by Algorithm 2. However, the reduction technique may degrade the performance of the involved algorithms both in theory and in practice [44]. Thus, we use VR-SGD to directly solve non-strongly convex problems.

The learning rate of Algorithm 2 can be fixed to a constant. Inspired by existing accelerated stochastic algorithms [36, 48], the learning rate in Algorithm 2 can be gradually increased in early iterations for both strongly convex and non-strongly convex problems, which leads to faster convergence (see Fig. 3 in the Supplementary Material). Different from SGD and Katyusha [48], where the learning rate of the former requires to be gradually decayed and that of the latter needs to be gradually increased, the update rule of in Algorithm 2 is defined as follows: is an initial learning rate, and for any ,

(14)

where is a given constant, e.g., .

3.4 Extensions of VR-SGD

It has been shown in [38, 39] that mini-batching can effectively decrease the variance of stochastic gradient estimates. Therefore, we first extend the proposed VR-SGD method to the mini-batch setting, as well as its convergence results below. Here, we denote by the mini-batch size and the selected random index set for each outer-iteration and inner-iteration .

Definition 2.

The stochastic variance reduced gradient estimator in the mini-batch setting is defined as

(15)

where is a mini-batch of size .

If some component functions are non-smooth, we can use the proximal operator oracle [68] or the Nesterov’s smoothing [69] and homotopy smoothing [70] techniques to smoothen them, and thereby obtain their smoothed approximations. In addition, we can directly extend our VR-SGD method to the non-smooth setting as in [36] (e.g., Algorithm 3 in [36]) without using any smoothing techniques.

Considering that each component function may have different degrees of smoothness, picking the random index

from a non-uniform distribution is a much better choice than commonly used uniform random sampling

[71, 72], as well as without-replacement sampling vs. with-replacement sampling [73]. This can be done using the same techniques in [34, 48], i.e., the sampling probabilities for all are proportional to their Lipschitz constants, i.e., . VR-SGD can also be combined with other accelerated techniques used for SVRG. For instance, the epoch length of VR-SGD can be automatically determined by the techniques in [44, 74], instead of a fixed epoch length. We can reduce the number of gradient calculations in early iterations as in [43, 44], which leads to faster convergence in general (see Section Impact of Increasing Learning Rates for details). Moreover, we can introduce the Nesterov’s acceleration techniques in [25, 39, 40, 41, 42] and momentum acceleration tricks in [36, 48, 75] to further improve the performance of VR-SGD.

4 Algorithm Analysis

In this section, we provide the convergence guarantees of VR-SGD for solving both smooth and non-smooth general convex problems, and extend the results to the mini-batch setting. We also study the convergence properties of VR-SGD for solving both smooth and non-smooth strongly convex objective functions. Moreover, we discuss the equivalent relationship between VR-SGD and its momentum accelerated variant, as well as some of its extensions.

4.1 Convergence Properties: Non-strongly Convex

In this part, we analyze the convergence properties of VR-SGD for solving more general non-strongly convex problems. Considering that the proposed algorithm (i.e., Algorithm 2) has two different update rules for smooth and non-smooth cases, we give the convergence guarantees of VR-SGD for the two cases as follows.

4.1.1 Smooth Objective Functions

We first provide the convergence guarantee of our algorithm for solving Problem (1) when is smooth. In order to simplify analysis, we denote by , that is, for all , and then .

Lemma 1 (Variance bound).

Let be the optimal solution of Problem (1). Suppose Assumption 1 holds. Then the following inequality holds

The proofs of this lemma, the lemmas and theorems below are all included in the Supplementary Material. Lemma 3 provides the upper bound on the expected variance of the variance reduced gradient estimator in (5), i.e., the SVRG estimator. For Algorithm 2 with Option II and a fixed learning rate , we have the following result.

Theorem 1 (Smooth objectives).

Suppose Assumption 1 holds. Then the following inequality holds

where , and .

From Theorem 1 and its proof, one can see that our convergence analysis is very different from that of existing stochastic methods, such as SVRG [1], Prox-SVRG [34], and SVRG++ [44]. Similarly, the convergence of Algorithm 2 with Option I and a fixed learning rate can be guaranteed, as stated in Theorem 6 in the Supplementary Material. All the results show that VR-SGD attains a convergence rate of for non-strongly convex functions.

4.1.2 Non-Smooth Objective Functions

We also provide the convergence guarantee of Algorithm 2 with Option I and (13) for solving Problem (1) when is non-smooth and non-strongly convex, as shown below.

Theorem 2 (Non-smooth objectives).

Suppose Assumption 1 holds. Then the following inequality holds

Similarly, the convergence of Algorithm 2 with Option II and a fixed learning rate can be guaranteed, as stated in Corollary 4 in the Supplementary Material.

4.1.3 Mini-Batch Settings

The upper bound on the variance of is extended to the mini-batch setting as follows.

Corollary 1 (Variance bound of mini-batch).

If each is convex and -smooth, then the following inequality holds

where .

This corollary is essentially identical to Theorem 4 in [38], and hence its proof is omitted. It is not hard to verify that . Based on the variance upper bound, we further analyze the convergence properties of VR-SGD in the mini-batch setting, as shown below.

Theorem 3 (Mini-batch).

If each is convex and -smooth, then the following inequality holds

where .

From Theorem 3, one can see that when (i.e., the batch setting), , and the first term on the right-hand side of the above inequality diminishes. That is, VR-SGD degenerates to a batch method. When , we have , and thus Theorem 3 degenerates to Theorem 2.

4.2 Convergence Properties: Strongly Convex

We also analyze the convergence properties of VR-SGD for solving strongly convex problems. We first give the following convergence result for Algorithm 2 with Option II.

Theorem 4 (Strongly convex).

Suppose Assumptions 1, 2 and 3 in the Supplementary Material hold, and is sufficiently large so that

where is a constant. Then Algorithm 2 with Option II has the following geometric convergence in expectation:

We also provide the linear convergence guarantees for Algorithm 2 with Option I for solving non-smooth and strongly convex functions, as stated in Theorem 7 in the Supplementary Material. Similarly, the linear convergence of Algorithm 2 can be guaranteed for the smooth strongly-convex case. All the theoretical results show that VR-SGD attains a linear convergence rate and at most the oracle complexity of for both smooth and non-smooth strongly convex functions. In contrast, the convergence of SVRG [1] is only guaranteed for smooth and strongly convex problems.

Although the learning rate in Theorem 4 needs to be less than , we can use much larger learning rates in practice, e.g., . However, it can be easily verified that the learning rate of SVRG should be less than in theory, and adopting a larger learning rate for SVRG is not always helpful in practice, which means that VR-SGD can use much larger learning rates than SVRG both in theory and in practice. In other words, although they have the same theoretical convergence rate, VR-SGD converges significantly faster than SVRG in practice, as shown by our experiments. Note that similar to the convergence analysis in [4, 5, 58], the convergence of VR-SGD for some non-convex problems can also be guaranteed.

4.3 Equivalent to Its Momentum Accelerated Variant

Inspired by the success of the momentum technique in our previous work [6, 50, 75], we present a momentum accelerated variant of Algorithm 2, as shown in Algorithm 3. Unlike existing momentum techniques, e.g., [19, 22, 25, 39, 40, 48], we use the convex combination of the snapshot and latest iterate for acceleration, i.e., . It is not hard to verify that Algorithm 2 with Option I is equivalent to its variant (i.e., Algorithm 3 with Option I), when and is sufficiently small (see the Supplementary Material for their equivalent analysis). We emphasize that the only difference between Options I and II in Algorithm 3 is the initialization of and .

Theorem 5.

Suppose Assumption 1 holds. Then the following inequality holds:

Choosing , Algorithm 3 with Option II achieves an -suboptimal solution (i.e., ) using at most iterations.

This theorem shows that the oracle complexity of Algorithm 3 with Option II is consistent with that of Katyusha [48], and is better than that of accelerated deterministic methods (e.g., AGD [20]), (i.e., ), which are also verified by the experimental results in Fig. 2. Our algorithm also achieves the optimal convergence rate for non-strongly convex functions as in [48, 49]. Fig. 2 shows that Katyusha and Algorithm 3 with Option II have similar performance as Algorithms 2 and 3 with Option I () in terms of number of effective passes. Clearly, Algorithm 3 and Katyusha have higher complexity per iteration than Algorithm 2. Thus, we only report the results of VR-SGD (i.e., Algorithm 2) in Section 5.

0:   and .
0:  , , , and .
1:  for  do
2:     , ;
3:     Option I: , or Option II: ;
4:     for  do
5:        Pick uniformly at random from ;
6:        ;
7:        ;
8:        ;
9:     end for
10:     ;
11:     Option I: , or Option II: ;
12:  end for
12:  , if , and otherwise.
Algorithm 3 The momentum accelerated algorithm
(a) Small dataset: MNIST
(b) Large dataset: Epsilon
Fig. 2: Comparison of AGD [20], Katyusha [48], Algorithm 3 with Option I and II, and VR-SGD for solving logistic regression with .

4.4 Complexity Analysis

From Algorithm 2, we can see that the per-iteration cost of VR-SGD is dominated by the computation of , , and or the proximal update in (13). Thus, the complexity is , which is as low as that of SVRG [1] and Prox-SVRG [34]. In fact, for some ERM problems, we can save the intermediate gradients in the computation of , which generally requires additional storage. As a result, each epoch only requires component gradient evaluations. In addition, for extremely sparse data, we can introduce the lazy update tricks in [38, 76, 77] to our algorithm, and perform the update steps in (12) and (13) only for the non-zero dimensions of each sample, rather than all dimensions. In other words, the per-iteration complexity of VR-SGD can be improved from to , where is the sparsity of feature vectors. Moreover, VR-SGD has a much lower per-iteration complexity than existing accelerated stochastic variance reduction methods such as Katyusha [48], which have more updating steps for additional variables, as shown in (9a)-(9c).

5 Experimental Results

In this section, we evaluate the performance of VR-SGD for solving a number of convex and non-convex ERM problems (such as logistic regression, Lasso and ridge regression), and compare its performance with several state-of-the-art stochastic variance reduced methods (including SVRG [1], Prox-SVRG [34], SAGA [31]) and accelerated methods, such as Catalyst [40] and Katyusha [48]. Moreover, we apply VR-SGD to solve other machine learning problems, such as ERM with non-convex loss and leading eigenvalue computation.

(a) -norm regularized logistic regression:
(b) -norm regularized logistic regression:
Fig. 3: Comparison of deterministic and stochastic methods on Adult.
Option I Option II Option III
   and   and   and 
TABLE II: The three choices of snapshot and starting points for stochastic variance reduction optimization.
(a) Ridge regression: (left)  and  (right)
(b) Lasso: (left)  and  (right)
Fig. 4: Comparison of the algorithms with Options I, II, and III for solving ridge regression and Lasso on Covtype. In each plot, the vertical axis shows the objective value minus the minimum, and the horizontal axis denotes the number of effective passes.

5.1 Experimental Setup

We used several publicly available data sets in the experiments: Adult (also called a9a), Covtype, Epsilon, MNIST, and RCV1, all of which can be downloaded from the LIBSVM Data website444https://www.csie.ntu.edu.tw/~cjlin/libsvm/. It should be noted that each sample of these date sets was normalized so that they have unit length as in [34, 36], which leads to the same upper bound on the Lipschitz constants , i.e., for all . As suggested in [1, 34, 48], the epoch length is set to for the stochastic variance reduced methods, SVRG [1], Prox-SVRG [34], Catalyst [40], and Katyusha [48], as well as VR-SGD. Then the only parameter we have to tune by hand is the learning rate, . More specifically, we select learning rates from , where . Since Katyusha has a much higher per-iteration complexity than SVRG and VR-SGD, we compare their performance in terms of both the number of effective passes and running time (seconds), where computing a single full gradient or evaluating component gradients is considered as one effective pass over the data. For fair comparison, we implemented SVRG, Prox-SVRG, SAGA, Catalyst, Katyusha, and VR-SGD in C++ with a Matlab interface, as well as their sparse versions with lazy update tricks, and performed all the experiments on a PC with an Intel i5-4570 CPU and 16GB RAM. The source code of all the methods is available at https://github.com/jnhujnhu/VR-SGD.

(a) Adult: (left) and (right)
(b) Covtype: (left) and (right)
Fig. 5: Comparison of SVRG [1], Katyusha [48], VR-SGD and their proximal versions for solving ridge regression problems. In each plot, the vertical axis shows the objective value minus the minimum, and the horizontal axis is the number of effective passes over data.

5.2 Deterministic Methods vs. Stochastic Methods

In this subsection, we compare the performance of stochastic methods (including SGD, SVRG, Katyusha, and VR-SGD) with that of deterministic methods such as AGD [19, 20] and APG [22] for solving strongly and non-strongly convex problems. Note that the important momentum parameter of AGD is as in [78], while that of APG is defined as follows: for all [22], where