Guaranteed Sufficient Decrease for Variance Reduced Stochastic Gradient Descent

03/20/2017 ∙ by Fanhua Shang, et al. ∙ 0

In this paper, we propose a novel sufficient decrease technique for variance reduced stochastic gradient descent methods such as SAG, SVRG and SAGA. In order to make sufficient decrease for stochastic optimization, we design a new sufficient decrease criterion, which yields sufficient decrease versions of variance reduction algorithms such as SVRG-SD and SAGA-SD as a byproduct. We introduce a coefficient to scale current iterate and satisfy the sufficient decrease property, which takes the decisions to shrink, expand or move in the opposite direction, and then give two specific update rules of the coefficient for Lasso and ridge regression. Moreover, we analyze the convergence properties of our algorithms for strongly convex problems, which show that both of our algorithms attain linear convergence rates. We also provide the convergence guarantees of our algorithms for non-strongly convex problems. Our experimental results further verify that our algorithms achieve significantly better performance than their counterparts.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic gradient descent (SGD) has been successfully applied to many large scale machine learning problems 

[15, 36]

, by virtue of its low per-iteration cost. However, standard SGD estimates the gradient from only one or a few samples, and thus the variance of the stochastic gradient estimator may be large 

[13, 37], which leads to slow convergence and poor performance. In particular, even under the strongly convex (SC) condition, the convergence rate of standard SGD is only sub-linear. Recently, the convergence rate of SGD has been improved by various variance reduction methods, such as SAG [28], SDCA [30], SVRG [13], SAGA [7], Finito [8], MISO [21], and their proximal variants, such as [29], [31] and [34]. Under the SC condition, these variance reduced SGD (VR-SGD) algorithms achieve linear convergence rates.

Very recently, many techniques were proposed to further speed up the VR-SGD methods mentioned above. These techniques include importance sampling [37], exploiting neighborhood structure in the training data to share and re-use information about past stochastic gradients [12], incorporating Nesterov’s acceleration techniques [19, 25] or momentum acceleration tricks [1], reducing the number of gradient computations in the early iterations [3, 4, 35], and the projection-free property of the conditional gradient method [11]. [2] and [27] proved that SVRG and SAGA with minor modifications can asymptotically converge to a stationary point for non-convex problems.

So far the two most popular stochastic gradient estimators are the SVRG estimator independently introduced by [13, 35] and the SAGA estimator [7]. All these estimators may be very different from their full gradient counterparts, thus moving in the direction may not decrease the objective function anymore, as stated in [1]. To address this problem, inspired by the success of sufficient decrease methods for deterministic optimization such as [18, 33], we propose a novel sufficient decrease technique for a class of VR-SGD methods, including the widely-used SVRG and SAGA methods. Notably, our method with partial sufficient decrease achieves average time complexity per-iteration as low as the original SVRG and SAGA methods. We summarize our main contributions below.

  • For making sufficient decrease for stochastic optimization, we design a sufficient decrease strategy to further reduce the cost function, in which we also introduce a coefficient to take the decisions to shrink, expand or move in the opposite direction.

  • We incorporate our sufficient decrease technique, together with momentum acceleration, into two representative SVRG and SAGA algorithms, which lead to SVRG-SD and SAGA-SD. Moreover, we give two specific update rules of the coefficient for Lasso and ridge regression problems as notable examples.

  • Moreover, we analyze the convergence properties of SVRG-SD and SAGA-SD, which show that SVRG-SD and SAGA-SD converge linearly for SC objective functions. Unlike most of the VR-SGD methods, we also provide the convergence guarantees of SVRG-SD and SAGA-SD for non-strongly convex (NSC) problems.

  • Finally, we show by experiments that SVRG-SD and SAGA-SD achieve significantly better performance than SVRG [13] and SAGA [7]. Compared with the best known stochastic method, Katyusha [1], our methods also have much better performance in most cases.

2 Preliminary and Related Work

In this paper, we consider the following composite convex optimization problem:

(1)

where are the smooth convex component functions, and is a relatively simple convex (but possibly non-differentiable) function. Recently, many VR-SGD methods [13, 28, 34, 35] have been proposed for special cases of (1). Under smoothness and SC assumptions, and , SAG [28] achieves a linear convergence rate. A recent line of work, such as [13, 34], has been proposed with similar convergence rates to SAG but without the memory requirements for all gradients. SVRG [13] begins with an initial estimate , sets and then generates a sequence of (, where is usually set to ) using

(2)

where is the step size, is the full gradient at , and is chosen uniformly at random from . After every stochastic iterations, we set , and reset and . Unfortunately, most of the VR-SGD methods [8, 30, 34], including SVRG, only have convergence guarantee for smooth and SC problems. However, may be NSC in many machine learning applications, such as Lasso. [7] proposed SAGA, a fast incremental gradient method in the spirit of SAG and SVRG, which works for both SC and NSC objective functions, as well as in proximal settings. Its main update rule is formulated as follows:

(3)

where is updated for all as follows: if , and otherwise, and the proximal operator is defined as: .

The technique of sufficient decrease (e.g., the well-known line search technique [22]) has been studied for deterministic optimization [18, 33]. For example, [18] proposed the following sufficient decrease condition for deterministic optimization:

(4)

where is a small constant, and . Similar to the strategy for deterministic optimization, in this paper we design a novel sufficient decrease technique for stochastic optimization, which is used to further reduce the cost function and speed up its convergence.

3 Variance Reduced SGD with Sufficient Decrease

In this section, we propose a novel sufficient decrease technique for VR-SGD methods, which include the widely-used SVRG and SAGA methods. To make sufficient decrease for stochastic optimization, we design a sufficient decrease strategy to further reduce the cost function. Then a coefficient is introduced to satisfy the sufficient decrease condition, and takes the decisions to shrink, expand or move in the opposite direction. Moreover, we present two sufficient decrease VR-SGD algorithms with momentum acceleration: SVRG-SD and SAGA-SD. We also give two specific schemes to compute for Lasso and ridge regression.

3.1 Our Sufficient Decrease Technique

Suppose for the -th outer-iteration and the -th inner-iteration. Unlike the full gradient method, the stochastic gradient estimator is somewhat inaccurate (i.e., it may be very different from ), then further moving in the updating direction may not decrease the objective value anymore [1]. That is, may be larger than even for very small step length . Motivated by this observation, we design a factor to scale the current iterate for the decrease of the objective function. For SVRG-SD, the cost function with respect to is formulated as follows:

(5)

where is a trade-off parameter between the two terms, is a small constant and set to 0.1. The second term in (5) involves the norm of the residual of stochastic gradients, and plays the same role as the second term of the right-hand side of (4). Different from existing sufficient decrease techniques including (4), a varying factor instead of a constant is introduced to scale and the coefficient of the second term of (5), and plays a similar role as the step-size parameter optimized via a line-search for deterministic optimization. However, line search techniques have a high computational cost in general, which limits their applicability to stochastic optimization [20].

For SAGA-SD, the cost function with respect to can be revised by simply replacing with defined below. Note that is a scalar and takes the decisions to shrink, expand or move in the opposite direction of . The detailed schemes to calculate for Lasso and ridge regression are given in Section 3.3. We first present the following sufficient decrease condition in the statistical sense for stochastic optimization.

Property 1.

For given and the solution of (5), then the following inequality holds

(6)

where for SVRG-SD.

It is not hard to verify that can be further decreased via our sufficient decrease technique, when the current iterate is scaled by the coefficient . Indeed, for the special case when for some , the inequality in (6) can be still satisfied. Moreover, Property 1 can be extended for SAGA-SD by setting , as well as for other VR-SGD algorithms such as SAG and SDCA. Unlike the sufficient decrease condition for deterministic optimization [18, 33], may be a negative number, which means to move in the opposite direction of .

3.2 Momentum Acceleration

In this part, we first design the update rule for the key variable with the coefficient as follows:

(7)

where , is a constant and can be set to which also works well in practice. In fact, the second term of the right-hand side of (7) plays a momentum acceleration role as in batch and stochastic optimization [1, 23, 25]. That is, by introducing this term, we can utilize the previous information of gradients to update . In addition, the update rule of is given by

(8)

where , is a Lipschitz constant (see Assumption 1 below), denotes a constant, and can be the two most popular choices for stochastic gradient estimators: the SVRG estimator [13, 35] for SVRG-SD and the SAGA estimator [7] for SAGA-SD defined as follows:

(9)

respectively, where . For SAGA-SD, we need to set , and store in the table similar to [7]. All the other entries in the table remain unchanged, and is the table average. From (8), it is clear that our algorithms can tackle non-smooth problems directly as in [7].

0:

  the number of epochs

, the number of iterations per epoch, and step size .
0:   for Case of SC or for Case of NSC.
1:  for  do
2:     Case of SC: ,  or  Case of NSC: ;
3:     ;
4:     for  do
5:        Pick uniformly at random from ;
6:        Compute and by (9) and (8);
7:        Update and by (5) and (7);
8:     end for
9:     ; Case of NSC: ;
10:  end for
10:   (SC)  or   if and otherwise (NSC)
Algorithm 1 SVRG-SD

In summary, we propose a novel variant of SVRG with sufficient decrease (SVRG-SD) to solve both SC and NSC problems, as outlined in Algorithm 1. For the case of SC, , while and for the case of NSC. Similarly, we also present a novel variant of SAGA with sufficient decrease (SAGA-SD), as shown in the Supplementary Material. The main differences between them are the stochastic gradient estimators in (9), and the update rule of the sufficient decrease coefficient in (5).

Note that when and , the proposed SVRG-SD and SAGA-SD degenerate to the original SVRG or its proximal variant (Prox-SVRG [34]) and SAGA [7], respectively. In this sense, SVRG, Prox-SVRG and SAGA can be seen as the special cases of the proposed algorithms. Like SVRG and SVRG-SD, SAGA-SD is also a multi-stage algorithm, whereas SAGA is a single-stage algorithm.

3.3 Coefficients for Lasso and Ridge Regression

In this part, we give the closed-form solutions of the coefficient for Lasso and ridge regression problems. For Lasso problems and given , we have . The closed-form solution of (5) for SVRG-SD can be obtained as follows:

(10)

where is the data matrix containing data samples, , and is the so-called soft thresholding operator [9] with the following threshold,

For ridge regression problems, and , the closed-form solution of (5) for SVRG-SD is given by

(11)

In the same ways as in (10) and (11), we can compute the coefficient of SAGA-SD for Lasso and ridge regression problems, and revise the update rules in (10) and (11) by simply replacing with

. We can also derive the update rule of the coefficient for other loss functions using their approximations, e.g.,

[5]

for logistic regression.

3.4 Efficient Implementation

Both (10) and (11) require the calculation of , thus we need to precompute and save in the initial stage. To further reduce the computational complexity of in (10) and (11

), we use the fast partial singular value decomposition to obtain the best rank-

approximation to and save . Then . In practice, e.g., in our experiments, can be set to a small number to capture 99.5% of the spectral energy of the data matrix , e.g., for the Covtype data set, similar to inexact line search methods [22] for deterministic optimization.

The time complexity of each inter-iteration in the proposed SVRG-SD and SAGA-SD with full sufficient decrease is , which is a little higher than SVRG and SAGA. In fact, we can just randomly select only a small fraction (e.g., ) of stochastic gradient iterations in each epoch to update with sufficient decrease, while the remainder of iterations without sufficient decrease, i.e., . Let be the number of iterations with our sufficient decrease technique in each epoch. By fixing and thus without increasing parameters tuning difficulties, SVRG-SD and SAGA-SD111Note that SVRG-SD and SAGA-SD with partial sufficient decrease possess the similar convergence properties as SVRG-SD and SAGA-SD with full sufficient decrease because Property 1 still holds when . can always converge much faster than their counterparts: SVRG and SAGA, as shown in Figure 1. It is easy to see that our algorithms are very robust with respect to the choice of , and achieve average time complexity per-iteration as low as the original SVRG and SAGA. Thus, we mainly consider SVRG-SD and SAGA-SD with partial sufficient decrease.

Figure 1: Comparison of SVRG-SD and SAGA-SD with different values of , and their counterparts for ridge regression on the Covtype data set.

4 Convergence Guarantees

In this section, we provide the convergence analysis of SVRG-SD and SAGA-SD for both SC and NSC cases. In this paper, we consider the problem (1) under the following standard assumptions.

Assumption 1.

Each convex function is -smooth, iff there exists a constant such that for any , .

Assumption 2.

is -strongly convex, iff there exists a constant such that for any ,

(12)

where is the subdifferential of at . If is smooth, we can revise the inequality (12) by simply replacing the sub-gradient with .

4.1 Convergence Analysis of SVRG-SD

In this part, we analyze the convergence property of SVRG-SD for both SC and NSC cases. The first main result is the following theorem, which provides the convergence rate of SVRG-SD.

Theorem 1.

Suppose Assumption 1 holds. Let be the optimal solution of Problem (1), and be the sequence produced by SVRG-SD, , and , then

where , , , and .

The proof of Theorem 1 and the definitions of and are given in the Supplementary Material. The linear convergence of SVRG-SD follows immediately.

Corollary 1 (Sc).

Suppose each is -smooth, and is -strongly convex. Setting , , and sufficiently large so that

then SVRG-SD has the geometric convergence in expectation:

The proof of Corollary 1 is given in the Supplementary Material. From Corollary 1, one can see that SVRG-SD has a linear convergence rate for SC problems. As discussed in [34], for the proximal variant of SVRG [34], where . For a reasonable comparison, we use the same parameter settings for SVRG and SVRG-SD, e.g., and . Then one can see that for SVRG and for SVRG-SD, that is, is smaller than . Thus, SVRG-SD can significantly improve the convergence rate of SVRG in practice, which will be confirmed by the experimental results below.

Unlike most of VR-SGD methods [13, 34], including SVRG, the convergence result of SVRG-SD for the NSC case is also provided, as shown below.

Corollary 2 (Nsc).

Suppose each is -smooth. Setting , , and sufficiently large, then

The proof of Corollary 2 is provided in the Supplementary Material. The constant is from the sufficient decrease strategy, which thus implies that the convergence bound in Corollary 2 can be further improved using our sufficient decrease strategy with an even larger .

4.2 Convergence Analysis of SAGA-SD

In this part, we analyze the convergence property of SAGA-SD for both SC and NSC cases. The following lemma provides the upper bound on the expected variance of the gradient estimator in (9) (i.e., the SAGA estimator [7]), and its proof is given in the Supplementary Material.

Lemma 1.

Suppose Assumption 1 holds. Then the following inequality holds

Theorem 2 (Sc).

Suppose is -strongly convex and is -smooth. With the same notation as in Theorem 1, and by setting , , and sufficiently large such that

then SAGA-SD has the geometric convergence in expectation:

The proof of Theorem 2 is provided in the Supplementary Material. Theorem 2 shows that SAGA-SD also attains linear convergence similar to SVRG-SD. Like Corollary 2, we also provide the convergence guarantee of SAGA-SD for NSC problems, as shown below.

Corollary 3 (Nsc).

Suppose each is -smooth. With the same notation as in Theorem 2 and by setting , , and , then

The proof of Corollary 3 is provided in the Supplementary Material. Due to , Theorem 2 and Corollary 3 imply that SAGA-SD can significantly improve the convergence rate of SAGA [7] for both SC and NSC cases, which will be confirmed by our experimental results.

As suggested in [10] and [19], one can add a proximal term into a non-strongly convex objective function as follows: , where is a constant that can be determined as in [10, 19], and is a proximal point. Then the condition number of this proximal function can be much smaller than that of the original function , if is sufficiently large. However, adding the proximal term may degrade the performance of the involved algorithms both in theory and in practice [3]. Therefore, we directly use SVRG-SD and SAGA-SD to solve non-strongly convex objectives.

Figure 2: Comparison of different VR-SGD methods for solving ridge regression problems () on Ijcnn1 (left), Covtype (center), and SUSY (right). The vertical axis is the objective value minus the minimum, and the horizontal axis denotes the number of effective passes over the data.

5 Experimental Results

In this section, we evaluate the performance of SVRG-SD and SAGA-SD, and compare their performance with their counterparts including SVRG [13], its proximal variant (Prox-SVRG) [34], and SAGA [7]. Moreover, we also report the performance of the well-known accelerated VR-SGD methods, Catalyst [19] and Katyusha [1]. For fair comparison, we implemented all the methods in C++ with a Matlab interface (all codes are made available, see link in the Supplementary Materials), and performed all the experiments on a PC with an Intel i5-2400 CPU and 16GB RAM.

5.1 Ridge Regression

Our experiments were conducted on three popular data sets: Covtype, Ijcnn1 and SUSY, all of which were obtained from the LIBSVM Data website222https://www.csie.ntu.edu.tw/~cjlin/libsvm/ (more details and regularization parameters are given in the Supplementary Material). Following [34]

, each feature vector of these date sets has been normalized so that

for all , which leads to the same upper bound on the Lipschitz constants . This step is for comparison only and not necessary in practice. We focus on the ridge regression as the SC example. For SVRG-SD and SAGA-SD, we set on the three data sets. In addition, unlike SAGA [7], we fixed for each epoch of SAGA-SD. For SVRG-SD, Catalyst, Katyusha, SVRG and its proximal variant, we set the epoch size , as suggested in [1, 13, 34]. Each of these methods had its step size parameter chosen so as to give the fastest convergence.

Figure 2 shows how the objective gap, i.e., , of all these algorithms decreases for ridge regression problems with the regularization parameter (more results are given in the Supplementary Material). Note that the horizontal axis denotes the number of effective passes over the data. As seen in these figures, SVRG-SD and SAGA-SD achieve consistent speedups for all the data sets, and significantly outperform their counterparts, SVRG and SAGA, in all the settings. This confirms that our sufficient decrease technique is able to accelerate SVRG and SAGA. Impressively, SVRG-SD and SAGA-SD usually converge much faster than the well-known accelerated VR-SGD methods, Catalyst and Katyusha, which further justifies the effectiveness of our sufficient decrease stochastic optimization method.

(a) Lasso,
(b) Elastic-net regularized Lasso,
Figure 3: Comparison of different VR-SGD methods for solving Lasso and elastic-net regularized Lasso problems on the three data sets: Ijcnn1 (left), Covtype (center), and SUSY (right).

5.2 Lasso and Elastic-Net Regularized Lasso

We also conducted experiments of the Lasso and elastic-net regularized (i.e., ) Lasso problems. We plot some representative results in Figure 3 (see Figures 3 and 4 in the Supplementary Material for more results), which show that SVRG-SD and SAGA-SD significantly outperform their counterparts (i.e., Prox-SVRG and SAGA) in all the settings, as well as Catalyst, and are considerably better than Katyusha in most cases. This empirically verifies that our sufficient decrease technique can accelerate SVRG and SAGA for solving both SC and NSC objectives.

6 Conclusion & Future Work

To the best of our knowledge, this is the first work to design an efficient sufficient decrease technique for stochastic optimization. Moreover, we proposed two different schemes for Lasso and ridge regression to efficiently update the coefficient , which takes the important decisions to shrink, expand or move in the opposite direction. This is very different from adaptive learning rate methods, e.g., [14], and line search methods, e.g., [20], all of which cannot address the issue in Section 3.1 whatever value the step size is. Unlike most VR-SGD methods [13, 30, 34], which only have convergence guarantees for SC problems, we provided the convergence guarantees of our algorithms for both SC and NSC cases. Experimental results verified the effectiveness of our sufficient decrease technique for stochastic optimization. Naturally, it can also be used to further speed up accelerated VR-SGD methods such as [1, 3, 19].

As each function can have different degrees of smoothness, to select the random index

from a non-uniform distribution is a much better choice than simple uniform random sampling 

[37], as well as without-replacement sampling vs. with-replacement sampling [32]. On the practical side, both our algorithms tackle the NSC and non-smooth problems directly, without using any quadratic regularizer as in [1, 19], as well as proximal settings. Note that some asynchronous parallel and distributed variants [17, 26] of VR-SGD methods have also been proposed for such stochastic settings. We leave these variations out from our comparison and consider similar extensions to our stochastic sufficient decrease method as future work.

Supplementary Materials for “Guaranteed Sufficient Decrease for Variance Reduced Stochastic Gradient Descent”

In this supplementary material, we give the detailed proofs for some lemmas, theorems and corollaries stated in the main paper. Moreover, we also report more experimental results for both of our algorithms.

Notations

Throughout this paper, denotes the standard Euclidean norm, and is the -norm, i.e., . We denote by the full gradient of if it is differentiable, or the subdifferential of at if it is only Lipschitz continuous. Note that Assumption 2 is the general form for the two cases when is smooth or non-smooth333Strictly speaking, when the function is non-smooth, ; while is smooth, .. That is, if is smooth, the inequality in (12) in Assumption 2 becomes the following form:

Appendix A: Proof of Theorem 1

Although the proposed SVRG-SD is a variant of SVRG, it is non-trivial to analyze its convergence property, as well as that of SAGA-SD. Before proving Theorem 1, we first give the following lemma.

Lemma 2.

Let be the optimal solution of Problem (1), then the following inequality holds

Lemma 2 provides the upper bound on the expected variance of the variance reduced gradient estimator in (9) (i.e., the SVRG estimator independently introduced in [13, 35]), which satisfies . This lemma is essentially identical to Corollary 3.5 in [34]. From Lemma 2, we immediately get the following result, which is useful in our convergence analysis.

Corollary 4.

For any , the following inequality holds

Proof.

where the second equality holds due to the fact that ; the second inequality holds due to the fact that ; and the last inequality follows from Lemma 3.4 in [34] (i.e., ). ∎

Moreover, we also introduce the following lemmas [6, 16], which are useful in our convergence analysis.

Lemma 3.

Let be the linear approximation of at with respect to , i.e.,

Then

Lemma 4.

Assume that is an optimal solution of the following problem,

where is a convex function (but possibly non-differentiable). Then the following inequality holds for all :

Proof of Theorem 1:

Proof.

Let and . Using Lemma 3, we have

(13)

Then

(14)

where the inequality follows from the Young’s inequality, i.e., for any . Substituting the inequality (14) into the inequality (13), we have

(15)

where , and . The second inequality follows from Lemma 4 with ,