SSRGD: Simple Stochastic Recursive Gradient Descent for Escaping Saddle Points

04/19/2019 ∙ by Zhize Li, et al. ∙ Tsinghua University 0

We analyze stochastic gradient algorithms for optimizing nonconvex problems. In particular, our goal is to find local minima (second-order stationary points) instead of just finding first-order stationary points which may be some bad unstable saddle points. We show that a simple perturbed version of stochastic recursive gradient descent algorithm (called SSRGD) can find an (ϵ,δ)-second-order stationary point with O(√(n)/ϵ^2 + √(n)/δ^4 + n/δ^3) stochastic gradient complexity for nonconvex finite-sum problems. As a by-product, SSRGD finds an ϵ-first-order stationary point with O(n+√(n)/ϵ^2) stochastic gradients. These results are almost optimal since Fang et al. [2018] provided a lower bound Ω(√(n)/ϵ^2) for finding even just an ϵ-first-order stationary point. We emphasize that SSRGD algorithm for finding second-order stationary points is as simple as for finding first-order stationary points just by adding a uniform perturbation sometimes, while all other algorithms for finding second-order stationary points with similar gradient complexity need to combine with a negative-curvature search subroutine (e.g., Neon2 [Allen-Zhu and Li, 2018]). Moreover, the simple SSRGD algorithm gets a simpler analysis. Besides, we also extend our results from nonconvex finite-sum problems to nonconvex online (expectation) problems, and prove the corresponding convergence results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nonconvex optimization is ubiquitous in machine learning applications especially for deep neural networks. For convex optimization, every local minimum is a global minimum and it can be achieved by any first-order stationary point, i.e.,

. However, for nonconvex problems, the point with zero gradient can be a local minimum, a local maximum or a saddle point. To avoid converging to bad saddle points (including local maxima), we want to find a second-order stationary point, i.e., and (this is a necessary condition for to be a local minimum). All second-order stationary points indeed are local minima if function satisfies strict saddle property (Ge et al., 2015). Note that finding the global minimum in nonconvex problems is NP-hard in general. Also note that it was shown that all local minima are also global minima for some nonconvex problems, e.g., matrix sensing (Bhojanapalli et al., 2016), matrix completion (Ge et al., 2016), and some neural networks (Ge et al., 2017). Thus, our goal in this paper is to find an approximate second-order stationary point (local minimum) with proved convergence.

There has been extensive research for finding -first-order stationary point (i.e., ), e.g., GD, SGD and SVRG. See Table 1 for an overview. Although Xu et al. (2018) and Allen-Zhu and Li (2018) independently proposed reduction algorithms Neon/Neon2 that can be combined with previous -first-order stationary points finding algorithms to find an -second-order stationary point (i.e., and ). However, algorithms obtained by this reduction are very complicated in practice, and they need to extract negative curvature directions from the Hessian to escape saddle points by using a negative curvature search subroutine: given a point

, find an approximate smallest eigenvector of

. This also involves a more complicated analysis. Note that in practice, standard first-order stationary point finding algorithms can often work (escape bad saddle points) in nonconvex setting without a negative curvature search subroutine. The reason may be that the saddle points are usually not very stable. So there is a natural question “Is there any simple modification to allow first-order stationary point finding algorithms to get a theoretical second-order guarantee?”. For gradient descent (GD), Jin et al. (2017) showed that a simple perturbation step is enough to escape saddle points for finding a second-order stationary point, and this is necessary (Du et al., 2017). Very recently, Ge et al. (2019) showed that a simple perturbation step is also enough to find a second-order stationary point for SVRG algorithm (Li and Li, 2018). Moreover, Ge et al. (2019) also developed a stabilized trick to further improve the dependency of Hessian Lipschitz parameter.

Algorithm
Stochastic gradient
complexity
Guarantee
Negative-curvature
search subroutine
GD (Nesterov, 2004) 1st-order No
SVRG (Reddi et al., 2016),
(Allen-Zhu and Hazan, 2016);
SCSG (Lei et al., 2017);
SVRG+ (Li and Li, 2018)
1st-order No
SNVRG (Zhou et al., 2018b);
SPIDER (Fang et al., 2018);
SpiderBoost (Wang et al., 2018);
SARAH (Pham et al., 2019)
1st-order No
SSRGD (this paper) 1st-order No
PGD (Jin et al., 2017) 2nd-order No
Neon2+FastCubic/CDHS
(Agarwal et al., 2016; Carmon et al., 2016)
2nd-order Needed
Neon2+SVRG (Allen-Zhu and Li, 2018) 2nd-order Needed
Stabilized SVRG (Ge et al., 2019) 2nd-order No
SNVRG+Neon2 (Zhou et al., 2018a) 2nd-order Needed
SPIDER-SFO(+Neon2) (Fang et al., 2018) 2nd-order Needed
SSRGD (this paper) 2nd-order No
Table 1: Stochastic gradient complexity of optimization algorithms for nonconvex finite-sum problem (1)
Algorithm
Stochastic gradient
complexity
Guarantee
Negative-curvature
search subroutine
SGD (Ghadimi et al., 2016) 1st-order No
SCSG (Lei et al., 2017);
SVRG+ (Li and Li, 2018)
1st-order No
SNVRG (Zhou et al., 2018b);
SPIDER (Fang et al., 2018);
SpiderBoost (Wang et al., 2018);
SARAH (Pham et al., 2019)
1st-order No
SSRGD (this paper) 1st-order No
Perturbed SGD (Ge et al., 2015) poly 2nd-order No
CNC-SGD (Daneshmand et al., 2018) 2nd-order No
Neon2+SCSG (Allen-Zhu and Li, 2018) 2nd-order Needed
Neon2+Natasha2 (Allen-Zhu, 2018) 2nd-order Needed
SNVRG+Neon2 (Zhou et al., 2018a) 2nd-order Needed
SPIDER-SFO(+Neon2) (Fang et al., 2018) 2nd-order Needed
SSRGD (this paper) 2nd-order No
Table 2: Stochastic gradient complexity of optimization algorithms for nonconvex online (expectation) problem (2)

Note: 1. Guarantee (see Definition 1): -first-order stationary point ; -second-order stationary point and .

2. In the classical setting where (Nesterov and Polyak, 2006; Jin et al., 2017), our simple SSRGD is always (no matter what and are) not worse than all other algorithms (in both Table 1 and 2

) except FastCubic/CDHS (which need to compute Hessian-vector product) and SPIDER-SFO

. Moreover, our simple SSRGD is not worse than FastCubic/CDHS if and is better than SPIDER-SFO if is very small (e.g., ) in Table 1.

0:  initial point

, epoch length

, minibatch size , step size , perturbation radius , threshold gradient
1:  for  do
2:     if not currently in a super epoch and  then
3:         where uniformly , start a super epoch // we use super epoch since we do not want to add the perturbation too often near a saddle point
4:     end if
5:     
6:     for  do
7:        
8:        
9:          // are i.i.d. uniform samples with
10:        if  meet stop condition then  stop super epoch
11:     end for
12:  end for
Algorithm 1 Simple Stochastic Recursive Gradient Descent (SSRGD)

1.1 Our Contributions

In this paper, we propose a simple SSRGD algorithm (described in Algorithm 1) showed that a simple perturbation step is enough to find a second-order stationary point for stochastic recursive gradient descent algorithm. Our results and previous results are summarized in Table 1 and 2. We would like to highlight the following points:

  • We improve the result in (Ge et al., 2019) to the almost optimal one (i.e., from to ) since Fang et al. (2018) provided a lower bound for finding even just an -first-order stationary point. Note that for other two algorithms (i.e., SNVRG and SPIDER-SFO), they both need the negative curvature search subroutine thus are more complicated in practice and in analysis compared with their first-order guarantee algorithms (SNVRG and SPIDER), while our SSRGD is as simple as its first-order guarantee algorithm.

  • For more general nonconvex online (expectation) problems (2), we obtain the first algorithm which is as simple as finding first-order stationary points for finding a second-order stationary point with similar state-of-the-art convergence result. See the last column of Table 2.

  • Our simple SSRGD algorithm gets simpler analysis. Also, the result for finding a first-order stationary point is a by-product from our analysis. We also give a clear interpretation to show why our analysis for SSRGD algorithm can improve the original SVRG from to in Section 5.1. We believe it is very useful for better understanding these two algorithms.

2 Preliminaries

Notation: Let denote the set and denote the Eculidean norm for a vector and the spectral norm for a matrix. Let denote the inner product of two vectors and . Let

denote the smallest eigenvalue of a symmetric matrix

. Let denote a Euclidean ball with center and radius . We use to hide the constant and to hide the polylogarithmic factor.

In this paper, we consider two types of nonconvex problems. The finite-sum problem has the form

(1)

where and all individual are possibly nonconvex. This form usually models the empirical risk minimization in machine learning problems.

The online (expectation) problem has the form

(2)

where and are possibly nonconvex. This form usually models the population risk minimization in machine learning problems.

Now, we make standard smoothness assumptions for these two problems.

Assumption 1 (Gradient Lipschitz)
  1. For finite-sum problem (1), each is differentiable and has -Lipschitz continuous gradient, i.e.,

    (3)
  2. For online problem (2), is differentiable and has -Lipschitz continuous gradient, i.e.,

    (4)
Assumption 2 (Hessian Lipschitz)
  1. For finite-sum problem (1), each is twice-differentiable and has -Lipschitz continuous Hessian, i.e.,

    (5)
  2. For online problem (2), is twice-differentiable and has -Lipschitz continuous Hessian, i.e.,

    (6)

These two assumptions are standard for finding first-order stationary points (Assumption 1) and second-order stationary points (Assumption 1 and 2) for all algorithms in both Table 1 and 2.

Now we define the approximate first-order stationary points and approximate first-order stationary points.

Definition 1

is an -first-order stationary point for a differentiable function if

(7)

is an -second-order stationary point for a twice-differentiable function if

(8)

The definition of -second-order stationary point is the same as (Allen-Zhu and Li, 2018; Daneshmand et al., 2018; Zhou et al., 2018a; Fang et al., 2018) and it generalizes the classical version where used in (Nesterov and Polyak, 2006; Jin et al., 2017; Ge et al., 2019).

3 Simple Stochastic Recursive Gradient Descent

In this section, we propose the simple stochastic recursive gradient descent algorithm called SSRGD. The high-level description (which omits the stop condition details in Line 10) of this algorithm is in Algorithm 1 and the full algorithm (containing the stop condition) is described in Algorithm 2. Note that we call each outer loop (i.e., Line 211 of algorithm 1) an epoch, i.e., iterations from to for an epoch . We call the iterations between the beginning of perturbation and end of perturbation a super epoch.

The SSRGD algorithm is based on the stochastic recursive gradient descent which is introduced in (Nguyen et al., 2017) for convex optimization. In particular, Nguyen et al. (2017) want to save the storage of past gradients in SAGA (Defazio et al., 2014) by using the recursive gradient. However, this stochastic recursive gradient descent is widely used in recent work for nonconvex optimization such as SPIDER (Fang et al., 2018), SpiderBoost (Wang et al., 2018) and some variants of SARAH (e.g., ProxSARAH (Pham et al., 2019)).

Recall that in the well-known SVRG, Johnson and Zhang (2013) reused a fixed snapshot full gradient (which is computed at the beginning of each epoch) in the update step:

(9)

while the stochastic recursive gradient descent uses the recursive update step:

(10)
0:  initial point , epoch length , minibatch size , step size , perturbation radius , threshold gradient , threshold function value , super epoch length
1:  
2:  for  do
3:     if  and  then
4:        
5:        
6:         where uniformly
7:     end if
8:     
9:     for  do
10:        
11:        
12:          // are i.i.d. uniform samples with
13:        if  and (  or  then
14:            break
15:        else if  then
16:

           break with probability

// we use random stop since we want to randomly choose a point as the starting point of the next epoch
17:        end if
18:     end for
19:     
20:  end for
Algorithm 2 Simple Stochastic Recursive Gradient Descent (SSRGD)

4 Convergence Results

Similar to the perturbed GD (Jin et al., 2017) and perturbed SVRG (Ge et al., 2019), we add simple perturbations to the stochastic recursive gradient descent algorithm to escape saddle points efficiently. Besides, we also consider the more general online case. In the following theorems, we provide the convergence results of SSRGD for finding an -first-order stationary point and an -second-order stationary point for the nonconvex finite-sum problem (1) and online problem (2). The proofs are provided in Appendix B. We give an overview of the proofs in next Section 5.

4.1 Nonconvex Finite-sum Problem

Theorem 1

Under Assumption 1 (i.e. (3)), let , where is the initial point and is the optimal value of . By letting step size , epoch length and minibatch size , SSRGD will find an -first-order stationary point in expectation using

stochastic gradients for nonconvex finite-sum problem (1).

Theorem 2

Under Assumption 1 and 2 (i.e. (3) and (5)), let , where is the initial point and is the optimal value of . By letting step size , epoch length , minibatch size , perturbation radius , threshold gradient , threshold function value and super epoch length , SSRGD will at least once get to an -second-order stationary point with high probability using

stochastic gradients for nonconvex finite-sum problem (1).

4.2 Nonconvex Online (Expectation) Problem

In this online case, the following bounded variance assumption is needed. To simplify the presentation, let

denote a stochastic gradient for online problem (2).

Assumption 3 (Bounded Variance)

For , where is a constant.

Note that this assumption is standard and necessary for this online case since the full gradients are not available (see e.g., (Ghadimi et al., 2016; Lei et al., 2017; Li and Li, 2018; Zhou et al., 2018b; Fang et al., 2018; Wang et al., 2018; Pham et al., 2019)). Moreover, we need to modify the full gradient computation step at the beginning of each epoch to a large batch stochastic gradient computation step (similar to (Lei et al., 2017; Li and Li, 2018)), i.e., change (Line 8 of Algorithm 2) to

(11)

where are i.i.d. samples with . We call the batch size and the minibatch size. Also, we need to change (Line 3 of Algorithm 2) to .

Theorem 3

Under Assumption 1 (i.e. (4)) and Assumption 3, let , where is the initial point and is the optimal value of . By letting step size , batch size , minibatch size and epoch length , SSRGD will find an -first-order stationary point in expectation using

stochastic gradients for nonconvex online problem (2).

For achieving a high probability for finding second-order stationary points (i.e., Theorem 4), we need a stronger version of Assumption 3 as in the following Assumption 4.

Assumption 4 (Bounded Variance)

For , where is a constant.

We want to point out that Assumption 4 can be relaxed such that has sub-Gaussian tail, i.e., . Then it is sufficient for us to get a high probability bound by using Hoeffding bound on these sub-Gaussian variables. Note that Assumption 4 (or the relaxed sub-Gaussian version) is also standard in second-order stationary point finding algorithms (see e.g., (Allen-Zhu and Li, 2018; Zhou et al., 2018a; Fang et al., 2018)).

Theorem 4

Under Assumption 1, 2 (i.e. (4) and (6)) and Assumption 4, let , where is the initial point and is the optimal value of . By letting step size , batch size , minibatch size , epoch length , perturbation radius , threshold gradient , threshold function value and super epoch length , SSRGD will at least once get to an -second-order stationary point with high probability using

stochastic gradients for nonconvex online problem (2).

5 Overview of the Proofs

5.1 Finding First-order Stationary Points

In this section, we first show that why this stochastic recursive gradient descent algorithm can improve previous SVRG type algorithm (see e.g., (Li and Li, 2018; Ge et al., 2019)) from to . Then we give a simple high-level proof for achieving the convergence result (i.e., Theorem 1).

Why it can be improved from to : First, we need a key relation between and , where ,

(12)

where (12) holds since has -Lipschitz continuous gradient (Assumption 1). The details for obtaining (12) can be found in Appendix B.1 (see (25)).

Note that (12) is very meaningful and also very important for the proofs. The first term indicates that the function value will decrease a lot if the gradient is large. The second term indicates that the function value will also decrease a lot if the moving distance is large (note that here we require the step size ). The additional third term exists since we use

as a estimator of the actual gradient

(i.e., ). So it may increase the function value if is a bad direction in this step.

To get an -first-order stationary point, we want to cancel the last two terms in (12). Firstly, we want to bound the last variance term. Recall the variance bound (see Equation (29) in (Li and Li, 2018)) for SVRG algorithm, i.e., estimator (9):

(13)

In order to connect the last two terms in (12), we use Young’s inequality for the second term , i.e., (for any ). By plugging this Young’s inequality and (13) into (12), we can cancel the last two terms in (12) by summing up (12) for each epoch (i.e., iterations ), i.e., for each epoch , we have (see Equation (35) in (Li and Li, 2018))

(14)

However, due to the Young’s inequalities, we need to let to cancel the last two terms in (12) for obtaining (14), where denotes minibatch size and denotes the epoch length. According to (14), it is not hard to see that is a -first-order stationary point in expectation (i.e., ) if is chosen uniformly randomly from and the number of iterations . Note that for each iteration we need to compute stochastic gradients, where we amortize the full gradient computation of the beginning point of each epoch ( stochastic gradients) into each iteration in its epoch (i.e., ) for simple presentation. Thus, the convergence results is since , where equality holds if . Note that here we ignore the and .

However, for stochastic recursive gradient descent estimator (10), we can bound the last variance term in (12) as (see Equation (31) in Appendix B.1):

(15)

Now, the advantage of (15) compared with (13) is that it is already connected to the second term in (12), i.e., distances . Thus we do not need an additional Young’s inequality to transform the second term as before. This makes the function value decrease bound tighter. Similarly, we plug (15) into (12) and sum it up for each epoch to cancel the last two terms in (12), i.e., for each epoch , we have (see Equation (33) in Appendix B.1)

(16)

Compared with (14) (which requires ), here (16) only requires due to the tighter function value decrease bound since it does not involve the additional Young’s inequalities.

High-level proof for achieving result: Now, according to (16), we can use the same above SVRG arguments to show the convergence result, i.e., is a -first-order stationary point in expectation (i.e., ) if is chosen uniformly randomly from and the number of iterations . Also, for each iteration, we compute stochastic gradients. The only difference is that now the convergence results is since (rather than ), where equality holds if . Here we ignore the and . Thus our Theorem 1 is obtained.

5.2 Finding Second-order Stationary Points

In this section, we give the high-level proof ideas for finding a second-order stationary point with high probability. Note that our proof is different from that in (Ge et al., 2019) due to the different estimators (9) and (10). Ge et al. (2019) based on the estimator (9) and its first-order analysis in (Li and Li, 2018). Here, our SSRGD uses the estimator (10). The difference of the first-order analysis between estimator (9) ((Li and Li, 2018)) and estimator (10) (this paper) is already discussed in the last Section 5.1. For the second-order analysis, since the estimator (10) in our SSRGD is more correlated than (9), thus we will use martingales to handle it. Besides, different relations will incur more differences in the detailed proof of second-order guarantee analysis than that of first-order guarantee analysis.

We divide the proof into two situations, i.e., large gradients and around saddle points. According to (16), a natural way to prove the convergence result is that the function value will decrease at a desired rate with high probability. Note that the amount for function value decrease is at most .

Large gradients:
In this situation, due to the large gradients, it is sufficient to adjust the first-order analysis to show that the function value will decrease a lot in an epoch. Concretely, we want to show the function value decrease bound (16) holds with high probability. It is not hard to see that the desired rate of function value decrease is per each iteration (recall the parameters and in our Theorem 2). Also note that we compute stochastic gradients at each iteration (recall in our Theorem 2). Here we amortize the full gradient computation of the beginning point of each epoch ( stochastic gradients) into each iteration in its epoch (i.e., ) for simple presentation (we will analyze this more rigorous in the detailed proofs in Appendices). Thus the number of stochastic gradient computation is at most for this large gradients situation.

For the proof, to show the function value decrease bound (16) holds with high probability, we need to show that the bound for variance term () holds with high probability. Note that the estimator defined in (10) is correlated with previous . Fortunately, let , then it is not hard to see that is a martingale vector sequence with respect to a filtration such that . Moreover, let denote the associated martingale difference sequence with respect to the filtration , i.e., and Thus to bound the variance term with high probability, it is sufficient to bound the martingale sequence . This can be bounded with high probability by using the martingale Azuma-Hoeffding inequality. Note that in order to apply Azuma-Hoeffding inequality, we first need to use the Bernstein inequality to bound the associated difference sequence . In sum, we will get the high probability function value decrease bound by applying these two inequalities (see (42) in Appendix B.1).

Note that (42) only guarantees function value decrease when the summation of gradients in this epoch is large. However, in order to connect the guarantees between first situation (large gradients) and second situation (around saddle points), we need to show guarantees that are related to the gradient of the starting point of each epoch (see Line 3 of Algorithm 2). Similar to (Ge et al., 2019), we achieve this by stopping the epoch at a uniformly random point (see Line 16 of Algorithm 2). We use the following lemma to connect these two situations (large gradients and around saddle points):

Lemma 1 (Connection of Two Situations)

For any epoch , let be a point uniformly sampled from this epoch . Moreover, let the step size (where ) and the minibatch size , there are two cases:

  1. If at least half of points in this epoch have gradient norm no larger than , then holds with probability at least ;

  2. Otherwise, we know holds with probability at least

Moreover, holds with high probability no matter which case happens.

Note that if Case 2 happens, the function value already decreases a lot in this epoch (as we already discussed at the beginning of this situation). Otherwise, Case 1 happens, we know the starting point of the next epoch (i.e., Line 19 of Algorithm 2), then we know . Then we will start a super epoch (see Line 3 of Algorithm 2). This corresponds to the following second situation around saddle points. Note that if , this point is already an -second-order stationary point (recall in our Theorem 2).

Around saddle points: and at the initial point of a certain super epoch
In this situation, we want to show that the function value will decrease a lot in a super epoch (instead of an epoch as in the first situation) with high probability by adding a random perturbation at the initial point . To simplify the presentation, we use to denote the starting point of the super epoch after the perturbation, where uniformly and the perturbation radius is (see Line 6 in Algorithm 2). Following the classical widely used two-point analysis developed in (Jin et al., 2017), we consider two coupled points and with , where is a scalar and denotes the smallest eigenvector direction of Hessian . Then we get two coupled sequences and by running SSRGD update steps (Line 812 of Algorithm 2) with the same choice of minibatches (i.e., ’s in Line 12 of Algorithm 2) for a super epoch. We will show that at least one of these two coupled sequences will decrease the function value a lot (escape the saddle point) with high probability, i.e.,

(17)

Similar to the classical argument in (Jin et al., 2017), according to (17), we know that in the random perturbation ball, the stuck points can only be a short interval in the direction, i.e., at least one of two points in the direction will escape the saddle point if their distance is larger than . Thus, we know that the probability of the starting point (where uniformly ) located in the stuck region is less than (see (48) in Appendix B.1). By a union bound ( is not in a stuck region and (17) holds), with high probability, we have

(18)

Note that the initial point of this super epoch is before the perturbation (see Line 6 of Algorithm 2), thus we also need to show that the perturbation step (where uniformly ) does not increase the function value a lot, i.e.,

(19)

where the last inequality holds since the initial point satisfying and the perturbation radius is , and the last equality holds by letting the perturbation radius small enough. By combining (18) and (19), we obtain with high probability

(20)

Now, we can obtain the desired rate of function value decrease in this situation is per each iteration (recall the parameters , and in our Theorem 2). Same as before, we compute stochastic gradients at each iteration (recall in our Theorem 2). Thus the number of stochastic gradient computation is at most for this around saddle points situation.

Now, the remaining thing is to prove (17). It can be proved by contradiction. Assume the contrary, and . First, we show that if function value does not decrease a lot, then all iteration points are not far from the starting point with high probability.

Lemma 2 (Localization)

Let denote the sequence by running SSRGD update steps (Line 812 of Algorithm 2) from . Moreover, let the step size and minibatch size , with probability , we have

(21)

where .

Then we will show that the stuck region is relatively small in the random perturbation ball, i.e., at least one of and will go far away from their starting point and with high probability.

Lemma 3 (Small Stuck Region)

If the initial point satisfies , then let and be two coupled sequences by running SSRGD update steps (Line 8