1 Introduction
Nonconvex optimization is ubiquitous in machine learning applications especially for deep neural networks. For convex optimization, every local minimum is a global minimum and it can be achieved by any firstorder stationary point, i.e.,
. However, for nonconvex problems, the point with zero gradient can be a local minimum, a local maximum or a saddle point. To avoid converging to bad saddle points (including local maxima), we want to find a secondorder stationary point, i.e., and (this is a necessary condition for to be a local minimum). All secondorder stationary points indeed are local minima if function satisfies strict saddle property (Ge et al., 2015). Note that finding the global minimum in nonconvex problems is NPhard in general. Also note that it was shown that all local minima are also global minima for some nonconvex problems, e.g., matrix sensing (Bhojanapalli et al., 2016), matrix completion (Ge et al., 2016), and some neural networks (Ge et al., 2017). Thus, our goal in this paper is to find an approximate secondorder stationary point (local minimum) with proved convergence.There has been extensive research for finding firstorder stationary point (i.e., ), e.g., GD, SGD and SVRG. See Table 1 for an overview. Although Xu et al. (2018) and AllenZhu and Li (2018) independently proposed reduction algorithms Neon/Neon2 that can be combined with previous firstorder stationary points finding algorithms to find an secondorder stationary point (i.e., and ). However, algorithms obtained by this reduction are very complicated in practice, and they need to extract negative curvature directions from the Hessian to escape saddle points by using a negative curvature search subroutine: given a point
, find an approximate smallest eigenvector of
. This also involves a more complicated analysis. Note that in practice, standard firstorder stationary point finding algorithms can often work (escape bad saddle points) in nonconvex setting without a negative curvature search subroutine. The reason may be that the saddle points are usually not very stable. So there is a natural question “Is there any simple modification to allow firstorder stationary point finding algorithms to get a theoretical secondorder guarantee?”. For gradient descent (GD), Jin et al. (2017) showed that a simple perturbation step is enough to escape saddle points for finding a secondorder stationary point, and this is necessary (Du et al., 2017). Very recently, Ge et al. (2019) showed that a simple perturbation step is also enough to find a secondorder stationary point for SVRG algorithm (Li and Li, 2018). Moreover, Ge et al. (2019) also developed a stabilized trick to further improve the dependency of Hessian Lipschitz parameter.Algorithm 

Guarantee 


GD (Nesterov, 2004)  1storder  No  

1storder  No  

1storder  No  
SSRGD (this paper)  1storder  No  
PGD (Jin et al., 2017)  2ndorder  No  

2ndorder  Needed  
Neon2+SVRG (AllenZhu and Li, 2018)  2ndorder  Needed  
Stabilized SVRG (Ge et al., 2019)  2ndorder  No  
SNVRG+Neon2 (Zhou et al., 2018a)  2ndorder  Needed  
SPIDERSFO(+Neon2) (Fang et al., 2018)  2ndorder  Needed  
SSRGD (this paper)  2ndorder  No 
Algorithm 

Guarantee 


SGD (Ghadimi et al., 2016)  1storder  No  

1storder  No  

1storder  No  
SSRGD (this paper)  1storder  No  
Perturbed SGD (Ge et al., 2015)  poly  2ndorder  No  
CNCSGD (Daneshmand et al., 2018)  2ndorder  No  
Neon2+SCSG (AllenZhu and Li, 2018)  2ndorder  Needed  
Neon2+Natasha2 (AllenZhu, 2018)  2ndorder  Needed  
SNVRG+Neon2 (Zhou et al., 2018a)  2ndorder  Needed  
SPIDERSFO(+Neon2) (Fang et al., 2018)  2ndorder  Needed  
SSRGD (this paper)  2ndorder  No 
Note: 1. Guarantee (see Definition 1): firstorder stationary point ; secondorder stationary point and .
2. In the classical setting where (Nesterov and Polyak, 2006; Jin et al., 2017), our simple SSRGD is always (no matter what and are) not worse than all other algorithms (in both Table 1 and 2
) except FastCubic/CDHS (which need to compute Hessianvector product) and SPIDERSFO
. Moreover, our simple SSRGD is not worse than FastCubic/CDHS if and is better than SPIDERSFO if is very small (e.g., ) in Table 1.1.1 Our Contributions
In this paper, we propose a simple SSRGD algorithm (described in Algorithm 1) showed that a simple perturbation step is enough to find a secondorder stationary point for stochastic recursive gradient descent algorithm. Our results and previous results are summarized in Table 1 and 2. We would like to highlight the following points:

We improve the result in (Ge et al., 2019) to the almost optimal one (i.e., from to ) since Fang et al. (2018) provided a lower bound for finding even just an firstorder stationary point. Note that for other two algorithms (i.e., SNVRG and SPIDERSFO), they both need the negative curvature search subroutine thus are more complicated in practice and in analysis compared with their firstorder guarantee algorithms (SNVRG and SPIDER), while our SSRGD is as simple as its firstorder guarantee algorithm.

Our simple SSRGD algorithm gets simpler analysis. Also, the result for finding a firstorder stationary point is a byproduct from our analysis. We also give a clear interpretation to show why our analysis for SSRGD algorithm can improve the original SVRG from to in Section 5.1. We believe it is very useful for better understanding these two algorithms.
2 Preliminaries
Notation: Let denote the set and denote the Eculidean norm for a vector and the spectral norm for a matrix. Let denote the inner product of two vectors and . Let
denote the smallest eigenvalue of a symmetric matrix
. Let denote a Euclidean ball with center and radius . We use to hide the constant and to hide the polylogarithmic factor.In this paper, we consider two types of nonconvex problems. The finitesum problem has the form
(1) 
where and all individual are possibly nonconvex. This form usually models the empirical risk minimization in machine learning problems.
The online (expectation) problem has the form
(2) 
where and are possibly nonconvex. This form usually models the population risk minimization in machine learning problems.
Now, we make standard smoothness assumptions for these two problems.
Assumption 1 (Gradient Lipschitz)
Assumption 2 (Hessian Lipschitz)
These two assumptions are standard for finding firstorder stationary points (Assumption 1) and secondorder stationary points (Assumption 1 and 2) for all algorithms in both Table 1 and 2.
Now we define the approximate firstorder stationary points and approximate firstorder stationary points.
Definition 1
is an firstorder stationary point for a differentiable function if
(7) 
is an secondorder stationary point for a twicedifferentiable function if
(8) 
3 Simple Stochastic Recursive Gradient Descent
In this section, we propose the simple stochastic recursive gradient descent algorithm called SSRGD. The highlevel description (which omits the stop condition details in Line 10) of this algorithm is in Algorithm 1 and the full algorithm (containing the stop condition) is described in Algorithm 2. Note that we call each outer loop (i.e., Line 2–11 of algorithm 1) an epoch, i.e., iterations from to for an epoch . We call the iterations between the beginning of perturbation and end of perturbation a super epoch.
The SSRGD algorithm is based on the stochastic recursive gradient descent which is introduced in (Nguyen et al., 2017) for convex optimization. In particular, Nguyen et al. (2017) want to save the storage of past gradients in SAGA (Defazio et al., 2014) by using the recursive gradient. However, this stochastic recursive gradient descent is widely used in recent work for nonconvex optimization such as SPIDER (Fang et al., 2018), SpiderBoost (Wang et al., 2018) and some variants of SARAH (e.g., ProxSARAH (Pham et al., 2019)).
Recall that in the wellknown SVRG, Johnson and Zhang (2013) reused a fixed snapshot full gradient (which is computed at the beginning of each epoch) in the update step:
(9) 
while the stochastic recursive gradient descent uses the recursive update step:
(10) 
4 Convergence Results
Similar to the perturbed GD (Jin et al., 2017) and perturbed SVRG (Ge et al., 2019), we add simple perturbations to the stochastic recursive gradient descent algorithm to escape saddle points efficiently. Besides, we also consider the more general online case. In the following theorems, we provide the convergence results of SSRGD for finding an firstorder stationary point and an secondorder stationary point for the nonconvex finitesum problem (1) and online problem (2). The proofs are provided in Appendix B. We give an overview of the proofs in next Section 5.
4.1 Nonconvex Finitesum Problem
Theorem 1
Theorem 2
Under Assumption 1 and 2 (i.e. (3) and (5)), let , where is the initial point and is the optimal value of . By letting step size , epoch length , minibatch size , perturbation radius , threshold gradient , threshold function value and super epoch length , SSRGD will at least once get to an secondorder stationary point with high probability using
stochastic gradients for nonconvex finitesum problem (1).
4.2 Nonconvex Online (Expectation) Problem
In this online case, the following bounded variance assumption is needed. To simplify the presentation, let
denote a stochastic gradient for online problem (2).Assumption 3 (Bounded Variance)
For , , where is a constant.
Note that this assumption is standard and necessary for this online case since the full gradients are not available (see e.g., (Ghadimi et al., 2016; Lei et al., 2017; Li and Li, 2018; Zhou et al., 2018b; Fang et al., 2018; Wang et al., 2018; Pham et al., 2019)). Moreover, we need to modify the full gradient computation step at the beginning of each epoch to a large batch stochastic gradient computation step (similar to (Lei et al., 2017; Li and Li, 2018)), i.e., change (Line 8 of Algorithm 2) to
(11) 
where are i.i.d. samples with . We call the batch size and the minibatch size. Also, we need to change (Line 3 of Algorithm 2) to .
Theorem 3
Under Assumption 1 (i.e. (4)) and Assumption 3, let , where is the initial point and is the optimal value of . By letting step size , batch size , minibatch size and epoch length , SSRGD will find an firstorder stationary point in expectation using
stochastic gradients for nonconvex online problem (2).
For achieving a high probability for finding secondorder stationary points (i.e., Theorem 4), we need a stronger version of Assumption 3 as in the following Assumption 4.
Assumption 4 (Bounded Variance)
For , , where is a constant.
We want to point out that Assumption 4 can be relaxed such that has subGaussian tail, i.e., . Then it is sufficient for us to get a high probability bound by using Hoeffding bound on these subGaussian variables. Note that Assumption 4 (or the relaxed subGaussian version) is also standard in secondorder stationary point finding algorithms (see e.g., (AllenZhu and Li, 2018; Zhou et al., 2018a; Fang et al., 2018)).
Theorem 4
Under Assumption 1, 2 (i.e. (4) and (6)) and Assumption 4, let , where is the initial point and is the optimal value of . By letting step size , batch size , minibatch size , epoch length , perturbation radius , threshold gradient , threshold function value and super epoch length , SSRGD will at least once get to an secondorder stationary point with high probability using
stochastic gradients for nonconvex online problem (2).
5 Overview of the Proofs
5.1 Finding Firstorder Stationary Points
In this section, we first show that why this stochastic recursive gradient descent algorithm can improve previous SVRG type algorithm (see e.g., (Li and Li, 2018; Ge et al., 2019)) from to . Then we give a simple highlevel proof for achieving the convergence result (i.e., Theorem 1).
Why it can be improved from to : First, we need a key relation between and , where ,
(12) 
where (12) holds since has Lipschitz continuous gradient (Assumption 1). The details for obtaining (12) can be found in Appendix B.1 (see (25)).
Note that (12) is very meaningful and also very important for the proofs. The first term indicates that the function value will decrease a lot if the gradient is large. The second term indicates that the function value will also decrease a lot if the moving distance is large (note that here we require the step size ). The additional third term exists since we use
as a estimator of the actual gradient
(i.e., ). So it may increase the function value if is a bad direction in this step.To get an firstorder stationary point, we want to cancel the last two terms in (12). Firstly, we want to bound the last variance term. Recall the variance bound (see Equation (29) in (Li and Li, 2018)) for SVRG algorithm, i.e., estimator (9):
(13) 
In order to connect the last two terms in (12), we use Young’s inequality for the second term , i.e., (for any ). By plugging this Young’s inequality and (13) into (12), we can cancel the last two terms in (12) by summing up (12) for each epoch (i.e., iterations ), i.e., for each epoch , we have (see Equation (35) in (Li and Li, 2018))
(14) 
However, due to the Young’s inequalities, we need to let to cancel the last two terms in (12) for obtaining (14), where denotes minibatch size and denotes the epoch length. According to (14), it is not hard to see that is a firstorder stationary point in expectation (i.e., ) if is chosen uniformly randomly from and the number of iterations . Note that for each iteration we need to compute stochastic gradients, where we amortize the full gradient computation of the beginning point of each epoch ( stochastic gradients) into each iteration in its epoch (i.e., ) for simple presentation. Thus, the convergence results is since , where equality holds if . Note that here we ignore the and .
However, for stochastic recursive gradient descent estimator (10), we can bound the last variance term in (12) as (see Equation (31) in Appendix B.1):
(15) 
Now, the advantage of (15) compared with (13) is that it is already connected to the second term in (12), i.e., distances . Thus we do not need an additional Young’s inequality to transform the second term as before. This makes the function value decrease bound tighter. Similarly, we plug (15) into (12) and sum it up for each epoch to cancel the last two terms in (12), i.e., for each epoch , we have (see Equation (33) in Appendix B.1)
(16) 
Compared with (14) (which requires ), here (16) only requires due to the tighter function value decrease bound since it does not involve the additional Young’s inequalities.
Highlevel proof for achieving result: Now, according to (16), we can use the same above SVRG arguments to show the convergence result, i.e., is a firstorder stationary point in expectation (i.e., ) if is chosen uniformly randomly from and the number of iterations . Also, for each iteration, we compute stochastic gradients. The only difference is that now the convergence results is since (rather than ), where equality holds if . Here we ignore the and . Thus our Theorem 1 is obtained.
5.2 Finding Secondorder Stationary Points
In this section, we give the highlevel proof ideas for finding a secondorder stationary point with high probability. Note that our proof is different from that in (Ge et al., 2019) due to the different estimators (9) and (10). Ge et al. (2019) based on the estimator (9) and its firstorder analysis in (Li and Li, 2018). Here, our SSRGD uses the estimator (10). The difference of the firstorder analysis between estimator (9) ((Li and Li, 2018)) and estimator (10) (this paper) is already discussed in the last Section 5.1. For the secondorder analysis, since the estimator (10) in our SSRGD is more correlated than (9), thus we will use martingales to handle it. Besides, different relations will incur more differences in the detailed proof of secondorder guarantee analysis than that of firstorder guarantee analysis.
We divide the proof into two situations, i.e., large gradients and around saddle points. According to (16), a natural way to prove the convergence result is that the function value will decrease at a desired rate with high probability. Note that the amount for function value decrease is at most .
Large gradients:
In this situation, due to the large gradients, it is sufficient to adjust the firstorder analysis to show that the function value will decrease a lot in an epoch.
Concretely, we want to show the function value decrease bound (16) holds with high probability.
It is not hard to see that the desired rate of function value decrease is per each iteration (recall the parameters and in our Theorem 2).
Also note that we compute stochastic gradients at each iteration (recall in our Theorem 2).
Here we amortize the full gradient computation of the beginning point of each epoch ( stochastic gradients) into each iteration in its epoch (i.e., ) for simple presentation (we will analyze this more rigorous in the detailed proofs in Appendices).
Thus the number of stochastic gradient computation is at most for this large gradients situation.
For the proof, to show the function value decrease bound (16) holds with high probability, we need to show that the bound for variance term () holds with high probability. Note that the estimator defined in (10) is correlated with previous . Fortunately, let , then it is not hard to see that is a martingale vector sequence with respect to a filtration such that . Moreover, let denote the associated martingale difference sequence with respect to the filtration , i.e., and Thus to bound the variance term with high probability, it is sufficient to bound the martingale sequence . This can be bounded with high probability by using the martingale AzumaHoeffding inequality. Note that in order to apply AzumaHoeffding inequality, we first need to use the Bernstein inequality to bound the associated difference sequence . In sum, we will get the high probability function value decrease bound by applying these two inequalities (see (42) in Appendix B.1).
Note that (42) only guarantees function value decrease when the summation of gradients in this epoch is large. However, in order to connect the guarantees between first situation (large gradients) and second situation (around saddle points), we need to show guarantees that are related to the gradient of the starting point of each epoch (see Line 3 of Algorithm 2). Similar to (Ge et al., 2019), we achieve this by stopping the epoch at a uniformly random point (see Line 16 of Algorithm 2). We use the following lemma to connect these two situations (large gradients and around saddle points):
Lemma 1 (Connection of Two Situations)
For any epoch , let be a point uniformly sampled from this epoch . Moreover, let the step size (where ) and the minibatch size , there are two cases:

If at least half of points in this epoch have gradient norm no larger than , then holds with probability at least ;

Otherwise, we know holds with probability at least
Moreover, holds with high probability no matter which case happens.
Note that if Case 2 happens, the function value already decreases a lot in this epoch (as we already discussed at the beginning of this situation). Otherwise, Case 1 happens, we know the starting point of the next epoch (i.e., Line 19 of Algorithm 2), then we know . Then we will start a super epoch (see Line 3 of Algorithm 2). This corresponds to the following second situation around saddle points. Note that if , this point is already an secondorder stationary point (recall in our Theorem 2).
Around saddle points: and at the initial point of a certain super epoch
In this situation, we want to show that the function value will decrease a lot in a super epoch (instead of an epoch as in the first situation) with high probability by adding a random perturbation at the initial point . To simplify the presentation, we use to denote the starting point of the super epoch after the perturbation, where uniformly and the perturbation radius is (see Line 6 in Algorithm 2).
Following the classical widely used twopoint analysis developed in (Jin et al., 2017), we consider two coupled points and with , where is a scalar and denotes the smallest eigenvector direction of Hessian . Then we get two coupled sequences and by running SSRGD update steps (Line 8–12 of Algorithm 2) with the same choice of minibatches (i.e., ’s in Line 12 of Algorithm 2) for a super epoch.
We will show that at least one of these two coupled sequences will decrease the function value a lot (escape the saddle point) with high probability, i.e.,
(17) 
Similar to the classical argument in (Jin et al., 2017), according to (17), we know that in the random perturbation ball, the stuck points can only be a short interval in the direction, i.e., at least one of two points in the direction will escape the saddle point if their distance is larger than . Thus, we know that the probability of the starting point (where uniformly ) located in the stuck region is less than (see (48) in Appendix B.1). By a union bound ( is not in a stuck region and (17) holds), with high probability, we have
(18) 
Note that the initial point of this super epoch is before the perturbation (see Line 6 of Algorithm 2), thus we also need to show that the perturbation step (where uniformly ) does not increase the function value a lot, i.e.,
(19) 
where the last inequality holds since the initial point satisfying and the perturbation radius is , and the last equality holds by letting the perturbation radius small enough. By combining (18) and (19), we obtain with high probability
(20) 
Now, we can obtain the desired rate of function value decrease in this situation is per each iteration (recall the parameters , and in our Theorem 2). Same as before, we compute stochastic gradients at each iteration (recall in our Theorem 2). Thus the number of stochastic gradient computation is at most for this around saddle points situation.
Now, the remaining thing is to prove (17). It can be proved by contradiction. Assume the contrary, and . First, we show that if function value does not decrease a lot, then all iteration points are not far from the starting point with high probability.
Lemma 2 (Localization)
Then we will show that the stuck region is relatively small in the random perturbation ball, i.e., at least one of and will go far away from their starting point and with high probability.