1 Introduction
Nonconvex optimization is widely used in machine learning. Recently, for problems like matrix sensing
(Bhojanapalli et al., 2016), matrix completion (Ge et al., 2016), and certain objectives for neural networks
(Ge et al., 2017b), it was shown that all local minima are also globally optimal, therefore simple local search algorithms can be used to solve these problems.For a convex function , a local and global minimum is achieved whenever the point has zero gradient: . However, for nonconvex functions, a point with zero gradient can also be a saddle point. To avoid converging to saddle points, recent results (Ge et al., 2015; Jin et al., 2017a, b) prove stronger results that show local search algorithms converge to approximate secondorder stationary points – points with small gradients and almost positive semidefinite Hessians (see Definition 1).
In theory, Xu et al. (2018) and AllenZhu and Li (2017) independently showed that finding a secondorder stationary point is not much harder than finding a firstorder stationary point – they give reduction algorithms Neon/Neon2 that can converge to secondorder stationary points when combined with algorithms that find firstorder stationary points. Algorithms obtained by such reductions are complicated, and they require a negative curvature search subroutine: given a point
, find an approximate smallest eigenvector of
. In practice, standard algorithms for convex optimization work in a nonconvex setting without a negative curvature search subroutine.What algorithms can be directly adapted to the nonconvex setting, and what are the simplest modifications that allow a theoretical analysis? For gradient descent, Jin et al. (2017a) showed that a simple perturbation step is enough to find a secondorder stationary point, and this was later shown to be necessary (Du et al., 2017). For accelerated gradient, Jin et al. (2017b) showed a simple modification would allow the algorithm to work in the nonconvex setting, and escape from saddle points faster than gradient descent. In this paper, we show that there is also a simple modification to the Stochastic Variance Reduced Gradient (SVRG) algorithm (Johnson and Zhang, 2013) that is guaranteed to find a secondorder stationary point.
SVRG is designed to optimize a finite sum objective of the following form:
where evaluating would require evaluating every . In the original result, Johnson and Zhang (2013) showed that when ’s are smooth and is strongly convex, SVRG finds a point with error in time when . The same guarantees were also achieved by algorithms like SAG (Roux et al., 2012), SDCA (ShalevShwartz and Zhang, 2013) and SAGA (Defazio et al., 2014), but SVRG is much cleaner both in terms of implementation and analysis.
SVRG was analyzed in nonconvex regimes, Reddi et al. (2016) and AllenZhu and Hazan (2016) showed that SVRG can find an firstorder stationary point using stochastic gradients. Li and Li (2018) analyzed a batchedgradient version of SVRG and achieved the same guarantee with much simpler analysis. These results can then be combined with the reduction (AllenZhu and Li, 2017; Xu et al., 2018) to give complicated algorithms for finding secondorder stationary points. Using more complicated optimization techniques, it is possible to design faster algorithms for finding firstorder stationary points, including FastCubic (Agarwal et al., 2016), SNVRG (Zhou et al., 2018b), SPIDERSFO (Fang et al., 2018). These algorithms can also combine with procedures like Neon2 to give secondorder guarantees.
In this paper, we give a variant of SVRG called Stabilized SVRG that is able to find secondorder stationary points, while maintaining the simplicity of the SVRG algorithm. See Table 1 for a comparison between our algorithm and existing results. The main term in the running time of our algorithm matches the analysis with firstorder guarantees. All other algorithms that achieve secondorder guarantees require negative curvature search subroutines like Neon2, and many are more complicated than SVRG even without this subroutine.
Algorithm  Stochastic Gradients  Guarantee  Simple  

1stOrder  ✓  
MinibatchSVRG (Li and Li, 2018)  1stOrder  ✓  
Neon2+SVRG (AllenZhu and Li, 2017)  2ndOrder  

2ndOrder  
SNVRG+Neon2 (Zhou et al., 2018a, b)  2ndOrder  
SPIDERSFO (Fang et al., 2018)  2ndOrder  
Stabilized SVRG (this paper)  2ndOrder  ✓ 
2 Preliminaries
2.1 Notations
We use to denote the set of natural numbers and real numbers respectively. We use to denote the set . Let be a multiset of size whose th element () is chosen i.i.d. from uniformly (
is used to denote the samples used in a minibatch for the algorithm). For vectors we use
to denote their inner product, and for matrices we use to denote the trace of We use to denote the Euclidean norm for a vector and spectral norm for a matrix, andto denote the largest and the smallest eigenvalue of a real symmetric matrix.
Throughout the paper, we use and to hide poly log factors on relevant parameters. We did not try to optimize the poly log factors in the proof.
2.2 FiniteSum Objective and Stationary Points
Now we define the objective that we try to optimize. A finitesum objective has the form
(1) 
where maps a dimensional vector to a scalar and is finite. In our model, both and can be nonconvex. We make standard smoothness assumptions as follows:
Assumption 1.
Each individual function has Lipschitz Gradient, that is,
This implies that the average function also has Lipschitz gradient. We assume the average function and individual functions have Lipschitz Hessian. That is,
Assumption 2.
The average function has Lipschitz Hessian, which means
each individual function has Lipschitz Hessian, which means
These two assumptions are standard in the literature for finding secondorder stationary points
(Ge et al., 2015; Jin et al., 2017a, b; AllenZhu and Li, 2017).
The goal of nonconvex optimization algorithms is to converge to an approximatesecondorder stationary point.
Definition 1.
For a differentiable function , is a firstorder stationary point if ; is an firstorder stationary point if .
For twicedifferentiable function , is a secondorder stationary point if
If is Hessian Lipschitz, is an secondorder stationary point if
This definition of secondorder stationary point is standard in previous literature (Ge et al., 2015; Jin et al., 2017a, b). Note that the definition of secondorder stationary point uses the Hessian Lipschitzness parameter of the average function (instead of of individual function). It is easy to check that . In Appendix F we show there are natural applications where , so in general algorithms that do not depend heavily on are preferred.
2.3 SVRG Algorithm
In this section we give a brief overview of the SVRG algorithm. In particular we follow the minibatch version in Li and Li (2018) which is used for our analysis for simplicity.
SVRG algorithm has an outer loop. We call each iteration of the outer loop an epoch
. At the beginning of each epoch, define the snapshot vector
to be the current iterate and compute its full gradient . Each epoch of SVRG consists of iterations. In each iteration, the SVRG algorithm picks random samples (with replacement) from and form a multiset, and then estimate the gradient as:
After estimating the gradient, the SVRG algorithm performs an update , where
is the step size. The choice of gradient estimate gives an unbiased estimate of the true gradient, and often has much smaller variance compared to stochastic gradient descent. The pseudocode for minibatchSVRG is given in Algorithm
1.3 Our Algorithms: Perturbed SVRG and Stabilized SVRG
In this paper we give two simple modifications to the original SVRG algorithm. First, similar to perturbed gradient descent (Jin et al., 2017a), we add perturbations to SVRG algorithm to make it escape from saddle points efficiently. We will show that this algorithm finds an secondorder stationary point in time, where is the difference between initial function value and the optimal function value. This algorithm is efficient as long as , but can be slower if is much larger (see Appendix F for an example where ^{1}^{1}1Existing algorithms like Neon2+SVRG try to estimate the Hessian at a single point, so they do not depend heavily on (in particular, they do not depend on given access to a Hessianvector product oracle, and only depends logarithmically on with a gradient oracle). However for our algorithm the iterates keep moving so it is more difficult to get the correct dependency on .. To achieve stronger guarantees, we introduce Stabilized SVRG, which is another simple modification on top of Perturbed SVRG that improves the dependency on .
3.1 Perturbed SVRG
Similar to gradient descent, if one starts SVRG exactly at a saddle point, it is easy to check that the algorithm will not move. To avoid this problem, we propose Perturbed SVRG. A high level description is in Algorithm 2. Intuitively, since at the beginning of each epoch in SVRG the gradient of the function is computed, we can add a small perturbation to the current point if the gradient turns out to be small (which means we are either near a saddle point or already at a secondorder stationary point). Similar to perturbed gradient descent in Jin et al. (2017a), we also make sure that the algorithm does not add a perturbation very often  the next perturbation can only happen either after many iterations or if the point travels enough distance . The full algorithm is a bit more technical and is given in Algorithm 4 in appendix.
Later, we will call the steps between the beginning of perturbation and end of perturbation a super epoch. When the algorithm is not in a super epoch, for technical reasons we also use a version of SVRG that stops at a random iteration (not reflected in Algorithm 2 but is in Algorithm 4).
For perturbed SVRG, we have the following guarantee:
Theorem 1.
Assume the function is Hessian Lipschitz, and each individual function is smooth and HessianLipschitz. Let , where is the initial point and is the optimal value of . There exist minibatch size , epoch length , step size , perturbation radius , super epoch length , threshold gradient , threshold distance such that Perturbed SVRG (Algorithm 4) will at least once get to an
secondorder stationary point with high probability using
stochastic gradients.
3.2 Stabilized SVRG
In order to relax the dependency on , we further introduce stabilization in the algorithm. Basically, if we encounter a saddle point , we will run SVRG iterations on a shifted function , whose gradient at is exactly zero. Another minor (but important) modification is to perturb the point in a ball with much smaller radius compared to Algorithm 2. We will give more intuitions to show why these modifications are necessary in Section 4.3.
The high level ideas of Stabilized SVRG is given in Algorithm 3. In the pseudocode, the key observation is that gradient on the shifted function is equal to the gradient of original function plus a stabilizing term. Detailed implementation of Stabilized SVRG is deferred to Algorithm 5. For Stabilized SVRG, the time complexity in the following theorem only has a polylogarithmic dependency on , which is hidden in notation.
Theorem 2.
Assume the function is Hessian Lipschitz, and each individual function is smooth and Hessian Lipschitz. Let , where is the initial point and is the optimal value of . There exists minibatch size , epoch length , step size , perturbation radius , super epoch length , threshold gradient , threshold distance such that Stabilized SVRG (Algorithm 5) will at least once get to an secondorder stationary point with high probability using
stochastic gradients.
In previous work (AllenZhu and Li, 2017), it has been shown that Neon2+SVRG has similar time complexity for finding secondorder stationary point, . Our result achieves a slightly better convergence rate using a much simpler variant of SVRG.
4 Overview of Proof Techniques
In this section, we illustrate the main ideas in the proof of Theorems 1 and 2. Similar to many existing proofs for escaping saddle points, we will show that Algorithms 2 and 3 can decrease the function value efficiently either when the current point has a large gradient () or has a large negative curvature (). Since the function value cannot decrease below the global optimal , the algorithms will be able to find a secondorder stationary point within the desired number of iterations.
In the proof, we use similar notations as in previous paper (Jin et al., 2017a). We use to denote the threshold of the gradient norm, and show that the function value decreases if the average norm of the gradients is at least Starting from a saddle point, the superepoch ends if the number of steps exceeds the threshold or the distance to the saddle point exceeds the threshold distance . In both algorithms, we choose . For the distance threshold, we choose for Perturbed SVRG and for Stabilized SVRG.
Throughout the analysis, we use to denote the index of the snapshot point of iterate . More precisely, .
4.1 Exploiting Large Gradients
There have already been several proofs that show SVRG can converge to a firstorder stationary point, and our proof here is very similar. First, we show that the gradient estimate is accurate as long as the current point is close to the snapshot point.
Lemma 1.
For any point , let the gradient estimate be , where is the snapshot point of the current epoch. Then, with probability at least , we have
This lemma is standard and the version for expected square error was proved in Li and Li (2018). Here we only applied simple concentration inequalities to get a high probability bound.
Next, we show that the function value decrease is lower bounded by the summation of gradient norm squares. The proof of the following lemma is adopted from Li and Li (2018) with minor modifications.
Lemma 2.
For any epoch, suppose the initial point is , which is also the snapshot point for this epoch. Assume for any , where comes from Lemma 1. Then, given , we have
for any .
Using this fact, we can now state the guarantee for exploiting large gradients.
Lemma 3.
For any epoch, suppose the initial point is . Let be a point uniformly sampled from . Then, given , for any value of we have two cases:

if at least half of points in have gradient no larger than we know holds with probability at least ;

otherwise, we know holds with probability at least
Further, no matter which case happens we always have with high probability.
As this lemma suggests, our algorithm will stop at a random iterate when it is not in a super epoch (this is reflected in the detailed Algorithms 4 and 5). In the first case, since there are at least half points with small gradients, by uniform sampling, we know the sampled point must have small gradient with at least half probability. In the second case, the function value decreases significantly. Proofs for lemmas in this section are deferred to Appendix B.
4.2 Exploiting Negative Curvature  Perturbed SVRG
Section 4.1 already showed that if the algorithm is not in a super epoch, with constant probability every epoch of SVRG will either decrease the function value significantly, or end at a point with small gradient. In the latter case, if the point with small gradient also has almost positive semidefinite Hessian, then we have found an approximatesecondorder stationary point. Otherwise, the algorithm will enter a super epoch, and we will show that with a reasonable probability Algorithm 2 can decrease the function value significantly within the super epoch.
For simplicity, we will reset the indices for the iterates in the super epoch. Let the initial point be , the point after the perturbation be , and the iterates in this super epoch be .
The proof for Perturbed SVRG is very similar to the proof of perturbed gradient descent in Jin et al. (2017a). In particular, we perform a two point analysis. That is, we consider two coupled samples of the perturbed point . Let be the smallest eigendirection of Hessian . The two perturbed points and only differ in the direction. We couple the two trajectories from and by choosing the same minibatches for both of them. The iterates of the two sequences are denoted by and respectively. Our goal is to show that with good probability one of these two points can escape the saddle point.
To do that, we will keep track of the difference between the two sequences . The key lemma in this section uses Hessian Lipschitz condition to show that the variance of (introduced by the random choice of minibatch) can actually be much smaller than the variance we observe in Lemma 1. More precisely,
Lemma 4.
Let and be two SVRG sequences running on that use the same choice of minibatches. Let be the snapshot point for iterate . Let and . Then, with probability at least , we have
This variance is often much smaller than before as in the extreme case, if (individual functions are quadratics), the variance is proportional to . In the proof we will show that cannot change very quickly within a single epoch so is much smaller than or . Using this new variance bound we can prove:
Lemma 5 (informal).
Let and be two SVRG sequences running on that use the same choice of minibatches. Assume aligns with direction and Setting the parameters appropriately we know with high probability , for some
Intuitively, this lemma is true because at every iterate we expect to be multiplied by a factor of if the iterate follows exact gradient, and the variance bound from Lemma 4 is tight enough. The precise statement of the lemma is given in Lemma 16 in Appendix C. The lemma shows that one of the points can escape from a local neighborhood, which by the following lemma is enough to guarantee function value decrease:
Lemma 6.
Let be the initial point, which is also the snapshot point of the current epoch. Let be the iterates of SVRG running on starting from . Fix any , suppose for every where comes from Lemma 1. Given we have
4.3 Exploiting Negative Curvature  Stabilized SVRG
The main problem in the previous analysis is that when is large, the variance estimate in Lemma 4 is no longer very strong. To solve this problem, note that the additional term is proportional to (the maximum distance of the iterates to the initial point). If we can make sure that the iterates stay very close to the initial point for long enough we will still be able to use Lemma 4 to get a good variance estimate.
However, in Perturbed SVRG, the iterates are not going to stay close to the starting point , as the initial point can have a nonnegligible gradient that will make the iterates travel a significant distance (see Figure 1 (a)). To fix this problem, we make a simple change to the function to set the gradient at equal to 0. More precisely, define the stabilized function . After this stabilization, at least the first few iterates will not travel very far (see Figure 1 (b)). Our algorithm will apply SVRG on this stabilized function.
For the stabilized function , we have , so is an exact firstorder stationary point. In this case, suppose the initial radius of perturbation is small, we will show that the behavior of the algorithm has two phases. In Phase 1, the iterates will remain in a ball around whose radius is , which allows us to have very tight bounds on the variance and the potential changes in the Hessian. By the end of Phase 1, we show that the projection in the negative eigendirections of is already at least . This means that Phase 1 has basically done a negative curvature search without a separate subroutine! Using the last point of Phase 1 as a good initialization, in Phase 2 we show that the point will eventually escape. See Figure 2 for the two phases.
The rest of the subsection will describe the two phases in more details in order to prove the following main lemma:
Lemma 7 (informal).
Let be the initial point with gradient and . Let be the iterates of SVRG running on starting from , which is the perturbed point of . Let be the length of the current super epoch. Setting the parameters appropriately we know with probability at least , and with high probability, where
Basically, this lemma shows that starting from a saddle point, with constant probability the function value decreases by after a super epoch; with high probability, the function value does not increase by more than . The precise statement of this lemma is given in Lemma 24 in Appendix D. Proofs for lemmas in this section are deferred to Appendix D.
4.3.1 Analysis of Phase 1
Let be the subspace spanned by all the eigenvectors of with eigenvalues at most . Our goal is to show that by the end of Phase 1, the projection of on subspace becomes large while the total movement is still bounded. To prove this, we use the following conditions to define Phase 1:
Stopping Condition:
An iterate is in Phase 1 if (1) or (2) .
If both conditions break, Phase 1 has ended. Intuitively, the second condition guarantees that the projection of on subspace is large at the end of Phase 1. The first condition makes sure that Phase 1 is long enough such that the projection of along positive eigendirections of has shrunk significantly, which will be crucial in the analysis of Phase 2.
With the above two conditions, the length of Phase 1 can be defined as
(2) 
The main lemma for Phase 1 gives the following guarantee:
Lemma 8 (informal).
By choosing , and , with constant probability, the length of the first phase is and
We will first show that the iterates in Phase 1 cannot go very far from the initial point:
Lemma 9 (informal).
Let be the length of Phase 1. Setting parameters appropriately we know with high probability for every
The formal version of the above lemma is in Lemma 20. Taking the sum over all and note that , this implies that the iterates are constrained in a ball whose radius is not much larger than . If we choose to be small enough, within this ball Lemma 4 will give very sharp bounds on the variance of the gradient estimates. This allows us to repeat the twopoint analysis in Section 4.2 and prove that at least one sequence must have a large projection on subspace within steps. Recall that in the two point analysis, we consider two coupled samples of the perturbed points . The two perturbed points and only differ in the direction. These two sequences and share the same choice of minibatches at each step. Basically, we prove after steps, the difference between two sequences along direction becomes large, which implies that at least one sequence must have large distance to on subspace. The formal version of the following lemma is in Lemma 21.
Lemma 10 (informal).
Let and be two SVRG sequences running on that use the same choice of minibatches. Assume aligns with direction and Let be the length of Phase 1 for and respectively. Setting parameters appropriately with high probability we have W.l.o.g., suppose and we further have
Remark:
We note that the guarantee of Lemma 10 for Phase 1 is very similar to the guarantee of a negative curvature search subroutine: we find a direction that has a large projection in subspace , which contains only the very negative eigenvectors of .
4.3.2 Analysis of Phase 2
By the guarantee of Phase 1, we know if it is successful has a large projection in subspace of very negative eigenvalues. Starting from such a point, in Phase 2 we will show that the projection of in grows exponentially and exceeds the threshold distance within steps. In order to prove this, we use the following expansion,
where Intuitively, if we only have the first term, it’s clear that . The norm in subspace increases exponentially and will become very far from in a small number of iterations. Our proof bounds the Hessian changing term and variance term separately to show that they do not influence the exponential increase. The main lemma that we will prove for Phase 2 is:
Lemma 11 (informal).
Assume Phase 1 is successful in the sense that and . Setting parameters appropriately with high probability we know there exists such that
The precise version of the above lemma is in Lemma 23 in Appendix D. Similar to Lemma 5, the lemma above shows that the iterates will escape from a local neighborhood if Phase 1 was successful (which happens with at least constant probability). We can then use Lemma 6 to bound the function value decrease.
4.4 Proof of Main Theorems
Finally we are ready to sketch the proof for Theorem 2. For each epoch, if the gradients are large, by Lemma 3 we know with constant probability the function value decreases by at least . For each super epoch, if the starting point has significant negative curvature, by Lemma 7, we know with constant probability the function value decreases by at least We also know that the number of stochastic gradient for each epoch is and that for each super epoch is . Thus, we know after
stochastic gradients, the function value will decrease below the global optimal with high probability unless we have already met an secondorder stationary point. Thus, we will at least once get to an secondorder stationary point within stochastic gradients. The formal proof of Theorem 2 is deferred to Appendix E. The proof for Theorem 1 is almost the same except that it uses Lemma 5 instead of Lemma 7 for the guarantee of the super epoch.
5 Conclusion
This paper gives a new algorithm Stabilized SVRG that is able to find an secondorder stationary point using stochastic gradients. To our best knowledge this is the first algorithm that does not rely on a separate negative curvature search subroutine, and it is much simpler than all existing algorithms with similar guarantees. In our proof, we developed the new technique of stabilization (Section 4.3), where we showed if the initial point has exactly 0 gradient and the initial perturbation is small, then the first phase of the algorithm can achieve the guarantee of a negative curvature search subroutine. We believe the stabilization technique can be useful for analyzing other optimization algorithms in nonconvex settings without using an explicit negative curvature search. We hope techniques like this will allow us to develop nonconvex optimization algorithms that are as simple as their convex counterparts.
Acknowledgement
This work was supported by NSF CCF1704656.
References
 Agarwal et al. (2016) Naman Agarwal, Zeyuan AllenZhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima for nonconvex optimization in linear time. arXiv preprint arXiv:1611.01146, 2016.
 AllenZhu (2017) Zeyuan AllenZhu. Natasha 2: Faster nonconvex optimization than sgd. arXiv preprint arXiv:1708.08694, 2017.
 AllenZhu and Hazan (2016) Zeyuan AllenZhu and Elad Hazan. Variance reduction for faster nonconvex optimization. In International Conference on Machine Learning, pages 699–707, 2016.
 AllenZhu and Li (2017) Zeyuan AllenZhu and Yuanzhi Li. Neon2: Finding local minima via firstorder oracles. arXiv preprint arXiv:1711.06673, 2017.
 Bai and Yin (1988) ZhiDong Bai and YongQua Yin. Necessary and sufficient conditions for almost sure convergence of the largest eigenvalue of a wigner matrix. The Annals of Probability, pages 1729–1741, 1988.
 Bhojanapalli et al. (2016) Srinadh Bhojanapalli, Behnam Neyshabur, and Nati Srebro. Global optimality of local search for low rank matrix recovery. In Advances in Neural Information Processing Systems, pages 3873–3881, 2016.
 Candes and Plan (2011) Emmanuel J Candes and Yaniv Plan. Tight oracle inequalities for lowrank matrix recovery from a minimal number of noisy random measurements. IEEE Transactions on Information Theory, 57(4):2342–2359, 2011.
 Carmon et al. (2016) Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for nonconvex optimization. arXiv preprint arXiv:1611.00756, 2016.
 Defazio et al. (2014) Aaron Defazio, Francis Bach, and Simon LacosteJulien. Saga: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in neural information processing systems, pages 1646–1654, 2014.
 Du et al. (2017) Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos. Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems, pages 1067–1077, 2017.
 Fang et al. (2018) Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Nearoptimal nonconvex optimization via stochastic pathintegrated differential estimator. In Advances in Neural Information Processing Systems, pages 687–697, 2018.

Ge et al. (2015)
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan.
Escaping from saddle points—online stochastic gradient for tensor decomposition.
In Conference on Learning Theory, pages 797–842, 2015.  Ge et al. (2016) Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
 Ge et al. (2017a) Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. arXiv preprint arXiv:1704.00708, 2017a.
 Ge et al. (2017b) Rong Ge, Jason D Lee, and Tengyu Ma. Learning onehiddenlayer neural networks with landscape design. arXiv preprint arXiv:1711.00501, 2017b.
 Jin et al. (2017a) Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. arXiv preprint arXiv:1703.00887, 2017a.
 Jin et al. (2017b) Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456, 2017b.
 Johnson and Zhang (2013) Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in neural information processing systems, pages 315–323, 2013.
 Lei et al. (2017) Lihua Lei, Cheng Ju, Jianbo Chen, and Michael I Jordan. Nonconvex finitesum optimization via scsg methods. In Advances in Neural Information Processing Systems, pages 2345–2355, 2017.
 Li and Li (2018) Zhize Li and Jian Li. A simple proximal stochastic gradient method for nonsmooth nonconvex optimization. arXiv preprint arXiv:1802.04477, 2018.
 Recht et al. (2010) Benjamin Recht, Maryam Fazel, and Pablo A Parrilo. Guaranteed minimumrank solutions of linear matrix equations via nuclear norm minimization. SIAM review, 52(3):471–501, 2010.
 Reddi et al. (2016) Sashank J Reddi, Ahmed Hefny, Suvrit Sra, Barnabas Poczos, and Alex Smola. Stochastic variance reduction for nonconvex optimization. In International conference on machine learning, pages 314–323, 2016.
 Roux et al. (2012) Nicolas L Roux, Mark Schmidt, and Francis R Bach. A stochastic gradient method with an exponential convergence _rate for finite training sets. In Advances in neural information processing systems, pages 2663–2671, 2012.
 ShalevShwartz and Zhang (2013) Shai ShalevShwartz and Tong Zhang. Stochastic dual coordinate ascent methods for regularized loss minimization. Journal of Machine Learning Research, 14(Feb):567–599, 2013.

Tao (2012)
Terence Tao.
Topics in random matrix theory
, volume 132. American Mathematical Soc., 2012.  Tripuraneni et al. (2018) Nilesh Tripuraneni, Mitchell Stern, Chi Jin, Jeffrey Regier, and Michael I Jordan. Stochastic cubic regularization for fast nonconvex optimization. In Advances in Neural Information Processing Systems, pages 2904–2913, 2018.
 Tropp (2012) Joel A Tropp. Userfriendly tail bounds for sums of random matrices. Foundations of computational mathematics, 12(4):389–434, 2012.
 Xu et al. (2018) Yi Xu, Jing Rong, and Tianbao Yang. Firstorder stochastic algorithms for escaping from saddle points in almost linear time. In Advances in Neural Information Processing Systems, pages 5535–5545, 2018.
 Zhou et al. (2018a) Dongruo Zhou, Pan Xu, and Quanquan Gu. Finding local minima via stochastic nested variance reduction. arXiv preprint arXiv:1806.08782, 2018a.
 Zhou et al. (2018b) Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex optimization. arXiv preprint arXiv:1806.07811, 2018b.
Appendix A Detailed Descriptions of Our Algorithm
In this section, we give the complete descriptions of the Perturbed SVRG and Stabilized SVRG algorithms.
Perturbed SVRG
Perturbed SVRG is given in Algorithm 4. The only difference of this algorithm with the high level description in Algorithm 2 is that we have now stated the stopping condition explicitly, and when the algorithm is not running a super epoch, we choose a random iterate as the starting point of the next epoch (this is necessary because of the guarantee in Lemma 2).
In the algorithm, the break probability in Step 16 is used to implement the random stopping. Breaking the loop with this probability is exactly equivalent to finishing the loop and sampling for uniformly at random.
Stabilized SVRG
Stabilized SVRG is given in Algorithm 5. The only differences between Stabilized SVRG and Perturbed SVRG is that Stabilized SVRG adds an additional shift of when it is in a super epoch ( in the algorithm).