1 Introduction
Nonconvex stochastic optimization is crucial in machine learning and have attracted tremendous attentions and unprecedented popularity. Lots of modern tasks that include lowrank matrix factorization/completion and principal component analysis
(Candès & Recht, 2009; Jolliffe, 2011), dictionary learning (Sun et al., 2017)(Reynolds et al., 2000), as well as notably deep neural networks
(Hinton & Salakhutdinov, 2006) are formulated as nonconvex stochastic optimization problems. In this paper, we concentrate on finding an approximate solution to the following minimization problem:(1.1) 
Here,
denotes a family of stochastic functions indexed by some random variable
that obeys some prescribed distribution , and we consider the general case where and have Lipschitzcontinuous gradients and Hessians and might be nonconvex. In empirical risk minimization tasks, is an uniformly discrete distribution over the set of training sample indices, and the stochastic function corresponds to the nonconvex loss associated with such a sample.One of the classical algorithms for optimizing (1.1) is the Stochastic Gradient Descent (SGD) method, which performs descent updates iteratively via the inexpensive stochastic gradient
that serves as an unbiased estimator of (the inaccessible) gradient
(Robbins & Monro, 1951; Bottou & Bousquet, 2008), i.e. . Let denote the positive stepsize, then at steps , the iteration performs the following update:(1.2) 
where is randomly sampled at iteration . SGD admits perhaps the simplest update rule among stochastic firstorder methods. See Algorithm 1 for a formal illustration of the meta algorithm. It gains tremendous popularity by researchers and engineers due to its exceptional effectiveness and performance in practice. Taking the example of training deep neural networks, the dominating algorithm at present time is SGD (Abadi et al., 2016)
, where the stochastic gradient is computed via one backpropagation step. Superior characteristics of SGD have been observed in many empirical studies, including but
not limited to fast convergence, desirable solutions of low training loss, as well as its generalization ability.Turning to the theoretical side, relatively mature and concrete analysis in existing literatures Rakhlin et al. (2012); Agarwal et al. (2009) show that SGD achieves an optimal rate of convergence for convex objective function under some standard regime. Specifically, the convergence rate of in term of the function optimality gap match the algorithmic lower bound for an appropriate class of strongly convex functions (Agarwal et al., 2009).
Despite the optimal convex optimization rates that SGD achieves, the provable nonconvex SGD convergence rate result has long stayed upon on finding an approximate firstorder stationary point
: with high probability SGD finds an
such that in stochastic gradient computational costs under the gradient Lipschitz condition of (Nesterov, 2004). In contrast, our goal in this paper is to find an approximate secondorder stationary point such thatand the least eigenvalue of the Hessian matrix
is , where denotes the socalled HessianLipschitz parameter to be specified later (Nesterov & Polyak, 2006; Tripuraneni et al., 2018; Carmon et al., 2018; Agarwal et al., 2017). Putting it differently, we need to escape from all firstorder stationary points that admit a strong negative Hessian eigenvalue (a.k.a. saddle points) (Dauphin et al., 2014) and lands at a point that quantitatively resembles a local minimizer in terms of the gradient norm and least Hessian eigenvalue.Results on the convergence rate of SGD for finding an approximate secondorder stationary point have been scarce until very recently.^{1}^{1}1Some authors work with stationary point and we ignore such expression due to the natural choice in optimization literature (Nesterov & Polyak, 2006; Jin et al., 2017). To the best of our knowledge, Ge et al. (2015) provided the first theoretical result that SGD with artificially injected spherical noise can escape from all saddle points in polynomial time. Moreover, Ge et al. (2015) showed that SGD finds an approximate secondorder stationary point at a stochastic gradient computational cost of . A recent followup work by Daneshmand et al. (2018) derived a convergence rate of stochastic gradient computations. These milestone works Ge et al. (2015); Daneshmand et al. (2018) showed that SGD can always escape from saddle points and can find an approximate local solution of (1.1) with a stochastic gradient computational cost that is polynomially dependent on problemspecific parameters. Motivated by these recent works, the current paper tries to answer the following questions:

Is it possible to sharpen the analysis of SGD algorithm and obtain a reduced stochastic gradient computational cost for finding an approximate secondorder stationary point?

Is artificial noise injection absolutely necessary for SGD to find an approximate secondorder stationary point with an almost dimensionfree stochastic gradient computational cost?
To answer aforementioned question (i), we provide a sharp analysis and prove that SGD finds an approximate stationary point at a remarkable stochastic gradient computational cost for solving (1.1). This is a very unexpected result because it has been conjectured by many (Xu et al., 2018; AllenZhu & Li, 2018; Tripuraneni et al., 2018) that an cost is required to find an approximate secondorder stationary point. Our result on SGD negates this conjecture and serves as the sharpest stochastic gradient computational cost for SGD prior to this work. To answer question (ii) above, we propose a novel dispersive noise assumption and prove that under such an assumption, SGD requires no artificial noise injection in order to achieve the aforementioned sharp stochastic gradient computational cost. Such noise assumption is satisfied in the case of infinite online samples and Gaussian sampling zerothorder optimization, and can be satisfied automatically by injecting artificial ballshaped, spherical uniform, or Gaussian noises.
We emphasize that the stochastic gradient computational cost is, however, not the lower bound complexity for finding an approximate secondorder stationary point for optimization problem (1.1). Recently, Fang et al. (2018)
applied a novel variance reduction technique named
Spider tracking and proposed the SpiderSFO^{+} algorithm which provably achieves a stochastic gradient computational cost of for finding an approximate secondorder stationary point. It is our beliefs that variance reduction techniques are necessary to achieve a stochastic gradient computational cost that is strictly sharper than .1.1 Our Contributions
We study theoretically in this work the SGD algorithm for minimizing nonconvex function . Specially, this work contributes the following:

We propose a sharp convergence analysis for the classical and simple SGD and prove that the total stochastic gradient computational cost to find a secondorder stationary point is at most under both Lipschitzcontinuous gradient and Hessian assumptions of objective function. Such convergence rate matches the most accelerated nonconvex stochastic optimization results that such as Nesterov’s momentum acceleration, negative curvature search, and quadratic and cubic regularization tricks.

We propose the dispersive noise assumption and prove that under such an assumption, SGD ensures to escape all saddles that has a strongly negative Hessian eigenvalue. Such type of noise generalizes the artificial ballshaped noise and is widely applicable to many tasks.

Our novel analytic tools for proving saddle escaping and fast convergence of SGD is of independent interests, and they shed lights on developing and analyzing new stochastic optimization algorithms.
Organization
The rest of the paper is organized as follows. §2 provides the SGD algorithm and the main convergence rate theorem for finding an approximate secondorder stationary point. In §3, we sketch the proof of our convergence rate theorem by providing and discussing three core propositions. We conclude our paper in §5 with proposed future directions. Missing proofs are detailed in order in the Appendix sections.
Notation
Let
denote the Euclidean norm of a vector or spectral norm of a square matrix. Denote
for a sequence of vectors and positive scalars if there is a global constant such that , and such hides a polylogarithmic factor of and . Denote if there is a global constant which hides a polylogarithmic factor such that . We denote if there is a global constant which hides a polylogarithmic factor of and such that. Further, we denote linear transformation of set
as . Let denote the least eigenvalue of a real symmetric matrix . We denote as the neighborhood of , i.e. the set .2 Algorithm and Main Result
In this section, we formally state SGD and the corresponding convergence rate theorem. In §2.1, we propose the key assumptions for the objective functions and noise distributions. In §2.2, we detail SGD in Algorithm 2 and present the main convergence rate theorem.
2.1 Assumptions and Definitions
We first present the following smoothness assumption on the objective adopted in this paper:
Assumption 1 (Smoothness).
Let be a twicedifferentiable with Lipschitz continuous gradients and Lipschitz continuous Hessians, i.e. for all ,
(2.1) 
and
(2.2) 
With HessianLipschitz parameter prescribed in (2.2), we formally define the approximate secondorder stationary point. To best of our knowledge, such concept firstly appeared in Nesterov & Polyak (2006):
Definition 1 (Secondorder Stationary Point).
Call an approximate secondorder stationary point if
Let the starting point of our SGD algorithm be . We assume the following boundedness assumption:
Assumption 2 (Boundedness).
The where is the global infimum value of .
Turning to the assumptions on noise, we first assume the following:
Assumption 3 (Bounded Noise).
For any , the stochastic gradient satisfies:
(2.3) 
An alternative (slighter weaker) assumption that also works is to assume that the norm of noise satisfies subgaussian distribution, i.e. for any ,
(2.4) 
Assumptions 1, 2 and 3 are standard in nonconvex optimization literatures (Ge et al., 2015; Xu et al., 2018; AllenZhu & Li, 2018; Fang et al., 2018). We treat the parameters , , , and as global constants, and focus on the dependency for stochastic gradient complexity on and .
For the purpose of fast escaping saddles, we need an extra noise shape assumption. Let be a positive real, and let be a unit vector. We define a set property as follows:
Definition 2 (narrow property).
We say that a Borel set satisfies the narrow property, if for any and , holds, where denotes the complement set of . Next, we introduce the dispersive property as follows:
It is easy to verify that the first parameter in the narrow property is linearly scalable and translation invariant with sets, i.e. if satisfies narrow property, then for any satisfies the narrow property.
Definition 3 (dispersive property).
Obviously, if satisfies dispersive property, for any fixed vector , then also satisfies dispersive property. We then present the dispersive noise assumption as follows:
Assumption 4 (Dispersive Noise).
For an arbitrary point , admits the dispersive property (as in Definition 3) for any unit vector .
Assumption 4 is motivated from the key lemma for escaping from saddle points in Jin et al. (2017), which obtains a sharp rate for gradient descent escaping from saddle points. Such an assumption enables SGD to move out of a stuck region with probability in its first step and enables escaping from saddle points (by repeating logarithmic rounds). We would like to emphasize that the dispersive noises contain many canonical examples; see the following
Examples of Dispersive Noises
Here we exemplify a few noise distributions that satisfy the dispersive property, that is, for an arbitrary set with narrow property, where . We prove the following proposition:
Proposition 1.
For the following noise distributions, (2.5) in Definition 3 is satisfied:

Gaussian noise: where is the standard Gaussian noise with covariance matrix being ;

Uniform ballshaped noise or spherical noise: , where is uniformly sampled from the unit ball centered at ;

Artificial noise injection: , where is some independent artificial noise that is dispersive for any .
2.2 SGD and Main Theorem
Our SGD algorithm for analysis purposes is detailed in Algorithm 2. Our SGD algorithm only differs from classical SGD algorithms on stopping criteria. Distinct from the classical ones that simply terminate in a certain number of steps and output the final iterate or a randomly drawn iterate, the SGD we consider here introduces a ballcontrolled mechanism as the stopping criteria: if exits a small neighborhood in iterations (Line 2 to 6), one starts over and do the next round of SGD; if exiting does not occur in iterations, then the algorithm simply outputs an arithmetic average of of the last iterates within the neighborhood, which in turns is an approximate secondorder stationary point with high probability. In contrast with the stopping criteria in the deterministic setting that checks the descent in function values (Jin et al., 2017), the function value in stochastic setting is reasonably costly to approximate (costs stochastic gradient computations), and the error plateaus might be hard to observed theoretically.
Parameter Setting
We set the hyperparameters^{2}^{2}2 Set . Because in (2.2) involves logarithmic factors on and , a simple choice to set the step size is as . for Algorithm 2 as follows:
(2.6) 
For brevity of analysis, we assume , and . In other words, we assume the accuracy .
Now we are ready to present our main result of SGD theorem.
Theorem 1 (SGD Rate).
Strikingly, Theorem 1 indicates that SGD in Algorithm 2 achieves a stochastic gradient computation cost of to find an approximate secondorder stationary point. Compared with existing algorithms that achieves an convergence rate, SGD is comparatively the simplest to implement and does not invoke any additional techniques or iterations such as momentum acceleration (Jin et al., 2018b), cubic regularization (Tripuraneni et al., 2018), regularization (AllenZhu, 2018a), or Neon negative curvature search (Xu et al., 2018; AllenZhu & Li, 2018).
Admittedly, the bestknown SGD theoretical guarantee in Theorem 1 relies on a dispersive noise assumption. To remove such an assumption, we argue that only steps of each round does one need to run an SGD step of dispersive noise to enable efficient escaping. We propose a variant of SGD called NoiseScheduled SGD which requires artificial noise injection but does not relies on a dispersive noise assumption. The algorithm is shown in Algorithm 3. One can obtain the convergence property straightforwardly.
Remark 1.
For the function class that admits the strictsaddle property (Carmon et al., 2018; Ge et al., 2015; Jin et al., 2017), an approximate secondorder stationary point is guaranteed to be an approximate local minimizer. For example for optimizing a strictsaddle function, one can first find an approximate secondorder stationary point with which is guaranteed to be an approximate local minimizer due to the strictsaddle property. Our SGD convergence rate is independent of the target accuracy , and one can run a standard convex optimization theory to obtain an convergence rate in terms of the optimality gap. Limited by space we omit the details.
3 Proof Sketches
For convenience, when we study Algorithm 2 in each inner loop from Line 2 to Line 8, we override the definition of as its initial vector. We briefly introduce our proof techniques in this section. The details of the rigorous proofs are deferred in Appendix. Our poof basically is consist of two ingredients. The first is to prove that SGD can efficiently escape saddles: with high probability, if , moves out of in iterations. The second is to show that SGD converges with a faster rate of , rather than . We further separate the second destination into two parts:

Throughout the execution of the algorithm, each time moves out of , with high probability, the function value shall decrease with a magnitude at least .

Once does not move out of until iteration, with high probability, we find a desired approximate secondorder stationary point.
Let be the filtration involving the full information of all the previous times iterations, where denotes the sigma field. And let be the first time (mathematically, a stopping time) that exits the neighborhood of , i.e.
(3.1) 
Both and is measurable on , where denotes the indicator function.
3.1 Part I: Escaping Saddles
Our goal is to prove the following proposition:
Proposition 2.
Proposition 2 essentially says that assuming if the function has a negative Hessian eigenvalue at , the iteration exits the neighborhood of in steps with a high probability.
To prove Proposition 2, we let , be the iteration by SGD starting from a fixed using the same stochastic samples as iteration , i.e.
(3.3) 
Obviously, we have . Let be the first step number (a stopping time) such that exits the neighborhood of . Formally,
(3.4) 
It is easy to see from (3.1) that . Inspired from Jin et al. (2017), we cope with the stochasticity of gradients and define the socalled bad initialization region as the point initialized from which iteration exits the neighborhood of with probability :
(3.5) 
We will show that the bad initialization region enjoys the narrow property, where . Since the first step will provide a continuous noise as supposed by Assumption 3, with the properly selected , it will move the iteration out of the bad initialization region in its first step with probability . Repeating such an argument in a logarithmic number of rounds enables escaping to occur with high probability.
The idea is to prove the following lemma:
Lemma 1.
Let the assumptions of Proposition 2 hold, and assume WLOG
be an arbitrary eigenvector of
corresponding to its smallest eigenvalue , which satisfies . Then we have for any fixed and pair of points that(3.6) 
Lemma 1 is inspired from Lemma 15 in Jin et al. (2017). Nevertheless due to the noise brought in at each update step, the analysis of stochastic gradient differs from that of the gradient descent in many aspects. For example, instead of showing the decrease of function value, we need to show that with a positive probability, at least one of the two iterations, or , exits the neighborhood of . Our proof is also more intuitive compared with Lemma 15 in Jin et al. (2017). The core idea is to focus on analyzing the difference trajectory for and , and to show that the rotation speed for the difference trajectory is the same as the expansion speed. Detailed proof is provided in §C.1.
3.2 Part II: Faster Descent
The goal of Part II is to prove the following proposition:
Proposition 3 (Faster Descent).
Proposition 3 is the key for SGD to achieve the reduced stochastic computation costs. It shows that no matter what does the local surface of look like, once moves out of the ball in iterations, the function value shall decrease with a magnitude of at least . To put it differently, on average, the function value decreases at least periteration during the execution of Algorithm 2. We will present the basic argument below.
We start with reviewing the more traditional approach for proving sufficient descent of SGD, and then we will discuss how to improve it as done in this work. The previous approaches are all based on the idea of (Nesterov, 2004), which mainly takes advantage of the gradientsmoothness condition of the objective. The proof can be briefly described below:
(3.8)  
From the above derivation, in order to guarantee the monotone descent of function value in expectation, the step size needs to be
(3.9) 
where the last equality uses . Plugging (3.9) into (3.8), and using , we have that the function value periteration would descent with a magnitude of at least . Such result indicates that, in the worse case, SGD takes stochastic oracles to find an approximate firstorder stationary point. This simple argument is the reason why previous works conjectured that the complexity of SGD is .
However, in this paper, we show that the above analysis can be further improved by using the Hessiansmoothness condition of the objective, and by considering the decomposition of objective function , and treating component and component separately as follows:

(Case 1) The component is near convex locally, in the sense that for all . In this case, by using techniques for near convex problems, it is possible for us to take a larger stepsize and prove a faster convergence rate.

(Case 2) The component is near concave locally, in the sense that for all . In this case, It can be shown that the last term on the right hand side of (3.8) can be reduced to . Therefore the step size can be chosen as , leading to a fast function value reduction.
To formalize the above observations into a rigorous proof, in this paper we introduce the quadratic approximation of at point , defined as
(3.10) 
We let be the subspace spanned by all eigenvectors of whose eigenvalue is greater than , and denotes the complement space. Also let and as the projection matrices onto the space of and , respectively. Also let the full SVD decomposition of be . We introduce and respectively, and define the following two auxiliary functions and :
(3.11) 
and
(3.12) 
For the previously mentioned decomposition of , one may simply take , and let . It can be checked that and . It follows that we only need to separately analyze the two quadratic approximations and . We then bound the difference between and as .
The analysis for can be obtained via the standard analysis informally described above in Case 2 (Refer to Lemma 7).
Our proof technique for dealing with is to introduce an auxiliary trajectory with the following deterministic updates for as:
(3.13) 
and . We then track and analyze the difference trajectory between and (Refer to Lemma 6). In the sense that simply performs Gradient Descent, we can arrive our final results for (Refer to Lemma 5), which leads to a rigorous statement of Case 1.
Finally, via the fact that moves out of the ball in iteration throughout the execution of Algorithm 2, we prove that with high probability the sum for the norm of gradients can be lower bounded as:
(3.14) 
which ensures sufficient descent of the function value. By putting the above arguments together, we can obtain Proposition 3.
3.3 Part III: Finding SSP
Part III proves the following proposition:
Proposition 4.
With probability of at least , if has not moved out of the ball in iterations, then let , we have
(3.15) 
Proposition 4 can be obtained via the same idea of Part II. We first study the quadratic approximation function and then bound the difference between and .
4 Discussions on Related Works
Due to the recent heat of deep learning, many researchers have studied the nonconvex SGD method from various perspectives in the machine learning community. We compare our results with concurrent theoretical works on nonconvex SGD in the following discussions. For clarity, we also compare the convergence rates of some works most related to ours in Table
1.
Pioneer SGD: The first work on SGD escaping from saddle points Ge et al. (2015) obtain a stochastic gradient computational cost of .^{3}^{3}3The analysis in (Ge et al., 2015) indicates a factor of at least. We are aware that the group of authors of Ge et al. (2015) claimed that they have a different proof to obtain a stochastic computation cost of using the technique of Jin et al. (2017), but we have not found it online up to the initial release of this work. Later, Jin et al. (2017, 2018b) noiseperturbed GD and AGD and achieve sharp gradient computational costs, which suggests the possibility of sharper SGD rate for escaping saddles. Our analysis in this work is partially motivated by Jin et al. (2017) for escaping from saddle points, but generalizes the noise condition and needs no deliberate noise injections which is not the original GD/SGD algorithm in a strict sense.

Concurrent SGD: A recent result by Daneshmand et al. (2018) obtains a stochastic computation cost of to find an approximate secondorder stationary point. The highlight of their work is that they need no injection of artificial noises. Nevertheless in their work, the Correlated Negative Curvature parameter cannot be treated as an constant. Taking the case of injected spherical noise or Gaussian noise, it can be at most linearly dependent on [Assumption 4], so their result is again not (almost) dimensionfree, and their worstcase convergence rate shall be interpreted as .

NC search + SGD: The Neon+SGD (Xu et al., 2018; AllenZhu & Li, 2018) methods achieve a dimensionfree convergence rate of for the general problem of form (1.1) to reach an approximate firstorder stationary point. Prior to this, classical nonconvex GD/SGD only achieves such a rate for finding an approximate firstorder stationary point (Nesterov, 2004), which, with the help of Neon method, successfully escapes from saddles via a Negative Curvature (NC) search iteration.

Regularization + SGD: Very recently, AllenZhu (2018a) takes a quadratic regularization approach and equips it with a negativecurvature search iteration Neon2 (AllenZhu & Li, 2018), which successfully improves the rate to . In comparison, our method achieves essentially the same rate without using regularization methods. Tripuraneni et al. (2018) proposed a stochastic variant of cubic regularization method (Nesterov & Polyak, 2006; Agarwal et al., 2017) and achieves the same convergence rate, which is the first achieving such rate without invoking variance reduced gradient techniques.^{4}^{4}4Note in the convergence rate here, we also includes the number of stochastic Hessianvector product evaluations, each of which takes about the same magnitude of time as per stochastic gradient evaluation.

NC search + VR: AllenZhu (2018b) converted a NC search method to the online stochastic setting (Carmon et al., 2018) and achieved a convergence rate of for finding an approximate secondorder stationary point. For finding a relaxed approximate secondorder stationary point, AllenZhu (2018b) obtains a lower stochastic gradient computational cost of . With a recently proposed optimal variance reduced gradient techniques applied, Spider achieves the stateoftheart stochastic gradient computational cost (Fang et al., 2018).^{5}^{5}5The independent work Zhou et al. (2018a) achieves a similar convergence rate for finding an approximate secondorder stationary point by imposing a thirdorder smoothness conditions on the objective.
Algorithm  SG Comp. Cost  

Neon+SGD  (Xu et al., 2018) 


Neon2+SGD  (AllenZhu & Li, 2018)  
Stochastic Cubic  (Tripuraneni et al., 2018) 


RSGD5  (AllenZhu, 2018a)  
Natasha2  (AllenZhu, 2018b) 


Neon2+SNVRG  (Zhou et al., 2018a)  
SpiderSFO^{+}  (Fang et al., 2018)  


(Ge et al., 2015)  
(Daneshmand et al., 2018)  
(this work) 
Orangeboxed: Spider reported in orangeboxed is the only existing variant stochastic algorithm that achieves provable faster rate by order than simple SGD.
: AllenZhu (2018b) also obtains a stochastic gradient computational cost of for finding a relaxed approximate secondorder stationary point.
: With additional thirdorder smoothness assumptions, SNVRG (Zhou et al., 2018a) achieves a stochastic gradient costs of .
4.1 More Related Works
VR Methods
In the recent two years, sharper convergence rates for nonconvex stochastic optimization can be achieved using variance reduced gradient techniques (Schmidt et al., 2017; Johnson & Zhang, 2013; Xiao & Zhang, 2014; Defazio et al., 2014). The SVRG/SCSG (Lei et al., 2017) adopts the technique from Johnson & Zhang (2013) and novelly introduces a random stopping criteria for its inner loops and achieve a stochastic gradient costs of . Very recently, two independent works, namely SPIDER (Fang et al., 2018) and SVRC (Zhou et al., 2018b), design sharper variance reduced gradient methods and obtain a stochastic gradient computational costs of , which is stateoftheart and nearoptimal in the sense that they achieve the algorithmic lower bound in the finitesum setting.
Escaping Saddles in SingleFunction Case
Recently, many theoretical works care about convergence to an approximate secondorder stationary point or escaping from saddles for the case of one single function (Carmon & Duchi, 2016; Jin et al., 2017; Carmon et al., 2018, 2017; Agarwal et al., 2017; Jin et al., 2018b; Lee et al., 2017; Du et al., 2017). Among them, the work Jin et al. (2017) proposed a ballshapednoiseperturbed variant of gradient descent which can efficiently escape saddle points and achieves a sharp stochastic gradient computational cost of , which is also achieved by Neon+GD (Xu et al., 2018; AllenZhu & Li, 2018). Another line of works apply momentum acceleration techniques (Agarwal et al., 2017; Carmon et al., 2017; Jin et al., 2018b) and achieve a rate of for a general optimization problem.
Escaping Saddles in FiniteSum Case
For the finitesum setting, many works have applied variance reduced gradient methods (Agarwal et al., 2017; Carmon et al., 2018; Fang et al., 2018; Zhou et al., 2018a) and further reduce the stochastic gradient computational cost to (Agarwal et al., 2017; AllenZhu & Li, 2018). Reddi et al. (2018) proposed a simpler algorithm that obtains a stochastic gradient cost of . With recursive gradient method applied (Fang et al., 2018; Zhou et al., 2018a), the stochastic gradient cost further reduces to , which is the stateoftheart.
Miscellaneous
It is wellknown that for general nonconvex optimization problem in the form of (1.1), finding an approximate global minimizer is in worstcase NPhard (Hillar & Lim, 2013). Seeing this, many works turn to study the convergence properties based on specific models. Faster convergence rate to local or even global minimizers can be guaranteed for many statistical learning tasks such as principal component analysis (Li et al., 2018a; Jain & Kar, 2017), matrix completion (Jain et al., 2013; Ge et al., 2016; Sun & Luo, 2016), dictionary learning (Sun et al., 2015, 2017) as well as linear and nonlinear neural networks (Zhong et al., 2017; Li & Yuan, 2017; Li et al., 2018b).
5 Conclusions and Future Direction
In this paper, we presented a sharp convergence analysis on the classical SGD algorithm. We showed that equipped with a ballcontrolled stopping criterion, SGD achieves a stochastic gradient computational cost of for finding an approximate secondorder stationary point, which improves over the bestknown SGD convergence rate prior to our work. While this work focuses on sharpened convergence rate, there are still some important questions left:

It is still unknown whether SGD achieves a rate that is faster than or is exactly the lower bound for SGD to solve the general problem in the form of (1.1). As we mentioned in §1, it is our conjecture that variance reduction methods are necessary to achieve an approximate secondorder stationary point in fewer than steps.

We have not considered several important extensions in this work, such as the convergence rate of SGD in solving constrained optimization problems, and how one extends the analysis in this paper to the proximal case.

It will be also interesting to study the stochastic version of Nesterov’s accelerated gradient descent (AGD) (Jin et al., 2018b).
Acknowledgement
The authors would like to greatly thank Chris Junchi Li providing us with a proof of SGD to escape saddle points in computational costs and carefully revising our paper. The authors also would like to thank Haishan Ye for very helpful discussions and Huan Li, Zebang Shen, and Li Shen for very helpful comments.
References
 Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., & Zheng, X. (2016). Tensorflow: A system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (pp. 265–283).
 Agarwal et al. (2009) Agarwal, A., Wainwright, M. J., Bartlett, P. L., & Ravikumar, P. K. (2009). Informationtheoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems (pp. 1–9).

Agarwal et al. (2017)
Agarwal, N., AllenZhu, Z., Bullins, B., Hazan, E., & Ma, T. (2017).
Finding approximate local minima faster than gradient descent.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
(pp. 1195–1199).: ACM.  AllenZhu (2018a) AllenZhu, Z. (2018a). How to make the gradients small stochastically: Even faster convex and nonconvex sgd. In Advances in Neural Information Processing Systems (pp. 1165–1175).
 AllenZhu (2018b) AllenZhu, Z. (2018b). Natasha 2: Faster nonconvex optimization than sgd. In Advances in Neural Information Processing Systems (pp. 2676–2687).
 AllenZhu & Li (2018) AllenZhu, Z. & Li, Y. (2018). Neon2: Finding local minima via firstorder oracles. In Advances in Neural Information Processing Systems (pp. 3720–3730).
 Bartlett et al. (2008) Bartlett, P. L., Dani, V., Hayes, T. P., Kakade, S. M., Rakhlin, A., & Tewari, A. (2008). Highprobability regret bounds for bandit online linear optimization. In Proceedings of the 31st Conference On Learning Theory.
 Bottou & Bousquet (2008) Bottou, L. & Bousquet, O. (2008). The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems (pp. 161–168).
 Candès & Recht (2009) Candès, E. J. & Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6), 717.
 Carmon & Duchi (2016) Carmon, Y. & Duchi, J. C. (2016). Gradient descent efficiently finds the cubicregularized nonconvex newton step. arXiv preprint arXiv:1612.00547.
 Carmon et al. (2017) Carmon, Y., Duchi, J. C., Hinder, O., & Sidford, A. (2017). “Convex Until Proven Guilty”: Dimensionfree acceleration of gradient descent on nonconvex functions. In International Conference on Machine Learning (pp. 654–663).
 Carmon et al. (2018) Carmon, Y., Duchi, J. C., Hinder, O., & Sidford, A. (2018). Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2), 1751–1772.
 Daneshmand et al. (2018) Daneshmand, H., Kohler, J., Lucchi, A., & Hofmann, T. (2018). Escaping saddles with stochastic gradients. In International Conference on Machine Learning (pp. 1155–1164).
 Dauphin et al. (2014) Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and attacking the saddle point problem in highdimensional nonconvex optimization. In Advances in Neural Information Processing Systems (pp. 2933–2941).
 Defazio et al. (2014) Defazio, A., Bach, F., & LacosteJulien, S. (2014). SAGA: A fast incremental gradient method with support for nonstrongly convex composite objectives. In Advances in Neural Information Processing Systems (pp. 1646–1654).
 Du et al. (2017) Du, S. S., Jin, C., Lee, J. D., Jordan, M. I., Singh, A., & Poczos, B. (2017). Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems (pp. 1067–1077).
 Fang et al. (2018) Fang, C., Li, C. J., Lin, Z., & Zhang, T. (2018). Spider: Nearoptimal nonconvex optimization via stochastic pathintegrated differential estimator. In Advances in Neural Information Processing Systems (pp. 686–696).
 Freedman (1975) Freedman, D. A. (1975). On tail probabilities for martingales. Annals of Probability, 3(1), 100–118.

Ge et al. (2015)
Ge, R., Huang, F., Jin, C., & Yuan, Y. (2015).
Escaping from saddle points – online stochastic gradient for tensor decomposition.
In Proceedings of The 28th Conference on Learning Theory (pp. 797–842).  Ge et al. (2016) Ge, R., Lee, J. D., & Ma, T. (2016). Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems (pp. 2973–2981).
 Hillar & Lim (2013) Hillar, C. J. & Lim, L.H. (2013). Most tensor problems are nphard. Journal of the ACM (JACM), 60(6), 45.
 Hinton & Salakhutdinov (2006) Hinton, G. E. & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. science, 313(5786), 504–507.
 Jain & Kar (2017) Jain, P. & Kar, P. (2017). Nonconvex optimization for machine learning. Foundations and Trends® in Machine Learning, 10(34), 142–336.
 Jain et al. (2013) Jain, P., Netrapalli, P., & Sanghavi, S. (2013). Lowrank matrix completion using alternating minimization. In Proceedings of the fortyfifth annual ACM symposium on Theory of computing (pp. 665–674).: ACM.
 Jin et al. (2017) Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., & Jordan, M. I. (2017). How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning (pp. 1724–1732).
 Jin et al. (2018a) Jin, C., Liu, L. T., Ge, R., & Jordan, M. I. (2018a). On the local minima of the empirical risk. In Advances in Neural Information Processing Systems (pp. 4901–4910).
 Jin et al. (2018b) Jin, C., Netrapalli, P., & Jordan, M. I. (2018b). Accelerated gradient descent escapes saddle points faster than gradient descent. In Proceedings of the 31st Conference On Learning Theory (pp. 1042–1085).
 Johnson & Zhang (2013) Johnson, R. & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems (pp. 315–323).
 Jolliffe (2011) Jolliffe, I. (2011). Principal component analysis. In International encyclopedia of statistical science (pp. 1094–1096). Springer.
 Kallenberg & Sztencel (1991) Kallenberg, O. & Sztencel, R. (1991). Some dimensionfree features of vectorvalued martingales. Probability Theory and Related Fields, 88(2), 215–247.
 Lee et al. (2017) Lee, J. D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M. I., & Recht, B. (2017). Firstorder methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406.
 Lei et al. (2017) Lei, L., Ju, C., Chen, J., & Jordan, M. I. (2017). Nonconvex finitesum optimization via scsg methods. In Advances in Neural Information Processing Systems (pp. 2345–2355).
 Li et al. (2018a) Li, C. J., Wang, M., Liu, H., & Zhang, T. (2018a). Nearoptimal stochastic approximation for online principal component estimation. Mathematical Programming, 167(1), 75–97.
 Li et al. (2018b) Li, Y., Ma, T., & Zhang, H. (2018b). Algorithmic regularization in overparameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory (pp. 2–47).

Li & Yuan (2017)
Li, Y. & Yuan, Y. (2017).
Convergence analysis of twolayer neural networks with relu activation.
In Advances in Neural Information Processing Systems (pp. 597–607).  Nesterov (2004) Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course, volume 87. Springer.
 Nesterov & Polyak (2006) Nesterov, Y. & Polyak, B. T. (2006). Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1), 177–205.
 Pinelis (1994) Pinelis, I. (1994). Optimum bounds for the distributions of martingales in banach spaces. The Annals of Probability, (pp. 1679–1706).
 Rakhlin et al. (2012) Rakhlin, A., Shamir, O., & Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning (pp. 449–456).

Reddi et al. (2018)
Reddi, S., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., &
Smola, A. (2018).
A generic approach for escaping saddle points.
In
Proceedings of the TwentyFirst International Conference on Artificial Intelligence and Statistics
(pp. 1233–1242).  Reynolds et al. (2000) Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital signal processing, 10(13), 19–41.
 Robbins & Monro (1951) Robbins, H. & Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, (pp. 400–407).
 Schmidt et al. (2017) Schmidt, M., Le Roux, N., & Bach, F. (2017). Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(12), 83–112.
 Sun et al. (2015) Sun, J., Qu, Q., & Wright, J. (2015). When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096.
 Sun et al. (2017) Sun, J., Qu, Q., & Wright, J. (2017). Complete dictionary recovery over the sphere i: Overview and the geometric picture. IEEE Transactions on Information Theory, 63(2), 853–884.
 Sun & Luo (2016) Sun, R. & Luo, Z.Q. (2016). Guaranteed matrix completion via nonconvex factorization. IEEE Transactions on Information Theory, 62(11), 6535–6579.
 Tripuraneni et al. (2018) Tripuraneni, N., Stern, M., Jin, C., Regier, J., & Jordan, M. I. (2018). Stochastic cubic regularization for fast nonconvex optimization. In Advances in Neural Information Processing Systems (pp. 2904–2913).
 Xiao & Zhang (2014) Xiao, L. & Zhang, T. (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4), 2057–2075.
 Xu et al. (2018) Xu, Y., Rong, J., & Yang, T. (2018). Firstorder stochastic algorithms for escaping from saddle points in almost linear time. In Advances in Neural Information Processing Systems (pp. 5531–5541).
 Zhang (2005) Zhang, T. (2005). Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17(9), 2077–2098.
 Zhang et al. (2017) Zhang, Y., Liang, P., & Charikar, M. (2017). A hitting time analysis of stochastic gradient langevin dynamics. In Conference on Learning Theory.
 Zhong et al. (2017) Zhong, K., Song, Z., Jain, P., Bartlett, P. L., & Dhillon, I. S. (2017). Recovery guarantees for onehiddenlayer neural networks. arXiv preprint arXiv:1706.03175.
 Zhou et al. (2018a) Zhou, D., Xu, P., & Gu, Q. (2018a). Finding local minima via stochastic nested variance reduction. arXiv preprint arXiv:1806.08782.
 Zhou et al. (2018b) Zhou, D., Xu, P., & Gu, Q. (2018b). Stochastic nested variance reduced gradient descent for nonconvex optimization. In Advances in Neural Information Processing Systems (pp. 3922–3933).
Appendix A Proof of Proposition 1 and Algorithm 3
Proof of Proposition 1.

Recall the multivariate gaussian noise where . We show that it satisfies (2.5). Clearly, it satisfies (2.4).
Let be an arbitrary unit vector, and due to symmetry in below we assume WLOG . Recall we have set satisfying the narrow property in Definition 2. Then
If set contains no points of for each , then is a subset of and has Lebesgue measure . This is because that for any given there exists an such that and we pick to be the infimum of such. Then it is easy to conclude that for any , and that
Therefore we have for any admitting narrow property where , that for any given ,
where is of Lebesgue measure . Taking expectation again gives
and we complete the proof that is disperse for any .

For example, recall the uniform ballshaped noise , where is uniformly sampled from , the unit ball centered at . We prove that (2.5) holds in this case. Assume once again that because of symmetry. Using classical results in Multivariate Calculus (or see Jin et al. (2017)) and narrow property property in Definition 2 of set we have