1 Introduction
We consider the problem of nonconvex optimization in the stochastic approximation framework (Robbins and Monro, 1951):
(1) 
In this setting, we only have access to the stochastic function
, where the random variable
is sampled from an underlying distribution . The task is to optimize the expected function, which in general may be nonconvex. This framework covers a wide range of problems, including the offline setting where we minimize the empirical loss over a fixed amount of data, and the online setting where data arrives sequentially. One of the most prominent applications of stochastic optimization has been in largescale statistics and machine learning problems, such as the optimization of deep neural networks.
Classical analysis in nonconvex optimization only guarantees convergence to a firstorder stationary point (i.e., a point satisfying ), which can be a local minimum, a local maximum, or a saddle point. This paper goes further, proposing an algorithm that escapes saddle points and converges to a local minimum. A local minimum is defined as a point satisfying and . Finding such a point is of special interest for a large class of statistical learning problems where local minima are global or nearglobal solutions (e.g. Choromanska et al. (2015); Sun et al. (2016a, b); Ge et al. (2017)).
Among firstorder stochastic optimization algorithms, stochastic gradient descent (SGD) is perhaps the simplest and most versatile. While SGD is computationally inexpensive, the best current guarantee for finding an approximate local minimum (see Definition 1) requires iterations (Ge et al., 2015), which is inefficient in the highdimensional regime.
In contrast, secondorder methods which have access to the Hessian of can exploit curvature to more effectively escape saddles and arrive at local minima. Empirically, secondorder methods are known to perform better than firstorder methods on a variety of nonconvex problems (Rattray et al., 1998; Martens, 2010; Regier et al., 2017). However, many classical secondorder algorithms need to construct the full Hessian, which can be prohibitively expensive when working with large models. Recent work has therefore explored the use of Hessianvector products , which can be computed as efficiently as gradients in many cases including neural networks (Pearlmutter, 1994).
Several algorithms incorporating Hessianvector products (Carmon et al., 2016; Agarwal et al., 2017) have been shown to achieve faster convergence rates than gradient descent in the nonstochastic setting. However, in the stochastic setting where we only have access to stochastic Hessianvector products, significantly less progress has been made. This leads us to ask the central question of this paper: Can stochastic Hessianvector products help to speed up nonconvex optimization?
In this work, we answer in the affirmative, proposing a stochastic optimization method that utilizes stochastic gradients and Hessianvector products to find an secondorder stationary point using only oracle evaluations, where hides polylogarithmic factors. Our rate improves upon the rate of stochastic gradient descent, and matches the bestknown result for finding local minima without the need for any delicate acceleration or variance reduction techniques (see Section 1.1 for details).
The proposed algorithm in this paper is based on a classic algorithm in the nonstochastic setting—the cubic regularized Newton method (or cubic regularization) (Nesterov and Polyak, 2006). This algorithm is a natural extension of gradient descent that incorporates Hessian information. Whereas gradient descent finds the minimizer of a local secondorder Taylor expansion,
the cubic regularized Newton method finds the minimizer of a local thirdorder Taylor expansion,
We provide a stochastic variant of this classic algorithm, bridging the gap between its use in the nonstochastic and stochastic settings.
In contrast to prior work, we present a fully stochastic cubicregularized Newton method: both gradients and Hessianvector products are observed with noise. Additionally, we provide a nonasymptotic analysis of its complexity.
1.1 Related Work
There has been a recent surge of interest in optimization methods that can escape saddle points and find approximate local minima (see Definition 1) in various settings. We provide a brief summary of these results. All iteration complexities in this section are stated in terms of finding approximate local minima, and only highlight the dependency on and .
1.1.1 General Case
This line of work optimizes over a general function without any special structural assumptions. In this setting, the optimization algorithm has direct access to the gradient or Hessian oracles at each iteration. The work of Nesterov and Polyak (2006) first proposed the cubicregularized Newton method, which requires gradient and Hessian oracle calls to find an secondorder stationary point. Later, the ARC algorithm (Cartis et al., 2011) and trustregion methods (Curtis et al., 2017) were also shown to achieve the same guarantee with similar Hessian oracle access. However, these algorithms rely on having access to the full Hessian at each iteration, which is prohibitive in high dimensions.
Recently, instead of using the full Hessian, Carmon and Duchi (2016) showed that using a gradient descent solver for the cubic regularization subproblem allows their algorithm to find secondorder stationary points in Hessianvector product evaluations. With acceleration techniques, the number of Hessianvector products can be reduced to (Carmon et al., 2016; Agarwal et al., 2017; Royer and Wright, 2017).
Meanwhile, in the realm of entirely Hessianfree results, Jin et al. (2017) showed that a simple variant of gradient descent can find second stationary points in gradient evaluations.
1.1.2 FiniteSum Setting
In the finitesum setting (also known as the offline setting) where , one assumes that algorithms have access to the individual functions . In this setting, variance reduction techniques can be exploited (Johnson and Zhang, 2013). Agarwal et al. (2017) give an algorithm requiring Hessianvector oracle calls to find an approximate local minimum. A similar result is achieved by the algorithm proposed by Reddi et al. (2017).
1.1.3 Stochastic Approximation
The framework of stochastic approximation where only assumes access to a stochastic gradient and Hessian via . In this setting, Ge et al. (2015) showed that the total gradient iteration complexity for SGD to find an secondorder stationary point was . More recently, Kohler and Lucchi (2017) consider a subsampled version of the cubic regularization algorithm, but do not provide a nonasymptotic analysis for their algorithm to find an approximate local minimum; they also assume access to exact (expected) function values at each iteration which are not available in the fully stochastic setting. Xu et al. (2017) consider the case of stochastic Hessians, but also require access to exact gradients and function values at each iteration. Very recently, AllenZhu (2017) proposed an algorithm with a mechanism exploiting variance reduction that finds a secondorder stationary point with ^{2}^{2}2The original paper reports a rate of due to a different definition of secondorder stationary point, , which is weaker than the standard definition as in Definition 1. Hessianvector product evaluations. We note that our result matches this best result so far using a simpler approach without any delicate acceleration or variancereduction techniques. See Table 1 for a brief comparison.
Method  Runtime  Variance Reduction 

Stochastic Gradient Descent (Ge et al., 2015)  not needed  
Natasha 2 (AllenZhu, 2017)  needed  
Stochastic Cubic Regularization (this paper)  not needed 
2 Preliminaries
We are interested in stochastic optimization problems of the form
(2) 
where is a random variable with distribution . In general, the function may be nonconvex. This formulation covers both the standard offline setting where the objective function can be expressed as a finite sum of individual functions , as well as the online setting where data arrives sequentially.
Our goal is to minimize the function using only stochastic gradients and stochastic Hessianvector products , where is a vector of our choosing. Although it is expensive and often intractable in practice to form the entire Hessian, computing a Hessianvector product is as cheap as computing a gradient when our function is represented as an arithmetic circuit (Pearlmutter, 1994), as is the case for neural networks.
Notation: We use bold uppercase letters to denote matrices and bold lowercase letters to denote vectors. For vectors we use to denote the norm, and for matrices we use to denote the spectral norm and
to denote the minimum eigenvalue. Unless otherwise specified, we use the notation
to hide only absolute constants which do not depend on any problem parameter, and the notation to hide only absolute constants and logarithmic factors.2.1 Assumptions
Throughout the paper, we assume that the function is bounded below by some optimal value . We also make following assumptions about function smoothness:
Assumption 1.
The function has

[itemsep=.5em, nosep]

Lipschitz gradients: for all and ,

Lipschitz Hessians: for all and ,
The above assumptions state that the gradient and the Hessian cannot change dramatically in a small local area, and are standard in prior work on escaping saddle points and finding local minima.
Next, we make the following variance assumptions about stochastic gradients and stochastic Hessians:
Assumption 2.
The function has

[itemsep=.5em, nosep]

for all , and a.s.;

for all , and a.s.
We note that the above assumptions are not essential to our result, and can be replaced by any conditions that give rise to concentration. Moreover, the a.s. bounded Hessian assumption can be removed if one further assumes has Lipschitz gradients for all , which is stronger than Assumption 1. Assuming has Lipschitz gradients immediately implies the stochastic Hessians are a.s. bounded with parameter which alone is sufficient to guarantee concentration without any further assumptions on the variance of (Tropp et al., 2015).
2.2 Cubic Regularization
Our target in this paper is to find an secondorder stationary point, which we define as follows:
Definition 1.
For a Hessian Lipschitz function , we say that is an secondorder stationary point (or approximate local minimum) if
(3) 
An secondorder stationary point not only has a small gradient, but also has a Hessian which is close to positive semidefinite. Thus it is often also referred to as an approximate local minimum.
In the deterministic setting, cubic regularization (Nesterov and Polyak, 2006) is a classic algorithm for finding a secondorder stationary point of a HessianLipschitz function . In each iteration, it first forms a local upper bound on the function using a thirdorder Taylor expansion around the current iterate :
This is called the cubic submodel. Then, cubic regularization minimizes this cubic submodel to obtain the next iterate: . When the cubic submodel can be solved exactly, cubic regularization requires iterations to find an secondorder stationary point.
To apply this algorithm in the stochastic setting, three issues need to be addressed: (1) we only have access to stochastic gradients and Hessians, not the true gradient and Hessian; (2) our only means of interaction with the Hessian is through Hessianvector products; (3) the cubic submodel cannot be solved exactly in practice, only up to some tolerance. We discuss how to overcome each of these obstacles in our paper.
3 Main Results
We begin with a generalpurpose stochastic cubic regularization metaalgorithm in Algorithm 1, which employs a blackbox subroutine to solve stochastic cubic subproblems. At a high level, in order to deal with stochastic gradients and Hessians, we sample two independent minibatches and at each iteration. Denoting the average gradient by
(4) 
and the average Hessian by
(5) 
this implies a stochastic cubic submodel:
(6) 
Although the subproblem depends on , we note that our metaalgorithm never explicitly formulates this matrix, only providing the subsolver access to through Hessianvector products, which we denote by . We hence assume that the subsolver performs gradientbased optimization to solve the subproblem, as depends on only via .
After sampling minibatches for the gradient and the Hessian, Algorithm 1 makes a call to a blackbox cubic subsolver to optimize the stochastic submodel . The subsolver returns a parameter change , i.e., an approximate minimizer of the submodel, along with the corresponding change in submodel value, . The algorithm then updates the parameters by adding to the current iterate, and checks whether satisfies a stopping condition.
In more detail, the CubicSubsolver subroutine takes a vector and a function for computing Hessianvector products , then optimizes the thirdorder polynomial
(7) 
Let denote the minimizer of this polynomial. In general, the subsolver cannot return the exact solution . We hence tolerate a certain amount of suboptimality:
Condition 1.
For any fixed, small constant , terminates within gradient iterations (which may depend on ), and returns a satisfying at least one of the following:

. (Case 1)

and, if , then . (Case 2)
The first condition is satisfied if the parameter change results in submodel and function decreases that are both sufficiently large (Case 1). If that fails to hold, the second condition ensures that is not too large relative to the true solution , and that the cubic submodel is solved to precision when is large (Case 2).
As mentioned above, we assume the subsolver uses gradientbased optimization to solve the subproblem so that it will only access the Hessian only through Hessianvector products. Accordingly, it can be any standard firstorder algorithm such as gradient descent, Nesterov’s accelerated gradient descent, etc. Gradient descent is of particular interest as it can be shown to satisfy Condition 1 (see Lemma 2).
Having given an overview of our metaalgorithm and verified the existence of a suitable subsolver, we are ready to present our main theorem:
Theorem 1.
There exists an absolute constant such that if satisfies Assumptions 1, 2, CubicSubsolver satisfies Condition 1 with , , and , then for all and , Algorithm 1 will output an secondorder stationary point of
with probability at least
within(8) 
total stochastic gradient and Hessianvector product evaluations.
In the limit where is small the result simplifies:
Remark 1.
Theorem 1 states that after iterations, stochastic cubic regularization (Algorithm 1) will have found an secondorder stationary point w.h.p. – see Equation (26) in the Appendix for an exact outer iteration complexity that guarantees early termination of Algorithm 1 w.h.p. Within each iteration, we require samples for gradient averaging and samples for Hessianvector product averaging to ensure small variance when is small. Recall that the averaged gradient is fixed for a given cubic submodel, while the averaged Hessianvector product needs to be recalculated every time the cubic subsolver queries the gradient . At most such queries will be made by definition. Therefore, each iteration takes stochastic gradient and Hessianvector product evaluations in the limit where is small (see Remark 1). Multiplying these two quantities together gives the full result.
Finally, we note that lines 811 of Algorithm 1 give the termination condition of our metaalgorithm. When the decrease in submodel value is too small, our theory guarantees is an secondorder stationary point, where is the optimal solution of the cubic submodel. However, CubicSubsolver may only produce an inexact solution satisfying Condition 1, which is not sufficient for to be an secondorder stationary point. We therefore call CubicFinalsolver to solve the subproblem with higher precision. Since CubicFinalsolver is invoked only once at the end of the algorithm, we can just use gradient descent, and its runtime will always be dominated by the rest of the algorithm.
3.1 Gradient Descent as a CubicSubsolver
One concrete example of a cubic subsolver is a simple variant of gradient descent (Algorithm 3) as studied in Carmon and Duchi (2016). The two main differences relative to standard gradient descent are: (1) lines 1–3: when is large, the submodel (Equation 7) may be illconditioned, so instead of doing gradient descent, the iterate only moves one step in the direction, which already guarantees sufficient descent; (2) line 6: the algorithm adds a small perturbation to to avoid a certain “hard” case for the cubic submodel. We refer readers to Carmon and Duchi (2016) for more details about Algorithm 3.
Adapting their result for our setting, we obtain the following lemma, which states that the gradient descent subsolver satisfies our Condition 1.
Lemma 2.
Our next corollary applies gradient descent (Algorithm 3) as the approximate cubic subsolver in our metaalgorithm (Algorithm 1), which immediately gives the total number of gradient and Hessianvector evaluations for the full algorithm.
Corollary 3.
Remark 2.
From Corollary 3, we observe that the dominant term in solving the submodel is when is sufficiently small, giving a total iteration complexity of when other problemdependent parameters are constant. This improves on the complexity attained by SGD.
It is reasonable to believe there may be another cubic subsolver which is faster than gradient descent and which satisfies Condition 1 with a smaller , for instance a variant of Nesterov’s accelerated gradient descent. However, since the dominating term in our subsolver complexity is due to gradient averaging, and this is independent of , a faster cubic subsolver cannot improve the overall number of gradient and Hessianvector product evaluations. This means that the gradient descent subsolver already achieves the optimal asymptotic rate for finding an secondorder stationary point under our stochastic cubic regularization framework.
4 Proof Sketch
This section sketches the crucial steps needed to understand and prove our main theorem (Theorem 1). We begin by describing our highlevel approach, then show how to instantiate this highlevel approach in the stochastic setting, assuming oracle access to an exact subsolver. For the case of an inexact subsolver and other proof details, we defer to the Appendix.
Recall that at iteration of Algorithm 1, a stochastic cubic submodel is constructed around the current iterate with the form given in Equation (6):
where and are the averaged stochastic gradients and Hessians. At a high level, we will show that for each iteration, the following two claims hold:
Claim 1. If is not an secondorder stationary point of , the cubic submodel has large descent .
Claim 2. If the cubic submodel has large descent , then the true function also has large descent .
Given these claims, it is straightforward to argue for the correctness of our approach. We know that if we observe a large decrease in the cubic submodel value during Algorithm 1, then by Claim 2 the function will also have large descent. But since is bounded below, this cannot happen indefinitely, so we must eventually encounter an iteration with small cubic submodel descent. When that happens, we can conclude by Claim 1 that is an secondorder stationary point.
We note that Claim 2 is especially important in the stochastic setting, as we no longer have access to the true function but only the submodel. Claim 2 ensures that progress in still indicates progress in , allowing the algorithm to terminate at the correct time.
In the remaining parts of this section, we discuss why the above two claims hold for an exact solver.
4.1 Stochastic Setting with Exact Subsolver
In this setting, and are the averaged gradient and Hessian with sample sizes and , respectively. To ensure the stochastic cubic submodel approximates the exact cubic submodel well, we need large enough sample sizes so that both and are close to the exact gradient and Hessian at up to some tolerance:
Lemma 4.
For any fixed small constants , we can pick gradient and Hessian minibatch sizes and so that with probability ,
(12) 
(13) 
We need to ensure that the random vectors/matrices concentrate along an arbitrary direction (depending on and ). In order to guarantee the uniform concentration in Lemma 4, we can directly apply results from matrix concentration to obtain the desired result (Tropp et al., 2015).
Let , i.e. is a global minimizer of the cubic submodel . If we use an exact oracle solver, we have . In order to show Claim 1 and Claim 2, one important quantity to study is the decrease in the cubic submodel :
Lemma 5.
Let and be defined as above. Then for all ,
Lemma 5 implies that in order to prove submodel has sufficient function value decrease, we only need to lower bound the norm of optimal solution, i.e. .
Proof sketch for claim 1: Our strategy is to lower bound the norm of when is not an secondorder stationary point. In the nonstochastic setting, Nesterov and Polyak (2006) prove
which gives the desired result. In the stochastic setting, a similar statement holds up to some tolerance:
Lemma 6.
Under the setting of Lemma 4 with sufficiently small constants ,
That is, when is not an secondorder stationary point, we have . In other words, we have sufficient movement. It follows by Lemma 5 that we have sufficient cubic submodel descent.
Proof sketch for claim 2: In the nonstochastic case, is by construction an upper bound on . Together with the fact , we have:
showing Claim 2 is always true. For the stochastic case, this inequality may no longer be true. Instead, under the setting of Lemma 4, via Lemma 5, we can upper bound the function decrease with an additional error term:
for some sufficiently small constant . Then when , we have , which proves Claim 2.
Finally, for an approximate cubic subsolver, the story becomes more elaborate. Claim 1 is only “approximately” true, while Claim 2 still holds but for more complicated reasons. We defer to the Appendix for the full proof.
5 Experiments
In this section, we provide empirical results on synthetic and realworld data sets to demonstrate the efficacy of our approach. All experiments are implemented using TensorFlow
(Abadi et al., 2016), which allows for efficient computation of Hessianvector products using the method described by Pearlmutter (1994).5.1 Synthetic Nonconvex Problem
We begin by constructing a nonconvex problem with a saddle point to compare our proposed approach against stochastic gradient descent. Let be the Wshaped scalar function depicted in Figure 1, with a local maximum at the origin and two local minima on either side. While we defer the exact form of to Appendix C, we note here that it has small negative curvature at the origin, , and that it has a 1Lipschitz second derivative. We aim to solve the problem
with independent noise drawn from added separately to each component of every gradient and Hessianvector product evaluation. By construction, the objective function has a saddle point at the origin with Hessian eigenvalues of 0.2 and 20, providing a simple but challenging test case where the negative curvature is two orders of magnitude smaller than the positive curvature and is comparable in magnitude to the noise.
We ran our method and SGD on this problem, plotting the objective value versus the number of oracle calls in Figure 3. The batch sizes and learning rates for each method are tuned separately to ensure a fair comparison; see Appendix C for details. We observe that our method is able to escape the saddle point at the origin and converge to one of the global minima faster than SGD, offering empirical evidence in support of our method’s theoretical advantage.
5.2 Deep Autoencoder
In addition to the synthetic problem above, we also investigate the performance of our approach on a more practical problem from deep learning, namely training a deep autoencoder on MNIST. The MNIST data set consists of grayscale images of handwritten digits, divided into 60,000 training images and 10,000 test images
(LeCun and Cortes, 2010). Each image is pixels, and pixel intensities are normalized to lie between 0 and 1. Our architecture consists of a fully connected encoder with dimensions together with a symmetric decoder. We use a softplus nonlinearity (defined as ) for each hidden layer to ensure continuous first and second derivatives, and we use a pixelwiseloss between the input and the reconstructed output as our objective function. We apply an elementwise sigmoid function to the final layer to guarantee that the output pixel intensities lie between 0 and 1.
Results on this autoencoding task are presented in Figure 3. In addition to training the model with our method and SGD, we also include results using AdaGrad, a popular adaptive firstorder method with strong empirical performance (Duchi et al., 2011)
. Since the standard MNIST split does not include a validation set, we separate the original training set into 55,000 training images and 5,000 validation images, plotting training error on the former and using the latter to select hyperparameters for each method. More details about our experimental setup can be found in Appendix
C.We observe that stochastic cubic regularization quickly escapes two saddle points and descends toward a local optimum, while AdaGrad takes two to three times longer to escape each saddle point, and SGD is slower still. This demonstrates that incorporating curvature information via Hessianvector products can assist in escaping saddle points in practice. However, it is worth noting that AdaGrad makes slightly faster progress than our approach after reaching a basin around a local optimum, indicating that adaptivity may provide complementary benefits to secondorder information. We leave the investigation of a hybrid method combining both techniques as an exciting direction for future work.
6 Conclusion
In this paper, we presented a stochastic algorithm based on the classic cubicregularized Newton method for nonconvex optimization. Our algorithm provably finds approximate local minima in total gradient and Hessianvector product evaluations, improving upon the rate of SGD. Our experiments show the favorable performance of our method relative to SGD on both a synthetic and a realworld problem.
References
 Abadi et al. [2016] Martin Abadi, Paul Barham, et al. Tensorflow: A system for largescale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, pages 265–283, 2016.

Agarwal et al. [2017]
Naman Agarwal, Zeyuan AllenZhu, Brian Bullins, Elad Hazan, and Tengyu Ma.
Finding approximate local minima faster than gradient descent.
In
Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing
, 2017.  AllenZhu [2017] Zeyuan AllenZhu. Natasha 2: Faster nonconvex optimization than SGD. arXiv preprint arXiv:1708.08694, 2017.
 Carmon and Duchi [2016] Yair Carmon and John Duchi. Gradient descent efficiently finds the cubicregularized nonconvex Newton step. arXiv preprint arXiv:1612.00547, 2016.
 Carmon et al. [2016] Yair Carmon, John Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for nonconvex optimization. arXiv preprint arXiv:1611.00756, 2016.
 Cartis et al. [2011] Coralia Cartis, Nicholas Gould, and Philippe Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part II: worstcase functionand derivativeevaluation complexity. Mathematical Programming, 130(2):295–319, 2011.

Choromanska et al. [2015]
Anna Choromanska, Mikael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann
LeCun.
The loss surfaces of multilayer networks.
In
Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics
, 2015.  Curtis et al. [2017] Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a worstcase iteration complexity of for nonconvex optimization. Mathematical Programming, 162(12):1–32, 2017.
 Duchi et al. [2011] John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.

Ge et al. [2015]
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan.
Escaping from saddle points — online stochastic gradient for tensor decomposition.
In Proceedings of The 28th Conference on Learning Theory, 2015.  Ge et al. [2017] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In Proceedings of the 34th International Conference on Machine Learning, 2017.
 Jin et al. [2017] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning, 2017.
 Johnson and Zhang [2013] Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, 2013.
 Kohler and Lucchi [2017] Jonas Moritz Kohler and Aurelien Lucchi. Subsampled cubic regularization for nonconvex optimization. In Proceedings of the 34th International Conference on Machine Learning, 2017.
 LeCun and Cortes [2010] Yann LeCun and Corinna Cortes. MNIST Handwritten Digit Database, 2010. URL http://yann.lecun.com/exdb/mnist/.
 Martens [2010] James Martens. Deep learning via Hessianfree optimization. In Proceedings of the 27th International Conference on Machine Learning, 2010.
 Nesterov [2013] Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2013.
 Nesterov and Polyak [2006] Yurii Nesterov and Boris T Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
 Pearlmutter [1994] Barak A Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 6(1):147–160, 1994.
 Rattray et al. [1998] Magnus Rattray, David Saad, and Shunichi Amari. Natural gradient descent for online learning. Physical review letters, 81(24):5461, 1998.
 Reddi et al. [2017] Sashank J Reddi, Manzil Zaheer, et al. A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434, 2017.
 Regier et al. [2017] Jeffrey Regier, Michael I Jordan, and Jon McAuliffe. Fast blackbox variational inference through stochastic trustregion optimization. In Advances in Neural Information Processing Systems, 2017.
 Robbins and Monro [1951] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
 Royer and Wright [2017] Clément W Royer and Stephen J Wright. Complexity analysis of secondorder linesearch algorithms for smooth nonconvex optimization. arXiv preprint arXiv:1706.03131, 2017.
 Sun et al. [2016a] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere I: Overview and the geometric picture. IEEE Transactions on Information Theory, 2016a.
 Sun et al. [2016b] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. In IEEE International Symposium on Information Theory. IEEE, 2016b.
 Tropp et al. [2015] Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(12):1–230, 2015.
 Xu et al. [2017] Peng Xu, Farbod RoostaKhorasani, and Michael W Mahoney. Newtontype methods for nonconvex optimization under inexact Hessian information. arXiv preprint arXiv:1708.07164, 2017.
Appendix A Proof of Main Results
In this section, we give formal proofs of Theorems 1 and 3. We start by providing proofs of several useful auxiliary lemmas.
Remark 3.
It suffices to assume that for the following analysis, since otherwise every point satisfies the secondorder condition trivially by the Lipschitzgradient assumption.
a.1 SetUp and Notation
Here we remind the reader of the relevant notation and provide further background from Nesterov and Polyak [2006] on the cubicregularized Newton method. We denote the stochastic gradient as
and the stochastic Hessian as
both for iteration . We draw a sufficient number of samples and so that the concentration conditions
are satisfied for sufficiently small (see Lemma 4 for more details). The cubicregularized Newton subproblem is to minimize
(14) 
We denote the global optimizer of as , that is .
As shown in Nesterov and Polyak [2006] a global optima of Equation (14) satisfies:
(15)  
(16) 
Equation (15) is the firstorder stationary condition, while Equation (16) follows from a duality argument. In practice, we will not be able to directly compute so will instead use a CubicSubsolver routine which must satisfy:
Condition 1.
For any fixed, small constant , terminates within gradient iterations (which may depend on ), and returns a satisfying at least one of the following:

. (Case 1)

and, if , then . (Case 2)
a.2 Auxiliary Lemmas
We begin by providing the proof of several useful auxiliary lemmas. First we provide the proof of Lemma 4 which characterize the concentration conditions.
See 4
Proof.
We can use the matrix Bernstein inequality from Tropp et al. [2015] to control both the fluctuations in the stochastic gradients and stochastic Hessians under Assumption 2.
Recall that the spectral norm of a vector is equivalent to its vector norm. So the matrix variance of the centered gradients is:
using the triangle inequality and Jensens inequality. A direct application of the matrix Bernstein inequality gives:
Taking gives the result.
The matrix variance of the centered Hessians , which are symmetric, is:
(17) 
once again using the triangle inequality and Jensens inequality. Another application of the matrix Bernstein inequality gives that:
Taking ensures that the stochastic Hessianvector products are controlled uniformly over :
using samples with probability .
∎
Proof.
Using the global optimality conditions for Equation (14) from Nesterov and Polyak [2006], we have the global optima , satisfies:
Together these conditions also imply that:
Now immediately from the definition of stochastic cubic submodel model and the aforementioned conditions we have that:
An identical statement appears as Lemma 10 in Nesterov and Polyak [2006], so this is merely restated here for completeness. ∎
Thus to guarantee sufficient descent it suffices to lower bound the