 # Stochastic Cubic Regularization for Fast Nonconvex Optimization

This paper proposes a stochastic variant of a classic algorithm---the cubic-regularized Newton method [Nesterov and Polyak 2006]. The proposed algorithm efficiently escapes saddle points and finds approximate local minima for general smooth, nonconvex functions in only Õ(ϵ^-3.5) stochastic gradient and stochastic Hessian-vector product evaluations. The latter can be computed as efficiently as stochastic gradients. This improves upon the Õ(ϵ^-4) rate of stochastic gradient descent. Our rate matches the best-known result for finding local minima without requiring any delicate acceleration or variance-reduction techniques.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

We consider the problem of nonconvex optimization in the stochastic approximation framework (Robbins and Monro, 1951):

 minx∈Rdf(x)\coloneqqEξ∼D[f(x;ξ)]. (1)

, where the random variable

is sampled from an underlying distribution . The task is to optimize the expected function

, which in general may be nonconvex. This framework covers a wide range of problems, including the offline setting where we minimize the empirical loss over a fixed amount of data, and the online setting where data arrives sequentially. One of the most prominent applications of stochastic optimization has been in large-scale statistics and machine learning problems, such as the optimization of deep neural networks.

Classical analysis in nonconvex optimization only guarantees convergence to a first-order stationary point (i.e., a point satisfying ), which can be a local minimum, a local maximum, or a saddle point. This paper goes further, proposing an algorithm that escapes saddle points and converges to a local minimum. A local minimum is defined as a point satisfying and . Finding such a point is of special interest for a large class of statistical learning problems where local minima are global or near-global solutions (e.g. Choromanska et al. (2015); Sun et al. (2016a, b); Ge et al. (2017)).

Among first-order stochastic optimization algorithms, stochastic gradient descent (SGD) is perhaps the simplest and most versatile. While SGD is computationally inexpensive, the best current guarantee for finding an -approximate local minimum (see Definition 1) requires iterations (Ge et al., 2015), which is inefficient in the high-dimensional regime.

In contrast, second-order methods which have access to the Hessian of can exploit curvature to more effectively escape saddles and arrive at local minima. Empirically, second-order methods are known to perform better than first-order methods on a variety of nonconvex problems (Rattray et al., 1998; Martens, 2010; Regier et al., 2017). However, many classical second-order algorithms need to construct the full Hessian, which can be prohibitively expensive when working with large models. Recent work has therefore explored the use of Hessian-vector products , which can be computed as efficiently as gradients in many cases including neural networks (Pearlmutter, 1994).

Several algorithms incorporating Hessian-vector products (Carmon et al., 2016; Agarwal et al., 2017) have been shown to achieve faster convergence rates than gradient descent in the non-stochastic setting. However, in the stochastic setting where we only have access to stochastic Hessian-vector products, significantly less progress has been made. This leads us to ask the central question of this paper: Can stochastic Hessian-vector products help to speed up nonconvex optimization?

In this work, we answer in the affirmative, proposing a stochastic optimization method that utilizes stochastic gradients and Hessian-vector products to find an -second-order stationary point using only oracle evaluations, where hides poly-logarithmic factors. Our rate improves upon the rate of stochastic gradient descent, and matches the best-known result for finding local minima without the need for any delicate acceleration or variance reduction techniques (see Section 1.1 for details).

The proposed algorithm in this paper is based on a classic algorithm in the non-stochastic setting—the cubic regularized Newton method (or cubic regularization) (Nesterov and Polyak, 2006). This algorithm is a natural extension of gradient descent that incorporates Hessian information. Whereas gradient descent finds the minimizer of a local second-order Taylor expansion,

 xGDt+1=argminx[f(xt)+∇f(xt)⊤(x−xt)+ℓ2\normx−xt2],

the cubic regularized Newton method finds the minimizer of a local third-order Taylor expansion,

 xCubict+1= argminx[f(xt)+∇f(xt)⊤(x−xt)+12(x−xt)⊤∇2f(xt)(x−xt)+ρ6\normx−xt3].

We provide a stochastic variant of this classic algorithm, bridging the gap between its use in the non-stochastic and stochastic settings.

In contrast to prior work, we present a fully stochastic cubic-regularized Newton method: both gradients and Hessian-vector products are observed with noise. Additionally, we provide a non-asymptotic analysis of its complexity.

### 1.1 Related Work

There has been a recent surge of interest in optimization methods that can escape saddle points and find -approximate local minima (see Definition 1) in various settings. We provide a brief summary of these results. All iteration complexities in this section are stated in terms of finding approximate local minima, and only highlight the dependency on and .

#### 1.1.1 General Case

This line of work optimizes over a general function without any special structural assumptions. In this setting, the optimization algorithm has direct access to the gradient or Hessian oracles at each iteration. The work of Nesterov and Polyak (2006) first proposed the cubic-regularized Newton method, which requires gradient and Hessian oracle calls to find an -second-order stationary point. Later, the ARC algorithm (Cartis et al., 2011) and trust-region methods (Curtis et al., 2017) were also shown to achieve the same guarantee with similar Hessian oracle access. However, these algorithms rely on having access to the full Hessian at each iteration, which is prohibitive in high dimensions.

Recently, instead of using the full Hessian, Carmon and Duchi (2016) showed that using a gradient descent solver for the cubic regularization subproblem allows their algorithm to find -second-order stationary points in Hessian-vector product evaluations. With acceleration techniques, the number of Hessian-vector products can be reduced to (Carmon et al., 2016; Agarwal et al., 2017; Royer and Wright, 2017).

Meanwhile, in the realm of entirely Hessian-free results, Jin et al. (2017) showed that a simple variant of gradient descent can find -second stationary points in gradient evaluations.

#### 1.1.2 Finite-Sum Setting

In the finite-sum setting (also known as the offline setting) where , one assumes that algorithms have access to the individual functions . In this setting, variance reduction techniques can be exploited (Johnson and Zhang, 2013). Agarwal et al. (2017) give an algorithm requiring Hessian-vector oracle calls to find an -approximate local minimum. A similar result is achieved by the algorithm proposed by Reddi et al. (2017).

#### 1.1.3 Stochastic Approximation

The framework of stochastic approximation where only assumes access to a stochastic gradient and Hessian via . In this setting, Ge et al. (2015) showed that the total gradient iteration complexity for SGD to find an -second-order stationary point was . More recently, Kohler and Lucchi (2017) consider a subsampled version of the cubic regularization algorithm, but do not provide a non-asymptotic analysis for their algorithm to find an approximate local minimum; they also assume access to exact (expected) function values at each iteration which are not available in the fully stochastic setting. Xu et al. (2017) consider the case of stochastic Hessians, but also require access to exact gradients and function values at each iteration. Very recently, Allen-Zhu (2017) proposed an algorithm with a mechanism exploiting variance reduction that finds a second-order stationary point with 222The original paper reports a rate of due to a different definition of -second-order stationary point, , which is weaker than the standard definition as in Definition 1. Hessian-vector product evaluations. We note that our result matches this best result so far using a simpler approach without any delicate acceleration or variance-reduction techniques. See Table 1 for a brief comparison.

## 2 Preliminaries

We are interested in stochastic optimization problems of the form

 minx∈Rdf(x)\coloneqqEξ∼D[f(x;ξ)], (2)

where is a random variable with distribution . In general, the function may be nonconvex. This formulation covers both the standard offline setting where the objective function can be expressed as a finite sum of individual functions , as well as the online setting where data arrives sequentially.

Our goal is to minimize the function using only stochastic gradients and stochastic Hessian-vector products , where is a vector of our choosing. Although it is expensive and often intractable in practice to form the entire Hessian, computing a Hessian-vector product is as cheap as computing a gradient when our function is represented as an arithmetic circuit (Pearlmutter, 1994), as is the case for neural networks.

Notation: We use bold uppercase letters to denote matrices and bold lowercase letters to denote vectors. For vectors we use to denote the -norm, and for matrices we use to denote the spectral norm and

to denote the minimum eigenvalue. Unless otherwise specified, we use the notation

to hide only absolute constants which do not depend on any problem parameter, and the notation to hide only absolute constants and logarithmic factors.

### 2.1 Assumptions

Throughout the paper, we assume that the function is bounded below by some optimal value . We also make following assumptions about function smoothness:

###### Assumption 1.

The function has

• [itemsep=.5em, nosep]

• -Lipschitz gradients: for all and ,

 \norm∇f(x1)−∇f(x2)]≤ℓ\normx1−x2;
• -Lipschitz Hessians: for all and ,

 \norm∇2f(x1)−∇2f(x2)≤ρ\normx1−x2.

The above assumptions state that the gradient and the Hessian cannot change dramatically in a small local area, and are standard in prior work on escaping saddle points and finding local minima.

Next, we make the following variance assumptions about stochastic gradients and stochastic Hessians:

###### Assumption 2.

The function has

• [itemsep=.5em, nosep]

• for all , and a.s.;

• for all , and a.s.

We note that the above assumptions are not essential to our result, and can be replaced by any conditions that give rise to concentration. Moreover, the a.s. bounded Hessian assumption can be removed if one further assumes has -Lipschitz gradients for all , which is stronger than Assumption 1. Assuming has -Lipschitz gradients immediately implies the stochastic Hessians are a.s. bounded with parameter which alone is sufficient to guarantee concentration without any further assumptions on the variance of (Tropp et al., 2015).

### 2.2 Cubic Regularization

Our target in this paper is to find an -second-order stationary point, which we define as follows:

###### Definition 1.

For a -Hessian Lipschitz function , we say that is an -second-order stationary point (or -approximate local minimum) if

 \norm∇f(x)≤ϵandλmin(∇2f(x))≥−√ρϵ. (3)

An -second-order stationary point not only has a small gradient, but also has a Hessian which is close to positive semi-definite. Thus it is often also referred to as an -approximate local minimum.

In the deterministic setting, cubic regularization (Nesterov and Polyak, 2006) is a classic algorithm for finding a second-order stationary point of a -Hessian-Lipschitz function . In each iteration, it first forms a local upper bound on the function using a third-order Taylor expansion around the current iterate :

 mt(x)

This is called the cubic submodel. Then, cubic regularization minimizes this cubic submodel to obtain the next iterate: . When the cubic submodel can be solved exactly, cubic regularization requires iterations to find an -second-order stationary point.

To apply this algorithm in the stochastic setting, three issues need to be addressed: (1) we only have access to stochastic gradients and Hessians, not the true gradient and Hessian; (2) our only means of interaction with the Hessian is through Hessian-vector products; (3) the cubic submodel cannot be solved exactly in practice, only up to some tolerance. We discuss how to overcome each of these obstacles in our paper.

## 3 Main Results

We begin with a general-purpose stochastic cubic regularization meta-algorithm in Algorithm 1, which employs a black-box subroutine to solve stochastic cubic subproblems. At a high level, in order to deal with stochastic gradients and Hessians, we sample two independent minibatches and at each iteration. Denoting the average gradient by

 gt=1|S1|∑ξi∈S1∇f(xt,ξi) (4)

and the average Hessian by

 Bt=1|S2|∑ξi∈S2∇2f(xt,ξi), (5)

this implies a stochastic cubic submodel:

 mt(x) =f(xt)+(x−xt)⊤gt+12(x−xt)⊤Bt(x−xt)+ρ6\normx−xt3. (6)

Although the subproblem depends on , we note that our meta-algorithm never explicitly formulates this matrix, only providing the subsolver access to through Hessian-vector products, which we denote by . We hence assume that the subsolver performs gradient-based optimization to solve the subproblem, as depends on only via .

After sampling minibatches for the gradient and the Hessian, Algorithm 1 makes a call to a black-box cubic subsolver to optimize the stochastic submodel . The subsolver returns a parameter change , i.e., an approximate minimizer of the submodel, along with the corresponding change in submodel value, . The algorithm then updates the parameters by adding to the current iterate, and checks whether satisfies a stopping condition.

In more detail, the Cubic-Subsolver subroutine takes a vector and a function for computing Hessian-vector products , then optimizes the third-order polynomial

 (7)

Let denote the minimizer of this polynomial. In general, the subsolver cannot return the exact solution . We hence tolerate a certain amount of suboptimality:

###### Condition 1.

For any fixed, small constant , terminates within gradient iterations (which may depend on ), and returns a satisfying at least one of the following:

1. . (Case 1)

2. and, if , then . (Case 2)

The first condition is satisfied if the parameter change results in submodel and function decreases that are both sufficiently large (Case 1). If that fails to hold, the second condition ensures that is not too large relative to the true solution , and that the cubic submodel is solved to precision when is large (Case 2).

As mentioned above, we assume the subsolver uses gradient-based optimization to solve the subproblem so that it will only access the Hessian only through Hessian-vector products. Accordingly, it can be any standard first-order algorithm such as gradient descent, Nesterov’s accelerated gradient descent, etc. Gradient descent is of particular interest as it can be shown to satisfy Condition 1 (see Lemma 2).

Having given an overview of our meta-algorithm and verified the existence of a suitable subsolver, we are ready to present our main theorem:

###### Theorem 1.

There exists an absolute constant such that if satisfies Assumptions 1, 2, CubicSubsolver satisfies Condition 1 with , , and , then for all and , Algorithm 1 will output an -second-order stationary point of

with probability at least

within

 ~O(√ρΔfϵ1.5(max(M1ϵ,σ21ϵ2)+max(M2√ρϵ,σ22ρϵ)⋅T(ϵ))) (8)

total stochastic gradient and Hessian-vector product evaluations.

In the limit where is small the result simplifies:

###### Remark 1.

If , then under the settings of Theorem 1 we can conclude that Algorithm 1 will output an -second-order stationary point of with probability at least within

 ~O(√ρΔfϵ1.5(σ21ϵ2+σ22ρϵ⋅T(ϵ))) (9)

total stochastic gradient and Hessian-vector product evaluations.

Theorem 1 states that after iterations, stochastic cubic regularization (Algorithm 1) will have found an -second-order stationary point w.h.p. – see Equation (26) in the Appendix for an exact outer iteration complexity that guarantees early termination of Algorithm 1 w.h.p. Within each iteration, we require samples for gradient averaging and samples for Hessian-vector product averaging to ensure small variance when is small. Recall that the averaged gradient is fixed for a given cubic submodel, while the averaged Hessian-vector product needs to be recalculated every time the cubic subsolver queries the gradient . At most such queries will be made by definition. Therefore, each iteration takes stochastic gradient and Hessian-vector product evaluations in the limit where is small (see Remark 1). Multiplying these two quantities together gives the full result.

Finally, we note that lines 8-11 of Algorithm 1 give the termination condition of our meta-algorithm. When the decrease in submodel value is too small, our theory guarantees is an -second-order stationary point, where is the optimal solution of the cubic submodel. However, Cubic-Subsolver may only produce an inexact solution satisfying Condition 1, which is not sufficient for to be an -second-order stationary point. We therefore call Cubic-Finalsolver to solve the subproblem with higher precision. Since Cubic-Finalsolver is invoked only once at the end of the algorithm, we can just use gradient descent, and its runtime will always be dominated by the rest of the algorithm.

### 3.1 Gradient Descent as a Cubic-Subsolver

One concrete example of a cubic subsolver is a simple variant of gradient descent (Algorithm 3) as studied in Carmon and Duchi (2016). The two main differences relative to standard gradient descent are: (1) lines 1–3: when is large, the submodel (Equation 7) may be ill-conditioned, so instead of doing gradient descent, the iterate only moves one step in the direction, which already guarantees sufficient descent; (2) line 6: the algorithm adds a small perturbation to to avoid a certain “hard” case for the cubic submodel. We refer readers to Carmon and Duchi (2016) for more details about Algorithm 3.

Adapting their result for our setting, we obtain the following lemma, which states that the gradient descent subsolver satisfies our Condition 1.

###### Lemma 2.

There exists an absolute constant , such that under the same assumptions on and the same choice of parameters as in Theorem 1, Algorithm 3 satisfies Condition 1 with probability at least with

 T(ϵ)≤~O(ℓ√ρϵ). (10)

Our next corollary applies gradient descent (Algorithm 3) as the approximate cubic subsolver in our meta-algorithm (Algorithm 1), which immediately gives the total number of gradient and Hessian-vector evaluations for the full algorithm.

###### Corollary 3.

Under the same settings as Theorem 1, if , and if we instantiate the Cubic-Subsolver subroutine with Algorithm 3, then with probability greater than , Algorithm 1 will output an -second-order stationary point of within

 ~O(√ρΔfϵ1.5(σ21ϵ2+σ22ρϵ⋅ℓ√ρϵ)) (11)

total stochastic gradient and Hessian-vector product evaluations.

###### Remark 2.

The overall runtime of Algorithm 1 in Corollary 3 is the number of stochastic gradient and Hessian-vector product evaluations multiplied by the time to compute a gradient or Hessian-vector product. For neural networks, the latter takes time.

From Corollary 3, we observe that the dominant term in solving the submodel is when is sufficiently small, giving a total iteration complexity of when other problem-dependent parameters are constant. This improves on the complexity attained by SGD.

It is reasonable to believe there may be another cubic subsolver which is faster than gradient descent and which satisfies Condition 1 with a smaller , for instance a variant of Nesterov’s accelerated gradient descent. However, since the dominating term in our subsolver complexity is due to gradient averaging, and this is independent of , a faster cubic subsolver cannot improve the overall number of gradient and Hessian-vector product evaluations. This means that the gradient descent subsolver already achieves the optimal asymptotic rate for finding an -second-order stationary point under our stochastic cubic regularization framework.

## 4 Proof Sketch

This section sketches the crucial steps needed to understand and prove our main theorem (Theorem 1). We begin by describing our high-level approach, then show how to instantiate this high-level approach in the stochastic setting, assuming oracle access to an exact subsolver. For the case of an inexact subsolver and other proof details, we defer to the Appendix.

Recall that at iteration of Algorithm 1, a stochastic cubic submodel is constructed around the current iterate with the form given in Equation (6):

 mt(x) =f(xt)+(x−xt)⊤gt+12(x−xt)⊤Bt(x−xt)+ρ6\normx−xt3,

where and are the averaged stochastic gradients and Hessians. At a high level, we will show that for each iteration, the following two claims hold:

Claim 1. If is not an -second-order stationary point of , the cubic submodel has large descent .

Claim 2. If the cubic submodel has large descent , then the true function also has large descent .

Given these claims, it is straightforward to argue for the correctness of our approach. We know that if we observe a large decrease in the cubic submodel value during Algorithm 1, then by Claim 2 the function will also have large descent. But since is bounded below, this cannot happen indefinitely, so we must eventually encounter an iteration with small cubic submodel descent. When that happens, we can conclude by Claim 1 that is an -second-order stationary point.

We note that Claim 2 is especially important in the stochastic setting, as we no longer have access to the true function but only the submodel. Claim 2 ensures that progress in still indicates progress in , allowing the algorithm to terminate at the correct time.

In the remaining parts of this section, we discuss why the above two claims hold for an exact solver.

### 4.1 Stochastic Setting with Exact Subsolver

In this setting, and are the averaged gradient and Hessian with sample sizes and , respectively. To ensure the stochastic cubic submodel approximates the exact cubic submodel well, we need large enough sample sizes so that both and are close to the exact gradient and Hessian at up to some tolerance:

###### Lemma 4.

For any fixed small constants , we can pick gradient and Hessian mini-batch sizes and so that with probability ,

 ∥gt−∇f(xt)∥≤c1⋅ϵ, (12)
 ∀v,∥(Bt−∇2f(xt))v∥≤c2⋅√ρϵ\normv. (13)

We need to ensure that the random vectors/matrices concentrate along an arbitrary direction (depending on and ). In order to guarantee the uniform concentration in Lemma 4, we can directly apply results from matrix concentration to obtain the desired result (Tropp et al., 2015).

Let , i.e.  is a global minimizer of the cubic submodel . If we use an exact oracle solver, we have . In order to show Claim 1 and Claim 2, one important quantity to study is the decrease in the cubic submodel :

###### Lemma 5.

Let and be defined as above. Then for all ,

 mt(xt+Δ⋆t)−mt(xt)≤−112ρ\normΔ⋆t3.

Lemma 5 implies that in order to prove submodel has sufficient function value decrease, we only need to lower bound the norm of optimal solution, i.e. .

Proof sketch for claim 1: Our strategy is to lower bound the norm of when is not an -second-order stationary point. In the non-stochastic setting, Nesterov and Polyak (2006) prove

 \normΔ⋆t≥12max{√1ρ\norm∇f(xt+1),1ρλmin(∇2f(xt+1))},

which gives the desired result. In the stochastic setting, a similar statement holds up to some tolerance:

###### Lemma 6.

Under the setting of Lemma 4 with sufficiently small constants ,

 \normΔ⋆t≥12max{ √1ρ(\norm∇f(xt+Δ⋆t)−ϵ4),1ρ(λmin(∇2f(xt+Δ⋆t))−√ρϵ4)}.

That is, when is not an -second-order stationary point, we have . In other words, we have sufficient movement. It follows by Lemma 5 that we have sufficient cubic submodel descent.

Proof sketch for claim 2: In the non-stochastic case, is by construction an upper bound on . Together with the fact , we have:

 f(xt+1)−f(xt)≤mt(xt+1)−mt(xt),

showing Claim 2 is always true. For the stochastic case, this inequality may no longer be true. Instead, under the setting of Lemma 4, via Lemma 5, we can upper bound the function decrease with an additional error term:

 f(xt+1)− f(xt)≤12[mt(xt+1)−mt(xt)]+c√ϵ3ρ,

for some sufficiently small constant . Then when , we have , which proves Claim 2.

Finally, for an approximate cubic subsolver, the story becomes more elaborate. Claim 1 is only “approximately” true, while Claim 2 still holds but for more complicated reasons. We defer to the Appendix for the full proof.

## 5 Experiments

In this section, we provide empirical results on synthetic and real-world data sets to demonstrate the efficacy of our approach. All experiments are implemented using TensorFlow

(Abadi et al., 2016), which allows for efficient computation of Hessian-vector products using the method described by Pearlmutter (1994).

### 5.1 Synthetic Nonconvex Problem Figure 1: The piecewise cubic function w(x) used along one of the dimensions in the synthetic experiment. The other dimension uses a scaled quadratic.

We begin by constructing a nonconvex problem with a saddle point to compare our proposed approach against stochastic gradient descent. Let be the W-shaped scalar function depicted in Figure 1, with a local maximum at the origin and two local minima on either side. While we defer the exact form of to Appendix C, we note here that it has small negative curvature at the origin, , and that it has a 1-Lipschitz second derivative. We aim to solve the problem

 minx∈R2[w(x1)+10x22],

with independent noise drawn from added separately to each component of every gradient and Hessian-vector product evaluation. By construction, the objective function has a saddle point at the origin with Hessian eigenvalues of -0.2 and 20, providing a simple but challenging test case where the negative curvature is two orders of magnitude smaller than the positive curvature and is comparable in magnitude to the noise.

We ran our method and SGD on this problem, plotting the objective value versus the number of oracle calls in Figure 3. The batch sizes and learning rates for each method are tuned separately to ensure a fair comparison; see Appendix C for details. We observe that our method is able to escape the saddle point at the origin and converge to one of the global minima faster than SGD, offering empirical evidence in support of our method’s theoretical advantage.

### 5.2 Deep Autoencoder Figure 2: Results on our synthetic nonconvex optimization problem. Stochastic cubic regularization escapes the saddle point at the origin and converges to a global optimum faster than SGD.

In addition to the synthetic problem above, we also investigate the performance of our approach on a more practical problem from deep learning, namely training a deep autoencoder on MNIST. The MNIST data set consists of grayscale images of handwritten digits, divided into 60,000 training images and 10,000 test images

(LeCun and Cortes, 2010). Each image is pixels, and pixel intensities are normalized to lie between 0 and 1. Our architecture consists of a fully connected encoder with dimensions together with a symmetric decoder. We use a softplus nonlinearity (defined as ) for each hidden layer to ensure continuous first and second derivatives, and we use a pixelwise

loss between the input and the reconstructed output as our objective function. We apply an elementwise sigmoid function to the final layer to guarantee that the output pixel intensities lie between 0 and 1.

Results on this autoencoding task are presented in Figure 3. In addition to training the model with our method and SGD, we also include results using AdaGrad, a popular adaptive first-order method with strong empirical performance (Duchi et al., 2011)

. Since the standard MNIST split does not include a validation set, we separate the original training set into 55,000 training images and 5,000 validation images, plotting training error on the former and using the latter to select hyperparameters for each method. More details about our experimental setup can be found in Appendix

C.

We observe that stochastic cubic regularization quickly escapes two saddle points and descends toward a local optimum, while AdaGrad takes two to three times longer to escape each saddle point, and SGD is slower still. This demonstrates that incorporating curvature information via Hessian-vector products can assist in escaping saddle points in practice. However, it is worth noting that AdaGrad makes slightly faster progress than our approach after reaching a basin around a local optimum, indicating that adaptivity may provide complementary benefits to second-order information. We leave the investigation of a hybrid method combining both techniques as an exciting direction for future work.

## 6 Conclusion

In this paper, we presented a stochastic algorithm based on the classic cubic-regularized Newton method for nonconvex optimization. Our algorithm provably finds -approximate local minima in total gradient and Hessian-vector product evaluations, improving upon the rate of SGD. Our experiments show the favorable performance of our method relative to SGD on both a synthetic and a real-world problem.

## References

• Abadi et al.  Martin Abadi, Paul Barham, et al. Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation, pages 265–283, 2016.
• Agarwal et al.  Naman Agarwal, Zeyuan Allen-Zhu, Brian Bullins, Elad Hazan, and Tengyu Ma. Finding approximate local minima faster than gradient descent. In

Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

, 2017.
• Allen-Zhu  Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. arXiv preprint arXiv:1708.08694, 2017.
• Carmon and Duchi  Yair Carmon and John Duchi. Gradient descent efficiently finds the cubic-regularized non-convex Newton step. arXiv preprint arXiv:1612.00547, 2016.
• Carmon et al.  Yair Carmon, John Duchi, Oliver Hinder, and Aaron Sidford. Accelerated methods for non-convex optimization. arXiv preprint arXiv:1611.00756, 2016.
• Cartis et al.  Coralia Cartis, Nicholas Gould, and Philippe Toint. Adaptive cubic regularisation methods for unconstrained optimization. Part II: worst-case function-and derivative-evaluation complexity. Mathematical Programming, 130(2):295–319, 2011.
• Choromanska et al.  Anna Choromanska, Mikael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The loss surfaces of multilayer networks. In

Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics

, 2015.
• Curtis et al.  Frank E Curtis, Daniel P Robinson, and Mohammadreza Samadi. A trust region algorithm with a worst-case iteration complexity of for nonconvex optimization. Mathematical Programming, 162(1-2):1–32, 2017.
• Duchi et al.  John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12:2121–2159, 2011.
• Ge et al.  Rong Ge, Furong Huang, Chi Jin, and Yang Yuan.

In Proceedings of The 28th Conference on Learning Theory, 2015.
• Ge et al.  Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank problems: A unified geometric analysis. In Proceedings of the 34th International Conference on Machine Learning, 2017.
• Jin et al.  Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M. Kakade, and Michael I. Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning, 2017.
• Johnson and Zhang  Rie Johnson and Tong Zhang. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems, 2013.
• Kohler and Lucchi  Jonas Moritz Kohler and Aurelien Lucchi. Sub-sampled cubic regularization for non-convex optimization. In Proceedings of the 34th International Conference on Machine Learning, 2017.
• LeCun and Cortes  Yann LeCun and Corinna Cortes. MNIST Handwritten Digit Database, 2010.
• Martens  James Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th International Conference on Machine Learning, 2010.
• Nesterov  Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course, volume 87. Springer Science & Business Media, 2013.
• Nesterov and Polyak  Yurii Nesterov and Boris T Polyak. Cubic regularization of Newton method and its global performance. Mathematical Programming, 108(1):177–205, 2006.
• Pearlmutter  Barak A Pearlmutter. Fast exact multiplication by the Hessian. Neural Computation, 6(1):147–160, 1994.
• Rattray et al.  Magnus Rattray, David Saad, and Shun-ichi Amari. Natural gradient descent for on-line learning. Physical review letters, 81(24):5461, 1998.
• Reddi et al.  Sashank J Reddi, Manzil Zaheer, et al. A generic approach for escaping saddle points. arXiv preprint arXiv:1709.01434, 2017.
• Regier et al.  Jeffrey Regier, Michael I Jordan, and Jon McAuliffe. Fast black-box variational inference through stochastic trust-region optimization. In Advances in Neural Information Processing Systems, 2017.
• Robbins and Monro  Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical Statistics, pages 400–407, 1951.
• Royer and Wright  Clément W Royer and Stephen J Wright. Complexity analysis of second-order line-search algorithms for smooth nonconvex optimization. arXiv preprint arXiv:1706.03131, 2017.
• Sun et al. [2016a] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere I: Overview and the geometric picture. IEEE Transactions on Information Theory, 2016a.
• Sun et al. [2016b] Ju Sun, Qing Qu, and John Wright. A geometric analysis of phase retrieval. In IEEE International Symposium on Information Theory. IEEE, 2016b.
• Tropp et al.  Joel A Tropp et al. An introduction to matrix concentration inequalities. Foundations and Trends® in Machine Learning, 8(1-2):1–230, 2015.
• Xu et al.  Peng Xu, Farbod Roosta-Khorasani, and Michael W Mahoney. Newton-type methods for non-convex optimization under inexact Hessian information. arXiv preprint arXiv:1708.07164, 2017.

## Appendix A Proof of Main Results

In this section, we give formal proofs of Theorems 1 and 3. We start by providing proofs of several useful auxiliary lemmas.

###### Remark 3.

It suffices to assume that for the following analysis, since otherwise every point satisfies the second-order condition trivially by the Lipschitz-gradient assumption.

### a.1 Set-Up and Notation

Here we remind the reader of the relevant notation and provide further background from Nesterov and Polyak  on the cubic-regularized Newton method. We denote the stochastic gradient as

 gt=1|S1|∑ξi∈S1∇f(xt,ξi)

and the stochastic Hessian as

 Bt=1|S2|∑ξi∈S2∇2f(xt,ξi),

both for iteration . We draw a sufficient number of samples and so that the concentration conditions

 ∥gt−∇f(xt)∥≤c1⋅ϵ,
 ∀v,∥(Bt−∇2f(xt))v∥≤c2⋅√ρϵ\normv.

are satisfied for sufficiently small (see Lemma 4 for more details). The cubic-regularized Newton subproblem is to minimize

 mt(y) =f(xt)+(y−xt)⊤gt+12(y−xt)⊤Bt(y−xt)+ρ6\normy−xt3. (14)

We denote the global optimizer of as , that is .

As shown in Nesterov and Polyak  a global optima of Equation (14) satisfies:

 gt+BtΔ⋆t+ρ2\normΔ⋆tΔ⋆t=0. (15) Bt+ρ2\normΔ⋆tI⪰0. (16)

Equation (15) is the first-order stationary condition, while Equation (16) follows from a duality argument. In practice, we will not be able to directly compute so will instead use a Cubic-Subsolver routine which must satisfy:

###### Condition 1.

For any fixed, small constant , terminates within gradient iterations (which may depend on ), and returns a satisfying at least one of the following:

1. . (Case 1)

2. and, if , then . (Case 2)

### a.2 Auxiliary Lemmas

We begin by providing the proof of several useful auxiliary lemmas. First we provide the proof of Lemma 4 which characterize the concentration conditions.

See 4

###### Proof.

We can use the matrix Bernstein inequality from Tropp et al.  to control both the fluctuations in the stochastic gradients and stochastic Hessians under Assumption 2.

Recall that the spectral norm of a vector is equivalent to its vector norm. So the matrix variance of the centered gradients is:

 v[~g]=1n21max{\normE[n1∑i=1~∇f(x,ξi)~∇f(x,ξi)⊤],\normE[n1∑i=1~∇f(x,ξi)⊤~∇f(x,ξi)]}≤σ21n1

using the triangle inequality and Jensens inequality. A direct application of the matrix Bernstein inequality gives:

 P[\normg−∇f(x)≥t]≤2dexp(−t2/2v[~g]+M1/(3n1))≤2dexp(−3n18min{tM1,t2σ21})⟹ \normg−∇f(x)≤t with % probability 1−δ′ for n1≥max(M1t,σ21t2)83log2dδ′

Taking gives the result.

The matrix variance of the centered Hessians , which are symmetric, is:

 v[~B]=1n22\normn2∑i=1E[(~∇2f(x,ξi))2]≤σ22n2 (17)

once again using the triangle inequality and Jensens inequality. Another application of the matrix Bernstein inequality gives that:

 P[\normB−∇2f(x))≥t]≤2dexp(−3n28min{tM2,t2σ22})⟹ \normB−∇2f(x))≤t with % probability 1−δ′ for n2≥max(M2t,σ22t2)83log2dδ′

Taking ensures that the stochastic Hessian-vector products are controlled uniformly over :

 \norm(B−∇2f(x))v≤c2⋅√ρϵ\normv

using samples with probability .

Next we show Lemma 5 which will relate the change in the cubic function value to the norm . See 5

###### Proof.

Using the global optimality conditions for Equation (14) from Nesterov and Polyak , we have the global optima , satisfies:

 gt+Bt(Δ⋆t)+ρ2\normΔ⋆t(Δ⋆t)=0 Bt+ρ2\normΔ⋆tI⪰0.

Together these conditions also imply that:

 (Δ⋆t)⊤gt+(Δ⋆t)⊤Bt(Δ⋆t)+ρ2\normΔ⋆t3=0 (Δ⋆t)⊤Bt(Δ⋆t)+ρ2\normΔ⋆t3≥0.

Now immediately from the definition of stochastic cubic submodel model and the aforementioned conditions we have that:

 f(xt)−mt(xt+Δ⋆t) =−(Δ⋆t)⊤gt−12(Δ⋆t)⊤Bt(Δ⋆t)−ρ6\normxt+Δ⋆t3 =12(Δ⋆t)⊤Bt(Δ⋆t)+13ρ\normΔ⋆t3 ≥112<