Sharp Analysis for Nonconvex SGD Escaping from Saddle Points

02/01/2019 ∙ by Cong Fang, et al. ∙ Peking University 0

In this paper, we prove that the simplest Stochastic Gradient Descent (SGD) algorithm is able to efficiently escape from saddle points and find an (ϵ, O(ϵ^0.5))-approximate second-order stationary point in Õ(ϵ^-3.5) stochastic gradient computations for generic nonconvex optimization problems, under both gradient-Lipschitz and Hessian-Lipschitz assumptions. This unexpected result subverts the classical belief that SGD requires at least O(ϵ^-4) stochastic gradient computations for obtaining an (ϵ, O(ϵ ^0.5))-approximate second-order stationary point. Such SGD rate matches, up to a polylogarithmic factor of problem-dependent parameters, the rate of most accelerated nonconvex stochastic optimization algorithms that adopt additional techniques, such as Nesterov's momentum acceleration, negative curvature search, as well as quadratic and cubic regularization tricks. Our novel analysis gives new insights into nonconvex SGD and can be potentially generalized to a broad class of stochastic optimization algorithms.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Non-convex stochastic optimization is crucial in machine learning and have attracted tremendous attentions and unprecedented popularity. Lots of modern tasks that include low-rank matrix factorization/completion and principal component analysis

(Candès & Recht, 2009; Jolliffe, 2011), dictionary learning (Sun et al., 2017)

, Gaussian mixture models

(Reynolds et al., 2000)

, as well as notably deep neural networks

(Hinton & Salakhutdinov, 2006) are formulated as nonconvex stochastic optimization problems. In this paper, we concentrate on finding an approximate solution to the following minimization problem:

(1.1)

Here,

denotes a family of stochastic functions indexed by some random variable

that obeys some prescribed distribution , and we consider the general case where and have Lipschitz-continuous gradients and Hessians and might be nonconvex. In empirical risk minimization tasks, is an uniformly discrete distribution over the set of training sample indices, and the stochastic function corresponds to the nonconvex loss associated with such a sample.

One of the classical algorithms for optimizing (1.1) is the Stochastic Gradient Descent (SGD) method, which performs descent updates iteratively via the inexpensive stochastic gradient

that serves as an unbiased estimator of (the inaccessible) gradient

(Robbins & Monro, 1951; Bottou & Bousquet, 2008), i.e. . Let denote the positive stepsize, then at steps , the iteration performs the following update:

(1.2)

where is randomly sampled at iteration . SGD admits perhaps the simplest update rule among stochastic first-order methods. See Algorithm 1 for a formal illustration of the meta algorithm. It gains tremendous popularity by researchers and engineers due to its exceptional effectiveness and performance in practice. Taking the example of training deep neural networks, the dominating algorithm at present time is SGD (Abadi et al., 2016)

, where the stochastic gradient is computed via one backpropagation step. Superior characteristics of SGD have been observed in many empirical studies, including but

not limited to fast convergence, desirable solutions of low training loss, as well as its generalization ability.

Turning to the theoretical side, relatively mature and concrete analysis in existing literatures Rakhlin et al. (2012); Agarwal et al. (2009) show that SGD achieves an optimal rate of convergence for convex objective function under some standard regime. Specifically, the convergence rate of in term of the function optimality gap match the algorithmic lower bound for an appropriate class of strongly convex functions (Agarwal et al., 2009).

Despite the optimal convex optimization rates that SGD achieves, the provable nonconvex SGD convergence rate result has long stayed upon on finding an -approximate first-order stationary point

: with high probability SGD finds an

such that in stochastic gradient computational costs under the gradient Lipschitz condition of (Nesterov, 2004). In contrast, our goal in this paper is to find an -approximate second-order stationary point such that

and the least eigenvalue of the Hessian matrix

is , where denotes the so-called Hessian-Lipschitz parameter to be specified later (Nesterov & Polyak, 2006; Tripuraneni et al., 2018; Carmon et al., 2018; Agarwal et al., 2017). Putting it differently, we need to escape from all first-order stationary points that admit a strong negative Hessian eigenvalue (a.k.a. saddle points) (Dauphin et al., 2014) and lands at a point that quantitatively resembles a local minimizer in terms of the gradient norm and least Hessian eigenvalue.

1:  for do
2:   Draw an independent and set SGD step
3:   if Stopping criteria is satisfied then
4:    break
Algorithm 1 SGD (Meta version)

Results on the convergence rate of SGD for finding an -approximate second-order stationary point have been scarce until very recently.111Some authors work with -stationary point and we ignore such expression due to the natural choice in optimization literature (Nesterov & Polyak, 2006; Jin et al., 2017). To the best of our knowledge, Ge et al. (2015) provided the first theoretical result that SGD with artificially injected spherical noise can escape from all saddle points in polynomial time. Moreover, Ge et al. (2015) showed that SGD finds an -approximate second-order stationary point at a stochastic gradient computational cost of . A recent follow-up work by Daneshmand et al. (2018) derived a convergence rate of stochastic gradient computations. These milestone works Ge et al. (2015); Daneshmand et al. (2018) showed that SGD can always escape from saddle points and can find an approximate local solution of (1.1) with a stochastic gradient computational cost that is polynomially dependent on problem-specific parameters. Motivated by these recent works, the current paper tries to answer the following questions:

  1. Is it possible to sharpen the analysis of SGD algorithm and obtain a reduced stochastic gradient computational cost for finding an -approximate second-order stationary point?

  2. Is artificial noise injection absolutely necessary for SGD to find an approximate second-order stationary point with an almost dimension-free stochastic gradient computational cost?

To answer aforementioned question (i), we provide a sharp analysis and prove that SGD finds an -approximate stationary point at a remarkable stochastic gradient computational cost for solving (1.1). This is a very unexpected result because it has been conjectured by many (Xu et al., 2018; Allen-Zhu & Li, 2018; Tripuraneni et al., 2018) that an cost is required to find an -approximate second-order stationary point. Our result on SGD negates this conjecture and serves as the sharpest stochastic gradient computational cost for SGD prior to this work. To answer question (ii) above, we propose a novel dispersive noise assumption and prove that under such an assumption, SGD requires no artificial noise injection in order to achieve the aforementioned sharp stochastic gradient computational cost. Such noise assumption is satisfied in the case of infinite online samples and Gaussian sampling zeroth-order optimization, and can be satisfied automatically by injecting artificial ball-shaped, spherical uniform, or Gaussian noises.

We emphasize that the stochastic gradient computational cost is, however, not the lower bound complexity for finding an -approximate second-order stationary point for optimization problem (1.1). Recently, Fang et al. (2018)

applied a novel variance reduction technique named

Spider tracking and proposed the Spider-SFO+ algorithm which provably achieves a stochastic gradient computational cost of for finding an -approximate second-order stationary point. It is our beliefs that variance reduction techniques are necessary to achieve a stochastic gradient computational cost that is strictly sharper than .

1.1 Our Contributions

We study theoretically in this work the SGD algorithm for minimizing nonconvex function . Specially, this work contributes the following:

  1. We propose a sharp convergence analysis for the classical and simple SGD and prove that the total stochastic gradient computational cost to find a second-order stationary point is at most under both Lipschitz-continuous gradient and Hessian assumptions of objective function. Such convergence rate matches the most accelerated nonconvex stochastic optimization results that such as Nesterov’s momentum acceleration, negative curvature search, and quadratic and cubic regularization tricks.

  2. We propose the dispersive noise assumption and prove that under such an assumption, SGD ensures to escape all saddles that has a strongly negative Hessian eigenvalue. Such type of noise generalizes the artificial ball-shaped noise and is widely applicable to many tasks.

  3. Our novel analytic tools for proving saddle escaping and fast convergence of SGD is of independent interests, and they shed lights on developing and analyzing new stochastic optimization algorithms.

Organization

The rest of the paper is organized as follows. §2 provides the SGD algorithm and the main convergence rate theorem for finding an -approximate second-order stationary point. In §3, we sketch the proof of our convergence rate theorem by providing and discussing three core propositions. We conclude our paper in §5 with proposed future directions. Missing proofs are detailed in order in the Appendix sections.

Notation

Let

denote the Euclidean norm of a vector or spectral norm of a square matrix. Denote

for a sequence of vectors and positive scalars if there is a global constant such that , and such hides a poly-logarithmic factor of and . Denote if there is a global constant which hides a poly-logarithmic factor such that . We denote if there is a global constant which hides a poly-logarithmic factor of and such that

. Further, we denote linear transformation of set

as . Let denote the least eigenvalue of a real symmetric matrix . We denote as the -neighborhood of , i.e. the set .

2 Algorithm and Main Result

In this section, we formally state SGD and the corresponding convergence rate theorem. In §2.1, we propose the key assumptions for the objective functions and noise distributions. In §2.2, we detail SGD in Algorithm 2 and present the main convergence rate theorem.

2.1 Assumptions and Definitions

We first present the following smoothness assumption on the objective adopted in this paper:

Assumption 1 (Smoothness).

Let be a twice-differentiable with -Lipschitz continuous gradients and -Lipschitz continuous Hessians, i.e. for all ,

(2.1)

and

(2.2)

With Hessian-Lipschitz parameter prescribed in (2.2), we formally define the -approximate second-order stationary point. To best of our knowledge, such concept firstly appeared in Nesterov & Polyak (2006):

Definition 1 (Second-order Stationary Point).

Call an -approximate second-order stationary point if

Let the starting point of our SGD algorithm be . We assume the following boundedness assumption:

Assumption 2 (Boundedness).

The where is the global infimum value of .

Turning to the assumptions on noise, we first assume the following:

Assumption 3 (Bounded Noise).

For any , the stochastic gradient satisfies:

(2.3)

An alternative (slighter weaker) assumption that also works is to assume that the norm of noise satisfies subgaussian distribution, i.e. for any ,

(2.4)

Assumptions 1, 2 and 3 are standard in nonconvex optimization literatures (Ge et al., 2015; Xu et al., 2018; Allen-Zhu & Li, 2018; Fang et al., 2018). We treat the parameters , , , and as global constants, and focus on the dependency for stochastic gradient complexity on and .

For the purpose of fast escaping saddles, we need an extra noise shape assumption. Let be a positive real, and let be a unit vector. We define a set property as follows:

Definition 2 (-narrow property).

We say that a Borel set satisfies the -narrow property, if for any and , holds, where denotes the complement set of . Next, we introduce the -dispersive property as follows:

It is easy to verify that the first parameter in the narrow property is linearly scalable and translation invariant with sets, i.e. if satisfies -narrow property, then for any satisfies the -narrow property.

Definition 3 (-dispersive property).

Let be a random vector satisfying Assumption 3. We say that has the -dispersive property, if for an arbitrary set that satisfies the -narrow property (as in Definition 2) the following holds:

(2.5)

Obviously, if satisfies -dispersive property, for any fixed vector , then also satisfies -dispersive property. We then present the dispersive noise assumption as follows:

Assumption 4 (Dispersive Noise).

For an arbitrary point , admits the -dispersive property (as in Definition 3) for any unit vector .

Assumption 4 is motivated from the key lemma for escaping from saddle points in Jin et al. (2017), which obtains a sharp rate for gradient descent escaping from saddle points. Such an assumption enables SGD to move out of a stuck region with probability in its first step and enables escaping from saddle points (by repeating logarithmic rounds). We would like to emphasize that the -dispersive noises contain many canonical examples; see the following

Examples of Dispersive Noises

Here we exemplify a few noise distributions that satisfy the -dispersive property, that is, for an arbitrary set with -narrow property, where . We prove the following proposition:

Proposition 1.

For the following noise distributions, (2.5) in Definition 3 is satisfied:

  1. Gaussian noise: where is the standard Gaussian noise with covariance matrix being ;

  2. Uniform ball-shaped noise or spherical noise: , where is uniformly sampled from the unit ball centered at ;

  3. Artificial noise injection: , where is some independent artificial noise that is -dispersive for any .

2.2 SGD and Main Theorem

Our SGD algorithm for analysis purposes is detailed in Algorithm 2. Our SGD algorithm only differs from classical SGD algorithms on stopping criteria. Distinct from the classical ones that simply terminate in a certain number of steps and output the final iterate or a randomly drawn iterate, the SGD we consider here introduces a ball-controlled mechanism as the stopping criteria: if exits a small neighborhood in iterations (Line 2 to 6), one starts over and do the next round of SGD; if exiting does not occur in iterations, then the algorithm simply outputs an arithmetic average of of the last iterates within the neighborhood, which in turns is an -approximate second-order stationary point with high probability. In contrast with the stopping criteria in the deterministic setting that checks the descent in function values (Jin et al., 2017), the function value in stochastic setting is reasonably costly to approximate (costs stochastic gradient computations), and the error plateaus might be hard to observed theoretically.

1:  Set , ,
2:  while  do
3:     Draw an independent and set SGD step
4:      Counter of SGD steps
5:     if   then
6:        ,
7:     end if
8:  end while
9:   Reach this line in SGD steps w.h.p.
10:  return   Return an -approximate second-order stationary point
Algorithm 2 SGD (For finding an -approximate second-order stationary point): Input , , , , and

Parameter Setting

We set the hyper-parameters222 Set . Because in (2.2) involves logarithmic factors on and , a simple choice to set the step size is as . for Algorithm 2 as follows:

(2.6)

For brevity of analysis, we assume , and . In other words, we assume the accuracy .

Now we are ready to present our main result of SGD theorem.

Theorem 1 (SGD Rate).

Let Assumptions 1, 2, 3, and 4 hold. Let the parameters , and be set in (2.2) with being the error probability, and set , then running Algorithm 2 in , with probability at least , SGD outputs an satisfying

(2.7)

Treating , , and as global constants, the stochastic gradient computational cost is .

Strikingly, Theorem 1 indicates that SGD in Algorithm 2 achieves a stochastic gradient computation cost of to find an -approximate second-order stationary point. Compared with existing algorithms that achieves an convergence rate, SGD is comparatively the simplest to implement and does not invoke any additional techniques or iterations such as momentum acceleration (Jin et al., 2018b), cubic regularization (Tripuraneni et al., 2018), regularization (Allen-Zhu, 2018a), or Neon  negative curvature search (Xu et al., 2018; Allen-Zhu & Li, 2018).

Admittedly, the best-known SGD theoretical guarantee in Theorem 1 relies on a dispersive noise assumption. To remove such an assumption, we argue that only steps of each round does one need to run an SGD step of dispersive noise to enable efficient escaping. We propose a variant of SGD called Noise-Scheduled SGD which requires artificial noise injection but does not relies on a dispersive noise assumption. The algorithm is shown in Algorithm 3. One can obtain the convergence property straightforwardly.

Remark 1.

For the function class that admits the strict-saddle property (Carmon et al., 2018; Ge et al., 2015; Jin et al., 2017), an approximate second-order stationary point is guaranteed to be an approximate local minimizer. For example for optimizing a -strict-saddle function, one can first find an -approximate second-order stationary point with which is guaranteed to be an approximate local minimizer due to the strict-saddle property. Our SGD convergence rate is independent of the target accuracy , and one can run a standard convex optimization theory to obtain an convergence rate in terms of the optimality gap. Limited by space we omit the details.

3 Proof Sketches

For convenience, when we study Algorithm 2 in each inner loop from Line 2 to Line 8, we override the definition of as its initial vector. We briefly introduce our proof techniques in this section. The details of the rigorous proofs are deferred in Appendix. Our poof basically is consist of two ingredients. The first is to prove that SGD can efficiently escape saddles: with high probability, if , moves out of in iterations. The second is to show that SGD converges with a faster rate of , rather than . We further separate the second destination into two parts:

  1. Throughout the execution of the algorithm, each time moves out of , with high probability, the function value shall decrease with a magnitude at least .

  2. Once does not move out of until iteration, with high probability, we find a desired approximate second-order stationary point.

Let be the filtration involving the full information of all the previous times iterations, where denotes the sigma field. And let be the first time (mathematically, a stopping time) that exits the -neighborhood of , i.e.

(3.1)

Both and is measurable on , where denotes the indicator function.

3.1 Part I: Escaping Saddles

Our goal is to prove the following proposition:

Proposition 2.

Assume , and recall the parameter set in (2.2). Initialized at and running Line 2 to Line 8, with probability at least we have

(3.2)

where .

Proposition 2 essentially says that assuming if the function has a negative Hessian eigenvalue at , the iteration exits the -neighborhood of in steps with a high probability.

To prove Proposition 2, we let , be the iteration by SGD starting from a fixed using the same stochastic samples as iteration , i.e.

(3.3)

Obviously, we have . Let be the first step number (a stopping time) such that exits the -neighborhood of . Formally,

(3.4)

It is easy to see from (3.1) that . Inspired from Jin et al. (2017), we cope with the stochasticity of gradients and define the so-called bad initialization region as the point initialized from which iteration exits the -neighborhood of with probability :

(3.5)

We will show that the bad initialization region enjoys the -narrow property, where . Since the first step will provide a continuous noise as supposed by Assumption 3, with the properly selected , it will move the iteration out of the bad initialization region in its first step with probability . Repeating such an argument in a logarithmic number of rounds enables escaping to occur with high probability.

The idea is to prove the following lemma:

Lemma 1.

Let the assumptions of Proposition 2 hold, and assume WLOG

be an arbitrary eigenvector of

corresponding to its smallest eigenvalue , which satisfies . Then we have for any fixed and pair of points that

(3.6)

Lemma 1 is inspired from Lemma 15 in Jin et al. (2017). Nevertheless due to the noise brought in at each update step, the analysis of stochastic gradient differs from that of the gradient descent in many aspects. For example, instead of showing the decrease of function value, we need to show that with a positive probability, at least one of the two iterations, or , exits the -neighborhood of . Our proof is also more intuitive compared with Lemma 15 in Jin et al. (2017). The core idea is to focus on analyzing the difference trajectory for and , and to show that the rotation speed for the difference trajectory is the same as the expansion speed. Detailed proof is provided in §C.1.

3.2 Part II: Faster Descent

The goal of Part II is to prove the following proposition:

Proposition 3 (Faster Descent).

For Algorithm 2 with parameter set in (2.2). With probability at least , if moves out of in iteration, we have

(3.7)

Proposition 3 is the key for SGD to achieve the reduced stochastic computation costs. It shows that no matter what does the local surface of look like, once moves out of the ball in iterations, the function value shall decrease with a magnitude of at least . To put it differently, on average, the function value decreases at least per-iteration during the execution of Algorithm 2. We will present the basic argument below.

We start with reviewing the more traditional approach for proving sufficient descent of SGD, and then we will discuss how to improve it as done in this work. The previous approaches are all based on the idea of (Nesterov, 2004), which mainly takes advantage of the gradient-smoothness condition of the objective. The proof can be briefly described below:

(3.8)

From the above derivation, in order to guarantee the monotone descent of function value in expectation, the step size needs to be

(3.9)

where the last equality uses . Plugging (3.9) into (3.8), and using , we have that the function value per-iteration would descent with a magnitude of at least . Such result indicates that, in the worse case, SGD takes stochastic oracles to find an -approximate first-order stationary point. This simple argument is the reason why previous works conjectured that the complexity of SGD is .

However, in this paper, we show that the above analysis can be further improved by using the Hessian-smoothness condition of the objective, and by considering the decomposition of objective function , and treating component and component separately as follows:

  • (Case 1) The component is near convex locally, in the sense that for all . In this case, by using techniques for near convex problems, it is possible for us to take a larger stepsize and prove a faster convergence rate.

  • (Case 2) The component is near concave locally, in the sense that for all . In this case, It can be shown that the last term on the right hand side of (3.8) can be reduced to . Therefore the step size can be chosen as , leading to a fast function value reduction.

To formalize the above observations into a rigorous proof, in this paper we introduce the quadratic approximation of at point , defined as

(3.10)

We let be the subspace spanned by all eigenvectors of whose eigenvalue is greater than , and denotes the complement space. Also let and as the projection matrices onto the space of and , respectively. Also let the full SVD decomposition of be . We introduce and respectively, and define the following two auxiliary functions and :

(3.11)

and

(3.12)

For the previously mentioned decomposition of , one may simply take , and let . It can be checked that and . It follows that we only need to separately analyze the two quadratic approximations and . We then bound the difference between and as .

The analysis for can be obtained via the standard analysis informally described above in Case 2 (Refer to Lemma 7).

Our proof technique for dealing with is to introduce an auxiliary trajectory with the following deterministic updates for as:

(3.13)

and . We then track and analyze the difference trajectory between and (Refer to Lemma 6). In the sense that simply performs Gradient Descent, we can arrive our final results for (Refer to Lemma 5), which leads to a rigorous statement of Case 1.

Finally, via the fact that moves out of the ball in iteration throughout the execution of Algorithm 2, we prove that with high probability the sum for the norm of gradients can be lower bounded as:

(3.14)

which ensures sufficient descent of the function value. By putting the above arguments together, we can obtain Proposition 3.

3.3 Part III: Finding SSP

Part III proves the following proposition:

Proposition 4.

With probability of at least , if has not moved out of the ball in iterations, then let , we have

(3.15)

Proposition 4 can be obtained via the same idea of Part II. We first study the quadratic approximation function and then bound the difference between and .

Finally, integrating Proposition 2, 3, and 4, and using the boundedness of the function value in Assumption 2, we know with probability at least , Algorithm 2 shall stop before steps, and output an approximate second-order stationary point satisfying (2.7), which immediately leads to Theorem 1.

4 Discussions on Related Works

Due to the recent heat of deep learning, many researchers have studied the nonconvex SGD method from various perspectives in the machine learning community. We compare our results with concurrent theoretical works on nonconvex SGD in the following discussions. For clarity, we also compare the convergence rates of some works most related to ours in Table 

1.

  1. Pioneer SGD: The first work on SGD escaping from saddle points Ge et al. (2015) obtain a stochastic gradient computational cost of .333The analysis in (Ge et al., 2015) indicates a factor of at least. We are aware that the group of authors of Ge et al. (2015) claimed that they have a different proof to obtain a stochastic computation cost of using the technique of Jin et al. (2017), but we have not found it online up to the initial release of this work. Later, Jin et al. (2017, 2018b) noise-perturbed GD and AGD and achieve sharp gradient computational costs, which suggests the possibility of sharper SGD rate for escaping saddles. Our analysis in this work is partially motivated by Jin et al. (2017) for escaping from saddle points, but generalizes the noise condition and needs no deliberate noise injections which is not the original GD/SGD algorithm in a strict sense.

  2. Concurrent SGD: A recent result by Daneshmand et al. (2018) obtains a stochastic computation cost of to find an -approximate second-order stationary point. The highlight of their work is that they need no injection of artificial noises. Nevertheless in their work, the Correlated Negative Curvature parameter cannot be treated as an -constant. Taking the case of injected spherical noise or Gaussian noise, it can be at most linearly dependent on [Assumption 4], so their result is again not (almost) dimension-free, and their worst-case convergence rate shall be interpreted as .

  3. NC search + SGD: The Neon+SGD (Xu et al., 2018; Allen-Zhu & Li, 2018) methods achieve a dimension-free convergence rate of for the general problem of form (1.1) to reach an -approximate first-order stationary point. Prior to this, classical nonconvex GD/SGD only achieves such a rate for finding an -approximate first-order stationary point (Nesterov, 2004), which, with the help of Neon method, successfully escapes from saddles via a Negative Curvature (NC) search iteration.

  4. Regularization + SGD: Very recently, Allen-Zhu (2018a) takes a quadratic regularization approach and equips it with a negative-curvature search iteration Neon2 (Allen-Zhu & Li, 2018), which successfully improves the rate to . In comparison, our method achieves essentially the same rate without using regularization methods. Tripuraneni et al. (2018) proposed a stochastic variant of cubic regularization method (Nesterov & Polyak, 2006; Agarwal et al., 2017) and achieves the same convergence rate, which is the first achieving such rate without invoking variance reduced gradient techniques.444Note in the convergence rate here, we also includes the number of stochastic Hessian-vector product evaluations, each of which takes about the same magnitude of time as per stochastic gradient evaluation.

  5. NC search + VR: Allen-Zhu (2018b) converted a NC search method to the online stochastic setting (Carmon et al., 2018) and achieved a convergence rate of for finding an -approximate second-order stationary point. For finding a relaxed -approximate second-order stationary point, Allen-Zhu (2018b) obtains a lower stochastic gradient computational cost of . With a recently proposed optimal variance reduced gradient techniques applied, Spider achieves the state-of-the-art stochastic gradient computational cost (Fang et al., 2018).555The independent work Zhou et al. (2018a) achieves a similar convergence rate for finding an -approximate second-order stationary point by imposing a third-order smoothness conditions on the objective.

Algorithm SG Comp. Cost
SGD Variants
Neon+SGD (Xu et al., 2018)
Neon2+SGD (Allen-Zhu & Li, 2018)
Stochastic Cubic (Tripuraneni et al., 2018)
RSGD5 (Allen-Zhu, 2018a)
Natasha2 (Allen-Zhu, 2018b)
Neon2+SNVRG (Zhou et al., 2018a)
Spider-SFO+ (Fang et al., 2018)
Original SGD
SGD
(Ge et al., 2015)
(Daneshmand et al., 2018)
(this work)
Table 1: Comparable results on the stochastic gradient computational cost for nonconvex optimization algorithms in finding an -approximate second-order stationary point for problem (1.1) under standard assumptions. Note that each stochastic gradient computational cost may hide a poly-logarithmic factors of , , .
Orange-boxed: Spider reported in orange-boxed is the only existing variant stochastic algorithm that achieves provable faster rate by order than simple SGD.
: Allen-Zhu (2018b) also obtains a stochastic gradient computational cost of for finding a relaxed -approximate second-order stationary point.
: With additional third-order smoothness assumptions, SNVRG (Zhou et al., 2018a) achieves a stochastic gradient costs of .

4.1 More Related Works

VR Methods

In the recent two years, sharper convergence rates for nonconvex stochastic optimization can be achieved using variance reduced gradient techniques (Schmidt et al., 2017; Johnson & Zhang, 2013; Xiao & Zhang, 2014; Defazio et al., 2014). The SVRG/SCSG (Lei et al., 2017) adopts the technique from Johnson & Zhang (2013) and novelly introduces a random stopping criteria for its inner loops and achieve a stochastic gradient costs of . Very recently, two independent works, namely SPIDER (Fang et al., 2018) and SVRC (Zhou et al., 2018b), design sharper variance reduced gradient methods and obtain a stochastic gradient computational costs of , which is state-of-the-art and near-optimal in the sense that they achieve the algorithmic lower bound in the finite-sum setting.

Escaping Saddles in Single-Function Case

Recently, many theoretical works care about convergence to an approximate second-order stationary point or escaping from saddles for the case of one single function (Carmon & Duchi, 2016; Jin et al., 2017; Carmon et al., 2018, 2017; Agarwal et al., 2017; Jin et al., 2018b; Lee et al., 2017; Du et al., 2017). Among them, the work Jin et al. (2017) proposed a ball-shaped-noise-perturbed variant of gradient descent which can efficiently escape saddle points and achieves a sharp stochastic gradient computational cost of , which is also achieved by Neon+GD (Xu et al., 2018; Allen-Zhu & Li, 2018). Another line of works apply momentum acceleration techniques (Agarwal et al., 2017; Carmon et al., 2017; Jin et al., 2018b) and achieve a rate of for a general optimization problem.

Escaping Saddles in Finite-Sum Case

For the finite-sum setting, many works have applied variance reduced gradient methods (Agarwal et al., 2017; Carmon et al., 2018; Fang et al., 2018; Zhou et al., 2018a) and further reduce the stochastic gradient computational cost to (Agarwal et al., 2017; Allen-Zhu & Li, 2018). Reddi et al. (2018) proposed a simpler algorithm that obtains a stochastic gradient cost of . With recursive gradient method applied (Fang et al., 2018; Zhou et al., 2018a), the stochastic gradient cost further reduces to , which is the state-of-the-art.

Miscellaneous

It is well-known that for general nonconvex optimization problem in the form of (1.1), finding an approximate global minimizer is in worst-case NP-hard (Hillar & Lim, 2013). Seeing this, many works turn to study the convergence properties based on specific models. Faster convergence rate to local or even global minimizers can be guaranteed for many statistical learning tasks such as principal component analysis (Li et al., 2018a; Jain & Kar, 2017), matrix completion (Jain et al., 2013; Ge et al., 2016; Sun & Luo, 2016), dictionary learning (Sun et al., 2015, 2017) as well as linear and nonlinear neural networks (Zhong et al., 2017; Li & Yuan, 2017; Li et al., 2018b).

In retrospect, our focus in this paper is on escaping from saddles, and we refer the readers to recent inspiring works studying how to escape from local minimizers, for instance, Zhang et al. (2017); Jin et al. (2018a).

5 Conclusions and Future Direction

In this paper, we presented a sharp convergence analysis on the classical SGD algorithm. We showed that equipped with a ball-controlled stopping criterion, SGD achieves a stochastic gradient computational cost of for finding an -approximate second-order stationary point, which improves over the best-known SGD convergence rate prior to our work. While this work focuses on sharpened convergence rate, there are still some important questions left:

  1. It is still unknown whether SGD achieves a rate that is faster than or is exactly the lower bound for SGD to solve the general problem in the form of (1.1). As we mentioned in §1, it is our conjecture that variance reduction methods are necessary to achieve an -approximate second-order stationary point in fewer than steps.

  2. We have not considered several important extensions in this work, such as the convergence rate of SGD in solving constrained optimization problems, and how one extends the analysis in this paper to the proximal case.

  3. It will be also interesting to study the stochastic version of Nesterov’s accelerated gradient descent (AGD) (Jin et al., 2018b).

Acknowledgement

The authors would like to greatly thank Chris Junchi Li providing us with a proof of SGD to escape saddle points in computational costs and carefully revising our paper. The authors also would like to thank Haishan Ye for very helpful discussions and Huan Li, Zebang Shen, and Li Shen for very helpful comments.

References

  • Abadi et al. (2016) Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean, J., Devin, M., Ghemawat, S., Irving, G., Isard, M., Kudlur, M., Levenberg, J., Monga, R., Moore, S., Murray, D. G., Steiner, B., Tucker, P., Vasudevan, V., Warden, P., Wicke, M., Yu, Y., & Zheng, X. (2016). Tensorflow: A system for large-scale machine learning. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (pp. 265–283).
  • Agarwal et al. (2009) Agarwal, A., Wainwright, M. J., Bartlett, P. L., & Ravikumar, P. K. (2009). Information-theoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural Information Processing Systems (pp. 1–9).
  • Agarwal et al. (2017) Agarwal, N., Allen-Zhu, Z., Bullins, B., Hazan, E., & Ma, T. (2017). Finding approximate local minima faster than gradient descent. In

    Proceedings of the 49th Annual ACM SIGACT Symposium on Theory of Computing

    (pp. 1195–1199).: ACM.
  • Allen-Zhu (2018a) Allen-Zhu, Z. (2018a). How to make the gradients small stochastically: Even faster convex and nonconvex sgd. In Advances in Neural Information Processing Systems (pp. 1165–1175).
  • Allen-Zhu (2018b) Allen-Zhu, Z. (2018b). Natasha 2: Faster non-convex optimization than sgd. In Advances in Neural Information Processing Systems (pp. 2676–2687).
  • Allen-Zhu & Li (2018) Allen-Zhu, Z. & Li, Y. (2018). Neon2: Finding local minima via first-order oracles. In Advances in Neural Information Processing Systems (pp. 3720–3730).
  • Bartlett et al. (2008) Bartlett, P. L., Dani, V., Hayes, T. P., Kakade, S. M., Rakhlin, A., & Tewari, A. (2008). High-probability regret bounds for bandit online linear optimization. In Proceedings of the 31st Conference On Learning Theory.
  • Bottou & Bousquet (2008) Bottou, L. & Bousquet, O. (2008). The tradeoffs of large scale learning. In Advances in Neural Information Processing Systems (pp. 161–168).
  • Candès & Recht (2009) Candès, E. J. & Recht, B. (2009). Exact matrix completion via convex optimization. Foundations of Computational mathematics, 9(6), 717.
  • Carmon & Duchi (2016) Carmon, Y. & Duchi, J. C. (2016). Gradient descent efficiently finds the cubic-regularized non-convex newton step. arXiv preprint arXiv:1612.00547.
  • Carmon et al. (2017) Carmon, Y., Duchi, J. C., Hinder, O., & Sidford, A. (2017). “Convex Until Proven Guilty”: Dimension-free acceleration of gradient descent on non-convex functions. In International Conference on Machine Learning (pp. 654–663).
  • Carmon et al. (2018) Carmon, Y., Duchi, J. C., Hinder, O., & Sidford, A. (2018). Accelerated methods for nonconvex optimization. SIAM Journal on Optimization, 28(2), 1751–1772.
  • Daneshmand et al. (2018) Daneshmand, H., Kohler, J., Lucchi, A., & Hofmann, T. (2018). Escaping saddles with stochastic gradients. In International Conference on Machine Learning (pp. 1155–1164).
  • Dauphin et al. (2014) Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., & Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems (pp. 2933–2941).
  • Defazio et al. (2014) Defazio, A., Bach, F., & Lacoste-Julien, S. (2014). SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (pp. 1646–1654).
  • Du et al. (2017) Du, S. S., Jin, C., Lee, J. D., Jordan, M. I., Singh, A., & Poczos, B. (2017). Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems (pp. 1067–1077).
  • Fang et al. (2018) Fang, C., Li, C. J., Lin, Z., & Zhang, T. (2018). Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. In Advances in Neural Information Processing Systems (pp. 686–696).
  • Freedman (1975) Freedman, D. A. (1975). On tail probabilities for martingales. Annals of Probability, 3(1), 100–118.
  • Ge et al. (2015) Ge, R., Huang, F., Jin, C., & Yuan, Y. (2015).

    Escaping from saddle points – online stochastic gradient for tensor decomposition.

    In Proceedings of The 28th Conference on Learning Theory (pp. 797–842).
  • Ge et al. (2016) Ge, R., Lee, J. D., & Ma, T. (2016). Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems (pp. 2973–2981).
  • Hillar & Lim (2013) Hillar, C. J. & Lim, L.-H. (2013). Most tensor problems are np-hard. Journal of the ACM (JACM), 60(6), 45.
  • Hinton & Salakhutdinov (2006) Hinton, G. E. & Salakhutdinov, R. R. (2006). Reducing the dimensionality of data with neural networks. science, 313(5786), 504–507.
  • Jain & Kar (2017) Jain, P. & Kar, P. (2017). Non-convex optimization for machine learning. Foundations and Trends® in Machine Learning, 10(3-4), 142–336.
  • Jain et al. (2013) Jain, P., Netrapalli, P., & Sanghavi, S. (2013). Low-rank matrix completion using alternating minimization. In Proceedings of the forty-fifth annual ACM symposium on Theory of computing (pp. 665–674).: ACM.
  • Jin et al. (2017) Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., & Jordan, M. I. (2017). How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning (pp. 1724–1732).
  • Jin et al. (2018a) Jin, C., Liu, L. T., Ge, R., & Jordan, M. I. (2018a). On the local minima of the empirical risk. In Advances in Neural Information Processing Systems (pp. 4901–4910).
  • Jin et al. (2018b) Jin, C., Netrapalli, P., & Jordan, M. I. (2018b). Accelerated gradient descent escapes saddle points faster than gradient descent. In Proceedings of the 31st Conference On Learning Theory (pp. 1042–1085).
  • Johnson & Zhang (2013) Johnson, R. & Zhang, T. (2013). Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems (pp. 315–323).
  • Jolliffe (2011) Jolliffe, I. (2011). Principal component analysis. In International encyclopedia of statistical science (pp. 1094–1096). Springer.
  • Kallenberg & Sztencel (1991) Kallenberg, O. & Sztencel, R. (1991). Some dimension-free features of vector-valued martingales. Probability Theory and Related Fields, 88(2), 215–247.
  • Lee et al. (2017) Lee, J. D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M. I., & Recht, B. (2017). First-order methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406.
  • Lei et al. (2017) Lei, L., Ju, C., Chen, J., & Jordan, M. I. (2017). Non-convex finite-sum optimization via scsg methods. In Advances in Neural Information Processing Systems (pp. 2345–2355).
  • Li et al. (2018a) Li, C. J., Wang, M., Liu, H., & Zhang, T. (2018a). Near-optimal stochastic approximation for online principal component estimation. Mathematical Programming, 167(1), 75–97.
  • Li et al. (2018b) Li, Y., Ma, T., & Zhang, H. (2018b). Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. In Conference On Learning Theory (pp. 2–47).
  • Li & Yuan (2017) Li, Y. & Yuan, Y. (2017).

    Convergence analysis of two-layer neural networks with relu activation.

    In Advances in Neural Information Processing Systems (pp. 597–607).
  • Nesterov (2004) Nesterov, Y. (2004). Introductory lectures on convex optimization: A basic course, volume 87. Springer.
  • Nesterov & Polyak (2006) Nesterov, Y. & Polyak, B. T. (2006). Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1), 177–205.
  • Pinelis (1994) Pinelis, I. (1994). Optimum bounds for the distributions of martingales in banach spaces. The Annals of Probability, (pp. 1679–1706).
  • Rakhlin et al. (2012) Rakhlin, A., Shamir, O., & Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In Proceedings of the 29th International Conference on Machine Learning (pp. 449–456).
  • Reddi et al. (2018) Reddi, S., Zaheer, M., Sra, S., Poczos, B., Bach, F., Salakhutdinov, R., & Smola, A. (2018). A generic approach for escaping saddle points. In

    Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics

    (pp. 1233–1242).
  • Reynolds et al. (2000) Reynolds, D. A., Quatieri, T. F., & Dunn, R. B. (2000). Speaker verification using adapted gaussian mixture models. Digital signal processing, 10(1-3), 19–41.
  • Robbins & Monro (1951) Robbins, H. & Monro, S. (1951). A stochastic approximation method. The annals of mathematical statistics, (pp. 400–407).
  • Schmidt et al. (2017) Schmidt, M., Le Roux, N., & Bach, F. (2017). Minimizing finite sums with the stochastic average gradient. Mathematical Programming, 162(1-2), 83–112.
  • Sun et al. (2015) Sun, J., Qu, Q., & Wright, J. (2015). When are nonconvex problems not scary? arXiv preprint arXiv:1510.06096.
  • Sun et al. (2017) Sun, J., Qu, Q., & Wright, J. (2017). Complete dictionary recovery over the sphere i: Overview and the geometric picture. IEEE Transactions on Information Theory, 63(2), 853–884.
  • Sun & Luo (2016) Sun, R. & Luo, Z.-Q. (2016). Guaranteed matrix completion via non-convex factorization. IEEE Transactions on Information Theory, 62(11), 6535–6579.
  • Tripuraneni et al. (2018) Tripuraneni, N., Stern, M., Jin, C., Regier, J., & Jordan, M. I. (2018). Stochastic cubic regularization for fast nonconvex optimization. In Advances in Neural Information Processing Systems (pp. 2904–2913).
  • Xiao & Zhang (2014) Xiao, L. & Zhang, T. (2014). A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4), 2057–2075.
  • Xu et al. (2018) Xu, Y., Rong, J., & Yang, T. (2018). First-order stochastic algorithms for escaping from saddle points in almost linear time. In Advances in Neural Information Processing Systems (pp. 5531–5541).
  • Zhang (2005) Zhang, T. (2005). Learning bounds for kernel regression using effective data dimensionality. Neural Computation, 17(9), 2077–2098.
  • Zhang et al. (2017) Zhang, Y., Liang, P., & Charikar, M. (2017). A hitting time analysis of stochastic gradient langevin dynamics. In Conference on Learning Theory.
  • Zhong et al. (2017) Zhong, K., Song, Z., Jain, P., Bartlett, P. L., & Dhillon, I. S. (2017). Recovery guarantees for one-hidden-layer neural networks. arXiv preprint arXiv:1706.03175.
  • Zhou et al. (2018a) Zhou, D., Xu, P., & Gu, Q. (2018a). Finding local minima via stochastic nested variance reduction. arXiv preprint arXiv:1806.08782.
  • Zhou et al. (2018b) Zhou, D., Xu, P., & Gu, Q. (2018b). Stochastic nested variance reduced gradient descent for nonconvex optimization. In Advances in Neural Information Processing Systems (pp. 3922–3933).

Appendix A Proof of Proposition 1 and Algorithm 3

Proof of Proposition 1.
  1. Recall the multivariate gaussian noise where . We show that it satisfies (2.5). Clearly, it satisfies (2.4).

    Let be an arbitrary unit vector, and due to symmetry in below we assume WLOG . Recall we have set satisfying the -narrow property in Definition 2. Then

    If set contains no points of for each , then is a subset of and has Lebesgue measure . This is because that for any given there exists an such that and we pick to be the infimum of such. Then it is easy to conclude that for any , and that

    Therefore we have for any admitting -narrow property where , that for any given ,

    where is of Lebesgue measure . Taking expectation again gives

    and we complete the proof that is -disperse for any .

  2. For example, recall the uniform ball-shaped noise , where is uniformly sampled from , the unit ball centered at . We prove that (2.5) holds in this case. Assume once again that because of symmetry. Using classical results in Multivariate Calculus (or see Jin et al. (2017)) and -narrow property property in Definition 2 of set we have