 # Escaping From Saddle Points --- Online Stochastic Gradient for Tensor Decomposition

We analyze stochastic gradient descent for optimizing non-convex functions. In many cases for non-convex functions the goal is to find a reasonable local minimum, and the main concern is that gradient updates are trapped in saddle points. In this paper we identify strict saddle property for non-convex problem that allows for efficient optimization. Using this property we show that stochastic gradient descent converges to a local minimum in a polynomial number of iterations. To the best of our knowledge this is the first work that gives global convergence guarantees for stochastic gradient descent on non-convex functions with exponentially many local minima and saddle points. Our analysis can be applied to orthogonal tensor decomposition, which is widely used in learning a rich class of latent variable models. We propose a new optimization formulation for the tensor decomposition problem that has strict saddle property. As a result we get the first online algorithm for orthogonal tensor decomposition with global convergence guarantee.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Stochastic gradient descent is one of the basic algorithms in optimization. It is often used to solve the following stochastic optimization problem

 w=argminw∈Rdf(w),~{}where~{}f(w)=Ex∼D[ϕ(w,x)] (1)

Here is a data point that comes from some unknown distribution , and is a loss function that is defined for a pair . We hope to minimize the expected loss .

When the function is convex, convergence of stochastic gradient descent is well-understood (Rakhlin et al., 2012; Shalev-Shwartz et al., 2009)

. However, stochastic gradient descent is not only limited to convex functions. Especially, in the context of neural networks, stochastic gradient descent is known as the “backpropagation” algorithm

(Rumelhart et al., 1988)

, and has been the main algorithm that underlies the success of deep learning

(Bengio, 2009). However, the guarantees in the convex setting does not transfer to the non-convex settings.

Optimizing a non-convex function is NP-hard in general. The difficulty comes from two aspects. First, a non-convex function may have many local minima, and it might be hard to find the best one (global minimum) among them. Second, even finding a local minimum might be hard as there can be many saddle points which have -gradient but are not local minima111See Section 3 for definition of saddle points.. In the most general case, there is no known algorithm that guarantees to find a local minimum in polynomial number of steps. The discrete analog (finding local minimum in domains like ) has been studied in complexity theory and is PLS-complete (Johnson et al., 1988).

In many cases, especially in those related to deep neural networks (Dauphin et al., 2014)
(Choromanska et al., 2014), the main bottleneck in optimization is not due to local minima, but the existence of many saddle points. Gradient based algorithms are in particular susceptible to saddle point problems as they only rely on the gradient information. The saddle point problem is alleviated for second-order methods that also rely on the Hessian information (Dauphin et al., 2014).

However, using Hessian information usually increases the memory requirement and computation time per iteration. As a result many applications still use stochastic gradient and empirically get reasonable results. In this paper we investigate why stochastic gradient methods can be effective even in presence of saddle point, in particular we answer the following question:

Question: Given a non-convex function with many saddle points, what properties of will guarantee stochastic gradient descent to converge to a local minimum efficiently?

We identify a property of non-convex functions which we call strict saddle. Intuitively, this property guarantees local progress if we have access to the Hessian information. Surprisingly we show with only first order (gradient) information, stochastic gradient can escape the saddle points efficiently. We give a framework for analyzing stochastic gradient in both unconstrained and equality-constrained case using this property.

We apply our framework to orthogonal tensor decomposition, which is a core problem in learning many latent variable models (see discussion in 2.2). The tensor decomposition problem is inherently susceptible to the saddle point issues, as the problem asks to find different components and any permutation of the true components yields a valid solution. Such symmetry creates exponentially many local minima and saddle points in the optimization problem. Using our new analysis of stochastic gradient, we give the first online algorithm for orthogonal tensor decomposition with global convergence guarantee. This is a key step towards making tensor decomposition algorithms more scalable.

### 1.1 Summary of Results

Given a function that is twice differentiable, we call a stationary point if

. A stationary point can either be a local minimum, a local maximum or a saddle point. We identify an interesting class of non-convex functions which we call strict saddle. For these functions the Hessian of every saddle point has a negative eigenvalue. In particular, this means that local second-order algorithms which are similar to the ones in

(Dauphin et al., 2014) can always make some progress.

It may seem counter-intuitive why stochastic gradient can work in these cases: in particular if we run the basic gradient descent starting from a stationary point then it will not move. However, we show that the saddle points are not stable and that the randomness in stochastic gradient helps the algorithm to escape from the saddle points.

###### Theorem 1 (informal).

Suppose is strict saddle (see Definition 5), Noisy Gradient Descent (Algorithm 1) outputs a point that is close to a local minimum in polynomial number of steps.

#### Online tensor decomposition

Requiring all saddle points to have a negative eigenvalue may seem strong, but it already allows non-trivial applications to natural non-convex optimization problems. As an example, we consider the orthogonal tensor decomposition problem. This problem is the key step in spectral learning for many latent variable models (see more discussions in Section 2.2).

We design a new objective function for tensor decomposition that is strict saddle.

###### Theorem 2.

Given random samples such that is an orthogonal -th order tensor (see Section 2.2), there is an objective function such that every local minimum of corresponds to a valid decomposition of . Further, function is strict saddle.

Combining this new objective with our framework for analyzing stochastic gradient in non-convex setting, we get the first online algorithm for orthogonal tensor decomposition with global convergence guarantee.

### 1.2 Related Works

#### Relaxed notions of convexity

In optimization theory and economics, there are extensive works on understanding functions that behave similarly to convex functions (and in particular can be optimized efficiently). Such notions involve pseudo-convexity (Mangasarian, 1965), quasi-convexity
(Kiwiel, 2001), invexity(Hanson, 1999) and their variants. More recently there are also works that consider classes that admit more efficient optimization procedures like RSC (restricted strong convexity) (Agarwal et al., 2010). Although these classes involve functions that are non-convex, the function (or at least the function restricted to the region of analysis) still has a unique stationary point that is the desired local/global minimum. Therefore these works cannot be used to prove global convergence for problems like tensor decomposition, where by symmetry of the problem there are multiple local minima and saddle points.

#### Second-order algorithms

The most popular second-order method is the Newton’s method. Although Newton’s method converges fast near a local minimum, its global convergence properties are less understood in the more general case. For non-convex functions, (Frieze et al., 1996) gave a concrete example where second-order method converges to the desired local minimum in polynomial number of steps (interestingly the function of interest is trying to find one component in a -th order orthogonal tensor, which is a simpler case of our application). As Newton’s method often converges also to saddle points, to avoid this behavior, different trusted-region algorithms are applied (Dauphin et al., 2014).

The tensor decomposition problem we consider in this paper has the following symmetry: the solution is a set of vectors . If is a solution, then for any permutation and any sign flips , is also a valid solution. In general, symmetry is known to generate saddle points, and variants of gradient descent often perform reasonably in these cases (see (Saad and Solla, 1995)(Rattray et al., 1998)(Inoue et al., 2003)). The settings in these work are different from ours, and none of them give bounds on number of steps required for convergence.

There are many other problems that have the same symmetric structure as the tensor decomposition problem, including the sparse coding problem (Olshausen and Field, 1997) and many deep learning applications (Bengio, 2009). In these problems the goal is to learn multiple “features” where the solution is invariant under permutation. Note that there are many recent papers on iterative/gradient based algorithms for problems related to matrix factorization (Jain et al., 2013; Saxe et al., 2013). These problems often have very different symmetry, as if

then for any invertible matrix

we know . In this case all the equivalent solutions are in a connected low dimensional manifold and there need not be saddle points between them.

## 2 Preliminaries

#### Notation

Throughout the paper we use to denote set . We use to denote the norm of vectors and spectral norm of matrices. For a matrix we use to denote its smallest eigenvalue. For a function , and denote its gradient vector and Hessian matrix.

The stochastic gradient aims to solve the stochastic optimization problem (1), which we restate here:

 w=argminw∈Rdf(w),~{}where~{}f(w)=Ex∼D[ϕ(w,x)].

Recall denotes the loss function evaluated for sample at point . The algorithm follows a stochastic gradient

 wt+1=wt−η∇wtϕ(wt,xt), (2)

where is a random sample drawn from distribution and is the learning rate.

In the more general setting, stochastic gradient descent can be viewed as optimizing an arbitrary function given a stochastic gradient oracle.

###### Definition 3.

For a function , a function that maps a variable to a random vector in is a stochastic gradient oracle if and .

In this case the update step of the algorithm becomes .

#### Smoothness and Strong Convexity

Traditional analysis for stochastic gradient often assumes the function is smooth and strongly convex. A function is -smooth if for any two points ,

 ∥∇f(w1)−∇f(w2)∥≤β∥w1−w2∥. (3)

When is twice differentiable this is equivalent to assuming that the spectral norm of the Hessian matrix is bounded by . We say a function is -strongly convex if the Hessian at any point has smallest eigenvalue at least ().

Using these two properties, previous work (Rakhlin et al., 2012) shows that stochastic gradient converges at a rate of . In this paper we consider non-convex functions, which can still be -smooth but cannot be strongly convex.

#### Smoothness of Hessians

We also require the Hessian of the function to be smooth. We say a function has -Lipschitz Hessian if for any two points we have

 ∥∇2f(w1)−∇2f(w2)∥≤ρ∥w1−w2∥. (4)

This is a third order condition that is true if the third order derivative exists and is bounded.

### 2.2 Tensors decomposition

A -th order tensor is a -dimensional array. In this paper we will mostly consider -th order tensors. If is a -th order tensor, we use to denote its -th entry.

Tensors can be constructed from tensor products. We use to denote a nd order tensor where . This generalizes to higher order and we use to denote the -th order tensor

 [u⊗4]i1,i2,i3,i4=ui1ui2ui3ui4.

We say a -th order tensor has an orthogonal decomposition if it can be written as

 T=d∑i=1a⊗4i, (5)

where

’s are orthonormal vectors that satisfy

and for . We call the vectors ’s the components of this decomposition. Such a decomposition is unique up to permutation of ’s and sign-flips.

A tensor also defines a multilinear form (just as a matrix defines a bilinear form), for a -th order tensor and matrices , we define

 [T(M1,M2,...,Mp)]i1,i2,...,ip=∑j1,j2,...,jp∈[d]Tj1,j2,...,jp∏t∈[p]Mt[it,jt].

That is, the result of the multilinear form is another tensor in . We will most often use vectors or identity matrices in the multilinear form. In particular, for a -th order tensor we know is a vector and is a matrix. In particular, if has the orthogonal decomposition in (5), we know and .

Given a tensor with an orthogonal decomposition, the orthogonal tensor decomposition problem asks to find the individual components

. This is a central problem in learning many latent variable models, including Hidden Markov Model, multi-view models, topic models, mixture of Gaussians and Independent Component Analysis (ICA). See the discussion and citations in

Anandkumar et al. (2014)

. Orthogonal tensor decomposition problem can be solved by many algorithms even when the input is a noisy estimation

(Harshman, 1970; Kolda, 2001; Anandkumar et al., 2014). In practice this approach has been successfully applied to ICA (Comon, 2002), topic models (Zou et al., 2013) and community detection (Huang et al., 2013).

In this section we discuss the properties of saddle points, and show if all the saddle points are well-behaved then stochastic gradient descent finds a local minimum for a non-convex function in polynomial time.

For a twice differentiable function , we call the points stationary points if their gradients are equal to . Stationary points could be local minima, local maxima or saddle points. By local optimality conditions (Wright and Nocedal, 1999), in many cases we can tell what type a point is by looking at its Hessian: if is positive definite then is a local minimum; if is negative definite then is a local maximum; if has both positive and negative eigenvalues then is a saddle point. These criteria do not cover all the cases as there could be degenerate scenarios: can be positive semidefinite with an eigenvalue equal to 0, in which case the point could be a local minimum or a saddle point.

If a function does not have these degenerate cases, then we say the function is strict saddle:

###### Definition 4.

A twice differentiable function is strict saddle, if all its local minima have and all its other stationary points satisfy .

Intuitively, if we are not at a stationary point, then we can always follow the gradient and reduce the value of the function. If we are at a saddle point, we need to consider a second order Taylor expansion:

 f(w+Δw)≈w+(Δw)T∇2f(w)(Δw)+O(∥Δw∥3).

Since the strict saddle property guarantees to have a negative eigenvalue, there is always a point that is near and has strictly smaller function value. It is possible to make local improvements as long as we have access to second order information. However it is not clear whether the more efficient stochastic gradient updates can work in this setting.

To make sure the local improvements are significant, we use a robust version of the strict saddle property:

###### Definition 5.

A twice differentiable function is -strict saddle, if for any point at least one of the following is true

1. .

2. .

3. There is a local minimum such that , and the function restricted to neighborhood of () is -strongly convex.

Intuitively, this condition says for any point whose gradient is small, it is either close to a robust local minimum, or is a saddle point (or local maximum) with a significant negative eigenvalue.

###### Theorem 6 (Main Theorem).

Suppose a function that is -strict saddle, and has a stochastic gradient oracle with radius at most . Further, suppose the function is bounded by , is -smooth and has -Lipschitz Hessian. Then there exists a threshold , so that for any , and for any

, with probability at least

in iterations, Algorithm 1 (Noisy Gradient Descent) outputs a point that is -close to some local minimum .

Here (and throughout the rest of the paper) () hides the factor that is polynomially dependent on all other parameters (including , , , , , , , , and ), but independent of and . So it focuses on the dependency on and . Our proof technique can give explicit dependencies on these parameters however we hide these dependencies for simplicity of presentation.

###### Remark (Decreasing learning rate).

Often analysis of stochastic gradient descent uses decreasing learning rates and the algorithm converges to a local (or global) minimum. Since the function is strongly convex in the small region close to local minimum, we can use Theorem 6 to first find a point that is close to a local minimum, and then apply standard analysis of SGD in the strongly convex case (where we decrease the learning rate by and get convergence in ).

In the next part we sketch the proof of the main theorem. Details are deferred to Appendix A.

### 3.2 Proof sketch

In order to prove Theorem 6, we analyze the three cases in Definition 5. When the gradient is large, we show the function value decreases in one step (see Lemma 7); when the point is close to a local minimum, we show with high probability it cannot escape in the next polynomial number of iterations (see Lemma 8).

Under the assumptions of Theorem 6, for any point with (where ) and , after one iteration we have .

The proof of this lemma is a simple application of the smoothness property.

###### Lemma 8 (Local minimum).

Under the assumptions of Theorem 6, for any point that is close to local minimum , in number of steps all future ’s are -close with probability at least .

The proof of this lemma is similar to the standard analysis (Rakhlin et al., 2012) of stochastic gradient descent in the smooth and strongly convex setting, except we only have local strongly convexity. The proof appears in Appendix A.

The hardest case is when the point is “close” to a saddle point: it has gradient smaller than and smallest eigenvalue of the Hessian bounded by . In this case we show the noise in our algorithm helps the algorithm to escape:

Under the assumptions of Theorem 6, for any point where (for the same as in Lemma 7), and , there is a number of steps that depends on such that . The number of steps has a fixed upper bound that is independent of where .

Intuitively, at point there is a good direction that is hiding in the Hessian. The hope of the algorithm is that the additional (or inherent) noise in the update step makes a small step towards the correct direction, and then the gradient information will reinforce this small perturbation and the future updates will “slide” down the correct direction.

To make this more formal, we consider a coupled sequence of updates such that the function to minimize is just the local second order approximation

 ~f(w)=f(wt)+∇f(wt)T(w−wt)+12(w−wt)T∇2f(wt)(w−wt).

The dynamics of stochastic gradient descent for this quadratic function is easy to analyze as can be calculated analytically. Indeed, we show the expectation of will decrease. We then use the smoothness of the function to show that as long as the points did not go very far from , the two update sequences and will remain close to each other, and thus . Finally we prove the future ’s (in the next steps) will remain close to with high probability by Martingale bounds. The detailed proof appears in Appendix A.

With these three lemmas it is easy to prove the main theorem. Intuitively, as long as there is a small probability of being -close to a local minimum, we can always apply Lemma 7 or Lemma 9 to make the expected function value decrease by in at most iterations, this cannot go on for more than iterations because in that case the expected function value will decrease by more than , but by our assumption. Therefore in steps with at least constant probability will become -close to a local minimum. By Lemma 8 we know once it is close it will almost always stay close, so we can repeat this times to get the high probability result. More details appear in Appendix A.

### 3.3 Constrained Problems

In many cases, the problem we are facing are constrained optimization problems. In this part we briefly describe how to adapt the analysis to problems with equality constraints (which suffices for the tensor application). Dealing with general inequality constraint is left as future work.

For a constrained optimization problem:

 minw∈Rdf(w) (6) s.t.ci(w)=0,i∈[m]

in general we need to consider the set of points in a low dimensional manifold that is defined by the constraints. In particular, in the algorithm after every step we need to project back to this manifold (see Algorithm 2 where is the projection to this manifold).

For constrained optimization it is common to consider the Lagrangian:

 L(w,λ)=f(w)−m∑i=1λici(w). (7)

Under common regularity conditions, it is possible to compute the value of the Lagrangian multipliers:

 λ∗(w)=argminλ∥∇wL(w,λ)∥.

We can also define the tangent space, which contains all directions that are orthogonal to all the gradients of the constraints: . In this case the corresponding gradient and Hessian we consider are the first-order and second-order partial derivative of Lagrangian at point :

 χ(w)=∇wL(w,λ)|(w,λ∗(w))=∇f(w)−m∑i=1λ∗i(w)∇ci(w) (8) M(w)=∇2wwL(w,λ)|(w,λ∗(w))=∇2f(w)−m∑i=1λ∗i(w)∇2ci(w) (9)

We replace the gradient and Hessian with and

, and when computing eigenvectors of

we focus on its projection on the tangent space. In this way, we can get a similar definition for strict saddle (see Appendix B), and the following theorem.

###### Theorem 10.

(informal) Under regularity conditions and smoothness conditions, if a constrained optimization problem satisfies strict saddle property, then for a small enough , in iterations Projected Noisy Gradient Descent (Algorithm 2) outputs a point that is close to a local minimum with probability at least .

Detailed discussions and formal version of this theorem are deferred to Appendix B.

## 4 Online Tensor Decomposition

In this section we describe how to apply our stochastic gradient descent analysis to tensor decomposition problems. We first give a new formulation of tensor decomposition as an optimization problem, and show that it satisfies the strict saddle property. Then we explain how to compute stochastic gradient in a simple example of Independent Component Analysis (ICA) (Hyvärinen et al., 2004).

### 4.1 Optimization problem for tensor decomposition

Given a tensor that has an orthogonal decomposition

 T=d∑i=1a⊗4i, (10)

where the components ’s are orthonormal vectors (, for ), the goal of orthogonal tensor decomposition is to find the components ’s.

This problem has inherent symmetry: for any permutation and any set of , we know is also a valid solution. This symmetry property makes the natural optimization problems non-convex.

In this section we will give a new formulation of orthogonal tensor decomposition as an optimization problem, and show that this new problem satisfies the strict saddle property.

Previously, Frieze et al. (1996) solves the problem of finding one component, with the following objective function

 max∥u∥2=1T(u,u,u,u). (11)

In Appendix C.1, as a warm-up example we show this function is indeed strict saddle, and we can apply Theorem 10 to prove global convergence of stochastic gradient descent algorithm.

It is possible to find all components of a tensor by iteratively finding one component, and do careful deflation, as described in Anandkumar et al. (2014) or Arora et al. (2012). However, in practice the most popular approaches like Alternating Least Squares (Comon et al., 2009) or FastICA (Hyvarinen, 1999) try to use a single optimization problem to find all the components. Empirically these algorithms are often more robust to noise and model misspecification.

The most straight-forward formulation of the problem aims to minimize the reconstruction error

 min∀i,∥ui∥2=1∥T−d∑i=1u⊗4i∥2F. (12)

Here is the Frobenius norm of the tensor which is equal to the norm when we view the tensor as a dimensional vector. However, it is not clear whether this function satisfies the strict saddle property, and empirically stochastic gradient descent is unstable for this objective.

We propose a new objective that aims to minimize the correlation between different components:

 min∀i,∥ui∥2=1∑i≠jT(ui,ui,uj,uj), (13)

To understand this objective intuitively, we first expand vectors in the orthogonal basis formed by ’s. That is, we can write , where are scalars that correspond to the coordinates in the basis. In this way we can rewrite . From this form it is clear that the is always nonnegative, and is equal to only when the support of and do not intersect. For the objective function, we know in order for it to be equal to 0 the ’s must have disjoint support. Therefore, we claim that is equivalent to up to permutation and sign flips when the global minimum (which is 0) is achieved.

We further show that this optimization program satisfies the strict saddle property and all its local minima in fact achieves global minimum value. The proof is deferred to Appendix C.2.

###### Theorem 11.

The optimization problem (13) is -strict saddle, for and . Moreover, all its local minima have the form for some and permutation .

### 4.2 Implementing stochastic gradient oracle

To design an online algorithm based on objective function (13), we need to give an implementation for the stochastic gradient oracle.

In applications, the tensor is oftentimes the expectation of multilinear operations of samples over where is generated from some distribution . In other words, for any , the tensor is . Using the linearity of the multilinear map, we know . Therefore we can define the loss function , and the stochastic gradient oracle .

For concreteness, we look at a simple ICA example. In the simple setting we consider an unknown signal that is uniform222In general ICA the entries of are independent, non-Gaussian variables. in

, and an unknown orthonormal linear transformation

333In general (under-complete) ICA this could be an arbitrary linear transformation, however usually after the “whitening” step (see Cardoso (1989)) the linear transformation becomes orthonormal. (). The sample we observe is . Using standard techniques (see Cardoso (1989)), we know the -th order cumulant of the observed sample is a tensor that has orthogonal decomposition. Here for simplicity we don’t define 4-th order cumulant, instead we give the result directly.

Define tensor as follows:

 Z(i,i,i,i)=3,∀i∈[d]Z(i,i,j,j)=Z(i,j,i,j)=Z(i,j,j,i)=1,∀i≠j∈[d]

where all other entries of are equal to . The tensor can be written as a function of the auxiliary tensor and multilinear form of the sample .

###### Lemma 12.

The expectation , where ’s are columns of the unknown orthonormal matrix .

This lemma is easy to verify, and is closely related to cumulants (Cardoso, 1989). Recall that denotes the loss (objective) function evaluated at sample for point . Let . By Lemma 12, we know that is equal to the objective function as in Equation (13). Therefore we rewrite objective (13) as the following stochastic optimization problem

 min∀i,∥ui∥2=1E[ϕ(u,y)], % where ϕ(u,y)=∑i≠j12(Z−y⊗4)(ui,ui,uj,uj)

The stochastic gradient oracle is then

 ∇uiϕ(u,y)=∑j≠i(⟨uj,uj⟩ui+2⟨ui,uj⟩uj−⟨uj,y⟩2⟨ui,y⟩y). (14)

Notice that computing this stochastic gradient does not require constructing the -th order tensor . In particular, this stochastic gradient can be computed very efficiently:

###### Remark.

The stochastic gradient (14) can be computed in time for one sample or for average of samples.

###### Proof.

The proof is straight forward as the first two terms take and is shared by all samples. The third term can be efficiently computed once the inner-products between all the ’s and all the ’s are computed (which takes time). ∎

## 5 Experiments

We run simulations for Projected Noisy Gradient Descent (Algorithm 2) applied to orthogonal tensor decomposition. The results show that the algorithm converges from random initial points efficiently (as predicted by the theorems), and our new formulation (13) performs better than reconstruction error (12) based formulation.

#### Settings

We set dimension , the input tensor is a random tensor in that has orthogonal decomposition (5). The step size is chosen carefully for respective objective functions. The performance is measured by normalized reconstruction error .

We use two ways to generate samples and compute stochastic gradients. In the first case we generate sample by setting it equivalent to with probability . It is easy to see that . This is a very simple way of generating samples, and we use it as a sanity check for the objective functions.

In the second case we consider the ICA example introduced in Section 4.2, and use Equation (14) to compute a stochastic gradient. In this case the stochastic gradient has a large variance, so we use mini-batch of size 100 to reduce the variance.

#### Comparison of objective functions

We use the simple way of generating samples for our new objective function (13) and reconstruction error objective (12). The result is shown in Figure 1. Our new objective function is empirically more stable (always converges within 10000 iterations); the reconstruction error do not always converge within the same number of iterations and often exhibits long periods with small improvement (which is likely to be caused by saddle points that do not have a significant negative eigenvalue).

#### Simple ICA example

As shown in Figure 2, our new algorithm also works in the ICA setting. When the learning rate is constant the error stays at a fixed small value. When we decrease the learning rate the error converges to 0.

## 6 Conclusion

In this paper we identify the strict saddle property and show stochastic gradient descent converges to a local minimum under this assumption. This leads to new online algorithm for orthogonal tensor decomposition. We hope this is a first step towards understanding stochastic gradient for more classes of non-convex functions. We believe strict saddle property can be extended to handle more functions, especially those functions that have similar symmetry properties.

## References

• Agarwal et al. (2010) Agarwal, A., Negahban, S., and Wainwright, M. J. (2010).

Fast global convergence rates of gradient methods for high-dimensional statistical recovery.

In Advances in Neural Information Processing Systems, pages 37–45.
• Anandkumar et al. (2014) Anandkumar, A., Ge, R., Hsu, D., Kakade, S. M., and Telgarsky, M. (2014). Tensor decompositions for learning latent variable models.

Journal of Machine Learning Research

, 15:2773–2832.
• Arora et al. (2012) Arora, S., Ge, R., Moitra, A., and Sachdeva, S. (2012).

Provable ICA with unknown gaussian noise, with implications for gaussian mixtures and autoencoders.

In Advances in Neural Information Processing Systems, pages 2375–2383.
• Azuma (1967) Azuma, K. (1967).

Weighted sums of certain dependent random variables.

Tohoku Mathematical Journal, Second Series, 19(3):357–367.
• Bengio (2009) Bengio, Y. (2009). Learning deep architectures for AI. Foundations and trends® in Machine Learning, 2(1):1–127.
• Cardoso (1989) Cardoso, J.-F. (1989).

Source separation using higher order moments.

In Acoustics, Speech, and Signal Processing, pages 2109–2112. IEEE.
• Choromanska et al. (2014) Choromanska, A., Henaff, M., Mathieu, M., Arous, G. B., and LeCun, Y. (2014). The loss surface of multilayer networks. arXiv:1412.0233.
• Comon (2002) Comon, P. (2002). Tensor decompositions. Mathematics in Signal Processing V, pages 1–24.
• Comon et al. (2009) Comon, P., Luciani, X., and De Almeida, A. L. (2009). Tensor decompositions, alternating least squares and other tales. Journal of Chemometrics, 23(7-8):393–405.
• Dauphin et al. (2014) Dauphin, Y. N., Pascanu, R., Gulcehre, C., Cho, K., Ganguli, S., and Bengio, Y. (2014). Identifying and attacking the saddle point problem in high-dimensional non-convex optimization. In Advances in Neural Information Processing Systems, pages 2933–2941.
• Frieze et al. (1996) Frieze, A., Jerrum, M., and Kannan, R. (1996). Learning linear transformations. In 2013 IEEE 54th Annual Symposium on Foundations of Computer Science, pages 359–359.
• Hanson (1999) Hanson, M. A. (1999). Invexity and the kuhn–tucker theorem. Journal of mathematical analysis and applications, 236(2):594–604.
• Harshman (1970) Harshman, R. A. (1970). Foundations of the PARAFAC procedure: Models and conditions for an “explanatory” multi-modal factor analysis. UCLA Working Papers in Phonetics, 16(1):84.
• Huang et al. (2013) Huang, F., Niranjan, U., Hakeem, M. U., and Anandkumar, A. (2013). Fast detection of overlapping communities via online tensor methods. arXiv:1309.0787.
• Hyvarinen (1999) Hyvarinen, A. (1999). Fast ICA for noisy data using gaussian moments. In Circuits and Systems, volume 5, pages 57–61.
• Hyvärinen et al. (2004) Hyvärinen, A., Karhunen, J., and Oja, E. (2004). Independent component analysis, volume 46. John Wiley & Sons.
• Inoue et al. (2003) Inoue, M., Park, H., and Okada, M. (2003). On-line learning theory of soft committee machines with correlated hidden units–steepest gradient descent and natural gradient descent–. Journal of the Physical Society of Japan, 72(4):805–810.
• Jain et al. (2013) Jain, P., Netrapalli, P., and Sanghavi, S. (2013). Low-rank matrix completion using alternating minimization. In

Proceedings of the forty-fifth annual ACM symposium on Theory of computing

, pages 665–674.
• Johnson et al. (1988) Johnson, D. S., Papadimitriou, C. H., and Yannakakis, M. (1988). How easy is local search? Journal of computer and system sciences, 37(1):79–100.
• Kiwiel (2001) Kiwiel, K. C. (2001). Convergence and efficiency of subgradient methods for quasiconvex minimization. Mathematical programming, 90(1):1–25.
• Kolda (2001) Kolda, T. G. (2001). Orthogonal tensor decompositions. SIAM Journal on Matrix Analysis and Applications, 23(1):243–255.
• Mangasarian (1965) Mangasarian, O. L. (1965). Pseudo-convex functions. Journal of the Society for Industrial & Applied Mathematics, Series A: Control, 3(2):281–290.
• Olshausen and Field (1997) Olshausen, B. A. and Field, D. J. (1997). Sparse coding with an overcomplete basis set: A strategy employed by V1? Vision research, 37(23):3311–3325.
• Rakhlin et al. (2012) Rakhlin, A., Shamir, O., and Sridharan, K. (2012). Making gradient descent optimal for strongly convex stochastic optimization. In ICML, pages 449–456.
• Rattray et al. (1998) Rattray, M., Saad, D., and Amari, S.-i. (1998). Natural gradient descent for on-line learning. Physical review letters, 81(24):5461.
• Rumelhart et al. (1988) Rumelhart, D. E., Hinton, G. E., and Williams, R. J. (1988). Learning representations by back-propagating errors. Cognitive modeling, 5.
• Saad and Solla (1995) Saad, D. and Solla, S. A. (1995). On-line learning in soft committee machines. Physical Review E, 52(4):4225.
• Saxe et al. (2013) Saxe, A. M., McClelland, J. L., and Ganguli, S. (2013). Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. arXiv:1312.6120.
• Shalev-Shwartz et al. (2009) Shalev-Shwartz, S., Shamir, O., Sridharan, K., and Srebro, N. (2009). Stochastic convex optimization. In Proceedings of The 22nd Conference on Learning Theory.
• Wright and Nocedal (1999) Wright, S. J. and Nocedal, J. (1999). Numerical optimization, volume 2. Springer New York.
• Zou et al. (2013) Zou, J. Y., Hsu, D., Parkes, D. C., and Adams, R. P. (2013). Contrastive learning using spectral methods. In Advances in Neural Information Processing Systems, pages 2238–2246.

## Appendix A Detailed Analysis for Section 3 in Unconstrained Case

In this section we give detailed analysis for noisy gradient descent, under the assumption that the unconstrained problem satisfies -strict saddle property.

The algorithm we investigate in Algorithm 1, we can combine the randomness in the stochastic gradient oracle and the artificial noise, and rewrite the update equation in form:

 wt=wt−1−η(∇f(wt−1)+ξt−1) (15)

where is step size, (recall is a random vector on unit sphere) is the combination of two source of noise.

By assumption, we know ’s are independent and they satisfying , . Due to the explicitly added noise in Algorithm 1, we further have . For simplicity, we assume , for some constant , then the algorithm we are running is exactly the same as Stochastic Gradient Descent (SGD). Our proof can be very easily extended to the case when because both the upper and lower bounds are .

We first restate the main theorem in the context of stochastic gradient descent.

###### Theorem 13 (Main Theorem).

Suppose a function that is -strict saddle, and has a stochastic gradient oracle where the noise satisfy . Further, suppose the function is bounded by , is -smooth and has -Lipschitz Hessian. Then there exists a threshold , so that for any , and for any , with probability at least in iterations, SGD outputs a point that is -close to some local minimum .

Recall that () hides the factor that is polynomially dependent on all other parameters, but independent of and . So it focuses on the dependency on and . Throughout the proof, we interchangeably use both and to represent the Hessian matrix of .

As we discussed in the proof sketch in Section 3, we analyze the behavior of the algorithm in three different cases. The first case is when the gradient is large.

###### Lemma 14.

Under the assumptions of Theorem 13, for any point with where , after one iteration we have:

 Ef(w1)−f(w0)≤−~Ω(η2) (16)
###### Proof.

Choose , then by update equation Eq.(15), we have:

 Ef(w1)−f(w0) ≤∇f(w0)TE(w1−w0)+β2E∥w1−w0∥2 =−(η−βη22)∥∇f(w0)∥2+η2σ2βd2 ≤−η2∥∇f(w0)∥2+η2σ2βd2≤−η2σ2βd2 (17)

which finishes the proof. ∎

###### Lemma 15.

Under the assumptions of Theorem 13, for any initial point that is close to a local minimum , with probability at least , we have following holds simultaneously:

 ∀t≤~O(1η2log1ζ),∥wt−w⋆∥≤~O(√ηlog1ηζ)<δ (18)

where is the locally optimal point.

###### Proof.

We shall construct a supermartingale and use Azuma’s inequality (Azuma, 1967) to prove this result.

Let filtration , and note , where denotes the sigma field. Let event , where is independent of , and will be specified later. To ensure the correctness of proof, notation in this proof will never hide any dependence on . Clearly there’s always a small enough choice of to make holds as long as . Also note , that is .

By Definition 5 of -strict saddle, we know is locally -strongly convex in the -neighborhood of . Since , we have

 ∇f(wt)T(wt−w⋆)1Et≥α∥wt−w⋆∥21Et (19)

Furthermore, with , using -smoothness, we have:

 E[∥wt−w⋆∥21Et−1|Ft−1]= E[∥wt−1−η(∇f(wt−1)+ξt−1)−w⋆∥2|Ft−1]1Et−1 = [∥wt−1−w⋆∥2−2η∇f(wt−1)T(wt−1−w⋆)+η2∥∇f(wt−1)∥2+η2σ2]1Et−1 ≤ [(1−2ηα+η2β2)∥wt−1−w⋆∥2+η2σ2]1Et−1 ≤ [(1−ηα)∥wt−1−w⋆∥2+η2σ2]1Et−1 (20)

Therefore, we have:

 [E[∥wt−w⋆∥2|Ft−1]−ηα]1Et−1≤(1−ηα)[∥wt−1−w⋆∥2−ηα]1Et−1 (21)

Then, let , we have:

 E[Gt1Et−1|Ft−1]≤Gt−11Et−1≤Gt−11Et−2 (22)

which means is a supermartingale.

Therefore, with probability 1, we have:

 |Gt1Et−1−E[Gt1Et−1|Ft−1]| ≤ (1−ηα)−t[ ∥wt−1−η∇f(wt−1)−w⋆∥⋅η∥ξt−1∥+η2∥ξt−1∥2−η2σ2 ]1Et−1 ≤ (1−ηα)−t⋅~O(μη1.5log121ηζ)=dt (23)

Let

 ct= ⎷t∑τ=1d2τ=~O(μη1.5log121ηζ) ⎷t∑τ=1(1−ηα)−2τ (24)

By Azuma’s inequality, with probability less than , we have:

 Gt1Et−1>~O(1)ctlog12(1ηζ)+G0 (25)

We know is equivalent to:

 ∥wt−w⋆∥2>~O(η)+~O(1)(1−ηα)tctlog12(1ηζ) (26)

We know:

 (1−ηα)tctlog12(1ηζ)=μ⋅~O(η1.5log1ηζ) ⎷t∑τ=1(1−ηα)2(t−τ) = μ⋅~O(η1.5log1ηζ) ⎷t−1∑τ=0(1−ηα)2τ≤μ⋅~O(η1.5log1ηζ)√11−(1−ηα)2=μ⋅~O(ηlog1ηζ) (27)

This means Azuma’s inequality implies, there exist some so that:

 P(Et−1∩{∥wt−w⋆∥2>μ⋅~Cηlog1