Heavy-ball Algorithms Always Escape Saddle Points

by   Tao Sun, et al.
NetEase, Inc

Nonconvex optimization algorithms with random initialization have attracted increasing attention recently. It has been showed that many first-order methods always avoid saddle points with random starting points. In this paper, we answer a question: can the nonconvex heavy-ball algorithms with random initialization avoid saddle points? The answer is yes! Direct using the existing proof technique for the heavy-ball algorithms is hard due to that each iteration of the heavy-ball algorithm consists of current and last points. It is impossible to formulate the algorithms as iteration like xk+1= g(xk) under some mapping g. To this end, we design a new mapping on a new space. With some transfers, the heavy-ball algorithm can be interpreted as iterations after this mapping. Theoretically, we prove that heavy-ball gradient descent enjoys larger stepsize than the gradient descent to escape saddle points to escape the saddle point. And the heavy-ball proximal point algorithm is also considered; we also proved that the algorithm can always escape the saddle point.


page 1

page 2

page 3

page 4


Stochastic Heavy Ball

This paper deals with a natural stochastic optimization procedure derive...

A nonsmooth nonconvex descent algorithm

The paper presents a new descent algorithm for locally Lipschitz continu...

Efficient Projection Algorithms onto the Weighted l1 Ball

Projected gradient descent has been proved efficient in many optimizatio...

Ball k-means

This paper presents a novel accelerated exact k-means algorithm called t...

Heavy-Ball-Based Hard Thresholding Algorithms for Sparse Signal Recovery

The hard thresholding technique plays a vital role in the development of...

Quickly Finding a Benign Region via Heavy Ball Momentum in Non-Convex Optimization

The Heavy Ball Method, proposed by Polyak over five decades ago, is a fi...

First-order Methods Almost Always Avoid Saddle Points

We establish that first-order methods avoid saddle points for almost all...

1 Introduction

We consider the smooth nonconvex optimization problem


where is a differentiable closed function whose gradient is -Lipschitz continuous (may be nonconvex). And more we assume is over , that is, is twice-differentiable and is continuous. This paper is devoted to the study of two heavy-ball algorithms for minimizing with random initializations. Specifically, we focus on whether these two algorithms escape the saddle points.

1.1 Heavy-ball Algorithms

We consider the Heavy-Ball Gradient Descent (HBGD) with random initializations presented as


where is the stepsize and is the inertial parameter. If , the algorithm then reduces to the basic gradient descent. The heavy-ball method is also named as momentum method and has been widely used in different cases to accelerate the gradient descent iteration. Theoretically, it has been proved that HBGD enjoys better convergence factor than both the gradient and Nesterov’s accelerated gradient method with linear convergence rates under the condition that the objective function is twice continuously differentiable, strongly convex and has Lipschitz continuous gradient. With the convex and smooth assumption on the objective function, the ergodic rate in terms of the objective value, i.e., , is proved by [Ghadimi et al.2015]. HBGD was proved to converge linearly in the strongly convex case by [Ghadimi et al.2015]. But the authors used a somehow restrictive assumption on the parameter , which leads to a small range values of choose for when the strongly convex constant is tiny. Due to the inertial term , HBGD breaks Fejér monotonicity that gradient descent obeys. Consequently, it is difficult to prove its non-ergodic convergence rates. To this end, [Sun et al.2019] designed a novel Lyapunov analysis and derived better convergence results like non-ergodic sublinear convergence rate and larger stepsize for linear convergence. [Combettes and Glaudin2017] developed and studied heavy-ball methods for operator research. In the nonconvex case, if the objective function is nonconvex but Lipschitz differentiable, [Zavriev and Kostyuk1993] proved that the sequence generated by the heavy-ball gradient descent is convergent to a critical point, but without specifying related convergence rates. With semi-algebraic assumption on the objective function, [Ochs et al.2014] proved the sequence generated by HBGD is convergent to some critical point. Actually, [Ochs et al.2014] also extended HBGD to a nonsmooth composite optimization problem. [Xu and Yin2013]

developed and analyzed the Heavy-ball algorithm for tensor minimization problems.

[Loizou and Richtárik2017a, Loizou and Richtárik2017b] introduced the stochastic versions of heavy-ball algorithms and its relations with existing methods.

On the other hand, [Alvarez2000] studied a second-order ODE whose implicit discretization yields the Heavy-Ball Proximal Point Algorithm (HBPPA) mathematically described as


which admits the acceleration of the proximal point algorithm. In fact, the setting will push the scheme back to be the basic proximal point algorithm. The convergence of HBPPA is studied by [Alvarez2000] with several assumptions on the parameters and when is convex. Noticing the fact that is actually the maximal monotone operator in the convex case, [Alvarez and Attouch2001] extended HBPPA to solve the inclusion problem. The inexact HBPPA was proposed and studied by [Moudafi and Elizabeth2003]. Existing works on HBPPA are all based on the convex assumption on the . In this paper, we will present the convergence of nonconvex HBPPA including the gradient convergence under smooth assumption and sequence convergence under semi-algebraic assumption.

The performances of HBGD and HBPPA with convex settings exclude the interest of this paper. What we care is that can these two algorithms escape the saddle points when is nonconvex? This paper answers this problem.

1.2 First-order Algorithms Escape Saddle Points

In the community of nonconvex continuous optimization, saddle points, which are usually unavoidable and outnumber the cardinality of local minima, have long been regarded as a major obstacle. In many applications like tensor decomposition [Ge et al.2015], matrix completion [Ge et al.2016] and dictionary learning [Sun et al.2017], all local optima are also globally optimal. In some networks, under several conditions, local optimas are close to the global optima [Chaudhari et al.2016]. Although these conditions are barely satisfied by many networks, the conclusion coincides with existing experimental observations. Thus if all local minimizers are relatively good, we just need to push the iterations to escape the saddle points. And how to escape the saddle points has been a hot topic recently for first-order optimization algorithms. [Pemantle1990] established the convergence of the Robbins-Monro stochastic approximation to local minimizers by introducing sufficiently large unbiased noise. For tensor problem, [Ge et al.2015] provided the quantitative rates on the convergence of noisy gradient descent. For general smooth nonconvex functions, the analysis of noisy gradient descent is established by [Jin et al.2017a]. In a latter paper, [Jin et al.2017b] considered the noisy accelerated gradient descent and established related convergence theory. Another technique to escape the saddle points is using random initializations. It has been shown that, even without any random perturbations, a good initialization can make gradient descent converge to the global minimum for various non-convex problems like matrix factorization [Keshavan et al.2010] , phase retrieval [Cai et al.2016], dictionary learning [Arora et al.2015]. [Lee et al.2016, Lee et al.2017]

proved that for plenty of first-order nonconvex optimization algorithms, with random initializations, they can converge to local minima with probability 1. In the perspective of mathematics, the core techniques consist of two parts: 1). they reformulate the algorithms as mapping iterations; 2). the stable manifold theorem

[Shub2013] is employed. In this paper, we establish the theoretical guarantees for heavy-ball algorithms escaping the saddle points.

1.3 Difficulty in Analyzing Heavy-ball Iterations

The difficulty of study on the performance of heavy-ball with random initializations lies in how to reformulate the algorithm as iterations under a mapping. Taking the gradient descent for example, one can use with being the stepsize. And then, the gradient descent can be represented as . Constructing the mapping for gradient descent is obvious and easy. But for heavy-ball ones, the construction tasks are quite different due to the inertial term . Point is generated by both the current variable and the last one . Thus, it is impossible to find a mapping to rewrite the heavy-ball as .

1.4 Contribution

Due to that the analysis of nonconvex HBPPA is still blank, we establish related theory in this paper. And with the recently developed techniques in variational analysis, the sequence convergence of nonconvex HBPPA is also studied. But our main interests are still the analysis of HBGD and HBPPA with random initializations. We prove guarantees for both algorithms to escape saddle points under mild assumptions. Direct using the existing method is unapplicable for heavy-ball algorithms. Thus, we propose a novel construction technique. It can be proved that heavy-ball gradient descent enjoys larger stepsize than gradient descent to escape the saddle points with random starting points, which somehow indicates the advantages of heavy-ball methods.

2 Preliminaries

In this part, we introduce several basic definitions and tool lemmas. For a closed function (not necessary to be convex), the proximal map is defined as


Under the condition that is differentiable, if , the K.K.T. condition can yields that Given another point arbitrary , it holds that


For matrices and , denotes that there exist perturbation matrices such that . We use and to denote the

unit and zero matrix, respectively.

is determinant of matrix . It is easy to see if , .

Definition 1 (Strict Saddle)

We present the following two definitions,

  1. A point is a critical point of if . is isolated if there is a neighborhood around , and is the only critical point in .

  2. A point is a strict saddle point of if is a critical point and . Let denote the set of strict saddle points of function .

Let be a mapping; and its domain is defined as . And denotes the -fold composition of . Our interest is the sequence generated by

Definition 2 (Global Stable Set)

The global stable set of the strict saddles is the set of initial conditions where iteration of the mapping converges to a strict saddle, i.e., For an iterative algorithm , and is generated by with starting point .

Definition 3 (Unstable fixed point)

The differential of the mapping , denoted as , is a linear operator from , where is the tangent space of at point . Given a curve in with and , the linear operator is defined as . Let

be the set of fixed points where the differential has at least a single real eigenvalue with magnitude greater than one. These are the unstable fixed points.

With previous definitions, the authors in [Lee et al.2017] presented the following result. The core technique is employing the stable manifold theorem given in [Shub2013].

Lemma 1

Let be a mapping and for all . Assume the sequence is generated by scheme (6). Then the set of initial points that converge to an unstable fixed point has measure zero, . Further if , .

3 Main Results

In this section, we present the theoretical guarantees to escape the saddle points for the heavy-ball algorithms presented before. We need a smooth assumption on the objective function.

Assumption 1

The objective function is twice-differentiable and , where denotes the spectral norm of a matrix. Otherwise the objective function satisfies that .

The smooth assumption is used to derive the differential of the designed mapping. And the lower bounded assumption is used to provide a lower bound for the descent variables.

Lemma 2

Let be generated by the HBGD (2) with . If the stepsize is selected as

we have and .

It is easy to see that if the upper bound of the stepsize is therefore larger than for HBGD with random initializations. And the upper bound is closed to provided that is sufficiently small. [Lee et al.2016] proved that the stepsize for the GD with random initializations is required to be smaller than . Lemma 2 means a larger stepsize can be used under the heavy-ball scheme, which somehow demonstrates the advantage of heavy-ball method.

The difficulty of the analysis for HBGD has been presented in Sec. 1: it is impossible to find to a mapping such that directly. Noticing the th iteration involves with and , we turn to consider combining the these two points together, i.e., . And then, the task then turns to find a mapping such that . The mapping is from to . After building the relations between the unstable fixed point of and the strict saddle points, the theorem then can be proved. Different from existing results in [Lemma 1,[Sun et al.2019]], here. This is to make sure that the differential of the constructed mapping in the proofs is invertible. And the differential is non-symmetric.

Lemma 2 actually means , that is, HBGD always escapes saddle points provided the stepsize is set properly. Lemma 2 just means that will not fall into a saddle point. But whether the sequence converges or not is out of scope of this lemma. In the nonconvex optimization community, such a problem is answered by posing the semi-algebracity assumption on the objective function [Łojasiewicz1993, Kurdyka1998] or the isolated assumption on the critical point [Lange2013]. Thus, we can derive the following result.

Theorem 1

Assume that conditions of Lemma 2 hold and is coercive222We say function coercive if as .. Suppose the starting points and are totally random and one of the following conditions is satisfied.

  1. is semi-algebraic.

  2. All critical points of are isolated points.

Then, for HBGD, exists and that limit is a local minimizer of with probability 1.

Now, we study the HBPPA. First, we provide the convergence in the nonconvex settings.

Lemma 3

Let be generated by scheme (3) with and . Then, we have . Further more, if the function is semi-algebraic and coercive, is convergent to some critical point .

To obtain the sequence convergence of HBPPA, besides the semi-algebraicity, the coercivity is also needed, which is used to guarantee the boundedness of the generated sequence. In Lemma 3, is set to be differentiable. Actually, the smooth setting can be removed. Alternatively, we need to use the notion about subdifferentials of closed functions [Rockafellar and Wets2009] which are frequently used in the variational analysis. Specifically, in the smooth case, subdifferential is actually the gradient. With the same parameters, the sequence of HBPPA is still provably to converge to a critical point of .

We are prepared to present the guarantee for HBPPA to escape the saddle point.

Lemma 4

Let be generated by scheme (3) with , if the stepsize is selected as

we have and .

For HBPPA, the stepsize requirement coincides with existing result [Lee et al.2017]. Main idea of proof of Lemma 4 is similar to Lemma 2, but more complicated in details.

Theorem 2

Assume that conditions of Lemma 4 hold and is coercive. Suppose the starting points and are totally random and one of the following conditions is satisfied.

  1. is semi-algebraic.

  2. All critical points of are isolated points.

Then, for HBPPA, exists and that limit is a local minimizer of with probability 1.

4 Proofs

This section collects the proofs for the this paper.

4.1 Proof of Lemma 2

To guarantee the convergence of HBGD, from [Lemma 1,[Sun et al.2019]], we need and . Let and , denote a map as

It is easy to check that if . We further denote It can be easily verified that the Heavy-ball methods can be reformulated as


To use Lemma 1, we turn to the proofs of two facts:

  1. for all .

  2. .

Proof of fact 1: Direct calculations give us

We use the short-hand notation We can obtain the following relation

Therefore, we can derive for any .

Proof of fact 2: For any being a strict saddle, with the symmetry of , then

For and   333We have proved that is nonsingular., denoting , we consider

After simplifications, we can get

That indicates Denote function . All the eigenvalues of are the roots of the equation above. We exploit the two roots of It is easy to check that which means this equation has two real roots denoted as . Noting , that is We can see that is increasing on . With the fact , is increasing on . Thus, there exists some real number such that . Thus, we get i.e., . That is also

Using Lemma 1, and then

On the other hand, due to the nonnegativity of the measurement,

4.2 Proof of Theorem 1

The first case has been proved in [Ochs et al.2014], in which is convergent to some critical point . For the second case, [Proposition 12.4.1, [Lange2013]] and equation (10) means that the stationary set of is finite. Otherwise, the stationary set is connected which contradicts with the fact all critical points are isolated. Once using [Proposition 12.4.1, [Lange2013]], the stationary set has only one element ; and then, .

Therefore, in both cases, it can be proved that . With Lemma 2, is a local minimizer with probability 1 if the starting points are random. The theorem is then proved.

4.3 Proof of Lemma 3

This subsection contains two parts. Part 1 is to prove the gradient convergence, while Part 2 contains the sequence convergence proof.

Part 1. In (5), by substituting , , , ,

After simplifications, we then get


Denote a novel function as

and sequence . Thus, inequality (4.3) offers the descent as


With the fact , . The setting gives


On the other hand, from the property of the proximal map,


which can derive


Part 2. We prove that is coercive provided is coercive. If this claim fails to hold, there exists some such that

With the coercivity of , there exists some such that . On the other hand, is bounded due to the lower boundedness of ; thus, is bounded, which contradicts the fact .

From (9), we see that is bounded; and then, is bounded. The descent property of and the continuity of function directly give that


That means the sequence has a stationary point ; (4.3) means admits the critical point of . It is easy to see is the critical point of . Without loss of generality, we assume as .

Noting polynomial functions are semi-algebraic and the fact that sum of semi-algebraic functions is still semi-algebraic, is then semi-algebraic. Denote the set of all stationary points of as . From [Lemma 3.6, [Bolte et al.2014]], there exist and a continuous concave function such that

  1. ; is on ; for all , .

  2. for , it holds that


For the given above, as is large enough, will fall into the set . The concavity of gives


where , and uses 9, and comes from (14). The gradient of satisfies


where depends on (11) and, we used the fact . Combining (4.3) and (4.3),


where employs the Schwarz inequality with , and . Summing both sides of (4.3) from sufficiently large to , we have


Letting , from (10), (13), and the continuity of , (4.3) then yields We are then led to That means is convergent to some point. With the fact, is a stationary point, .

4.4 Proof of Lemma 4

In this proof, we denote the mapping as

To calculate the differential of , we denote

It is easy to see that


The definition of the proximal map then gives


By implicit differentiation on variable , we can get

Due to that , is invertible. We use a shorthand notation

For , by the same procedure, then we can derive


With (19) and (21), we are then led to


That means for any , . On the other hand, if is a strict saddle, , For , denoting , we consider

After simplifications, we can derive

With direct computations, we turn to

We consider the following equation

Direct calculations give

Thus, the equation enjoys two real roots denoted by . It is easy to see that . Noticing that and , we have . That means thus, . Consequently, the proof is proved by Lemma 1.

4.5 Proof of Theorem 2

This proof is similar to the one of Theorem 1 and will not be reproduced.

5 Conclusion

In this paper, we proved that HBGD and HBPPA always escape the saddle points with random initializations. This paper also established the convergence of nonconvex HBPPA. The core part in the proofs is bundling current and the last point as one point in an enlarged space. The heavy algorithms then can be represented as an iteration after a mapping. An interesting finding is that the HBGD can enjoy larger stepsize than the gradient descent to escape saddle points.


This work is sponsored in part by National Key R&D Program of China (2018YFB0204300), and Major State Research Development Program (2016YFB0201305), and National Natural Science Foundation of Hunan Province in China (2018JJ3616), and National Natural Science Foundation for the Youth of China (61602166), and Natural Science Foundation of Hunan (806175290082), and Natural Science Foundation of NUDT (ZK18-03-01), and National Natural Science Foundation of China (11401580).


  • [Alvarez and Attouch2001] Felipe Alvarez and Hedy Attouch. An inertial proximal method for maximal monotone operators via discretization of a nonlinear oscillator with damping. Set-Valued Analysis, 9(1-2):3–11, 2001.
  • [Alvarez2000] Felipe Alvarez. On the minimizing property of a second order dissipative system in hilbert spaces. SIAM Journal on Control and Optimization, 38(4):1102–1119, 2000.
  • [Arora et al.2015] Sanjeev Arora, Rong Ge, Tengyu Ma, and Ankur Moitra. Simple, efficient, and neural algorithms for sparse coding. 2015.
  • [Bolte et al.2014] Jérôme Bolte, Shoham Sabach, and Marc Teboulle. Proximal alternating linearized minimization for nonconvex and nonsmooth problems. Mathematical Programming, 146(1-2):459–494, 2014.
  • [Cai et al.2016] T Tony Cai, Xiaodong Li, and Zongming Ma. Optimal rates of convergence for noisy sparse phase retrieval via thresholded wirtinger flow. The Annals of Statistics, 44(5):2221–2251, 2016.
  • [Chaudhari et al.2016] Pratik Chaudhari, Anna Choromanska, Stefano Soatto, Yann LeCun, Carlo Baldassi, Christian Borgs, Jennifer Chayes, Levent Sagun, and Riccardo Zecchina. Entropy-sgd: Biasing gradient descent into wide valleys. arXiv preprint arXiv:1611.01838, 2016.
  • [Combettes and Glaudin2017] Patrick L. Combettes and Lilian E. Glaudin. Quasinonexpansive iterations on the affine hull of orbits: From mann’s mean value algorithm to inertial methods. Siam Journal on Optimization, 27(4), 2017.
  • [Ge et al.2015] Rong Ge, Furong Huang, Chi Jin, and Yang Yuan. Escaping from saddle points-online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pages 797–842, 2015.
  • [Ge et al.2016] Rong Ge, Jason D Lee, and Tengyu Ma. Matrix completion has no spurious local minimum. In Advances in Neural Information Processing Systems, pages 2973–2981, 2016.
  • [Ghadimi et al.2015] Euhanna Ghadimi, Hamid Reza Feyzmahdavian, and Mikael Johansson. Global convergence of the heavy-ball method for convex optimization. In Control Conference (ECC), 2015 European, pages 310–315. IEEE, 2015.
  • [Jin et al.2017a] Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In

    Proceedings of the 34th International Conference on Machine Learning-Volume 70

    , pages 1724–1732. JMLR. org, 2017.
  • [Jin et al.2017b] Chi Jin, Praneeth Netrapalli, and Michael I Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456, 2017.
  • [Keshavan et al.2010] Raghunandan H Keshavan, Andrea Montanari, and Sewoong Oh. Matrix completion from a few entries. IEEE transactions on information theory, 56(6):2980–2998, 2010.
  • [Kurdyka1998] Krzysztof Kurdyka. On gradients of functions definable in o-minimal structures. In Annales de l’institut Fourier, volume 48, pages 769–784. Chartres: L’Institut, 1950-, 1998.
  • [Lange2013] Kenneth Lange. Elementary optimization. In Optimization, pages 1–21. Springer, 2013.
  • [Lee et al.2016] Jason D Lee, Max Simchowitz, Michael I Jordan, and Benjamin Recht. Gradient descent only converges to minimizers. In Conference on Learning Theory, pages 1246–1257, 2016.
  • [Lee et al.2017] Jason D Lee, Ioannis Panageas, Georgios Piliouras, Max Simchowitz, Michael I Jordan, and Benjamin Recht. First-order methods almost always avoid saddle points. To be appeared in Mathematical Progamming, 2017.
  • [Loizou and Richtárik2017a] Nicolas Loizou and Peter Richtárik. Linearly convergent stochastic heavy ball method for minimizing generalization error. arXiv preprint arXiv:1710.10737, 2017.
  • [Loizou and Richtárik2017b] Nicolas Loizou and Peter Richtárik. Momentum and stochastic momentum for stochastic gradient, newton, proximal point and subspace descent methods. arXiv preprint arXiv:1712.09677, 2017.
  • [Łojasiewicz1993] Stanislas Łojasiewicz. Sur la géométrie semi-et sous-analytique. Ann. Inst. Fourier, 43(5):1575–1595, 1993.
  • [Moudafi and Elizabeth2003] Abdellatif Moudafi and E Elizabeth. Approximate inertial proximal methods using the enlargement of maximal monotone operators. International Journal of Pure and Applied Mathematics, 5(3):283–299, 2003.
  • [Ochs et al.2014] Peter Ochs, Yunjin Chen, Thomas Brox, and Thomas Pock. ipiano: Inertial proximal algorithm for nonconvex optimization. SIAM Journal on Imaging Sciences, 7(2):1388–1419, 2014.
  • [Pemantle1990] Robin Pemantle. Nonconvergence to unstable points in urn models and stochastic approximations. The Annals of Probability, 18(2):698–712, 1990.
  • [Rockafellar and Wets2009] R Tyrrell Rockafellar and Roger J-B Wets. Variational analysis, volume 317. Springer Science & Business Media, 2009.
  • [Shub2013] Michael Shub. Global stability of dynamical systems. Springer Science & Business Media, 2013.
  • [Sun et al.2017] Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere i: Overview and the geometric picture. IEEE Transactions on Information Theory, 63(2):853–884, 2017.
  • [Sun et al.2019] Tao Sun, Penghang Yin, Dongsheng Li, Chun Huang, Lei Guan, and Hao Jiang. Non-ergodic convergence analysis of heavy-ball algorithms. AAAI, 2019.
  • [Xu and Yin2013] Yangyang Xu and Wotao Yin. A block coordinate descent method for regularized multiconvex optimization with applications to nonnegative tensor factorization and completion. SIAM Journal on imaging sciences, 6(3):1758–1789, 2013.
  • [Zavriev and Kostyuk1993] SK Zavriev and FV Kostyuk. Heavy-ball method in nonconvex optimization problems. Computational Mathematics and Modeling, 4(4):336–341, 1993.