Machine learning has stimulated interest in obtaining global convergence rates in non-convex optimization. Consider a possibly non-convex objective function . We want to solve
This is hard in general. Instead, we usually settle for approximate first-order critical (or stationary) points where the gradient is small, or second-order critical (or stationary) points where the gradient is small and the Hessian is nearly positive semidefinite.
One of the simplest algorithms for solving (1) is gradient descent (GD): given , iterate
It is well known that if is Lipschitz continuous, with appropriate step-size , GD converges to first-order critical points. However, it may take an exponential time to escape saddle points, that is, to reach an approximate second-order critical point du2017gradient . There is an increasing amount of evidence that saddle points are a serious obstacle to the practical success of local optimization algorithms such as GD Pascanu2014 ; Ge2015 . This calls for algorithms which provably escape saddle points efficiently. We focus on methods which only have access to and through a black-box model.
Several methods add noise to GD iterates in order to escape saddle points faster, under the assumption that has -Lipschitz continuous gradient and -Lipschitz continuous Hessian. In this setting, an -second-order critical point is a point satisfying and . Under the strict saddle assumption, with small enough, such points are near (local) minimizers Ge2015 ; Jina2017 .
In 2015, Ge et al. Ge2015 gave a variant of stochastic gradient descent (SGD) which adds isotropic noise to iterates, showing it produces an -second-order critical point with high probability in stochastic gradient queries. In 2017, Jin et al. Jina2017 presented a variant of GD, perturbed gradient descent (PGD), which reduces this complexity to full gradient queries. Recently, Jin et al. Jin2019 simplified their own analysis of PGD, and extended it to stochastic gradient descent.
Jin et al.’s PGD (Jin2019, , Alg. 4) works as follows: If the gradient is large at iterate , , then perform a gradient descent step: . If the gradient is small at iterate , , perturb by , with sampled uniformly from a ball of fixed radius centered at zero. Starting from this new point , perform gradient descent steps, arriving at iterate . From here, repeat this procedure starting at . Crucially, Jin et al. Jin2019 show that, if is not an -second-order critical point, then the function decreases enough from to with high probability, leading to an escape.
In this paper we generalize PGD to optimization problems on manifolds, i.e., problems of the form
), computer vision (e.g.,vision ) and signal processing (e.g., signal )—see apps for more. See saddle1 and saddle2 for examples of the strict saddle property on manifolds.
Given , the gradient of at ,
, is a vector in the tangent space at, . To perform gradient descent on a manifold, we need a way to move on the manifold along the direction of the gradient at . This is provided by a retraction : a smooth map from to . Riemannian gradient descent (RGD) performs steps on of the form
For Euclidean space, , the standard retraction is , in which case (4) reduces to (2). For the sphere embedded in Euclidean space, , a natural retraction is given by orthogonal projection to the sphere: .
For , define the pullback . If is nice enough (details below), the gradient and Hessian of at equal the gradient and Hessian of at the origin of . Since is a vector space, if we perform GD on , we can almost directly apply Jin et al.’s analysis Jin2019 . This motivates the two-phase structure of our perturbed Riemannian gradient descent (PRGD), listed as Algorithm 1.
Our PRGD is a variant of RGD (4) and a generalization of PGD. It works as follows: If the gradient is large at iterate , , perform an RGD step: . We call this a “step on the manifold.” If the gradient at iterate is small, , then perturb in the tangent space . After this perturbation, execute at most gradient descent steps on the pullback , in the tangent space. We call these “tangent space steps.” We denote this sequence of tangent space steps by . This sequence of steps is performed by TangentSpaceSteps: a deterministic, vector-space procedure—see Algorithm 1.
By distinguishing between gradient descent steps on the manifold and those in a tangent space, we can apply Jin et al.’s analysis almost directly Jin2019 , allowing us to prove PRGD reaches an -second-order critical point on in gradient queries. The notion of approximate second-order critical point is here defined with respect to a notion of Lipschitz-type continuity of the Riemannian gradient and Hessian detailed below, as advocated in trustMan ; arcMan . The analysis is technically far simpler than if one runs all steps on the manifold. We expect that this two-phase approach may prove useful for the generalization of other algorithms and analyses from the Euclidean to the Riemannian realm.
Recently, Sun and Fazel Fazel2018 provided the first generalization of PGD to certain manifolds with a polylogarithmic complexity in the dimension. This improves on the earlier results by Ge et al. (Ge2015, , App. B) which had a polynomial complexity. Both of these works focus on submanifolds of a Euclidean space, with the algorithm in Fazel2018 depending on the equality constraints chosen to describe this submanifold.
At the same time as the present paper, Sun et al. Sun2019prgd improved their analysis to cover any complete Riemannian manifold with bounded sectional curvature. In contrast to ours, their algorithm executes all steps on the manifold. Their analysis requires the retraction to be the Riemannian exponential map (i.e., geodesics). Our regularity assumptions are similar but different: while we assume Lipschitz-type conditions on the pullbacks in small balls around the origins of tangent spaces, Sun et al. make Lipschitz assumptions on the cost function directly, using parallel transport and Riemannian distance. As a result, curvature appears in their results. We make no explicit assumptions on regarding curvature or completeness, though these may be implicitly included in our regularity assumptions.
1.1 Main result
Here we state our result informally. Formal results are stated in subsequent sections.
Theorem 1.1 (Informal).
Let be a Riemannian manifold of dimension equipped with a retraction . Assume is twice continuously differentiable, and furthermore:
is lower bounded.
The gradients of the pullbacks uniformly satisfy a Lipschitz-type condition.
The Hessians of the pullbacks uniformly satisfy a Lipschitz-type condition.
The retraction uniformly satisfies a second-order condition.
Then, setting , PRGD visits several points with gradient smaller than and, with high probability, at least two-thirds of those points are -second-order critical (Definition 3.1).
PRGD uses gradient queries, and crucially no Hessian queries. The algorithm requires knowledge of the Lipschitz constants defined below, which makes this a mostly theoretical algorithm.
1.2 Related work
Algorithms which efficiently escape saddle points can be classified into two families: first-order and second-order methods. First-order methods only use function value and gradient information. SGD and PGD are first-order methods. Second-order methods also access Hessian information. Newton’s method, trust regionstrust ; trustMan and adaptive cubic regularization arc ; arcMan are second-order methods.
As noted above, Ge et al. Ge2015 and Jin et al. Jina2017 escape saddle points (in Euclidean space) by exploiting noise in iterations. There has also been similar work for normalized gradient descent Levy2016 . Expanding on Jina2017 , Jin et al. Jinb2017 give an accelerated PGD algorithm (PAGD) which reaches an -second-order critical point of a non-convex function with high probability in iterations. In Jin2019 , Jin et al. show that a stochastic version of PGD reaches an -second-order critical point in stochastic gradient queries; only queries are needed if the stochastic gradients are well behaved. For an analysis of PGD under convex constraints, see mokhtari2018escaping .
There is another line of research, inspired by Langevin dynamics, in which judiciously scaled Gaussian noise is added at every iteration. We note that although this differs from the first incarnation of PGD in Jina2017 , this resembles a simplified version of PGD in Jin2019 . Sang and Liu Sang2018 develop an algorithm (adaptive stochastic gradient Langevin dynamics, ASGLD), which provably reaches an -second-order critical point in with high probability. With full gradients, AGSLD reaches an -second-order critical point in queries with high probability.
One might hope that the noise inherent in vanilla SGD would help it escape saddle points without noise injection. Daneshmand et al. Daneshmand2018 propose the correlated negative curvature assumption (CNC), under which they prove that SGD reaches an -second-order critical point in queries with high probability. They also show that, under the CNC assumption, a variant of GD (in which iterates are perturbed only by SGD steps) efficiently escapes saddle points. Importantly, these guarantees are completely dimension-free.
A first-order method can include approximations of the Hessian (e.g., with a difference of gradients). For example, Allen-Zhu’s Natasha 2 algorithm AllenZhua2017 uses first-order information (function value and stochastic gradients) to search for directions of negative curvature of the Hessian. Natasha 2 reaches an -second-order critical point in iterations.
Many classical optimization algorithms have been generalized to optimization on manifolds, including gradient descent, Newton’s method, trust regions and adaptive cubic regularization edelman1998geometry ; AbsilBook ; genrtr ; newton ; trustMan ; arcMan ; bento2017iterationcomplexity . Bonnabel bonnabel extends stochastic gradient descent to Riemannian manifolds and proves that Riemannian SGD converges to critical points of the cost function. Zhang et al. speedup and Sato et al. speedup2
both use variance reduction to speed up SGD on Riemannian manifolds.
2 Preliminaries: Optimization on manifolds
We review the key definitions and tools for optimization on manifolds. For more information, see AbsilBook . Let be a -dimensional Riemannian manifold: a real, smooth -manifold equipped with a Riemannian metric. We associate with each a -dimensional real vector space , called the tangent space at . For embedded submanifolds of , we often visualize the tangent space as being tangent to the manifold at . The Riemannian metric defines an inner product on the tangent space , with associated norm . We denote these by and when is clear from context. A vector in the tangent space is a tangent vector. The set of pairs for is called the tangent bundle . Define : the closed ball of radius centered at . We occasionally denote by when is clear from context. Let
denote the uniform distribution over the ball.
The Riemannian gradient of a differentiable function at is the unique vector in satisfying , where is the directional derivative of at along . The Riemannian metric also gives rise to a well-defined notion of derivative of vector fields called the Riemannian connection or Levi–Civita connection . The Hessian of is the derivative of the gradient vector field: . The Hessian describes how the gradient changes. is a symmetric linear operator on . If the manifold is a Euclidean space, , with the standard metric , the Riemannian gradient and Hessian coincide with the standard gradient and Hessian .
As discussed in Section 1, the retraction is a mapping which allows us to move along the manifold from a point in the direction of a tangent vector . Formally:
Definition 2.1 (Retraction, from AbsilBook ).
A retraction on a manifold is a smooth mapping from the tangent bundle to satisfying properties 1 and 2 below. Let denote the restriction of to .
, where is the zero vector in .
The differential of at , , is the identity map.
(Our algorithm and theory only require to be defined in balls of a fixed radius around the origins of tangent spaces.) Recall these special retractions, which are good to keep in mind for intuition: on , we typically use , and on the unit sphere we typically use .
For in , define the pullback of from the manifold to the tangent space by
This is a real function on a vector space. Furthermore, for and , let
denote the differential of at (a linear operator). The gradient and Hessian of the pullback admit the following nice expressions in terms of those of , and the retraction.
Lemma 2.2 (Lemma 5.2 of arcMan ).
For twice continuously differentiable, and , with denoting the adjoint of ,
where is a symmetric linear operator on defined through polarization by
with the intrinsic acceleration on of at .
The velocity of a curve is . The intrinsic acceleration of is the covariant derivative (induced by the Levi–Civita connection) of the velocity of : . When is a Riemannian submanifold of , does not necessarily coincide with . In this case, is the orthogonal projection of onto .
3 PRGD efficiently escapes saddle points
We now precisely state the assumptions, the main result, and some important parts of the proof of the main result, including the main obstacles faced in generalizing PGD to manifolds. A full proof of all results is provided in the appendix.
The first assumption, namely, that is lower bounded, ensures that there are points on the manifold where the gradient is arbitrarily small.
is lower bounded: for all .
Generalizing from the Euclidean case, we assume Lipschitz-type conditions on the gradients and Hessians of the pullbacks . For the special case of and , these assumptions hold if the gradient and Hessian are each Lipschitz continuous, as in (Jin2019, , A1) (with the same constants). The Lipschitz-type assumptions below are similar to assumption A2 of arcMan . Notice that these assumptions involve both the cost function and the retraction: this dependency is further discussed in trustMan ; arcMan for a similar setting.
There exist and such that and with ,
There exist and such that and with ,
where on the left-hand side we use the operator norm.
More precisely, we only need these assumptions to hold at the iterates Let . (We do this to reduce the number of parameters in Algorithm 1.) The next assumption requires the chosen retraction to be well behaved, in the sense that the (intrinsic) acceleration of curves on the manifold, defined below, must remain bounded—compare with Lemma 2.2.
There exists such that, for all and satisfying , the curve has initial acceleration bounded by : .
If Assumption 4 holds with , is said to be second order (AbsilBook, , p107). Second-order retractions include the so-called exponential map and the standard retractions on and the unit sphere mentioned earlier—see malick for a large class of such retractions on relevant manifolds.
For compact manifolds, all of these assumptions hold (all proofs are in the appendix):
3.2 Main results
Recall that PRGD (Algorithm 1) works as follows. If , perform a Riemannian gradient descent step, . If , then perturb, i.e., sample and let . After this perturbation, remain in the tangent space and do (at most) gradient descent steps on the pullback , starting from . We denote this sequence of tangent space steps by . This sequence of gradient descent steps is performed by TangentSpaceSteps: a deterministic procedure in the (linear) tangent space.
One difficulty with this approach is that, under our assumptions, for some , may not be Lipschitz continuous in all of . However, it is easy to show that is Lipschitz continuous in the ball of radius by compactness, uniformly in . This is why we limit our algorithm to these balls. If the sequence of iterates escapes the ball for some , TangentSpaceSteps returns the point between and on the boundary of that ball.
Following Jin2019 , we use a set of carefully balanced parameters. Parameters and are user-defined. The claim in Theorem 3.4 below holds with probability at least . Assumption 1 provides parameter . Assumptions 2 and 3 provide parameters and . As announced, the latter two assumptions further ensure Lipschitz continuity of the gradients of the pullbacks in balls of the tangent spaces, uniformly: this defines the parameter , as prescribed below.
Then, choose (preferably small) such that
and set algorithm parameters
where is such that is an integer. We also use this notation in the proofs:
visits at least two iterates satisfying . With probability at least , at least two-thirds of those iterates satisfy
The algorithm uses at most gradient queries (and no function or Hessian queries).
Assume satisfies Assumptions 1, 2, 3 and 4. For an arbitrary , with , , and , choose as in (9). Then, setting as in (11), visits at least two iterates satisfying . With probability at least , at least two-thirds of those iterates are -second-order points. If (that is, the retraction is second order), then the same claim holds for -second-order points instead of . The algorithm uses at most gradient queries.
Assume with standard inner product and standard retraction . As in Jin2019 , assume is lower bounded, is -Lipschitz in , and is -Lipschitz in . Then, Assumptions 1, 2 and 3 hold with . Furthermore, Assumption 4 holds with so that (Lemma 2.2). For all , has Lipschitz constant since . Therefore, using , and choosing as in (9), PRGD reduces to PGD, and Theorem 3.4 recovers the result of Jin et al. Jin2019 : this confirms that the present result is a bona fide generalization.
PRGD, like PGD (Algorithm 4 in Jin2019 ), does not specify which iterate is an -second-order critical point. However, it is straightforward to include a termination condition in PRGD which halts the algorithm and returns a suspected -second-order critical point. Indeed, Jin et al. include such a termination condition in their original PGD algorithm Jina2017 , which here would go as follows: After performing a perturbation and (tangent space) steps in , return if , i.e., the function value does not decrease enough. The termination condition requires a threshold which is balanced like the other parameters of PRGD in (9).
3.3 Main proof ideas
Theorem 3.4 follows from the following two lemmas which we prove in the appendix. These lemmas state that, in each round of the while-loop in PRGD, if is not at an -second-order critical point, PRGD makes progress, that is, decreases the cost function value (the first lemma is deterministic, the second one is probabilistic). Yet, the value of on the iterates can only decrease so much because is bounded below by . Therefore, the probability that PRGD does not visit an -second-order critical point is low.
Lemma 3.8 states that we are guaranteed to make progress if the gradient is large. This follows from the sufficient decrease of RGD steps. Lemma 3.9 states that, with perturbation, GD on the pullback escapes a saddle point with high probability. Lemma 3.9 is analogous to Lemma 11 in Jin2019 .
Let be the set of tangent vectors in for which GD on the pullback starting from does not escape the saddle point, i.e., the function value does not decrease enough after iterations. Following Jin et al.’s analysis Jin2019
, we bound the width of this “stuck region” (in the direction of the eigenvectorassociated with the minimum eigenvalue of the Hessian of the pullback, ). Like Jin et al., we do this with a coupling argument, showing that given two GD sequences with starting points sufficiently far apart, one of these sequences must escape. This is formalized in Lemma C.4 of the appendix. A crucial observation to prove Lemma C.4 is that, if the function value of GD iterates does not decrease much, then these iterates must be localized; this is formalized in Lemma C.3 of the appendix, which Jin et al. call “improve or localize.”
We stress that the stuck region concept, coupling argument, improve or local paradigm, and details of the analysis are due to Jin et al. Jin2019 : our main contribution is to show a clean way to generalize the algorithm to manifolds in such a way that the analysis extends with little friction. We believe that the general idea of separating iterations between the manifold and the tangent spaces to achieve different objectives may prove useful to generalize other algorithms as well.
To perform PGD (Algorithm 4 of Jin2019 ), one must know the step size , perturbation radius and the number of steps to perform after perturbation. These parameters are carefully balanced, and their values depend on the smoothness parameters and . In practice, we do not know or . An algorithm which does not require knowledge of or but still has the same guarantees as PGD would be useful.
GD equipped with a backtracking line-search method achieves an -first-order critical point in gradient queries without knowledge of the Lipschitz constant . At each iterate
of GD, backtracking line-search essentially uses function and gradient queries to estimate the gradient Lipschitz parameter near. Perhaps PGD can perform some kind of line-search to locally estimate and . We note that if is known and we use line-search-like methods to estimate , there are still difficulties applying Jin et al.’s coupling argument.
Jin et al. Jin2019 develop a stochastic version of PGD known as PSGD. Instead of perturbing when the gradient is small and performing GD steps, PSGD simply performs a stochastic gradient step and perturbation at each step. Distinguishing between manifold steps and tangent space steps, we suspect it is possible to develop a Riemannian version of perturbed stochastic gradient descent which achieves an -second-order critical point in stochastic gradient queries, like PSGD. However, this Riemannian version still performs a certain number of steps in the tangent space when the gradient is small, like PRGD.
- (1) A. Edelman, T.A. Arias, and S.T. Smith. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
- (2) R. Adler, J. Dedieu, J. Margulies, M. Martens, and M. Shub. Newton’s method on riemannian manifolds and a geometric model for the human spine. IMA Journal of Numerical Analysis, 22(3)(359-390), 2002.
P.-A. Absil and K. A. Gallivan.
Joint diagonalization on the oblique manifold for independent component analysis.Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 5(945-948), 2006.
- (4) Y. Nesterov and B. T. Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 108(1)(177-205), 2006.
- (5) P.-A. Absil, C. G. Baker, and K. A. Gallivan. Trust-region methods on Riemannian manifolds. Foundations of Computational Mathematics, 7(3):303–330, 2007.
Pavan Turaga, Ashok Veeraraghavan, and Rama Chellappa.
Statistical analysis on stiefel and grassmann manifolds with
applications in computer vision.
IEEE Conference on Computer Vision and Pattern Recognition, 2008.
- (7) P. A. Absil, R. Mahony, and R. Sepulchre. Optimization on manifolds: Methods and applications. Recent Advances in Optimization and its Applications in Engineering, Springer, (125-144), 2010.
- (8) P.-A. Absil and J. Malick. Projection-like retractions on matrix manifolds. SIAM Journal on Optimization, 22(1)(135-158), 2012.
- (9) S. Bonnabel. Stochastic gradient descent on Riemannian manifolds. Automatic Control, IEEE Transactions on, 58(9):2217–2229, 2013.
- (10) Razvan Pascanu, Yann N. Dauphin, Surya Ganguli, and Yoshua Bengio. On the saddle point problem for non-convex optimization. 2014, arXiv:1405.4604.
Rong Ge, Furong Huang, Chi Jin, and Yang Yuan.
Escaping from saddle points–online stochastic gradient for tensor decomposition.In Conference on Learning Theory, pages 797–842, 2015.
- (12) Leopold Cambier and P. A. Absil. Robust low-rank matrix completion by riemannian optimization. SIAM Journal on Scientific Computing, 38(5)(S440-S460), 2016.
- (13) Kfir Y. Levy. The power of normalization: Faster evasion of saddle points. 2016, arXiv:1611.04831.
- (14) Ju Sun, Qing Qu, and John Wright. Complete dictionary recovery over the sphere I: Overview and the geometric picture, 2016.
- (15) Hongyi Zhang, Sashank J. Reddi, and Suvrit Sra. Riemannian SVRG: Fast stochastic optimization on Riemannian manifolds. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 4592–4600. Curran Associates, Inc., 2016.
- (16) G.C. Bento, O.P. Ferreira, and J.G. Melo. Iteration-complexity of gradient, subgradient and proximal point methods on Riemannian manifolds. Journal of Optimization Theory and Applications, 173(2):548–562, 2017.
- (17) Simon S Du, Chi Jin, Jason D Lee, Michael I Jordan, Aarti Singh, and Barnabas Poczos. Gradient descent can take exponential time to escape saddle points. In Advances in neural information processing systems, pages 1067–1077, 2017.
- (18) Chi Jin, Rong Ge, Praneeth Netrapalli, Sham M Kakade, and Michael I Jordan. How to escape saddle points efficiently. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pages 1724–1732. JMLR.org, 2017.
- (19) Chi Jin, Praneeth Netrapalli, and Michael I. Jordan. Accelerated gradient descent escapes saddle points faster than gradient descent. 2017, arXiv:1711.10456.
- (20) Hiroyuki Sato, Hiroyuki Kasai, and Bamdev Mishra. Riemannian stochastic variance reduced gradient, 2017, arXiv:1702.05594.
- (21) Zeyuan Allen-Zhu. Natasha 2: Faster non-convex optimization than SGD. In Advances in Neural Information Processing Systems, pages 2675–2686, 2018.
- (22) N. Boumal, P.-A. Absil, and C. Cartis. Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis, 2018.
- (23) Hadi Daneshmand, Jonas Kohler, Aurelien Lucchi, and Thomas Hofmann. Escaping saddles with stochastic gradients. 2018, arXiv:1803.05999.
- (24) Aryan Mokhtari, Asuman Ozdaglar, and Ali Jadbabaie. Escaping saddle points in constrained optimization. In Advances in Neural Information Processing Systems, pages 3629–3639, 2018.
- (25) T. Pumir, S. Jelassi, and N. Boumal. Smoothed analysis of the low-rank approach for smooth semidefinite programs. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 2283–2292. Curran Associates, Inc., 2018.
- (26) Hejian Sang and Jia Liu. Adaptive stochastic gradient langevin dynamics: Taming convergence and saddle point escape time. 2018, arXiv:1805.09416.
- (27) Yue Sun and Maryam Fazel. Escaping saddle points efficiently in equality-constrained optimization problems. 2018, ICML.
- (28) Teng Zhang and Yi Yang. Robust pca by manifold optimization. Journal of Machine Learning Research, 19(1-39), 2018.
- (29) N. Agarwal, N. Boumal, B. Bullins, and C. Cartis. Adaptive regularization with cubics on manifolds. arXiv preprint arXiv:1806.00065, 2019.
- (30) Chi Jin, Praneeth Netrapalli, Rong Ge, Sham M. Kakade, and Michael I. Jordan. Stochastic gradient descent escapes saddle points efficiently. 2019, arXiv:1902.04811.
- (31) Yue Sun, Nicolas Flammarion, and Maryam Fazel. Escaping from saddle points on Riemannian manifolds. 2019, arXiv:1906.07355.
- (32) P.-A. Absil, R. Mahony, and R. Sepulchre. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2008.
- (33) J. Nocedal and S. Wright. Numerical Optimization. Springer Verlag, 1999.
Appendix A Proof that assumptions hold for compact manifolds
Proof of Lemma 3.2.
Since is compact and is continuous, is lower bounded by some .
Recall . Define using operator norms by
Since is three times continuously differentiable and is smooth, and are each continuous on the tangent bundle . The set
is a compact subset of the tangent bundle since is compact. Thus, we may define
Using the notation from Assumption 4, the map given by is continuous since is smooth. The set
is also compact in . Hence, is a valid choice. ∎
Appendix B Proofs for the main results
The proof follows that of Jin et al. Jin2019 closely, reusing many of their key lemmas: we repeat some here for convenience, while highlighting the specificities of the manifold case. We consider it a contribution of this paper that, as a result of our distinction between manifold and tangent space steps, there is limited extra friction, despite the significantly extended generality. In this section and the next, all parameters are chosen as in (9) and (10).
We assume . We also assume because otherwise we can reach a point satisfying and simply using RGD. Indeed, RGD always finds a point satisfying , and Assumption 2 implies so that . Thus, if , every point satisfies .
We want to prove Theorem 3.4. This theorem follows from the following two lemmas (repeated from Lemmas 3.8 and 3.9 for convenience), which we prove in Appendix C below. Lemma B.1 is deterministic: it is a statement about the cost decrease produced by a single Riemannian gradient step, with bounded step size. Lemma B.2 is probabilistic, and is analogous to Lemma 11 in Jin2019 .
Proof of Theorem 3.4.
This proof is similar to Jin et al.’s proof of Theorem 9 in Jin2019 .
Recall that we set
PRGD performs two types of steps: (1) if , an RGD step on the manifold, and (2) if , a perturbation in the tangent space followed by GD steps in the tangent space.
The variable in Algorithm 1 is an upper bound on the number of gradient queries issued so far. For each RGD step on the manifold, increases by exactly 1. PRGD does not terminate before exceeds , and for every perturbation the counter increases by exactly . Therefore, there are at least iterates satisfying . By the definition of (12), .
Suppose PRGD visits more than points satisfying and . Each of these iterates is followed by a perturbation and at most tangent space steps . For at least one such , the sequence of tangent space steps does not escape the saddle point (that is, ), for otherwise by the definition of (12). Yet, by Lemma B.2 and a union bound, the probability that one or more of these sequences does not escape is at most . Indeed, factoring out the third term in the max,
where we used . Now using
for all , and , we find
Hence, with probability at least , PRGD visits at most points satisfying and . Using that there are at least iterates with , we conclude that at least two-thirds of the iterates with also satisfy , with probability at least . ∎