# Efficiently escaping saddle points on manifolds

Smooth, non-convex optimization problems on Riemannian manifolds occur in machine learning as a result of orthonormality, rank or positivity constraints. First- and second-order necessary optimality conditions state that the Riemannian gradient must be zero, and the Riemannian Hessian must be positive semidefinite. Generalizing Jin et al.'s recent work on perturbed gradient descent (PGD) for optimization on linear spaces [How to Escape Saddle Points Efficiently (2017), Stochastic Gradient Descent Escapes Saddle Points Efficiently (2019)], we propose a version of perturbed Riemannian gradient descent (PRGD) to show that necessary optimality conditions can be met approximately with high probability, without evaluating the Hessian. Specifically, for an arbitrary Riemannian manifold M of dimension d, a sufficiently smooth (possibly non-convex) objective function f, and under weak conditions on the retraction chosen to move on the manifold, with high probability, our version of PRGD produces a point with gradient smaller than ϵ and Hessian within √(ϵ) of being positive semidefinite in O((d)^4 / ϵ^2) gradient queries. This matches the complexity of PGD in the Euclidean case. Crucially, the dependence on dimension is low, which matters for large-scale applications including PCA and low-rank matrix completion, which both admit natural formulations on manifolds. The key technical idea is to generalize PRGD with a distinction between two types of gradient steps: steps on the manifold' and perturbed steps in a tangent space of the manifold.' Ultimately, this distinction makes it possible to extend Jin et al.'s analysis seamlessly.

## Authors

• 2 publications
• 14 publications
• ### Escape saddle points faster on manifolds via perturbed Riemannian stochastic recursive gradient

In this paper, we propose a variant of Riemannian stochastic recursive g...
10/23/2020 ∙ by Andi Han, et al. ∙ 1

• ### Escaping from saddle points on Riemannian manifolds

We consider minimizing a nonconvex, smooth function f on a Riemannian ma...
06/18/2019 ∙ by Yue Sun, et al. ∙ 0

• ### Smoothed analysis of the low-rank approach for smooth semidefinite programs

We consider semidefinite programs (SDPs) of size n with equality constra...
06/11/2018 ∙ by Thomas Pumir, et al. ∙ 0

• ### A Riemannian low-rank method for optimization over semidefinite matrices with block-diagonal constraints

We propose a new algorithm to solve optimization problems of the form f...
06/01/2015 ∙ by Nicolas Boumal, et al. ∙ 0

• ### On Geodesically Convex Formulations for the Brascamp-Lieb Constant

We consider two non-convex formulations for computing the optimal consta...
04/11/2018 ∙ by Nisheeth K. Vishnoi, et al. ∙ 0

• ### Analysis of Asymptotic Escape of Strict Saddle Sets in Manifold Optimization

In this paper, we provide some analysis on the asymptotic escape of stri...
11/28/2019 ∙ by Thomas Y. Hou, et al. ∙ 0

• ### Riemannian Langevin Algorithm for Solving Semidefinite Programs

We propose a Langevin diffusion-based algorithm for non-convex optimizat...
10/21/2020 ∙ by Mufan Bill Li, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Machine learning has stimulated interest in obtaining global convergence rates in non-convex optimization. Consider a possibly non-convex objective function . We want to solve

 minx∈Rdf(x). (1)

This is hard in general. Instead, we usually settle for approximate first-order critical (or stationary) points where the gradient is small, or second-order critical (or stationary) points where the gradient is small and the Hessian is nearly positive semidefinite.

One of the simplest algorithms for solving (1) is gradient descent (GD): given , iterate

 xt+1=xt−η∇f(xt). (2)

It is well known that if is Lipschitz continuous, with appropriate step-size , GD converges to first-order critical points. However, it may take an exponential time to escape saddle points, that is, to reach an approximate second-order critical point du2017gradient . There is an increasing amount of evidence that saddle points are a serious obstacle to the practical success of local optimization algorithms such as GD Pascanu2014 ; Ge2015 . This calls for algorithms which provably escape saddle points efficiently. We focus on methods which only have access to and through a black-box model.

Several methods add noise to GD iterates in order to escape saddle points faster, under the assumption that has -Lipschitz continuous gradient and -Lipschitz continuous Hessian. In this setting, an -second-order critical point is a point satisfying and . Under the strict saddle assumption, with small enough, such points are near (local) minimizers Ge2015 ; Jina2017 .

In 2015, Ge et al. Ge2015 gave a variant of stochastic gradient descent (SGD) which adds isotropic noise to iterates, showing it produces an -second-order critical point with high probability in stochastic gradient queries. In 2017, Jin et al. Jina2017 presented a variant of GD, perturbed gradient descent (PGD), which reduces this complexity to full gradient queries. Recently, Jin et al. Jin2019 simplified their own analysis of PGD, and extended it to stochastic gradient descent.

Jin et al.’s PGD (Jin2019, , Alg. 4) works as follows: If the gradient is large at iterate , , then perform a gradient descent step: . If the gradient is small at iterate , , perturb by , with sampled uniformly from a ball of fixed radius centered at zero. Starting from this new point , perform gradient descent steps, arriving at iterate . From here, repeat this procedure starting at . Crucially, Jin et al. Jin2019 show that, if is not an -second-order critical point, then the function decreases enough from to with high probability, leading to an escape.

In this paper we generalize PGD to optimization problems on manifolds, i.e., problems of the form

 minx∈Mf(x) (3)

where is an arbitrary Riemannian manifold and is sufficiently smooth AbsilBook . Optimization on manifolds notably occurs in machine learning (e.g., PCA PCA1 , low-rank matrix completion PCA2

), computer vision (e.g.,

vision ) and signal processing (e.g., signal )—see apps for more. See saddle1 and saddle2 for examples of the strict saddle property on manifolds.

Given , the gradient of at ,

, is a vector in the tangent space at

, . To perform gradient descent on a manifold, we need a way to move on the manifold along the direction of the gradient at . This is provided by a retraction : a smooth map from to . Riemannian gradient descent (RGD) performs steps on of the form

For Euclidean space, , the standard retraction is , in which case (4) reduces to (2). For the sphere embedded in Euclidean space, , a natural retraction is given by orthogonal projection to the sphere: .

For , define the pullback . If is nice enough (details below), the gradient and Hessian of at equal the gradient and Hessian of at the origin of . Since is a vector space, if we perform GD on , we can almost directly apply Jin et al.’s analysis Jin2019 . This motivates the two-phase structure of our perturbed Riemannian gradient descent (PRGD), listed as Algorithm 1.

Our PRGD is a variant of RGD (4) and a generalization of PGD. It works as follows: If the gradient is large at iterate , , perform an RGD step: . We call this a “step on the manifold.” If the gradient at iterate is small, , then perturb in the tangent space . After this perturbation, execute at most gradient descent steps on the pullback , in the tangent space. We call these “tangent space steps.” We denote this sequence of tangent space steps by . This sequence of steps is performed by TangentSpaceSteps: a deterministic, vector-space procedure—see Algorithm 1.

By distinguishing between gradient descent steps on the manifold and those in a tangent space, we can apply Jin et al.’s analysis almost directly Jin2019 , allowing us to prove PRGD reaches an -second-order critical point on in gradient queries. The notion of approximate second-order critical point is here defined with respect to a notion of Lipschitz-type continuity of the Riemannian gradient and Hessian detailed below, as advocated in trustMan ; arcMan . The analysis is technically far simpler than if one runs all steps on the manifold. We expect that this two-phase approach may prove useful for the generalization of other algorithms and analyses from the Euclidean to the Riemannian realm.

Recently, Sun and Fazel Fazel2018 provided the first generalization of PGD to certain manifolds with a polylogarithmic complexity in the dimension. This improves on the earlier results by Ge et al. (Ge2015, , App. B) which had a polynomial complexity. Both of these works focus on submanifolds of a Euclidean space, with the algorithm in Fazel2018 depending on the equality constraints chosen to describe this submanifold.

At the same time as the present paper, Sun et al. Sun2019prgd improved their analysis to cover any complete Riemannian manifold with bounded sectional curvature. In contrast to ours, their algorithm executes all steps on the manifold. Their analysis requires the retraction to be the Riemannian exponential map (i.e., geodesics). Our regularity assumptions are similar but different: while we assume Lipschitz-type conditions on the pullbacks in small balls around the origins of tangent spaces, Sun et al. make Lipschitz assumptions on the cost function directly, using parallel transport and Riemannian distance. As a result, curvature appears in their results. We make no explicit assumptions on regarding curvature or completeness, though these may be implicitly included in our regularity assumptions.

### 1.1 Main result

Here we state our result informally. Formal results are stated in subsequent sections.

###### Theorem 1.1 (Informal).

Let be a Riemannian manifold of dimension equipped with a retraction . Assume is twice continuously differentiable, and furthermore:

1. is lower bounded.

2. The gradients of the pullbacks uniformly satisfy a Lipschitz-type condition.

3. The Hessians of the pullbacks uniformly satisfy a Lipschitz-type condition.

4. The retraction uniformly satisfies a second-order condition.

Then, setting , PRGD visits several points with gradient smaller than and, with high probability, at least two-thirds of those points are -second-order critical (Definition 3.1).

PRGD uses gradient queries, and crucially no Hessian queries. The algorithm requires knowledge of the Lipschitz constants defined below, which makes this a mostly theoretical algorithm.

### 1.2 Related work

Algorithms which efficiently escape saddle points can be classified into two families: first-order and second-order methods. First-order methods only use function value and gradient information. SGD and PGD are first-order methods. Second-order methods also access Hessian information. Newton’s method, trust regions

trust ; trustMan and adaptive cubic regularization arc ; arcMan are second-order methods.

As noted above, Ge et al. Ge2015 and Jin et al. Jina2017 escape saddle points (in Euclidean space) by exploiting noise in iterations. There has also been similar work for normalized gradient descent Levy2016 . Expanding on Jina2017 , Jin et al. Jinb2017 give an accelerated PGD algorithm (PAGD) which reaches an -second-order critical point of a non-convex function with high probability in iterations. In Jin2019 , Jin et al. show that a stochastic version of PGD reaches an -second-order critical point in stochastic gradient queries; only queries are needed if the stochastic gradients are well behaved. For an analysis of PGD under convex constraints, see mokhtari2018escaping .

There is another line of research, inspired by Langevin dynamics, in which judiciously scaled Gaussian noise is added at every iteration. We note that although this differs from the first incarnation of PGD in Jina2017 , this resembles a simplified version of PGD in Jin2019 . Sang and Liu Sang2018 develop an algorithm (adaptive stochastic gradient Langevin dynamics, ASGLD), which provably reaches an -second-order critical point in with high probability. With full gradients, AGSLD reaches an -second-order critical point in queries with high probability.

One might hope that the noise inherent in vanilla SGD would help it escape saddle points without noise injection. Daneshmand et al. Daneshmand2018 propose the correlated negative curvature assumption (CNC), under which they prove that SGD reaches an -second-order critical point in queries with high probability. They also show that, under the CNC assumption, a variant of GD (in which iterates are perturbed only by SGD steps) efficiently escapes saddle points. Importantly, these guarantees are completely dimension-free.

A first-order method can include approximations of the Hessian (e.g., with a difference of gradients). For example, Allen-Zhu’s Natasha 2 algorithm AllenZhua2017 uses first-order information (function value and stochastic gradients) to search for directions of negative curvature of the Hessian. Natasha 2 reaches an -second-order critical point in iterations.

Many classical optimization algorithms have been generalized to optimization on manifolds, including gradient descent, Newton’s method, trust regions and adaptive cubic regularization edelman1998geometry ; AbsilBook ; genrtr ; newton ; trustMan ; arcMan ; bento2017iterationcomplexity . Bonnabel bonnabel extends stochastic gradient descent to Riemannian manifolds and proves that Riemannian SGD converges to critical points of the cost function. Zhang et al. speedup and Sato et al. speedup2

both use variance reduction to speed up SGD on Riemannian manifolds.

## 2 Preliminaries: Optimization on manifolds

We review the key definitions and tools for optimization on manifolds. For more information, see AbsilBook . Let be a -dimensional Riemannian manifold: a real, smooth -manifold equipped with a Riemannian metric. We associate with each a -dimensional real vector space , called the tangent space at . For embedded submanifolds of , we often visualize the tangent space as being tangent to the manifold at . The Riemannian metric defines an inner product on the tangent space , with associated norm . We denote these by and when is clear from context. A vector in the tangent space is a tangent vector. The set of pairs for is called the tangent bundle . Define : the closed ball of radius centered at . We occasionally denote by when is clear from context. Let

denote the uniform distribution over the ball

.

The Riemannian gradient of a differentiable function at is the unique vector in satisfying , where is the directional derivative of at along . The Riemannian metric also gives rise to a well-defined notion of derivative of vector fields called the Riemannian connection or Levi–Civita connection . The Hessian of is the derivative of the gradient vector field: . The Hessian describes how the gradient changes. is a symmetric linear operator on . If the manifold is a Euclidean space, , with the standard metric , the Riemannian gradient and Hessian coincide with the standard gradient and Hessian .

As discussed in Section 1, the retraction is a mapping which allows us to move along the manifold from a point in the direction of a tangent vector . Formally:

###### Definition 2.1 (Retraction, from AbsilBook ).

A retraction on a manifold is a smooth mapping from the tangent bundle to satisfying properties 1 and 2 below. Let denote the restriction of to .

1. , where is the zero vector in .

2. The differential of at , , is the identity map.

(Our algorithm and theory only require to be defined in balls of a fixed radius around the origins of tangent spaces.) Recall these special retractions, which are good to keep in mind for intuition: on , we typically use , and on the unit sphere we typically use .

For in , define the pullback of from the manifold to the tangent space by

 ^fx=f∘Retrx:TxM→R.

This is a real function on a vector space. Furthermore, for and , let

 Tx,s=DRetrx(s):TxM→TRetrx(s)M

denote the differential of at (a linear operator). The gradient and Hessian of the pullback admit the following nice expressions in terms of those of , and the retraction.

###### Lemma 2.2 (Lemma 5.2 of arcMan ).

For twice continuously differentiable, and , with denoting the adjoint of ,

where is a symmetric linear operator on defined through polarization by

with the intrinsic acceleration on of at .

The velocity of a curve is . The intrinsic acceleration of is the covariant derivative (induced by the Levi–Civita connection) of the velocity of : . When is a Riemannian submanifold of , does not necessarily coincide with . In this case, is the orthogonal projection of onto .

## 3 PRGD efficiently escapes saddle points

We now precisely state the assumptions, the main result, and some important parts of the proof of the main result, including the main obstacles faced in generalizing PGD to manifolds. A full proof of all results is provided in the appendix.

### 3.1 Assumptions

The first assumption, namely, that is lower bounded, ensures that there are points on the manifold where the gradient is arbitrarily small.

###### Assumption 1.

is lower bounded: for all .

Generalizing from the Euclidean case, we assume Lipschitz-type conditions on the gradients and Hessians of the pullbacks . For the special case of and , these assumptions hold if the gradient and Hessian are each Lipschitz continuous, as in (Jin2019, , A1) (with the same constants). The Lipschitz-type assumptions below are similar to assumption A2 of arcMan . Notice that these assumptions involve both the cost function and the retraction: this dependency is further discussed in trustMan ; arcMan for a similar setting.

###### Assumption 2.

There exist and such that and with ,

 ∥∥∇^fx(s)−∇^fx(0)∥∥≤L∥s∥.
###### Assumption 3.

There exist and such that and with ,

 ∥∥∇2^fx(s)−∇2^fx(0)∥∥≤ρ∥s∥,

where on the left-hand side we use the operator norm.

More precisely, we only need these assumptions to hold at the iterates Let . (We do this to reduce the number of parameters in Algorithm 1.) The next assumption requires the chosen retraction to be well behaved, in the sense that the (intrinsic) acceleration of curves on the manifold, defined below, must remain bounded—compare with Lemma 2.2.

###### Assumption 4.

There exists such that, for all and satisfying , the curve has initial acceleration bounded by : .

If Assumption 4 holds with , is said to be second order (AbsilBook, , p107). Second-order retractions include the so-called exponential map and the standard retractions on and the unit sphere mentioned earlier—see malick for a large class of such retractions on relevant manifolds.

###### Definition 3.1.

A point is an -second-order critical point of the twice-differentiable function satisfying Assumption 3 if

 ∥gradf(x)∥ ≤ϵ, and λmin(Hessf(x)) ≥−√ρϵ, (7)

where

denotes the smallest eigenvalue of the symmetric operator

.

For compact manifolds, all of these assumptions hold (all proofs are in the appendix):

###### Lemma 3.2.

Let be a compact Riemannian manifold equipped with a retraction . Assume is three times continuously differentiable. Pick an arbitrary . Then, there exist and such that Assumptions 1, 2, 3 and 4 are satisfied.

### 3.2 Main results

Recall that PRGD (Algorithm 1) works as follows. If , perform a Riemannian gradient descent step, . If , then perturb, i.e., sample and let . After this perturbation, remain in the tangent space and do (at most) gradient descent steps on the pullback , starting from . We denote this sequence of tangent space steps by . This sequence of gradient descent steps is performed by TangentSpaceSteps: a deterministic procedure in the (linear) tangent space.

One difficulty with this approach is that, under our assumptions, for some , may not be Lipschitz continuous in all of . However, it is easy to show that is Lipschitz continuous in the ball of radius by compactness, uniformly in . This is why we limit our algorithm to these balls. If the sequence of iterates escapes the ball for some , TangentSpaceSteps returns the point between and on the boundary of that ball.

Following Jin2019 , we use a set of carefully balanced parameters. Parameters and are user-defined. The claim in Theorem 3.4 below holds with probability at least . Assumption 1 provides parameter . Assumptions 2 and 3 provide parameters and . As announced, the latter two assumptions further ensure Lipschitz continuity of the gradients of the pullbacks in balls of the tangent spaces, uniformly: this defines the parameter , as prescribed below.

###### Lemma 3.3.

Under Assumptions 2 and 3, there exists such that, for all , the gradient of the pullback, , is -Lipschitz continuous in the ball .

Then, choose (preferably small) such that

 χ ≥4log2(231ℓ2√d(f(x0)−f∗)δ√ρϵ5/2), (8)

and set algorithm parameters

 η =1ℓ, r =ϵ400χ3, T =ℓχ√ρϵ, (9)

where is such that is an integer. We also use this notation in the proofs:

 F =150χ3√ϵ3ρ, L =14χ√ϵρ. (10)
###### Theorem 3.4.

Assume satisfies Assumptions 1, 2 and 3. For an arbitrary , with , , and , choose as in (9). Then, setting

 T=8max{T3,(f(x0)−f∗)TF,f(x0)−f∗ηϵ2}=O(ℓ(f(x0)−f∗)ϵ2(logd)4), (11)

visits at least two iterates satisfying . With probability at least , at least two-thirds of those iterates satisfy

The algorithm uses at most gradient queries (and no function or Hessian queries).

By Assumption 4 and Lemma 2.2, is close to , which allows us to conclude:

###### Corollary 3.5.

Assume satisfies Assumptions 1, 2, 3 and 4. For an arbitrary , with , , and , choose as in (9). Then, setting as in (11), visits at least two iterates satisfying . With probability at least , at least two-thirds of those iterates are -second-order points. If (that is, the retraction is second order), then the same claim holds for -second-order points instead of . The algorithm uses at most gradient queries.

Assume with standard inner product and standard retraction . As in Jin2019 , assume is lower bounded, is -Lipschitz in , and is -Lipschitz in . Then, Assumptions 1, 2 and 3 hold with . Furthermore, Assumption 4 holds with so that (Lemma 2.2). For all , has Lipschitz constant since . Therefore, using , and choosing as in (9), PRGD reduces to PGD, and Theorem 3.4 recovers the result of Jin et al. Jin2019 : this confirms that the present result is a bona fide generalization.

For the important special case of compact manifolds, Lemmas 3.2 and 3.3 yield:

###### Corollary 3.6.

Assume is a compact Riemannian manifold equipped with a retraction , and is three times continuously differentiable. Pick an arbitrary . Then, Assumptions 1, 2, 3, 4 hold for some , , so that Corollary 3.5 applies with some .

###### Remark 3.7.

PRGD, like PGD (Algorithm 4 in Jin2019 ), does not specify which iterate is an -second-order critical point. However, it is straightforward to include a termination condition in PRGD which halts the algorithm and returns a suspected -second-order critical point. Indeed, Jin et al. include such a termination condition in their original PGD algorithm Jina2017 , which here would go as follows: After performing a perturbation and (tangent space) steps in , return if , i.e., the function value does not decrease enough. The termination condition requires a threshold which is balanced like the other parameters of PRGD in (9).

### 3.3 Main proof ideas

Theorem 3.4 follows from the following two lemmas which we prove in the appendix. These lemmas state that, in each round of the while-loop in PRGD, if is not at an -second-order critical point, PRGD makes progress, that is, decreases the cost function value (the first lemma is deterministic, the second one is probabilistic). Yet, the value of on the iterates can only decrease so much because is bounded below by . Therefore, the probability that PRGD does not visit an -second-order critical point is low.

###### Lemma 3.8.

Under Assumptions 2 and 3, set for some . If satisfies with and , then,

 f(\textscTangentSpaceSteps(x,0,η,b,1))−f(x)≤−ηϵ2/2.
###### Lemma 3.9.

Under Assumptions 2 and 3, let satisfy both and with and . Set as in (9) and (10). Let with . Then,

 P[f(\textscTangentSpaceSteps(x,s0,η,b,T))−f(x)≤−F/2]≥1−ℓ√d√ρϵ210−χ/2.

Lemma 3.8 states that we are guaranteed to make progress if the gradient is large. This follows from the sufficient decrease of RGD steps. Lemma 3.9 states that, with perturbation, GD on the pullback escapes a saddle point with high probability. Lemma 3.9 is analogous to Lemma 11 in Jin2019 .

Let be the set of tangent vectors in for which GD on the pullback starting from does not escape the saddle point, i.e., the function value does not decrease enough after iterations. Following Jin et al.’s analysis Jin2019

, we bound the width of this “stuck region” (in the direction of the eigenvector

associated with the minimum eigenvalue of the Hessian of the pullback, ). Like Jin et al., we do this with a coupling argument, showing that given two GD sequences with starting points sufficiently far apart, one of these sequences must escape. This is formalized in Lemma C.4 of the appendix. A crucial observation to prove Lemma C.4 is that, if the function value of GD iterates does not decrease much, then these iterates must be localized; this is formalized in Lemma C.3 of the appendix, which Jin et al. call “improve or localize.”

We stress that the stuck region concept, coupling argument, improve or local paradigm, and details of the analysis are due to Jin et al. Jin2019 : our main contribution is to show a clean way to generalize the algorithm to manifolds in such a way that the analysis extends with little friction. We believe that the general idea of separating iterations between the manifold and the tangent spaces to achieve different objectives may prove useful to generalize other algorithms as well.

## 4 Perspectives

To perform PGD (Algorithm 4 of Jin2019 ), one must know the step size , perturbation radius and the number of steps to perform after perturbation. These parameters are carefully balanced, and their values depend on the smoothness parameters and . In practice, we do not know or . An algorithm which does not require knowledge of or but still has the same guarantees as PGD would be useful.

GD equipped with a backtracking line-search method achieves an -first-order critical point in gradient queries without knowledge of the Lipschitz constant . At each iterate

of GD, backtracking line-search essentially uses function and gradient queries to estimate the gradient Lipschitz parameter near

. Perhaps PGD can perform some kind of line-search to locally estimate and . We note that if is known and we use line-search-like methods to estimate , there are still difficulties applying Jin et al.’s coupling argument.

Jin et al. Jin2019 develop a stochastic version of PGD known as PSGD. Instead of perturbing when the gradient is small and performing GD steps, PSGD simply performs a stochastic gradient step and perturbation at each step. Distinguishing between manifold steps and tangent space steps, we suspect it is possible to develop a Riemannian version of perturbed stochastic gradient descent which achieves an -second-order critical point in stochastic gradient queries, like PSGD. However, this Riemannian version still performs a certain number of steps in the tangent space when the gradient is small, like PRGD.

## Appendix A Proof that assumptions hold for compact manifolds

###### Proof of Lemma 3.2.

Since is compact and is continuous, is lower bounded by some .

Recall . Define using operator norms by

 ϕ(x,s)=∥∥∇2^fx(s)∥∥=∥∥∇2s(f∘Retr(x,s))∥∥,
 ψ(x,s)=∥∥∇3^fx(s)∥∥=∥∥∇3s(f∘Retr(x,s))∥∥.

Since is three times continuously differentiable and is smooth, and are each continuous on the tangent bundle . The set

 Sb ={(x,s):x∈M,s∈TxM% with ∥s∥≤b}

is a compact subset of the tangent bundle since is compact. Thus, we may define

 L =max(x,s)∈Sbϕ(x,s), and ρ =max(x,s)∈Sbψ(x,s),

so that and for all and . From here, it is clear that Assumptions 2 and 3 are satisfied, for we can just integrate as in eq. (13) below.

Using the notation from Assumption 4, the map given by is continuous since is smooth. The set

 Vb ={(x,s):x∈M,s∈TxM% with ∥s∥=1}

is also compact in . Hence, is a valid choice. ∎

## Appendix B Proofs for the main results

The proof follows that of Jin et al. Jin2019 closely, reusing many of their key lemmas: we repeat some here for convenience, while highlighting the specificities of the manifold case. We consider it a contribution of this paper that, as a result of our distinction between manifold and tangent space steps, there is limited extra friction, despite the significantly extended generality. In this section and the next, all parameters are chosen as in (9) and (10).

We assume . We also assume because otherwise we can reach a point satisfying and simply using RGD. Indeed, RGD always finds a point satisfying , and Assumption 2 implies so that . Thus, if , every point satisfies .

We want to prove Theorem 3.4. This theorem follows from the following two lemmas (repeated from Lemmas 3.8 and 3.9 for convenience), which we prove in Appendix C below. Lemma B.1 is deterministic: it is a statement about the cost decrease produced by a single Riemannian gradient step, with bounded step size. Lemma B.2 is probabilistic, and is analogous to Lemma 11 in Jin2019 .

###### Lemma B.1.

Under Assumptions 2 and 3, set for some . If satisfies with and , then,

 f(\textscTangentSpaceSteps(x,0,η,b,1))−f(x)≤−ηϵ2/2.
###### Lemma B.2.

Under Assumptions 2 and 3, let satisfy both and with and . Set as in (9) and (10). Let with . Then,

 P[f(\textscTangentSpaceSteps(x,s0,η,b,T))−f(x)≤−F/2]≥1−ℓ√d√ρϵ210−χ/2.
###### Proof of Theorem 3.4.

This proof is similar to Jin et al.’s proof of Theorem 9 in Jin2019 .

Recall that we set

 T=8max{T3,(f(x0)−f∗)TF,f(x0)−f∗ηϵ2}. (12)

PRGD performs two types of steps: (1) if , an RGD step on the manifold, and (2) if , a perturbation in the tangent space followed by GD steps in the tangent space.

There are at most iterates satisfying (i.e., iterates where an RGD step is performed), for otherwise Lemma B.1 and the definition of  (12) would imply , which contradicts Assumption 1.

The variable in Algorithm 1 is an upper bound on the number of gradient queries issued so far. For each RGD step on the manifold, increases by exactly 1. PRGD does not terminate before exceeds , and for every perturbation the counter increases by exactly . Therefore, there are at least iterates satisfying . By the definition of  (12), .

Suppose PRGD visits more than points satisfying and . Each of these iterates is followed by a perturbation and at most tangent space steps . For at least one such , the sequence of tangent space steps does not escape the saddle point (that is, ), for otherwise by the definition of  (12). Yet, by Lemma B.2 and a union bound, the probability that one or more of these sequences does not escape is at most . Indeed, factoring out the third term in the max,

 T =8ℓ(f(x0)−f∗)ϵ2max{13χ√ρϵϵ2(f(x0)−f∗),50χ4,1} ≤8ℓ(f(x0)−f∗)ϵ2max{χ,50χ4,1}=O(ℓ(f(x0)−f∗)ϵ2χ4),

where we used . Now using

 max{χ,50χ4,1}≤218+χ/4

for all , and , we find

 T⋅ℓ√d√ρϵ210−χ/2 ≤ℓ2√d√ρϵ(f(x0)−f∗)ϵ2231−χ/4≤δ,

as announced.

Hence, with probability at least , PRGD visits at most points satisfying and . Using that there are at least iterates with , we conclude that at least two-thirds of the iterates with also satisfy , with probability at least . ∎

Corollary 3.5 follows directly from Theorem 3.4 and the following lemma.

###### Lemma B.3.

For some (which would typically come from Assumption 3), under Assumption 4 on the retraction, let satisfy and . Then, . In particular, if