Escaping from saddle points on Riemannian manifolds

06/18/2019 ∙ by Yue Sun, et al. ∙ berkeley college University of Washington 0

We consider minimizing a nonconvex, smooth function f on a Riemannian manifold M. We show that a perturbed version of Riemannian gradient descent algorithm converges to a second-order stationary point (and hence is able to escape saddle points on the manifold). The rate of convergence depends as 1/ϵ^2 on the accuracy ϵ, which matches a rate known only for unconstrained smooth minimization. The convergence rate depends polylogarithmically on the manifold dimension d, hence is almost dimension-free. The rate also has a polynomial dependence on the parameters describing the curvature of the manifold and the smoothness of the function. While the unconstrained problem (Euclidean setting) is well-studied, our result is the first to prove such a rate for nonconvex, manifold-constrained problems.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

We consider minimizing a non-convex smooth function on a smooth manifold ,

(1)

where is a -dimensional smooth manifold111Here is the dimension of the manifold itself; we do not consider as a submanifold of a higher dimensional space. For instance, if is a 2-dimensional sphere embedded in , its dimension is ., and is twice differentiable, with a Hessian that is -Lipschitz (assumptions are formalized in section 4). This framework includes a wide range of fundamental problems (often non-convex), such as PCA (Edelman et al., 1998), dictionary learning (Sun et al., 2017), low rank matrix completion (Boumal & Absil, 2011)

, and tensor factorization

(Ishteva et al., 2011). Finding the global minimum to Eq. (1

) is in general NP-hard; our goal is to find an approximate second order stationary point with first order optimization methods. We are interested in first-order methods because they are extremely prevalent in machine learning, partly because computing Hessians is often too costly. It is then important to understand how first-order methods fare when applied to nonconvex problems, and there has been a wave of recent interest on this topic since

(Ge et al., 2015), as reviewed below.

In the Euclidean space, it is known that with random initialization, gradient descent avoids saddle points asymptotically (Pemantle, 1990; Lee et al., 2016). Lee et al. (2017) (section 5.5) show that this is also true on smooth manifolds, although the result is expressed in terms of nonstandard manifold smoothness measures. Also, importantly, this line of work does not give quantitative rates for the algorithm’s behaviour near saddle points.

Du et al. (2017) show gradient descent can be exponentially slow in the presence of saddle points. To alleviate this phenomenon, it is shown that for a -gradient Lipschitz, -Hessian Lipschitz function, cubic regularization (Carmon & Duchi, 2017) and perturbed gradient descent (Ge et al., 2015; Jin et al., 2017a) converges to local minimum 222defined as satisfying , in polynomial time, and momentum based method accelerates (Jin et al., 2017b). Much less is known about inequality constraints: Nouiehed et al. (2018) and Mokhtari et al. (2018) discuss second order convergence for general inequality-constrained problems, where they need an NP-hard subproblem (checking the co-positivity of a matrix) to admit a polynomial time approximation algorithm. However such an approximation exists only under very restrictive assumptions.

An orthogonal line of work is optimization on Riemannian manifolds. Absil et al. (2009) provide comprehensive background, showing how algorithms such as gradient descent, Newton and trust region methods can be implemented on Riemannian manifolds, together with asymptotic convergence guarantees to first order stationary points. Zhang & Sra (2016) provide global convergence guarantees for first order methods when optimizing geodesically convex functions. Bonnabel (2013)

obtains the first asymptotic convergence result for stochastic gradient descent in this setting, which is further extended by

Tripuraneni et al. (2018); Zhang et al. (2016); Khuzani & Li (2017). If the problem is non-convex, or the Riemannian Hessian is not positive definite, one can use second order methods to escape from saddle points. Boumal et al. (2016a) shows that Riemannian trust region method converges to a second order stationary point in polynomial time (see, also, Kasai & Mishra, 2018; Hu et al., 2018; Zhang & Zhang, 2018). But this method requires a Hessian oracle, whose complexity is

times more than computing gradient. In Euclidean space, trust region subproblem can be sometimes solved via a Hessian-vector product oracle, whose complexity is about the same as computing gradients.

Agarwal et al. (2018) discuss its implementation on Riemannian manifolds, but not clear about the complexity and sensitivity of Hessian vector product oracle on manifold.

The study of the convergence of gradient descent for non-convex Riemannian problems is previously done only in the Euclidean space by modeling the manifold with equality constraints. Ge et al. (2015, Appendix B) prove that stochastic projected gradient descent methods converge to second order stationary points in polynomial time (here the analysis is not geometric, and depends on the algebraic representation of the equality constraints). Sun & Fazel (2018) proves perturbed projected gradient descent converges with a comparable rate to the unconstrained setting (Jin et al., 2017a) (polylog in dimension). The paper applies projections from the ambient Euclidean space to the manifold and analyzes the iterations under the Euclidean metric. This approach loses the geometric perspective enabled by Riemannian optimization, and cannot explain convergence rates in terms of inherent quantities such as the sectional curvature of the manifold.

After finishing this work, we found the recent and independent paper Criscitiello & Boumal (2019) which gives a similar convergence analysis result for a related perturbed Riemannian gradient method. We point out a few differences: (1) In Criscitiello & Boumal (2019) Lipschitz assumptions are made on the pullback map . While this makes the analysis simpler, it lumps the properties of the function and the manifold together, and the role of the manifold’s curvature is not explicit. In contrast, our rates are expressed in terms of the function’s smoothness parameters and the sectional curvature of the manifold separately, capturing the geometry more clearly. (2) The algorithm in Criscitiello & Boumal (2019) uses two types of iterates (some on the manifold but some taken on a tangent space), whereas all our algorithm steps are directly on the manifold, which is more natural. (3) To connect our iterations with intrinsic parameters of the manifold, we use the exponential map instead of the more general retraction used in Criscitiello & Boumal (2019).

Contributions. We provide convergence guarantees for perturbed first order Riemannian optimization methods to second-order stationary points (local minimum). We prove that as long as the function is appropriately smooth and the manifold has bounded sectional curvature, a perturbed Riemannian gradient descent algorithm escapes (an approximate) saddle points with a rate of , a polylog dependence on the dimension of the manifold (hence almost dimension-free), and a polynomial dependence on the smoothness and curvature parameters. This is the first result showing such a rate for Riemannian optimization, and the first to relate the rate to geometric parameters of the manifold.

Despite analogies with the unconstrained (Euclidean) analysis and with the Riemannian optimization literature, the technical challenge in our proof goes beyond combining two lines of work: we need to analyze the interaction between the first-order method and the second order structure of the manifold to obtain second-order convergence guarantees that depend on the manifold curvature. Unlike in Euclidean space, the curvature affects the Taylor approximation of gradient steps. On the other hand, unlike in the local rate analysis in first-order Riemannian optimization, our second-order analysis requires more refined properties of the manifold structure (whereas in prior work, first order oracle makes enough progress for a local convergence rate proof, see Lemma 1), and second order algorithms such as (Boumal et al., 2016a) use second order oracles (Hessian evaluation). See section 4 for further discussion.

2 Notation and Background

We consider a complete333Since our results are local, completeness is not necessary and our results can be easily generalized, with extra assumptions on the injectivity radius., smooth, dimensional Riemannian manifold , equipped with a Riemannian metric , and we denote by its tangent space at (which is a vector space of dimension ). We also denote by the ball of radius in centered at . At any point , the metric induces a natural inner product on the tangent space denoted by . We also consider the Levi-Civita connection (Absil et al., 2009, Theorem 5.3.1). The Riemannian curvature tensor is denoted by where , and is defined in terms of the connection  (Absil et al., 2009, Theorem 5.3.1). The sectional curvature for and is then defined in Lee (1997, Prop. 8.8).

Denote the distance (induced by the Riemannian metric) between two points in by . A geodesic is a constant speed curve whose length is equal to , so it is the shortest path on manifold linking and . denotes the geodesic from to (thus and ).

The exponential map maps to such that there exists a geodesic with , and . The injectivity radius at point is the maximal radius for which the exponential map is a diffeomorphism on . The injectivity radius of the manifold, denoted by , is the infimum of the injectivity radii at all points. Since the manifold is complete, we have . When satisfies , the exponential map admits an inverse , which satisfies . Parallel translation denotes a the map which transports to along such that the vector stays constant by satisfying a zero-acceleration condition (Lee, 1997, equation (4.13)).

For a smooth function , denotes the Riemannian gradient of at which satisfies  (see Absil et al., 2009, Sec 3.5.1 and (3.31)). The Hessian of is defined jointly with the Riemannian structure of the manifold. The (directional) Hessian is and we use as a shorthand. We call an saddle point when and . We refer the interested reader to Do Carmo (2016) and Lee (1997) which provide a thorough review on these important concepts of Riemannian geometry.

3 Perturbed Riemannian gradient algorithm

Our main Algorithm 1 runs as follows:

  1. Check the norm of the gradient: If it is large, do one step of Riemannian gradient descent, consequently the function value decreases.

  2. If the norm of gradient is small, it’s either an approximate saddle point or a local minimum. Perturb the variable by adding an appropriate level of noise in its tangent space, map it back to the manifold and run a few iterations.

    1. If the function value decreases, iterates are escaping from the approximate saddle point (and the algorithm continues)

    2. If the function value does not decrease, then it is an approximate local minimum (the algorithm terminates).

0:  Initial point , parameters , accuracy

, probability of success

(parameters defined in Assumptions 1, 2, 3 and assumption of Theorem 1). Set constants: , (defined in Lemma 2 and proof of Lemma 8)     and , , .Set threshold values: , , , .Set stepsize: .
  while 1 do
     if  and  then
        , , , uniformly sampled from .
     end if
     if  and  then
         output
     end if
     
     .
  end while
Algorithm 1 Perturbed Riemannian gradient algorithm

Algorithm 1 relies on the manifold’s exponential map, and is useful for cases where this map is easy to compute (true for many common manifolds). We refer readers to Lee (1997, pp. 81-86) for the exponential map of sphere and hyperbolic manifolds, and Absil et al. (2009, Example 5.4.2, 5.4.3) for the Stiefel and Grassmann manifolds. If the exponential map is not computable, the algorithm can use a retraction444A retraction is a first-order approximation of the exponential map which is often easier to compute. instead, however our current analysis only covers the case of the exponential map. In Figure 1, we illustrate a function with saddle point on sphere, and plot the trajectory of Algorithm 1 when it is initialized at a saddle point.

Figure 1: Function with saddle point on a sphere. . We plot the contour of this function on unit sphere. Algorithm 1 initializes at (a saddle point), perturbs it towards and runs Riemannian gradient descent, and terminates at (a local minimum). We amplify the first iteration to make saddle perturbation visible.

4 Main theorem: escape rate for perturbed Riemannian gradient descent

We now turn to our main results, beginning with our assumptions and a statement of our main theorem. We then develop a brief proof sketch.

Our main result involves two conditions on function and one on the curvature of the manifold .

Assumption 1 (Lipschitz gradient).

There is a finite constant such that

Assumption 2 (Lipschitz Hessian).

There is a finite constant such that

Assumption 3 (Bounded sectional curvature).

There is a finite constant such that

is an intrinsic parameter of the manifold capturing the curvature. We list a few examples here: (i) A sphere of radius has a constant sectional curvature (Lee, 1997, Theorem 1.9). If the radius is bigger, is smaller which means the sphere is less curved; (ii) A hyper-bolic space of radius has (Lee, 1997, Theorem 1.9); (iii) For sectional curvature of the Stiefel and the Grasmann manifolds, we refer readers to Rapcsák (2008, Section 5) and Wong (1968), respectively.

Note that the constant is not directly related to the RLICQ parameter defined by Ge et al. (2015) which first requires describing the manifold by equality constraints. Different representations of the same manifold could lead to different curvature bounds, while sectional curvature is an intrinsic property of manifold. If the manifold is a sphere , then , but more generally there is no simple connection. The smoothness parameters we assume are natural compared to some quantity from complicated compositions (Lee et al., 2017, Section 5.5) or pullback (Zhang & Zhang, 2018; Criscitiello & Boumal, 2019). With these assumptions, the main result of this paper is the following:

Theorem 1.

Under Assumptions 1,2,3, let be a function defined in Lemma 2, , if satisfies that

(2)

where , are defined in Lemma 4, then with probability , perturbed Riemannian gradient descent with step size converges to a -stationary point of in

iterations.

Proof roadmap. For a function satisfying smoothness condition (Assumption 1 and 2), we use a local upper bound of the objective based on the third-order Taylor expansion (see supplementary material Section A for a review),

When the norm of the gradient is large (not near a saddle), the following lemma guarantees the decrease of the objective function in one iteration.

Lemma 1.

(Boumal et al., 2018) Under Assumption 1, by choosing , the Riemannian gradient descent algorithm is monotonically descending, .

Thus our main challenge in proving the main theorem is the Riemannian gradient behaviour at an approximate saddle point:

1. Similar to the Euclidean case studied by Jin et al. (2017a), we need to bound the “thickness” of the “stuck region” where the perturbation fails. We still use a pair of hypothetical auxiliary sequences and study the “coupling” sequences. When two perturbations couple in the thinnest direction of the stuck region, their distance grows and one of them escapes from saddle point.

2. However our iterates are evolving on a manifold rather than a Euclidean space, so our strategy is to map the iterates back to an appropriate fixed tangent space where we can use the Euclidean analysis. This is done using the inverse of the exponential map and various parallel transports.

3. Several key challenges arise in doing this. Unlike Jin et al. (2017a), the structure of the manifold interacts with the local approximation of the objective function in a complicated way. On the other hand, unlike recent work on Riemannian optimization by Boumal et al. (2016a), we do not have access to a second order oracle and we need to understand how the sectional curvature and the injectivity radius (which both capture intrinsic manifold properties) affect the behavior of the first order iterates.

4. Our main contribution is to carefully investigate how the various approximation errors arising from (a) the linearization of the iteration couplings and (b) their mappings to a common tangent space can be handled on manifolds with bounded sectional curvature. We address these challenges in a sequence of lemmas (Lemmas 3 through 6) we combine to linearize the coupling iterations in a common tangent space and precisely control the approximation error. This result is formally stated in the following lemma.

Lemma 2.

Define , , and . Let us consider be a saddle point, and define and . Under Assumptions 123, if all pairwise distances between are less than , then for some explicit constant depending only on , there is

(3)

The proof of this lemma includes novel contributions by strengthen known result (Lemmas 3) and also combining known inequalities in novel ways (Lemmas 4 to 6) that allow us to control all the approximation errors and arrive at the tight rate of escape for the algorithm.

5 Proof of Lemma 2

Lemma 2 controls the error of the linear approximation of the iterates when mapped in . In this section, we assume that all points are within a region of diameter (inequality follows from Eq. (2) ), i.e., the distance of any two points in the following lemmas are less than . The proof of Lemma 2 is based on the sequence of following lemmas.

Lemma 3.

Let and . Let us denote by then under Assumption 3

(4)

This lemma tightens the result of Karcher (1977, C2.3), which only shows an upper-bound . We prove the upper-bound in the supplement.

(a)
(b)
Figure 2: (a) Eq. (5). First map and to and , and transport the two vectors to , and get their relation. (b) Lemma 3 bounds the difference of two steps starting from : (1) take step in and map it to manifold, and (2) take step in , map to manifold, call it , and take step in , and map to manifold. is close to .

We also need the following lemma showing that both the exponential map and its inverse are Lipschitz.

Lemma 4.

Let , and the distance of each two points is no bigger than . Then under assumption 3

Intuitively this lemma relates the norm of the difference of two vectors of to the distance between the corresponding points on the manifold and follows from bounds on the Hessian of the square-distance function (Sakai, 1996, Ex. 4 p. 154). The upper-bound is directly proven by Karcher (1977, Proof of Cor. 1.6), and we prove the lower-bound via Lemma 3 in the supplement.

The following contraction result is fairly classical and is proven using the Rauch comparison theorem from differential geometry (Cheeger & Ebin, 2008).

Lemma 5.

(Mangoubi et al., 2018, Lemma 1) Under Assumption 3, for and ,

Finally we need the following corollary of the Ambrose-Singer theorem (Ambrose & Singer, 1953).

Lemma 6.

(Karcher, 1977, Section 6) Under Assumption 3, for and ,

Lemma 3 through 6 are mainly proven in the literature, and we make up the missing part in Supplementary material Section B. Then we prove Lemma 2 in Supplementary material Section B.

The spirit of the proof is to linearize the manifold using the exponential map and its inverse, and to carefully bounds the various error terms caused by the approximation. Let us denote by .

1. We first show using twice Lemma 3 and Lemma 5 that

2. We use Lemma 4 to linearize this iteration in as

3. Using the Hessian Lipschitzness

3. We use Lemma 6 to map to and the Hessian Lipschitzness to compare to . This is an important intermediate result (see Lemma 1 in Supplementary material Section B).

(5)

4. We use Lemma 3 and 4 to approximate two iteration updates in .

(6)

And same for the pair replacing .

5. Combining Eq. (5) and Eq. (6) together, we obtain

Now note that, the iterations of the algorithm are both on the manifold. We use to map them to the same tangent space at .

Therefore we have linearized the two coupled trajectories and in a common tangent space, and we can modify the Euclidean escaping saddle analysis thanks to the error bound we proved in Lemma 2.

6 Proof of main theorem

In this section we suppose all assumptions in Section 4 hold. The proof strategy is to show with high probability that the function value decreases of in iterations at an approximate saddle point. Lemma 7 suggests that, if after a perturbation and steps, the iterate is far from the approximate saddle point, then the function value decreases. If the iterates do not move far, the perturbation falls in a stuck region. Lemma 8

uses a coupling strategy, and suggests that the width of the stuck region is small in the negative eigenvector direction of the Riemannian Hessian.

Define

At an approximate saddle point , let be in the neighborhood of where , denote

Let and . We consider two iterate sequences, and where are two perturbations at .

Lemma 7.

Assume Assumptions 1, 2, 3 and Eq. (2) hold. There exists a constant , , for any with , .

then , we have , .

Lemma 8.

Assume Assumptions 1, 2, 3 and Eq. (2) hold. Take two points and which are perturbed from an approximate saddle point, where , , is the smallest eigenvector555

“smallest eigenvector” means the eigenvector corresponding to the smallest eigenvalue.

of , , and the algorithm runs two sequences and starting from and . Denote

then , if , , we have .

We prove Lemma 7 and 8 in supplementary material Section C. We also prove, in the same section, the main theorem using the coupling strategy of Jin et al. (2017a). but with the additional difficulty of taking into consideration the effect of the Riemannian geometry (Lemma 2) and the injectivity radius.

7 Examples

kPCA.

We consider the kPCA problem, where we want to find the principal eigenvectors of a symmetric matrix , as an example (Tripuraneni et al., 2018). This corresponds to

which is an optimization problem on the Grassmann manifold defined by the constraint . If the eigenvalues of are distinct, we denote by ,…, the eigenvectors of , corresponding to eigenvalues with decreasing order. Let be the matrix with columns composed of the top eigenvectors of , then the local minimizers of the objective function are for all unitary matrices . Denote also by the matrix with columns composed of distinct eigenvectors, then the first order stationary points of the objective function (with Riemannian gradient being ) are for all unitary matrices . In our numerical experiment, we choose to be a diagonal matrix and let . The Euclidean basis are an eigenbasis of and the first order stationary points of the objective function are with distinct basis and being unitary. The local minimizers are . We start the iteration at and see in Fig. 3 the algorithm converges to a local minimum.

Burer-Monteiro approach for certain low rank problems.

Following Boumal et al. (2016b), we consider, for and , the problem

We factorize by with an overparametrized and . Then any local minimum of

is a global minimum where (Boumal et al., 2016b). Let . In the experiment, we take being a sparse matrix that only the upper left block is random and other entries are . Let the initial point , such that for and otherwise. Then is a saddle point. We see in Fig. 3 the algorithm converges to the global optimum.

(a)
(b)
Figure 3: (a) kPCA problem. We start from an approximate saddle point, and it converges to a local minimum (which is also global minimum). (b) Burer-Monteiro approach Plot versus iterations. We start from the saddle point, and it converges to a local minimum (which is also global minimum).

Summary

We have shown that for the constrained optimization problem of minimizing subject to a manifold constraint as long as the function and the manifold are appropriately smooth, a perturbed Riemannian gradient descent algorithm will escape saddle points with a rate of order in the accuracy , polylog in manifold dimension , and depends polynomially on the curvature and smoothness parameters.

A natural extension of our result is to consider other variants of gradient descent, such as the heavy ball method, Nesterov’s acceleration, and the stochastic setting. The question is whether these algorithms with appropriate modification (with manifold constraints) would have a fast convergence to second-order stationary point (not just first-order stationary as studied in recent literature), and whether it is possible to show the relationship between convergence rate and smoothness of manifold.

References

Appendix

Organization of the Appendix

In Appendix A we review classical results on the Taylor expansions for functions on Riemannian manifold. In Appendix B we provide the proof of Lemma 2 which requires to expand the iterates on the tangent space in the the saddle point. Finally, in Appendix C, we provide the proofs of Lemma 7 and Lemma 8 which enable to prove the main theorem of the paper.

Throughout the paper we assume that the objective function and the manifold are smooth. Here we list the assumptions that are used in the following lemmas.

Assumption 1 (Lipschitz gradient).

There is a finite constant such that

Assumption 2 (Lipschitz Hessian).

There is a finite constant such that

Assumption 3 (Bounded sectional curvature).

There is a finite constant such that

Appendix A Taylor expansions on Riemannian manifold

We provide here the Taylor expansion for functions and gradients of functions defined on a Riemannian manifold.

a.1 Taylor expansion for the gradient

For any point and be a point in the neighborhood of where the geodesic is defined.

(7)

where . The Taylor approximation in Eq. (7) is proven by Absil et al. (2009, Lemma 7.4.7).

a.2 Taylor expansion for the function

Taylor expansion of the gradient enables us to approximate the iterations of the main algorithm, but obtaining the convergence rate of the algorithm requires proving that the function value decreases following the iterations. We need to give the Taylor expansion of with the parallel translated gradient on LHS of Eq. (7). To simplify the notation, let denote the .

(8a)
(8b)
(8c)
(8d)
(8e)
(8f)

is defined in Eq. (7). . The second line is just rewriting by definition. Eq. (8c) means the parallel translation preserves the inner product (Tu, 2017, Prop. 14.16). Eq. (8d) uses , meaning that the velocity stays constant along a geodesic (Absil et al., 2009, (5.23)). Eq. (8e) uses Eq. (7). In Euclidean space, the Taylor expansion is

(9)

Compare Eq. (8) and Eq. (9), is replaced by and is replaced by or .

Now we have

Appendix B Linearization of the iterates in a fixed tangent space

In this section we linearize the progress of the iterates of our algorithm in a fixed tangent space . We always assume here that all points are within a region of diameter . In the course of the proof we need several auxilliary lemmas which are stated in the last two subsections of this section.

b.1 Evolution of

Figure 4: Lemma 9. First map and to and , and transport the two vectors to , and get their relation.

We first consider the evolution of in a fixed tangent space . We show in the following lemma that it approximately follows a linear reccursion.

Lemma 9.

Define , , and . Let us consider be a saddle point, and define and . Under Assumptions 123, if all pairwise distances between are less than , then for some explicit constant depending only on , there is

for some explicit function .

This lemma is illustrated in Fig. 4.

Proof.

Denote , . is a smooth map. We first prove the following claim.

Claim 1.

where .

To show this, note that

and using Lemma 5 with ,

Using Lemma 5,