We consider minimizing a non-convex smooth function on a smooth manifold ,
where is a -dimensional smooth manifold111Here is the dimension of the manifold itself; we do not consider as a submanifold of a higher dimensional space. For instance, if is a 2-dimensional sphere embedded in , its dimension is ., and is twice differentiable, with a Hessian that is -Lipschitz (assumptions are formalized in section 4). This framework includes a wide range of fundamental problems (often non-convex), such as PCA (Edelman et al., 1998), dictionary learning (Sun et al., 2017), low rank matrix completion (Boumal & Absil, 2011)
, and tensor factorization(Ishteva et al., 2011). Finding the global minimum to Eq. (1
) is in general NP-hard; our goal is to find an approximate second order stationary point with first order optimization methods. We are interested in first-order methods because they are extremely prevalent in machine learning, partly because computing Hessians is often too costly. It is then important to understand how first-order methods fare when applied to nonconvex problems, and there has been a wave of recent interest on this topic since(Ge et al., 2015), as reviewed below.
In the Euclidean space, it is known that with random initialization, gradient descent avoids saddle points asymptotically (Pemantle, 1990; Lee et al., 2016). Lee et al. (2017) (section 5.5) show that this is also true on smooth manifolds, although the result is expressed in terms of nonstandard manifold smoothness measures. Also, importantly, this line of work does not give quantitative rates for the algorithm’s behaviour near saddle points.
Du et al. (2017) show gradient descent can be exponentially slow in the presence of saddle points. To alleviate this phenomenon, it is shown that for a -gradient Lipschitz, -Hessian Lipschitz function, cubic regularization (Carmon & Duchi, 2017) and perturbed gradient descent (Ge et al., 2015; Jin et al., 2017a) converges to local minimum 222defined as satisfying , in polynomial time, and momentum based method accelerates (Jin et al., 2017b). Much less is known about inequality constraints: Nouiehed et al. (2018) and Mokhtari et al. (2018) discuss second order convergence for general inequality-constrained problems, where they need an NP-hard subproblem (checking the co-positivity of a matrix) to admit a polynomial time approximation algorithm. However such an approximation exists only under very restrictive assumptions.
An orthogonal line of work is optimization on Riemannian manifolds. Absil et al. (2009) provide comprehensive background, showing how algorithms such as gradient descent, Newton and trust region methods can be implemented on Riemannian manifolds, together with asymptotic convergence guarantees to first order stationary points. Zhang & Sra (2016) provide global convergence guarantees for first order methods when optimizing geodesically convex functions. Bonnabel (2013)
obtains the first asymptotic convergence result for stochastic gradient descent in this setting, which is further extended byTripuraneni et al. (2018); Zhang et al. (2016); Khuzani & Li (2017). If the problem is non-convex, or the Riemannian Hessian is not positive definite, one can use second order methods to escape from saddle points. Boumal et al. (2016a) shows that Riemannian trust region method converges to a second order stationary point in polynomial time (see, also, Kasai & Mishra, 2018; Hu et al., 2018; Zhang & Zhang, 2018). But this method requires a Hessian oracle, whose complexity is
times more than computing gradient. In Euclidean space, trust region subproblem can be sometimes solved via a Hessian-vector product oracle, whose complexity is about the same as computing gradients.Agarwal et al. (2018) discuss its implementation on Riemannian manifolds, but not clear about the complexity and sensitivity of Hessian vector product oracle on manifold.
The study of the convergence of gradient descent for non-convex Riemannian problems is previously done only in the Euclidean space by modeling the manifold with equality constraints. Ge et al. (2015, Appendix B) prove that stochastic projected gradient descent methods converge to second order stationary points in polynomial time (here the analysis is not geometric, and depends on the algebraic representation of the equality constraints). Sun & Fazel (2018) proves perturbed projected gradient descent converges with a comparable rate to the unconstrained setting (Jin et al., 2017a) (polylog in dimension). The paper applies projections from the ambient Euclidean space to the manifold and analyzes the iterations under the Euclidean metric. This approach loses the geometric perspective enabled by Riemannian optimization, and cannot explain convergence rates in terms of inherent quantities such as the sectional curvature of the manifold.
After finishing this work, we found the recent and independent paper Criscitiello & Boumal (2019) which gives a similar convergence analysis result for a related perturbed Riemannian gradient method. We point out a few differences: (1) In Criscitiello & Boumal (2019) Lipschitz assumptions are made on the pullback map . While this makes the analysis simpler, it lumps the properties of the function and the manifold together, and the role of the manifold’s curvature is not explicit. In contrast, our rates are expressed in terms of the function’s smoothness parameters and the sectional curvature of the manifold separately, capturing the geometry more clearly. (2) The algorithm in Criscitiello & Boumal (2019) uses two types of iterates (some on the manifold but some taken on a tangent space), whereas all our algorithm steps are directly on the manifold, which is more natural. (3) To connect our iterations with intrinsic parameters of the manifold, we use the exponential map instead of the more general retraction used in Criscitiello & Boumal (2019).
Contributions. We provide convergence guarantees for perturbed first order Riemannian optimization methods to second-order stationary points (local minimum). We prove that as long as the function is appropriately smooth and the manifold has bounded sectional curvature, a perturbed Riemannian gradient descent algorithm escapes (an approximate) saddle points with a rate of , a polylog dependence on the dimension of the manifold (hence almost dimension-free), and a polynomial dependence on the smoothness and curvature parameters. This is the first result showing such a rate for Riemannian optimization, and the first to relate the rate to geometric parameters of the manifold.
Despite analogies with the unconstrained (Euclidean) analysis and with the Riemannian optimization literature, the technical challenge in our proof goes beyond combining two lines of work: we need to analyze the interaction between the first-order method and the second order structure of the manifold to obtain second-order convergence guarantees that depend on the manifold curvature. Unlike in Euclidean space, the curvature affects the Taylor approximation of gradient steps. On the other hand, unlike in the local rate analysis in first-order Riemannian optimization, our second-order analysis requires more refined properties of the manifold structure (whereas in prior work, first order oracle makes enough progress for a local convergence rate proof, see Lemma 1), and second order algorithms such as (Boumal et al., 2016a) use second order oracles (Hessian evaluation). See section 4 for further discussion.
2 Notation and Background
We consider a complete333Since our results are local, completeness is not necessary and our results can be easily generalized, with extra assumptions on the injectivity radius., smooth, dimensional Riemannian manifold , equipped with a Riemannian metric , and we denote by its tangent space at (which is a vector space of dimension ). We also denote by the ball of radius in centered at . At any point , the metric induces a natural inner product on the tangent space denoted by . We also consider the Levi-Civita connection (Absil et al., 2009, Theorem 5.3.1). The Riemannian curvature tensor is denoted by where , and is defined in terms of the connection (Absil et al., 2009, Theorem 5.3.1). The sectional curvature for and is then defined in Lee (1997, Prop. 8.8).
Denote the distance (induced by the Riemannian metric) between two points in by . A geodesic is a constant speed curve whose length is equal to , so it is the shortest path on manifold linking and . denotes the geodesic from to (thus and ).
The exponential map maps to such that there exists a geodesic with , and . The injectivity radius at point is the maximal radius for which the exponential map is a diffeomorphism on . The injectivity radius of the manifold, denoted by , is the infimum of the injectivity radii at all points. Since the manifold is complete, we have . When satisfies , the exponential map admits an inverse , which satisfies . Parallel translation denotes a the map which transports to along such that the vector stays constant by satisfying a zero-acceleration condition (Lee, 1997, equation (4.13)).
For a smooth function , denotes the Riemannian gradient of at which satisfies (see Absil et al., 2009, Sec 3.5.1 and (3.31)). The Hessian of is defined jointly with the Riemannian structure of the manifold. The (directional) Hessian is and we use as a shorthand. We call an saddle point when and . We refer the interested reader to Do Carmo (2016) and Lee (1997) which provide a thorough review on these important concepts of Riemannian geometry.
3 Perturbed Riemannian gradient algorithm
Our main Algorithm 1 runs as follows:
Check the norm of the gradient: If it is large, do one step of Riemannian gradient descent, consequently the function value decreases.
If the norm of gradient is small, it’s either an approximate saddle point or a local minimum. Perturb the variable by adding an appropriate level of noise in its tangent space, map it back to the manifold and run a few iterations.
If the function value decreases, iterates are escaping from the approximate saddle point (and the algorithm continues)
If the function value does not decrease, then it is an approximate local minimum (the algorithm terminates).
Algorithm 1 relies on the manifold’s exponential map, and is useful for cases where this map is easy to compute (true for many common manifolds). We refer readers to Lee (1997, pp. 81-86) for the exponential map of sphere and hyperbolic manifolds, and Absil et al. (2009, Example 5.4.2, 5.4.3) for the Stiefel and Grassmann manifolds. If the exponential map is not computable, the algorithm can use a retraction444A retraction is a first-order approximation of the exponential map which is often easier to compute. instead, however our current analysis only covers the case of the exponential map. In Figure 1, we illustrate a function with saddle point on sphere, and plot the trajectory of Algorithm 1 when it is initialized at a saddle point.
4 Main theorem: escape rate for perturbed Riemannian gradient descent
We now turn to our main results, beginning with our assumptions and a statement of our main theorem. We then develop a brief proof sketch.
Our main result involves two conditions on function and one on the curvature of the manifold .
Assumption 1 (Lipschitz gradient).
There is a finite constant such that
Assumption 2 (Lipschitz Hessian).
There is a finite constant such that
Assumption 3 (Bounded sectional curvature).
There is a finite constant such that
is an intrinsic parameter of the manifold capturing the curvature. We list a few examples here: (i) A sphere of radius has a constant sectional curvature (Lee, 1997, Theorem 1.9). If the radius is bigger, is smaller which means the sphere is less curved; (ii) A hyper-bolic space of radius has (Lee, 1997, Theorem 1.9); (iii) For sectional curvature of the Stiefel and the Grasmann manifolds, we refer readers to Rapcsák (2008, Section 5) and Wong (1968), respectively.
Note that the constant is not directly related to the RLICQ parameter defined by Ge et al. (2015) which first requires describing the manifold by equality constraints. Different representations of the same manifold could lead to different curvature bounds, while sectional curvature is an intrinsic property of manifold. If the manifold is a sphere , then , but more generally there is no simple connection. The smoothness parameters we assume are natural compared to some quantity from complicated compositions (Lee et al., 2017, Section 5.5) or pullback (Zhang & Zhang, 2018; Criscitiello & Boumal, 2019). With these assumptions, the main result of this paper is the following:
Proof roadmap. For a function satisfying smoothness condition (Assumption 1 and 2), we use a local upper bound of the objective based on the third-order Taylor expansion (see supplementary material Section A for a review),
When the norm of the gradient is large (not near a saddle), the following lemma guarantees the decrease of the objective function in one iteration.
Thus our main challenge in proving the main theorem is the Riemannian gradient behaviour at an approximate saddle point:
1. Similar to the Euclidean case studied by Jin et al. (2017a), we need to bound the “thickness” of the “stuck region” where the perturbation fails. We still use a pair of hypothetical auxiliary sequences and study the “coupling” sequences. When two perturbations couple in the thinnest direction of the stuck region, their distance grows and one of them escapes from saddle point.
2. However our iterates are evolving on a manifold rather than a Euclidean space, so our strategy is to map the iterates back to an appropriate fixed tangent space where we can use the Euclidean analysis. This is done using the inverse of the exponential map and various parallel transports.
3. Several key challenges arise in doing this. Unlike Jin et al. (2017a), the structure of the manifold interacts with the local approximation of the objective function in a complicated way. On the other hand, unlike recent work on Riemannian optimization by Boumal et al. (2016a), we do not have access to a second order oracle and we need to understand how the sectional curvature and the injectivity radius (which both capture intrinsic manifold properties) affect the behavior of the first order iterates.
4. Our main contribution is to carefully investigate how the various approximation errors arising from (a) the linearization of the iteration couplings and (b) their mappings to a common tangent space can be handled on manifolds with bounded sectional curvature. We address these challenges in a sequence of lemmas (Lemmas 3 through 6) we combine to linearize the coupling iterations in a common tangent space and precisely control the approximation error. This result is formally stated in the following lemma.
5 Proof of Lemma 2
Lemma 2 controls the error of the linear approximation of the iterates when mapped in . In this section, we assume that all points are within a region of diameter (inequality follows from Eq. (2) ), i.e., the distance of any two points in the following lemmas are less than . The proof of Lemma 2 is based on the sequence of following lemmas.
Let and . Let us denote by then under Assumption 3
This lemma tightens the result of Karcher (1977, C2.3), which only shows an upper-bound . We prove the upper-bound in the supplement.
We also need the following lemma showing that both the exponential map and its inverse are Lipschitz.
Let , and the distance of each two points is no bigger than . Then under assumption 3
Intuitively this lemma relates the norm of the difference of two vectors of to the distance between the corresponding points on the manifold and follows from bounds on the Hessian of the square-distance function (Sakai, 1996, Ex. 4 p. 154). The upper-bound is directly proven by Karcher (1977, Proof of Cor. 1.6), and we prove the lower-bound via Lemma 3 in the supplement.
The following contraction result is fairly classical and is proven using the Rauch comparison theorem from differential geometry (Cheeger & Ebin, 2008).
Finally we need the following corollary of the Ambrose-Singer theorem (Ambrose & Singer, 1953).
The spirit of the proof is to linearize the manifold using the exponential map and its inverse, and to carefully bounds the various error terms caused by the approximation. Let us denote by .
2. We use Lemma 4 to linearize this iteration in as
3. Using the Hessian Lipschitzness
3. We use Lemma 6 to map to and the Hessian Lipschitzness to compare to . This is an important intermediate result (see Lemma 1 in Supplementary material Section B).
And same for the pair replacing .
Now note that, the iterations of the algorithm are both on the manifold. We use to map them to the same tangent space at .
Therefore we have linearized the two coupled trajectories and in a common tangent space, and we can modify the Euclidean escaping saddle analysis thanks to the error bound we proved in Lemma 2.
6 Proof of main theorem
In this section we suppose all assumptions in Section 4 hold. The proof strategy is to show with high probability that the function value decreases of in iterations at an approximate saddle point. Lemma 7 suggests that, if after a perturbation and steps, the iterate is far from the approximate saddle point, then the function value decreases. If the iterates do not move far, the perturbation falls in a stuck region. Lemma 8
uses a coupling strategy, and suggests that the width of the stuck region is small in the negative eigenvector direction of the Riemannian Hessian.
At an approximate saddle point , let be in the neighborhood of where , denote
Let and . We consider two iterate sequences, and where are two perturbations at .
Assume Assumptions 1, 2, 3 and Eq. (2) hold. Take two points and which are perturbed from an approximate saddle point, where ,
is the smallest eigenvector555 “smallest eigenvector” means the eigenvector corresponding to the smallest eigenvalue.
“smallest eigenvector” means the eigenvector corresponding to the smallest eigenvalue.of , , and the algorithm runs two sequences and starting from and . Denote
then , if , , we have .
We prove Lemma 7 and 8 in supplementary material Section C. We also prove, in the same section, the main theorem using the coupling strategy of Jin et al. (2017a). but with the additional difficulty of taking into consideration the effect of the Riemannian geometry (Lemma 2) and the injectivity radius.
We consider the kPCA problem, where we want to find the principal eigenvectors of a symmetric matrix , as an example (Tripuraneni et al., 2018). This corresponds to
which is an optimization problem on the Grassmann manifold defined by the constraint . If the eigenvalues of are distinct, we denote by ,…, the eigenvectors of , corresponding to eigenvalues with decreasing order. Let be the matrix with columns composed of the top eigenvectors of , then the local minimizers of the objective function are for all unitary matrices . Denote also by the matrix with columns composed of distinct eigenvectors, then the first order stationary points of the objective function (with Riemannian gradient being ) are for all unitary matrices . In our numerical experiment, we choose to be a diagonal matrix and let . The Euclidean basis are an eigenbasis of and the first order stationary points of the objective function are with distinct basis and being unitary. The local minimizers are . We start the iteration at and see in Fig. 3 the algorithm converges to a local minimum.
Burer-Monteiro approach for certain low rank problems.
Following Boumal et al. (2016b), we consider, for and , the problem
We factorize by with an overparametrized and . Then any local minimum of
is a global minimum where (Boumal et al., 2016b). Let . In the experiment, we take being a sparse matrix that only the upper left block is random and other entries are . Let the initial point , such that for and otherwise. Then is a saddle point. We see in Fig. 3 the algorithm converges to the global optimum.
We have shown that for the constrained optimization problem of minimizing subject to a manifold constraint as long as the function and the manifold are appropriately smooth, a perturbed Riemannian gradient descent algorithm will escape saddle points with a rate of order in the accuracy , polylog in manifold dimension , and depends polynomially on the curvature and smoothness parameters.
A natural extension of our result is to consider other variants of gradient descent, such as the heavy ball method, Nesterov’s acceleration, and the stochastic setting. The question is whether these algorithms with appropriate modification (with manifold constraints) would have a fast convergence to second-order stationary point (not just first-order stationary as studied in recent literature), and whether it is possible to show the relationship between convergence rate and smoothness of manifold.
- Absil et al. (2009) Absil, P.-A., Mahony, R., and Sepulchre, R. Optimization Algorithms on Matrix Manifolds. Princeton University Press, 2009.
- Agarwal et al. (2018) Agarwal, N., Boumal, N., Bullins, B., and Cartis, C. Adaptive regularization with cubics on manifolds with a first-order analysis. arXiv preprint arXiv:1806.00065, 2018.
- Ambrose & Singer (1953) Ambrose, W. and Singer, I. M. A theorem on holonomy. Transactions of the American Mathematical Society, 75(3):428–443, 1953.
- Bonnabel (2013) Bonnabel, S. Stochastic gradient descent on riemannian manifolds. IEEE Transactions on Automatic Control, 58(9):2217–2229, 2013.
- Boumal & Absil (2011) Boumal, N. and Absil, P.-a. Rtrmc: A riemannian trust-region method for low-rank matrix completion. In Advances in neural information processing systems, pp. 406–414, 2011.
- Boumal et al. (2016a) Boumal, N., Absil, P.-A., and Cartis, C. Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis, 2016a.
- Boumal et al. (2016b) Boumal, N., Voroninski, V., and Bandeira, A. The non-convex burer-monteiro approach works on smooth semidefinite programs. In Advances in Neural Information Processing Systems, pp. 2757–2765, 2016b.
- Boumal et al. (2018) Boumal, N., Absil, P.-A., and Cartis, C. Global rates of convergence for nonconvex optimization on manifolds. IMA Journal of Numerical Analysis, pp. drx080, 2018. doi: 10.1093/imanum/drx080. URL http://dx.doi.org/10.1093/imanum/drx080.
- Carmon & Duchi (2017) Carmon, Y. and Duchi, J. C. Gradient descent efficiently finds the cubic-regularized non-convex newton step. arXiv preprint arXiv:1612.00547, 2017.
- Cheeger & Ebin (2008) Cheeger, J. and Ebin, D. G. Comparison Theorems in Riemannian Geometry. AMS Chelsea Publishing, Providence, RI, 2008.
- Criscitiello & Boumal (2019) Criscitiello, C. and Boumal, N. Efficiently escaping saddle points on manifolds. arXiv preprint arXiv:1906.04321, 2019.
- Do Carmo (2016) Do Carmo, M. P. Differential Geometry of Curves and Surfaces. Courier Dover Publications, 2016.
- Du et al. (2017) Du, S. S., Jin, C., Lee, J. D., Jordan, M. I., Singh, A., and Poczos, B. Gradient descent can take exponential time to escape saddle points. In Advances in Neural Information Processing Systems, pp. 1067–1077, 2017.
- Edelman et al. (1998) Edelman, A., Arias, T. A., and Smith, S. T. The geometry of algorithms with orthogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998.
- Ge et al. (2015) Ge, R., Huang, F., Jin, C., and Yuan, Y. Escaping from saddle points – online stochastic gradient for tensor decomposition. In Conference on Learning Theory, pp. 797–842, 2015.
- Hu et al. (2018) Hu, J., Milzarek, A., Wen, Z., and Yuan, Y. Adaptive quadratically regularized Newton method for Riemannian optimization. SIAM J. Matrix Anal. Appl., 39(3):1181–1207, 2018.
- Ishteva et al. (2011) Ishteva, M., Absil, P.-A., Van Huffel, S., and De Lathauwer, L. Best low multilinear rank approximation of higher-order tensors, based on the riemannian trust-region scheme. SIAM Journal on Matrix Analysis and Applications, 32(1):115–135, 2011.
- Jin et al. (2017a) Jin, C., Ge, R., Netrapalli, P., Kakade, S. M., and Jordan, M. I. How to escape saddle points efficiently. In International Conference on Machine Learning, pp. 1724–1732, 2017a.
- Jin et al. (2017b) Jin, C., Netrapalli, P., and Jordan, M. I. Accelerated gradient descent escapes saddle points faster than gradient descent. arXiv preprint arXiv:1711.10456, 2017b.
- Karcher (1977) Karcher, H. Riemannian center of mass and mollifier smoothing. Communications on pure and applied mathematics, 30(5):509–541, 1977.
- Kasai & Mishra (2018) Kasai, H. and Mishra, B. Inexact trust-region algorithms on riemannian manifolds. In Advances in Neural Information Processing Systems 31, pp. 4254–4265. 2018.
- Khuzani & Li (2017) Khuzani, M. B. and Li, N. Stochastic primal-dual method on riemannian manifolds with bounded sectional curvature. arXiv preprint arXiv:1703.08167, 2017.
- Lee et al. (2016) Lee, J. D., Simchowitz, M., Jordan, M. I., and Recht, B. Gradient descent only converges to minimizers. Conference on Learning Theory, pp. 1246–1257, 2016.
- Lee et al. (2017) Lee, J. D., Panageas, I., Piliouras, G., Simchowitz, M., Jordan, M. I., and Recht, B. First-order methods almost always avoid saddle points. arXiv preprint arXiv:1710.07406, 2017.
- Lee (1997) Lee, J. M. Riemannian manifolds : an introduction to curvature. Graduate texts in mathematics ; 176. Springer, New York, 1997. ISBN 9780387227269.
- Mangoubi et al. (2018) Mangoubi, O., Smith, A., et al. Rapid mixing of geodesic walks on manifolds with positive curvature. The Annals of Applied Probability, 28(4):2501–2543, 2018.
- Mokhtari et al. (2018) Mokhtari, A., Ozdaglar, A., and Jadbabaie, A. Escaping saddle points in constrained optimization. arXiv preprint arXiv:1809.02162, 2018.
- Nouiehed et al. (2018) Nouiehed, M., Lee, J. D., and Razaviyayn, M. Convergence to second-order stationarity for constrained non-convex optimization. arXiv preprint arXiv:1810.02024, 2018.
- Pemantle (1990) Pemantle, R. Nonconvergence to unstable points in urn models and stochastic approximations. The Annals of Probability, pp. 698–712, 1990.
- Rapcsák (2008) Rapcsák, T. Sectional curvatures in nonlinear optimization. Journal of Global Optimization, 40(1-3):375–388, 2008.
- Sakai (1996) Sakai, T. Riemannian Geometry, volume 149 of Translations of Mathematical Monographs. American Mathematical Society, 1996.
- Sun et al. (2017) Sun, J., Qu, Q., and Wright, J. Complete dictionary recovery over the sphere ii: Recovery by riemannian trust-region method. IEEE Transactions on Information Theory, 63(2):885–914, 2017.
- Sun & Fazel (2018) Sun, Y. and Fazel, M. Escaping saddle points efficiently in equality-constrained optimization problems. In Workshop on Modern Trends in Nonconvex Optimization for Machine Learning, International Conference on Machine Learning, 2018.
- Tripuraneni et al. (2018) Tripuraneni, N., Flammarion, N., Bach, F., and Jordan, M. I. Averaging Stochastic Gradient Descent on Riemannian Manifolds. arXiv preprint arXiv:1802.09128, 2018.
- Tu (2017) Tu, L. W. Differential geometry : connections, curvature, and characteristic classes. Graduate texts in mathematics ; 275. Springer, Cham, Switzerland, 2017. ISBN 9783319550848.
- Wong (1968) Wong, Y.-c. Sectional curvatures of Grassmann manifolds. Proc. Nat. Acad. Sci. U.S.A., 60:75–79, 1968.
- Zhang & Sra (2016) Zhang, H. and Sra, S. First-order methods for geodesically convex optimization. arXiv:1602.06053, 2016. Preprint.
- Zhang et al. (2016) Zhang, H., Reddi, S. J., and Sra, S. Riemannian svrg: fast stochastic optimization on riemannian manifolds. In Advances in Neural Information Processing Systems, pp. 4592–4600, 2016.
- Zhang & Zhang (2018) Zhang, J. and Zhang, S. A cubic regularized newton’s method over riemannian manifolds. arXiv preprint arXiv:1805.05565, 2018.
Organization of the Appendix
In Appendix A we review classical results on the Taylor expansions for functions on Riemannian manifold. In Appendix B we provide the proof of Lemma 2 which requires to expand the iterates on the tangent space in the the saddle point. Finally, in Appendix C, we provide the proofs of Lemma 7 and Lemma 8 which enable to prove the main theorem of the paper.
Throughout the paper we assume that the objective function and the manifold are smooth. Here we list the assumptions that are used in the following lemmas.
Assumption 1 (Lipschitz gradient).
There is a finite constant such that
Assumption 2 (Lipschitz Hessian).
There is a finite constant such that
Assumption 3 (Bounded sectional curvature).
There is a finite constant such that
Appendix A Taylor expansions on Riemannian manifold
We provide here the Taylor expansion for functions and gradients of functions defined on a Riemannian manifold.
a.1 Taylor expansion for the gradient
a.2 Taylor expansion for the function
Taylor expansion of the gradient enables us to approximate the iterations of the main algorithm, but obtaining the convergence rate of the algorithm requires proving that the function value decreases following the iterations. We need to give the Taylor expansion of with the parallel translated gradient on LHS of Eq. (7). To simplify the notation, let denote the .
is defined in Eq. (7). . The second line is just rewriting by definition. Eq. (8c) means the parallel translation preserves the inner product (Tu, 2017, Prop. 14.16). Eq. (8d) uses , meaning that the velocity stays constant along a geodesic (Absil et al., 2009, (5.23)). Eq. (8e) uses Eq. (7). In Euclidean space, the Taylor expansion is
Now we have
Appendix B Linearization of the iterates in a fixed tangent space
In this section we linearize the progress of the iterates of our algorithm in a fixed tangent space . We always assume here that all points are within a region of diameter . In the course of the proof we need several auxilliary lemmas which are stated in the last two subsections of this section.
b.1 Evolution of
We first consider the evolution of in a fixed tangent space . We show in the following lemma that it approximately follows a linear reccursion.
This lemma is illustrated in Fig. 4.