Consider a function , for which 0 is a regular value. By the regular level set theorem (see e.g. [Hirsch1976, Thm 3.2]), the zero set is a properly embedded submanifold of with codimension and dimension . We call the constraint function and the constraint manifold. Depending on the context, may also be called an implicitly-defined manifold, a solution manifold, or an equilibrium manifold. A geodesic in the manifold is a curve with zero intrinsic acceleration, and all the geodesics are encoded in the exponential map such that , where are the initial location and velocity.
The exponential map is crucial for analysis and computation on manifolds. Application to problems on manifolds include optimization [Adler2002, Absil2008, GeR2015, ZhangHY2016b, Boumal2018], differential equations [Hairer2006]Sander2015], sampling [Brubaker2012, Byrne2013, LiuC2016, Leimkuhler2016, Zappa2018, Mangoubi2018, Lelievre2019, Goyal2019, ZhangW2020], and many other problems in statistics [ChenYC2020]
. If the exponential map is not known in an analytic form or is computationally intractable, it can be approximated by numerically integrating the geodesic trajectory, i.e. solving the ordinary differential equationwith initial conditions , . For submanifolds, this can also be done by projecting to , which requires solving a constrained minimization problem. In general, an approximation to the exponential map is called a retraction [Adler2002]. As is widely acknowledged (see e.g. [Absil2008, ZhangHY2016b, Boumal2018]), retractions are often far less difficult to compute than the exponential map, which allows for practical and efficient algorithms.
In this article, we introduce a class of second-order retractions on submanifolds, which move points along manifolds orthogonal to the constraint manifold. Of particular interest among this class is Newton retraction (NR), which we show to have better convergence property and lower computational cost than the popular approach known as oblique projection or orthographic retraction. Table 1 provides an overview of methods for computing geodesics, where we summarize some of our results.
|numerical geodesics||level set||variable||high||tiny||[Leimkuhler2016]|
|projective retraction||level set||2nd||high||almost any|
|orthographic retraction||level set||2nd||low||small||[Zappa2018]|
|Newton retraction*||level set||2nd (1)||lower (1)||large (3)||[Adler2002]|
Cross-referenced are results in this paper.
1.1 Related Literature
Retraction is widely useful for optimization on manifolds. Retraction was first defined in [Adler2002] for Newton’s method for root finding on submanifolds of Euclidean spaces, which was applied to a constrained optimization problem over a configuration space of the human spine. In particular, they proposed a retraction for constraint manifolds using Newton’s method, which we study in this paper. Noisy stochastic gradient method [GeR2015] escapes saddle points efficiently, which uses projective retraction for constrained problems. For geodesically convex optimization, [ZhangHY2016b] studied several first-order methods using the exponential map and a projection oracle, while acknowledging the value of retractions. The Riemannian trust region method has a sharp global rate of convergence to an approximate second-order critical point, where any second-order retraction may be used [Boumal2018].
Sampling on manifolds often involves simulating a diffusion process, which is usually done by a Markov Chain Monte Carlo (MCMC) method.[Brubaker2012] generalized Hamiltonian Monte Carlo (HMC) methods to distributions on constraint manifolds, where the Hamiltonian dynamics is simulated using RATTLE [Andersen1983], in which orthographic retraction is used to maintain state and momentum constraints. [Byrne2013] proposed geodesic Monte Carlo (gMC), an HMC method for submanifolds with known geodesics. [LiuC2016] proposed two stochastic gradient MCMC methods for manifolds with known geodesics, including a variant of gMC. To sample on configuration manifolds of molecular systems, [Leimkuhler2016] proposed a scheme for constrained Langevin dynamics. Criticizing the stability and accuracy limitations in the SHAKE method and its RATTLE variant—both of which use orthographic retraction—their scheme approximated geodesics by numerical integration. [Zappa2018] proposed reversible Metropolis random walks on constraint manifolds, which use orthographic retraction. [Lelievre2019] generalized the previous work to constrained generalized HMC, allowing for gradient forces in proposal; it uses RATTLE. In this line of work, it is necessary to check that the proposal is actually reversible, because large timesteps can lead to bias in the invariant measure. The authors pointed out that numerical efficiency in these algorithms requires a balance between stepsize and the proportion of reversible proposed moves. More recently, [ZhangW2020]
proposed a family of ergodic diffusion processes for sampling on constraint manifolds, which use retractions defined by differential equations. To sample the uniform distribution on compact Riemannian manifolds,[Mangoubi2018] proposed geodesic walk, which uses the exponential map. [Goyal2019] used a similar approach to sample the uniform distribution on compact, convex subsets of Riemannian manifolds, which can be adapted to sample an arbitrary density using a Metropolis filter; both use the exponential map.
Many other statistical problems on constraint manifolds are discussed in [ChenYC2020], where Newton retraction can be very useful. In probabilistic learning on manifolds [ZhangRD2020EnvEcon], a retraction based on constrained gradient flow is used for inference and data augmentation on density ridge [ZhangRD2020nbb].
2 Newton Retraction
Retractions [Adler2002] are mappings that preserve the correct initial location and velocity of the geodesics; they approximate the exponential map to the first order. Second-order retractions [Absil2008, Prop 5.33] are retractions with zero initial acceleration; they approximate the exponential map to the second order. In general, we define retraction of an arbitrary order as follows.
Retraction of order on a manifold, , is a
mapping to the manifold from an open subset of the tangent bundle containing all the zero tangent vectors, such that at every zero tangent vector it agrees with the exponential map in Riemannian distance to the-th order: , , , , , in the sense that, .
[Absil2012] defined a class of retractions on submanifolds of Euclidean spaces, called projection-like retractions. This includes projective and orthographic retractions, both of which are second-order.
Definition 2 ([Absil2012], Def 14)
Retractor of a submanifold of with codimension is a mapping from tangent vectors to linear -subspaces of the ambient space, such that for every zero tangent vector, affine space of the form intersects the submanifold transversely: , , , , . Here is the Grassmann manifold. Projection-like retraction induced by a retractor is a correspondence that takes a tangent vector to the set of points closest to the origin of the affine space that intersects the submanifold: , . In particular, it is a mapping if the tangent vector is small enough: , , : .
Projective retraction , where projection , and can be seen as the projection-like retraction induced by retractor , where . Orthographic retraction is the projection-like retraction induced by retractor .
Lemma 1 ([Absil2012], Thm 15, 22)
Projection-like retraction is a (first-order) retraction. It is second-order if the retractor maps every zero tangent to the normal space: , .
2.2 Main Results
Here we define another class of second-order retractions on submanifolds, based on foliations that intersect the submanifold orthogonally. Such foliations may be generated by a dynamical system, where each leaf is a stable invariant manifold. In particular, we are interested in Newton retraction, generated by a discrete-time dynamical system with quadratic local convergence. Such retractions can also be generated by flows: [ZhangW2020] used gradient flows of squared 2-norm of the constraint, [ZhangRD2020nbb] used constrained gradient flows of log density function; both have linear local convergence and, with sufficiently small steps, global convergence.
Normal foliation of a neighborhood of a codimension- submanifold of a Riemannian -manifold is a partition of the neighborhood into connected -submanifolds (called the leaves of the foliation) which intersect the submanifold orthogonally: , , , , . Retraction induced by a normal foliation is the map , where is a retraction on the ambient manifold and is the canonical projection: , . If , let be the Euclidean exponential map , we have , .
Recall that for the under-determined system of nonlinear equations , Newton’s minimum-norm step is , where Jacobian and denotes the Moore-Penrose inverse. If has full row rank, then . Newton map, or Newton’s least-change update, is . Newton limit map is the mapping defined by , where is a neighborhood of . [Adler2002, Ex 4] showed that given any retraction on the ambient manifold, the mapping is a retraction on . We call this map Newton retraction.
Newton retraction is the map such that , and can be seen as the retraction induced by the normal foliation determined by the Newton map: , .
To show that the retractions we defined are second-order, we give two lemmas. First, projective retractions form a class of retractions of order two and not of any higher order. Second, the retraction induced by a normal foliation matches the projective retraction at zero tangent vectors to the third order.
For every submanifold , , the projective retraction satisfies: , , . There exists a submanifold, , such that the previous condition does not hold if is replaced by .
Given a submanifold and a normal foliation of a neighborhood of , , .
Retraction induced by a normal foliation, which includes Newton retraction , is a second-order retraction. This characterization of order is sharp.
Although and are both second-order retractions, they have different domain sizes. Notice that the projection from a Euclidean space onto a compact subset is uniquely defined except for a subset of measure zero, so the projective retraction on a compact submanifold is defined almost everywhere on the tangent bundle. On the other hand, may be undefined for large tangent vectors as the affine space fails to intersect the submanifold, see Fig. 1. This precludes the domain of to a relatively small subset of the tangent bundle, regardless of implementation. Since matches to the third order while and can differ on the third order, it is easy to see that can have a larger domain than .
Now we characterize the domain of relative to that of in their usual implementation. Algorithm 1 gives an implementation of , which solves , by Newton’s method, with initial value . In comparison, is usually implemented by solving , by Newton’s method, with initial value . This means replacing line 5 with and line 6 with , where is evaluated at the input . It can be seen as an augmented Jacobian algorithm for solving under-determined systems of nonlinear equations: denote the Stiefel manifold , given , step is defined by , . For , the algorithm starts with and satisfies . Kantorovich-type convergence theorems for Newton’s method and augmented Jacobian algorithms are given in [Walker1990], which provide sufficient conditions for immediately superlinear convergence.
Let and . Let be open and convex, , , and . We say function satisfies the normal flow hypothesis, , if , (1) ; (2) ; (3) . Given , we say function satisfies the augmented Jacobian hypothesis, if , (1) ; (2’) ; (3’) .
Theorem 2 ([Walker1990], Thm 2.1, 3.2)
If then , , Newton’s method satisfies: , , ; in particular, , , . If , then the previous statement holds for the augmented Jacobian algorithm.
Per the previous Kantorovich-type convergence theorem, it follows immediately from the next lemma that Newton retraction is always stabler than orthographic retraction. Recall that has domain , where is the domain of , i.e. the convergence region of Newton’s method, and has domain . Let be the convergence region of the usual implementation of .
For any , if , then .
With the usual implementation of and , for any , the order- convergence region of guaranteed by Theorem 2 is a subset of that of : let and , where , , then , .
Our next result shows that Newton retraction is always faster than orthographic retraction.
With the usual implementation of and , for any , for any , the number of operations required for to converge is no greater than that for .
The computational complexity per iteration in Algorithm 1 is dominated by the evaluation of the Jacobian which are real-valued functions, and the linear solver in use, which invokes algebraic operations, . Since convergence is immediately superlinear (typically quadratic), it is reasonable to assume that the algorithm stops after a fixed number of iterations. Therefore the overall cost should suffice for most problems.
In case Jacobian evaluation is expensive and high numerical accuracy is unnecessary, one may consider a modified Newton retraction (mNR): run line 4 only for the first iteration, denote the outcome as , and replace line 5 with . As a corollary of Lemma 1, mNR is a second-order retraction. In this context, a natural implementation of is to use a chord method: remove line 4, and replace line 5 with . Both methods have linear convergence, but mNR has a faster rate: let , then exists such that , where for and for mNR, see e.g. [Allgower1990, Eq 6.2.15]. By Lemma 4, mNR is no more stable than NR.
Proof (Proof of Lemma 2)
[Absil2012, Ex 18] showed that projective retractions are second-order retractions, so we only need to show that the projective retraction of some manifold is exactly second-order. Consider the circle as a submanifold of the Euclidean plane, identified with the set . Its exponential map is , and its projective retraction is . Without loss of generality, consider the point and tangent vectors . Now we can write the exponential map as and the projective retraction as . So the distance between them is , which has a Taylor expansion at zero as . We can see that, for the unit circle, the projective retraction matches the exponential map up to the second order at zero tangent vectors.
Proof (Proof of Lemma 3)
For all such that is in the neighborhood of that is partitioned by , define to be the unique point such that . Note that , so if then and are orthogonal complements. Assume is small enough such that affine spaces and intersect transversely, define the unique point . These constructs are illustrated in Fig. 2. Furthermore, define and . Because , then , , that is, . To prove the theorem, we only need to show .
First we show that . Parameterize at as the graph of a function , , such that , . We see that . Let , because , we have . Thus, .
Second, we show that . Parameterize at as the graph of a function , , such that , . We see that . For all , if , define angle , otherwise define . The first principal angle between linear subspaces is defined as . Because , we have . Let , since , , we have . Thus, and . Because , we have , that is, . We conclude that .
Combining the previous two results gives . Because , we have . This means .
Proof (Proof of Theorem 1)
Combining Lemmas 3 and 2, we have , i.e. is a second-order retraction. [Beyn1993, Thm 3.1] showed that induces a foliation of a neighborhood of into -submanifolds, which intersect orthogonally. So fits Definition 4 as a retraction induced by a normal foliation, and thus it is second-order. Recall the circle example in the proof of Lemma 2, if is defined as the zero set of , , then , which means in this case is only a second-order retraction.
Proof (Proof of Lemma 4)
By the definitions of the hypotheses, part (1) are identical, part (2) follows immediately from part (2’), so we only need to show part (3). For the rest of the proof, is an arbitrary point in . To simplify notation, we will ignore explicit dependence on , such that refers to , and so on. Since
has full rank, we have QR decomposition, where and is a upper triangular matrix of order with positive diagonal entries. Let
be an orthogonal matrix of order, whose first columns matches . Since , let , such that .
First we show that is non-singular and its spectral norm is no greater than 1. By (2’), is non-singular. Since is invertible, this means is non-singular, and thus is non-singular. Note that , , so is non-singular, which means is non-singular. Moreover, let , then , which means .
Now we prove (3). By (3’), , that is, , . This means, , . Equivalently, , . Because , , so the previous inequality holds for . Note that , the inequality becomes . We have shown that is non-singular and non-expansive, so , , or equivalently, , . Define , then , . Define , then , . The left-hand side equals , that is and in short . We conclude that .
Proof (Proof of Proposition 1)
Since , the update sequences in and both satisfy . Because , the in will be remain closer to than that in after the same number of iterations, and thus converges in no more iterations than . Moreover, at each iteration, and both evaluate and , and solve an order- system of linear equations, see line 5. But the coefficient matrix for is a generic matrix , while that for is a symmetric matrix , which admits a faster linear solver. Therefore, the overall number of operations in is no greater than that in .