1. Introduction and Outline
The (global) Riemannian center of mass (a.k.a. Riemannian mean or average or Fréchet mean)111Throughout this paper we are mainly interested in the “global” Riemannian center of mass, hence in reference to it very often we drop the term “global” and simply use “Riemannian center of mass,” etc. If need arises we explicitly use the term “local” in reference to a center which is not global (see Definition 2.5). of a set data points in a Riemannian manifold is defined as a point which minimizes the sum of squares of geodesic distances to the data points. This notion and its variants have a long history with several applications in pure mathematics (see e.g., , , ,  and also 
for a brief history and some new results). More recently, statistical analysis on Riemannian manifolds and, in particular, the Riemannian center of mass have found applications in many applied fields. These include fields such as computer vision (see e.g., and ), statistical analysis of shapes (see e.g., , , , , and ), medical imaging (see e.g.,  and ), and sensor networks (see e.g.,  and ), and many other general data analysis applications (see e.g., , , , and ). In these applied settings one often needs to numerically locate or compute the Riemannian center of mass of a set of data points lying on a Riemannian manifold, and the so-called (intrinsic) gradient descent algorithm has been a popular choice for this purpose. The constant step-size version of the gradient descent for finding the Riemannian center of mass is the easiest one to implement, and hence is the most popular one.
The main challenge in analyzing as well as working with the constant step-size gradient descent algorithm for finding the Riemannian center of mass is the fact that the underlying cost function is usually neither globally differentiable 222For us global differentiability means differentiability everywhere on the manifold; however, we use the term “global” to remind ourselves that our functions of interest (e.g., the Riemannian distance from a point) lose differentiability at far away distances. nor globally convex on the manifold.333In fact, it is well known that on a compact Riemannian manifold the only globally continuous convex functions are constant functions (see Theorem 2.2 and ). More specifically, the cost function is not differentiable at any cut point of the data points. Moreover, as it can be shown by simple examples, the cost function can have local minimizers, which are not of interest and should be avoided. Nevertheless, we expect and hope that if the algorithm is initialized close enough to the (unknown) global Riemannian center of mass and the step-size is small enough, then the algorithm should converge to the center. One would like to have the step-size small enough such that the cost is reduced at each step and at the same time the iterates do not leave a neighborhood around in which
is the only zero of the gradient of the cost function (recall that a gradient descent algorithm, at best, can only locate a zero of the gradient vector field of the cost function). On the other hand, one would like to have large enough step-size so that the convergence is fast. The interplay between these three constraints is important in determining the conditions guaranteeing convergence as well as the speed of convergence.
The goal of this paper is to give accurate conditions that guarantee convergence of the constant step-size gradient algorithm to the global Riemannian center of mass of a set of data points. In Section 2, we first briefly give the necessary backgrounds on the Riemannian center of mass and the gradient descent algorithm for finding it, these include the notions of convex functions and sets in Subsection 2.1.2, differentiability and convexity properties of the Riemannian distance function and bounds on its Hessian in Subsection 2.1.3, Riemannian center of mass and its properties in Subsection 2.1.4, a general convergence theorem for gradient descent in Subsection 2.1.5
, and a convergence theorem estimating the speed of convergence and the best step-size in Subsection2.1.6. Following that, in Subsection 2.2, we state Conjecture 2.15 in which we specify the best “condition for convergence” one can hope for (in a sense to be described). Specifically, we specify a bound on the radius of any ball around the data points in which the algorithm can be initialized together with an interval of allowable step-size so that the algorithm converges to the global center of mass . The significant point is that for convergence the radius of the ball does not need to be any smaller than what ensures existence and uniqueness of the center. Moreover, the step-size can be chosen equal to the best (in a specific described sense) step-size under the extra assumption that the iterates stay in that ball; 444This step-size, in general, depends on an upper bound on the sectional curvatures of the manifold and the radius of the mentioned ball. However, interestingly, for a manifold of non-negative curvature it is simply (see Conjecture 2.15 and Remark 2.16). and it is conjectured that, indeed, the iterates stay in the ball. 555The main challenge in proving this conjecture is to prove that the iterates stay in the ball containing the data points. Knowing the conjecture helps us to compare and evaluate the existing results mentioned in Subsection 2.3 as well as the results derived in this paper. In Section 3 (Theorem 3.6), we prove Conjecture 2.15 for the case of manifolds of constant nonnegative curvature. In our developments in this section, we first prove comparison Theorem 3.1 in Subsection 3.1. This comparison result (which differs from standard comparison theorems in some aspects) most likely has been known among geometers, but we could find neither its statement or a proof for it in the literature. In Subsection 3.2 we make sense of an intuitive notion of Riemannian affine convex combination of a set of data points in a manifold of nonnegative constant curvature and we explore its relation to the convex hull of the data points. These prepare us to prove the main theorem of the section, Theorem 3.6, in Subsection 3.3. Although limited in scope, this result covers some very important manifolds of practical interest: the unit sphere in , the group of rotations in , and the real projective space in . In Section 4, for manifolds of arbitrary (i.e., non-constant) curvature, we derive two classes of sub-optimal convergence conditions (in comparison with Conjecture 2.15): In Subsection 4.1 we give a result (Theorem 4.1) in which convergence is guaranteed at the expense of smaller spread of data points, whereas in Subsection 4.2 (Theorem 4.2) the allowable step-size is compromised to guarantee convergence. Finally, in Section 5, we qualitatively identify data configurations for which the convergence is very fast or very slow. We conclude the paper in Section 6 by giving some remarks about further research.
2. Preliminaries, a conjecture, and prior work
2.1. Preliminaries on the Riemannian Center of Mass and the Gradient Descent Algorithm
Let be an -dimensional complete666Our results are local in nature, but they hold in relatively large regions that are determined explicitly. Therefore, completeness of the manifold is not necessary and our results could be easily adapted to e.g., non-singular regions of a singular manifold. However, for this purpose the statements of our results could become rather cumbersome. Riemannian manifold with distance function .777In our definitions relating to Riemannian manifolds we mainly follow . We denote the tangent space at by and by and we mean the Riemannian structure and the corresponding norm, respectively (dependence on the base point is implicit and clear from the context). By a function in a subset of we mean a continuous function which is order continuously differentiable in the subset as commonly understood in differential geometry. For a function , denotes its gradient vector field with respect to . We assume that the sectional curvatures of are bounded from above and below by and , respectively. The exponential map of at is denoted by and its inverse (wherever defined) is denoted by . The injectivity radius of is denoted by . An open ball with center and radius is denoted by and its closure by .
2.1.2. Convex functions and sets in Riemannian manifolds
Convexity plays an important rule in our developments and we have the following definition:
Let be an open subset of such that every two points in can be connected by at least one geodesic of such that this geodesic lies entirely in . Assume that is a continuous function. Then is called (strictly) convex if the composition is (strictly) convex for any geodesic . We say that is globally (strictly) convex if it is (strictly) convex in .
If is in , then convexity (strict convexity) of is equivalent to (), where is any geodesic in .
An insightful fact is the following :
The only globally convex function in a compact Riemannian manifold is a constant function.
It is more convenient to limit ourselves to strongly convex subsets of because they are quite similar to standard convex sets in :
A set is called strongly convex if any two points in can be connected by a unique minimizing geodesic in and the geodesic segment lies entirely in .
with the convention that for . An open ball with is strongly convex in ; the same holds for any closed ball if (see e.g., [32, p. 404] and [11, pp. 168-9]). In fact, with is even more similar to a convex set in Euclidean space: for any the minimal geodesic between from to is the only geodesic connecting them which lies entirely in .
2.1.3. Differentiability and convexity properties of the distance function and estimates on its Hessian
Now, we briefly give some facts about the Riemannian distance function which will be used throughout the paper. Let . The function is continuous for every . However, it is differentiable (in fact ) in , where is the cut locus of (see e.g., [32, pp. 108-110]). We recall the notion of cut locus: Let be the largest domain containing the origin of in which is a diffeomorphism, and let be the boundary of . Then is called the cut locus of and one has (see e.g., [11, pp. 117-118] or [32, p. 104]). The distance between and is called the injectivity radius of denoted by , and by definition . It is well known that has measure zero in . On the unit sphere the cut locus of every point is its antipode. In a general manifold the differentiability property of at is similar to the case where equipped with the standard Euclidean metric; in particular, is a function in . However, the behavior at far away points (e.g., the cut locus) is of substantially different nature and depends on the topology and curvature of (recall that in Euclidean space the cut locus of every point is empty).
Next, we recall some useful estimates about the Hessian of the Riemannian distance function. We adopt the following definitions:
Assume that is such that .888Instead of this condition, it is often more convenient to require , which is a more conservative yet global version of the condition. Furthermore, assume that with is a unit speed geodesic making, at , an angle with the minimal geodesic from to . It can be proved that (see e.g., [32, pp. 152-154])
where is defined in (2.2). Based on the above one can verify that
and more generally that
Notice that the above confirms the intuition that the differentiability behavior of at is the same in and in . We will use these two relations very often; in doing so it is useful to have in mind that and , respectively, are decreasing and increasing in .
We emphasize that the requirement is essential and cannot be compromised, as at a cut point the distance function becomes non-differentiable. Although in the left hand side of the above bounds only appears explicitly both the curvature and topology of determine the convexity properties of the distance function. Notice that
gives (some) information about the Riemannian curvature tensor ofand (or ) gives (some) information about the topology of . For example, and the unit circle both have zero sectional curvature, while and the injectivity radius of is infinity. Now obviously in is globally convex, while in it is not (as seen directly or as a consequence of Theorem 2.2). The interesting fact is that has positive definite Hessian in , where is the antipodal point of . At , is not differentiable. This non-differentiability is so severe that in spite of the positivity of its Hessian at all other points is far from a globally convex function on . As an example of a less severe non-differentiability notice that in is also non-differentiable at , yet this does not affect its convexity.
2.1.4. Riemannian center of mass
We start by the following definition:
The (global) the Riemannian center of mass or mean (a.k.a. Fréchet mean) of the data set with respect to weights () is defined as the minimizer(s) of
in . We denote the center by . We call a local minimizer of (which is not global) a local center of mass of the data set with respect to the weights.999A local minimizer of in is sometimes called a Karcher mean, although this definition bears little relation with the way Grove and Karcher  or Karcher  originally defined what they called the Riemannian center of mass (see also  for more details). Given with small enough those authors defined the center of mass as a zero of in or alternatively as a local minimizer of in (Notice that given that is not differentiable at the cut locus of each data point it is not a-priori clear that, in general, a local minimizer of in should coincide with a zero of , although there are some evidence in the literature that this situation might not happen, see e.g,. , ). Overall, there is no consensus among authors about the terminology and we find it more convenient to use “local” and “global” Riemannian center of mass as defined here.
The reader is referred to  for details and other related definitions. As a convention when referring to the center of mass of some data points we usually do not refer to explicit weights unless needed. As another convention when is not specified we assume , which is the most commonly used case. Although and are also used often, in this paper our focus is limited to .101010Going from to does not introduce any major technical challenge, so our analysis can be applied almost uniformly to . However, since the results for are easier to state and presumably used more often, we usually state a result for and then give the more general version for . The only exception is Theorem 4.2 for which we do not give a version. The reason is that in our analysis we require
to be twice-continuously differentiable and we determine the constant step-size of the gradient algorithm in terms of the upper bounds on the eigenvalues Hessian of. As in Euclidean case, in the more general Riemannian case also one can see from (2.6) that for the Hessian of might be unbounded. It is well known that Lipschitz continuous Hessian (in particular bounded Hessian) is necessary for the convergence of a gradient descent method with constant-step size .
Although is a globally convex function when is a Euclidean space (or more generally an Hadamard manifold), is not globally convex for an arbitrary . In particular, the center of mass might not be unique; however, if the data points are close enough, then the center is unique. The following theorem gives sufficient conditions for existence and uniqueness of the Riemannian center of mass.
Let and assume with . For , if , then the Riemannian center of mass is unique and is inside . For it is the unique zero of the gradient vector field in , and moreover if no data point has weight , then is a non-degenerate critical point of (i.e., the Hessian of at is positive-definite). 111111In Theorem 2.1 of , the condition on is stated as , but since we have finite number of data points the current version follows immediately. Also from the statement of Theorem 2.1, is the only zero of in , but from the proof of the theorem it can be seen that the vector field is inward-pointing on the boundary of and hence is the only zero in entire . In addition, the fact about non-degeneracy is not present in the statement of Theorem 2.1 of , however it is proved in the proof of the theorem.
For a proof see . Also for see  and . In this paper, we are mainly interested in , because the bounds we derive on the step-size of the gradient descent algorithm depend on an upper bound on the eigenvalues of the Hessian of , and for (and ) the Hessian is unbounded. We refer the reader to  and  for algorithms in the case of and .
A rather subtle issue is that according to Theorem 2.6, if , then has a unique minimizer in . Nevertheless, , the restriction of to , might not be a convex function. More accurately, if (cf. (2.1)), then is a convex function (this can be seen easily from (2.5) by noting that for ). However, if and , then might not be a convex function, as can be seen by very simple examples. While this fact does not harm implementation and convergence of a gradient descent algorithm for finding greatly, it has some serious implications on implementation and applicability of Newton’s method (see Remarks 2.14 and 2.19).
2.1.5. Gradient descent algorithm for finding the Riemannian center of mass
For later reference we derive the intrinsic gradient descent algorithm for locating the Riemannian center of mass (see  or  for introduction to optimization on Riemannian manifolds). One can check that
for any as long as it is not in the cut locus of any of the data points. In particular, if , where , then for any the above expression is well-defined in the classical sense (i.e., it is uniquely defined). Notice that the above expression is well-defined for almost every because the set at which is not differentiable has measure zero (for this set is and for it is ). As we will see this non-differentiability has severe implications on the behavior of the constant step-size gradient descent.
|Algorithm 1: Gradient Descent for finding the Riemannian center of mass|
Besides practical considerations (e.g., stopping criterion), at least two important issues are left unspecified in Algorithm 1, namely, how to choose and how to choose for each . The most natural choice for is one point in , say one of the data points. Note that in practice and the exact value of might not be known.
The choice of is more complicated. The next general proposition gives a prescription for a step-size interval which ensures reducing the cost function at an iteration of a gradient descent provided one knows an upper bound on the eigenvalues of the Hessian of the cost function in a region which iterates live. The proof of this proposition follows from the second order Taylor series expansion (with remainder).
Let and consider an open neighborhood containing . Let be a function whose restriction to is twice-continuously differentiable and let the real number be an upper bound on the eigenvalues of the Hessian of in . There exists such that for all the curve does not leave and
For , with the convention that for , we have with equality only if is a critical point of . Moreover, when the right hand side of (2.10) is minimized for .
Notice that the fact that for the curve stays in is crucial in enabling us to use the upper bound and derive (2.10). This concept appears frequently in our analysis and it useful to have the following definition:
Obviously, continuously staying in is stronger than staying in . However, they are equivalent under some conditions (which hold in some, but not all, of the cases we study, see Remark 2.18):
If is a strongly convex set and for every , then for the iterates of Algorithm 1 staying in implies (and hence is equivalent to) continuously staying in .
Assume that and both belong to . Recall that for is a minimizing geodesic if . Therefore, by strong convexity of , for must be the only minimizing geodesic between and and must lie in entirely. ∎
As mentioned before we are only interested in constant step-size gradient descent. The following convergence result is standard for this type of algorithms when the cost is (or at least has Lipschitz gradient) and is globally convex; however, our version is adapted to which, in general, is neither globally nor convex. The assumption of the theorem that each iterate of the algorithm continuously stay in a neighborhood of , in which is is the only zero of , is a crucial enabling ingredient of the proof. In fact, our goal in Sections 3 and 4 is essentially to identify such a neighborhood (under certain conditions).
Let and assume that is the center of mass of , where . Let be a bounded open neighborhood of such that is in and in , the closure of . Furthermore, assume that is the only zero of the vector field in . Let be an upper bound on the eigenvalues of the Hessian of in . In Algorithm 1 choose . If starting from , each iterate of Algorithm 1 continuously stays in , then for with equality only if and converges to .
Since by assumption for and is compact, there is a subsequence converging to a point . By Proposition 2.7 we have unless and furthermore
for every . Since is a bounded sequence, the above implies that ; hence, by continuity of we have , that is, is a zero of in . But by the assumption about this means that coincides with and therefore . ∎
Next, we give a very simple but insightful example.
Example 2.11 (Finding the mean of two points on the unit circle ).
Let be the unit circle centered at the origin and equipped with the standard arc length distance . Recall that and . Let denote the point (see Figure 1). We consider two data points represented as () where and . We specify the weights and later. Under the mentioned assumption that , Theorem 2.6 guarantees that the center of mass is unique and in fact one can see that where . More importantly, it also follows that is the unique zero of in (as well as ). Notice that is smooth within (it does not follow from Theorem 2.6 but it is an easily verifiable fact that is the unique zero of in ). On the other hand, is not differentiable at and , the antipodal points of and , respectively. Furthermore, in the Hessian of is defined and is equal to , hence the largest possible range of the constant step-size is the interval . Next, we see under what conditions Theorem 2.10 applies. It is easy to check that, independent of the weights, with and step-size the iterates of Algorithm 1 continuously stay in . Therefore, an acceptable is the ball and Theorem 2.10 ensures convergence to the global center provided step-size is in the interval . This result is essentially not different from what we have in . However, the situation for is rather subtle since with a large step-size the iterates might leave the ball or even and enter a region in which there is another zero of . To be specific notice that can be parameterized with as
Now, let us fix (and ). First, let and . The solid curve in the right panel in Figure 1 shows the graph of . The two cranks in the curve at and are due to the non-differentiability of at antipodal points of and . If we run Algorithm 1 with and step-size we have , thus coincides with a non-differentiable critical point of
at which the algorithm is, in fact, not well-defined. In the generic setting the probability of this happening is zero; however, for larger, will leave . It can be seen that for this specific pair of weights has only one zero in . Consequently, in practice, Algorithm 1 for almost every initial condition in and step-size in the interval will find the global center of mass , where (this fact does not follow from Theorem 2.10 but is not difficult to verify in this special case, see also Remark 2.12). But we might not be this lucky always! For example, let and . The dashed curve in Figure 1 show for this pair of weights. One can verify that in addition to the global minimizer , this time, has a local minimizer at . Now if we run Algorithm 1 with and constant step-size where , then we have , i.e., the next iterate coincides with the local center and the algorithm gets stuck at the wrong center! For values of slightly smaller or larger than the algorithm still converges to . Notice that this happens despite the fact that the cost is reduced at each iteration.121212It would be interesting to see whether an example (in or another manifold) exists in which due to the non-differentiability of , we have if does not continuously stay in . Such a phenomenon could lead to an oscillatory behavior (see Remark 2.12). Although this simple example does not show the effect of the curvature of the manifold, still it makes it clear that in order for Algorithm 1 to have a predictable behavior that is as data-independent as possible it is important to identify conditions under which the assumptions of Theorem 2.10 are satisfied (mainly that the iterates continuously stay in a candidate ). Our efforts in the following sections are primarily directed toward finding a set , the radius of a ball containing the data points, and the step-size range which guarantee that the iterates continuously stay in .
Remark 2.12 (Local vs.global behavior).
Our focus in this paper is on finding the global center of mass using a constant step-size gradient descent algorithm; hence, we only study the local behavior of the algorithm (albeit in relatively large domains), and we leave the global behavior analysis of the algorithm to further research. In this regard, one particular important question is: “How (if possible at all) could one guarantee that for almost every initial condition in the algorithm converges to a local Riemannian center of mass?” Notice that this is a relevant question, in particular, since the underling cost function is not differentiable globally and e.g., a-priori one could not rule out a possible oscillatory behavior.
2.1.6. Speed of convergence and the best step-size
In Proposition 2.7, is the best step-size in the sense that in each iteration it gives the most reduction in the upper bound of described in the right hand side of (2.10). The following theorem relates this choice to the speed of convergence of the algorithm. The proof of the theorem is adopted from [37, p. 266, Theorem 4.2] where a proof is given for a globally convex function. Here, we adapt that proof to constant step-size gradient descent for minimizing (which is only locally and convex).131313It should be clear that nothing is special about and in Theorems 2.10 and 2.13: Both theorems hold true if is replaced with any function which is twice-continuously differentiable in satisfying the respective assumptions of the theorems and is replaced with a non-degenerate local minimizer of (in fact, for Theorem 2.10 an isolated local minimizer suffices).
Set . Let be the center of mass of , where . Suppose that is a strongly convex neighborhood around in which is twice-continuously differentiable, and let and , respectively, denote a lower and upper bound on the eigenvalues of the Hessian of in . Furthermore, assume that is small enough such that one can choose . In Algorithm 1 choose a constant step-size . If after a finite number of iterations each iterate continuously stays in , then for we have
In the above and are defined as
where . In particular, and as .
Let be the minimal geodesic from to . Note that for due to strong convexity of . After writing the second order Taylor’s series of around and using the bounds on the Hessian of one gets
Similarly by expansion of around one gets
Also notice that by the first order Taylor series expansion of around we have
Combining this with the left inequality in (2.17) results in
Therefore, we have
This theorem predicts a lower bound on the speed of convergence (i.e., the actual convergence is not worst than what the theorem predicts). The accuracy of this prediction, in part, depends on the accuracy of our estimates of the lower and upper bounds on the eigenvalues of the Hessian. Observe that when we are only given , (i.e., ) gives the smallest a-priori . We call the best a-priori step-size given .
Remark 2.14 (Asymptotic ).
Notice that a strongly convex which works in Theorem 2.13 might not work in Theorem 2.10 (since the assumption might not hold true) and a smaller might be needed for this theorem. Nevertheless, if we start with an (and a corresponding ) for which Theorem 2.10 holds, then there exists a smaller strongly convex for which and Theorem 2.13 holds. is still an upper bound on the eigenvalues of the Hessian of in and in lack of any knowledge about still is the best a-priori step-size. The actual asymptotic speed of convergence is determined by in a very small neighborhood of . In fact, in the limit and converge, respectively, to and the smallest and largest eigenvalues of the Hessian of at . Therefore, for any step-size , we define the associated asymptotic , denoted by , where in (2.14), , , and are replaced, respectively, by , , and . Notice that if , then and the smaller the the smaller the will be. In general, “smaller ” means that either we make smaller or we choose a more accurate upper bound on the eigenvalues of the Hessian of in .
2.2. A Conjecture: The Best Convergence Condition
As mentioned before, reduction of the cost at each iteration is not enough to guarantee the convergence of Algorithm 1 to the global center of mass. Nevertheless, we conjecture that if the constant step-size is chosen not too large and the initial condition is not far from (as specified next), then the cost can be reduced at each iteration, the iterates stay close to and converge to it .