 # The Riemannian barycentre as a proxy for global optimisation

Let M be a simply-connected compact Riemannian symmetric space, and U a twice-differentiable function on M, with unique global minimum at x^* ∈ M. The idea of the present work is to replace the problem of searching for the global minimum of U, by the problem of finding the Riemannian barycentre of the Gibbs distribution P_T∝(-U/T). In other words, instead of minimising the function U itself, to minimise E_T(x) = 1/2∫ d^ 2(x,z)P_T(dz), where d(·,·) denotes Riemannian distance. The following original result is proved : if U is invariant by geodesic symmetry about x^*, then for each δ < 1/2 r_ cx (r_ cx the convexity radius of M), there exists T_δ such that T ≤ T_δ implies E_T is strongly convex on the geodesic ball B(x^*,δ) , and x^* is the unique global minimum of E_T . Moreover, this T_δ can be computed explicitly. This result gives rise to a general algorithm for black-box optimisation, which is briefly described, and will be further explored in future work.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Concentration of the barycentre

Let be a probability distribution on a complete Riemannian manifold . A (Riemannian) barycentre of is any global minimiser of the function

 E(x)=12∫Md2(x,z)P(dz) for x∈M (1)

The following statement is due to Karcher, and was improved upon by Afsari  : if is supported inside a geodesic ball , where and ( the convexity radius of ), then is strongly convex on , and has a unique barycentre .

On the other hand, the present work considers a setting where is not supported inside , but merely concentrated on this ball. Precisely, assume is equal to the Gibbs distribution

 PT(dz)=(Z(T))−1exp[−U(z)T]vol(dz);T>0 (2)

where is a normalising constant, is a function with unique global minimum at , and is the Riemannian volume of . Then, let denote the function in (1), and let denote any barycentre of .

In this new setting, it is not clear whether is differentiable or not. Therefore, statements about convexity of and uniqueness of are postponed to the following Section 2. For now, it is possible to state the following Proposition 1. In this proposition, denotes Riemannian distance, and denotes the Kantorovich (-Wasserstein) distance . Moreover, is any open interval which contains the spectrum of the Hessian , considered as a linear mapping of the tangent space .

###### Proposition 1

assume is an -dimensional compact Riemannian manifold with non-negative sectional curvature. Denote the Dirac distribution at . The following hold,
(i) for any ,

 W(PT,δx∗)<η2(4diamM)⟹d(¯xT,x∗)<η (3)

(ii) for (which can be computed explicitly)

 W(PT,δx∗)≤√2π(π/2)n−1B−1n(μmax/μmin)n/2(T/μmin)1/2 (4)

where in terms of the Beta function.

Proposition 1 is motivated by the idea of using as an approximation of . Intuitively, this requires choosing so small that is sufficiently close to . Just how small a may be required is indicated by the inequality in (4). This inequality is optimal and explicit, in the following sense.

It is optimal because the dependence on in its right-hand side cannot be improved. Indeed, by the multi-dimensional Laplace approximation (see , for example), the left-hand side is equivalent to (in the limit ). While this constant is not tractable, the constants appearing in Inequality (4) depend explicitly on the manifold and the function . In fact, this inequality does not follows from the multi-dimensional Laplace approximation, but rather from volume comparison theorems of Riemannian geometry .

In spite of these nice properties, Inequality (4

) does not escape the curse of dimensionality. Indeed, for fixed

, its right-hand side increases exponentially with the dimension (note that decreases like ). On the other hand, although also depends on , it is typically much less affected by dimensionality, and decreases slower that as increases.

## 2 Convexity and uniqueness

Assume now that is a simply-connected, compact Riemannian symmetric space. In this case, for any , the function turns out to be throughout . This results from the following lemma.

###### Lemma 1

let be a simply-connected compact Riemannian symmetric space. Let be a geodesic defined on a compact interval . Denote the union of all cut loci for . Then, the topological dimension of is strictly less than . In particular, is a set with volume equal to zero.

Remark : the assumption that is simply-connected cannot be removed, as the conclusion does not hold if is a real projective space.

The proof of Lemma 1 uses the structure of Riemannian symmetric spaces, as well as some results from topological dimension theory  (Chapter VII). The notion of topological dimension arises because it is possible is not a manifold. The lemma immediately implies, for all ,

 ET(γ(t))=12∫Md2(γ(t),z)PT(dz)=12∫M−Cut(γ)d2(γ(t),z)PT(dz)

Then, since the domain of integration avoids the cut loci of all the , it becomes possible to differentiate under the integral. This is used in obtaining the following (the assumptions are the same as in Lemma 1).

###### Corollary 1

for , let and , where is the function . The following integrals converge for any

 Gx=∫M−Cut(x)Gx(z)PT(dz);Hx=∫M−Cut(x)Hx(z)PT(dz)

and both depend continuously on . Moreover,

 ∇ET(x)=Gxand∇2ET(x)=Hx (5)

so that is throughout .

With Corollary 1 at hand, it is possible to obtain Proposition 2, which is concerned with the convexity of and uniqueness of . In this proposition, the following notation is used

 (6)

where for positive . The reader may wish to note the fact that decreases to as decreases to .

###### Proposition 2

let be a simply-connected compact Riemannian symmetric space. Let be the maximum sectional curvature of , and its convexity radius. If (see (ii) of Proposition 1), then the following hold for any .
(i) for all in the geodesic ball ,

 ∇2ET(x)≥Ct(2δ)(1−vol(M)f(T))−πAMf(T) (7)

where and is a constant given by the structure of the symmetric space .
(ii) there exists (which can be computed explicitly), such that implies is strongly convex on , and has a unique global minimum . In particular, this means is the unique barycentre of .

Note that (ii) of Proposition 2 generalises the statement due to Karcher , which was recalled in Section 1.

## 3 Finding To and Tδ

Propositions 1 and 2 claim that and can be computed explicitly. This means that, with some knowledge of the Riemannian manifold and the function , and can be found by solving scalar equations. The current section gives the definitions of and .

In the notation of Proposition 1, let be small enough, so that,

 μmind2(x,x∗)≤2(U(x)−U(x∗))≤μmaxd2(x,x∗)

whenever , and consider the quantity

 f(T,m,ρ)=(2/π)1/2(μmax/T)m/2exp(−Uρ/T)

where is defined as in (6). Note that decreases to as decreases to , for fixed and . Now, it is possible to define as

 To=min{T1o,T2o} % where (8)
 T1o=inf{T>0:f(T,n−2,ρ)>ρ2−nAn−1}T2o=inf{T>0:f(T,n+1,ρ)>(μmax/μmin)n/2Cn}

Here, for , and , where is the surface area of a unit sphere .

With regard to Proposition 2, define as follows,

 Tδ=min{T1δ,T2δ}−ε (9)

for some arbitrary . Here, in the notation of (4), (6) and (7),

 T1δ=inf{T≤To:√2π(T/μmin)1/2>δ2(μmin/μmax)n/2Dn}T2δ=inf{T≤To:f(T)>Ct(2δ)(Ct(2δ)vol(M)+πAM)−1}

where .

## 4 Black-box optimisation

Consider the problem of searching for the unique global minimum of . In black-box optimisation, it is only possible to evaluate for given , and the cost of this evaluation precludes numerical approximation of derivatives. Then, the problem is to find using successive evaluations of (hopefully, as few of these evaluations as possible).

Here, a new algorithm for solving this problem is described. The idea of this algorithm is to find using successive evaluations of , in the hope that will provide a good approximation of . While the quality of this approximation is controlled by Inequalities (3) and (4) of Proposition 1, in some cases of interest, is exactly equal to , for correctly chosen , as in the following proposition 3.

To state this proposition, let denote geodesic symmetry about (see ). This is the transformation of , which leaves fixed, and reverses the direction of geodesics passing through .

###### Proposition 3

assume that is invariant by geodesic symmetry about , in the sense that . If (see (ii) of Proposition 2), then is the unique barycentre of .

Proposition 3 follows rather directly from Proposition 2. Precisely, by (ii) of Proposition 2, the condition implies is strongly convex on , and . Thus, is the unique stationary point of in . But, using the fact that is invariant by geodesic symmetry about , it is possible to prove that is a stationary point of , and this implies .
The two following examples verify the conditions of Proposition 3.
Example 1 : assume is a complex Grassmann manifold. In particular, is a simply-connected, compact Riemannian symmetric space. Identify with the set of Hermitian projectors such that , where denotes the trace. Then, define for , where

is a Hermitian positive-definite matrix with distinct eigenvalues. Now, the unique global minimum of

occurs at , the projector onto the principal
-subspace of . Also, the geodesic symmetry is given by , where denotes reflection through the image space of . It is elementary to verify that is invariant by this geodesic symmetry. Example 2 : let be a simply-connected, compact Riemannian symmetric space, and a function on with unique global minimum at . Assume moreover that is invariant by geodesic symmetry about . For each , there exists an isometry of , such that . Then, has unique global minimum at , and is invariant by geodesic symmetry about .
Example 1 describes the standard problem of finding the principal subspace of the covariance matrix . In Example 2, the function is a known template, which undergoes an unknown transformation , leading to the observed pattern

. This is a typical situation in pattern recognition problems.

Of course, from a mathematical point of view, Example 2 is not really an example, since it describes the completely general setting where the conditions of Proposition 3 are verified. In this setting, consider the following algorithm.
Description of the algorithm :
– input : % to find such , see Section 3
% symmetric Markov kernel
% initial guess for
– iterate : for
(1) sample
(2) compute
(3) reject with probability % then,
(4) % see definition (10) below
– until : does not change sensibly
– output : % approximation of
The above algorithm recursively computes the Riemannian barycentre of the samples generated by a symmetric Metropolis-Hastings algorithm (see ). Here, The Metropolis-Hastings algorithm is implemented in lines (1)--(3). On the other hand, line (4) takes care of the Riemannian barycentre. Precisely, if is a length-minimising geodesic connecting to , let

 ^xn−1#1nzn=γ(1/n) (10)

This geodesic need not be unique.

The point of using the Metropolis-Hastings algorithm is that the generated eventually sample from the Gibbs distribution . The convergence of the distribution of to takes place exponentially fast. Indeed, it may be inferred from  (see Theorem 8, Page 36)

 ∥Pn−PT∥TV≤(1−pT)n (11)

where is the total variation norm, and verifies

 pT≤(volM)infx,zq(x,z)exp(−supxU(x)/T)

so the rate of convergence is degraded when is small.

Accordingly, the intuitive justification of the above algorithm is the following. Since the eventually sample from the Gibbs distribution , and the desired global minimum of is equal to the barycentre of (by Proposition 3), then the barycentre of the is expected to converge to .

It should be emphasised that, in the present state of the literature, there is no rigorous result which confirms this convergence . It is therefore an open problem, to be confronted in future work.

For a basic computer experiment, consider and let

 U(x)=−P9(x3) for x=(x1,x2,x3)∈S2 (12)

where is the Legendre polynomial of degree  . The unique global minimiser of is , and the conditions of Proposition 3 are verified, since is invariant by reflection in the axis, which is geodesic symmetry about .

Figure 2 shows the dependence of on , displaying multiple local minima and maxima. Figure 2 shows the algorithm overcoming these local minima and maxima, and converging to the global minimum , within iterations. The experiment was conducted with , and the Markov kernel obtained from the von Mises-Fisher distribution (see ). The initial guess is not shown in Figure 2.

In comparison, a standard simulated annealing method offered less robust performance, which varied considerably with the choice of annealing schedule.

## 5 Proofs

This section is devoted to the proofs of the results stated in previous sections.

As of now, assume that . There is nos loss of generality in making this assumption.

### 5.1 Proof of Proposition 1

 Proof of (i) : denote fx(z)=12d2(x,z) . By the definition of ET ET(x)=∫Mfx(z)PT(dz) (13a) Moreover, let E0 be the function E0(x)=∫Mfx(z)δx∗(dz)=12d2(x,x∗) (13b) For any x, it is elementary that fx(z) is Lipschitz continuous, with respect to z, with Lipschitz constant diamM. Then, from the Kantorovich-Rubinshtein formula , |ET(x)−E0(x)|≤(diamM)W(PT,δx∗) (13c) a uniform bound in x∈M. It now follows that infx∈B(x∗,η)ET(x)−infx∈B(x∗,η)E0(x)≤(diamM)W(PT,δx∗) and (13d) (13e) However, from (13b), it is clear that infx∈B(x∗,η)E0(x)=0andinfx∉B(x∗,η)E0(x)=η22 To complete the proof, replace this into (13d) and (13e). Then, assuming the condition in (3) is verified, infx∈B(x∗,η)ET(x)<η24

Proof of (ii) : let where is the injectivity radius of at , and is an upper bound on the sectional curvature of . Assume, in addition, is small enough so

 μmind2(x,x∗)≤2(U(x)−U(x∗))≤μmaxd2(x,x∗) (14a) whenever d(x,x∗)≤ρ. Further, consider the truncated distribution PρT(dz)=1Bρ(z)PT(Bρ)⋅PT(dz) (14b) where 1 denotes the indicator function, and Bρ stands for the open ball B(x∗,ρ). Of course, by the triangle inequality, W(PT,δx∗)≤W(PT,PρT)+W(PρT,δx∗) (14c) The proof relies on the following estimates, which use the notation of Section 3. First estimate : if T≤T1o, then (14d) Second estimate : if T≤T1o, then W(PρT,δx∗)≤2√2π(π2)n−1B−1n(μmaxμmin)n/2(Tμmin)1/2 (14e) These two estimates are proved below. Assume now they hold true, and T≤To. In particular, since T≤T2o, the definition of T2o implies f(T,n+1,ρ)≤(μmax/μmin)n/2Cn Recall the definition of Cn, and express ωn and An in terms of the Gamma function . The last inequality becomes (diamM×volM)f(T,n+1,ρ)≤2(2π)n/2B−1n(μmax/μmin)n/2 This is the same as (diamM×volM)1π(π8)n/2f(T,n+1,ρ)≤(π2)n−1B−1n(μmax/μmin)n/2 By the definition of f(T,n+1,ρ), it now follows the right-hand side of (14d) is less than half the right-hand side of (14e). In this case, (4) follows from the triangle inequality (14c). ■
 Proof of first estimate : consider the coupling of PT and PρT, provided by the probability distribution K on M×M, K(dz1×dz2)=PρT(dz1)[PT(Bρ)δz1(dz2)+1Bcρ(z2)PT(dz2)] (15a) where Bcρ denotes the complement of Bρ. Recall the definition of the Kantorovich distance (see ). Replacing (15a) into this definition, it follows that W(PT,PρT)≤(diamM)PT(Bcρ) (15b) Then, from the definition (2) of PT, PT(Bcρ)≤(Z(T))−1(volM)exp(−Uρ/T) (15c) Now, (14d) follows directly from (15b) and (15c), if the following lower bound on Z(T) can be proved, Z(T)≥π2(8π)n/2(Tμmax)n/2 for T≤T1o (15d) To prove this lower bound, note that Z(T)=∫Me−U(z)(Tvol(dz)≥∫Bρe−U(z)(Tvol(dz) Using this last inequality and (14a), it is possible to write Z(T)≥∫Bρe−U(z)(Tvol(dz)≥∫Bρe−μmax(2Td2(x,x∗)vol(dz) (15e) Writing this last integral in Riemannian spherical coordinates, ∫Bρe−μmax(2Td2(x,x∗)vol(dz)=∫ρ0∫Sn−1e−μmax(2Tr2λ(r,s)drωn(ds) (15f) where λ(r,s) is the volume density in the Riemannian spherical coordinates, r≥0 and s∈Sn−1, and where ωn(ds) is the area element of Sn−1. From the volume comparison theorem in  (see Page 129), λ(r,s)≥(κ−1sin(κr))n−1≥((2/π)r)n−1 (15g) where the second inequality follows since x↦sin(x) is concave for x∈(0,π). Now, it follows from (15e) and (15f), Z(T)≥ωn(2π)n−1∫ρ0e−μmax(2Tr2rn−1dr (15h) where ωn is the surface area of Sn−1. Thus, the required lower bound (15d) follows by noting that ∫ρ0e−μmax(2Tr2rn−1dr=(2π)1/2(Tμmax)n/2An−1−∫∞ρe−μmax(2Tr2rn−1dr where An=E|X|n for X∼N(0,1), and that ∫∞ρe−μmax(2Tr2rn−1dr≤ρn−2Tμmaxe−μmax(2Tρ2≤ρn−2Tμmaxe−Uρ(T Indeed, taken together, these give Z(T)≥ωn(2π)n−1[(2π)1/2(Tμmax)n/2An−1−ρn−2Tμmaxe−Uρ(T] Finally, (15d) can be obtained by noting the second term in square brackets is negligeable compared to the first, as T decreases to 0, and by expressing ωn and An−1 in terms of the Gamma function . ■

Proof of second estimate : the Kantorovich distance between and the Dirac distribution is equal to the expectation of the distance to , with respect to  . Precisely,

 W(PρT,δx∗)=∫Md(x∗,z)PρT(dz) According to (2) and (14b), this is Using (2) to express the probability PT(Bρ), this becomes W(PρT,δx∗)=∫Bρd(x∗,z)e−U(z)(Tvol(dz)∫Bρe−U(z)(Tvol(dz) (16a) A lower bound on the denominator can be found from (15e) and subsequent inequalities, which were used to prove (15d). Precisely, these inequalities provide ∫Bρe−U(z)(Tvol(dz)≥12ωn(2π)n−1(2π)1/2An−1(Tμmax)n/2 (16b) whenever T≤T1o. For the numerator in (16a), it will be shown that, for any T, ∫Bρd(x∗,z)e−U(z)(Tvol(dz)≤ωn(2π)1/2An(Tμmin)(n+1)/2 (16c) Then, (14e) follows by dividing (16c) by (16b), and replacing in (16a), after noting that An/An−1=√2πB−1n. Thus, it only remains to prove (16c). Using (14a), it is seen that ∫Bρd(x∗,z)e−U(z)(Tvol(dz)≤∫Bρd(x∗,z)e−μmin(2Td2(x,x∗)vol(dz) By expressing this last integral in Riemannian spherical coordinates, as in (15f), ∫Bρd(x∗,z)e−U(z)(Tvol(dz)≤∫ρ0∫Sn−1re−μmin(2Tr2λ(r,s)drωn(ds) (16d) From the volume comparison theorem in  (see Page 130), λ(r,s)≤rn−1. Therefore, (16d) becomes ∫Bρd(x∗,z)e−U(z)(Tvol(dz)≤ωn∫ρ0e−μmin(2Tr2rndr≤ωn∫∞0e−μmin(2Tr2rndr The right-hand side is half the nth absolute moment of a normal distribution. Expressing this in terms of An, and replacing in (16d), gives (16c). ■

## 6 Proof of Lemma 1

Denote the connected component at identity of the group of isometries of . It will be assumed that is simply-connected and semisimple . Any geodesic is of the form ,

 γ(t)=exp(tY)⋅x (17a) for some x∈M and Y∈g, the Lie algebra of G, where exp:g→G denotes the Lie group exponential mapping, and the dot denotes the action of G on M. For each t∈I, the cut locus Cut(γ(t)) of γ(t) is given by Cut(γ(t))=exp(tY)⋅Cut(x) (17b) This is due to a more general result : let M be a Riemannian manifold and g:M→M be an isometry of M. Then, Cut(g⋅x)=g⋅Cut(x) for all x∈M. This is because y∈Cut(x) if and only if y is conjugate to x along some geodesic, or there exist two different geodesics connecting x to y . Both of these properties are preserved by the isometry g.

In order to describe the set , denote the isotropy group of in , and the Lie algebra of . Let be an orthogonal decomposition, with respect to the Killing form of , and let be a maximal Abelian subspace of . Define  ( the centraliser of in ), and consider the mapping