 # Note on the geodesic Monte Carlo

Geodesic Monte Carlo (gMC) comprises a powerful class of algorithms for Bayesian inference on non-Euclidean manifolds. The original gMC algorithm was cleverly derived in terms of its progenitor, the Riemannian manifold Hamiltonian Monte Carlo (RMHMC). Here, it is shown that an alternative, conceptually simpler derivation is available which clarifies the algorithm when applied to manifolds embedded in Euclidean space.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

Bayesian inference is hard. Bayesian inference on non-Euclidean manifolds is harder. Prior to the publication of byrne2013geodesic, a statistician required great ingenuity to compute the posterior distribution for any model with non-Euclidean parameter space, and the algorithmic details might change significantly depending on the prior, the likelihood, and the constraints implied by the non-Euclidean geometry. A good example of this approach is found in hoff2009simulation, where the posterior distribution over the Stiefel manifold of orthonormal matrices is computed by way of column-at-a-time Gibbs updates that rely on model specifications.

It is preferable, rather, that the same algorithm work for many different kinds of models. This is one of the strengths of Hamiltonian Monte Carlo duane1987hybrid and its Riemannian extension, RMHMC girolami2011riemann, which augments the posterior distribution by the random Gaussian momentum , where

is the metric tensor pertaining to the Riemannian manifold over which the model is defined. RMHMC simulates from the posterior distribution by simulating the augmented canonical distribution with Hamiltonian

 (1.1) H(q,p)=U(q)+K(q,p)∝−logπ(q)+12log|G(q)|+12pTG(q)−1p,

i.e., is the negative log-posterior and

is the negative logarithm of the probability density function of Gaussian momentum

. Since the kinetic energy is not separable in and , the system is not integrable using Euler’s method, so, in most cases, implicit integration methods are required girolami2011riemann. However, byrne2013geodesic point out that, for certain manifolds with known geodesics, it is beneficial to split the Hamiltonian into two parts and simulate the two systems iteratively. Here, the first Hamiltonian renders the equations

 ˙q =0 ˙p =∇q(logπ(q)−12log|G(q)|),

and, crucially, the second Hamiltonian renders the geodesic dynamics for the Riemannian metric’s Levi-Civita connection. Thus, the entire system may be simulated by iterating between perturbing the momentum and advancing along the manifold geodesics.

## 2. gMC on embedded manifolds

byrne2013geodesic extends the RMHMC formalism to posterior inference on manifolds embedded in Euclidean space. In the following, this extension is referred to as the embedding geodesic Monte Carlo (egMC). To maintain the RMHMC formalism, the authors begin by considering the inference problem on the intrinsic manifold, where the Hausdorff measure

 Hd(dq)=√|G(q)|λd(dq),

and not the Lebesgue measure , is the base measure with respect to which the posterior distribution is defined111Whereas the ensuing derivation is extremely clever, it is unfortunate that it relies on an intrinsic conception of the inference problem, which, we will argue, causes confusion when the object of interest is a priori defined using the Euclidean embedding coordinates.. Here, the RMHMC Hamiltonian (1.1) may be written

 H(q,p)=−logπH(q)+12pTG(q)−1p,

for

 logπH(q)=logπ(q)−12log|G(q)|

the log-posterior with respect to the Hausdorff base measure. Now, a clever change of variables occurs using an isometric embedding as a tool. An isometric embedding of a manifold into Euclidean space is a map satisfying

 Gij(q)=d∑l=1∂xl∂qi(q)∂xl∂qj(q),orG(q)=Jx(q)TJx(q)

for the Jacobian of the map evaluated at . byrne2013geodesic use the isometric embedding to make gMC practical on certain manifolds. This is accomplished by the change of variables , with

 M=Jx(q)(Jx(q)TJx(q))−1=Jx(q)G(q)−1.

If , then the Hamiltonian becomes ([byrne2013geodesic], Equation (9))

 (2.1) H(x,v) =−logπH(x)+12vTΠqv =−logπH(x)+12vTv

for the projection matrix of the tangent space of the embedded manifold (at point ) conceived of as a subspace of the ambient Euclidean space. The authors point out that “the target density is still defined with respect to the Hausdorff measure of the manifold, and so no additional log-Jacobian term is introduced,” and invite the reader to

[n]ote that by working entirely in the embedded space, we completely avoid the coordinate system and the related problems where no single global coordinate system exists. The Riemannian metric only appears in the Jacobian determinant term of the density: in certain examples, this can also be removed, for example by specifying the prior distribution as uniform with respect to the Hausdorff measure…

But it is not immediately clear how one should approach the common scenario where the prior is defined a priori using the embedding coordinates, i.e. those of the ambient Euclidean space. On the sphere, for example, such priors include the Von Mises-Fisher distribution. On the Stiefel manifold, such priors include the matrix Bingham-Von Mises-Fisher distribution hoff2009simulation. Contrary to the above statement, one suspects that the log-Jacobian term should never be necessary, and this turns out to be the case.

## 3. Alternative derivation I

Let denote a target posterior density defined directly using embedding coordinates. For the unit sphere, this means that ; for the Stiefel manifold of orthonormal matrices, this means that , for

the identity matrix of the given dimension

. Let be the the orthogonal projection onto the tangent space of the embedded manifold at point . For example, for the sphere, this projection is given by

 Πx=I−xxT;

for the Stiefel manifold, the matrix is (see Appendix B)

 Πx=Ids−12(Is2⊗x)(P+Ids)(Is2⊗xT),

for the Kronecker product and the permutation matrix for which for any matrix . For simplicity, we take the sphere as our prime example and leave the Stiefel manifold case for the appendix.

Let momentum

follow a degenerate Gaussian distribution on the tangent space to the sphere at

, i.e. , where is some positive semi-definite matrix. Then at any point , the density of is proportional to

 Det−1/2(ΠxMΠx)exp(−12pT(ΠxMΠx)+p),

where is the pseudo determinant and is the pseudo inverse of matrix . Then the Hamiltonian is given by

 (3.1) H(x,p)=−logπ(x)+12logDet(ΠxMΠx)+12pT(ΠxMΠx)+p,

for any pair and . Similar to the original gMC algorithm, we split into two Hamiltonians

 H(x,p)=−logπ(x)+12logDet(ΠxMΠx)

and

 H(x,p)=12pT(ΠxMΠx)+p.

Using some matrix calculus (Appendix C) and the fact that holbrook2018differentiating, the first system gives the equations

 (3.2) ˙x =0 ˙p =∇xlogπ(x)−(ΠxMΠx)+ΠxMx.

Since the gradient does not necessarily belong to the tangent space, we perform the change of variables . The equations now read

 (3.3) ˙x =0 ˙v =(ΠxMΠx)+(∇xlogπ(x)−(ΠxMΠx)+ΠxMx).

Velocity stays on the tangent space at because in general. The second system may also be rewritten:

 H(x,p) =12pT(ΠxMΠx)+p =12pT(ΠxMΠx)+(ΠxMΠx)(ΠxMΠx)+p =12vT(ΠxMΠx)v =12~vT~v:=H(x,~v),

where . The system corresponding to is solved by the geodesic with initial conditions . Thus the system corresponding to may be integrated by iteratively advancing according to (3.3) and spherical geodesics, alternating between and between steps. The general algorithm is given in Appendix A.1.

Accounting for the deterministic maps and within the accept/reject step yields a surprisingly simple acceptance probability. For the trajectory beginning at point , is mapped to before the geodesic flow, but is mapped to afterward, where we have used the shorthand . But before the next geodesic flow, we apply the inverse map . In this way, all internal deterministic maps cancel out, and one must only account for the first and last. Thus, for a trajectory consisting of steps, the Jacobian correction is

 Det((Πx0MΠx0)1/2(ΠxTMΠxT)−1/2),

and the resulting log acceptance probability is the minimum of 0 and

 (3.4) α =−logπ(x0)+logDet(Πx0MΠx0)+12vT0(Πx0MΠx0)v0+ logπ(xT)−logDet(ΠxTMΠxT)−12vTT(ΠxTMΠxT)vT.

See Appendix A.1 for details.

## 4. Alternative derivation II

But why begin with the momentum at all? By beginning with velocity, one may derive yet another class of algorithms that nonetheless reduces to the original geodesic Monte Carlo algorithm. The following approach is similar to that of holbrook2017geodesic and is related to the Lagrangian formulation found in lan2015markov. We let the velocity have the same distribution as before, i.e. , and define the non-canonical (cf. beskos2011hybrid) Hamiltonian

 H(x,v)=−logπ(x)−12logDet(ΠxMΠx)+12vT(ΠxMΠx)v.

Note that the sign of the log pseudo determinant differs from that from Equation (3.1), but the quadratic terms are equal. Again, split the Hamiltonian in two:

 H(x,v)=−logπ(x)−12logDet(ΠxMΠx),H(x,v)=12vT(ΠxMΠx)v.

The first yields the equations

 ˙x =0 ˙v =(ΠxMΠx)+(∇xlogπ(x)+(ΠxMΠx)+ΠxMx),

where the only difference with Equation (3.3) is the sign of . The second Hamiltonian is handled in the exact same way as above. Map , advance along the geodesics, and map back to . As above, the same Jacobian correction appears in the accept/reject step, and this time the log acceptance probability simplifies even further (see Appendix A.2) to

 (4.1) α =−logπ(x0)+12vT0(Πx0MΠx0)v0+logπ(xT)−12vTT(ΠxTMΠxT)vT,

i.e., the log pseudo determinants cancel. See Appendix A.2 for algorithmic details.

## 5. Obtaining the original algorithm

For both alternative derivations, the formulas greatly simplify when is the identity matrix, and the original geodesic Monte Carlo algorithm is obtained. Because the pseudo determinant of a projection matrix is unity, the Hamiltonians reduce to

 H(x,v)=−logπ(x)±12logDet(Πx)+12vTΠxv=−logπ(x)+12vTv.

The simplified Hamiltonian is the same as Formula (2.1), but with replacing , the posterior with respect to the Hausdorff measure. As established above, the two are equivalent, but by working completely with embedding coordinates, we are able to avoid any notion of intrinsic geometry whatsoever and thus require less mathematical machinery.

Similarly, there is no need for the Jacobian correction within the accept/reject step. Concretely, this is because . Theoretically, this is because the geodesic Monte Carlo algorithm is not symplectic for general but is symplectic for the identity. Finally, the two derivations may be viewed as constructing random walks on the cotangent and tangent bundles, respectively. The upshot is that the original geodesic Monte Carlo algorithm may be interpreted either way.

## 6. Discussion

We have proposed two alternative derivations of the geodesic Monte Carlo for embedded manifolds byrne2013geodesic. These derivations are conceptually simpler, as they do not rely on a notion of intrinsic manifold geometry. They clarify the original algorithm by showing that the inclusion of the log-Jacobian of the embedding in the Hamiltonian is unnecessary in any case where the target distribution is defined using embedding coordinates. This claim goes beyond the statement of the original paper.

Here, the original geodesic Monte Carlo algorithm was presented as a special case of two general classes of algorithms with non-trivial mass matrices. As a result, the new derivations emphasized the role played by the degenerate Gaussian distribution. Finally, the exposition hinted how Metropolis adjustments may be incorporated into geometric Langevin algorithms such as leimkuhler2016efficient.

## Appendix A Acceptance probabilities and generalized algorithms

### a.1. First alternative derivation

Let be the trajectory’s starting position and be its end point. Also let and denote the energies at the beginning and end of the trajectory, respectively. Then the log acceptance probability is min, where

 α =h0−hT+12logDet(Πx0MΠx0)−12logDet(ΠxTMΠxT) =−logπ(x0)+12logDet(Πx0MΠx0)+12vT0(Πx0MΠx0)v0+ logπ(xT)−12logDet(ΠxTMΠxT)−12vTT(ΠxTMΠxT)vT+ 12logDet(Πx0MΠx0)−12logDet(ΠxTMΠxT) =−logπ(x0)+logDet(Πx0MΠx0)+12vT0(Πx0MΠx0)v0+ logπ(xT)−logDet(ΠxTMΠxT)−12vTT(ΠxTMΠxT)vT :=e0−eT.

In the final line, and denote the terms collected into those featuring the initial and final positions, respectively.

### a.2. Second alternative derivation

Again let be the trajectory’s starting position and be its end point. Let and denote the energies at the beginning and end of the trajectory, respectively. Then the log acceptance probability is min, where

 α =h0−hT+12logDet(Πx0MΠx0)−12logDet(ΠxTMΠxT) =−logπ(x0)−12logDet(Πx0MΠx0)+12vT0(Πx0MΠx0)v0+ logπ(xT)+12logDet(ΠxTMΠxT)−12vTT(ΠxTMΠxT)vT+ 12logDet(Πx0MΠx0)−12logDet(ΠxTMΠxT) =−logπ(x0)+12vT0(Πx0MΠx0)v0+logπ(xT)−12vTT(ΠxTMΠxT)vT :=e0−eT.

In the final line, and denote the terms collected into those featuring the initial and final positions, respectively.

## Appendix B Projection matrix for the Stiefel manifold

When modeling an element of the Stiefel manifold, for momentum matrix we write the degenerate Gaussian distribution

 Det−1/2(ΠxMΠx)exp(−12vec(p)T(ΠxMΠx)+vec(p)),

and are matrices. To get the form for , we note that the orthogonal projection of a matrix onto the tangent space at is

 Πx(v)=v−12x(vTx+xTv).

Applying the vec operator gives

 vec(Πx(v)) =vec(v)−12vec(x(vTx+xTv)) =vec(v)−12(Is2⊗x)vec(vTx+xTv) =vec(v)−12(Is2⊗x)vec(vTx)+vec(xTv) =vec(v)−12(Is2⊗x)Pvec(xT)+xTv) =vec(v)−12(Is2⊗x)(P+Ids)vec(xTv) =vec(v)−12(Is2⊗x)(P+Ids)(Is2⊗xT)vec(v) =(Ids−12(Is2⊗x)(P+Ids)(Is2⊗xT))vec(v) =Πxv

Hence

 Πx=Ids−12(Is2⊗x)(P+Ids)(Is2⊗xT).

## Appendix C Deriving the first system of equations

To obtain Equation (3.2), we need to calculate

 ∇xlogDet(ΠxMΠx).

This may be done using the differential and Theorem 2.20 from holbrook2018differentiating, namely

 dDet(A)=Det(A)tr(A+(dA)).

Thus

 dlogDet(ΠxMΠx) =tr((ΠxMΠx)+(d(ΠxMΠx))) =tr((ΠxMΠx)+((dΠx)MΠx+ΠxM(dΠx)))).

is given by

 dΠx =d(I−xxT)=−d(xxT) =−(dx)xT−x(dx)T,

so we have

 dlogDet(ΠxMΠx) =−tr((ΠxMΠx)+(((dx)xT+x(dx)T)MΠx+ΠxM((dx)xT+x(dx)T))).

Distributing the leading and rearranging terms gives

 dlogDet(ΠxMΠx) =−2((dx)T(MΠx(ΠxMΠx)+x+(ΠxMΠx)+ΠxMx)),

but the first term of the inner parenthesis is equal to zero because and so

 MΠx(ΠxMΠx)+x =MΠx(ΠxMΠx)+Πxx =MΠx(ΠxMΠx)+0=0.

Hence,

 dlogDet(ΠxMΠx) =−2((dx)T(ΠxMΠx)+ΠxMx),

and it follows immediately that

 ∇xlogDet(ΠxMΠx)=−2(ΠxMΠx)+ΠxMx.