 # Posterior Integration on an Embedded Riemannian Manifold

This note extends the posterior integration method of Oates et al. (2016, 2017) to the case where the posterior is supported on an embedded Riemannian manifold. In contrast to the original Euclidean case, where certain boundary conditions are required, no boundary conditions are needed for a closed manifold. The important case of the 2-sphere is explored in detail.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This work considers numerical approximation of an integral

 ∫MfdP (1)

where is a -dimensional Riemannian manifold, is a distribution suitably defined on and is a -measurable integrand. This fundamental problem is well-studied in applied mathematics and existing methods include Gaussian cubatures (Atkinson, 1982; Filbir & Mhaskar, 2010), cubatures based on uniformly-weighted quasi Monte Carlo points (Kuo & Sloan, 2005; Gräf, 2013) and cubatures based on optimally-weighted Monte Carlo points (Brandolini et al., 2014; Ehler & Gräf, 2017). These numerical integration methods assume that a closed form for is provided. However, this is not the case for many important integrals that occur in the applied statistical context. In particular, in the Bayesian framework the distribution can represent posterior belief: i.e.

where is an expert-elicited prior distribution on and a function , known as a likelihood, determines how the expert’s belief should be updated on the basis of data obtained in an experiment; see Bernardo & Smith (2001) for the statistical background. The left hand side of this equation is to be interpreted as the Radon-Nikodym derivative of the posterior with respect to the prior (Stuart, 2010). Outside of conjugate exponential families, posterior distributions are not easily characterised, as the normalisation constant (or marginal likelihood)

 Z = ∫ML(x)dP0(x)

is itself an intractable integral. Several methods for approximation of have been developed, but this problem is considered difficult – even in the case of the Euclidean manifold (see the survey in Friel & Wyse, 2012).

Integrals on manifolds arise in many important applications of Bayesian statistics, most notably directional statistics

(Mardia & Jupp, 2000) and modelling of functional data on the sphere (Porcu et al., 2016). The canonical scenario is that and admit densities and with respect to the natural volume element on the manifold, i.e. and , and that a function proportional to can be provided. Specifically, in the context of Bayesian statistics on a Riemannian manifold, we are provided with . To facilitate approximation of Eqn. 1 in this context, Markov chain Monte Carlo (MCMC) methods have been developed to sample from distributions defined on a manifold (Byrne & Girolami, 2013; Lan et al., 2014; Holbrook et al., 2016). Their output is a realisation of an ergodic Markov process that leaves invariant, so that the integral may be approximated by an ergodic average

. Convergence of MCMC estimators is well-understood

(Meyn & Tweedie, 2012).

A drawback of MCMC, of practical significance for regular problems of modest dimension, is that the convergence is gated at . This rate is inferior to the rates obtained by the aforementioned methods that apply when is explicit; a consequence of the fact that the ergodic average does not exploit smoothness of the integrand. In recent years, several alternatives to MCMC have been developed to address this convergence bottleneck. These include transport maps (Marzouk et al., 2016), Riemann sums (Philippe & Robert, 2001), quasi Monte Carlo ratio estimators (Schwab & Stuart, 2012) and estimators based on Stein’s method (Liu & Wang, 2016; Oates et al., 2017, 2018). However, these methods have so far focused on the case of the Euclidean manifold .

In this paper we generalise one of these methods – the method proposed in Oates et al. (2017) – for computation of posterior integrals on a Riemannian manifold. Inspired by classical integration methods, our approach is based on approximation of the integrand: In the first step, the un-normalised density is exploited to construct a class of functions, defined on the manifold, that can be exactly integrated with respect to . Next, the integrand is approximated with a suitably chosen element from . Finally, an approximation to Eqn. 1 is provided by . The main technical contribution occurs in the first step, where we must elucidate a class of functions that can be integrated without access to , the normalisation constant.

The main properties of the proposed method, which hold also for the Euclidean case Oates et al. (2017, 2018), are as follows:

• The convergence rate is empirically verified at , under regularity assumptions on the integrand. On the other hand, the computational cost associated with the estimator is up to .

• The points at which the integrand is evaluated do not need to form an approximation to .

• A computable upper bound on (relative) integration error – a kernel Stein discrepancy (Chwialkowski et al., 2016; Liu et al., 2016; Gorham & Mackey, 2017) – is obtained as a by-product of approximating the integral.

Moreover, in this paper we demonstrate that, compared to the case of the Euclidean manifold, non-trivial boundary conditions are not required for the method to be applied on a closed manifold.

The paper proceeds as follows: In Sec. 2 we provide a brief mathematical background. In Sec. 3 we present the proposed method. The method is empirically assessed in Sec. 4. Further discussion of the approach is provided in Sec. 5.

## 2 Mathematical Background

The aim of this section is to present an informal and accessible introduction to some of the mathematical tools that are needed for our development. For a formal treatment, several references to textbooks are provided.

#### Embedded Riemannian Manifolds

Our presentation focuses on embedded manifolds, as any abstract manifold may be embedded in for some by the Whitney embedding theorem (Skopenkov, 2008). However, the method itself will not require an embedding to be explicit. Recall the space may be equipped with global coordinates and the natural inner product. An -dimensional manifold embedded in is a subset of such that any point has a neighbourhood which can be parameterised by local coordinates ; i.e. there exists a smooth map with smooth inverse .

A tangent vector to

at a point is defined as the tangent vector at to a curve on . Since a curve on may be locally parametrised as , its tangent vector is

 γ––′=m∑i=1dqi(t)dt∂νx(q)∂qi∂qi.

Thus the vectors form a basis for the tangent space, denoted . The tangent space is equipped with the inner product , where , that arises as the restriction of the usual inner product on . The pair is called a Riemannian manifold.

Example: The sphere is a Riemannian manifold. The coordinate patch , with local coordinates , , holds for almost all111It does not cover the half great circle that passes through both poles and the point . . The tangent space is spanned by and . Taking the Euclidean inner product of these vectors produces , , .

#### Geometric Measure Theory

Any oriented Riemannian manifold has a natural measure over its Borel algebra, called the Riemannian volume form, which represent an infinitesimal volume element. In a coordinate patch , this measure can be expressed in terms of the Lebesgue measure: . In particular when is the Euclidean space, this is just the Lebesgue measure, and when is an embedded manifold in , is the surface area (or Hausdorff) measure (Federer, 1969).

Example: For the sphere , , where is the area of the parallelogram spanned by .

A technical point is that we restrict attention to Riemannian manifolds that are oriented. This is equivalent to assuming that the volume form is coordinate independent. It will also be required that is either closed or is a manifold with boundary (see p25 of Lee, 2013).

#### Calculus on a Riemannian Manifold

To present a natural, coordinate-independent construction of differential operators on manifolds would require either exterior calculus or the concept of a covariant derivative. To limit scope, we present two important differential operators in local coordinates and merely comment that the associated operators are in fact coordinate-independent; full details can be found in Bachman (2006). To this end, denote the gradient of a function , assumed to exist, as

 ∇ϕ=m∑i,j=1[G−1]i,j∂ϕ∂qj∂qi

Likewise, define the divergence of a vector field with , assumed to exist, as

 ∇⋅s–=m∑i=1∂si∂qi+si∂∂qilog√det(G).

These two differential operators are sufficient for our work; for instance, they can be combined to obtain the Laplacian .

#### Divergence Theorem on a Manifold

The divergence theorem on an oriented Riemannian manifold with boundary states that:

 ∫M∇⋅s–dV = ∫∂M⟨s–,n––⟩Gin––dV

where is the volume form on the boundary and is the unit normal vector pointing outward (Bachman, 2006). To define one uses the fact that, if is a Riemannian submanifold of , then for each , the metric of splits the tangent space into and its orthogonal complement ; i.e. . Elements of are normal vectors to . To define , note that is a submanifold of and the restriction of the metric induces a Riemannian mainfold . Then can be seen as the natural volume form on the induced manifold.

Thus if is a closed manifold, then . This fact will allow for considerable simplification of the proposed approach, compared to the Euclidean case studied in Oates et al. (2017, 2018), where non-trivial boundary conditions were required.

#### Reproducing Kernel Hilbert Spaces

The definition of a (real-valued) reproducing kernel Hilbert space (RKHS) on the manifold is identical to the usual definition on the Euclidean manifold. Namely, an RKHS is a Hilbert space of functions equipped with a kernel, , that satisfies: (a) for all ; (b) for all ; (c) for all , . For standard manifolds, such as the sphere , several function spaces and their reproducing kernels have been studied (e.g. Porcu et al., 2016). For more general manifolds, an extrinsic kernel can be induced from restriction under embedding into an ambient space (Lin et al., 2017), or the stochastic partial differential approach (Fasshauer & Ye, 2011; Lindgren et al., 2011; Niu et al., 2017) can be used to numerically approximate a suitable intrinsic kernel.

Three important facts will be used later: First, the kernel characterises the inner product and the set consists of functions with finite norm . Second, if and are two RKHS on with reproducing kernels and , then can be defined as the RKHS whose elements can be written as , , , with reproducing kernel . Third, if is a linear operator and is an RKHS, then can be defined as the set endowed with the reproducing kernel , where denotes the adjoint of the operator , which acts on the second argument rather than the first argument . See Berlinet & Thomas-Agnan (2011) for several examples of RKHS and additional technical background.

This completes our brief tour of the mathematical prerequisites; the next section describes the proposed posterior integration method.

## 3 Posterior Integration on a Manifold

In this section we present the proposed numerical integration method, which proceeds in three steps:

1. Construct a flexible class of functions such that the integrals with respect to can be exactly computed.

2. Approximate the integrand with a suitably chosen element from .

3. Approximate the integral of interest as .

It is clear that step 1 is non-trivial, since availability of the normalisation constant in Eqn. 1 cannot be assumed. This will be our focus next.

#### Step # 1: Constructing an Approximating Class H

To proceed, we generalise the method of Oates et al. (2017) to the case of an oriented Riemannian manifold.

Let be a RKHS of twice differentiable functions whose reproducing kernel is denoted . Then denotes a gradient field on and we may consider its divergence on . In particular, consider the linear differential operator

 Lπ(ϕ):=∇⋅(π∇ϕ)π.

The proposed method rests on the following result, which is proven in Sec. A.1 of the Supplement:

###### Theorem 1.

For it holds that

 ∫MLπ(ϕ)dP = 1Z∫∂M⟨π∇ϕ,n––⟩Gin––dV

Thus, if is a closed manifold, the right hand side vanishes and can be trivially integrated.

Note that the same conclusion holds even when is not closed, provided that vanishes everywhere on the boundary ; this is similar to the non-trivial assumption made in Oates et al. (2017) for the case of the Euclidean manifold. Note that is not the only differential operator that could be used; others are suggested in Sec. A.2 of the Supplement.

The RKHS , whose elements are functions of the form and whose kernel is , is not quite flexible enough for our purposes, since these cannot approximate the constant function . Thus we augment with the RKHS of constant functions, denoted and equipped with constant kernel , to obtain the function class . Of course, the integral of the constant function with respect to is trivially computed as

is a probability distribution. It follows that

is a RKHS with kernel . To ensure the elements of integrate to zero under , from Theorem 1, we therefore require that vanishes on for each fixed whenever is not closed.

Under certain regularity assumptions, and additional technical details to deal with the fact that a slightly different differential operator was used, the set can be shown to be dense in in the case of the Euclidean manifold (c.f. Lemma 4 of Oates et al., 2018).

#### Step #2: Approximating the Integrand

Now that we have a class of functions that can be exactly integrated, we must attempt to approximate with an element from this set. Following Oates et al. (2017), the estimator that we consider is

 ^f=argminh∈Hπ,σ∥h∥Hπ,σs.t.h(xi)=f(xi),i∈{1,…,n}.

From the representer theorem (see e.g. Schölkopf et al., 2001) it follows that has a closed form expression in terms of the kernel :

 ^f(⋅) =

This has the form of a weighted combination of functions in :

 ^f(⋅)=n∑i=1wikπ,σ(⋅,xi),wi=[K−1π,σf]i

The form of the estimator is rather standard and can be characterised in several ways, e.g. as a Bayes rule for an regression problem or as a posterior mean under a suitable Gaussian process regression model. Alternative kernel estimators, such as estimators that enforce non-negativity of the weights , could be considered (c.f. Liu & Lee, 2017; Ehler & Gräf, 2017).

#### Step #3: Approximating the Integral

The approximation can be exactly integrated by construction:

 ∫M^fdP = n∑i=1wi∫Mkπ,σ(⋅,xi)dP (2) = n∑i=1wiσ2=σ21⊤K−1π,σf

The estimate in Eqn. 2 is recognised as a kernel quadrature method and, as such, it carries a Bayesian interpretation (Briol et al., 2016). Namely, from the Bayesian perspective, Eqn. 2 is the posterior mean for when is modelled a priori as a centred Gaussian process with covariance function (see Rasmussen & Williams, 2006, for background on Gaussian process models). In this light, the parameter

can be considered as a prior standard deviation for the value of the integral

. Thus, since we may not know the size of the values taken by in advance, we consider a weakly informative prior corresponding to the limit . To this end, we have the following, proven in Sec. A.1 of the Supplement:

###### Theorem 2.

Let denote the kernel matrix with entries . Then

 = (K−1π11⊤K−1π1)⊤f. (3)

The estimator in Eqn. 3 is the one that is experimentally tested in Section 4. An interesting observation is that the weights automatically sum to unity in this approach. Moreover, the expression is exactly the worst case error of the weighted point set in the unit ball of :

 (1⊤K−1π1)−1/2 = sup{∣∣∣∫MfdP−limσ→∞∫M^fdP∣∣∣:∥f∥Hπ≤1} (4)

This quantity is also known as the kernel Stein discrepancy associated with the weighted point set (Chwialkowski et al., 2016; Liu et al., 2016; Gorham & Mackey, 2017). Thus a measure of (relative) integration error comes for free when the estimator is computed. From standard duality, this expression is also the posterior standard deviation for the integral. Of course, in practice the linear system need only be solved once, at a cost of at most .

#### Related Work

The original work of Oates et al. (2017) considered an arbitrary vector field in place of the gradient field , and thus required only a first order differential operator. This was possible since the coordinates of the vector field could be dealt with independently in the case of the Euclidean manifold, but this will not be possible in the case of a general manifold. Interestingly, the second order differential operator considered here is the manifold generalisation of the operator used in the earlier work of Assaraf & Caffarel (1999); Mira et al. (2013) and recently rediscovered in the context of Riemannian Stein variational gradient descent in Liu & Zhu (2017). The latter reference is most similar to our work, but focused on construction of a point set as opposed to the question of how to construct an estimator based on a given point set. Other Stein operators for the Euclidean manifold were proposed in Gorham et al. (2016). The divergence theorem was recently also used in order to generalise the score matching method for parameter estimation on a Riemannian manifold in Mardia et al. (2016).

## 4 Numerical Assessment

In this section we report experiments designed to assess the performance of the proposed numerical method. In Sec. 4.1 we return to the standard case of the Euclidean manifold , then in Sec. 4.2 we present experiments performed on the sphere . Last, in Sec. 4.3 we applied the proposed method to an Australian Mesozoic paleomagnetic pole model where the manifold was .

### 4.1 Euclidean Manifold

First, we considered the Euclidean manifold . This was for two reasons; first, to expose the proposed construction in a familiar context, and second, to determine whether the use of a second order differential operator leads to any substantive differences relative to earlier work.

#### Differential Operator

For , we have a global parametrisation and the natural volume form is the Lebesgue measure; . For simplicity, suppose that either vanishes on or . Then our method involves the second order differential operator

 Lπ(ϕ)=∇⋅(π∇ϕ)π=∇ππ⋅∇ϕ+Δϕ

where is the familiar gradient. For the case where is bounded, let denote the unit normal to . Then from the Euclidean version of the divergence theorem:

 ∫MLπ(ϕ)dP = ∫∂Mπ(x)∇ϕ(x)⋅n(x)dλd(x) = ∫∂M0dλd(x)=0.

The equality is immediate for the case .

This is to be contrasted with the earlier work of Oates et al. (2017), which considered a general vector field and the first order differential operator . From there, Oates et al. (2017) proceed as we have already described, with

the tensor product of

copies of the RKHS . Note that implicitly relies on the Euclidean structure of the manifold and cannot be general.

#### Choice of Kernel

If then the Matérn kernel

 k(x,y) = λ2exp(−√2α∥x−y∥ℓ)Γ(α+12)Γ(2α)α−12∑i=0(α−12+i)!i!(α−12−i)!(√8α∥x−y∥ℓ)α−12−i

with parameters reproduces the Sobolev space (see e.g. Fasshauer, 2007). In order for to be well-defined, we require that elements of are twice (weakly) differentiable; hence we require that . In contrast, for to be well defined we have the weaker requirement that .

#### Experimental Results

For a transparent and reproducible test bed, let be the standard Gaussian in and suppose that we are told . That is, we pretend that the normalisation constant is unknown and proceed as described. For the kernel we fixed222All kernel parameters were fixed at sensible default values in this work, but optimisation of these parameters can be expected to offer improvement. , and considered values . For the points we consider three scenarios:

1. independent, unbiased draws

2. independent, biased draws

3. (for ) stratified points the th percentile of

For each point set we computed the worst case error in Eqn. 4. (For scenarios 1 and 2 we report the mean worst case error obtained over 100 independent realisations of the random points.) Both differential operators and were considered, although the former is incompatible with . All results were obtained using MATLAB R2017b, with symbolic differentiation exploited to compute all kernels . Further details are provided in Sec. A.8 of the Supplement. Figure 1: Results for the standard Gaussian on the Euclidean manifold in dimension d=1. The worst case error of the proposed method (Eqn. 4) was plotted for various α, controlling the smoothness of the kernel, and various n, the number of evaluations of the integrand. [Top row: The points xi∼N(0,1) were drawn from the target. Middle row: The points xi∼N(1,3) were drawn from an incorrect distribution. Bottom row: The points xi were stratified on the percentiles of the target. Left column: The differential operator Lπ considered in this work. Right column: The differential operator L1π considered in earlier work.]

Results, in Fig. 1 for and in Fig. 5 in the Supplement for , showed that larger smoothness leads to faster decay of worst case error. In particular, we empirically observe convergence rates of . In the case of the operator studied on the Euclidean manifold in Oates et al. (2018) (right column), it was proven (under some additional assumptions) that the worst case error decreases at when is smooth, essentially because the kernel has derivatives of order . This is consistent with the experimental results in Fig. 1. A small extension of the theoretical methods used in Oates et al. (2018) gives a corresponding rate for the operator of , since is a second order differential operator. The results in the left column of Fig. 1 bear out this conjecture, with slower convergence of the worst case error for fixed compared to the right column.

There was only a small difference between the worst case error in the first scenario (, top row) compared to the second scenario (, middle row). This clearly illustrates the property that the points need not form an approximation to in the proposed method. On the other hand, the stratified points (bottom row) appeared to mitigate the transient phase before the linear asymptotics kicks in, compared to the use of Monte Carlo points, and should be preferred.

In dimension the worst case error decays more slowly, consistent with the rates just conjectured. Moreover, the asymptotic advantage of larger is not clearly seen for so that the transient phase appears to last for longer. This is consistent with the well-known curse of dimension for isotropic kernel methods.

To investigate the robustness of the proposed method when the integrand is not well-approximated by elements in , we also report absolute integration errors in Fig. 6 of the Supplement. These additional experiments are brief, since they closely mirror the experiments conducted in Oates et al. (2018).

### 4.2 The Sphere S2

Next we considered arguably the most important non-Euclidean manifold; the sphere .

#### Differential Operator

Recall that the coordinate map from Sec. 2 can be used to compute the metric tensor

 G=(sin2q2001)

and a natural volume element . It follows that, for a function , we have the gradient differential operator

 ∇ϕ = 1sin2q2∂ϕ∂q1∂q1+∂ϕ∂q2∂q2.

Similarly, for a vector field , we have the divergence operator

 ∇⋅s– = ∂s1∂q1+∂s2∂q2+cosq2sinq2s2.

Thus the linear operator that we consider is:

 Lπ(ϕ) = cosq2sinq2∂ϕ∂q2+1sin2q2{1π∂π∂q1∂ϕ∂q1+∂2ϕ∂q21}+{1π∂π∂q2∂ϕ∂q2+∂2ϕ∂q22}. (5)

Turning this into expressions in terms of requires that we notice

 cosq2sinq2=x3√1−x23,1sin2q2=11−x23

and use chain rule for partial differentiation (see Sec.

A.4 in the Supplement). This manifold-specific portion of MATLAB code is presented in Fig. 9 in the Supplement.

#### Choice of Kernel

To proceed we require a reproducing kernel defined on . To this end, Proposition 5 of Brauchart & Dick (2013) establishes that, for , the kernel

 k(x,y) = C(1)3F2[32−α,1−α,32−α2−α,3−m2−2α;1−x⋅y2]+C(2)∥x−y∥2α−2, (6)

defined for , reproduces the Sobolev space . Here is the generalised hypergeometric function. The expressions for the constants and are given in Sec. A.5 of the Supplement. This kernel was used in our experiments, with values considered.

#### Experimental Results Figure 2: The first 12 eigenfunctions of the kernel kπ for a von Mises-Fisher distribution on S2.

To illustrate the method, considered the von Mises-Fisher distribution whose density with respect to is

 πP(x)=∥c∥24πsinh(∥c∥2)exp(c⊤x).

For illustration we suppose that the normalisation constant is unknown and we have access only to . Thus we proceed to construct the differential operator operator as previously described. To gain intuition as to the reasonableness of the resulting , we fixed and plotted the first few eigenfunctions of for the target distribution defined by . These are shown in Fig. 2. Note that, in simple visual terms, these seem like a reasonable basis in which to perform function approximation on .

Next, we considered the performance of the proposed integration method. For various values of , we obtained points

that were quasi-uniformly distributed on

, being obtained by minimising a generalised electrostatic potential energy (Reisz’s-energy; Semechko, 2015). Note that these points, being uniform, do not form an approximation to . For each point set we computed the worst case error in Eqn. 4.

Results in Fig. 3 (left) showed that the worst case error decays more rapidly with larger . Again, we empirically observe convergence rates of . Although it is not possible to accurately read off asymptotic rates from these results, they are somewhat consistent with a convergence rate of where .

To investigate the robustness of the proposed method when the integrand is not itself an element of , we also report absolute integration errors in Sec. A.6 of the Supplement. Figure 3: Results for non-Euclidean manifolds. Left: The von Mises-Fisher distribution on S2. [The worst case error of the proposed method (Eqn. 4) was plotted for various α, controlling the smoothness of the kernel, and various n, the number of evaluations of the integrand. The points xi were quasi-uniform over S2 and the differential operator Lπ was used.] Right: Results for the paleomagnetic model. [The worst case error of the proposed method (Eqn. 4) was plotted for various n, the number of evaluations of the integrand. Black: The ergodic average from MCMC. Blue: The proposed method based on the same MCMC point set. Red: The proposed method based on a heuristic, whereby points were approximately stratified on the manifold.]

### 4.3 Australian Mesozoic Paleomagnetic Pole Model

The data that we considered consists of independent estimates for the position of the Earth’s historic magnetic pole (Table 2 of Schmidt, 1976). The task is to aggregate these estimates; for details see Paine et al. (2017). The statistical model considered uses the von Mises-Fisher distribution as a likelihood:

 L(μ,κ) ∝ n∏i=1κ1/2I12(κ)exp(κμ⊤yi)

where is the modified Bessel function of the first kind of order , is a mean parameter and is a concentration parameter. Here the density is given with respect to the natural volume element on

. A Bayesian analysis proceeds under a conjugate prior

 π0(μ,κ) ∝ ⎛⎜⎝κ1/2I12(κ)⎞⎟⎠cexp(R0κμ⊤0μ)

for hyper-parameters , , given with respect to the natural volume element on the manifold . The task is to compute expectations under the posterior

 πP(μ,κ|{yi}ni=1) ∝ ⎛⎜⎝κ1/2I12(κ)⎞⎟⎠c+nexp(Rnκμ⊤nμ),

where

 Rn=∥∥ ∥∥R0μ0+n∑i=1yi∥∥ ∥∥,μn=R−1n(R0μ0+n∑i=1yi).

In particular, we attempted to estimate the first and second moments of

and , so that for , or , in each case for . The most default prior with , was employed, following Nunez-Antonio & Gutiérrez-Pena (2005). Owing to the smoothness of these test functions, a version of the exponentiated-quadratic kernel was employed; see the Supplement for full detail.

Points were sampled from the posterior via random-walk MCMC. The usual ergodic average was used to estimate each integral and the result was compared to the estimate provided by the proposed method based on the same point set. Results in Figs. 3 (right) showed that the proposed method strongly outperformed the ergodic average for integrands in the unit ball of . In Fig. 4 the proposed method was seen to performed well for the test functions , but delivered similar performance to the ergodic average for the test functions . This may be because the latter were not well-approximated by elements of .

For interest, we also considered a heuristic approach where points were approximately uniformly stratified, as described in Sec. A.7 of the Supplement. Note that these points do not form an empirical approximation to the posterior; however, for the proposed method this is not required. Results in Fig. 3 (right) showed similar performance to MCMC. Figure 4: Results for the paleomagnetic model. Integral estimates were obtained and plotted for various n, the number of evaluations of the integrand. [Red: The ergodic average from MCMC. Blue: The proposed method based on the same MCMC point set. Results from 3 independent chains are shown; in most panels the proposed method has indistinguishable variation, so the blue curves cannot be distinguished.]

## 5 Discussion

This paper generalised of the method of Oates et al. (2017) to a general oriented Riemannian manifold. The method was illustrated for regular integrals of modest dimension; as usual, the case of high-dimensional manifolds (i.e. large) is likely to challenge any regression-based method unless strong assumptions can be made on the integrand. Three open theoretical questions are: (1) How expressive is the function class in terms of approximation properties? (2) How quickly do these estimators converge under particular regularity assumptions on the integrand? (3) How robust are these estimators when the function space assumptions are violated? The ultimate success of this method will hinge on the extent to which these important questions can be addressed.

## Acknowledgements

CJO and MG were supported by the Lloyd’s Register Foundation programme on data-centric engineering at the Alan Turing Institute, UK. AB was supported by a Roth scholarship from the Department of Mathematics at Imperial College London, UK. MG was supported by the EPSRC grants [EP/K034154/1, EP/R018413/1, EP/P020720/1, EP/L014165/1], and an EPSRC Established Career Fellowship, [EP/J016934/1]. CJO is grateful for discussions with Chang Liu.

## References

• Assaraf & Caffarel (1999) Assaraf, R. and Caffarel, M.

Zero-variance principle for Monte Carlo algorithms.

Physical Review Letters, 83(23):4682, 1999.
• Atkinson (1982) Atkinson, K. Numerical integration on the sphere. The ANZIAM Journal, 23(3):332–347, 1982.
• Bachman (2006) Bachman, D. A Geometric Approach to Differential Forms. Birkhäuser, 2006.
• Berlinet & Thomas-Agnan (2011) Berlinet, A. and Thomas-Agnan, C. Reproducing Kernel Hilbert Spaces in Probability and Statistics. Springer Science & Business Media, 2011.
• Bernardo & Smith (2001) Bernardo, J. M. and Smith, A. F. M. Bayesian Theory. IOP Publishing, 2001.
• Brandolini et al. (2014) Brandolini, L., Choirat, C., Colzani, L., Gigante, G., Seri, R., and Travaglini, G. Quadrature rules and distribution of points on manifolds. Annali della Scuola Normale Superiore di Pisa - Classe di Scienze, 5:889–923, 2014.
• Brauchart & Dick (2013) Brauchart, J. S. and Dick, J. A characterization of Sobolev spaces on the sphere and an extension of Stolarsky’s invariance principle to arbitrary smoothness. Constructive Approximation, 38(3):397–445, 2013.
• Briol et al. (2016) Briol, F.-X., Oates, C. J., Girolami, M., Osborne, M. A., and Sejdinovic, D. Probabilistic integration: A role in statistical computation? arXiv:1512.00933, 2016.
• Byrne & Girolami (2013) Byrne, S. and Girolami, M. Geodesic Monte Carlo on embedded manifolds. Scandinavian Journal of Statistics, 40(4):825–845, 2013.
• Chwialkowski et al. (2016) Chwialkowski, K., Strathmann, H., and Gretton, A. A kernel test of goodness of fit. In ICML, 2016.
• Ehler & Gräf (2017) Ehler, M. and Gräf, M. Optimal Monte Carlo integration on closed manifolds. arXiv:1707.04723, 2017.
• Fasshauer (2007) Fasshauer, G. Meshfree Approximation Methods with MATLAB. World Scientific, 2007.
• Fasshauer & Ye (2011) Fasshauer, G. E. and Ye, Q. Reproducing kernels of generalized Sobolev spaces via a Green function approach with distributional operators. Numerische Mathematik, 119(3):585–611, 2011.
• Federer (1969) Federer, H. Geometric Measure Theory. Springer, 1969.
• Filbir & Mhaskar (2010) Filbir, F. and Mhaskar, H. N. A quadrature formula for diffusion polynomials corresponding to a generalized heat kernel. Journal of Fourier Analysis and Applications, 16(5):629–657, 2010.
• Friel & Wyse (2012) Friel, N. and Wyse, J. Estimating the evidence - a review. Statistica Neerlandica, 66(3):288–308, 2012.
• Gorham & Mackey (2017) Gorham, J. and Mackey, L. Measuring sample quality with kernels. In ICML, 2017.
• Gorham et al. (2016) Gorham, J., Duncan, A. B., Vollmer, S. J., and Mackey, L. Measuring sample quality with diffusions. arXiv:1611.06972, 2016.
• Gräf (2013) Gräf, M. Efficient algorithms for the computation of optimal quadrature points on Riemannian manifolds. PhD thesis, Universitätsverlag der Technischen Universität Chemnitz, 2013.
• Holbrook et al. (2016) Holbrook, A., Lan, S., Vandenberg-Rodes, A., and Shahbaba, B. Geodesic Lagrangian Monte Carlo over the space of positive definite matrices: with application to Bayesian spectral density estimation. arXiv:1612.08224, 2016.
• Kuo & Sloan (2005) Kuo, F. Y. and Sloan, I. H. Quasi-Monte Carlo methods can be efficient for integration over products of spheres. Journal of Complexity, 21(2):196–210, 2005.
• Lan et al. (2014) Lan, S., Zhou, B., and Shahbaba, B. Spherical Hamiltonian Monte Carlo for constrained target distributions. In ICML, pp. 629–637, 2014.
• Lee (2013) Lee, J. M. Introduction to Smooth Manifolds. Springer-Verlag, 2013.
• Lin et al. (2017) Lin, L., Niu, M., Cheung, P., and Dunson, D. Extrinsic Gaussian processes for regression and classification on manifolds. arXiv:1706.08757, 2017.
• Lindgren et al. (2011) Lindgren, F., Rue, H., and Lindström, J.

An explicit link between Gaussian fields and Gaussian Markov random fields: The stochastic partial differential equation approach.

Journal of the Royal Statistical Society: Series B, 73(4):423–498, 2011.
• Liu & Zhu (2017) Liu, C. and Zhu, J. Riemannian Stein Variational Gradient Descent for Bayesian Inference. arXiv:1711.11216, 2017.
• Liu & Lee (2017) Liu, Q. and Lee, J. D. Black-box importance sampling. In AISTATS, 2017.
• Liu & Wang (2016) Liu, Q. and Wang, D.

Stein variational gradient descent: A general purpose Bayesian inference algorithm.

In NIPS, pp. 2378–2386, 2016.
• Liu et al. (2016) Liu, Q., Lee, J., and Jordan, M. A Kernelized Stein Discrepancy for Goodness-of-fit Tests and Model Evaluation. In ICML, 2016.
• Mardia & Jupp (2000) Mardia, K. V. and Jupp, P. E. Directional Statistics. John Wiley & Sons, 2000.
• Mardia et al. (2016) Mardia, K. V., Kent, J. T., and Laha, A. K. Score matching estimators for directional distributions. arXiv preprint arXiv:1604.08470, 2016.
• Marzouk et al. (2016) Marzouk, Y., Moselhy, T., Parno, M., and Spantini, A. An introduction to sampling via measure transport. arXiv:1602.05023, 2016.
• Meyn & Tweedie (2012) Meyn, S. P. and Tweedie, R. L. Markov Chains and Stochastic Stability. Springer Science & Business Media, 2012.
• Mira et al. (2013) Mira, A., Solgi, R., and Imparato, D. Zero variance Markov chain Monte Carlo for Bayesian estimators. Statistics and Computing, 23(5):653–662, 2013.
• Niu et al. (2017) Niu, M., Cheung, P., Lin, L., Dai, Z., Lawrence, N., and Dunson, D. Intrinsic Gaussian processes on complex constrained domains. arXiv:1801.01061, 2017.
• Nunez-Antonio & Gutiérrez-Pena (2005) Nunez-Antonio, G. and Gutiérrez-Pena, E. A Bayesian analysis of directional data using the von Mises-Fisher distribution. Communications in Statistics - Simulation and Computation, 34(4):989–999, 2005.
• Oates et al. (2017) Oates, C. J., Girolami, M., and Chopin, N. Control functionals for Monte Carlo integration. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(3):695–718, 2017.
• Oates et al. (2018) Oates, C. J., Cockayne, J., Briol, F.-X., and Girolami, M. Convergence rates for a class of estimators based on Stein’s identity. Bernoulli, 2018. To appear.
• Paine et al. (2017) Paine, P. J., Preston, S. P., Tsagris, M., and Wood, A. T. A.

An elliptically symmetric angular Gaussian distribution.

Statistics and Computing, pp. 1–9, 2017.
• Philippe & Robert (2001) Philippe, A. and Robert, C. P. Riemann sums for MCMC estimation and convergence monitoring. Statistics and Computing, 11(2):103–115, 2001.
• Porcu et al. (2016) Porcu, E., Bevilacqua, M., and Genton, M. G. Spatio-temporal covariance and cross-covariance functions of the great circle distance on a sphere. Journal of the American Statistical Association, 111(514):888–898, 2016.
• Rasmussen & Williams (2006) Rasmussen, C. E. and Williams, C. K.

Gaussian Processes for Machine Learning

.
MIT Press, 2006.
• Schmidt (1976) Schmidt, P. W. The non-uniqueness of the Australian Mesozoic palaeomagnetic pole position. Geophysical Journal International, 47(2):285–300, 1976.
• Schölkopf et al. (2001) Schölkopf, B., Herbrich, R., and Smola, A. A generalized representer theorem. In COLT, 2001.
• Schwab & Stuart (2012) Schwab, C. and Stuart, A. M. Sparse deterministic approximation of Bayesian inverse problems. Inverse Problems, 28(4):045003, 2012.
• Semechko (2015) Semechko, A. Suite of functions to perform uniform sampling of a sphere. MATLAB File Exchange Server, 2015.
• Skopenkov (2008) Skopenkov, A. B. Embedding and knotting of manifolds in Euclidean spaces. London Mathematical Society Lecture Note Series, 347:248, 2008.
• Stuart (2010) Stuart, A. M. Inverse problems: A Bayesian perspective. Acta Numerica, 19:451–559, 2010.

## Appendix A Supplement

This electronic supplement contains several theoretical details and empirical results that were referenced in the paper Posterior Integration on a Riemannian Manifold.

### a.1 Proofs

This section contains proofs of Theorems 1 and 2 from the main text.

###### Proof of Theorem 1.

By definition, and so we have that . Thus

 ∫MLπ(ϕ)dP = ∫MLπ(ϕ)πZdV = ∫M∇⋅(π∇ϕ)ππZdV = 1Z∫M∇⋅(π∇ϕ)dV

where is the natural volume form on . From the divergence theorem we have that:

 ∫M∇⋅(π∇ϕ)dV = ∫∂M⟨π∇ϕ,n––⟩Gin––dV

which establishes the result that was claimed. ∎

###### Proof of Theorem 2.

Note that . The proof is then an application of the Woodbury matrix inversion formula, which can be used to deduce that

 limσ→∞∫M^fdP = limσ→∞σ21⊤(σ211⊤+Kπ)−1f = = (K−1π11⊤K−1π1)⊤f

as required. ∎

### a.2 Alternative Differential Operators

Alternatives to the differential operator can also be considered. To this end, recall vector fields are differential operators, so that we can consider the directional derivative of a function in the direction , denoted . Now, note that . In particular, if , then .

From the above identities we have that, for a closed manifold , vector field and function ,

 ∫Ms–(f)+f∇⋅s–dV=0.

The operator in the main text is thus the special case with and . Another possibility is and :

 ∫M(∇ϕ)(π)+πΔϕdV=0.

Similarly, we also have Green’s identity

 ∫MfΔg−gΔfdV=0

In particular, if satisfies the Poisson equation , then

 ∫MfΔg−gρdV=0

and if is harmonic (), then

 ∫MfΔgdV=0. (7)

This suggests other possible differential operators, for example if we take in Eqn. 7 we obtain a differential operator for any harmonic . Of course, any linear combination of the above operators integrates to as well.

In this work we do not exhibit a criteria under which one operator may be preferred to another, although we note that the related discussion in Gorham et al. (2016) may be relevant. However, a computational preference should clearly be afforded to differential operators that are of lower order; for this reason we would prefer in the main text to above, as the former requires only first order derivatives of .

### a.3 Additional Results for the Euclidean Manifold

The experiment presented in the main text was repeated in dimension and the results are presented in Fig. 5. Figure 5: Results for the standard Gaussian on the Euclidean manifold in dimension d=2. The worst case error of the proposed method (Eqn. 4) was plotted for various α, controlling the smoothness of the kernel, and various n, the number of evaluations of the integrand. [Left: The points xi∼N(0,1) were drawn from the target. Right: The points xi∼N(1,3) were drawn from an incorrect distribution. In each case the differential operator Lπ was used.]

To assess the robustness of the proposed method when the integrand need not lie close to the set , we considered integration of explicit test functions . In Fig. 6 we fixed , and generated independent samples from . The proposed method based on was employed to integrate for and the integral estimates so-obtained were compared to those provided by the importance sampling Monte Carlo estimator:

 1nn∑i=1f(xi)πP(xi)πQ(xi)

Of course, the true integrals in this case are, respectively, and . It is seen that the proposed method provided lower variance estimation than the Monte Carlo method. Figure 6: Additional results for M=R. Integral estimates were obtained and plotted for various n, the number of evaluations of the integrand. The true integral for f(x)=x is 0 and for f(x)=x2 is 1. Several realisations of the point set were obtained and the associated estimates were aggregated into a boxplot. [Black: The importance sampling Monte Carlo method. Blue: The proposed method based on the same point set.]

### a.4 Chain Rule for Partial Differentiation

The following identities are useful, converting differential operators in local coordinates into differential operators in global coordinates :

 ∂∂qi = m∑j=1∂xj∂qi∂∂xj (8) ∂2∂q2i = m∑j,k=1∂xj∂qi∂xk∂qi∂2∂xj∂xk+3∑j=1∂2xj∂q2i∂∂xj (9)

For example, an application of these identities to Eqn. 5 leads to the differential operator used in the code snippet in Fig. 9.

### a.5 Choice of Kernel on S2

The constant terms in the kernel in Eqn. 6 are as follows:

 C(1) = 22α−22α−2(m2)2α−2(m)2α−2 C(2) = (−1)α−1221−2αΓ(m+12)Γ(α−12)Γ(α−12)√πΓ(m2)(12)α−12(m2)α−12.

Here is the Pochhammer symbol. See Brauchart & Dick (2013) for full detail.

### a.6 Additional Results for S2

To assess the robustness of the proposed method when the integrand need not lie close to the set , we considered integration of explicit test functions . In Fig. 7 we fixed and generated a quasi-uniform point set of size , as described in the main text. The proposed method based on was employed to integrate for and the integral estimates so-obtained were compared to those provided by the importance sampling Monte Carlo estimator:

 ∑ni=1f(xi)π(xi)∑ni=1π(xi)

From symmetry, the true integrals of and