# Sampling from manifold-restricted distributions using tangent bundle projections

A common problem in Bayesian inference is the sampling of target probability distributions at sufficient resolution and accuracy to estimate the probability density, and to compute credible regions. Often by construction, many target distributions can be expressed as some higher-dimensional closed-form distribution with parametrically constrained variables; i.e. one that is restricted to a smooth submanifold of Euclidean space. I propose a derivative-based importance sampling framework for such distributions. A base set of n samples from the target distribution is used to map out the tangent bundle of the manifold, and to seed nm additional points that are projected onto the tangent bundle and weighted appropriately. The method can act as a multiplicative complement to any standard sampling algorithm, and is designed for the efficient production of approximate high-resolution histograms from manifold-restricted Gaussian distributions.

## Authors

• 7 publications
10/21/2021

### Imprecise Subset Simulation

The objective of this work is to quantify the uncertainty in probability...
01/10/2021

### Randomised maximum likelihood based posterior sampling

Minimization of a stochastic cost function is commonly used for approxim...
05/18/2021

### Parametrization invariant interpretation of priors and posteriors

In this paper we leverage on probability over Riemannian manifolds to re...
09/01/2020

### Adaptive Path Sampling in Metastable Posterior Distributions

The normalizing constant plays an important role in Bayesian computation...
11/21/2019

### TMI: Thermodynamic inference of data manifolds

The Gibbs-Boltzmann distribution offers a physically interpretable way t...
07/13/2021

### IID Sampling from Intractable Distributions

We propose a novel methodology for drawing iid realizations from any tar...
06/28/2017

### Approximation of probability density functions on the Euclidean group parametrized by dual quaternions

Perception is fundamental to many robot application areas especially in ...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Bayesian inference is a standard approach to analyzing data in the physical sciences. The process involves fitting a probabilistic physical model to data, and summarizing the result as a probability distribution on the parameters of the model [Gelman et al. (2013)

]. In continuous problems, it entails the mapping out of a probability density function over the model parameter space. If the target distribution is nontrivial and the space has modest dimensionality, this must typically be performed with stochastic sampling algorithms, the most ubiquitous of which are the well-known Markov chain Monte Carlo (MCMC) class of methods [

Robert and Casella (2005); Gamerman and Lopes (2006)]. A histogram of the samples may then be used to estimate the probability density of the target distribution, and to compute credible regions in parameter space. However, the histogram bins must be small enough to resolve key features of the target density, which in turn calls for a large number of samples to ensure each bin gives an accurate estimate of the local density. Such a requirement can be computationally intensive, if not prohibitive.

I attempt to address this problem for any target distribution that can be cast as a multivariate distribution with parametrized constraints on its variables. More precisely, the target density must be expressible as the composition of (i) a smooth map from parameter space to an ambient Euclidean space, with (ii) the closed-form density of a common (e.g. multivariate normal) distribution on the ambient space. Many Bayesian likelihood or posterior densities “factor” naturally into such a composite form, with the map being a deterministic model for some set of observables, and the density describing the statistical uncertainty in their actual observed values. The present work includes some results for a general ambient density, but mostly focuses on the case where it is Gaussian. In spite of this, the smooth map is generally nonlinear, which endows its image with nontrivial curvature, and the full target density with non-Gaussianity.

In this paper, I propose a method that is designed for efficient sampling from distributions of the above form. It is original to the best of my knowledge, and combines a standard sampling algorithm (Metropolis–Hastings [Hastings (1970)], in the provided examples) with the derivative(s) of the smooth map, to perform importance sampling [Ripley (1987)] on a discretized and linearized approximation of the map image in the ambient space. The standard algorithm is used to explore the image space, which is a smooth submanifold of Euclidean space, and to populate a “base chain” of samples from the target distribution. Each base sample then seeds a “mini-distribution” of additional samples, where is specified; this is done by drawing points from a suitably compact Gaussian centered on the mapped base sample, projecting them onto the associated tangent space, and assigning approximate weights to the pullback of these points onto parameter space. The result is a set of weighted samples, from just calls to the map and derivative(s).

Provided first-derivative information is available, the proposed method then serves as a runtime or post-hoc “multiplier” for any algorithm employed to generate samples from a manifold-restricted Gaussian distribution. It is particularly useful when (i) the map is computationally expensive, (ii) the first derivative is not significantly more expensive than the map itself, and (iii) the base chain is near convergence. In this case, the computational savings scale as (where is effectively arbitrary), and the method can produce a high-resolution set of samples with great efficiency. Although these samples are approximate, the loss of accuracy is mitigated by the weights, and controlled by tuning the overall “compactness” of the mini-distributions; it can also be further reduced with the second derivative of the map, by accounting for the local curvature of the manifold in the compactness of each mini-distribution.

While the proposed method involves derivative-based sampling, it is actually more akin to kernel density estimation and other non-parametric approaches [

Hjort et al. (2010)] in terms of its results and applications, and is conceptually distinct from MCMC samplers that use derivatives to inform the chain dynamics. The method is ideally combined with such algorithms, however, due to their shared computation of derivatives at each iteration. There is a whole host of derivative-based MCMC samplers in the literature; more prominent examples include the Metropolis-adjusted Langevin algorithm (MALA) [Roberts and Rosenthal (1998)], Hamiltonian Monte Carlo [Duane et al. (1987)], and a family of stochastic-gradient variants [Ma et al. (2015)]. In addition, despite their common usage of techniques and nomenclature from differential geometry, the proposed method is only loosely related to the premise and concepts of information geometry [Amari and Nagaoka (2007)], and even more tangentially to the computer science field of manifold learning [Lee and Verleysen (2007)].

Section 2

details the theoretical and practical aspects of the method, which is then showcased through a handful of heuristic examples in Section

3; these range from the abstract to the applied (with a notable example in the presently high-profile field of gravitational-wave astronomy), while spanning a variety of scenarios with different dimensionality setups and curvature profiles.

## 2 Method

In Sections 2.1 and 2.2, I lay out some geometric preliminaries and present a formal derivation of the method. Readers who are more interested in practical implementation might benefit from the algorithmic summary provided in Section 2.3. To the best of my ability, notation in this paper is chosen for intuitiveness, concordance with standard conventions, and prevention of symbolic overload. The Latin indices are used as set labels, while the Greek indices

are reserved for local coordinates and the corresponding components of tensors.

### 2.1 General formalism

Let us consider the problem of generating samples in proportion to a probability density with general form

 p(θ)∝f(α(θ)), (1)

where the ambient density is a non-negative Lebesgue-integrable function, and the map is a smooth embedding of the sample-space manifold (with ). The image is then an -dimensional submanifold of , and the pullback of the Euclidean metric by induces a Riemannian metric on . (The notation is historically tied to the special case of the first fundamental form, but is chosen here for aesthetic consistency.) In local coordinates , the components of this pullback metric are given by

 FI:=[I(∂μ,∂μ′)]=JTJ, (2)

where is the Jacobian matrix . There is a natural inclusion of the tangent bundle into , and the columns of form a basis in for the corresponding tangent space .

In the proposed method, we assume that (1) can be sampled through other means, and that there exists a derivative-augmented base chain with members:

 B:={(θi,αi,Ji)|i=1,2,…,n}, (3)

where the distribution of the set is proportional to (1), and . For each triple in , we first generate a local mini-distribution of points , where . These are normally distributed with mean and a covariance that depends on , i.e.

 βij∼N(αi,C−1i), (4)

where is used here (and henceforth) to denote points that are not necessarily restricted to the manifold . The matrix determines the compactness of the points , and is left unspecified for now; a suitable prescription for choosing it is suggested in Section 2.2.

If the second derivative of is available, a natural choice for the compactness matrix will also account for the normal curvature of at the point , i.e. should depend on the (generalized) second fundamental form

. This is a vector-valued form in the case of codimension

, and is given by the projection of the Euclidean directional derivative onto the normal space [Spivak (1970)]. We will find it convenient to treat as a vector in rather than , and to work with the Euclidean norm of . Such a representation is valid for general codimension, and obviates the need to construct a canonical basis for the normal bundle. Although the normal curvature is described more accurately by a signed norm with respect to the orientation of this basis, or even better by the full Riemann tensor, our approach is practically motivated and will typically be a conservative choice (see discussion around (28)). This treatment of defines an symmetric matrix

 FII:=[|II(∂μ,∂μ′)|]=[∣∣(I−P⊥)∇∂μ∂μ′∣∣], (5)

where

is the identity matrix,

is the orthogonal projection onto , and denotes the directional derivative of along at for .

Eq. (5) can be cast directly in terms of and , with the directional derivatives of the basis vectors given by the components of the third-order Hessian tensor

 Hνμμ′:=∂μ∂μ′αν=(∇∂μ∂μ′)ν. (6)

The projection is related to the Jacobian by [Meyer (2000)]

 P⊥=JJ+, (7)

where the left pseudoinverse

 J+:=F−1IJT (8)

of also acts as the projection–pullback onto .

We now use the well-known fact that for a general multivariate Gaussian distribution (on Euclidean space)

 (xμ,yν)∼N(E[(xμ,yν)],cov(xμ,yν)), (9)

the marginal distribution for any proper subset of variables is also Gaussian with unchanged covariance, i.e.

 xμ∼N(E[xμ],cov(xμ)). (10)

Hence the marginalization of (9) over is precisely its restriction to the subspace (which is flat with the Euclidean metric), and this restricted Gaussian may be sampled by orthogonally projecting a set of samples from (9) onto the subspace. The equivalence between flat-space restriction and orthogonal projection allows us to trivially generate samples from (4) that are confined to the tangent space ; these will then be taken as proxies for points on in the neighborhood of .

For each Gaussian-distributed point centered on , the projection of onto gives the point

 β⊥ij:=αi+P⊥i[βij−αi], (11)

where . Via pullback by , the associated point is given by

 θij=θi+J+i[βij−αi], (12)

where . In the neighborhood of , we have

 αij:=α(θij)≈αi+Ji[θij−θi]=β⊥ij, (13)

with equality in the case where is a flat manifold, i.e. when for all .

As the samples have not been generated in proportion to the target density (1), the final step is to reduce their correlation with the corresponding , through the appropriate assignment of weights . From (1) and (4), we see that the probability density for is proportional to the “manifold convolution” of with a Gaussian, which requires knowledge of the probability measure on the manifold to begin with. If is flat, however, the manifold convolution is proportional to the standard convolution on . Drawing motivation from this, let us set

 wij=f(β⊥ij)(f∗gi)(β⊥ij), (14)

where is given by

 gi(β)∝exp(−12βTCiβ), (15)

and denotes the convolution on :

 (f∗g)(β)=∫dβ′f(β′)g(β−β′). (16)

The numerator of (14) is an approximation to (1), as it is evaluated at and not . Even if this is not the case, the weights are still only approximate for curved manifolds, as they do not use the actual probability density for . Regardless, the fact that (14) does not depend on is the essence of the proposed method, since only calls to and its derivative(s) are used to generate a set of weighted samples:

 S:={(θij,wij)|i=1,2,…,n,j=0,1,…,m}, (17)

where we have included with evaluated at . If is flat, then for all , and can be taken as constant for all . It follows that the distribution of the set is proportional to

 q(θij):=((f∗gi)∘α)(θij), (18)

with by construction. Sampling error will be present in if (1) is supported on regions of high curvature; however, in the limit of high compactness as for all , we have and for all , such that the distribution of the base chain is recovered (although samples will be uninformative).

### 2.2 Manifold-restricted Gaussian

The proposed method is applicable in principle to the manifold restriction of any continuous multivariate probability distribution on Euclidean space (see Kotz et al. (2004) for examples), provided the ambient density in (1) is such that its convolution with in (15) can be derived or approximated analytically. Strategies for obtaining

might include inverting the product of their characteristic functions [

Shephard (1991)], or adopting the formal approach used for path integrals in quantum field theory if is near-Gaussian [Peskin and Schroeder (1995)]. However, further treatment of the ambient density in full generality is beyond the scope of the present work.

For the rest of this paper, we will consider only the case where is proportional to the probability density of a Gaussian distribution on , i.e. we set

 f(β)∝exp(−12[β−β∗]TΣ−1[β−β∗]). (19)

This is not an overly limiting assumption, since manifold-restricted Gaussians appear in a broad range of practical applications. For example, might be a Bayesian likelihood function where is some modeled observable with additive Gaussian noise ; samples from can then be used to map out the likelihood hypersurface for , and to find the maximum-likelihood estimate . In Section 3.5, we will touch on reparametrization as a possible avenue towards generalizing the method for exponential families of distributions [Brown et al. (1986)].

As is the case with Gaussians, the density (19) has several desirable properties. For one, the derivatives of itself are quite tractable; in particular, the stationary points of occur where the vector is orthogonal to . (From , we see that .) The Fisher information matrix for also takes on the simple form

 Γ:=[E[(∂μlnp)(∂μ′lnp)]]=JTΣ−1J. (20)

Finally, and most crucially, the convolution of the two Gaussians and is trivial:

 βij∼N(β∗,Σ+C−1i). (21)

Using (21), the weights (14) simplify to

 wij=Niexp(−12[β⊥ij−β∗]TM−1i[β⊥ij−β∗]), (22)

with and given by

 Mi=Σ+ΣCiΣ, (23)
 Ni=√det(Σ+C−1i), (24)

where (23) follows from Hua’s identity [Artin (1988)], and common prefactors in each have been dropped from (24).

We may now construct a compactness matrix that depends on the minimal length scales describing (i) the ambient density , (ii) the normal curvature of the manifold , and (iii) the extent of the sample space (or some sampling region in

). The square of the first is simply the smallest eigenvalue

of the ambient covariance , while the reciprocal of the second is the largest eigenvalue of a curvature-related matrix

 K:=QTFIIQ. (25)

Here defines a change of basis such that , and is given via the standard eigenvalue decomposition

 F−1I=UDUT, (26)
 Q=UD1/2, (27)

with orthogonal and diagonal, positive-definite .

The eigenvalues of are the unsigned norms of with respect to some tangent space eigenbasis, and reduce by design to more familiar expressions of curvature in various special cases. When , is a curve in Euclidean space, and the single eigenvalue is its curvature , with chosen such that [Struik (1961)]. In the case of a surface in with positive curvature (both eigenvalues have the same sign), has the same determinant and eigenvalues as the Weingarten map, i.e. the Gaussian and principal curvatures respectively [Struik (1961)]. More generally, is proportional to an upper bound for the maximal eigenvalue of (5) with a signed norm (assuming the condition number of varies slowly across the manifold):

 κ′/ι2max≤uTF′IIu≤vTFIIv≤wTFIIw≤κ/ι2min, (28)

where is the condition number of , is (5) evaluated with a signed norm,

are unit eigenvectors of

respectively, and the unit vector has components . In other words, an appropriately scaled is conservative in regions of negative curvature, as it might overestimate the maximal principal curvature when the local geometry of the manifold is hyperbolic.

For simplicity, let us assume a hyperrectangular sampling region in with sides of half-length . These determine the associated matrix

 L:=diag(ℓ−21,…,ℓ−2s), (29)

its pushforward by :

 Λ:=(J+)TLJ+, (30)

and the largest eigenvalue of . A practical choice for the compactness is then , with given by

 ci=1ϵ(max{λ2i,κiσ}), (31)

where the dimensionless sampling parameter is tuned at runtime, and .

The term in (31) is derived by constraining the curvature associated with the length scale under projection and pullback, i.e. by requiring that the largest distance between corresponding points on and is proportional to the smallest covariance length of the ambient density (which is constant). Consequently, it might be useful to think of as an approximate upper bound on the error , for at one (mini-distribution) sigma. On the other hand, the metric-only term safeguards against cases where sampling is confined to regions in which the Fisher information is small; it is typically less stringent than the curvature term, but might be useful if the Hessian is not readily available (since the metric contributes to and can be strongly correlated with the normal curvature). The term is also crucial for the example in Section 3.4, where has vanishing normal curvature but the metric on has singular points.

It is convenient to cast (19

) in “whitened” form via the linear transformation

(where for some preferred choice of [Kessy et al. (2018))], such that the length scales of the ambient density are encoded in the transformed embedding. The compactness (31) and the Gaussian weights (22) reduce respectively to

 ci=1ϵ(max{λ2i,κi}), (32)
 wij=(1+cici)d2exp⎛⎜⎝−12∣∣β⊥ij−β∗∣∣2(1+ci)⎞⎟⎠. (33)

Note that the projection of a mini-distribution point is then conceptually equivalent to directly drawing a Gaussian sample with mean and covariance , but with a covariance scaling of such that its pushforward lies at a controlled distance from the manifold .

### 2.3 Algorithmic summary

Starting with a base chain of samples from some manifold-restricted Gaussian distribution, the proposed method generates a set of weighted samples from the same distribution, where is a desired multiplier for the sampling resolution. Any standard algorithm (e.g. Metropolis–Hastings [Hastings (1970)]) may be employed to draw the base samples from the target density , and it will also provide the corresponding points on the manifold as a by-product of computing . The first derivatives (and the second derivatives , if available) can be evaluated either concurrently with or post hoc; in the former case, they might be used to propose new base points in the standard algorithm (e.g. by taking the inverse Fisher matrix as the proposal covariance, or by adding a deterministic drift term as in manifold MALA [Girolami and Calderhead (2011)]).

Let us assume that the Gaussian density (19) is in whitened form such that , and that the base chain is obtained through a derivative-free sampling algorithm. For each pair in the base chain, we:

1. Evaluate the Jacobian (and the Hessian ).

2. Compute the metric , the projection and the pseudoinverse from Eqs (2) and (7)–(8).

3. Compute the term in (32) from Eqs (29)–(30). (Also compute the second fundamental form from Eqs (5)–(6), and the term in (32) from Eqs (25)–(27).) Determine the compactness .

4. Draw mini-distribution samples from Eq. (4).

5. Compute from Eqs (11)–(12) the projected pair for each .

6. Compute from Eq. (33) the weight for each , including for .

The proposed method is efficient when the computational complexity of each iteration is dominated by the evaluations of and its derivative(s), and particularly when these evaluations are of comparable cost. In such cases, Step 1 is the rate-determining step of each iteration; the method then has the same complexity as standard derivative-based algorithms (and is ideally used in conjunction with them). The remaining steps are straightforward tensor operations, but scale with some power of the ambient-space dimensionality , which might be much higher than the dimensionality of the manifold. Step 3 involves eigenvalue decompositions and is thus an operation (if ), while the projection-related Steps 2 and 5 are .

It is worth emphasizing that while the proposed method might improve local sample convergence for a near-flat manifold, it is not designed to address global convergence; indeed, the assumption throughout Sections 2.1 and 2.2 is that the distribution of the base chain is already proportional to the target density. As a result, the performance of the method is largely independent of , and is the sole tuning parameter. If is set small, convergence of the weighted samples is guaranteed as , but mixing will be slow for a base chain with low resolution. Conversely, large results in rapid mixing at the potential cost of reduced sampling accuracy. A conservative strategy is to choose for a segment of the base chain; this can then be increased if the mini-distributions are too tightly clustered around their base points, or decreased if the error for an examined sample is over some threshold. In practice, since acts as an overall scale for the compactness, computing the curvature term is less important for a manifold with a nearly constant Hessian in the region of interest (). Furthermore, if the Jacobian also varies slowly across the sampling region, we may simply take and do away with Step 3 altogether.

One additional step is required if the target distribution has support at the boundary of the sampling region (e.g. the examples in Sections 3.2 and 3.3

). In this case, a base sample near the boundary is likely to produce some projected points that fall outside the sampling region. Simply discarding such points will give rise to a deficit of samples in that mini-distribution, which will in turn skew the overall distribution; hence it is necessary to replace these points or reweight the rest of the mini-distribution. The most direct solution is to replace any affected point

with a copy of its associated base sample . This might potentially reduce sampling resolution near the boundary, but is a valid approach that preserves the overall distribution, and is also straightforward to do during runtime or in post-processing.

## 3 Examples

The efficacy of the proposed method is now demonstrated through a number of heuristic examples: a parabola (Section 3.1); a family of Klein bottles (Section 3.2); a space of gravitational-wave signals (Section 3.3); a reparametrized parabola (Section 3.4

); and a reparametrized beta distribution (Section

3.5). High-resolution histograms are constructed for each set of samples and its target distribution (or an accurate reference chain). As a standard metric for the concurrence of two histograms with identical binning, we will use the Hellinger distance between two discrete probability distributions . This is given by [Pollard (2002)]

 d2H=12∑i(√pi−√qi)2, (34)

with the maximum distance achieved when there is no bin such that . The Hellinger distance comes with ease of interpretation as it is symmetric and satisfies . Note that any statistical distance between histograms will naturally be dependent on binning, and hence less meaningful for comparison across problems.

### 3.1 Parabola

We begin with a simple but instructive example: a two-dimensional Gaussian restricted to a parabola in the plane, such that . With Cartesian coordinates on , we set

 α(x)=(x,x2), (35)
 β∗=(1,2) (36)

in (1) and (19). We also take here (and in all of the other examples), such that the target density is

 p(x)∝exp(−12(x4−3x2−2x+5)). (37)

The support of (37) lies comfortably within the interval , which we use as our sampling region; the metric and curvature eigenvalues in (32) then satisfy

 λ2i=19(1+4x2i)<2(1+4x2i)32=κi (38)

for all , such that . Note that as the second derivative of (35) is constant, the -dependence in the curvature is due to the changing metric, and so is correlated with (and can be used in lieu of) .

A set of weighted samples is generated from a base chain of length and mini-distributions of size ; their pushforward points on the tangent bundle are displayed in the top plot of Figure 1, together with , the base points , and the manifold itself. In the bottom plot, the empirical probability density of the samples is represented by a histogram with bins, and is overlaid with the (normalized) target density (37) for comparison. The Hellinger distance between the distribution of the samples and the target distribution is .

Figure 1 uses a tuning parameter of , which corresponds to about of projected points falling within a distance of from the manifold (see discussion after (31)). Although this error can be reduced with a choice of smaller , mixing will become poor in the tail of the distribution, due to the sparsity of the base chain and the high curvature in that region. Some experimentation is needed to choose for a balance between mixing and accuracy, but finding the optimal range generally does not require fine-tuning. Figure 2 shows the extent of changes to the histogram distribution that result from varying across two orders of magnitude. The Hellinger distance from the target distribution remains small at ; in comparison, the base chain has .

### 3.2 Klein bottles

Let us now consider a slightly more involved example. Our starting point is the classical embedding of a Klein bottle into [do Carmo (1992)]; we may promote this to a family of variable-size Klein bottles in by setting

 α1(r,ψ,ϕ) =12(1+rcosψ)cosϕ, α2(r,ψ,ϕ) =12(1+rcosψ)sinϕ, α3(r,ψ,ϕ) =r2sinψcosϕ2, α4(r,ψ,ϕ) =r2sinψsinϕ2, α5(r,ψ,ϕ) =0, (39)

and applying an arbitrary rotation to bring it out of the subspace. The sampling region is taken to be , which gives an orientable three-manifold with boundary. (This is then not a Klein-like manifold at all, but we are really more interested in the curvature induced by the embedding (3.2).) Finally, we choose

 β∗=Rα(3,π4,π2). (40)

The manifold has non-elliptic geometry throughout, and its matrix from (25) has rank two. As the maximal curvature eigenvalue is greater than the metric eigenvalues across the sampling region, we once again have . Three Metropolis–Hastings chains are generated for this example: a base chain of length , a base chain of length (by extending ), and a reference chain of length (by extending ). Mini-distributions of size are used for respectively, such that the corresponding projected sets also have samples each. The tuning parameter is for both sets of samples. Any projected points that fall outside the sampling region are replaced with their base points, as suggested in Section 2.3.

Probability density contours and curves (i.e. marginalized over some combination of one or two parameters) are computed for and the reference chain, and are consolidated in Figure 3. The Hellinger distance from the reference chain is larger for than it is for ; with bins per parameter (as used in the figure), it is compared to . This improved convergence is quite clear from visual inspection of the plots. Although it is not presented, the distribution obtained by combining and has a further reduced distance of , as expected. Figure 3 also includes contours for the base chains , which are severely under-resolved and require a smoothing kernel for visualization. The Hellinger distances for are and respectively. (Note that for this example is inflated by the large number of bins. If bins per parameter are used, the values fall to for , for the combined distribution, and for .)

### 3.3 Gravitational-wave signal space

For an example of a “real-world” application where the map does not admit a closed-form expression (or is analytically nontrivial), we turn to the burgeoning field of gravitational-wave astronomy [Abbott et al. (2016a, 2017)]. A Bayesian framework is used to conduct astrophysical inference on source signals buried in noisy detector data; in this setting, the likelihood function takes the standard form [Cutler and Flanagan (1994); Abbott et al. (2016b)]

 p(θ)∝exp(−12⟨h(θ)−d∗|h(θ)−d∗⟩), (41)

where is a parametrized waveform model for the source signal, is data comprising some “true” signal and detector noise , and is a noise-weighted inner product that incorporates a power spectral density model for (assumed to be Gaussian and additive).

Eq. (41) is manifestly a Gaussian probability density restricted to the waveform manifold , and may be cast in whitened form such that reduces to the Euclidean inner product on . However, the computation of waveform derivatives must typically be done numerically, and is compounded by the fact that

for most signals in a time- or frequency-series representation. Both of these difficulties are alleviated in recent work that uses deep-learning techniques to construct fully analytic interpolants for waveforms in a reduced-basis representation [

Field et al. (2011); Chua et al. (2018)]. Fast and reliable waveform derivatives are a key feature of this approach, in which (41) simplifies to

 p(θ)∝exp(−12|α(θ)−β∗|2), (42)

where are obtained by projecting onto a compact () basis for .

We now apply the proposed method to the target density (42), where is a reduced-representation interpolant for a two-parameter waveform model [Arun et al. (2009)]. This simple model describes the signal from the late inspiral stage of a non-spinning black-hole binary, with component masses in the interval ; more realistic waveform models generally have . The ambient-space dimensionality for this example is , while the sampling region (in the parametrization of chirp mass and symmetric mass ratio ) is confined to . We assume that the data contains a weak signal (to exaggerate the features in (42)), that the signal parameters lie at the centroid of the sampling region, and that the projection of the detector noise onto the reduced basis vanishes, i.e.

 β∗=α(8.1M⊙,0.18). (43)

The Metropolis–Hastings algorithm is used to draw base samples in proportion to the density (42). Both and in (32) turn out to vary by less than a factor of five over all points in the base chain. It is instructive to examine the robustness of the method when the compactness is approximated as constant in such cases, even though the second derivative of is available. Setting and , we compare the projected samples against a Metropolis–Hastings reference chain of the same length. With 30 bins per parameter, the Hellinger distance from the reference chain is for the projected samples, and for the base chain. The visual agreement between the projected samples and the reference chain (especially in the high-resolution histograms of Figure 4) is also serviceable for most intents and purposes, notwithstanding the constant compactness and the increased complexity of the probability surface.

### 3.4 Parabola redux

It is tempting to investigate, at least for simple problems, whether a map can be constructed such that (i) the image manifold has vanishing normal curvature, but (ii) the target density on is preserved. As a cost of fulfilling these two conditions, we will give up the requirement that is an embedding or even an immersion, i.e. the pushforward of is no longer assumed to be injective for all . Then the sample-space manifold is not generally Riemannian, as the pullback metric might not be positive-definite and can vanish at certain singular points on . From a practical standpoint, however, a reparametrization of the problem with vanishing allows the compactness to be computed from the metric term alone, and the existence of singular points is not necessarily an issue if the set of all such points has Lebesgue measure zero.

Let us return to the parabola-restricted Gaussian in Section 3.1, and find some alternative such that in (32) and has the same functional form as (37). One possible choice is

 α(ξ)=(√ξ4−3ξ2−2ξ+5,0), (44)
 β∗=(0,0). (45)

The image of the map (44) is then a ray on the -axis:

 α1(ξ)∈[12√11−6√3,∞), (46)

while the single metric component

 FI(ξ)=J11(ξ)2=(2ξ3−3ξ−1)2α(ξ)2 (47)

vanishes at . With as before, in (32) is given by

 λ2i=19FI(ξi). (48)

As seen from the top plot of Figure 5, the singular points are precisely the stationary points of , and the compactness diverges at the corresponding points , such that for base points near . This high compactness compensates for the wide dispersion of pullback points due to the vanishing metric. Note that for the singular point corresponding to the global minimum of , the -coordinate of a projected point near can fall outside the interval in (46), but both the pullback point and the weight remain well defined in such cases. The metric-only compactness is also less effective near this singular point since the metric changes rapidly there, and hence a deficit of samples in that region might occur for a sparse base chain (e.g. with as in Section 3.1). Regardless, the bottom plot of Figure 5 shows that the proposed method works as expected for a longer base chain (with , , and ); using a bin width of , the Hellinger distance from the target distribution is for the base chain, and decreases to for the projected samples.

### 3.5 Beta distribution

As a final example, let us consider a broader class of target densities that are not manifestly restricted Gaussians, but might be cast as such through the reparametrization approach taken in Section 3.4. We begin, rather ambitiously, with a probability distribution from a general exponential family [Brown et al. (1986)]. This covers any distribution whose density function can be written as

 p(θ)=exp(T(θ)+V(θ)TW(ζ)+Z(ζ)), (49)

where are scalar-valued, are vector-valued with some dimensionality , and the vector parametrizes the particular exponential family defined by the functional forms of . For fixed , we may cast (49) in the restricted-Gaussian form , through a crude but direct choice of (with ):

 α1(θ) =√−2T(θ), α2(θ) =√−2W1V1(θ), ⋮ αd(θ) =√−2WkVk(θ), (50)

where . By definition, we have for a full exponential family and for a curved one, such that as required. It is clear that such a map will in general be complex-valued (which then requires lifting the whole framework to complex space), and might also give rise to an induced metric with pathologies (e.g. vanishing or divergent points, non-smoothness, etc.)

Many common probability distributions are exponential families; continuous examples include the normal, gamma and chi-squared distributions. Our focus in this section will be restricted to the beta distribution, which is defined on the interval

and has two positive shape parameters . Following the map (3.5), we set

 α1(ξ) =√2ln(ξ(1−ξ)), α2(ξ) =√−2alnξ, α3(ξ) =√−2bln(1−ξ), (51)

such that the target density is (as desired)

 p(ξ)∝ξa−1(1−ξ)b−1. (52)

Observe that is imaginary and are real for all , with singular points only at the boundary. We may then set and replace the Euclidean inner product with the Hermitian form throughout, but continue treating the ambient space as instead of (by associating the imaginary -axes with the real line). This keeps the framework in Section 2 largely unchanged, without having to promote the normal densities (15) and (19) to their complex counterparts.

With these adjustments, the proposed method is used to draw projected samples from two qualitatively different beta distributions: one with shape parameters , and the other with . Each set of samples is generated from a base chain of length and mini-distributions of size . The tuning parameter is for , and for . Figure 6 shows the histogram distributions for both sets of projected samples, with the corresponding target and base distributions overlaid for comparison (all histograms have a bin width of ). For , the Hellinger distance from the target distribution is for the base chain, and falls to for the projected samples; for , it decreases from to . Although the projected samples appear to be slightly under-mixed (i.e. over-correlated with their base chains), cannot be raised much further in this example without a significant penalty to sampling accuracy, and so a longer base chain is necessary if increased smoothing is desired.

## 4 Conclusion

In this paper, I have proposed a derivative-based framework for sampling approximately and efficiently from continuous probability distributions that admit a density of the form (1). These are essentially standard closed-form distributions on Euclidean space, but restricted to a parametrized submanifold of their domain. Much of the present work deals with the specific but important case where the ambient-space distribution is a multivariate Gaussian. Examples in Sections 3.4 and 3.5, while somewhat academic, indicate that reparametrization might be a viable strategy for generalizing the existing results to a broader class of distributions (and their manifold restrictions, by extension).

The envisioned utility of the proposed method lies primarily in facilitating density estimation for Bayesian inference problems. It has a multiplicative effect on the sampling resolution of any standard algorithm that might be used in such applications, and is particularly synergistic with derivative-based samplers. Only the first derivative is required for reconstruction of the tangent bundle, although second-derivative information can be incorporated naturally and to good effect. The method is robust, with only a single tuning parameter that sets the overall scale of the mini-distributions. Its end product is a high-resolution set of weighted samples that can be used to build histograms or, more commonly, to enable kernel density estimation with a smaller optimal bandwidth.

## Acknowledgments

For pivotal discussions, as well as his scrutiny of the manuscript, I am indebted to Michele Vallisneri. I am also grateful for Christopher Moore’s insight and detailed comments on the method. Final thanks go out to Chad Galley, Jonathan Gair and Leo Stein for helpful conversations and suggestions. This work was supported by the Jet Propulsion Laboratory (JPL) Research and Technology Development program, and was carried out at JPL, California Institute of Technology, under a contract with the National Aeronautics and Space Administration. © 2018 California Institute of Technology. U.S. Government sponsorship acknowledged.

## References

• Abbott et al. (2016a) Abbott, B. P. et al. (2016a, Feb). Observation of gravitational waves from a binary black hole merger. Phys. Rev. Lett. 116, 061102.
• Abbott et al. (2016b) Abbott, B. P. et al. (2016b, Jun). Properties of the binary black hole merger gw150914. Phys. Rev. Lett. 116, 241102.
• Abbott et al. (2017) Abbott, B. P. et al. (2017, Oct). Gw170817: Observation of gravitational waves from a binary neutron star inspiral. Phys. Rev. Lett. 119, 161101.
• Amari and Nagaoka (2007) Amari, S. and H. Nagaoka (2007). Methods of Information Geometry. Translations of mathematical monographs. American Mathematical Society.
• Artin (1988) Artin, E. (1988). Geometric Algebra. Interscience tracts in pure and applied mathematics. Wiley.
• Arun et al. (2009) Arun, K. G., A. Buonanno, G. Faye, and E. Ochsner (2009, May). Higher-order spin effects in the amplitude and phase of gravitational waveforms emitted by inspiraling compact binaries: Ready-to-use gravitational waveforms. Phys. Rev. D 79, 104023.
• Brown et al. (1986) Brown, L., I. of Mathematical Statistics, and J. (Organization) (1986). Fundamentals of Statistical Exponential Families: With Applications in Statistical Decision Theory. IMS Lecture Notes. Institute of Mathematical Statistics.
• Chua et al. (2018) Chua, A. J. K., C. R. Galley, and M. Vallisneri (2018).

ROMAN: Reduced-Order Modeling with Artificial Neurons.

In review.
• Cutler and Flanagan (1994) Cutler, C. and E. E. Flanagan (1994, Mar). Gravitational waves from merging compact binaries: How accurately can one extract the binary’s parameters from the inspiral waveform? Phys. Rev. D 49, 2658–2697.
• do Carmo (1992) do Carmo, M. (1992). Riemannian Geometry. Mathematics (Boston, Mass.). Birkhäuser.
• Duane et al. (1987) Duane, S., A. D. Kennedy, B. J. Pendleton, and D. Roweth (1987). Hybrid monte carlo. Physics Letters B 195(2), 216 – 222.
• Field et al. (2011) Field, S. E., C. R. Galley, F. Herrmann, J. S. Hesthaven, E. Ochsner, and M. Tiglio (2011, Jun). Reduced basis catalogs for gravitational wave templates. Phys. Rev. Lett. 106, 221102.
• Gamerman and Lopes (2006) Gamerman, D. and H. Lopes (2006). Markov Chain Monte Carlo: Stochastic Simulation for Bayesian Inference, Second Edition. Chapman & Hall/CRC Texts in Statistical Science. Taylor & Francis.
• Gelman et al. (2013) Gelman, A., J. Carlin, H. Stern, D. Dunson, A. Vehtari, and D. Rubin (2013). Bayesian Data Analysis, Third Edition. Chapman & Hall/CRC Texts in Statistical Science. CRC Press.
• Girolami and Calderhead (2011) Girolami, M. and B. Calderhead (2011). Riemann manifold langevin and hamiltonian monte carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 73(2), 123–214.
• Hastings (1970) Hastings, W. K. (1970). Monte carlo sampling methods using markov chains and their applications. Biometrika 57(1), 97–109.
• Hjort et al. (2010) Hjort, N., C. Holmes, P. Müller, and S. Walker (2010). Bayesian Nonparametrics. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
• Kessy et al. (2018) Kessy, A., A. Lewin, and K. Strimmer (2018). Optimal whitening and decorrelation. The American Statistician 0(0), 1–6.
• Kotz et al. (2004) Kotz, S., N. Balakrishnan, and N. Johnson (2004). Continuous Multivariate Distributions, Volume 1: Models and Applications. Continuous Multivariate Distributions. Wiley.
• Lee and Verleysen (2007) Lee, J. and M. Verleysen (2007). Nonlinear Dimensionality Reduction. Information Science and Statistics. Springer New York.
• Ma et al. (2015) Ma, Y.-A., T. Chen, and E. B. Fox (2015). A complete recipe for stochastic gradient mcmc. In Proceedings of the 28th International Conference on Neural Information Processing Systems - Volume 2, NIPS’15, Cambridge, MA, USA, pp. 2917–2925. MIT Press.
• Meyer (2000) Meyer, C. (2000). Matrix Analysis and Applied Linear Algebra:. Other Titles in Applied Mathematics. Society for Industrial and Applied Mathematics.
• Peskin and Schroeder (1995) Peskin, M. and D. Schroeder (1995). An Introduction To Quantum Field Theory. Frontiers in Physics. Avalon Publishing.
• Pollard (2002) Pollard, D. (2002). A User’s Guide to Measure Theoretic Probability. Cambridge Series in Statistical and Probabilistic Mathematics. Cambridge University Press.
• Ripley (1987) Ripley, B. (1987). Stochastic simulation. Wiley Series in Probability and Statistics. J. Wiley.
• Robert and Casella (2005) Robert, C. and G. Casella (2005). Monte Carlo Statistical Methods. Springer Texts in Statistics. Springer New York.
• Roberts and Rosenthal (1998) Roberts, G. O. and J. S. Rosenthal (1998). Optimal scaling of discrete approximations to langevin diffusions. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 60(1), 255–268.
• Shephard (1991) Shephard, N. (1991). From characteristic function to distribution function: A simple framework for the theory. Econometric Theory 7(4), 519–529.
• Spivak (1970) Spivak, M. (1970). A comprehensive introduction to differential geometry. Number v. 1 in A Comprehensive Introduction to Differential Geometry. Brandeis University.
• Struik (1961) Struik, D. (1961). Lectures on Classical Differential Geometry. Addison-Wesley series in mathematics. Addison-Wesley Publishing Company.