Adaptive Dimension Reduction to Accelerate Infinite-Dimensional Geometric Markov Chain Monte Carlo

07/15/2018 ∙ by Shiwei Lan, et al. ∙ University of Illinois at Urbana-Champaign 0

Bayesian inverse problems highly rely on efficient and effective inference methods for uncertainty quantification (UQ). Infinite-dimensional MCMC algorithms, directly defined on function spaces, are robust under refinement of physical models. Recent development of this class of algorithms has started to incorporate the geometry of the posterior informed by data so that they are capable of exploring complex probability structures. However, the required geometric quantities are usually expensive to obtain in high dimensions. On the other hand, most geometric information of the unknown parameter space in this setting is concentrated in an intrinsic finite-dimensional subspace. To mitigate the computational intensity and scale up the applications of infinite-dimensional geometric MCMC (∞-GMC), we apply geometry-informed algorithms to the intrinsic subspace to probe its complex structure, and simpler methods like preconditioned Crank-Nicolson (pCN) to its geometry-flat complementary subspace. In this work, we take advantage of dimension reduction techniques to accelerate the original ∞-GMC algorithms. More specifically, partial spectral decomposition of the (prior or posterior) covariance operator is used to identify certain number of principal eigen-directions as a basis for the intrinsic subspace. The combination of dimension-independent algorithms, geometric information, and dimension reduction yields more efficient implementation, (adaptive) dimension-reduced infinite-dimensional geometric MCMC. With a small amount of computational overhead, we can achieve over 70 times speed-up compared to pCN using a simulated elliptic inverse problem and an inverse problem involving turbulent combustion. A number of error bounds comparing various MCMC proposals are presented to predict the asymptotic behavior of the proposed dimension-reduced algorithms.



There are no comments yet.


page 12

page 13

page 14

page 16

page 18

page 24

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Sampling from posterior distributions in the context of PDE-constrained inverse problems is typically a challenging task due to the high-dimensionality of the target, the non-Gaussianity of the posterior and the intensive computation of repeated PDE solutions for evaluating the likelihood function at different parameters. Traditional Metropolis-Hastings algorithms are characterized by deteriorating mixing times upon mesh-refinement in the finite-dimensional projection of parameter . This has prompted the recent development of a class of ‘dimension-independent’ MCMC methods (1; 2; 3; 4; 5; 6; 7; 8; 9; 10) that overcome this deficiency. Compared to traditional Metropolis-Hastings defined on the finite-dimensional space, the new algorithms are well-defined on the infinite-dimensional Hilbert space, thus yield the important computational benefit of mesh-independent mixing times for the practical finite-dimensional algorithms run on the computer.

Among those works, (5; 8; 10) incorporate geometry of the posterior informed by data to empower the MCMC to become more capable of exploring complicated distributions that deviate significantly from Gaussian. In particular, infinite-dimensional geometric MCMC (-GMC) (10) put a series of ‘dimension-independent’ MCMC algorithms in the context of increasingly adopting geometry (gradient, Hessian). With the help of such geometric information, (10) show that with the prior-based splitting strategy,

-GMC algorithms can achieve up to two orders of magnitude speed up in sampling efficiency compared to vanilla pCN. However, fully computing the required geometric quantities is prohibitive in the discretized parameter space with thousands of dimensions. Therefore, it is natural to consider approximations to the gradient vector and Hessian (Fisher) matrix and compute them in a subspace with reduced dimensions. The key to the dimension reduction in this setting is to identify an intrinsic low-dimensional subspace and apply geometric methods to effectively explore its complex structure; while simpler methods can be used on its complementary subspace with larger step sizes.

(8) seek the intrinsic subspace, known as likelihood-informed subspace (LIS) (11), by detecting the eigen-subspace of some globalized Hessian; (9) investigate the active subspace (AS) (12; 13) by probing the principal eigen-directions of a prior-averaged empirical Fisher matrix. More recently, (14) follow the same spirit to exploit the low-dimensional structure, in which the posterior changes the most from the prior. Their approach is based on approximating the likelihood function with a ridge function that depends non-trivially only on a few linear combinations of the parameters. Such ridge approximation is obtained by minimizing an upper bound on the Kullback-Leibler distance between the posterior distribution and its approximation.

In this paper, we propose dimension reduction directly based on partial (generalized) spectral decomposition of the prior covariance or the covariance of local Gaussian approximation to the posterior (GAP). The intrinsic low-dimensional subspace is identified by leading eigen-functions, which can be efficiently obtained by randomized linear algebraic algorithms (15; 16; 17). Unlike (8), the posterior covariance projected in the subspace is not empirically updated, but rather approximated in a diagonal form which can still capture the most variation of the projected posterior. The resulting GAP covariance has a low-rank structure. Such approximation can be either adopted position-wise or adapted towards a global LIS within the burn-in stage of MCMC. The latter yields a much simpler yet comparable or even more efficient MCMC algorithm compared to dimension-independent likelihood Informed MCMC (DILI) (8). The former can demonstrate advantage in sampling complicated posteriors where a globalized pre-conditioner in DILI does not work well universally. We apply the same dimension reduction to ‘Hamiltonian Monte Carlo (HMC)’ type algorithms so that they can further suppress the diffusive behavior of ‘MALA’ type algorithms and generalize DILI. We also provide theoretical bounds for comparing dimension-reduced MCMC proposals with their full versions to help understand their asymptotic behavior as well as their difference.

The contributions of this paper are multi-fold. First, we accelerate the original -GMC methods with dimension reduction techniques to scale their applications in PDE constrained inverse problems up to thousands of dimensions. Second, based directly on partial spectral decomposition, we propose more efficient methods that simplify and generalize DILI (8). We also establish interesting connections between our adaptive algorithm and DILI. Third, we derive theoretic bounds comparing several dimension-independent MCMC proposals to describe their asymptotic behavior. Lastly, we demonstrate the numerical advantage of our proposed algorithms in the high-dimensional setting by over 70 times speed-up compared to pCN method using a simulated elliptic inverse problem and an inverse problem of turbulent jet.

The rest of the paper is organized as follows. Section 2 reviews the background of Bayesian inverse problems, infinite-dimensional geometric MCMC (-GMC) (10) and dimension-independent likelihood Informed MCMC (DILI) (8). Section 3 describes the details of dimension reduction we adopt, based on the prior and the GAP posterior respectively. Section 4 applies these dimension reduction techniques to -GMC to achieve acceleration, establishes the validity of the proposed methods, and also provides error bounds for comparing various algorithms. In Section 5 we show the numerical advantage of our algorithms using a simulated elliptic inverse problem and an inverse problem involving turbulent combustion. Finally we make some discussion and conclude with a few future directions in Section 6.

2 Review of Background

2.1 Bayesian Inverse Problems

In Bayesian inverse problems, the objective is to identify an unknown parameter function which is assumed to be in a separable Hilbert space . Given finite-dimensional observations , for , is connected to through the following mapping:

where is the forward operator that maps unknown parameter onto the data space , and is the distribution of noise . If we assume the density of noise distribution, still denoted as , to exist with respect to the Lebesgue measure, then we can define the negative log-likelihood, a.k.a. potential function, as:

with indicating the density function for a given . The noise distribution could be simple, but the forward mapping is usually non-linear thus the potential function can be complicated and computationally expensive to evaluate. For example, if we assume Gaussian noise , for some symmetric, positive-definite , then the potential function can be written as


where we have considered the scaled inner product .

In the Bayesian setting, a prior measure is imposed on . In this paper we assume a Gaussian prior with the covariance being a positive, self-adjoint and trace-class operator on . Now we can get the posterior of , denoted as

, using Bayes’ theorem

(18; 19):

Notice that the posterior can exhibit strongly non-Gaussian behavior, with finite-dimensional projections having complex non-elliptic contours.

For simplicity we drop from terms involved, so we denote the posterior as and the potential function as . For the target and many proposal kernels in the sequel, we define the bivariate law:

Following the theory of Metropolis-Hastings on general spaces (20), the acceptance probability is non-trivial when

where denotes mutual absolute continuity, that is, and . The acceptance probability is:


where denotes the minimum of . We first review -GMC (10).

2.2 Infinite-Dimensional Geometric MCMC (-Gmc)

We start with the preconditioned Crank-Nicolson (pCN) method (21; 1; 3), whose proposal does not use any data information. It modifies standard random-walk Metropolis (RWM) to make a proposal movement from the current position towards a random point, with its size controlled by a free parameter :

PCN is well-defined on the Hilbert space with the proposal that preserves the prior when , whereas standard RWM can only be defined on finite-dimensional discretization and has diminishing acceptance probability for fixed step-size and increasing resolution (22). Thus, pCN mixes faster than RWM in high-enough dimensions and the disparity in mixing rates becomes greater upon mesh-refinement (3).

One approach for developing data-informed methods is to take advantage of gradient information in a steepest-descent setting. Consider the Langevin SDE on , preconditioned by some operator :


with denoting the Fréchet derivative of and being the cylindrical Wiener process. If we let , the scales of these dynamics are tuned to the prior. In this setting, SDE (3) preserves the posterior and can be used as effective MCMC proposals (1; 3). (1) use a semi-implicit Euler scheme to discretize the above SDE and develop infinite-dimensional MALA (-MALA) with the following proposal for an algorithmic parameter and some small step-size :


Following (1), under the assumption that , -a.s. in , one can use Theorem 2.21 of (23) on translations of Gaussian measures on separable Hilbert spaces, to obtain the Radon-Nikodym derivative in the acceptance probability (2).

To further incorporate local geometric information of the target distribution, one can consider a location-specific pre-conditioner as the covariance of a local Gaussian approximation to the posterior, hence named Gaussian-approximate posterior (GAP) covariance, defined through


where can be chosen as Hessian, Gauss-Newton Hessian (GNH), or Fisher information operator. With the Gaussian likelihood (1), we note that they are connected as follows

In the following, unless stated otherwise, we will use GNH for . Then can be viewed as the Gauss-Newton Hessian approximation to the log-posterior. In general defines a Riemannian metric on the parameter space which thus can be viewed as a Riemannian manifold (24). Notice that for the resulting dynamics do not, in general, preserve the target as they omit the higher order (and computationally expensive) Christofell symbol terms, see e.g. (24) and the discussion in (25). However, the dynamics in (3) can still capture an important part of the local curvature structure of the target and can provide an effective balance between mixing and computational cost (24). (10) develop infinite-dimensional manifold MALA (-mMALA; the name comes from the fact that the Langevin SDE (3) is defined on the manifold .) with the following proposal obtained by the similar semi-implicit scheme as in (1)


for defined as in (4) and we have:


With the assumptions 3.1-3.3 in (10), one can use the Feldman-Hajek theorem (see e.g. Theorem 2.23 in (23)) to derive the acceptance probability (2). See more details in (10). It is interesting to notice that when (), -mMALA coincides with the stochastic Newton (SN) MCMC method (26), with .

One can generalize ‘MALA’ type algorithms to multiple steps. This is equivalent to investigating the following continuous-time Hamiltonian dynamics:


HMC algorithm (21) makes use of the Strang splitting scheme to develop the following Störmer-Verlet symplectic integrator (27), for as defined in (7):


Equation (9) gives rise to the leapfrog map . Given a time horizon and current position , the MCMC mechanism proceeds by concatenating steps of leapfrog map consecutively:

where denotes the projection onto the -argument. For , this yields infinite-dimensional HMC (-HMC) (2) with , and infinite-dimensional manifold HMC (-mHMC) (10) with respectively. The well-posedness of these ‘HMC’ type algorithms can be established under the same assumptions 3.1-3.3 of (10). The following diagram illustrates graphically the connections between the various algorithms.

2.3 Dimension-Independent Likelihood Informed MCMC (DILI)

Now we review the MCMC algorithm DILI (8), which is much relevant to our proposed methods. The idea of DILI is to separate a low-dimensional LIS where likelihood-informed methods are applied to make inhomogeneous proposal to exploit the posterior structure that deviates from the prior structure; while the complementary space can be efficiently explored with simpler prior-based methods.

Inspired by the low-rank approximations to the Hessian of log-posterior in (26; 28), DILI (8) obtains the intrinsic low-dimensional LIS by comparing the Hessian of log-likelihood with prior covariance to identify directions in parameter space along which the posterior distribution differs most strongly from the prior. DILI also uses the Langevin equation (3) preconditioned by an operator as the proposal kernel. However, strictly speaking, the preconditioner is not the same as the location-specific , but rather globalized to aggregate the local geometry informed by data. More specifically, DILI considers the following prior-preconditioned Gauss-Newton Hessian (ppGNH):

where under the assumption (1), coincides with Fisher metric.

ppGNH stems from the local Rayleigh ratio that quantifies how strongly the likelihood constrains variation in the direction relative to the prior, and can be converted to GNH w.r.t the whitened parameter

Therefore by transforming and applying on both sides it helps to simplify (3) with into


where and . Note the whitened variable has the prior , where the identity covariance operator is not a trace-class on . However, random draws from are square-integrable in the weighted space . (10) can still serve as a well-defined function space proposal for parameter after inverting the transformation.

The intrinsic low-dimensional subspace is obtained through a low-rank approximation of the globalized (expected) GNH . Suppose the operator has eigen-pairs on . Then by thresholding

largest eigenvalues one can define


Note provides the basis for the LIS and one can have the following decomposition for :

where is the projection of into LIS and lies in the complementary space dominated by the prior ; and they are independent under the approximated posterior . Therefore one can approximate the posterior covariance (for the parameter ) as follows


where is computed empirically and has eigendecomposition ; . We can associate the complement of in

with a set of eigenfunctions

. Define . By applying and respectively to (10) with replaced by in (12) we obtain the splitting proposal as follows:

where and ; and are algorithmic parameters to indicate whether (set to 1) or not (set to 0) to include gradient information in the intrinsic subspace and its complement respectively.

Finally, we apply the semi-implicit Euler scheme as in (1) to the above SDE to get the discrete proposal in and rewrite it to the following proposal in (See more details in (8) or in Section 4.1)


with the following parameters in the above equation:


In the next section, we will derive similar intrinsic subspaces and splitting proposals directly from partial spectral decomposition. Based on prior or GAP covariance operators, different dimension reduction strategies can be achieved, applicable to different scenarios depending on how informative the data are.

3 Dimension Reduction

We focus on dimension reduction through partial spectral decomposition. Suppose we have eigen-pairs of some operator (prior covariance or GAP covariance) with . The intrinsic low-dimensional subspace can be defined through principal eigen-functions, that is, . Let . Then we define the following generic projection operator , e.g. in (11):

For example, if we truncate on the -dimensional subspace

where is the restriction of on , then we can approximate by replacing with in (5). In this section we investigate two types of dimension reduction based on prior and likelihood respectively, with particular connection to DILI (8).

3.1 Prior-Based Dimension Reduction

Let , be the eigenvalues and eigenfunctions of the prior covariance operator such that , . Assume is a sequence of positive reals with (this enforces the trace-class condition for ), and is an orthonormal basis of . We make the usual correspondence between an element and its coordinates w.r.t. the basis , that is . Using the Karhunen-Loève expansion of a Gaussian measure (29; 30; 19) we have the representation:


Define the Sobolev spaces corresponding to the basis :

so that and if . Typically, we will have the following decay assumption:

Assumption 3.1.

for some in the sense that there exist constants such that for all .

Thus, the prior (so also the posterior) concentrate on for any . Notice also that: By thresholding largest eigenvalues , eigen-basis can also serve as a basis for low-dimensional subspace. One can define the projection based on the eigen-pairs . Such prior based dimension reduction can work well in a class of inverse problems, especially when they are prior dominated (See 10, for more details).

With the eigen-pairs , the prior covariance operator can be written and approximated as

with and defined by similarly as (11). Then we can approximate

where , and . By the Sherman-Morrison-Woodbury formula,

where . This implies . By applying and to (3) respectively and using the above approximations we get

where and .

Remark 3.2.

The eigen-pairs can be pre-computed/pre-specified. For each step we only need to calculate -dimensional matrix at , which can be efficiently obtained by Hessian action through adjoint methods in PDE.

3.2 Likelihood-Based Dimension Reduction

When the data are more informative, the above prior based dimension reduction may not perform well and thus we need reduction techniques that are likelihood based, like DILI (8) or AS (9). We could consider the following generalized eigen-problem , to find the eigen-pairs such that


which can be shown equivalent to the eigen-problem of ppGNH for eigen-pairs


with ; it can be written as where

For the convenience of exposition and comparison with DILI, we work with the whitened coordinates in the following. To simplify the notation, we also drop some dependence of where there is no ambiguity, but readers should be reminded that , and are defined and approximated point wise .

In the whitened coordinates , we can rewrite , the GNH of the log-posterior, as follows

which has a direct -dimensional low-rank approximation


Note, in the sense of using the low-rank approximation to the Hessian of log-posterior, this approach is more faithful to (26; 28) than DILI. Thus can be approximated using the Sherman-Morrison-Woodbury formula


where .

Remark 3.3.

With the approximation (22) applied to the following , we directly have


Therefore we can rewrite the approximation (22) analogous to the approximation (12) in DILI (8)

In this sense, the approximation (22) is consistent with the low-rank approximation by DILI. However, we have avoided the empirical computation of . And since is already in the diagonal form (23), we have also avoided the rotation in DILI (8).

By applying and to (10) respectively and using the approximation (22), we have the following proposal split on the low-dimensional subspace and its complement :


where and .

4 Dimension Reduced Algorithms

In this section we apply the dimension reduction techniques discussed in Section 3 to two -GMC algorithms, -mMALA and -mHMC. Since the prior-based dimension reduction has been implemented in (10), we focus on the likelihood-based dimension reduction. We derive new efficient algorithms and compare them to DILI.

4.1 Dimension-Reduced -mMALA

We can apply the similar semi-implicit Euler scheme as in (1) to (24), or equivalently use the approximation (22) in the following whitened proposal, which is a reformulation of manifold MALA proposal (6) by the transformation


for defined as in (4) and setting , and replacing with in (7) we have:


where , with and chosen to be or to indicate whether or not to include gradient information on the low-dimensional subspace and its complement respectively. This proposal (25) can be reformulated as follows


where , , and .

With the approximation (22), it is straightforward to verify


Substituting the approximations (22) (28) into (27) yields the following proposal


With and , the proposal (29) can be rewritten as

Remark 4.1.

If we choose as DILI (8), then the proposal (29) can further be simplified as


Comparing with the following operator-weighted proposal as in DILI (14) (also Equations (17) (36) of (8); note that when is already in the diagonal form.)