# Accelerating Metropolis-within-Gibbs sampler with localized computations of differential equations

Inverse problem is ubiquitous in science and engineering, and Bayesian methodologies are often used to infer the underlying parameters. For high dimensional temporal-spatial models, classical Markov chain Monte Carlo (MCMC) methods are often slow to converge, and it is necessary to apply Metropolis-within-Gibbs (MwG) sampling on parameter blocks. However, the computation cost of each MwG iteration is typically O(n^2), where n is the model dimension. This can be too expensive in practice. This paper introduces a new reduced computation method to bring down the computation cost to O(n), for the inverse initial value problem of a stochastic differential equation (SDE) with local interactions. The key observation is that each MwG proposal is only different from the original iterate at one parameter block, and this difference will only propagate within a local domain in the SDE computations. Therefore we can approximate the global SDE computation with a surrogate updated only within the local domain for reduced computation cost. Both theoretically and numerically, we show that the approximation errors can be controlled by the local domain size. We discuss how to implement the local computation scheme using Euler-Maruyama and 4th order Runge-Kutta methods. We numerically demonstrate the performance of the proposed method with the Lorenz 96 model and a linear stochastic flow model.

## Authors

• 76 publications
• 8 publications
• ### Parameter elimination in particle Gibbs sampling

Bayesian inference in state-space models is challenging due to high-dime...
10/30/2019 ∙ by Anna Wigren, et al. ∙ 6

• ### Posterior computation with the Gibbs zig-zag sampler

Markov chain Monte Carlo (MCMC) sampling algorithms have dominated the l...
04/08/2020 ∙ by Matthias Sachs, et al. ∙ 0

• ### Accelerating delayed-acceptance Markov chain Monte Carlo algorithms

Delayed-acceptance Markov chain Monte Carlo (DA-MCMC) samples from a pro...
06/15/2018 ∙ by Samuel Wiqvist, et al. ∙ 0

• ### Bayesian Parameter Identification for Jump Markov Linear Systems

This paper presents a Bayesian method for identification of jump Markov ...
04/18/2020 ∙ by Mark P. Balenzuela, et al. ∙ 0

• ### Parareal computation of stochastic differential equations with time-scale separation: a numerical study

The parareal algorithm is known to allow for a significant reduction in ...
12/19/2019 ∙ by Tony Lelièvre, et al. ∙ 0

• ### Bayesian Approach to Inverse Time-harmonic Acoustic Scattering from Sound-soft Obstacles with Phaseless Data

This paper concerns the Bayesian approach to inverse acoustic scattering...
06/03/2020 ∙ by Zhipeng Yang, et al. ∙ 0

• ### Nonlinear Equation Solving: A Faster Alternative to Feedforward Computation

Feedforward computations, such as evaluating a neural network or samplin...
02/10/2020 ∙ by Yang Song, et al. ∙ 40

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

### 1.1 Inverse problem and MCMC

Inverse problem is ubiquitous in various fields of science and engineering [24, 2, 11]. It concerns how to infer model parameters from partial, delayed, and noisy observations. Typical examples include using measurements of seismic waves to determine the location of an earthquake epicenter, and recovering images that are as close to natural ones as possible from blurry observations [20]. The general formulation of inverse problem can be written as

 y=h(x,ζ). (1)

Here

denotes the parameters to be estimated,

denotes the observation data, is a physical model describing the data collection process, and is the possible random factor involved.

Often, it is of interest to quantify the uncertainty of , which can be used to infer estimation accuracy and regulate risk. The Bayesian approach is more appropriate for this purpose [11, 4, 22]. This involves modeling with a prior distribution and finding the observation distribution from (1). Then the Baye’s formula indicates that the posterior distribution of is given by

 p(x|y)∝p0(x)⋅pl(y|x). (2)

Markov chain Monte Carlo (MCMC) is a big class of stochastic algorithms designed to sample the posterior density iteratively [9, 13]. In each iteration, a new proposal is generated from the current iterate , which can be described by a transition density

. Then the Metropolis-Hasting (MH) step accepts this proposal with probability

 (3)

Some popular choices of include random walk transition in random walk Metropolis (RWM), and Langevin dynamic transition in Metropolis adjusted Langevin algorithm (MALA) [17, 10, 21].

While these MCMC algorithms perform well for classical problems, they become very slow when applied to modern data science problems, where the parameter dimension

is very large. The main issue is that the proposal in RWM or MALA in general is different from at all components, and the MH-acceptance probability is often of order . This is extremely small when is more than a few thousands. The proposals will mostly be rejected, and the MCMC is essentially stuck. One way to alleviate this issue is to choose a small step size in the proposals, so the average acceptance probability is . But then the consecutive MCMC iterates are close to each other, so the overall movement of MCMC can still be slow.

### 1.2 Spatial localization and MwG

The curse of dimensionality can often be lifted if there are exploitable statistical structrues. Examples include conditional Gaussianity and low effective dimension. In geophysical applications, the components of

usually describe status at different locations. Because the underlying physical law is often short ranged, faraway components of are nearly independent. Consequentially, the associated covariance matrix will have a clear banded structrue. Such phenomena is called spatial localization

, and it widely exists in problems involving vast spatial domains. In the statistic literatrue, such banded structrue can be exploited by tapering techniques, which significantly improves covariance estimation. And in numerical weather prediction (NWP), localization techniques are designed to utilize this structrue, so algorithms such as ensemble Kalman filter can provide stable estimation for planetary models of

dimensions with merely 100 samples.

A recent work [18] investigates the possibility to exploit spatial localization with MCMC. It is found that Gibbs sampling [8, 7] is a natural framework for this purpose. To do so, one first partitions the model components into blocks with , where each block contains only nearby components. When running the MCMC, one generates a proposal for the -th block by applying Gibbs sampler on the prior. This proposal will be different from only at the -th block, which is of dimension . Therefore, the corresponding MH acceptance probability in (3) will be scale and does not degenerate even if the overall dimension is large. A completely new iterate can be generated by repeating this procedure for all blocks. A more detailed description of this Metropolis-within-Gibbs (MwG) sampler can be found in Section 2.2. Numerical tests and rigorous analysis in Gaussian settings have revealed that MwG has dimension independent MCMC convergence rate when the underlying distributions are spatially localized.

### 1.3 Acceleration with local computation

While MwG takes only a constant number of iterations to sample the posterior distribution, the computation cost of each iterate can be expensive. This is when a Gibbs block proposal is being processed by the MH step (3), one often needs to evaluate . It often involves an computational cost. The proposal procedure is repeated for all blocks. So, to generate a new MwG iterate, the computation cost is . This is much more expensive than RWM and MALA iterates, which in general cost .

However, it is possible to reduce the cost of to . The main observation is that and differ only at one block, say the -th block, and the computation of is done in the previous MH step. So if one can replace the part in the computation of with , the value of can be obtained cheaply. As a simple example, suppose in (1) the observation model is with

, a normal distribution with

-dimensional mean vector

and covariance matrix being an identity matrix. Then, we have

 −logpl(y|x)∝m∑k=1∥yk−xk∥2.

Suppose this value is already available, then when computing , one only needs to update the -th block in the summation to , which only costs . This example can be easily generalized to cases where relies on multiple blocks of . See detailed discussion of this and the possibility of parallelization in [18].

In this paper, we explore the possibility of reducing the computational cost of MwG from to . We assume the observation model (1) is given by a high dimensional stochastic differential equation (SDE) with short-range interaction, where is the initial condition of the SDE, and consists of noisy partial observations of the SDE, , in a fixed time interval . Such an inverse initial value problem is practically important. It can be interpreted as an one-step smoothing problem in signal processing. Data assimilation problems such as NWP can also be formulated as sequential applications of it [5, 15].

Finding is equivalent to solving the SDE and finding in the smoothing context. Standard Euler-Maruyama scheme would require a cost of . When the Gibbs sampler proposes , one needs to use it as a new SDE initial condition and compute . But because is different from only for , we show in Proposition 1 that is not much different from if is larger than a radius . Therefore it is not necessary to do the recomputation in a full way. In other words, instead of applying Euler-Maruya to compute for all , we only compute locally for the ones with . Since the radius can often be chosen as a constant independent of , updating to only needs a cost of . This procedure will be repeated through all blocks, so the overall cost of finding a new MwG iterate is . This achieves the aforementioned computation reduction objective. We use a-MwG to denote this accelerated version of MwG.

Since MwG only requires constantly many iterates to sample a spatially localized distribution, a-MwG is expected to solve the Bayesian inverse problem with only a computation cost of . This is the optimal dimension scaling one can obtain for any numerical methods. The main drawback of a-MwG is that it uses a local computation scheme, so it is subjective to approximation errors. However, these errors can be controlled by choosing a large enough . We demonstrate this through rigorous analysis and numerical tests.

### 1.4 Organization and Prelminaries

This paper is organized in the following way. In Section 2, we consider the inverse problem of how to infer the initial conditions of an SDE with local interactions. We review the MwG sampler with a discussion of its computational complexity. We derive an accelerated-algorithm, called a-MwG sampler, in Section 3, and analyze the approximation errors. We also discuss how to implement a-MwG with Euler-Maruyama and 4th order Runge-Kutta schemes, as well as its adaptation to parallelization. In section 4, two numerical experiments of Lorenz 96 and linearized stochastic model are studied. The paper is concluded in Section 5. All the proofs are postponed to the Appendix.

Throughout this paper, we will use the following notations. When applying MwG, we need to partition a high dimension vector into blocks, for which we write as . For simplicity of the discussion, we assume each block shares the same length , so the total dimension . We remark our result is easily generalizable to non-constant block sizes. In practice, each block often represent information at a location on a torus, therefore it is natural to introduce measure of distance between indices as with .

When a matrix is given, the -th entry is denoted as . denote the transpose and the inverse of the matrix respectively. We adopt and to denote the norm and norm for a vector, namely, for a vector with elements , we have , and . For an matrix , the operator is written as . For two matrices and (including vectors as a special case), we say if we have for , entry-wise. and are matrices whose entries are 0 and 1 respectively. is a multidimensional normal distribution with mean vector and covariance matrix . We use to denote identity matrix. is a generic positive constant that may varies from line to line.

## 2 Problem setup

### 2.1 Inverse initial value problem for SDE

We consider a spatial-temporal model with local interaction

 dxj(t)=fj(t,xj−1(t),xj(t),xj+1(t))dt+σj(t,xj(t))dWj(t), for j=1,...,m,x0(t)=xm(t), xm+1(t)=x1(t), t∈[0,T], (4)

where is an matrix valued adapted locally bounded process, and is an dimensional standard Brownian motion. We assume the following Lipschitz continuity conditions are satisfied for the coefficient processes and :

###### Assumption 1

Given vectors , and , for , there exists constants and such that

 ∥fj(t,x1,x2,x3)−fj(t,y1,y2,y3)∥2≤Cf(∥x1−y1∥2+∥x2−y2∥2+∥x3−y3∥2),

and

 ∥(σj(t,x1)−σj(t,y1))∥2≤Cσ∥x1−y1∥2.

Assumption 1 is widely used to guarantee the solution to (4) exists and is unique, when the initial condition and the realizations of the Brownian motion are given. We write the solution as . In the scenario where there is no stochastic forcing, namely , (4

) is an ordinary differential equation (ODE). Its solution can be simply written as

.

One key featrue of (4) is that the drift term driving relies only on its neighboring blocks. This describes general short-range interactions that are typical in spatial-temporal models. The formulation of (4

) can be naturally derived, if one considers applying finite difference discretization for a stochastic partial differential equation, such as the reaction-diffusion equations. Details of such derivation can be found in

[3].

We assume an -dimensional data is generated from noisy observation of (4) at time , that is

 y=H⋅Φ(x,T,W)+ξ. (5)

Here is an by observation matrix that serves as selecting the observable components, and represents the associated observation noise, which is distributed as .

The Bayesian inverse problem this paper trying to solve is finding the posterior distribution of the initial condition with a given data . For simplicity, we assume the prior distribution of is Gaussian with mean and covariance matrix . Then the observation likelihood is given by

 pl(y|x)∝Eexp(−12∥y−H⋅Φ(x,T,W)∥2R), (6)

where for a vector , we define . The expectation in is averaging over all realization of . In computation, it can be approximated by a Monte Carlo sampled version

 ^pl(y|x):=1CC∑c=1exp(−12∥y−H⋅Φ(x,T,W(c))∥2R), (7)

where each is an independent Brownian motion realization. When (4) is an ODE, (6) is simplified as

 pl(y|x)∝exp(−12∥y−H⋅Φ(x,T)∥2R). (8)

### 2.2 MwG and its computational complexity

The Metropolis-within-Gibbs sampler is an MCMC algorithm. For each block, it generates a proposal by applying Gibbs sampling to the prior, and use the Metropolis step to incorporate data information. In specific, it generates iterations of form through the following steps, where we start with and draw randomly from :

1. Repeat steps 2-4 for all block index

2. Sample from

 p0(xj∈⋅|xk1,…,xkj−1,xkj+1,…,xkm).
3. Let .

4. Let with probability

 α(xp,xk)=min{1,pl(y|xp)pl(y|xk)}. (9)

can also be replaced by the sample version in (7).

5. When the loop in step 1) is finished, let , and increase time index from to .

In [18], it is shown that MwG has dimension independent performance if 1)

is a Gaussian distribution, and its covariance or precision matrix is close to be banded; 2) each component of

has significant dependence only on a few components of . It is also discussed how to truncate the far-off diagonal entries of the prior covariance or precision matrix. This so called “localization” technique can simplify the computation of the proposal probability in MwG step 2), making the cost to be .

When applying MwG directly to the inverse problem described in Section 2, the computational cost for fully updating all blocks is . This is because in step 4), we need to evaluate through either (7) or (8). This involves finding the numerical solution to (4), which requires executing the Euler-Maruyama or 4th order Runge-Kutta methods. Both these methods require computation complexity when the dimension of the differential equation (4) is . Then because step 4) is repeated for all blocks, the total computation cost is .

In summary, although the vanilla MwG only requires O(1) iterates to converge to the posterior distribution, each individual iterate can cost . This can be less appealing than standard MCMC algorithms with optimal tuning. For example, RWM can converge to the posterior distribution with iterations, while each iterate costs complexity, so the total complexity is . In this paper, we demonstrate how to bring down the computational cost of each MwG iterate to . This will lead to the optimal MCMC computational scalability.

## 3 Acceleration with local computation

### 3.1 Spatial propagation of local changes

Based on our previous discussion, the main computational cost of MwG takes place at step 4) when (9) is evaluated. We will discuss how to reduce this cost to . In what follows, We fix the time index as o and block index as , while the same procedure applies to all time indexes and blocks.

First of all, it should be noted that when executing step 4), the value of is already available from the previous time step 4). Likewise, we already have the values of . It is only necessary to find , or equivalently . Note that we will often write as , since it is the information we try to recover.

Our main observation here is that is different from only at the -th block. Since components in SDE (4) exchange information only through local interactions, the differences between and are likely to be of significance only for blocks that are close to . In fact, we have the following proposition:

###### Proposition 1

Under Assumption 1 and with any given positive constant , for , we have

 E[∥x\emph{o% }j(t)−x\emph{p}j(t)∥2]≤∥x% \emph{o}i⋆−x\emph{p}i⋆∥2eC1(f,σ)te−Cd⋅d(j,i⋆) (10)

with

 C1(f,σ)=(eCd+e−Cd+1)(Cf+Cσ+1). (11)

In particular, for any fixed small enough threshold and time range , we can find a radius , such that

 E[∥xoj(t)−xpj(t)∥2]≤ϵ,∀ t≤T, d(i⋆,j)≤L.

We define the local domain centered at index as

 Bi⋆:={j:d(i⋆,j)≤L},

and use to denote its complement. Then is already a good approximation of for . In other words, it is no longer necessary to recompute for , and we only need to compute for . Since contains at most elements, and is likely to be independent with the problem dimension , this provides us with a way to accelerate the overall computation of MwG.

### 3.2 Local surrogate

Next we consider how to use existing values of to reduce the computation of . For this purpose, we introduce a local surrogate model given by

 xlj(t)=xoj(t), for j∈Bci⋆,dxlj(t)=fj(t,xlj−1(t),xlj(t),xlj+1(t))dt+σj(t,xlj(t))dWj(t), % for j∈Bi⋆,xl0(t)=xlm(t), xlm+1(t)=xl1(t). (12)

Its initial condition is set to be . We write its solution as

 xl(t)=Φl(xp,t,xo(t≤T),W).

Note that the local surrogate within depends on for the boundary blocks of , of which the indices are usually just and . Such dependence can be viewed as using as spatial boundary conditions for the computation of .

We will use the local surrogate as an approximation of . It can be computed cheaply because it is different from only for blocks inside the local domain . The details of how to achieve this can be found in subsequent parts. First of all, we investigate the approximation error through the following theorem

###### Theorem 1

Under Assumption 1 and with any given positive constant , for and , we have

 E[∥x\emph{l% }j(t)−x\emph{p}j(t)∥2]≤C2(f,σ)∥x\emphoi⋆−x\emphpi⋆∥2e2C1(f,σ)te−Cd⋅(L+1), (13)

with defined in (11) and

 C2(f,σ)=max{2CfC1(f,σ),1}. (14)

In particular, for any given , if the local domain radius satisfies

 L≥log(ϵC2(f,σ)∥x\emph{o}i⋆−x\emph{p}i⋆∥2)−Cd+2C1(f,σ)CdT, (15)

then for all .

Theorem 1 indicates that we can use to approximate with controlled accuracy. So, as defined in (9) can be substituted by . When such a scheme for calculating (9) is plugged into MwG, this will lead to acceleration in terms of computation complexity. We call it accelerated Metropolis within Gibbs (a-MwG) and present its pseudocode in Algorithm 1. Next, we will discuss how to compute the local surrogate (12) with numerical methods, and how to implement parallelization.

### 3.3 Local computation with Euler-Maruyama

When stochastic forcing is nonzero, our model (4) is a bona-fide SDE. Euler-Maruyama is the standard numerical method for its computation.

In the vanilla MwG, if one applies it directly to the full model (4) to obtain , a small step size will be chosen, and the value of will be approximated through at grid points . The numerical solution is generated by the following iterations, starting from :

 ~xpj((i+1)h)=~xpj(ih)+σj(ih,~xpj(ih))√hWi,j+fj(ih,~xpj−1(ih),~xpj(ih),~xpj+1(ih))h. (16)

Here, are independent samples from . One can view

 Wj(kh)=k∑i=1√hWi,j

as a realization of the Brownian in (4). As (16) is repeated for all block index , obtaining has an computational complexity.

When applying Euler-Maruyama for the local surrogate (12), we assume numerical approximation of are available as . This comes either from the previous a-MwG iteration or an initial computation before the first a-MwG iteration. Then if the proposal is different from at the block, we set the local domain as , and denote its complement as . For blocks outside , we directly use as the numerical solution, that is we let

 ~xlj(ih)=~xoj(ih),j∈Bci⋆, i=1,…,T/h. (17)

New computation is needed for with , which are obtained through the following

 ~xlj((i+1)h)=~xlj(ih)+σj(ih,~xlj(ih))√hWi,j+fj(ih,~xlj−1(ih),~xlj(ih),~xlj+1(ih))h, (18)

with initial condition . This procedure is how do we obtain the step

 xl(t,c)=Φl(xp,t,xo(t≤T,c),W(c))

in a-MwG Algorithm 1. Note that in a-MwG, are Brownian motion realizations fixed during the loops. So when implementing (18), one use fixed samples in place of , instead of generating new independent samples for .

The reason we need to do the copying step (17) before the local Euler-Maruyama, is because the values of are needed in the computation of in (18). This means in real implementation, (17) is only needed to be executed for the boundary blocks . This can save additional computation time and storage in practice. We do not choose to formulate (17) and (18) in this fashion for simplifying the notations.

Because (18) only requires execution for , so the total cost of obtaining is only , which is one order cheaper than obtaining directly with Euler-Maruyama. This partial computation naturally introduces an error. But similar to Theorem 1, this error can be controlled through the local domain radius , according to the following theorem:

###### Theorem 2

Under Assumption 1, for any given positive constant , and , we have

 E[∥~x%\emphlj(ih)−~x\emph{p}j(ih)∥2]≤C2(f,σ)e2C1(f,σ)(1+h)ihe−Cd(L+1)∥x\emphoi⋆−x\emphpi⋆∥2, (19)

with defined in (11) and (14). For any given constant , if the local domain radius satisfies

 L≥log(ϵC2(f,σ)∥x\emphoi⋆−x\emphpi⋆∥2)−Cd+2C1(f,σ)(1+h)CdT, (20)

then .

Note that as , (19) and (20) converge to corresponding theoretical ones in Theorem 1. Also note that based on (20), the choice of is independent of the dimension .

### 3.4 Local computation with Runge-Kutta

When stochastic forcing is zero, our model (4) is an ODE. This means we can use in Algorithm 1, and it is not necessary to sample the Brownian motion. 4th order Runge-Kutta (RK4) is the standard numerical method for ODE computation.

When apply RK4 to the full model for obtaining as a start, one runs the following

 ~xoj((i+1)h) =~xoj(ih)+16(k1,oj(i)+2k2,oj(i)+2k3,oj(i)+k4,oj(i)),

with

 k1,oj(i)=hfj(ih,~xoj−1(ih),~xoj(ih),~xoj+1(ih))k2,oj(i)=hfj(ih+h2,~xoj−1(ih)+k1,oj−1(i)2,~xoj(ih)+k1,oj(i)2,~xoj+1(ih)+k1,oj+1(i)2)k3,oj(i)=hfj(ih+h2,~xoj−1(ih)+k2,oj(i)2,~xoj(ih)+k2,oj(i)2,~xoj+1(ih)+k2,oj+1(i)2)k4,oj(i)=hfj(ih+h,~xoj−1(ih)+k3,oj−1(i),~xoj−1(ih)+k3,oj(i),~xoj+1(ih)+k3,oj+1(i)). (21)

This step is repeated for all .

When applying RK4 to compute the local surrogate , we need not only the numerical approximation of , but also the intermediate values for . These values will be directly taken as and its RK4 intermediate values for :

 ~xlj(ih)=~xoj(ih),ks,lj(ih)=ks,oj(ih). (22)

New RK4 computation is only needed for blocks inside the local domain, and is obtained through iterating the following step for and :

 ~xlj((i+1)h)=~xlj(ih)+16(k1,%lj(i)+2k2,lj(i)+2k3,lj(i)+k4,lj(i)). (23)

Here the intermediate values are defined in the same way as in (21), that is every instance of is replaced by , and every instance of is replaced by . The same as the local computation with Euler-Maruyama method, we note that (22) in reality is only needed for , which can save both computation time and storage.

By completing the procedures described above, we obtain the local surrogate . If is accepted in the MH step, it will be used as in the next iteration. Since (23) is repeated only for blocks within the local domain, the overall cost is .

While an approximation error analysis like Theorem 2 in principle exists, we do not provide the detailed proof. This is partly because RK4 is much more complicated than Euler-Maruyama. Moreover, being an 4th order accuracy method, RK4 solution is very close to the exact solution of (12). So the approximation error can be learned directly from Theorem 1.

### 3.5 Parallelization

The a-MwG algorithm we proposed can be easily adapted to parallelization for shorter wall-clock computation time.

First of all, in Algorithm 1, every repetition among all Brownian motion realizations can be computed in parallel. This is the case because we assume are independent with each other.

Next, recall that in the local surrogate model (12), needs renewed computation only for , which depends on only at the boundary blocks . Now consider another block index such that . Then its local domain and boundary blocks have no intersection with the ones of . This means that if there are two Gibbs proposals and , and they are different from the current iterate at the -th block and -th block respectively, then the computation of