1.1 Inverse problem and MCMC
Inverse problem is ubiquitous in various fields of science and engineering [24, 2, 11]. It concerns how to infer model parameters from partial, delayed, and noisy observations. Typical examples include using measurements of seismic waves to determine the location of an earthquake epicenter, and recovering images that are as close to natural ones as possible from blurry observations . The general formulation of inverse problem can be written as
denotes the parameters to be estimated,denotes the observation data, is a physical model describing the data collection process, and is the possible random factor involved.
Often, it is of interest to quantify the uncertainty of , which can be used to infer estimation accuracy and regulate risk. The Bayesian approach is more appropriate for this purpose [11, 4, 22]. This involves modeling with a prior distribution and finding the observation distribution from (1). Then the Baye’s formula indicates that the posterior distribution of is given by
Markov chain Monte Carlo (MCMC) is a big class of stochastic algorithms designed to sample the posterior density iteratively [9, 13]. In each iteration, a new proposal is generated from the current iterate , which can be described by a transition density
. Then the Metropolis-Hasting (MH) step accepts this proposal with probability
While these MCMC algorithms perform well for classical problems, they become very slow when applied to modern data science problems, where the parameter dimensionis very large. The main issue is that the proposal in RWM or MALA in general is different from at all components, and the MH-acceptance probability is often of order . This is extremely small when is more than a few thousands. The proposals will mostly be rejected, and the MCMC is essentially stuck. One way to alleviate this issue is to choose a small step size in the proposals, so the average acceptance probability is . But then the consecutive MCMC iterates are close to each other, so the overall movement of MCMC can still be slow.
1.2 Spatial localization and MwG
The curse of dimensionality can often be lifted if there are exploitable statistical structrues. Examples include conditional Gaussianity and low effective dimension. In geophysical applications, the components ofusually describe status at different locations. Because the underlying physical law is often short ranged, faraway components of are nearly independent. Consequentially, the associated covariance matrix will have a clear banded structrue. Such phenomena is called spatial localization
, and it widely exists in problems involving vast spatial domains. In the statistic literatrue, such banded structrue can be exploited by tapering techniques, which significantly improves covariance estimation. And in numerical weather prediction (NWP), localization techniques are designed to utilize this structrue, so algorithms such as ensemble Kalman filter can provide stable estimation for planetary models ofdimensions with merely 100 samples.
A recent work  investigates the possibility to exploit spatial localization with MCMC. It is found that Gibbs sampling [8, 7] is a natural framework for this purpose. To do so, one first partitions the model components into blocks with , where each block contains only nearby components. When running the MCMC, one generates a proposal for the -th block by applying Gibbs sampler on the prior. This proposal will be different from only at the -th block, which is of dimension . Therefore, the corresponding MH acceptance probability in (3) will be scale and does not degenerate even if the overall dimension is large. A completely new iterate can be generated by repeating this procedure for all blocks. A more detailed description of this Metropolis-within-Gibbs (MwG) sampler can be found in Section 2.2. Numerical tests and rigorous analysis in Gaussian settings have revealed that MwG has dimension independent MCMC convergence rate when the underlying distributions are spatially localized.
1.3 Acceleration with local computation
While MwG takes only a constant number of iterations to sample the posterior distribution, the computation cost of each iterate can be expensive. This is when a Gibbs block proposal is being processed by the MH step (3), one often needs to evaluate . It often involves an computational cost. The proposal procedure is repeated for all blocks. So, to generate a new MwG iterate, the computation cost is . This is much more expensive than RWM and MALA iterates, which in general cost .
However, it is possible to reduce the cost of to . The main observation is that and differ only at one block, say the -th block, and the computation of is done in the previous MH step. So if one can replace the part in the computation of with , the value of can be obtained cheaply. As a simple example, suppose in (1) the observation model is with
, a normal distribution with
-dimensional mean vectorand covariance matrix being an identity matrix. Then, we have
Suppose this value is already available, then when computing , one only needs to update the -th block in the summation to , which only costs . This example can be easily generalized to cases where relies on multiple blocks of . See detailed discussion of this and the possibility of parallelization in .
In this paper, we explore the possibility of reducing the computational cost of MwG from to . We assume the observation model (1) is given by a high dimensional stochastic differential equation (SDE) with short-range interaction, where is the initial condition of the SDE, and consists of noisy partial observations of the SDE, , in a fixed time interval . Such an inverse initial value problem is practically important. It can be interpreted as an one-step smoothing problem in signal processing. Data assimilation problems such as NWP can also be formulated as sequential applications of it [5, 15].
Finding is equivalent to solving the SDE and finding in the smoothing context. Standard Euler-Maruyama scheme would require a cost of . When the Gibbs sampler proposes , one needs to use it as a new SDE initial condition and compute . But because is different from only for , we show in Proposition 1 that is not much different from if is larger than a radius . Therefore it is not necessary to do the recomputation in a full way. In other words, instead of applying Euler-Maruya to compute for all , we only compute locally for the ones with . Since the radius can often be chosen as a constant independent of , updating to only needs a cost of . This procedure will be repeated through all blocks, so the overall cost of finding a new MwG iterate is . This achieves the aforementioned computation reduction objective. We use a-MwG to denote this accelerated version of MwG.
Since MwG only requires constantly many iterates to sample a spatially localized distribution, a-MwG is expected to solve the Bayesian inverse problem with only a computation cost of . This is the optimal dimension scaling one can obtain for any numerical methods. The main drawback of a-MwG is that it uses a local computation scheme, so it is subjective to approximation errors. However, these errors can be controlled by choosing a large enough . We demonstrate this through rigorous analysis and numerical tests.
1.4 Organization and Prelminaries
This paper is organized in the following way. In Section 2, we consider the inverse problem of how to infer the initial conditions of an SDE with local interactions. We review the MwG sampler with a discussion of its computational complexity. We derive an accelerated-algorithm, called a-MwG sampler, in Section 3, and analyze the approximation errors. We also discuss how to implement a-MwG with Euler-Maruyama and 4th order Runge-Kutta schemes, as well as its adaptation to parallelization. In section 4, two numerical experiments of Lorenz 96 and linearized stochastic model are studied. The paper is concluded in Section 5. All the proofs are postponed to the Appendix.
Throughout this paper, we will use the following notations. When applying MwG, we need to partition a high dimension vector into blocks, for which we write as . For simplicity of the discussion, we assume each block shares the same length , so the total dimension . We remark our result is easily generalizable to non-constant block sizes. In practice, each block often represent information at a location on a torus, therefore it is natural to introduce measure of distance between indices as with .
When a matrix is given, the -th entry is denoted as . denote the transpose and the inverse of the matrix respectively. We adopt and to denote the norm and norm for a vector, namely, for a vector with elements , we have , and . For an matrix , the operator is written as . For two matrices and (including vectors as a special case), we say if we have for , entry-wise. and are matrices whose entries are 0 and 1 respectively. is a multidimensional normal distribution with mean vector and covariance matrix . We use to denote identity matrix. is a generic positive constant that may varies from line to line.
2 Problem setup
2.1 Inverse initial value problem for SDE
We consider a spatial-temporal model with local interaction
where is an matrix valued adapted locally bounded process, and is an dimensional standard Brownian motion. We assume the following Lipschitz continuity conditions are satisfied for the coefficient processes and :
Given vectors , and , for , there exists constants and such that
Assumption 1 is widely used to guarantee the solution to (4) exists and is unique, when the initial condition and the realizations of the Brownian motion are given. We write the solution as . In the scenario where there is no stochastic forcing, namely , (4
) is an ordinary differential equation (ODE). Its solution can be simply written as.
One key featrue of (4) is that the drift term driving relies only on its neighboring blocks. This describes general short-range interactions that are typical in spatial-temporal models. The formulation of (4
) can be naturally derived, if one considers applying finite difference discretization for a stochastic partial differential equation, such as the reaction-diffusion equations. Details of such derivation can be found in.
We assume an -dimensional data is generated from noisy observation of (4) at time , that is
Here is an by observation matrix that serves as selecting the observable components, and represents the associated observation noise, which is distributed as .
The Bayesian inverse problem this paper trying to solve is finding the posterior distribution of the initial condition with a given data . For simplicity, we assume the prior distribution of is Gaussian with mean and covariance matrix . Then the observation likelihood is given by
where for a vector , we define . The expectation in is averaging over all realization of . In computation, it can be approximated by a Monte Carlo sampled version
2.2 MwG and its computational complexity
The Metropolis-within-Gibbs sampler is an MCMC algorithm. For each block, it generates a proposal by applying Gibbs sampling to the prior, and use the Metropolis step to incorporate data information. In specific, it generates iterations of form through the following steps, where we start with and draw randomly from :
Repeat steps 2-4 for all block index
Let with probability
can also be replaced by the sample version in (7).
When the loop in step 1) is finished, let , and increase time index from to .
In , it is shown that MwG has dimension independent performance if 1)
is a Gaussian distribution, and its covariance or precision matrix is close to be banded; 2) each component ofhas significant dependence only on a few components of . It is also discussed how to truncate the far-off diagonal entries of the prior covariance or precision matrix. This so called “localization” technique can simplify the computation of the proposal probability in MwG step 2), making the cost to be .
When applying MwG directly to the inverse problem described in Section 2, the computational cost for fully updating all blocks is . This is because in step 4), we need to evaluate through either (7) or (8). This involves finding the numerical solution to (4), which requires executing the Euler-Maruyama or 4th order Runge-Kutta methods. Both these methods require computation complexity when the dimension of the differential equation (4) is . Then because step 4) is repeated for all blocks, the total computation cost is .
In summary, although the vanilla MwG only requires O(1) iterates to converge to the posterior distribution, each individual iterate can cost . This can be less appealing than standard MCMC algorithms with optimal tuning. For example, RWM can converge to the posterior distribution with iterations, while each iterate costs complexity, so the total complexity is . In this paper, we demonstrate how to bring down the computational cost of each MwG iterate to . This will lead to the optimal MCMC computational scalability.
3 Acceleration with local computation
3.1 Spatial propagation of local changes
Based on our previous discussion, the main computational cost of MwG takes place at step 4) when (9) is evaluated. We will discuss how to reduce this cost to . In what follows, We fix the time index as o and block index as , while the same procedure applies to all time indexes and blocks.
First of all, it should be noted that when executing step 4), the value of is already available from the previous time step 4). Likewise, we already have the values of . It is only necessary to find , or equivalently . Note that we will often write as , since it is the information we try to recover.
Our main observation here is that is different from only at the -th block. Since components in SDE (4) exchange information only through local interactions, the differences between and are likely to be of significance only for blocks that are close to . In fact, we have the following proposition:
Under Assumption 1 and with any given positive constant , for , we have
In particular, for any fixed small enough threshold and time range , we can find a radius , such that
We define the local domain centered at index as
and use to denote its complement. Then is already a good approximation of for . In other words, it is no longer necessary to recompute for , and we only need to compute for . Since contains at most elements, and is likely to be independent with the problem dimension , this provides us with a way to accelerate the overall computation of MwG.
3.2 Local surrogate
Next we consider how to use existing values of to reduce the computation of . For this purpose, we introduce a local surrogate model given by
Its initial condition is set to be . We write its solution as
Note that the local surrogate within depends on for the boundary blocks of , of which the indices are usually just and . Such dependence can be viewed as using as spatial boundary conditions for the computation of .
We will use the local surrogate as an approximation of . It can be computed cheaply because it is different from only for blocks inside the local domain . The details of how to achieve this can be found in subsequent parts. First of all, we investigate the approximation error through the following theorem
Theorem 1 indicates that we can use to approximate with controlled accuracy. So, as defined in (9) can be substituted by . When such a scheme for calculating (9) is plugged into MwG, this will lead to acceleration in terms of computation complexity. We call it accelerated Metropolis within Gibbs (a-MwG) and present its pseudocode in Algorithm 1. Next, we will discuss how to compute the local surrogate (12) with numerical methods, and how to implement parallelization.
3.3 Local computation with Euler-Maruyama
When stochastic forcing is nonzero, our model (4) is a bona-fide SDE. Euler-Maruyama is the standard numerical method for its computation.
In the vanilla MwG, if one applies it directly to the full model (4) to obtain , a small step size will be chosen, and the value of will be approximated through at grid points . The numerical solution is generated by the following iterations, starting from :
Here, are independent samples from . One can view
When applying Euler-Maruyama for the local surrogate (12), we assume numerical approximation of are available as . This comes either from the previous a-MwG iteration or an initial computation before the first a-MwG iteration. Then if the proposal is different from at the block, we set the local domain as , and denote its complement as . For blocks outside , we directly use as the numerical solution, that is we let
New computation is needed for with , which are obtained through the following
with initial condition . This procedure is how do we obtain the step
in a-MwG Algorithm 1. Note that in a-MwG, are Brownian motion realizations fixed during the loops. So when implementing (18), one use fixed samples in place of , instead of generating new independent samples for .
The reason we need to do the copying step (17) before the local Euler-Maruyama, is because the values of are needed in the computation of in (18). This means in real implementation, (17) is only needed to be executed for the boundary blocks . This can save additional computation time and storage in practice. We do not choose to formulate (17) and (18) in this fashion for simplifying the notations.
Because (18) only requires execution for , so the total cost of obtaining is only , which is one order cheaper than obtaining directly with Euler-Maruyama. This partial computation naturally introduces an error. But similar to Theorem 1, this error can be controlled through the local domain radius , according to the following theorem:
3.4 Local computation with Runge-Kutta
When stochastic forcing is zero, our model (4) is an ODE. This means we can use in Algorithm 1, and it is not necessary to sample the Brownian motion. 4th order Runge-Kutta (RK4) is the standard numerical method for ODE computation.
When apply RK4 to the full model for obtaining as a start, one runs the following
This step is repeated for all .
When applying RK4 to compute the local surrogate , we need not only the numerical approximation of , but also the intermediate values for . These values will be directly taken as and its RK4 intermediate values for :
New RK4 computation is only needed for blocks inside the local domain, and is obtained through iterating the following step for and :
Here the intermediate values are defined in the same way as in (21), that is every instance of is replaced by , and every instance of is replaced by . The same as the local computation with Euler-Maruyama method, we note that (22) in reality is only needed for , which can save both computation time and storage.
By completing the procedures described above, we obtain the local surrogate . If is accepted in the MH step, it will be used as in the next iteration. Since (23) is repeated only for blocks within the local domain, the overall cost is .
While an approximation error analysis like Theorem 2 in principle exists, we do not provide the detailed proof. This is partly because RK4 is much more complicated than Euler-Maruyama. Moreover, being an 4th order accuracy method, RK4 solution is very close to the exact solution of (12). So the approximation error can be learned directly from Theorem 1.
The a-MwG algorithm we proposed can be easily adapted to parallelization for shorter wall-clock computation time.
First of all, in Algorithm 1, every repetition among all Brownian motion realizations can be computed in parallel. This is the case because we assume are independent with each other.
Next, recall that in the local surrogate model (12), needs renewed computation only for , which depends on only at the boundary blocks . Now consider another block index such that . Then its local domain and boundary blocks have no intersection with the ones of . This means that if there are two Gibbs proposals and , and they are different from the current iterate at the -th block and -th block respectively, then the computation of