# Adaptive Approximation Error Models for Efficient Uncertainty Quantification with Application to Multiphase Subsurface Fluid Flow

Sample-based Bayesian inference provides a route to uncertainty quantification in the geosciences, though is very computationally demanding in the naïve form that requires simulating an accurate computer model at each iteration. We present a new approach that adaptively builds a stochastic model for the error induced by a reduced model. This enables sampling from the correct target distribution at reduced computational cost, while avoiding appreciable loss of statistical efficiency. We build on recent simplified conditions for adaptive Markov chain Monte Carlo algorithms to give practical approximation schemes and algorithms with guaranteed convergence. We demonstrate the efficacy of our new approach on two computational examples, including calibration of a large-scale numerical model of a real geothermal reservoir, that show good computational and statistical efficiencies on both synthetic and measured data sets.

There are no comments yet.

## Authors

• 14 publications
• 6 publications
• 2 publications
• ### Randomized Reduced Forward Models for Efficient Metropolis–Hastings MCMC, with Application to Subsurface Fluid Flow and Capacitance Tomography

Bayesian modelling and computational inference by Markov chain Monte Car...
09/17/2020 ∙ by Colin Fox, et al. ∙ 0

• ### SPUX: Scalable Particle Markov Chain Monte Carlo for uncertainty quantification in stochastic ecological models

Calibration of individual based models (IBMs), successful in modeling co...
11/04/2017 ∙ by Jonas Šukys, et al. ∙ 0

• ### Seismic Bayesian evidential learning: Estimation and uncertainty quantification of sub-resolution reservoir properties

We present a framework that enables estimation of low-dimensional sub-re...
05/14/2019 ∙ by Anshuman Pradhan, et al. ∙ 0

• ### A Metamodel of the Telemac Errors

A Telemac study is a computationally intensive application for the real ...
10/18/2019 ∙ by Fabrice Zaoui, et al. ∙ 0

• ### Large scale in transit computation of quantiles for ensemble runs

The classical approach for quantiles computation requires availability o...
05/10/2019 ∙ by Alejandro Ribes, et al. ∙ 0

• ### Probabilistic Integration: A Role in Statistical Computation?

A research frontier has emerged in scientific computation, wherein numer...
12/03/2015 ∙ by Francois-Xavier Briol, et al. ∙ 0

• ### Method G: Uncertainty Quantification for Distributed Data Problems using Generalized Fiducial Inference

It is not unusual for a data analyst to encounter data sets distributed ...
05/18/2018 ∙ by Randy C. S. Lai, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Characterizing subsurface flow in aquifers, geothermal and petroleum reservoirs often requires inferring unobserved subsurface properties from indirect observations made on the underlying system. This is a typical inverse problem. There are several fundamental difficulties associated with this process: data are corrupted by noise in the measurement process and often sparsely measured, the forward model only has a limited range of accuracy in representing the underlying system, and the parameters of interest are usually spatially distributed and highly heterogeneous, e.g., permeabilities and conductivities. These features make the inverse problem ill-posed [Hadamard, 1902, Kaipio and Somersalo, 2004]. This implies that there exists a range of feasible parameters that are consistent with the data, and hence a range of possible model predictions. Ill-posedness also causes the correlations between parameters to be extremely high, and the model output to occupy some low dimensional manifold in data space. These properties make the traditional, deterministic solution to the inverse problem very sensitive to measurement error and model error.

In this paper, we develop the ‘solution’ of the inverse problem in the framework of Bayesian inference, and present algorithmic advances for computing inferential solutions. Solution of inverse problems under the Bayesian framework is well established, with comprehensive examples including characterization of subsurface properties of aquifers and petroleum reservoirs such as [Oliver et al., 1997, Higdon et al., 2002, Higdon et al., 2003, Mondal et al., 2010], remote sensing such as [Cornford et al., 2004, Haario et al., 2004] and impedance imaging such as [Andersen et al., 2004, Watzenig and Fox, 2009]. By incorporating prior information and modelling uncertainties in the data observation process, all information regarding the solution of the inverse problem is coded into the posterior distribution over unknown parameters conditioned on the measured data.

Sample-based inference proceeds by computing Monte Carlo estimates of posterior statistics, to give ‘solutions’ and quantified uncertainties, using samples drawn from the posterior distribution via Markov chain Monte Carlo (MCMC) sampling. However, the applicability of this approach is fundamentally limited by the computational cost of the forward map and errors associated with the data modelling process. Difficulties arise from the high dimensionality of the model parameters, and computational demands of the forward map, which is usually dictated by numerical solutions of nonlinear partial differential equations (PDEs). In practice, it is computationally prohibitive to run standard MCMC for many iterations in realistic geophysical inverse problems.

Computational efficiency can be significantly improved by exploiting reduced models and recent advances in model errors. We present a novel adaptive delayed acceptance (ADA) algorithm that provides a computationally feasible route for sample-based inference in large-scale inverse problems. To speed up the computation, ADA implements the delayed acceptance (DA) MCMC Alg. [Christen and Fox, 2005] that takes advantage of a reduced model, coupled with recent advances in Bayesian inference for model errors [Kennedy and O’Hagan, 2001, Kaipio and Somersalo, 2007], and state-of-the-art techniques in adaptive MCMC algorithms [Roberts and Rosenthal, 2007].

Computational efficiency of a sampling algorithm is measured by the (smaller the) compute time required to evaluate Monte Carlo estimates to within a desired tolerance. Total compute time equals the compute cost per iteration multiplied by the number of samples required to achieve the desired tolerance; different MCMC algorithms will differ in both these measures, so it is necessary to consider both these measures, as detailed in Section 2.3.

We validate ADA on two models of geothermal reservoirs. The first is a 1D homogeneous model with 7 unknown parameters, and uses synthetic transient data. In the second example we predict the hot plume of a 3D multi-phase geothermal reservoir model by estimating the heterogeneous and anisotropic permeability distribution and the heterogeneous boundary conditions [Cui et al., 2011]. This model has unknown parameters, and each model simulation requires about 30 to 50 minutes of CPU time. ADA shows superior computational performance on both study cases.

This paper is structured as follows: Section 2 sets out the Bayesian formulation of inverse problems, and reviews existing sample-based inference including delayed-acceptance and adaptive algorithms. Section 3 presents deterministic and stochastic approximations to the forward map and posterior distribution that are used to reduce computational cost per iteration. Section 4 presents the ADA algorithm that utilizes the approximations developed in Section 3, and gives a proof of convergence. Section 5 presents two case studies of ADA on calibrating geothermal reservoir models. Section 6 offers some conclusions and discussion.

## 2 Bayesian Formulation and Posterior Exploration

In this section we review existing sample-based Bayesian methods for inverse problems that are relevant to the ADA algorithm developed in Section 4.

### 2.1 Bayesian Formulation

Suppose a physical system is simulated by a forward model and the unknowns of the physical system are parametrized by model parameters . The measurable features of the system are the image of the forward map at the true but unknown parameters , , called the true, noise-free data. Practical measurements are a noisy version of subject to measurement errors and other sources of imprecision. In an inverse problem, we wish to recover the unknown from measurements , and to make predictions about properties of the physical system, such as future, unobserved data.

Uncertainty in measured data and in the model leads to uncertain estimates of parameters

, and in subsequent predictions. What then is the range of permissible values, that is, the distribution over resulting estimates? The Bayesian formulation assigns probability distributions to each of the sources of error, and tracks the resulting distributions over quantities of interest. In our experience, this is the essential feature of the Bayesian method.

The Bayesian framework quantifies the distribution over parameters, consistent with the measured data, by the (unnormalized) posterior distribution

 π(x,γ,δ|~d)∝L(~d|x,γ)π(x,γ,δ), (1)

where is the likelihood function, giving the probability density of measuring data in identical measurement setups defined by , and

is the prior probability density function for model parameters

and (hyper)parameters and that represent uncertainties in the likelihood function and stochastic modelling of , respectively. We typically use hierarchical modelling to determine the functional forms of the likelihood function and prior distribution, writing in the usual case where uncertainties on measured data and model parameters are independent.

The likelihood function is determined by modelling the stochastic process that results in measured data for a given state of the system. Measured data is commonly assumed to be related to noise-free data by the additive error model

 ~d=F(x)+e, (2)

where the random vector

captures the measurement noise and other uncertainties such as model error. When

follows a zero mean multivariate Gaussian distribution the resulting likelihood function has the form

 L(~d|x,Σe)∝exp[−12{F(x)−~d}TΣ−1e{F(x)−~d}], (3)

where the hyperparameter

is the covariance matrix of the noise vector , with uncertainty in the covariance being modelled by the (hyper)prior distribution . Note that evaluating the likelihood function for a particular requires simulating the forward model , which is a computationally taxing numerical simulation of a mathematical model of the physical system.

The contribution of model error to the noise vector is usually non-negligible, in practice. This may be caused by discretization error in the computer implementation of the mathematical forward model and/or wrong assumptions in the mathematical model. We follow [Higdon et al., 2003] who observe that it may not possible to separate the measurement noise and model error in the case that only single set of data is available. Thus, it is necessary to incorporate the modeller’s judgments about the appropriate size and nature of the noise term . We also adopt the assumption of [Higdon et al., 2003] that the noise follows a zero mean Gaussian distribution, giving the likelihood function in Eqn. 3.

The primary unknowns of a physical system can often be represented by spatially distributed parameters, for example, the coefficients in a partial differential equation modelling energy and/or mass propagation. Correspondingly, representations and prior models are usefully drawn from spatial statistics, often in the form of hierarchical models [Banerjee et al., 2003]; see [Hurn et al., 2003] for a review. These include: (i) Gaussian Markov random field models [Besag, 1986, Higdon, 1998, Rue and Held, 2005] and Gaussian process models [Cressie, 1993] for low-level, pixel-based representations; (ii) Intermediate-level modelling that use a continuous random partition of the plane [Arak et al., 1993, Moller and Waagepetersen, 1998, Nicholls, 1998, Moller and Skare, 2001]; (iii) Object based high-level modelling such as the deformable template approaches of [Amit et al., 1991, Baddeley and van Lieshout, 1993, Grenander and Miller, 1994, Jain et al., 1998, Rue and Hurn, 1999]. Since the particulars of prior modelling are problem specific, we delay presenting detailed functional forms to the examples in Section 5.

Because the interaction between the primary parameter and hyperparameters and introduces significant computational difficulties [Rue and Held, 2005], here we set the value of hyperparameters based on expert opinion and field measurements. This yields the the posterior distribution

 π(x|~d)∝exp[−12{F(x)−~d}TΣ−1e{F(x)−~d}]π(x), (4)

that is used throughout the remainder of this paper.

Estimates of parameters and model predictions can be calculated as Monte Carlo estimates of the expected value of those quantities over the posterior distribution. For quantity , the estimate, denoted , is defined by

 E[g]=∫g(x)π(x|~d)dx≈¯¯¯gN=1NN∑i=1g(xi), (5)

using samples drawn from the posterior distribution, i.e., . Thus, the task of estimating parameters or predictive values is reduced to the task of drawing samples from the posterior distribution; this defines sample-based Bayesian inference. We present algorithms for sampling from in Sections 2.2 and 2.4, and novel efficient methods in Section 4.

### 2.2 Metropolis-Hastings Dynamics

The basis of the sampling methods we develop is the Metropolis Hasting (MH) algorithm [Metropolis et al., 1953, Hastings, 1970]

. This algorithm simulates a Markov chain of random variables, that converge in distribution to the posterior distribution

as the number of iterations . One initializes the chain at some starting state then iterate as in Alg. 1.

After a burn-in period, in which the chain effectively loses dependency on the starting state, the MH algorithm produces a sequence of correlated samples distributed as . Samples from the chain, typically after burn-in is discarded, may be substituted directly into the Monte Carlo estimate in Eqn. 5 to produce the estimate of quantity . The rate of distributional convergence, of to , depends on the degree of correlation [Sokal, 1989, Geyer, 1992]; chains that are fast to converge have lower correlation between adjacent samples.

The only choice one has in Alg. 1 is the choice of proposal distribution ; the choice is largely arbitrary, though has a significant influence on the rate of convergence. Traditionally, the proposal distribution is chosen from some simple family of distributions, then manually tuned in a “trial and error” manner to optimize the rate of convergence. We present automatic, adaptive methods for tuning the proposal in Sections 2.5 and 4.

### 2.3 Convergence and Statistical Efficiency

Sample-based inference returns in Eqn. 5, which is an estimate of the quantity of interest with some Monte Carlo error due to being finite. We consider convergence and computational efficiency in terms of the total computing cost required to evaluate the estimate to within a given error tolerance.

Convergence of the Monte Carlo estimate , in Eqn. 5

, is guaranteed by a central limit theorem (CLT)

[Kipnis and Varadhan, 1986] that gives as . Here, denotes the normal (or Gaussian) distribution with mean . Hence, asymptotic in sample size , almost surely, with accuracy when samples are used. When the samples are independent, it follows that

 Var(¯¯¯gN)=Var(g)N.

Since depends only on the quantity being estimated, this sets the fewest number of iterations that a practical MCMC algorithm will require to achieve a given accuracy.

When the are samples from a correlated Markov chain, as generated by any of the sampling algorithms discussed in this paper, instead (for large )

 Var(¯¯¯gN)=Var(g)N(1+2∞∑i=1ρgg(i)),

where is the autocorrelation coefficient for the chain in at lag

. Thus, the rate of variance reduction, compared to independent samples, is reduced by the factor

 τ=(1+2∞∑i=1ρgg(i)),

which is the integrated autocorrelation time (IACT) for the statistic  [Sokal, 1989]. We can think of being the length of the correlated chain that produces the same variance reduction as one independent sample. We call the quantity the statistical efficiency, while gives the number of effective (independent) samples for a Markov chain of length .

### 2.4 Delayed Acceptance and Computational Efficiency

We define the computational efficiency of MH to be the effective sample size per unit CPU time. Hence, as shown in the previous section, this is proportional to the rate of variance reduction in estimates per CPU time.

In our setting, applying standard MH can be computationally costly as the cost is dominated by the evaluation of the posterior density, which involves simulating of the forward map at proposed parameters . Instead, we consider using an approximated posterior, obtained by using some reduced model of the forward model, to improve the computational efficiency. We embed the approximation in the DA algorithm shown in Alg. 2 to obtain asymptotically unbiased MCMC estimates, while we can show that an appropriate approximation can increase the computational efficiency of MH.

The DA algorithm allows the approximation to depend on the state of the MCMC. Hence DA generalizes the surrogate transition method of [Liu, 2001] that uses a fixed approximate target distribution, or ‘surrogate’, that was reintroduced by [Efendiev et al., 2006]. The ability to use a state-dependent approximation turns out to be critical for improving computational efficiency in the applications that we consider.

DA uses two accept-reject steps. The first uses an approximation to the target distribution that can be relatively arbitrary, while the second accept-reject step ensures that the Markov chain correctly targets the desired distribution (see [Christen and Fox, 2005] for details). Note that there is never need to evaluate the integral in the definition of . The computational cost per iteration is reduced because only those proposals that are accepted using the approximation go on to evaluation of the posterior distribution , that requires evaluating the full, expensive forward map.

The DA Alg. 2 necessarily has lower statistical efficiency than the unmodified counterpart in Alg. 1, because of the additional acceptance/rejection in Step 2 [Christen and Fox, 2005]. That is, for any quantity . Fortunately, DA may still be more computationally efficient that the standard MH, as rejections in Step 1, which are based on an approximate posterior, can avoid computing costly posterior density evaluations on those proposed parameters that ought to be rejected.

Here we analyze the potential speed-up factor in computational efficiency of DA compared to the standard MH. Let the CPU time to evaluate the approximate posterior density and the exact posterior density be and , respectively. Suppose that the average acceptance probability in Step 1 of DA is . Since evaluation of the posterior density dominates the CPU time of MH, simulating iterations of the standard MH asymptotically costs CPU time. Using the same CPU time, DA can be simulated for iterations. This way, the effective sample size of DA is given by

 ESSDA=NtτDA(¯αt+t∗)=NτDA(¯α+t∗/t),

whereas the effective sample size of the standard MH is

 ESS=Nτ.

Thus, the speed-up factor of DA compared to the standard MH is

 ESSDAESS=ττDA1¯α+t∗/t. (7)

The constant is set by the choice of proposal. The factor counts the decrease in statistical efficiency by using DA instead of MH, while the factor gives the decrease in average compute cost per iteration. It is necessary to address both factors if one is to improve computational efficiency. That is, we want to be close to one and to be as small as possible.

Several forms of reduced models have been used in DA to approximate the posterior. For example, local linearized models have been used in [Christen and Fox, 2005] and model coarsening based on multiscale finite element is used in [Efendiev et al., 2006]. It can be challenging to balance the reduction in CPU time against accuracy of the reduced model. Using a lower accuracy reduced model, which runs faster compared to a more accurate one, will reduce the CPU time of MCMC per iteration, but at the risk of lower statistical efficiency since DA is more likely to reject the proposal in Step 2.

Here we present a systematic way to modify the statistical model of the likelihood function without changing the reduced model. By including the statistics of the numerical error of the reduced model in the likelihood, our modified approximate posterior can potentially improve the acceptance rate in step 2. This way, the statistical efficiency, and hence the speed-up factor, of the DA can be improved by using the same reduced model, at no extra cost.

### 2.5 Adaptive MCMC

A key ingredient of our modified likelihood is the use of adaptation to estimate the error statistics of the reduced model that is consistent with the posterior. Towards this goal, we design an adaptive delayed acceptance algorithm that both adaptively adjusts the proposal distribution, as in existing adaptive MCMC, and also adapts the error statistics in the likelihood using MCMC sample history. The latter is significantly different to existing adaptive MCMC algorithms, but follows the same regularity conditions established by [Roberts and Rosenthal, 2007].

Here, we review the classical adaptive Metropolis (AM) algorithm introduced by [Haario et al., 2001], shown in Alg. 3, and our modification that suits our case studies.

AM employs a random-walk Gaussian proposal , where the covariance is estimated adaptively from the history of the Markov chain. AM uses where is the empirical posterior covariance evaluated over iterations, with the scale chosen for an optimality criterion under certain technical conditions [Roberts et al., 1997, Roberts and Rosenthal, 2001]. The scaled empirical posterior covariance is mixed with a fixed Gaussian distribution, to avoid the situation that is singular.

AM has subsequently been extended in delayed rejection adaptive Metropolis (DRAM) [Haario et al., 2006], and the adaptive Metropolis within Gibbs (AMWG) [Roberts and Rosenthal, 2009]. A general class of adaptive algorithms was established by [Roberts and Rosenthal, 2007] with simplified regularity conditions required for ergodicity, namely simultaneous uniform ergodicity and diminishing adaptation. The ergodicity of many adaptive MCMC algorithms can be established using these simplified conditions.

We also present the grouped components adaptive Metropolis (GCAM) proposal of [Cui, 2010], which will be used in numerical examples.

In practice, the proposal (8) used by AM does not ensure sufficient statistical efficiency in the problems we consider. As shown by the numerical experiments in [Cui, 2010], the scaling used by AM does not produce the optimal acceptance rate of 0.234, and gives suboptimal statistical efficiency. However, the covariance matrix estimated from past samples provides useful information about the orientation of the target distribution. A more flexible scheme is to estimate the covariance matrix online and tune the scale variable separately as in AMWG, giving the GCAM in Alg. 4, that is effective for high dimensional problems. Suppose we have groups of components , with each group associated with a scale variable . Let be the number of elements of group , and .

In the above proposal distribution (9), gives the inverse of the largest variance among parameters in the group . This factor approximately normalizes the covariance matrix , and hence avoids the interaction of and during the adaptation. Such numerical treatment makes the scale variable stabilize faster. The algorithm is guaranteed to converge to the correct target distribution without this treatment, because it satisfies diminishing adaptation.

## 3 Approximations to the Forward Map and Posterior Distribution

Apart from the high dimensionality and complex nature of the posterior, a significant computational challenge arises from the high computational cost of evaluating the posterior density that entails computationally demanding numerical schemes used by the forward model . For example, the 3D geothermal reservoir model presented in Section 5 has about ten thousand parameters, and each model evaluation takes about 30 to 50 minutes CPU time. Hence, it is computationally infeasible to use the traditional MCMC Alg. 1 for this type of problem.

To improve the computational efficiency of MCMC as discussed previously, we consider to approximate the likelihood, and hence the posterior, by employing reduced models. The backbone of the approximation is a given reduced model built using existing techniques, including grid coarsening, e.g., [Christie and Blunt, 2001, Kaipio and Somersalo, 2007, Efendiev et al., 2006, Cui, 2010], linearization of the forward model e.g., [Christen and Fox, 2005], and projection-based methods e.g., [Benner et al., 2015, Ghasemi et al., 2015, Lieberman et al., 2015, Cui et al., 2015, Andrea et al., 2016]. By considering the statistics of the numerical error of the reduced model, we are able to develop deterministic and stochastic approximations to the forward map, which lead to modified approximated posteriors, for use in the DA Alg. 2.

Let denote the reduced model. We start with the common approximation to the posterior distribution that simply uses in place of the true forward map :

###### Approximation 1.

Approximate posterior distribution using in place of :

 π∗(x|~d)∝exp[−12{F∗(x)−~d}TΣ−1e{F∗(x)−~d}]π(x). (10)

The reduced model usually has a non-negligible discrepancy with the forward model, and hence the approximation in (10), by itself, can result in biased estimates while producing uncertainty intervals that are too small. This effect was investigated by [Kaipio and Somersalo, 2004, Kaipio and Somersalo, 2007], who noted that this is one way to perform an “inverse crime”. This indicates that the approximate posterior has displaced support and is too narrow to include the support of the true posterior.

Given a reduced model , Eqn. (2) can be expressed as

 ~d = F∗(x)+{F(x)−F∗(x)}+e (11) = F∗(x)+B(x)+e,

where

is the model reduction error between the true forward model and the reduced model. By assuming that the model reduction error is independent of the model parameters and normally distributed,

[Kaipio and Somersalo, 2007] introduced the approximation error model (AEM)

 ~d=F∗(x)+B+e,

with the further simplifying assumption that , and showed that this significantly improves the approximation of the posterior distribution compared to Approximation 1.

###### Approximation 2.

Approximate posterior distribution using the reduced model and AEM:

 π∗(x|~d)∝exp[−12{F∗(x)+μB−~d}T(ΣB+Σe)−1{F∗(x)+μB−~d}]π(x). (12)

### 3.1 Prior and Posterior Approximation Error Models

An a priori construction of the AEM is used in [Kaipio and Somersalo, 2007]. That is, before utilizing data and solving the inverse problem, the AEM is empirically estimated over the prior distribution. The resulting estimates for mean and covariance of are

 μB = ∫XB(x)π(x)dx (13) ≈ 1LL∑i=1B(xi), ΣB = ∫X{B(x)−μB}{B(x)−μB}Tπ(x)dx (14) ≈ 1L−1L∑i=1{B(xi)−μB}{B(xi)−μB}T,

where is defined by (11), and , are samples drawn from the prior distribution.

A drawback of evaluating and this way arises because the location of th bulk of probability of the prior distribution is typically quite different to that of the posterior distribution. Hence, the AEM may be reasonable over the prior distribution but the

samples may not include any samples with appreciable posterior probability so the AEM could be far from optimal over the support of the posterior distribution. Note also that this

a priori approach requires appreciable pre-computation to construct the AEM.

We improve the AEM, and hence make a better approximate posterior distribution, by empirically estimating the AEM over the posterior distribution. That is, we use

 μB = ∫XB(x)π(x∣~d)dx, (15) ΣB = ∫X{B(x)−μB}{B(x)−μB}Tπ(x∣~d)dx. (16)

This is achieved by evaluating within the DA Alg. 2, and applying adaptive MCMC methods to ensure that the a posteriori estimates of and converge to the true values given in Eqns 15 and 16. We also use the AEM within DA to define an approximate posterior distribution to reduce computational cost per iteration, while ensuring that the samples are drawn from the exact posterior distribution (4). A further advantage of this adaptive approach is that it does not require any pre-computation to estimate the AEM before setting up a MCMC simulation. Indeed, in all computational experiments we find that building the AEM over the posterior leads to better statistical efficiency in the MCMC, than when the AEM is estimated over the prior, and so gives a method that both does not require precompution and also gives a more computationally efficient MCMC.

### 3.2 State-dependent Approximation Error Models

The work of [Christen and Fox, 2005] demonstrated DA using a local linearization of the forward map as the approximate forward map, which is a local reduced model that depends on the current state of the MCMC. For the applications we present in Section 5, we utilize the existing Fortran code TOUGH2 [Pruess, 1991] to simulate the forward map. This package does not give access to derivatives (nor adjoints, etc) and so we form a reduced model by using coarsened discretizations. This reduced model depends only on the point at which it is evaluated, but not the state of the MCMC.

In many cases, including in the applications we consider here, both the true forward map and reduced model satisfy a smoothness criterion of the type for some and . Then, when using DA, a local improvement to the state-independent reduced model can be made, for little additional computational cost, by using the values of and for points that are accepted, and hence become the state of the chain. We define a deterministic state-dependent reduced model using a zeroth-order correction to the reduced model, as follows.

###### Approximation 3.

State-dependent reduced model and approximate posterior distribution: Suppose that at iteration , the Markov chain has state . For a proposed state , the state-dependent reduced model is

 F∗x(y)=F∗(y)+{F(x)−F∗(x)}. (17)

The resulting approximate posterior distribution is

 π∗x(y|~d)∝exp[−12{F∗x(y)−~d}TΣ−1e{F∗x(y)−~d}]π(y). (18)

It is worth mentioning that the state-dependent reduced model (17) has the desirable property that . More accurate state-dependent reduced models with this property are possible when higher order derivatives are available. The error structure of the state-dependent reduced model (17) can also be estimated by employing the AEM. In particular, Approximation 2 and 3 can be combined, at no significant increase in computational cost, to give a more accurate approximation to the posterior distribution. The state-dependent model reduction error, for the zeroth-order correction (17), is ; note that .

###### Approximation 4.

AEM built over the posterior distribution with a state-dependent reduced model: Suppose that at iteration the Markov chain is at state and a proposed state . The state-dependent approximate posterior distribution is given by

 π∗x(y|~d)∝exp[−12{F∗x(y)+μB−~d}T(ΣB+Σe)−1{F∗x(y)+μB−~d}]π(y). (19)

The mean and covariance of the AEM in Approximation 4 with reduced model (17) are

 μB=Eπ(x|~d){∫XBx(y)K(x,y)dy} (20)

and

 ΣB=Covπ(x|~d){∫XBx(y)K(x,y)dy}, (21)

respectively, where denotes the transition kernel implemented by the MCMC iteration.

The mean of the AEM (20) for reduced model (17) can be shown to be , as follows.

 Eπ(x|~d)[∫Bx(y)K(x,y)dy] =∫X∫X{F(y)−F∗(y)}π(x|~d)K(x,y)dydx −∫X∫X{F(x)−F∗(x)}π(x|~d)K(x,y)dydx

with the two terms on the right canceling because the kernel satisfies the detailed balance condition . Even though empirical estimates of , defined inductively by

 μB,n=1n{(n−1)μB,n−1+Bxn−1(xn)}, (22)

will not equal zero, we find that the estimates are close to zero and it is always advantageous to set in (19).

The covariance of the model reduction error can be computed adaptively at iteration by the inductive formula

 ΣB,n=1n−1{(n−2)ΣB,n−1+Bxn−1(xn)Bxn−1(xn)T}. (23)

The approximate posterior distribution (19) after steps of adaptive updating is then

 π∗n,x(y|~d)∝exp[−12{F∗x(y)−~d}T(ΣB,n+Σe)−1{F∗x(y)−~d}]π(x). (24)

## 4 Adaptive Delayed Acceptance Algorithm

The ADA algorithm combines the adaptive AEM built over the posterior, suggested in Section 3, and the adaptation of the proposal distribution as in existing adaptive algorithms in Section 2.5. Adaptive estimates of the AEM are made possible by using the DA algorithm, which provides the basic structure of ADA, and also the mechanism for reduction in compute cost per iteration. The use of adaptively improved stochastic approximations means the reduction in statistical efficiency, which is an unavoidable consequence of using DA, may be minimized so the reduction of compute cost per iteration is transferred directly into computational efficiency of the resulting algorithm.

In this algorithm, the proposal in step 1 and its adaptive update in step 4 may have the form of any of the adaptive algorithms, such as the AM Alg. 3. When Approximation 4 is used in step 1, updating of the approximate target distribution in step 3 uses the updating rules for and in Eqns 22 and 23, respectively.

### 4.1 Ergodicity Conditions and Main Result

We follow the notation in [Roberts and Rosenthal, 2007] to formalize ADA. In particular, we index distributions by adaptation indices, rather than iteration number, as in Alg. 5, since the former provides a unique notation for functions. To simplify the notation, let denote the exact posterior distribution.

Suppose is a fixed target distribution, defined on state space with -algebra . Let be a family of Markov chain transition kernels (associated with MH) on , and suppose that for all , is the unique stationary distribution. Let be a family of state-dependent approximations to the exact target distribution for all .

The adaptation indices and are associated with adaptation of the proposal and approximate target, respectively. For example, in Approximation 4 the adaptation indices are that define the AEM, while in the GCAM Alg. 4, is the set of scale variables. At each step , ADA updates and by a -valued random variable and a -valued random variable , respectively. The transition kernel of ADA is denoted by .

We prove the following theorem, that ergodicity of ADA can be guaranteed by imposing certain regularity conditions.

###### Theorem 1.

Consider an ADA algorithm, with target distribution defined on a state space , with -valued proposal adaptation index and -valued approximation adaptation index.

Suppose that for each , is the kernel of a MH algorithm targeting with proposal kernel having a density with respect to some finite reference measure , with corresponding density for so that . Similarly, for each , the state-dependent approximation has density such that, . Let be the transition kernel of the corresponding ADA algorithm using the approximation , proposal , and targeting . Suppose further that the following conditions hold:

1. The spaces , , and are compact.

2. Each transition kernel is ergodic for .

3. For all , is uniformly bounded, and the mapping is continuous.

4. The proposal distribution satisfies diminishing adaptation, that is,
in probability, where is the total variational norm.

5. The mapping is continuous.

6. The approximation adaptation index satisfies diminishing adaptation, that is, in probability.

Then ADA is ergodic for .

Proof: We prove this theorem by establishing the conditions of Theorem 1 of [Roberts and Rosenthal, 2007] for the composite adaptation index and proposal . First note that, since , , and are compact, all product spaces are compact in the product topology.

By Theorem 1 of [Christen and Fox, 2005], conditions (1), (2), and (5) imply that, for all the transition kernel is ergodic for .

The effective proposal in step 2 of ADA has density

 q∗γ,ξ(x,y)=αγ,ξ(x,y)qγ(x,y)+{1−rγ,ξ(x)}δx(y),

where

 αγ,ξ(x,y)=min{1,h∗ξ,x(y)qγ(y,x)h∗ξ,x(x)qγ(x,y)},

and is the probability of accepting a proposal from in step 1 of ADA, given by

 rγ,ξ(x)=∫Xαγ,ξ(x,y)qγ(x,y)λ(dy).

It follows from conditions (3) and (5) that is continuous, and that is continuous as in Corollary 5 of [Roberts and Rosenthal, 2007]111Note that continuity in is required in condition (3); the conditions in Corollary 5 of [Roberts and Rosenthal, 2007] are not quite sufficient for general proposal distributions..

Hence the probability of accepting a proposal from in both steps 1 and 2 of ADA is

 ργ,ξ(x)=∫Xβγ,ξ(x,y)αγ,ξ(x,y)qγ(x,y)λ(dy)

where the acceptance probability in step 2 of ADA,

 βγ,ξ(x,y)=min{1,h(y)αγ,ξ(y,x)qγ(y,x)h(x)αγ,ξ(x,y)qγ(x,y)},

is jointly continuous in , , and . It follows, as in Corollary 5 of [Roberts and Rosenthal, 2007], that satisfies the simultaneous uniform ergodicity condition in Theorem 1 of [Roberts and Rosenthal, 2007].

Diminishing adaptation of the overall transition kernel follows, as in Lemma 4.21 in [Łatuszyński et al., 2013], from diminishing adaptation of the proposal , which can be established using the triangle inequality treating adaptation indices and in separate steps. Diminishing adaptation of with respect to adaptation in follows directly from condition 3, again as Lemma 4.21 in [Łatuszyński et al., 2013]. Diminishing adaptation of with respect to adaptation in may be established using the inequality

 ∥q∗γ,ξn+1(x,⋅) −q∗γ,ξn(x,⋅)∥TV ≤ 2∫Xmin⎧⎨⎩qγ(x,y),∣∣ ∣∣h∗γ,ξn+1(y)h∗γ,ξn+1(x)−h∗γ,ξn(y)h∗γ,ξn(x)∣∣ ∣∣qγ(y,x)⎫⎬⎭λ(dy).

From conditions 1, 3 and 5 it follows that the RHS , uniformly, as . Diminishing adaptation of of with respect to adaptation in follows from Condition 6.

Thus the conditions in Theorem 1 of [Roberts and Rosenthal, 2007] are satisfied, and the result follows.

Then, we can use Theorem 1 to establish ergodicity of ADA using the approximation schemes in Section 3.

###### Corollary 1.

Suppose that the forward map and its reduced model are continuous functions, then the ADA Alg. 5 with either of the adaptive proposals in Section 2.5, and using any of Approximation 1 to 4 in Section 3 is ergodic for .

Proof: We treat Approximations 2 and 4, since Approximations 1 and 3 can be considered as special cases with zero AEM covariance.

Without loss of generality, we may assume that the parameter space is compact, as we can always define bounds on the input parameter to the computer model. Since and are continuous, it follows that the model reduction error or is compact. Thus, the space of possible and is compact, that is, is compact. Compactness of follows from the form of proposal adaptation in Algs 3 and 4, and existing results for such proposals, such as Corollary 6 of [Roberts and Rosenthal, 2007]. This establishes Condition (1) of Theorem 1.

Since and are continuous, Theorem 1 condition 5 follows from the form of Equations 12 and 19 when is nonsingular. Since is positive definite and is positive semi-definite, it follows that is actually positive definite.

The AEM updating rules (22) and (23) satisfies the diminishing adaptation condition (Condition (6) of Theorem 1), because the empirical estimates change , in probability, at the th iteration. Similarly, the proposal diminishing adaptation condition (Condition (3) of Theorem 1) follows from the form of adaptation in Algs 3 and 4.

The conditions in Theorem 1 hold, and so the result follows.

Note that ergodicity of ADA implies that the mean and covariance matrix, given in the updating rules (22) and (23), actually converge to the posterior expected values in Eqns (20) and (21), respectively.

## 5 Applications in geothermal reservoir models

In this section we apply ADA to two calibration problems for geothermal reservoir models. First, a one dimensional radial symmetry model of the feedzone of a geothermal reservoir with synthetic data is presented. This example is small enough that extensive statistics can be computed to validate the algorithm. Then ADA is applied to sample a 3D model with measured data. We begin with a description of the governing equations of the geothermal reservoir and its numerical simulator.

### 5.1 Data Simulation

Consider a two phase geothermal reservoir (water and vapour) governed by the general mass balance and energy balance equations:

 ddt∫ΩMαdV=∫∂ΩQα⋅ndΓ+∫ΩqαdV, (25)

where is the control volume and is its boundary. The accumulation term represents the mass () and heat sources or sinks in , and denotes the mass () or energy () flux through . The mass and energy within are represented by and .

A complex set of nonlinear partial differential equations such as multiphase Darcy’s law are used to model and . For brevity, we express these terms by the simplified functional relationship between the typical parameters and state of the system:

 (MmMe)=fM(ϕ;p,T),

and

 (QmQe)=fQ(k,krl,krv;p,T),

where is porosity,

is a diagonal second order permeability tensor in 3-dimensions, and

and are the relative permeabilities. These are the unknown spatially distributed parameters that we wish to determine. The state of system is represented by the spatially distributed quantities for a single phase system, or for a two phase system; these are only partially observable through wells. The subscripts and represent the liquid and vapour phases, respectively. Relative permeabilities and are introduced to account for the interference between liquid and vapour phases as they move through the rock matrix in the geothermal reservoir. There are several empirically derived curves available to model relative permeabilities as functions of vapour saturation . We use the van Genuchten-Mualem model [van Genuchten, 1980]:

 (krl,krv)=fvGM(m,Srl,Sls)

where , , and are hyper-parameters. For a complete description of the governing equations of multiphase geothermal reservoirs we refer to [Grant et al., 1982, O’Sullivan, 1985]. Spatial discretization of (25) is based on an integrated finite difference or finite volume method, implemented in the existing Fortran code TOUGH2 [Pruess, 1991].

### 5.2 Well Discharge Test Analysis

Well discharge analysis is usually used to interpret the near-well properties of the reservoir from pressure and enthalpy data measured during a short period of field production. Based on the typical assumption that all flows into the well come through a single layer feedzone, we build an one-dimensional radially-symmetric forward model with 640 blocks as shown in Fig. 1 (a). A high resolution grid is used immediately outside the well and the thickness increases exponentially outside this region. The reduced model is built by coarsening the grid of the forward model to a coarse grid with 40 blocks, see Figure 1 (b). Based on 1,000 simulations with different set of parameters on a DELL T3400 workstation, we estimate the CPU time of evaluating the forward model is 2.60 seconds. The computing time of the reduced model is 0.15 seconds, which is of that for the forward model.

The parameters of interest are the porosity, permeability, the hyperparameters in the van Genuchten-Mualem relative permeability model, as well as the initial vapour saturation () and initial pressure () that are used to represent the initial thermodynamic state of the two-phase system. These make up the seven unknowns required for data simulation:

 x={ϕ,log10(k),p0,Sv0,m,S%rl,Sls}.

Note that the permeability is represented on a base 10 logarithmic scale. These parameters are assumed to be independent and follow non-informative prior distributions with bounds given in Table 1. Table 1 also gives “true” values of model parameters used to simulate the synthetic data. The model is simulated over 80 days with production rates varying smoothly from about 4 kg/second to about 6 kg/second (see Figure 1 (c)). The corresponding model outputs are pressure and flowing enthalpy , which has the form of

 (dhdp)=F(x).

We assume the measurement noise follows i.i.d. zero mean Gaussian distribution with standard deviations for pressure and for the flowing enthalpy. The noise corrupted pressure and flowing enthalpy data are plotted in Figure 1 (d) and (e), respectively. This yields the posterior distribution

 π(x|~dh,~dp) ∝ exp⎡⎢⎣−12{F(x)−(~dh~dp)}T(σ2hIn00σ2pIn)−1{F(x)−(~dh~dp)}⎤⎥⎦ ×χ(x)

where is number of measurements, and is the prior distribution which we take to be the indicator function that implements the bounds in Table 1.

#### 5.2.1 MCMC sampling

To benchmark various approximate posterior distributions, we first run GCAM with one group of all 7 parameters and targeting the exact posterior distribution. Based on several short runs (not reported here), we find that an acceptance rate of gives optimal efficiency for this problem. This configuration of GCAM is run for iterations (after discarding burn-in steps) giving an IACT of the log-likelihood function estimated as . Then we then run ADA using Approximation 1, Approximation 2 with the AEM calculated a priori using Eqns (13) and (14), Approximation 2 with the AEM calculated adaptively over the posterior distribution converging to Eqns (15) and (16), and Approximation 4 with the AEM calculated adaptively over the posterior distribution. All cases used GCAM for the proposal with the target acceptance rate of . The acceptance rate in step 2 of ADA, , and the IACT of the likelihood function are shown in Table 2. We note that when Approximation 1 is used in ADA, the method is the equivalent method in [Efendiev et al., 2006], except that an adaptive proposal is also used.