Modelling of groundwater flow and transport is an important decision support tool when, for example, estimating the sustainable yield of an aquifer or remediating groundwater pollution. However, the input parameters for mathematical models of groundwater flow (such as subsurface transmissivity and boundary conditions) are often impossible to determine fully or accurately, and are hence subject to various uncertainties. In order to make informed decisions, it is of critical importance to decision makers to obtain robust and unbiased estimates of the total model uncertainty, which in turn is a product of the uncertainty of these input parameters. A popular way to achieve this, in relation to groundwater flow or any inverse problem in general, is stochastic or Bayesian modelling [2, 3, 4]. In this context, a probability distribution, the prior, is assigned to the input parameters, in accordance with any readily available information. Given some real-world measurements corresponding to the model outputs (e.g. sparse spatial measurements of hydraulic head, Darcy flow or concentration of pollutants), it is possible to reduce the overall uncertainty and obtain a better representation of the model by conditioning the prior distribution on this data. The result is a distribution of the model input parameters given data, which is also referred to as the posterior.
Obtaining samples from the posterior distribution directly is not possible for all but the simplest of problems. A popular approach for generating samples is the Metropolis–Hastings type Markov Chain Monte Carlo (MCMC) method . Samples are generated by a sequential process. First, given a current sample, a new proposal for the input parameters is made using a so-called proposal distribution. Evaluating the model with this new set of parameters, a likelihood is computed - a measure of misfit between the model outputs and the data. The likelihood of the proposed and current samples are then compared. Based on this comparison, the proposal is either accepted or rejected, and the whole process is repeated, generating a Markov chain of probabilistically feasible input parameters. The key point is that the distribution of samples in the chain converges to the posterior – the distribution of input parameters given the data . This relatively simple algorithm can lead to extremely expensive Bayesian computations for three key reasons. First, each step of the chain requires the evaluation of (often) an expensive mathematical model. Second, the sequential nature of the algorithm means subsequent samples are often highly correlated – even repeated if a step is rejected. Therefore the chains must often be very long to obtain good statistics on the distribution of outputs of the model. Third, without special care, the approach does not generally scale well to large numbers of uncertain input parameters; the so-called curse of dimensionality. Addressing these scientific challenges is at the heart of modern research in MCMC algorithms. As with this paper there is a particular focus on developing novel and innovative proposal distributions, which seek to de-correlate adjacent samples and limit the computational burden of evaluating expensive models.
Broadly in the literature, simple Darcy type models and other variants of the diffusion equation have long been a popular toy example problems for demonstrating MCMC methodologies in the applied mathematics community (see e.g. [6, 7, 8]). There appears to be much less interest in MCMC in the applied groundwater modelling community. This may be because of the computational cost of running MCMC on highly parametrised, expensive models, or the lack of an easy-to-use MCMC software framework, akin to the parameter estimation toolbox PEST .
An exciting approach to significantly reduce the computational cost has been proposed in multi-level, multi-fidelity and Delayed Acceptance (DA) MCMC methods. In each case, to alleviate computational cost, a hierarchy of models is established, consisting of a fine model and (possibly multiple) coarse, computationally cheap approximations. Typically, the coarser models are finite element solutions of the PDE on a mesh with a coarser resolution, but as we show in this paper, can be taken to be any general approximation similar to the multi-fidelity philosophy . Independent of the approach, the central idea is the same: to obtain significant efficiency gains by exploiting approximate coarse models to generate ‘good’ proposals cheaply, using additional accept/reject steps to filter out highly unlikely proposals before evaluating the fine, expensive model. Previous studies of two-stage approaches include  who modelled multi-phase flow with coarse level proposals evaluated by a coarse-mesh single-phase flow model (an idea that was developed further in ),  and . We note that the latter of which, instead of simply using a coarser discretisation, implemented a data-driven polynomial chaos expansion as a surrogate model. We intend to demonstrate how the development of novel techniques in MCMC and machine learning can be combined to help realise the potential of MCMC in this field.
In this work, we propose a combination of multiple cutting-edge MCMC techniques to allow for efficient inversion and uncertainty quantification of groundwater flow. We propose an improved delayed acceptance (DA) MCMC algorithm, adapted from the approach proposed by . In our case, similarly to multi-level MCMC , proposals are generated by computing a subchain using a Deep Neural Network (DNN) as an approximate model – leading to cheaply computed, decorrelated proposals passed on to the fine model. For our first example, the subchain is driven by the preconditioned Crank-Nicolson (pCN) proposal distribution  to ensure the proposed Metropolis-Hastings algorithm is robust with respect to the dimension of the uncertain parameter space. For our second example, proposals for the subchains are generated using the Adaptive Metropolis (AM) proposal , since the posterior distribution in this case is highly non-spherical and multiple parameters are correlated. Finally, we propose a enhanced error model, in which the DNN is trained by sampling the prior distribution, yet the bias of the approximation is adaptively estimated and corrected on-the-fly by testing the approximations against the full model in an adaptive delayed acceptance setting .
In this section we briefly introduce the forward model, defining the governing equations underpinning groundwater flow and their corresponding weak form, enabling us to solve the equations using FEM methods. We then formulate our model as an Bayesian inverse problem with random input parameters, effectively resulting in a stochastic model, which can be accurately characterised by sampling from the posterior distribution of parameters using MCMC. The simple Metropolis-Hastings MCMC algorithm is then introduced and extended with the preconditioned Crank-Nicolson (pCN) and Adaptive Metrpolis (AM) transition kernels.
2.1 Governing equations for groundwater flow
Consider steady groundwater flow in a confined, inhomogenous aquifer which occupies the domain with boundary
. Assuming that water is incompressible, the governing equations for groundwater flow can be written as the scalar elliptic partial differential equation:
subject to boundary conditions on defined by the constraint equations
Here is the heterogeneous, depth-integrated transmissivity, is hydraulic head, is fixed hydraulic head at boundaries with Dirichlet constraints, is fluid sources and sinks, is Darcy velocity, is Darcy velocity across boundaries with Neumann constraints and and define the boundaries comprising of Dirichlet and Neumann conditions, respectively. Following standard FEM practice (see e.g. ), eq. (1) is converted into weak form by multiplying by an appropriate test function and integrating by parts, so that
where is the Hilbert space of weakly differentiable functions on . To approximate the hydraulic head solution , a finite element space on a finite element mesh . This is defined by a basis of piecewise linear Lagrange polynomials , associated with each of the finite element nodes. As a result (3) can be rewritten as a system of sparse linear equations
are the global stiffness matrix and load vector, respectively. The vectoris the solution vector of hydraulic head at each node within the finite element mesh so that
. In our numerical experiments, these equations are solved using the open source general-purpose FEM frameworkFEniCS . While there are well-established groundwater simulation software packages available, such as MODFLOW  and FEFLOW , FEniCS was chosen because of its flexibility and ease of integration with other software and analysis codes.
2.2 Aquifer transmissivity
The aquifer transmissivity is not known everywhere on the domain, therefore a typical approach is to model it as a log-Gaussian random field. There exists extensive literature on modelling groundwater flow transmissivity using log-Gaussian random fields (see e.g. [22, 23, 14]). Whilst this may not always prove a good model, particularly in cases with highly correlated extreme values and/or preferential flow paths [24, 25] as seen when considering faults and other discontinuities [26, 27]
. However, the log-Gaussian distribution remains relevant for modelling transmissivity in a range of aquifers[28, 29, 14].
Our starting point is a covariance operator with kernel , which defines the correlation structure of the uncertain transmissivity field. For our numerical experiments, we consider the ARD (Automatic Relevance Determination) squared exponential kernel, a generalisation of the ‘classic’ squared exponential kernel, which allows for handling directional anisotropy:
where is the spatial dimensionality of the problem and is a vector of lengths scales corresponding to each spatial dimension. We emphasise that the covariance kernel is a modelling choice, and that different options are available, such as the Matern kernel which offers additional control over the smoothness of the field.
In our work, transmissivity was modeled as a discrete log-Gaussian random field expanded in an orthogonal eigenbasis with Karhunen-Loève (KL) eigenmodes. To achieve this we construct a covariance matrix , where entries are given by for each pair of nodal coordinates within the finite element mesh . Once constructed the largest eigenvalues
and associated eigenvectorsof can be computed. The transmissivity at the nodes , is given by
vector defines the log of the mean transmissivitity field,
a scalar parameterising the variance and
a vector of Gaussian random variables such thatas in 
. The random field can be interpolated from nodal values across, using the shape functions so that
Truncating the KL eigenmodes at the th mode and interpolating the field both have a smoothing effect on the recovered transmissivity fields, which may or may not be desirable, depending on the application. Figure 1 shows some examples of realisations of Gaussian random fields with a square exponential kernel, which illustrates the effect of the covarance length scale and the number of admitted KL eigenmodes .
2.3 The Bayesian inverse problem
To setup the Bayesian inverse problem and thereby quantify the uncertainty in the transmissivity field , the starting point is define a statistical model which describes distribution of the mismatch between observations and model predictions. The observations are expressed in a single vector and for a given set of model input parameters , the model’s prediction of the data is defined by the forward map, . The statistical model assumes the connection between model and observations through the relationship
where we take which represents the uncertainty the connection between model and data, capturing both model mis-specification and measurement noise as sources of this uncertainty.
The backbone of a Bayesian approach is Bayes’ theorem, which allows for computingposterior beliefs of model parameters using both prior beliefs and observations
. Bayes’ theorem states that the posterior probability of a parameter realisationgiven data can be computed as
where is referred to as the posterior distribution, is called the likelihood, the prior distribution and
is a normalising constant, sometimes referred to as the evidence. In most cases this integral does not have a closed-form solution and is infeasible to estimate numerically in most real-world applications, particularly when the dimension of the unknown parameter space is large and the evaluation of the model (required to compute ) is computationally expensive. A family of methods called Markov Chain Monte Carlo (MCMC) are often employed to approximate the solution . Importantly MCMC, whilst computationally expensive, allows indirect sampling from the posterior distribution and avoids the explicit need to estimate (10). Moreover, it can be designed to be independent of the dimension of the parameter space and has no embedded unquantifiable bias. In this paper we consider a subclass of MCMC methods called the Metropolis-Hastings [32, 33, 5] algorithm, which is described in Algorithm 1. The algorithm generates a Markov chain with a distribution converging to . It is difficult (often impossible) to sample directly from the posterior, hence at each step, at position in the chain, a proposal is made from a simpler known (proposal) distribution . An accept/reject step then determines whether the proposal comes from (probabilistically) the posterior distribution or not. This accept/reject step is a achieved by essentially computing the ratio of the densities of the current state to the proposal. To do this we exploit Bayes’s Theorem. The key observation in MCMC is that the normalising constant is independent of , and so
Therefore when comparing the ratio of the densities, the normalizing contant (since independent of ) cancels.
Algorithm 1: Metropolis-Hastings Algorithm
Given a parameter realisation and a transition kernel , generate a proposal .
Compute the likelihood ratio between the proposal and the previous realisation:
In our model problem, the prior density of the parameters represents the available a priori knowledge about the transmissivity of the aquifier. From our statistical model (8) we see that our , hence
Importantly we note that for each step of the Metropolis-Hastings algorithms we are required to compute . This requires the evaluation of the forward mapping which can be computationally expensive. Moreover, due to the sequential nature of MCMC-based approaches, consecutive samples are correlated and hence many samples are required to obtain good statistics on the outputs.
The proposal distribution is the key element which drives the Metropolis-Hastings algorithm and control the effectiveness of the algorithm. A common choice is a simple random walk, for which , yet as shown in  , the basic random walk does not lead to a convergence that is independent of the input dimension . Better choices would be the preconditioned Crank-Nicolson proposal (pCN, ), which has dimension independent acceptance probability, or the Adaptive Metropolis algorithm (AM, ), which adaptively aligns the proposal distribution to the posterior during sampling. Moreover, unlike the Metropolis-Adjusted Langevin Algorithm (MALA), No-U-Turn Sampler (NUTS) and Hamiltonian Monte Carlo, none of these proposals rely on gradient information, which can be infeasible to compute for expensive forward models.
To generate a proposal using the pCN transition kernel, one computes
where is a random sample from the prior distribution, . This expression corresponds to the transition kernel . Moreover, for the pCN transition kernel, the acceptance probability simplifies to
Similarly, to generate a proposal using the AM transition kernel, we draw a random sample
where is an iteratively updated covariance structure
Hence, proposals are drawn from a distribution with an initial covariance for a given period , after which adaptivivity is ’switched on’, and used for the remaining samples. The adaptive covariance can be constructed iteratively during sampling using the following recursive formula:
where is the arithmetic mean, is a scaling parameter, is the dimension of the proposal distribution and is a parameter which prevents from becoming singular . This, on the other hand, corresponds to the transition kernel , which is not guaranteed to be ergodic, since it will depend on the history of the chain. However, the Diminishing Adaptation condition  holds, as adaptation will naturally decrease as sampling progresses.
2.4 Deep Neural Network
The approximate/surrogate model in our experiments is a feed-forward deep neural network (DNN), a type of artificial neural network with multiple hidden layers, as implemented in the open-source neural-network library Keras  utilizing the Theano backend .
The DNN approximates the forward map, accepting a vector of KL coefficients , and returning an approximation of the vector of approximate model output – in this paper a vector of hydraulic heads at given sampling points, i.e. . Figure 2 shows the graph of one particular DNN employed in our experiments.
Each edge in Figure 2 is equipped with a weight where is index of the layer that the weight feeds into, is the index of nodes in the same layer and is the index of nodes in the previous layer. These weights can be arranged in matrices for each layer . Similarly, each node is equipped with a bias where is index of its layer and is the index of node, and these biases can be arranged in vectors . Data is propogated through the network such that the output of a layer
with activation functionis
Activation functions are applied element-wise on their input vectors so that
Many different activation functions are available for artificial neural networks, and we here give a short description of the ones employed in our experiments: the sigmoid and the rectified linear unit
(‘ReLU’). The transfer function of the nodes in the first layer of each DNN was of the typesigmoid:
squashing the input vector into the interval , effectively resulting in a strictly positive output from the first hidden layer. The remaining hidden layers consisted of nodes with the de facto standard hidden layer activation function for deep neural networks, the rectified linear unit (‘ReLU’):
To fit an artifical neural network to a given set of data, the network is initially compiled using random weights and biases and then trained using a dataset of known inputs and their corresponding outputs. The weights and biases are updated iteratively during training by way of an appropriate optimisation algorithm and a loss function, and if appropriately set up, will converge towards a set of optimal values, allowing the MLP to predict the response of the forward model to some level of accuracy. Our particular DNNs were trained using the mean squared error (MSE) loss function
output variables, and the RMSprop optimiser, a stochastic, gradient based and adaptive algorithm, suggested by and widely used for training DNNs.
3 Adaptive Delayed Acceptance Proposal using a Deep Neural Network
In this section we describe a modified adaptive delayed acceptance proposal for MCMC, using ideas from multi-level MCMC . The general approach generates proposals by running Markov subchains driven by an approximate model. In our case this approximation is constructed from a DNN of the forward map trained from offline samples of the prior distribution. Finally, we show how the approximate map can be corrected online, by adaptively learning a simple multi-variant Gaussian correction to the outputs of the neural network.
3.1 Modified Delayed Acceptance MCMC
Delayed Acceptance (DA)  is a technique that exploits a model hierarchy consisting of an expensive fine model and relatively inexpensive coarse approximation. The idea is simple: a proposal is first evaluated (pre-screened) by an approximate model and immediately discarded if it is rejected. Only if accepted, it is subjected to a second accept/reject step using the fine model. In this context, the likelihood of observations given a parameter set is henceforth denoted when evaluated on the approximate model and remains when evaluated on the fine model. This simple screening mechanism cheaply filters out poor proposals, wasting minimal time evaluating unlikely proposals on the expensive, fine model. In this paper we extend this approach by not evaluating every accepted approximation proposal with the fine model. Instead, a proposal for the fine model is generated by running an approximate subchain until approximate proposals have been accepted and only then evaluate using the fine model. This modified Delayed Acceptance MCMC algorithm is described in Algorithm 2 and an illustration of the process is given in Figure 3.
Algorithm 2: Modified Delayed Acceptance MCMC
Given a realisation of the approximation parameters and the transition kernel , generate a proposal for the approximation .
Compute the likelihood ratio on the approximate model between the proposal and the previous realisation:
This way, the autocorrelation of the fine chain is reduced, since proposals are ‘more independent’. This approach is strongly related to a two-level version of multi-level MCMC. Since the fine model likelihood ratio is corrected by the inverse of the approximate likelihood ratio in step 6 of Algorithm 2, detailed balance is satisfied, the resulting Markov Chain is guaranteed to come from the true posterior and there is no loss of accuracy, even if the approximate model is severely wrong . To demonstrate that this approach does indeed decrease the autocorrelation in our fine chain MCMC samples, we compute the Effecive Sample Size of each MCMC simulation according to the procedures described in .
3.2 Adaptive correction of the approximate posterior
Whilst in theory the modified delayed acceptance proposal described in Section 3.1 will provide a convergent Metropolis-Hastings algorithm, there are cases in which the rate of convergence will be extremely slow. To demonstrate this, the left-hand contour plot in Fig. 4 shows an artificially bad example. In this case the approximate model (red isolines) poorly captures the target likelihood distribution (blue density); there is a clear offset in the distributions, and the scale, shape and orientation of the approximate likelihood is incorrect. If using the modified delayed acceptance algorithm without alteration, it is easy to see that the proposal mechanism would struggle to traverse the whole of the target distribution, since much of it lies in the tails of the approximate likelihood distribution. As a result, in practice, we would observe extremely slow convergence to the true posterior; in practise – at finite computational times – results would contain a significant bias.
An adhoc way to overcome this is to apply so-called tempering on the statistical model which drives the subchain. In this technique, the variance of the misfit on the subchain is artificially inflated to capture the uncertainty in the approximate model. The issue in adopting this approach is the difficulty in selecting a robust inflation factor for tempering, particularly in higher dimensions. Furthermore, an isotropic inflation of the approximate posterior will in general be sub-optimal.
In this paper we instead implement an adaptive enhanced error model (EEM), which overcomes many of these challenges. Moreover, it is easy to implement and has negligible additional computational cost. Let denote the approximate forward map of the fine/target model . Then, following [41, 18], we apply a trick to the statistical model (8) where we add and subtract the coarse map . With some rearrangement we obtain the expression
Here is the bias associated with the approximation at given parameter values . We approximate this bias using a multivariant Gaussian distribution, i.e. , and therefore the likelihood function (12) can be rewritten as
The influence of redefining the likelihood is best demonstrated geometrically, as shown in Fig. 4 (middle and right). Firstly, as shown in Fig. 4 (middle) we can make a better approximation by simply adding a shift of the mean bias to the original approximate model . This has the effect of aligning the ‘centre of mass’ of each of the distributions. Secondly, we can learn the covariance structure of the bias. This has the effect of stretching and rotating the approximate distribution to give an even better overall approximation, as shown in Fig. 4 (right). The final mismatch between the approximate and target distribution, will be driven by the assumption that bias can be represented by a multivariant Gaussian, although more complex distributions could be constructed using, for example, Gaussian process regression. Whilst this is an avenue to explore in the future, any such approach would surrender the simplicity of this approach, which from the results appears particularly effective.
The idea of using an EEM when dealing with model hierarchies originates from , who suggested to use samples from the prior distribution of parameters to construct the EEM prior to Bayesian inversion, so that
The estimates for and could be obtained by sampling the prior distribution and comparing the approximate forward map against the target forward map. However, since the structure of bias in the prior and the posterior could be significantly different, this approach could result in a suboptimal EEM and requires additional fine model evaluations. Furthermore, in our examples where the approximate model is built from samples from the prior, it is expected that such an approach would underestimate the bias, since the neural network has been explicitly trained to minimise the error with respect to samples from the prior. Instead of estimating the bias using the prior, the posterior bias can be constructed on-line by iteratively updating its mean and covariance using coarse/fine solution pairs from the MCMC samples as suggested by . In this case we select
While this approach does not in theory guarantee ergodicity of the chain, the bias distribution will converge as the chain progresses and adaptation diminishes, resulting in a de facto ergodic process after an initial period of high adaptivity. This is a common feature of adaptive MCMC algorithms, as discussed in the classic paper on adaptive Metropolis . Our experiments showed that the bias distribution did indeed converge for every simulation, and that repeated experiments converged towards the same posterior bias distribution. Admitting a bias term in the inverse problem further introduces an issue of identifiability, as highlighted in . Since observations are now modelled as a sum of coarse model output and multiple stochastic terms, the stochastic terms and are generally unidentifiable in the coarse model formulation.
In this section, we examine the effectiveness of our proposed strategy on two synthetic groundwater flow problems: a two-dimensional problem with an isotropic covariance kernel and a three-dimensional problem with an anisotropic covariance kernel. For both examples, we begin by outlining the model setup, for which we select a ‘true’ transmissivity field and a number of fixed observation points. For the first example, the influence of training size for the DNNs is examined, and the total cost of uncertainty quantification using a selection of DNNs is computed. For the second example we use a single DNN setup and analyse the resulting posterior marginal distributions and the quantity of interest. The first example was completed on commodity hardware – an HP Elitebook 840 G5 with an Intel Xeon E3-1200 quad-core processor, while the second example was completed on a TYAN Thunder FT48T-B7105 GPU server with two Intel Xeon Gold 6252 processors and an NVIDIA RTX 2080Ti GPU.
4.1 Example 1: 2D Unit Square
4.1.1 Model Setup
This example was conducted on a unit square domain , meshed using an unstructured triangular grid comprising 2,601degrees of freedom. Dirichlet boundary conditions were imposed on the left and right boundaries with hydraulic heads of 1 and 0, respectively. The top and bottom edges impose homogeneous no-flow Neumann boundary conditions. For the forward model, the covariance length scales of the ARD squared exponential kernel was set to , effectively resulting in an isotropic covariance kernel, equal to the ‘classic’ square exponential kernel with . This resulted in a KL decomposition with of total signal energy in the first 32 modes and of signal energy in the first KL modes. Hence, modes were included in the approximate model whilst modes where included in the fine model.
Figure 4(a) shows the ‘true’ transmissivity field that we attempt to recover through our MCMC methodology and the modelled, corresponding hydraulic head. Synthetic samples for the likelihood function were extracted at 25 points on a regular grid with a horisontal and vertical spacing of 0.2m (Figure 4(b)).
4.1.2 Deep Neural Network Design, Training and Evaluation
We evaluated a selection of different DNNs to investigate the impact of various network depths and activation functions on the DNN performance. Table 1 shows the layers of the employed DNNs, the number of nodes in each layer and their corresponding activation functions. DNN1 and DNN2 had three hidden layers, while DNN3 and DNN4 had only two, as the ReLU layer with nodes was not included in these networks. The output layer of DNN1 and DNN3 consisted of nodes with an exponential activation function , resulting in a strictly positive output. The DNNs with an exponential activation function in the final layer tended overall to lead to the best performance.
|Layer||# Nodes||Activation Functions|
Each DNN was trained on a set of samples from the prior distribution of parameters , in advance of running the MCMC. Hence, the DNN samples were drawn from a Latin Hypercube  in the interval
and transformed to the standard normal distribution using theprobit-function, such that . The FEM model was then run for every parameter sample, obtaining for each a vector of model outputs at sampling points given parameters. We trained and tested each DNN on a range of different sample sizes, namely , where
, with a 9:1 training/test splitting ratio. Each DNN was then trained for 200 epochs with a batch size of 50 using thermsprop optimiser .
Deep Neural Networks performance was compared using the RMSE of their respective testing dataset
The residual RMSE (24) of each DNN was computed to compare the network designs described in Table 1 and to investigate the influence of training dataset size on the DNN performance (Figure 6). As expected, each DNN performed better as the number of samples in the training dataset were increased. In comparison, the DNN design had much less influence on the testing performance, suggesting that the main driver for constructing an accurate surrogate model, within the bounds of the examined DNN designs, was the number of training samples. For the remaining analysis, we chose the overall best performing network, namely DNN1 constructed from samples.
Further performance analysis consisted of analysing the DNN error for true and predicted heads ( and , respectively) for datapoints 0, 8, 16 and 24. (Figure 7). All error distributions were approximately Gaussian, with the errors for the DNN with
exhibiting some right skew at sampling point 24. For all DNNs, the sampling points closer to the boundaries (at sampling points 0 and 24) had lower errors than those further away, since the heads close to the boundaries were more constrained by the model.
4.1.3 Uncertainty Quantification
For inversion and uncertainty quantification, we chose a multivariate standard normal distribution as the prior parameter distribution, and set the error covariance to . While conputationally convenient, the zero-centrered prior in practice favours transmissivity field realisations capable of reproducing the observed heads with as little variation as possible. In total, seven different sampling strategies were investigated, namely single level ‘vanilla’ MCMC, DA using three different DNNs trained and tested on samples, and DA with an enhanced error model (DA/EEM) using the same three DNNs. Every simulation was completed using the pCN transition kernel. Each MCMC sampling strategy was repeated () using randomly generated random seeds, to ensure that every starting point would converge towards the same stationary distribution and to allow for cross-chain statistics to be computed. Results given in this section pertain to these multi-chain samples rather than individual MCMC realisations, unless otherwise stated.
Our sampling strategies recovered the ground truth with good accuracy. Figure 8 shows the mean and variance of the recovered field from the DA/EEM MCMC using the DNN with . All recovered fields exhibit higher smoothness than the ground truth, which could be attributed to the relatively low number of sampling points and their regular distribution on the domain. None of the chains recovered the local peak in transmissivity on the right side of the domain, since there was too little data to discover this particular feature.
While the recovered fields indicate that every MCMC sampling strategy converged towards the desired stationary distribution, they do not reveal the relative efficiency of each stategy. Hence, the Effective Sample Size () was computed for each MCMC realisation. Every DA sampling strategy produced higher than the Vanilla pCN sampler, relative to the simulation time, with a clear correlation between DNN testing performance and . This was mainly because the better performing DNNs allowed for a longer coarse chain offset without diverging. Moreover, utilising the EEM produced even higher for every DA chain (Table 2).
|Strategy||NDNN||/||Acc. Rate||Time (min)|
|Vanilla||—||—||— / 80000||0.23||63.3||84.8|
|DA||4000||2||85461.9 / 20000||0.27||16.2||64.5|
|DA/EEM||4000||2||78853.4 / 20000||0.31||15.2||79.0|
|DA||16000||4||172383.1 / 20000||0.27||18.2||116.3|
|DA/EEM||16000||4||178978.4 / 20000||0.30||18.4||143.6|
|DA||64000||8||336447.5 / 20000||0.24||30.1||196.5|
|DA/EEM||64000||8||377524.4 / 20000||0.30||29.9||235.7|
4.1.4 Total cost
Since the DA chains required computation of a significant number of fine model solutions and training of a DNN in advance of running the chain, the total cost of each strategy was computed as
where was the time spent on precomputing fine model solution, was the time spent on training the respective DNN, was the time taken to run the chain and was the resulting effective sample size (Fig. 9).
The mean cost of every DA chain was lower than that of the Vanilla pCN chain, with the chains using the EEM consistently cheaper than their non-EEM counterparts. Moreover, using the EEM reduced the variance of the cost in repeated experiments, allowing each repetition to produce a consistently high . The overall cheapest inversion was completed using the DNN trained on 16,000 samples using the EEM, reducing the total cost, relative to the Vanilla pCN MCMC, with 75%. Notice that these results are extremely conservative in the sense that the entire cost of evaluating every DNN training sample and training the DNN in serial on a CPU was factored into the cost of every repetition, even though the same DNN was used for all the repititions within each sampling strategy. The precomputation cost can be dramatically reduced by evaluating the DNN samples in parallel and utilising high-performance hardware, such as GPUs, for training the DNN.
4.2 Example 2: 3D Rectangular Cuboid
4.2.1 Model Setup
This example was conducted on a rectangular cuboid domain meshed using an unstructured tetrahedral grid with 10,416 degrees of freedom (Figure 10). Dirichlet boundary conditions of and were imposed at and , respectively. No-flow Neumann conditions were imposed on all remaining boundaries.
We drew sampling well locations randomly using the Maximin Latin Hypercube Design , and samples of hydralic head were extracted at each well at datums , measured from the bottom of the domain, resulting in datapoints in total (Figure 9(b)). The covariance lengths scales for ARD squared exponential covariance kernel were set to , resulting in a highly anisotropic random process with high variation in the direction to simulate geological stratification, some variation in the direction and little variation in the direction (Figure 9(a)). Like in the first model, the random process was truncated at 64 KL eigenmodes for the fine model and 32 KL eigenmodes for the coarse model, embodying and of the total signal energy, respectively.
For this example, we first converged the conductivity parameters to the Maximum a posteriori (MAP) estimate using gradient descent, since initial MCMC experiments struggled to converge to the posterior distribution for random initial parameter sets.
4.2.2 Deep Neural Network Design, Training and Evaluation
Training a DNN to accurately emulate the model response for this setup was challenging, and we found no single combination of neural network layers and activation functions that would predict the head at every datapoint with sufficient accuracy. We hypothesise that this limitation could be caused by a strong ill-posedness of the DNN – for a single neural network, the output dimension greatly exceeded the input dimension, .
When we instead predicted the heads at each datapoint datum using a separate DNN, we found that we could utilise largely the same DNN design as had been employed in the first example. Hence, to predict the head at all datapoints, we utilised five identically designed but independent DNNs (Figure 11), each with four hidden layers and activation functions as indicated in Table 3.
|Layer||# Nodes||Activation Functions|
Each DNN was trained and tested on a dataset of samples with KL coefficients drawn from a Latin Hypercube  in the interval and transformed to a normal distribution centered on the MAP estimate of the parameters , i.e. . This was done to increase the density of samples and thus improve the DNN prediction at and around the MAP point, which ideally equals the mode of the posterior distribution. The DNNs were then trained with for 200 epochs using a batch size of 50, the MSE loss function and the rmsprop optimiser . Figure 12 shows performance plots of each DNN for both the training (top and the testing (bottom) datasets. While every DNN is clearly moderately biased by the training data, they all performed adequately with respect to the testing data.
4.2.3 Uncertainty Quantification
Similarly to the first example, we chose a multivariate standard normal distribution as the prior distribution of parameters, and set the error covariance to . In this example, we utilised the Adaptive Metropolis (AM) transition kernel for generating proposals. We completed independent simulations, each initialised from a random initial point close to the MAP point , with a burnin of 1000 and a final sample size of . The subchains were run with an acceptance delay of , since longer subchains tended to diverge, leading to sub-optimal acceptance rates on the fine level. The simulations had a mean acceptance rate of 0.26, a mean effective sample size () of 55.2 and a mean autocorrelation length of 181.0. The samples of each independent simulation were pruned according to the respective autocorrelation length, and the remaining samples were pooled together to yield 443 statistically independent samples that were then analysed further.
Figure 13 shows the marginal distributions of the six coarsest KL coefficients along with a scatterplot matrix of all the samples remaining after pruning. All the marginal distributions are approximately Gaussian, and the two-parameter marginal distributions are mostly elliptical. It is evident that some of these parameters are correlated, namely parameters and . It is worth mentioning that in every independent simulation, the AM proposal kernel managed to capture these correlations.
Moreover, we analysed the hydraulic head as a function of datum along a line in the centre of the domain . Figure 14 shows of the ground truth, MAP point , the mean of the independent simulations, and all the samples remaining after pruning. We observe that both the MAP point and the sample mean are fairly close to the ground truth, albeit exhibiting higher smoothness, particularly between the observation depths, where the head is essentially allowed to vary freely. It is also clear that the individual samples encapsulate the ground truth, indicating that the ground truth is indeed contained by posterior distribution.
In this paper, we have demonstrated the use of a novel Markov Chain Monte Carlo methodology which employs a delayed acceptance (DA) model hierarchy with a deep neural network (DNN) as an approximate model and a FEM solver as a fine model, and generates proposals using the pCN and AM transition kernels. Results from the first example clearly indicate that the use of a carefully designed DNN as a model approximation can significantly reduce the cost of uncertainty quantification, even for DNNs trained on relatively small sample sizes. We have established that offsetting fine model evaluations in the DA algorithm reduces the autocorrelation of the fine chain, resulting in a higher effective sample size which, in turn, improves the statistical validity of the results. In this context, the performance of the DNN is a critical driver when determining a feasible offset length to avoid divergence of the coarse chain. Hence, if a high effective sample size is required, it may be desirable to invest in a well-performing DNN. Moreover, we have shown that an enhanced error model, which introduces an iteratively-constructed bias distribution in the coarse chain likelihood function, further increases the effective sample size and decreases the variance of the cost in repeated experiments. Finally, we observed that for the second example, even when employing a relatively well-performing model approximation, we had to constrain the offset length of the subchains rather strongly to achieve optimal acceptance rates. This can be attributed in part to an apparent non-spherical and correlated posterior distribution, causing the employed proposal kernels to struggle to discover areas of high posterior probability.
We have demonstrated that relatively simple inverse hydrogeological problems can be solved in reasonable time on a commonly available personal computer with no GPU-acceleration. This opens the opportunity to apply robust uncertainty quantification during fieldwork and as a decision-support tool for groundwater surveying campaigns. We have also demonstrated the applicability of our approach on a larger scale three-dimensional problem, utilising a GPU-accelerated high-performance computer (HPC). Aside from the benefit of using a HPC computer for accelerating the fine model evaluations, utilising the GPU allowed for rapidly training and testing multiple different DNN designs to efficiently establish a well performing model approximation. There are other obvious ways to further encrease the efficiency of the proposed methodology. For example, construction of the DNNs used as coarse models comes with the cost of evaluating multiple models from the prior distribution, and, unlike the MCMC sampler, the prior models are independent and these fine model evaluations can thus be massively parallelised.
Our methodology was demonstrated in the context of two relatively simple groundwater flow problems with log-Gaussian transmissivity fields parameterised by Karhunen-Loève decompositions. While this model provides a convenient computational structure for our purposes, it may not reflect the full scale transmissivity of real-world aquifers, particularly in the presence of geological faults and other heterogeneities, as discussed in . Future research could address this problem through geological layer stratification using the universal cokriging interpolation method suggested in , potentially utilising the open-source geological modelling tool GemPy , which allows for simple parametric representation of geological strata. Spatially heterogeneous parameters within each strata could then be modelled hierarchically using a low order log-Gaussian random field to account for within-stratum variation, as demonstrated in .
This work was funded as part of the Water Informatics Science and Engineering Centre for Doctoral Training (WISE CDT) under a grant from the Engineering and Physical Sciences Research Council (EPSRC), grant number EP/L016214/1. TD was funded by a Turing AI Fellowship (2TAFFP\100007). DM acknowledges support from the EPSRC Platform Grant PRISM (EP/R029423/1). The authors have no competing interests. Data supporting the findings in this study are under preparation and will be made available in the Open Research Exeter (ORE, https://ore.exeter.ac.uk/repository/) data repository.
-  M. P. Anderson, W. W. Woessner, R. J. Hunt, Applied groundwater modeling: simulation of flow and advective transport, second edition Edition, Academic Press, London ; San Diego, CA, 2015, oCLC: ocn921253555.
A. D. Woodbury, T. J. Ulrych,
A full-Bayesian approach
to the groundwater inverse problem for steady state flow, Water Resources
Research 36 (8) (2000) 2081–2093.
G. Mariethoz, P. Renard, J. Caers,
Bayesian inverse problem and
optimization with iterative spatial resampling: ITERATIVE SPATIAL
RESAMPLING, Water Resources Research 46 (11) (Nov. 2010).
M. de la Varga, J. F. Wellmann,
modeling as an inference problem: A Bayesian perspective, Interpretation
4 (3) (2016) SM1–SM16.
-  C. P. Robert, G. Casella, Monte Carlo statistical methods, 2nd Edition, Springer texts in statistics, Springer, New York, NY, 2010, oCLC: 837651914.
D. Higdon, H. Lee, C. Holloman, Markov chain Monte Carlo-based approaches for inference in computationally intensive inverse problems, in: Bayesian Statistics, Vol. 7, Oxford University Press., 2003, pp. 181–197.
T. J. Dodwell, C. Ketelsen, R. Scheichl, A. L. Teckentrup,
Multilevel Markov Chain Monte Carlo Algorithm with Applications
to Uncertainty Quantification in Subsurface Flow, SIAM/ASA Journal
on Uncertainty Quantification 3 (1) (2015) 1075–1108.
G. Detommaso, T. Dodwell, R. Scheichl,
Continuous Level Monte Carlo and
Sample-Adaptive Model Hierarchies, arXiv:1802.07539 [math]ArXiv:
1802.07539 (Feb. 2018).
-  J. Doherty, Calibration and uncertainty analysis for complex environmental models, 2015, oCLC: 991568728.
-  B. Peherstorfer, K. Wilcox, M. Gunzburger, Survey of multifidelity methods in uncertainty propagation, inference, and optimization, SIAM Review 60 (3) (2018) 550–591.
Y. Efendiev, A. Datta‐Gupta, V. Ginting, X. Ma, B. Mallick,
efficient two-stage Markov chain Monte Carlo method for dynamic data
integration, Water Resources Research 41 (12) (2005).
A. Mondal, Y. Efendiev, B. Mallick, A. Datta-Gupta,
uncertainty quantification for flows in heterogeneous porous media using
reversible jump Markov chain Monte Carlo methods, Advances in Water
Resources 33 (3) (2010) 241–256.
P. Dostert, Y. Efendiev, B. Mohanty,
uncertainty quantification techniques in inverse problems for Richards’
equation using coarse-scale simulation models, Advances in Water Resources
32 (3) (2009) 329–339.
E. Laloy, B. Rogiers, J. A. Vrugt, D. Mallants, D. Jacques,
exploration of a high-dimensional groundwater model from two-stage Markov
chain Monte Carlo simulation and polynomial chaos expansion: Speeding
up MCMC Simulation of a Groundwater Model, Water Resources Research
49 (5) (2013) 2664–2682.
J. A. Christen, C. Fox,
chain Monte Carlo Using an Approximation, Journal of Computational
and Graphical Statistics 14 (4) (2005) 795–810.
S. L. Cotter, G. O. Roberts, A. M. Stuart, D. White,
MCMC Methods for Functions:
Modifying Old Algorithms to Make Them Faster, Statistical
Science 28 (3) (2013) 424–446, arXiv: 1202.0709.
H. Haario, E. Saksman, J. Tamminen,
Metropolis Algorithm, Bernoulli 7 (2) (2001) 223.
T. Cui, C. Fox, M. J. O’Sullivan, A
posteriori stochastic correction of reduced models in delayed acceptance
MCMC, with application to multiphase subsurface inverse problems,
arXiv:1809.03176 [stat]ArXiv: 1809.03176 (Sep. 2018).
H.-J. G. Diersch,
Element Modeling of Flow, Mass and Heat Transport in Porous and
Fractured Media, Springer Berlin Heidelberg, Berlin, Heidelberg, 2014.
-  H. P. Langtangen, A. Logg, Solving PDEs in Python – The FEniCS Tutorial Volume I (2017).
A. W. Harbaugh,
MODFLOW-2005 : the
U.S. Geological Survey modular ground-water model–the ground-water
flow process, Report (2005).
R. A. Freeze, A
stochastic-conceptual analysis of one-dimensional groundwater flow in
nonuniform homogeneous media, Water Resources Research 11 (5) (1975)
N.-O. Kitterrød, L. Gottschalk,
Simulation of normal
distributed smooth fields by Karhunen-Loéve expansion in combination
with kriging, Stochastic Hydrology and Hydraulics 11 (6) (1997) 459–482.
J. Gómez-Hernández, X.-H. Wen,
To be or
not to be multi-Gaussian? A reflection on stochastic hydrogeology,
Advances in Water Resources 21 (1) (1998) 47–61.
J. Kerrou, P. Renard, H.-J. Hendricks Franssen, I. Lunati,
in characterizing heterogeneity and connectivity in non-multiGaussian
media, Advances in Water Resources 31 (1) (2008) 147–159.
D. Russo, M. Bouton, Statistical
analysis of spatial variability in unsaturated flow parameters, Water
Resources Research 28 (7) (1992) 1911–1925.
R. J. Hoeksema, P. K. Kitanidis,
Analysis of the Spatial
Structure of Properties of Selected Aquifers, Water Resources
Research 21 (4) (1985) 563–572.
P. Dostert, Y. Efendiev, T. Hou, W. Luo,
Langevin algorithms for dynamic data integration and uncertainty
quantification, Journal of Computational Physics 217 (1) (2006) 123–142.
Y. M. Marzouk, H. N. Najm,
reduction and polynomial chaos acceleration of Bayesian inference in
inverse problems, Journal of Computational Physics 228 (6) (2009)
C. Scarth, S. Adhikari, P. H. Cabral, G. H. Silva, A. P. d. Prado,
field simulation over curved surfaces: Applications to computational
structural mechanics, Computer Methods in Applied Mechanics and Engineering
345 (2019) 283–301.
-  A. Gelman (Ed.), Bayesian data analysis, 2nd Edition, Texts in statistical science, Chapman & Hall/CRC, Boca Raton, Fla, 2004.
N. Metropolis, A. W. Rosenbluth, M. N. Rosenbluth, A. H. Teller, E. Teller,
Equation of State
Calculations by Fast Computing Machines, The Journal of Chemical
Physics 21 (6) (1953) 1087–1092.
-  W. K. Hastings, Monte Carlo sampling methods using Markov chains and their applications, Biometrika (1970) 13.
L. Katafygiotis, K. Zuev,
insight into the challenges of solving high-dimensional reliability
problems, Probabilistic Engineering Mechanics 23 (2-3) (2008) 208–218.
G. O. Roberts, J. S. Rosenthal,
Adaptive MCMC, Journal of Computational and Graphical Statistics 18 (2)
-  F. Chollet, et al., Keras, https://github.com/fchollet/keras (2015).
Theano Development Team, Theano: A
Python framework for fast computation of mathematical expressions, arXiv
e-prints abs/1605.02688 (May 2016).
-  T. Hastie, R. Tibshirani, J. H. Friedman, The elements of statistical learning: data mining, inference, and prediction, 2nd Edition, Springer series in statistics, Springer, New York, NY, 2009.
-  G. Hinton, N. Srivastava, K. Swersky, Neural networks for machine learning, lecture 6a: Overview of mini-batch gradient descent, Coursera, University of Toronto (2012).
U. Wolff, Monte Carlo errors with
less errors, Computer Physics Communications 176 (5) (2007) 383, arXiv:
J. Kaipio, E. Somersalo,
inverse problems: Discretization, model reduction and inverse crimes,
Journal of Computational and Applied Mathematics 198 (2) (2007) 493–504.
T. Cui, C. Fox, M. J. O’Sullivan,
Bayesian calibration of a
large-scale geothermal reservoir model by a new adaptive delayed acceptance
Metropolis Hastings algorithm: ADAPTIVE DELAYED ACCEPTANCE
METROPOLIS-HASTINGS ALGORITHM, Water Resources Research 47 (10) (Oct.
J. Brynjarsdóttir, A. O’Hagan,
about physical parameters: the importance of model discrepancy, Inverse
Problems 30 (11) (2014) 114007.
M. D. McKay, R. J. Beckman, W. J. Conover,
A comparison of three methods for
selecting values of input variables in the analysis of output from a computer
code, Technometrics 21 (2) (1979) 239–245.
M. D. Morris, T. J. Mitchell,
designs for computational experiments, Journal of Statistical Planning and
Inference 43 (3) (1995) 381–402.
C. Lajaunie, G. Courrioux, L. Manuel,
Foliation fields and 3d
cartography in geology: Principles of a method based on potential
interpolation, Mathematical Geology 29 (4) (1997) 571–584.
M. de la Varga, A. Schaaf, F. Wellmann,
GemPy 1.0: open-source
stochastic geological modeling and inversion, Geoscientific Model
Development 12 (1) (2019) 1–32.
Appendix A Preconditioned Crank-Nicolson
The preconditioned Crank-Nicolson (pCN) proposal was developed in  and is based on the following Stochasic Partial Differential Equation (SPDE):
where is the precision operator for the prior distribution , is brownian motion with covariance operator , and is a positive operator. This equation can be discretised using the Crank Nicolson approach to yield
for white noiseand a weight . If we choose , we get the plain Crank Nicolson (CN) proposal:
where . If we instead choose , we get the pCN proposal:
This is rewritten, conforming to our previous notation: