The increased availability of measurements from engineering systems allows for the development of new and the improvement of existing computational models, which are usually formulated as partial differential equations. Inferring model parameters from observations of the physical system is termed the inverse problem (tarantolaInverseProblemTheory2005; kaipioStatisticalComputationalInverse2005; stuartInverseProblemsBayesian2010). In this work, we consider the inverse problem where the quantities of interest (for example, some material properties) and the observations (e.g., the displacement field) are related through elliptic PDEs. Most inverse problems are non-linear and ill-posed, meaning that the existence, uniqueness, and/or stability (continuous dependence on the parameters) of the solution are violated (stuartInverseProblemsBayesian2010; tarantolaInverseProblemTheory2005; kaipioStatisticalComputationalInverse2005). These issues are often alleviated through some regularisation, like Tikhonov regularisation (tikhonovSolutionsIllposedProblems1977), that imposes assumptions on the regularity of the solution. Alternatively, the specification of the prior in the Bayesian formulation of inverse problems provides a natural choice for regularisation, and any given regularisation can be interpreted as a specific choice of priors in the Bayesian setting (bishopPatternRecognitionMachine2006)
. Furthermore, the Bayesian formulation provides not only a qualitative but also a quantitative estimate of both epistemic and aleatoric uncertainty in the solution. In particular, the mean of the posterior probability distribution corresponds to the point estimate of the solution while the credible intervals capture the range of the parameters consistent with the observed measurements and prior assumptions. For these reasons, Bayesian methods have gained popularity in computational mechanics for experimental design and inverse problems with uncertainty quantification; see, e.g., the recent papers byabdulleProbabilisticFiniteElement2021; panditaSurrogatebasedSequentialBayesian2021; pyrialakosNeuralNetworkaidedBayesian2021; niProbabilisticModelUpdating2021; sabaterBayesianApproachQuantile2021; huangSequentialSparseBayesian2021; ibrahimbegovicReducedModelMacroscale2020; tarakanovOptimalBayesianExperimental2020; michelenstroferEnforcingBoundaryConditions2020; carlonNesterovaidedStochasticGradient2020; wuBayesianInferenceNonlinear2020; uribeBayesianInferenceRandom2020; rizziBayesianModelingInconsistent2019; arnstIdentificationSamplingBayesian2019; beckFastBayesianExperimental2018; betzBayesianInferenceSubset2018; chenHessianbasedAdaptiveSparse2017; asaadiComputationalFrameworkBayesian2017; huangBayesianSystemIdentification2017; karathanasopoulosBayesianIdentificationTendon2017; babuskaBayesianInferenceModel2016; girolamiStatisticalFiniteElement2021.
The Bayesian formulation of inverse problems is also the focal point of probabilistic machine learning, and in recent years significant progress has been made in adapting and scaling machine learning approaches to complex large-scale problems(luSurpassingHumanLevelFace2015; solinPIVOProbabilisticInertialVisual2018). One of the leading models for Bayesian inverse problems are Gaussian processes (GPs) which define probability distributions over functions and allow for incorporating observed data to obtain posterior distributions. Given that most posterior distributions in Bayesian inference are analytically intractable, approximation methods need to be resorted to. Two classical approximation schemes are the Markov Chain Monte Carlo (MCMC) and the Laplace approximation (LA). The MCMC algorithm proceeds by creating a Markov Chain whose stationary distribution is the desired posterior distribution. Although MCMC provides asymptotic convergence in distribution, devising an efficient, finite-time sampling scheme is challenging, especially in higher dimensions (gelmanBayesianDataAnalysis2013). Application-specific techniques such as parameter space reduction and state space reduction have been proposed in the literature to help scale up MCMC methods, but these low-rank approximations are not specific to MCMC methods only (cuiScalablePosteriorApproximations2016). Due to the asymptotic correctness of MCMC, we use it as a benchmark for the experimental studies in this paper. Meanwhile, the Laplace approximation finds a Gaussian density centred around the mode of the true posterior, utilizing the negative Hessian of the unnormalised posterior log-density (bishopPatternRecognitionMachine2006). The Hessian is a large dense matrix, where forming each column requires multiple PDE solves; to make such calculations feasible, low-rank approximations are typically used (villaHIPPYlibExtensibleSoftware2021; bui-thanhComputationalFrameworkInfiniteDimensional2013)
. Evidently, the Laplace approximation is not suitable for multi-modal posterior distributions due to the uni-modality of the Gaussian distribution.
1.1 Related work
In recent years, advances in variational Bayes (VB) methods have allowed for Bayesian inference to be successfully applied to large data sets. Variational Bayes translates a sampling problem that arises from applying the Bayes rule into an optimisation problem (jordanIntroductionVariationalMethods1999; bleiVariationalInferenceReview2017; jordanGraphicalModelsExponential2007). The method finds a solution that minimizes the Kullback-Leibler (KL) divergence between the true posterior distribution and a trial distribution from a chosen family of distributions, for instance, multivariate Gaussian distributions with a specific covariance structure. The strong appeal of VB is that one can explicitly choose the complexity of the trial distribution, i.e., its number of free parameters, such that the resulting optimisation problem is computationally tractable, and the approximate posterior adequately captures important aspects of the true posterior.
Further scalability of VB methods is due to advancements in sparse approximations and approximate inference. For instance, sparse GP methods such as Nyström approximation or fully independent training conditional method (FITC) rely on lower-dimensional representations that are defined by a smaller set of so-called inducing points to represent the full GP (williamsUsingNystromMethod2001; csatoSparseOnlineGaussian2002; seegerFastForwardSelection2003; quinonero-candelaUnifyingViewSparse2005; snelsonSparseGaussianProcesses2006; titsiasVariationalLearningInducing2009; titsiasVariationalModelSelection2008). Using this approximation for a data set of size , algorithmic complexity is reduced from to , while storage demands go down from to , where is a user selected number of inducing variables. To widen the applicability of VB to large datasets and non-conjugate models (combinations of prior distributions and likelihoods that do not result in a closed-form solution), stochastic variational inference (SVI) was proposed (hensmanFastVariationalInference2012; hoffmanStochasticVariationalInference2013; hensmanGaussianProcessesBig2013). Sub-sampling the original data and Monte Carlo estimation of the optimisation objective and its gradients, allows for calibrating complex models using large amounts of data. Multiple further extensions to the sparse SVI framework were proposed, leveraging the Hilbert space formulation of VB (chengVariationalInferenceGaussian2017), introducing parametric approximations (jankowiakParametricGaussianProcess2020), applying the Lanczos algorithm to efficiently factorize the covariance matrix (pleissConstanttimePredictiveDistributions2018), transforming to an orthogonal basis (salimbeniOrthogonallyDecoupledVariational2018; shiSparseOrthogonalVariational2020), and adapting to compositional models (salimbeniDoublyStochasticVariational2017).
The choice of prior is a central task in designing Bayesian models. If the prior is obtained from a domain expert, it is not necessarily less valuable than the data itself; one way of thinking about a prior is by considering how many observations one would be prepared to trade for a prior from an expert – if the expert is very knowledgeable, then one might be prepared to exchange a large part of a dataset to get access to that prior. Translating the expert knowledge into a prior probability distribution is a challenging task, and due to practical considerations, certain choices of priors are preferred for their simplicity and analytic tractability. When inferring values of parameters over a spatial domain, as is typically the case in finite elements, GP priors offer a natural way to incorporate the information about the smoothness and other known properties of the solution. We note that while other Bayesian models, such as Bayesian neural networks are gaining interest, it is very difficult to impose functional priors in such models, challenging the effective use of expert knowledge and leading to unrealistic uncertainty estimates(sunFunctionalVariationalBayesian2019; burtUnderstandingVariationalInference2021).
In this work, we advocate for the use of GP priors with stochastic variational inference as a principled and scalable way to solve the inverse problems arising in computational mechanics. We show, through an extensive empirical study, that variational Bayes methods provide a flexible, fast, and scalable alternative to MCMC methods in the context of Bayesian inverse problems based on elliptic PDEs while retaining the ability to quantify uncertainty. While similar directions have been explored in previous work, the focus there is on specific applications, such as parameter estimation problems in models of contamination (tsilifisComputationallyEfficientVariational2016) or proof-of-concept on particular 1D inverse problems (barajas-solanoApproximateBayesianModel2019).
We extend the previous works in multiple aspects, focusing on improving the utility of VB in inverse problems arising from elliptic PDEs and providing a thorough discussion of the empirical results that can be used by practitioners to guide their use of VB in applications. Specifically, we argue that the efficiency of the VB algorithms for PDE based inverse problems can be improved by taking into account the structure of the problem, as encoded in the FEM discretisation of the PDE. Motivated by previous uses of precision matrices as a way of describing conditional independence (tanGaussianVariationalApproximation2018; durrandeBandedMatrixOperators2019), we leverage the sparse structure of the problems to impose conditional independence in the approximating posterior distribution. This choice of parametrisation results in sparse matrices, which improve the computational and the memory cost of the resulting algorithms. Such parametrisation, combined with stochastic optimisation techniques, allows the method to be scaled up to large problems on 2D domains. Through extensive empirical comparisons, we demonstrate that VB provides high quality point estimates and uncertainty quantification comparable to the estimates attained by MCMC algorithms but with significant computational gains. Finally, we describe how the proposed framework can be seamlessly combined with existing solvers and optimisation algorithms in the finite element implementations.
The main concern related to VB in statistics stems from the fact that it is constrained by the chosen family of trial distributions, which may not approximate the true posterior distribution well. If the choice of the trial distributions is too restrictive, the estimate of the posterior mean is biased while the uncertainty may be underestimated (mackayInformationTheoryInference2003; wangInadequacyIntervalEstimates2005; turnerTwoProblemsVariational2011). Furthermore, as noted in previous work, the commonly used mean-field factorisation of the trial distributions does not come with general guarantees on accuracy (giordanoCovariancesRobustnessVariational2018). However, VB has been demonstrated to work well in practice in a variety of settings (kingmaAutoencodingVariationalBayes2014; damianouVariationalInferenceLatent2016; bleiVariationalInferenceReview2017; zhangAdvancesVariationalInference2019). Recent work on VB has provided some tools for assessing the robustness of the VB estimates (giordanoCovariancesRobustnessVariational2018) .
The rest of the paper is structured as follows. In Sec. 2, we define Bayesian inverse problems and detail some inference challenges related to their ill-posedness. In Sec. 3, we give a presentation of the variational Bayes framework, with strong focus on sparse parametrisation resulting from conditional independence. We give details of the experiments and the evaluation criteria, and discuss obtained results for each experiment in Sec. 4. Lastly, Sec. 5 concludes the paper and discusses some promising directions for future work.
2 Bayesian Formulation of Inverse Problems
In this section, we review the Bayesian formulation of inverse problems by closely following stuartInverseProblemsBayesian2010.
2.1 Forward map and observation model
We are interested in finding , an input to a model, given , a noisy observation of the solution of the model, where are Banach spaces333Respective norms for Banach spaces , are and .. The mapping is given by
where , is additive observational noise. We focus on problems where maps solutions of elliptic partial differential equations with input into the observation space . For a suitable Hilbert space , which we make concrete later, let be a possibly non-linear solution operator of the PDE. For a particular , the solution is
To obtain observations , we define a projection operator . Consequently, (1) can be written out in full as
We solve the inverse problem (1) for by finding such that the data misfit, , is minimised. As already mentioned in the introduction, this is typically an ill-posed problem: there may be no solution, it may not be unique, there may exist a dimensionality mismatch between the observations and the quantity being inferred, and it may depend sensitively on . To proceed, we choose the Bayesian framework for regularising the problem to make it amenable to analysis and practical implementation. We describe our prior knowledge about in terms of a prior probability measure on the subspace of and use Bayes’ formula to calculate the posterior probability measure, , for given . The relationship between the posterior and prior is expressed as
where is the Radon-Nikodym derivative of with respect to , and is the potential function which is determined by the forward problem (1), specifically and . To ensure that is a valid probability measure, we have .
From here on, we assume that , where is the Euclidean norm, and we treat data and
as vectors, i.e.and . We specify the additive noise vector as Gaussian such that
is the standard deviation of the measurement noise and
is the identity matrix. We can writeconveniently as
where is the norm induced by the weighted inner product444For any self-adjoint positive operator , weighted inner product is , and the induced norm is .
We restrict the space of solutions to be a Hilbert space and place a Gaussian prior measure on with mean and covariance operator such that
For detailed assumptions on , , and that are required for deriving the posterior probability measure, we refer the reader to stuartInverseProblemsBayesian2010.
The objective is to find the posterior measure conditioned on the observations, as dictated by Bayes’s rule. The forward map (1) and the respective functions must be discretised. In Bayesian inference there are two possible approaches for discretisation: 1) apply the Bayesian methodology first, discretize afterwards, or 2) discretize first, then apply the Bayesian methodology (stuartInverseProblemsBayesian2010).
The first approach develops the solution of the inference problem in the function space before discretising it. A widely used algorithm of this form is the pre-conditioned Crank-Nicholson (pCN) MCMC scheme, where proposals are based on the prior measure and the current state of the Markov chain. The pCN method is a standard choice for high-dimensional sampling problems, as its implementation is well-defined and is invariant to mesh refinement (cotterMCMCMethodsFunctions2013; pinskiAlgorithmsKullbackLeibler2015). Since we will use this algorithm as one of the baselines, a summary of the algorithm is provided in C.1. Recently, infinite-dimensional MCMC schemes that leverage the geometry of the posterior to improve the efficiency have been proposed, see beskosGeometricMCMCInfinitedimensional2017. Other than MCMC schemes, some variational Bayes formulations in function space have been proposed (for example, minhInfinitedimensionalLogDeterminantDivergences2017; burtUnderstandingVariationalInference2021), though currently they do not offer a viable computational alternative to the finite-dimensional formulation of variational inference.
The second approach proceeds by first discretising the problem and then deriving the inference method. This approach forms the basis of almost all inference procedures developed in engineering: MCMC algorithms such as Metropolis-Hastings (metropolisEquationStateCalculations1953; hastingsMonteCarloSampling1970) or Hamiltonian Monte Carlo (HMC) (duaneHybridMonteCarlo1987), the Laplace approximation, or variational Bayes (jordanVariationalFormulationFokkerplanck1998; jordanIntroductionVariationalMethods1999) are used to approximate the posterior. In the discretised formulation, HMC has achieved recognition as the gold standard for its good convergence properties, favourable performance on high-dimensional and poorly conditioned problems, and universality of implementation that enables its generic use in many applications through probabilistic programming languages (e.g., Stan (carpenterStanProbabilisticProgramming2017)). Therefore, along with the pCN scheme mentioned above, our baseline for inference methods includes the HMC method, and we provide a summary of the HMC scheme in Sec. C.2.
For the rest of the exposition in this paper, we will focus on algorithms in the finite-dimensional case, where we discretise to yield a vector . In finite dimensions, probability densities with respect to the Lebesgue measure can be defined, thus leading to a more familiar form of the Bayes’s rule:
where is the posterior density, is the likelihood of the observed data for a given discretised and is determined by the discretised forward problem (1) and noise . The prior density for , which itself may depend on some (hyper-) parameters , is denoted by . Next two sections focus on discussing and , respectively.
2.3 Poisson Equation and likelihood
Let us consider a specific forward problem where is the solution to the Poisson problem:
where , with , is the log-diffusion coefficient, is the unknown, and is a deterministic forcing term. The boundary conditions have been omitted for brevity. We are given noisy observations of the solution at a finite set of points, . The observation points are collected in the matrix . Although this PDE is linear in for a given , the methodology in this paper applies to non-linear cases and can be extended for time-dependent cases such as the inverse problem of inferring initial conditions of a system given observations of the system at a later time.
We discretise the weak form of the Poisson problem (8) with a standard finite element approach. Specifically, the domain of interest is subdivided into a set of non-overlapping elements of size such that:
The unknown field is approximated with Lagrange basis functions and the respective nodal coefficients of the non-Dirichlet boundary mesh nodes by
The discretisation of the weak form of the Poisson equation yields the linear system of equations
where is the stiffness matrix, is the vector of log-diffusion coefficients, is the nodal source vector. The stiffness matrix of an element with label is given by
where the log-diffusion coefficient of the element is assumed to be constant within the element. The source vector is discretised as:
Hence, according to the observation model (5) the likelihood is given by
where the matrix represents the discretisation of the observation operator .
Then the mapping from the coefficients to the solution is . The unconditional distribution of is given by:
where is deterministic as defined in (11) but appears in it non-linearly, implying that the inference is not analytically tractable.
Throughout the experiments in the later sections, we either set Dirichlet (essential) boundary conditions everywhere (for example ), or assume Neumann (natural) boundary conditions on parts of the boundary. The choice will be made explicit in each experiment. To compute the likelihood, we solve the Poisson problem (8) for using the finite element method (FEM).
As discussed above, we place a Gaussian measure on , . Properties of samples from the measure depend on mean and on the spectral properties of the covariance operator . We restrict the space of prior functions to . Then, operator can be constructed from the covariance function, as:
for any . This formulation is what is commonly referred to as a Gaussian process (GP) with mean function , which we assume to be zero, and covariance function such that
Even though the process is infinite-dimensional, an instantiation of the process is finite and reduces to a multivariate Gaussian distribution by definition. The covariance function is typically parametrised by a set of hyperparameters. One popular option, which satisfies assumptions about as per stuartInverseProblemsBayesian2010
, is the squared exponential kernel (also called the exponentiated quadratic or the radial basis function (RBF) kernel):
where is the Euclidean distance between the inputs. It depends on two hyper-parameters , the scaling parameter , and the length-scale . Note that, is an infinitely smooth function, which implies that so is . The RBF kernel imposes smoothness and stationarity assumptions on the solution; in addition, such choice of kernel offers a way to regularise the resulting optimisation problem. However, depending on the expert knowledge of the true solution, other kernels may be used to impose other assumptions such as periodicity.
Both conditioning and marginalisation of the GP can be done in closed form. In particular, consider the joint model of the values at training locations and the unknown test values at test locations :
where is the matrix resulting from evaluating at all pairs of training and test points. The conditional distribution of the function values given the values at is:
The marginal distribution can be recovered by finding the relevant part of the covariance matrix; for example, the marginal of given is .
In this work, we place a zero-mean Gaussian process prior on and assume the squared exponential kernel with length-scale
and fixed variance. As mentioned in the previous section, we assume that is constant on each element of the mesh (we use the same mesh as for discretising and ). We place the prior on so that the centroids of the elements are the training points of the GP:
3 Variational Bayes Approximation
3.1 Variational Bayes
We assume that any hyper-parameters of the prior are fixed, and are only interested in the posterior distribution of . The variational approach proceeds by approximating the true posterior according to (7) with a trial density , which is the minimiser of the discrepancy between a chosen family of trial densities and the true posterior distribution (jordanIntroductionVariationalMethods1999; jordanGraphicalModelsExponential2007). A typical choice for the measure of discrepancy between distributions is the Kullback-Leibler (KL) divergence (which due to the lack of symmetry is not a metric). To find the approximate posterior distribution we have:
Expanding the KL divergence term we obtain
The last term of the divergence, the log-marginal likelihood , is usually not known. However, we use the fact that the divergence is non-negative to obtain the bound
This inequality becomes an equality when the trial density and the posterior are equal. To minimize the divergence, it is sufficient to maximise , which is commonly referred to as the evidence lower bound (ELBO). The ELBO term can be rewritten as
To summarize, the task now becomes:
To maximize the with a gradient-based optimiser, we need to evaluate it and its gradients with respect to the parameters of . Although the divergence term of the is often available in closed form, involving the likelihood is generally not available. It can be approximated using a Monte Carlo approximation with samples from the trial density as follows:
where is the -th sample from . This is done through a reparameterisation trick, as described in B.1. Our empirical tests show that the value of in the range of 2–5 provides fast convergence of the optimisation, agreeing with previous literature (kingmaAutoencodingVariationalBayes2014). This approach is often referred to as stochastic variational inference (SVI). The Monte Carlo approximation is in line with the work in barajas-solanoApproximateBayesianModel2019 but in contrast with the analytic approximation based on the Hessian calculations proposed in tsilifisComputationallyEfficientVariational2016.
3.2 Specification of trial distribution
The specification of the approximating family of distributions determines how much structure of the true posterior distribution is captured by the variational approximation. To model complex relationships between the components of the posterior, a more complex approximating family of distributions is needed. As the richer family of distributions is likely to require more parameters, the optimisation of the usually non-convex becomes harder. A balance must be struck in this trade-off: the family should be rich enough, but the optimisation task should still be computationally tractable.
A practical and widely used variational family is the multivariate Gaussian distribution, parametrised by the mean vector and the covariance matrix. One of the key benefits of this choice is that the KL divergence term of the ELBO in (26) is available in closed form for a GP prior. The choice of the parametrisation of the covariance matrix determines how much structure, other than the mean estimate, is captured by the variational family. We discuss this in more detail in the next section.
Numerous approaches have been proposed to extend the trial distribution beyond the Gaussian family. A standard approach in situations when the true posterior distribution is likely to be multimodal is to consider mixtures of variational densities (bishopApproximatingPosteriorDistributions1998). A more recent development is embedding parameters of a mean-field approximation in a hierarchical model to induce variational dependencies between latent variables (tranCopulaVariationalInference2015; ranganathHierarchicalVariationalModels2016).
3.2.1 Gaussian trial distribution
Choosing the trial distribution as a multivariate Gaussian requires optimisation over the mean and the covariance matrix . The flexibility in choosing how we specify both of these parameters, especially the covariance matrix, enables us to balance the trade-off between the expressiveness of the approximating distribution and the computational efficiency.
The richest specification corresponds to parametrising the covariance matrix using its full Cholesky factor , i.e.,
This choice results in a dense covariance matrix that may be able to capture the full covariance structure between the inputs (i.e. each input may be correlated with every other input). Parametrising the components of automatically ensures that the covariance matrix is positive definite as necessary. The number of parameters to optimise grows as and this leads to a difficult optimisation task that needs to be carefully initialised and parametrised. We refer to this parametrisation as full-covariance variational Bayes (FCVB).
A much more efficient choice is a diagonal covariance matrix, which is often referred to as mean-field variational Bayes (MFVB). By limiting the number of parameters that need to be optimised, the optimisation task becomes simpler and the number of parameters grows only as . While more computationally efficient and easier to initialize, MFVB ignores much of the dependence structure of the posterior distribution.
3.3 Conditional independence and sparse precision matrices
Instead of parametrising the covariance matrix , or its Cholesky decomposition , in physical systems it is often advantageous to parametrise the precision matrix, , where . While a component of the covariance matrix expresses marginal
dependence between the two corresponding random variables, the elements of the precision matrix reflect theirconditional independence (rueApproximateBayesianInference2009). Or, more specifically, for two components and of the random vector we note
where denotes the respective component of . Furthermore, defining the vector from the random vector by removing its -th and -th component, we note
That is, if and only if is independent from , conditional on all other components of .
A succinct way to represent conditional independence is using an undirected graph whose nodes correspond to the random variables (bishopPatternRecognitionMachine2006). A graph edge is present between two graph vertices and if the corresponding random variables are not conditionally independent from each other, given all the other random variables. Or, expressed differently, the edges between the graph vertices correspond to non-zeros in the precision matrix. In our context, each graph vertex represents a finite element and graph edges are introduced according to geometric adjacency of the finite elements as determined by the mesh. To this end, we define the 1-neighbourhood of a finite element as the union of the element itself and of elements sharing a node with the element. The -neighbourhood is defined recursively as the union of all 1-neighbourhoods of all the elements in the -neighbourhood. We introduce an edge between two graph vertices when the respective elements are in the same -neighbourhood.
Figure 1 shows examples of adjacency graphs and the structure of the corresponding precision matrices for 5 random variables resulting from a discretisation of a 1D domain with 5 finite elements. In the considered examples the random variables represent the constant log-diffusion coefficient in the elements. As shown in Figures 0(b) and 0(c) choosing a larger -neighbourhood for graph construction leads to a denser precision matrix. For instance, from the structure of the precision matrix in Figure 0(b), which assumes a 1-neighbourhood structure, we can read for the log-diffusion coefficient of element the following conditional independence relationship:
When the coefficient of element is given, the coefficient of the neighbouring element is independent from all the remaining coefficients. This is intuitively plausible and in line with physical observations. Clearly, the covariance matrices corresponding to the given sparse precision matrices are dense. Hence, in the considered case the coefficient of element may still be correlated to the coefficient of element , i.e. . This correlation will most likely be relatively weak given the large distance between the two elements, but knowing the coefficient of element will certainly restrict the range of possible values for the coefficient of element .
After obtaining the structure of the precision matrix, which is sparse but, in general, not banded, one can reorder the numbering of the elements in the finite element mesh to reduce its bandwidth. This allows for efficient linear algebra operations. See cuthillReducingBandwidthSparse1969 for an example of a reordering algorithm. Once a minimum bandwidth ordering with has been established, we use the property that the bandwidth of the Cholesky factor of matrix is less than or equal to the bandwidth of (rueGaussianMarkovRandom2005). Finally, the parameters we optimise are the components of the lower band of size of matrix , so that the approximating distribution reads
This process of devising a parametrisation for the precision matrix for a more complex mesh in 2D is illustrated in Fig. 2. This approach is computationally efficient – the number of parameters grows as – and is able to capture dependencies between all the random variables.
3.4 Stochastic optimisation
To maximize the in (27), we use the ADAM algorithm (kingmaAdamMethodStochastic2015)
. ADAM is a member of a larger class of stochastic optimisation methods that have become popular as tools for maximizing non-convex cost functions. These methods construct a stochastic estimate of the gradient to perform gradient descent-based optimisation. ADAM, a stochastic gradient descent algorithm with an adaptive step size is one popular algorithm that exhibits a stable behaviour on many problems and is easy to use without significant tuning. The algorithm uses a per-parameter step size, which is based on the first two moments of the estimate of the gradient for each parameter. Specifically, the step size is proportional to the ratio of the exponential moving average of the 1st moment to the square root of the exponential moving average of the non-centred 2nd moment. At any point, the exponential moving average is computed with decay parametersand for the 1st and 2nd moment, respectively. We adopt the parameter values suggested in kingmaAdamMethodStochastic2015: and . The speed of convergence is further controlled by the learning parameter which is used to regulate the step size for all parameters in the same way. In our experiments, we set it to 0.01 and let it decay exponentially every 2,500 steps (1,000 for MFVB), with the decay rate of 0.96. While the ADAM algorithm performs well on a variety of problems, it has been shown that the convergence of this algorithm is poor on some problems (reddiConvergenceAdam2018). We discuss alternative approaches as potential future work in Sec. 5.
To monitor convergence, we use a rule that tracks an exponentially weighted moving average of the decrease in the loss values between successive steps, and stops when that average drops below a threshold. The use of such an adaptive rule gives us a way to track the convergence of the algorithm and provides a conservative estimate for the time it takes for the optimisation to converge. This rule can be adapted based on the available computational budget.
3.5 The algorithm
The maximisation of the ELBO in (26) involves finding the parameters of the trial distribution , i.e. its mean and Cholesky factor , that minimize between and the posterior . Algorithm 1 shows the required steps to compute the ELBO and its gradients with respect to the parameters of the trial distribution. Different from the discussion so far, in Algorithm 1 it is assumed that there are multiple independent observation vectors with .
We evaluate the efficacy of variational inference first for 1D and 2D Poisson equation examples and then a benchmark proposed by aristoffBenchmarkBayesianInversion2021. We discretize the examples with a standard finite element method using linear Lagrange basis functions. We compare against two sampling-based inference schemes, Hamiltonian Monte Carlo (HMC) and pre-conditioned Crank-Nicholson Markov Chain Monte Carlo (pCN); both are known to be asymptotically correct as the number of samples increases. The evaluation criteria we use focus on three aspects of an inference scheme: the accuracy with respect to capturing the mean and the variance of the solution; propagation of uncertainty in derived quantities of interest; and the time until convergence of the solution.
To assess the propagation of uncertainty in derived quantities of interest, we consider a summary quantity for which a point estimate alone may not be informative enough for downstream tasks. In particular, we compute the log of total boundary flux through the boundary :
where is a unit vector normal to the boundary .
To quantitatively assess the inference of , we obtain samples from the posterior distribution of , . For synthetic experiments, where we know the true which generated the observations, we compute the mean error norm. The computation is the Euclidean norm of the error between the true value, , and the mean of the obtained samples:
Further, we compute the expected error in the solution space. This measures how close the solutions corresponding to the samples of are to the true solution . Specifically, we compute
4.1 Poisson 1D
For this experiment, we assume the unit-line domain, which is discretised into 32 equal-length elements. We impose Dirichlet boundary conditions on both boundaries, specifically we set ; the forcing is constant everywhere . Unless specified otherwise, all experiments in this section use observations per sensor and the sensor noise . Sensors are located on each of the discretisation nodes. For the prior on , we choose a zero-mean Gaussian process with squared exponential kernel (see Sec. 2.4 for details). We compare the results for three specifications of the prior length-scale, . The length-scale used to generate the data is . For inferences made using data generated by a shorter length-scale, see A.
4.1.1 VB performs competitively based on error norms
Fig. 3 shows the mean error norm (35) and the expected solution error norm (36) obtained from 10,000 posterior samples of from Hamiltonian Monte Carlo (HMC), pre-conditioned Crank-Nicholson MCMC (pCN), as well as VB inference with different parametrisations of the covariance/precision matrix. It is evident that for prior length-scales , the mean error norms computed by the variational Bayes methods are very close to the estimates from HMC and pCN. For prior , the mean error norm computed by MFVB is lower than other VB methods and MCMC methods. This is most likely due to MFVB being a much easier optimisation task compared to other VB methods with more optimisation parameters that capture dependencies. For the expected solution error norm, MFVB posterior consistently underestimates the uncertainty in , thus ignoring possible values of which are consistent with the data. This is further confirmed in the qualitative assessment of uncertainty in the next section. While MCMC methods are asymptotically correct, in practice, devising efficient samplers for high-dimensional problems within a limited computational budget is still a challenging task and requires substantial hand-tuning. To affirm that all the VB methods provide a good estimate of the mean of , as compared to MCMC methods, is better demonstrated by inspecting Fig. 4 which shows not only the mean but also the posterior uncertainty, which we discuss next.
4.1.2 VB adequately estimates posterior variance
Fig. 4 shows the true values of (red), the posterior means (black) and plus and minus two times the standard deviation (blue shaded regions) estimated by HMC, pCN, and variational inference with mean-field (MFVB), full covariance (FCVB), and precision matrix (PMVB) parametrisations for different values of prior length-scales.
We observe that the posterior variance estimates computed by HMC, pCN, and full covariance VB are qualitatively very similar, with the estimated uncertainty increasing with increasing distance from the fixed boundary. However, the MFVB solution greatly underestimates posterior variance while computing a reasonable estimate of the posterior mean. The over-confidence of MFVB means that values of that are consistent with the observed data are ignored; this may lead to poor calibration if the MFVB posterior is used as the true in downstream tasks or in other contexts. For the PMVB parametrisation, the uncertainty is underestimated to a much lesser extent.
The observations above are further confirmed by the density plot of our quantity of interest: the log of the total flux on the boundary, shown in Fig. 5.
For this example, we compute the flux on the left boundary at and show the posterior distribution of this quantity. For longer prior length-scales, FCVB and PMVB agree with the estimates obtained from pCN and HMC, whereas mean-field VB underestimates the uncertainty. For the short prior length-scale (), both PMVB and MFVB underestimate the uncertainty as compared with HMC, pCN, and FCVB schemes. The posterior distribution of FCVB approximately agrees with the MCMC schemes.
For the results obtained using the PMVB scheme, we used the 10-neighborhood structure to define the adjacency matrix and the non-zero elements of the precision matrix, (see Sec. 3.3). The order of the neighbourhood structure, which corresponds to the precision matrix bandwidth, determines how much dependence within is captured by the approximating posterior distribution. In Fig. 6, we show how the estimate of the mean and the variance of changes for different orders of neighbourhood structure. As expected, with the increasing bandwidth, the posterior estimate of gets closer to the estimate of FCVB, HMC, and pCN (shown in Fig. 4). While there is a significant change in the uncertainty estimate when we increase the bandwidth from 2 to 10, it is less pronounced when we change it from 10 to 20. For this reason, we choose the value of 10 for the PMVB parametrisation in 1D.
4.1.3 VB estimates improve with more observations and decreasing observational noise
The consistency of the posterior refers to the contraction of the posterior distribution to the truth as the data quality increases, i.e. either the number of observations increases or observation noise tends to zero. A recent line of work (abrahamStatisticalCalderonProblems2020; monardStatisticalGuaranteesBayesian2020; giordanoConsistencyBayesianInference2020) showed the posterior consistency for the estimates obtained using popular MCMC schemes such as pCN or unadjusted discretised Langevin algorithm for Bayesian inverse problems based on PDE forward mappings. While similar results are not available for VB methods in infinite-dimensional case, consistency and Bernstein-von Mises type results have been shown for the finite-dimensional case, including Bayesian inverse problems (wangFrequentistConsistencyVariational2019; luGaussianApproximationsProbability2017). Empirically, our experiments show that for the given family of trial distributions the VB posterior distribution contracts to the true .
Firstly, we show that increasing the number of observations, , results in a more accurate estimate. Given that the observations, , are independent of each other, the likelihood term of the ELBO (see Eq. (26)) is the product of the individual likelihood terms:
Secondly, by decreasing the observational noise we expect the posterior distribution to get closer to the ground truth and with lower uncertainty. Fig. 7 shows the true values of (red), the posterior mean estimates (black) and plus and minus two times the standard deviation (blue shaded regions) obtained by different variants of variational Bayes for varying numbers of observations (top panel) and different values of observational noise (bottom panel). We can see that MFVB underestimates the posterior variances and these estimates do not depend on the number of observations (top panel in Fig. 7) or the amount of observational noise (bottom panel in Fig. 7). However, the FCVB and PMVB uncertainty estimates get narrower with increasing number of observations and with decreasing observational noise, which is a desirable behaviour that should be exhibited by any consistent uncertainty estimation method. We can also see that the true solution is contained within the uncertainty bounds for all numbers of observations and noise levels for the full covariance parametrisation. This is not the case for the mean-field VB, providing another indication of uncertainty underestimation for this parametrisation.
4.1.4 VB is an order of magnitude faster than HMC
For HMC estimates, we obtain 200,000 samples out of which the first 100,000 are used to calibrate the sampling scheme and are subsequently discarded. Table 1 provides the run-times for HMC, MFVB, FCVB, and PMVB. For the HMC column, we also report (shown in brackets) the range of effective sample sizes (ESS) across different components of . For details on ESS, we refer the reader to (gelmanBayesianDataAnalysis2013, Ch. 11). Even with conservative convergence criteria (described in Sec. 3.4), the computational cost of VB algorithms is up to 25 times lower than that of HMC.
To emphasize the computational efficiency of the variational inference, we show the posterior estimates for different number of Monte Carlo samples in the estimation of ELBO. Fig. 8 shows that on a qualitative level, a low number of samples is sufficient to obtain a good estimate.
In particular, even with 2 Monte Carlo samples, the estimates are very similar to the case where . However, a lower number of samples may result in slower convergence of the optimisation scheme. Fig. 9 shows that for the FCVB and PMVB parametrisations, where the number of optimised parameters is larger than for MFVB, increasing the number of SVI samples may speed up the convergence of the optimisation. The effect is not as strong for the MFVB parametrisation.
4.2 Poisson 2D
We consider a 2D Poisson problem on the unit-square domain with a circular hole as shown in Fig. 10, with boundary conditions as indicated in the same figure. The problem is discretised with 208 linear triangular elements and 125 nodes. The forcing term is assumed to be constant throughout the domain, . Unless specified otherwise, all experiments in this section use observations per sensor and the sensor noise (note that for the 1D example we used ). The sensors are located at each node of the mesh. As in the 1D example, we assume a zero-mean GP prior on with square exponential kernel with varying length-scale, , as discussed in Sec. 2.4.
Firstly, the results in Fig. 11 show that the mean error of VB methods is very similar to the sampling methods (pCN and HMC). Similarly to the 1D case, the expected solution error norm is highest for MFVB estimate, indicating the lack of capturing the possible values of for which the solutions, , are consistent with the observed data. The results also show that both errors are lowest when the prior matches the length-scale used to generate the data.
Fig. 12-14 show the results for the posterior mean and the standard deviation of , and the solution corresponding to the mean of the posterior. We consider three configurations with prior length-scale , where the length-scale used to generate the data is . In all cases, the estimates of the posterior mean of and the corresponding solutions are very close to the true values. Similarly to the 1D case discussed in Sec. 4.1, the variance estimates between HMC and FCVB are consistent, especially for longer prior length-scales. There seems to be a discrepancy between the estimates obtained using MFVB and those obtained by other methods. The estimates obtained using precision-matrix parametrisation are qualitatively very close to the FCVB and MCMC estimates.
For the quantity of interest, we compute the log of the total flux along the right boundary of the domain (), and the results are shown in Fig. 15. Unlike the 1D case, the posterior estimates of the boundary flux are approximately the same for all the considered methods, except for the mean-field estimate when prior , where the MFVB estimate is biased as compared to the other methods.
The empirical computational cost for these experiments is given in Table 2. For the HMC experiments, we obtained 250,000 samples, out of which the first 125,000 were used to calibrate the sampling scheme and discarded afterwards. The timing results show that HMC takes an order of magnitude longer than variational Bayes, with some variation that depends on the parametrisation.
4.3 Inverse Problem Benchmark
We evaluate the effectiveness of VB methods on a recently proposed benchmark for Bayesian inverse problems (aristoffBenchmarkBayesianInversion2021). The benchmark aims to provide a test case that reflects practical applications, but at the same time is easy to replicate. Like above, the test case is a Poisson inverse problem where the task is to recover log-diffusion, , from a finite set of noisy observations. The problem domain is a unit square, the forcing function is constant throughout the domain, and the solution of the PDE is imposed to be zero on all four boundaries.
The benchmark discretizes using 64 quadrilateral elements, such that is constant for each individual element as shown in Fig. 16. The forward solution of the PDE is obtained after discretizing using bilinear quadrilateral elements. The locations where the solution is observed are placed on a uniform grid of 169 points (). The measurements are corrupted by the Gaussian noise with standard deviation . The authors of the benchmark provide the measurements as well as the true log-diffusion coefficient which generated the observations. The true log-diffusion coefficient, shown in Fig. 16, is zero throughout the domain, except two regions, where the value is and . It is these two jumps that make it a non-trivial test case.
Unlike in the previous examples, we place a prior on which does not induce any spatial correlation between any of the coefficients. The role of the prior is to express our belief about the ranges of the coefficients, rather than any dependencies. Although authors place for each component of independently, we choose as most of the coefficients of the true are at the baseline level equal to zero, and the fact that the corresponds to the diffusion parameter on the log-scale, a priori we do not expect such high variance.
We performed the inference using HMC, MFVB, FCVB, and PMVB. The means and standard deviations of inferred log-diffusion coefficients, together with the PDE solutions corresponding to the inferred means, are shown in Fig. 16. The results suggest that the mean estimates of all three methods do capture the jumps and the overall structure of . Specifically, the FCVB estimate of the mean of is closest to the true value. As for uncertainty quantification, the MFVB and PMVB estimates are closer to the HMC estimate (our assumed ground truth for the uncertainty) than the FCVB estimate. The FCVB estimate seems to overestimate the uncertainty at a few locations. This is potentially due to being stuck in a local optimum during the optimisation procedure, which for FCVB involves high-dimensional exploration.
In this paper, we have presented the variational inference framework for Bayesian inverse problems and investigated its efficacy on problems based on elliptic PDEs. Computationally, variational Bayes offers a tractable alternative to the intractable MCMC methods, and provides consistent mean and uncertainty estimates on the problems inspired by questions in computational mechanics. VB recasts the integration problem associated with Bayesian inference into an optimisation problem. As such, it is naturally integrated with existing FEM solvers, using the gradient calculations from the FEM solvers to optimise the ELBO in VB. Furthermore, the geometry of the problem encoded in the FEM mesh is utilised through the use of a sparse precision matrix that defines the conditional independence structure of the problem. Our results on the 1D and 2D Poisson problems support the claims of accuracy and scalability of VB. We note that the inferred variance is important in uncertainty quantification with a probabilistic forward model (for a different load case).
More specifically, our results show that
the mean of the variational posterior provides an accurate point estimate irrespective of the choice of the parametrisation of the covariance structure,
the variational approximation with a full-covariance or precision matrix structure adequately estimates posterior uncertainty when compared to HMC and pCN which are known to be asymptotically correct,
parametrising the multivariate Gaussian distribution using a sparse precision matrix provides a way to balance the trade-off between computational complexity and the ability to capture dependencies in the posterior distribution,
variational Bayes provides a good estimate for the mean and the variance of the posterior distribution in a time that is an order of magnitude faster than HMC or pCN,
the multivariate Gaussian variational family is flexible enough to capture the true posterior distribution with high accuracy,
the VB estimates may be used effectively in downstream tasks to estimate various quantities of interest, and
variational Bayes method is flexible enough to model multimodal posteriors, as illustrated on a preliminary truss example, see D.
Our work may be extended in a number of natural ways that allows for greater adaptivity to the specific problems encountered in applications and integration within existing frameworks. Firstly, taking advantage of fast implementations of sparse linear algebra routines would further improve the scalability of VB with the structured precision matrix, as proposed in our work. Secondly, casting the inverse problem in a multi-level setting and taking advantage of low-dimensional projections has potential to further improve computational efficiency (nagelUnifiedFrameworkMultilevel2016a; ghattasLearningPhysicsbasedModels2021). Thirdly, the results provided in this paper use standard off-the-shelf optimisation routines; further computational improvements may be achieved using customised algorithms. As a further extension, in some applications it may be informative to consider the uncertainty in the forcing function so that the forward mapping is stochastic, as discussed in (girolamiStatisticalFiniteElement2021). Finally, one of the aims of our work is to take advantage of the advances in Bayesian inference and adapt the novel algorithms to inverse problems in computational mechanics. As such, any further developments in VB as applied to machine learning and computational statistics problems may be directly applied using the framework proposed in this paper.
Codes for performing all forms of variational Bayes inference presented in this paper are available on Github at https://github.com/jp2011/bip-pde-vi. The user must provide their own PDE solver which accepts as input parameter and computes , together with its gradient with respect to .
We would like to thank William Baldwin for providing us the necessary background for and the implementation of the bimodal example; Garoe Dorta for his help with Tensorflow debugging. JP, IK, and MG were supported by the EPSRC grant EP/P020720/2 for Inference, COmputation and Numerics for Insights into Cities (ICONIC, https://iconicmath.org/). JP was also supported by EPSRC (EP/L015129/1). MG was supported by EPSRC grants EP/T000414/1, EP/R018413/2, EP/R034710/1, EP/R004889/1, and a Royal Academy of Engineering Research Chair in Data Centric Engineering.
Appendix A Short length-scale results
Appendix B Variational Inference
b.1 Reparametrisation trick
Reparametrisation trick allows computing the gradients of quantities derived from samples from a probability distribution with respect to the parameters of that probability distribution. This holds for probability distributions where samples can be obtained by a deterministic mapping, parametrised by , of other random variables.
Let be a set of random variables. We assume that samples of are given by a deterministic mapping
The KL divergence between approximating distribution and the prior is often available in closed form and so are its gradients with respect to . To estimate the gradients of the Monte Carlo estimate of the log-likelihood of the data,
we can use the chain rule of differentiation to obtain
Appendix C Markov Chain Monte Carlo
c.1 Pre-Conditioned Crank-Nicholson Scheme
We consider the pre-conditioned Crank-Nicholson scheme proposed by cotterMCMCMethodsFunctions2013. We summarise the procedure in Algorithm 2.
c.2 Hamiltonian Monte Carlo
Hamiltonian Monte Carlo (duaneHybridMonteCarlo1987) is a variant of Metropolis-Hastings (metropolisEquationStateCalculations1953; hastingsMonteCarloSampling1970) which takes advantage of the gradients of the target distribution in the proposal, allowing for a more rapid exploration of the sample space, even in a high-dimensional target space. For each component of the target space, the scheme adds a ‘momentum’ variable (note that this is different from used in B.1). Subsequently, and are updated jointly in a series of updates to propose a new sample that is then accepted/rejected.
The proposal is largely driven by the momentum variable. The proposal step starts with drawing a new value of from which needs to be specified. Then in a series of user-specified steps, , the momentum variable is updated based on the gradient of the of the target density, and is moved based on the momentum. Usually, the distribution of the momentum variable is , where is the so called ‘mass’ matrix. A diagonal matrix is often chosen to be able to efficiently sample from the momentum distribution. The full steps of the procedure are given in Algorithm 3.
The performance of the algorithm can be tuned in three ways: (i) choice of the momentum distribution , which in the version above requires specifying the mass matrix, (ii) adjusting the scaling factor of the leapfrog step, , and (iii) the number of leapfrog steps, . gelmanBayesianDataAnalysis2013 suggest setting and so that . They suggest tuning these so that the acceptance rate is about . As for the mass matrix, the authors suggest that it should approximately scale with the inverse covariance matrix of the posterior distribution, . This can be achieved by a pre-run from which the empirical covariance matrix can be computed.
Appendix D Bimodal example
One of the advantages of VB over Laplace approximation is the flexibility of the approximating distribution. To illustrate this, we consider an example with a one-dimensional truss which is fixed at one node and contains three degrees of freedom that correspond to the horizontal motion of the three nodes as shown in the left panel of Fig.19. We assume that the stiffness of each member is constant within each member and, furthermore, the stiffness of the member 1 is the average of the stiffness of the members at the ends, i.e. . The inverse problem is then defined as follows. Given the displacement vector and boundary conditions, find the unknown stiffness parameters and . To prevent negative or small stiffness, constraints are imposed on the stiffness of each member, , and Neumann boundary conditions are set to . Due to the constraint on the stiffness of member 1, the image of the forward problem is a manifold with dimension 2 embedded in displacement space . Due to the symmetry in this problem, the likelihood function, shown in centre panel of Fig. 19, is bimodal.
We place a multivariate Gaussian as prior on the stiffness parameters, and use a bimodal trial distribution to infer the posterior distribution of the parameters and given observed displacement . Specifically, we consider a mixture of two multivariate Gaussians with equal fixed mixture weights. As there is no closed form expression for the KL divergence between a mixture of Gaussians and a single Gaussian, we estimate the KL divergence term in the ELBO using Monte Carlo sampling. As shown in the right panel of Fig. 19, the resulting posterior distribution is bimodal and recovers the two modes present in the likelihood function. This illustrative example shows that when a proposed model exhibits multi-modality, the flexibility of variational Bayes methodology allows for specifying a family of trial distributions that can capture that property of the model.
=0mu plus 1mu