Random forward models and log-likelihoods in Bayesian inverse problems

by   H. C. Lie, et al.
Freie Universität Berlin

We consider the use of randomised forward models and log-likelihoods within the Bayesian approach to inverse problems. Such random approximations to the exact forward model or log-likelihood arise naturally when a computationally expensive model is approximated using a cheaper stochastic surrogate, as in Gaussian process emulation (kriging), or in the field of probabilistic numerical methods. We show that the Hellinger distance between the exact and approximate Bayesian posteriors is bounded by moments of the difference between the true and approximate log-likelihoods. Example applications of these stability results are given for randomised misfit models in large data applications and the probabilistic solution of ordinary differential equations.



There are no comments yet.


page 1

page 2

page 3

page 4


Error bounds for some approximate posterior measures in Bayesian inference

In certain applications involving the solution of a Bayesian inverse pro...

Differentiable Likelihoods for Fast Inversion of 'Likelihood-Free' Dynamical Systems

Likelihood-free (a.k.a. simulation-based) inference problems are inverse...

Sampling Methods for Bayesian Inference Involving Convergent Noisy Approximations of Forward Maps

We present Bayesian techniques for solving inverse problems which involv...

Beyond black-boxes in Bayesian inverse problems and model validation: applications in solid mechanics of elastography

The present paper is motivated by one of the most fundamental challenges...

Parallel Gaussian process surrogate method to accelerate likelihood-free inference

We consider Bayesian inference when only a limited number of noisy log-l...

Solving Parameter Estimation Problems with Discrete Adjoint Exponential Integrators

The solution of inverse problems in a variational setting finds best est...

Adaptive particle-based approximations of the Gibbs posterior for inverse problems

In this work, we adopt a general framework based on the Gibbs posterior ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Inverse problems are ubiquitous in the applied sciences and in recent years renewed attention has been paid to their mathematical and statistical foundations (Evans and Stark, 2002; Kaipio and Somersalo, 2005; Stuart, 2010). Questions of well-posedness — i.e. the existence, uniqueness, and stability of solutions — have been of particular interest for infinite-dimensional/non-parametric inverse problems because of the need to ensure stable and discretisation-independent inferences (Lassas and Siltanen, 2004) and develop algorithms that scale well with respect to high discretisation dimension (Cotter et al., 2013).

This paper considers the stability of the posterior distribution in a Bayesian inverse problem (BIP) when an accurate but computationally intractable forward model or likelihood is replaced by a random surrogate or emulator. Such stochastic surrogates arise often in practice. For example, an expensive forward model such as the solution of a PDE may replaced by a kriging/Gaussian process (GP) model (Stuart and Teckentrup, 2017)

. In the realm of “big data” a residual vector of prohibitively high dimension may be randomly subsampled or orthogonally projected onto a randomly-chosen low-dimensional subspace

(Le et al., 2017; Nemirovski et al., 2008). In the field of probabilistic numerical methods (Hennig et al., 2015), a deterministic dynamical system may be solved stochastically, with the stochasticity representing epistemic uncertainty about the behaviour of the system below the temporal or spatial grid scale (Conrad et al., 2016; Lie et al., 2017).

In each of the above-mentioned settings, the stochasticity in the forward model propagates to associated inverse problems, so that the Bayesian posterior becomes a random measure, , which we define precisely in (3.1). Alternatively, one may choose to average over the randomness to obtain a marginal posterior, , which we define precisely in (3.2). It is natural to ask in which sense the approximate posterior (either the random or the marginal version) is close to the ideal posterior of interest, .

In earlier work, Stuart and Teckentrup (2017) examined the case in which the random surrogate was a GP. More precisely, the object subjected to GP emulation was either the forward model (i.e. the parameter-to-observation map) or the negative log-likelihood. The prior GP was assumed to be continuous, and was then conditioned upon finitely many observations (i.e. pointwise evaluations) of the parameter-to-observation map or negative log-likelihood as appropriate. That paper provided error bounds on the Hellinger distance between the BIP’s exact posterior distribution and various approximations based on the GP emulator, namely approximations based on the mean of the predictive (i.e. conditioned) GP, as well as approximations based on the full GP emulator. Those results showed that the Hellinger distance between the exact BIP posterior and its approximations can be bounded by moments of the error in the emulator.

In this paper, we extend the analysis of Stuart and Teckentrup (2017) to consider more general (i.e. non-Gaussian) random approximations to forward models and log-likelihoods, and quantify the impact upon the posterior measure in a BIP. After establishing some notation in Section 2, we state the main approximation theorems in Section 3. Section 4

gives an application of the general theory to random misfit models, in which high-dimensional data are rendered tractable by projection into a randomly-chosen low-dimensional subspace. Section 

5 gives an application to the stochastic numerical solution of deterministic dynamical systems, in which the stochasticity is a device used to represent the impact of numerical discretisation uncertainty. The proofs of all theorems are deferred to an appendix located after the bibliographic references.

2 Setup and notation

2.1 Spaces of probability measures


is a fixed probability space that is rich enough to serve as a common domain for all random variables of interest.

The space of probability measures on the Borel -algebra of a topological space will be denoted by ; in practice, will be a separable Banach space.

When , integration of a measurable function (random variable) will also be denoted by expectation, i.e. .

The space will be endowed with the Hellinger metric : for that are both absolutely continuous with respect to a reference measure ,


The Hellinger distance is in fact independent of the choice of reference measure and defines a metric on (Bogachev, 2007, Lemma 4.7.35–36) with respect to which evidently has diameter at most . The Hellinger topology coincides with the total variation topology (Kraft, 1955) and is strictly weaker than the Kullback–Leibler (relative entropy) topology (Pinsker, 1964); all these topologies are strictly stronger than the topology of weak convergence of measures.

As used in Sections 35

, the Hellinger metric is useful for uncertainty quantification when assessing the similarity of Bayesian posterior probability distributions, since expected values of square-integrable functions are Lipschitz continuous with respect to the Hellinger metric:


when . In particular, for bounded , .

2.2 Bayesian inverse problems

By an inverse problem we mean the recovery of from an imperfect observation of , for a known forward operator . In practice, the operator may arise as the composition of the solution operator

of a system of ordinary or partial differential equations with an observation operator

, and it is typically the case that for some , whereas and can have infinite dimension. For simplicity, we assume an additive noise model


where the statistics but not the realisation of are known. In the strict sense, this inverse problem is ill-posed in the sense that there may be no element for which , or there may be multiple such that are highly sensitive to the observed data .

The Bayesian perspective eases these problems by interpreting , , and all as random variables or fields. Through knowledge of the distribution of , (2.3) defines the conditional distribution of

. After positing a prior probability distribution

for , the Bayesian solution to the inverse problem is nothing other than the posterior distribution for the conditioned random variable . This posterior measure, which we denote , is from the Bayesian point of view the proper synthesis of the prior information in with the observed data . The same posterior can also be arrived at via the minimisation of penalised Kullback–Leibler, , or Dirichlet energies (Dupuis and Ellis, 1997; Jordan and Kinderlehrer, 1996; Ohta and Takatsu, 2011), where the penalisation again expresses compromise between fidelity to the prior and fidelity to the data.

The rigorous formulation of Bayes’ formula for this context requires careful treatment and some further notation (Stuart, 2010). The pair is assumed to be a well-defined random variable with values in . The marginal distribution of is the Bayesian prior . The observational noise is distributed according to , independently of . The random variable is distributed according to , the translate of by , which is assumed to be absolutely continuous with respect to , with

The function is called the negative log-likelihood or simply potential. In the elementary setting of centred Gaussian noise, on , the potential is the non-negative quadratic misfit666Hereafter, to reduce notational clutter, we write both and as . . However, particularly for cases in which , it may be necessary to allow to take negative values and even to be unbounded below (Stuart, 2010, Remark 3.8).

With this notation, Bayes’ theorem is then as follows

(Dashti and Stuart, 2016, Theorem 3.4):

Theorem 2.1 (Generalised Bayesian formula).

Suppose that is -measurable and that

satisfies for -almost all . Then, for such , the conditional distribution of exists and is absolutely continuous with respect to with density


Note that, for (2.4) to make sense, it is essential to check that . Hereafter, to save space, we regard the data as fixed, and hence write in place of , in place of , and in place of . In particular, we shall redefine the negative log-likelihood as a function , instead of a function as in Theorem 2.1 above.

From the perspective of numerical analysis, it is natural to ask about the well-posedness of the Bayesian posterior : is it stable when the prior , the potential , or the observed data are slightly perturbed, e.g. due to discretisation, truncation, or other numerical errors? For example, what is the impact of using an approximate numerical forward operator in place of , and hence an approximate in place of ? Here, we quantify stability in the Hellinger metric from (2.1).

Stability of the posterior with respect to the observed data and the log-likelihood was established for Gaussian priors by Stuart (2010) and for more general priors by many later contributions (Dashti et al., 2012; Hosseini, 2017; Hosseini and Nigam, 2017; Sullivan, 2017). (We note in passing that the stability of BIPs with respect to perturbation of the prior is possible but much harder to establish, particularly when the data are highly informative and the normalisation constant is close to zero; see e.g. the “brittleness” phenomenon of (Owhadi and Scovel, 2017; Owhadi et al., 2015).) Typical approximation theorems for the replacement of the potential by a deterministic approximate potential , leading to an approximate posterior , aim to transfer the convergence rate of the forward problem to the inverse problem, i.e. to prove an implication of the form

where is suitably well behaved, quantifies the convergence rate of the forward problem, and is a constant. Following Stuart and Teckentrup (2017), the purpose of this article is to extend this paradigm and these approximation results to the case in which the approximation is a random object.

3 Well-posed Bayesian inverse problems with random likelihoods

In many practical applications, the negative log-likelihood is computationally too expensive or impossible to evaluate exactly; one therefore often uses an approximation of . This leads to an approximation of the exact posterior , and a key desideratum is convergence, in a suitable sense, of to as the approximation error in the potential tends to zero.

The focus of this work is on random approximations . One particular example of such random approximations are the GP emulators analysed in Stuart and Teckentrup (2017); other examples include the randomised misfit models in Section 4 and the probabilistic numerical methods in Section 5. The present section extends the analysis of Stuart and Teckentrup (2017) from the case of GP approximations of forward models or log-likelihoods to more general non-Gaussian approximations. In doing so, more precise conditions are obtained for the exact Bayesian posterior to be well approximated by its random counterpart.

Let now be a measurable function that provides a random approximation to , where we recall that we have fixed the data . Let be a probability measure on such that the distribution of the inputs of is given by ; we sometimes abuse notation and think of itself as being -distributed. We assume throughout that the randomness in the approximation of is independent of the randomness in the parameters being inferred.

Replacing by in (2.4), we obtain the sample approximation , the random measure given by


(Henceforth, we will omit the argument for brevity.) Thus, the measure is approximated by the random measure , and the normalisation constant is a random variable. A deterministic approximation of the posterior distribution can now be obtained either by fixing , i.e. by taking one particular realisation of the random posterior , or by taking the expected value of the random likelihood , i.e. by averaging over different realisations of . This yields the marginal approximation defined by


where . We note that an alternative averaged, deterministic approximation can be obtained by taking the expected value of the density in (3.1) as a whole, i.e. by taking the expected value of the ratio rather than the ratio of expected values. A result very similar to Theorem 3.1, with slightly modified assumptions, holds also in this case, with the proof following the same steps. However, the marginal approximation presented here appears more intuitive and more amenable to applications. Firstly, the marginal approximation provides a clear interpretation as the posterior distribution obtained by the approximation of the true data likelihood by

. Secondly, the marginal approximation is more amenable to sampling methods such as Markov chain Monte Carlo, with clear connections to the pseudo-marginal approach

(Andrieu and Roberts, 2009; Beaumont, 2003).

3.1 Random misfit models

This section considers the general setting in which the deterministic potential is approximated by a random potential . Recall from (2.4) that is the normalisation constant of , and that for to be well-defined, we must have that . The following two results, Theorems 3.1 and 3.2, extend Theorems 4.9 and 4.11 respectively of Stuart and Teckentrup (2017), in which the approximation is a GP model:

Theorem 3.1 (Deterministic convergence of the marginal posterior).

Suppose that there exist scalars , independent of , such that, for the Hölder-conjugate exponent pairs , , and , we have

  1. ;

  2. ;

  3. .

Then there exists , independent of , such that


In the proof of Theorem 3.1, we show that hypothesis (a) arises as an upper bound on the quantity . In order for the conclusion of Theorem 3.1 to hold, we need the latter to be finite. Thus, hypothesis (a) is an exponential decay condition on the positive tails of either or , with respect to the appropriate measures. Alternatively, by applying Jensen’s inequality to , one can strengthen hypothesis (a) into the hypothesis of exponential integrability of either with respect to or with respect to ; this yields the same interpretation. Thus, the parameter quantifies the exponential decay of the positive tail of either or .

By comparing the quantity from hypothesis (a) with the quantity in hypothesis (b), it follows that hypothesis (b) is an exponential decay condition on the negative tails of both and . The two new parameters in this decay condition arise because we apply Hölder’s inequality twice in order to develop the desired bound (3.3) on . The key desideratum here is that the bound is multiplicative in some -norm of . The two new parameters and quantify the decay with respect to and respectively. Note that the interaction between the hypotheses (a) and (b) as described by the conjugate exponent pair implies that one can trade off faster exponential decay of one tail with slower exponential decay of the other.

The two-sided condition on in hypothesis (c) ensures that both tails of with respect to decay sufficiently quickly. This hypothesis ensures that the Radon–Nikodym derivative in (3.2) is well-defined.

Finally, we note that the quantity on the right hand side of (3.3a) depends directly on the conjugate exponents of and appearing in hypotheses (a) and (b). The more well behaved the quantities in these hypotheses are, the weaker the norm we can choose on the right hand side of (3.3a).

Theorem 3.2 (Mean-square convergence of the sample posterior).

Suppose that there exist scalars , independent of , such that, for Hölder-conjugate exponent pairs and , we have

  1. ;

  2. .



Hypothesis (a) of Theorem 3.2 arises during the proof as a result of developing an upper bound on that is multiplicative in some -norm of . Thus, it describes an exponential decay condition of the negative tails of both or ; in particular, hypothesis (a) is always satisfied when the potentials or are non-negative, as is usually the case for finite-dimensional data. The appearance of and arises due to one application of Hölder’s inequality for fulfulling the desideratum of multiplicativity, and and quantify the decay with respect to and respectively.

Hypothesis (b) of Theorem 3.2 arises as a result of developing an upper bound on the quantity that fulfills the desideratum of multiplicativity mentioned above. The presence of both and its reciprocal indicates that hypothesis (b) is analogous to hypothesis (c) of Theorem 3.1, in that hypothesis (b) is a condition on the tails of with respect to . The difference between hypothesis (b) of Theorem 3.2 and hypothesis (c) of Theorem 3.1 arises due to the fact that the Radon–Nikodym derivative in (3.1) features instead of .

We now show that the assumptions of Theorems 3.1 and 3.2 are satisfied when the exact potential and the approximation quality are suitably well behaved. Since , it follows that for some .

Assumption 3.3.

There exists that does not depend on , such that, for all ,


and for any with the property that , there exists such that, for all ,


The lower bound conditions in (3.5) ensure that the hypothesised exponential decay conditions on the negative tails of the true likelihood and the random likelihoods from Theorems 3.1 and 3.2 are satisfied. The uniform lower bound on translates into a uniform upper bound of the Radon–Nikodym derivative of the posterior with respect to the prior, and is a very mild condition that is satisfied in many, if not most, BIPs. Given this fact, it is reasonable to demand that the satisfy the same uniform lower bound, -almost surely and for all ; this is the content of the second condition in (3.5). Condition (3.6) expresses the condition that, by choosing sufficiently large, one can approximate arbitrarily well using the random , with respect to the topology. This assumption ensures that the stated aims of this work are reasonable.

Lemma 3.4.

Suppose that Assumption 3.3 holds with as in (3.5) and and as in (3.6), that for some with conjugate exponent , and there exists some that does not depend on , such that, for all ,


Then the hypotheses of Theorem 3.1 hold, with

and as above. Moreover, the hypotheses of Theorem 3.2 hold, with

The uniform upper bound condition on with respect to in (3.7) is rather strong; we use it to ensure that is bounded away from zero, uniformly with respect to and . Together with the condition on in (3.5), this translates to uniform lower and upper bounds on ; the latter implies that hypothesis (b) in Theorem 3.2 holds with the stated values of and . A sufficient condition for (3.7) is that the are themselves uniformly bounded. This condition is of interest when the misfit is associated to a bounded forward model and the data take values in a bounded subset.

Lemma 3.5.

Suppose that Assumption 3.3 holds with as in (3.5) and and as in (3.6), and that there exists some such that . Then the hypotheses of Theorem 3.1 hold, with

and as above. Moreover, the hypotheses of Theorem 3.2 hold, with

By comparing the hypotheses and conclusions of Lemma 3.4 and Lemma 3.5, we observe that, by reducing the exponent of integrability from to , we can replace the strong uniform upper bound condition (3.7) on from Lemma 3.4 with the weaker condition that in Lemma 3.5, and thus increase the scope of applicability of the conclusion.

In Lemmas 3.4 and 3.5 above, we have specified the largest possible values of the exponents that are compatible with the hypotheses. This is because later, in Theorem 3.9, we will want to use the smallest possible values of the corresponding conjugate exponents in the resulting inequalities (3.3a) and (3.4).

3.2 Random forward models in quadratic potentials

In many settings, the potentials and have a common form and differ only in the parameter-to-observable map. In this section we shall assume that and are quadratic misfits of the form


corresponding to centred Gaussian observational noise with symmetric positive-definite covariance . Again, we assume that is deterministic while is random. In this section, for this setting, we show how the quality of the approximation transfers to the approximation , and hence to the approximation (for either the sample or marginal approximate posterior).

Pointwise in and , the errors in the misfit and the forward model are related according to the following proposition.

Proposition 3.6.

Let and be defined as in (3.8), where for some

and the eigenvalues of the operator

are bounded away from zero. Then, for some , for all , and -almost surely


Hence, for and all ,


By assuming that , we assume that the data live in a finite-dimensional space. This is a standard assumption in the area, and implies that the operator is simply a matrix. The assumption of the eigenvalues of being bounded away from zero is equivalent to assuming that is invertible, which follows immediately from the assumption stated earlier that is a symmetric and positive-definite covariance matrix.

Corollary 3.7.

Let , and suppose that . If there exists an such that, for all ,

then, there exists some that does not depend on such that for all ,

where and is as in Proposition 3.6.

The hypotheses ensure that the integrability of the misfit determines the highest degree of integrability of the forward operators and , and that for sufficiently large , we may make the norm of the difference of in an appropriate topology small enough. The constraint (3.7) is used to combine the and terms in (3.9). The resulting simplification ensures that we may apply Lemma 3.8.

Lemma 3.8.

Let and be as in (3.8). If, for some ,


then Assumption 3.3 holds.

The lemma states that if the random forward model converges to the true forward model in the appropriate topology, then the conditions in Assumption 3.3 are satisfied by the corresponding random misfits. Since the misfits were assumed to be quadratic in (3.8), the key contribution of Lemma 3.8 is to ensure that the approximation quality condition (3.6) is satisfied.

We shall use the preceding results to obtain bounds on the Hellinger distance in terms of errors in the forward model, of the following form: for and that do not depend on ,


For brevity and simplicity, the following result uses one pair in (3.11) in order to obtain convergence statements for both and . If one is interested in only one of these measures, then one may optimise and accordingly.

Theorem 3.9 (Convergence of posteriors for randomised forward models in quadratic potentials).

Let and be as in (3.8).

  1. Suppose there exists some with Hölder conjugate such that , and suppose that (3.7) holds for some . If as in (3.11) with and , then the following hold:

    1. there exists some that does not depend on , for which (3.12) holds with and , and

    2. there exists some that does not depend on , for which (3.13) holds with and .

  2. Suppose there exists some such that . If as in (3.11) with and , then the following hold:

    1. there exists some that does not depend on , for which (3.12) holds with and , and

    2. there exists some that does not depend on , for which (3.13) holds with and .

In both cases, and converge to in the appropriate metrics given in (3.12) and (3.13) respectively.

The proof of Theorem 3.9 consists of tracking the dependence of the parameters over the sequential application of the preceding results, all of which are used.

Case (a) applies in the situation where the random approximations are uniformly bounded from above; as discussed earlier, this condition is satisfied in the case that the misfit is associated to a bounded forward model and the data take values in a bounded subset of . Note that the topology of the convergence of to is quantified by and , and that depends on the parameter that quantifies the exponential -integrability of the misfit . In particular, the faster the exponential decay of the positive tail of (i.e. the larger the value of ), the stronger the topology of convergence of to .

In contrast to case (a), case (b) does not assume that the misfit is exponentially integrable or that the random approximations are uniformly bounded from above -almost surely. Instead, exponential integrability of the random misfit is required. Another difference is that the exponential integrability parameter determines the strength of the topology of convergence of the random forward models, not only with respect to the -topology, but also to the -topology as well.

4 Application: randomised misfit models

This section considers a particular Monte Carlo approximation of a quadratic potential , proposed by Nemirovski et al. (2008); Shapiro et al. (2009), and further applied and analysed in the context of BIPs by Le et al. (2017). This approximation is particularly useful when the data has very high dimension, so that one does not wish to interrogate every component of the data vector , or evaluate every component of the model prediction and compare it with the corresponding component of .

Let be an -valued random vector with mean zero and identity covariance, and let be independent and identically distributed copies (samples) of . We then have the following approximation:

The analysis and numerical studies in Le et al. (2017, Sections 3–4) suggest that a good choice for the random vector would be one with independent and identically distributed (i.i.d.) entries from a sub-Gaussian probability distribution on

. Examples of sub-Gaussian distributions considered include

  1. the standard Gaussian distribution: , for ; and

  2. the -sparse distribution: for , let and set, for ,

The randomised misfit can provide computational benefits in two ways. Firstly, a single evaluation of can be made cheap by choosing the -sparse distribution for , with large sparsity parameter . This choice ensures that a large proportion of the entries of each sample will be zero, significantly reducing the cost to compute the required inner products in , since there is no need to compute the components of the data or model vector that will be eliminated by the sparsity pattern. The value of of course also influences the computational cost. It is observed by Le et al. (2017) that, for large and moderate , the random potential and the original potential

are already very similar, in particular having approximately the same minimisers and minimum values. Statistically, these correspond to the maximum likelihood estimators under

and being very similar; after weighting by a prior, this corresponds to similarity of maximum a posteriori (MAP) estimators.

The second benefit of the randomised misfit approach, and the main motivation for its use in Le et al. (2017), is the reduction in computational effort needed to compute the MAP estimate. This task involves the solution of a large-scale optimisation problem involving in the objective function, which is typically done using inexact Newton methods. It is shown by Le et al. (2017) that the required number of evaluations of the forward model and its adjoint is drastically reduced when using the randomised misfit as opposed to using the true misfit , approximately by a factor of .

The aim of this section is to show that the use of the randomised misfit does not only lead to the MAP estimate being well-approximated, but in fact the whole Bayesian posterior distribution. Thus, the corresponding conjecture is that the ideal and deterministic posterior is well approximated by the random posterior . Indeed, via Theorem 3.2, we have the following convergence result for the case of a sparsifying distribution:

Proposition 4.1.

Suppose that the entries of are i.i.d. -sparse, for some , and that . Then there exists a constant , independent of , such that


(In this section, plays the role of the distribution of .) As the proof reveals, a valid choice of the constant in (4.1) is


where the constant is as in Theorem 3.2. Thus, as one would expect, the accuracy of the approximation decreases as approaches the complete sparsification case or as the data dimension increases, but always with the same convergence rate