Bayesian UQ in a nutshell is Bayesian inference on a (possibly infinite dimensional) parameter with data such that, for example,
The regressor , or Forward Map (FM), is commonly a complex non-linear map arising from unknown parameters in a system of ODEs or PDEs. Then to evaluate we require to solve a system of (O,P)DEs. Not only that, but this commonly involves a numerical solution with some error , which is the actual regressor we can work with in our computer. A prior is stated for and a numerical posterior distribution is obtained. represents a discretization used to approximate the FM and as increases the discretization becomes finer and the approximation becomes tighter.
In this paper we are concerned with the numerical error induced in this posterior in comparison to the theoretical posterior (when considering the exact theoretical FM ) and also on the error introduced in the numerical posterior when using a truncated, finite dimensional prior . Moreover, we dicuss practical guidelines to choose the numerical discretization refinement and the priori truncation in order to have correct posterior numerical error control. We consider a general, not necessarily Gaussian, model for the data s.
use a posterior bound (once the data is seen) and requires the estimation of normalizations constants. Here we use a prior(predictive) bound that results in a global bound for the FM to control the numerical error; a brief review ofCapistrán et al. (2016) and its shortcomings is given below in section 1.2.
Undoubtedly, the first step is to define the posterior distribution in a general setting including infinite dimensional spaces. In the context of Bayesian inverse problems Stuart (2010) did several advances and found regularity conditions for the posterior to exists in a fairly general setting (see Stuart, 2010, and references therein)
. The normalization constant in the posterior is proven to be finite and positive and thus the posterior is indeed a probability measure using boundedness assumptions on the likelihood and considering Gaussian priors(Stuart, 2010, assumption 2.7(i,ii) and theorem 4.1). Recently Hosseini and Nigam (2017) generalized the latter now considering priors with exponentially decaying tails, using the same regularity conditions.
However, to our surprise, in studying the mentioned results we found out that in other contexts defining the posterior distribution in general spaces is a very well known task; a nice example is contained in the text book Schervish (1997). A very powerful tool that can be used here is disintengration, although it is not essential. The principal remark here is that the existence of the posterior distribution can be established in a far more general sense than what Stuart (2010) establishes and these results are well known in the general Bayesian literarure. Below we discuss the existence of the posterior distribution in this perspective.
1.1 Existence of the posterior distribution in infinite dimensional spaces
It always puzzle us that in any other context of Bayesain inference we need not worry for, for example, the prior tail behaviour (Stuart, 2010) or, in fact, any other condition for the posterior to exists
. The usual practice is to define a parametric model for data, , a prior for the parameter and without guilt and further protection we declare
to be a joint distribution on; the usual argument being that, it is indeed positive and . But, when does define a joint distribution? when, to start with, the latter integrals exist and swap? But in any case, we depart from the construction of a joint probability measure for both .
What we call modern Bayesian statistics, in its foundations, requires exactly that: a joint probability measureon the whole measurable space of uncertain events, both observable, , or not, . The existence of such measure is proven by assuming a set of axioms on a preference relationship on events on based on a system of bets performed by an agent. Conditional on the chosen space and on the agent preferred system of bets quantifies the agent’s “uncertainty” on (namely a system of bets comprising the axioms), and this is the basis for the epistemic or conditional probabilistic or Bayesian (or which some also like to call, lightly or pejorative, “subjective”; Christen, 2006) approach to Uncertainty Quantification. Our preferred axiomatic development is that of DeGroot (1970).
In the same axiomatic development, if then an event is observed, a new system of bets is precluded in which bets on events are only relevant in terms of the intersection of those events with , ie. anything outside ceases to be relevant. The existence of a new measure on is guaranteed, which coincides with the new updated system of bets after has been observed and it turns out that
for all . That is, given the set of axioms, the updated measure is precisely the conditional probability conditional on . All inferences, given that we observed , stem from the conditional probability , namely the posterior or a posteriori probability measure. The way we perform any necessary calculations to obtain
, exactly or approximately, is up to us, and certainly Bayes theorem is used in most cases (not always, eg. when calculating a predictive posterior only total probability is used). Note therefore that Bayes theorem is not the fundamental issue in modern Bayesian statistics, nor its interpretations give meaning to modern Bayesian UQ.
However, a problem arises when modeling data with continuos distributions, since realized data have and the above simple calculation of cannot be used. Fortunately, this is a classical problem in probability, since conditioning on events of zero probability is a necessity well beyond Bayesaian statistics. Kolmogorov studied the problem but the modern approach, for very many technical reasons, is called disintegration. A very nice review may be found in Chang and Pollard (1997), specifically example 9 discusses the definition and existence of the posterior distribution. Leao Jr. et al. (2004) also present a nice review.
Disintegration has the correct properties as a conditional distribution, now generalized to events of probability zero. In particular, becomes irrelevant. The bottom line is the same as in Stuart (2010): the posterior measure has as density the likelihood function w.r.t the prior measure. However, the posterior may be proven to exists in a very general setting without any regard to tail behaviour of the prior etc. For completeness, all these results are presented in detail in section 2.
As it turns out, a good enough regularity setting is this: is continuos in and the joint measure space is Polish, leading to a Radon joint measure , see lemmas 2.1 and 2.2. As far as section 2 is concerned, we stress the fact that only the former we consider a relevant observation on our part (continuity of the likelihood), the rest in that section is based on classical probability results and are well known in other areas of Bayesian statistics.
1.2 Consistency, convergence and EABF
As mentioned above, we are interested in establishing guidelines for choosing a discretization level for the FM and a truncation for the prior . The problem is addressed in Capistrán et al. (2016) in the finite dimensional case and here we generalize their results for parameters in infinite dimensional Banach spaces and a truncation in the prior distribution.
Capistrán et al. (2016)
present an approach to address the above problem using Bayes factors (BF; the odds in favor) of the numerical model vs the theoretical model (further details will be given in section2). In an ODE framework, these odds are proved in Capistrán et al. (2016) to converge to 1, that is, both models would be equal, in the same order as the numerical solver used. For high order solvers Capistrán et al. (2016) illustrates, by reducing the step size in the numerical solver, that there should exist a point at which the BF is basically 1, but for fixed discretization (step size) greater than zero. This is the main point made by Capistrán et al. (2016): it could be possible to calculate a threshold for the tolerance such that the numerical posterior is basically equal to the theoretical posterior so, although we are using an approximate FM, the resulting posterior is nearly error free. Capistrán et al. (2016) illustrate, with some examples, that such optimal solver discretization leads to basically no differences in the numerical and the theoretical posterior, since the BF is basically 1; potential saving CPU time by choosing a corser solver.
However, Capistrán et al. (2016) still has a number of shortcomings. First, it depends crucially on estimating the normalizing constants from Monte Carlo samples of the unnormalized posterior, for a range of discretizations . This is a very complex estimation problem, subject of current research, and is in fact very difficult to reliably estimate these normalizing constants in mid to high dimension problems. Second, Capistrán et al. (2016) approach is as yet incomplete since one would need to decrease systematically, calculating the normalization constant of the corresponding numerical posterior to eventually estimate the normalization constant of the theoretical posterior (see figure 2 of Capistrán et al., 2016), which in turn will pin point a discretization at which both models are indistinguishable. Being this a second complex estimation problem, the main difficulty here is that one has already calculated the posterior for small step sizes and therefore it renders useless the selection of the optimal step size.
To improve on Capistrán et al. (2016), the idea of this paper is to consider the expected value of the BFs, before data is observed. We will try to bound this expected BF to find general guidelines to establish error bounds on the numerical solver, depending on the specific problem at hand and the sample design used, but not on particular data. These guidelines will be solely regarding the forward map and, although perhaps conservative, represent useful bounds to be used in practice. Moreover, as already mention, we generalize Capistrán et al. (2016) to an infinite dimensional setting and also considering a truncation in the prior.
The basic idea then is to establish the relative merit of the numeric model vs. the theoretical model using Bayesian model selection.
We first prove that the approximations are consistent. That is, that the numerical posterior converges to the theoretical posterior. This has been proved, and discussed extensively, using the Hellinger distance (eg. Stuart, 2010). Also, rates of convergence have been discussed elsewhere (Stuart, 2010; Bui-Thanh and Ghattas, 2014). Here in section 3, in the more general setting considered in this paper and for completeness, we use weak convergence. Then to establish the consistency in the rate of convergence in section 3.2 we use the Total Variation norm.
Having this we prove our main result, for Banach spaces, for the Expected Absolute difference of the BF to 1 (EABF), considering any location-scale family for the distribution of the data; the main results of the paper are found in section 4. In section 5 we consider the prior truncation and in section 6 a series of examples.
For the moment we finish this introduction with a brief discussion on the use of weak convergence and the Total Variation (TV) norm.
1.3 Weak convergence and the Total Variation norm
In probability theory, the basic convergence criterion is weak convergence. Other convergence criteria (in probability, in TV, in Lp etc.) are commonly generalized from weak convergence(Billingsley, 1968). Probability measures weakly converge to if the Lebesgue integrals converge to for all measurable, non-negative, continuous bounded functions . We write .
In oder to have a clear concept of rates of convergence we require a metric to measure distance between the involved objects. Total variation (TV) is one of the most common for many reasons (Gibbs and Su, 2002). The TV distance between two measures and on the same measure space is defined as
where measurable. Note that, if approximates the posterior distribution then
is the upper bound for the difference in any posterior probability we wish to calculate and/or on the error in any bounded posterior expectation we need to calculate. Moreover, note that utility functions are bounded and with correct units belong to(DeGroot, 1970). Then is the maximum error incurred in calculating expected utilities when using instead of . As far as Bayesian theory is concerned, TV is quite well suited for what is required.
Indeed, the Hellinger distance could be used as well, as has been the tradition in the Bayesian UQ context. Note however that TV is equivalent to Hellinger (Gibbs and Su, 2002); convergence in TV implies convergence in Hellinger and viceversa. It bounds perhaps to facility in proofs and direct interpretation and that is why we choose TV.
2 Setting and existence lemmas
Let be the data at hand and be a family of probability models for . We assume that the family of probability models for the observables have a density w.r.t a -finite measure , namely a product of the Lebesgue and counting measures in to accommodate, possibly, discrete and continuous observations. That is
for all measurable . For example, is a product space of subsets of or , leading to discrete and/or continuos data. This is the usual setting in parametric inference.
In any case, with the usual topological considerations we assume is a Polish space. Polish spaces include complete metric spaces that have a countable dense subset. should be viewed as a Polish space with the standard metric in and the discrete metric in , and then results in a Borel -finite measure on . is then a Radon measure for all , since any Borel probability measure on a Polish space is Radon. We use this last fact in the proof of lemma 2.2 below.
Until now the parameter space is arbitrary. We need to define a measurable space to be able to define a probability measure on , namely a prior distribution. So far cannot be considered a conditional distribution but due to the next two lemmas we adopt the more common notation .
Let be any -measurable function. If is a -measurable function then
defines a joint probability measure on the product space .
Since and are -finite ( is finite) then by Tonelli’s theorem is measurable, is (non-negative) -integrable and the above integrals swap. Moreover, using we have . See for example Schervish (1997), p. 16.
Lemma 2.2 (Bayes’ theorem)
If is a separable Banach space and is continuos for all then:
The joint measure exists, as defined in Lemma 2.1.
The -disintegration of exists, may be seen as such -desintegration and therefore may be seen as the conditional density of given .
The -disintegration of exists and is the general definition of the conditional measure on given , namely, the posterior distribution.
Moreover, for any measurable we have
that is , for all .
The -measurability of was proven in Gowrisankaran (1972). Since is also -measurable, from Lemma 2.1, 1. above follows. Moreover, any separable Banach space is a Polish space and the product space is also Polish, therefore the joint probability measure is a Radon measure and the prior is also Radon. The rest follows from standard results in disintegration with Radon probability measures, see example 9 of Chang and Pollard (1997). This is also proven in, for example, Schervish (1997), p. 16, although not using the disintegration argument.
Generality: The combination of lemas 2.1 and 2.2 state the existence of the posterior measure, which are based on standard results in probability and integration. Note that we do not require any restriction on the tail behavior on the likelihood nor on the prior. This is a far more general result than Stuart (2010) or Hosseini and Nigam (2017). Existence of the posterior measure in the parametric setting is guaranteed with the continuity of the likelihood and regularity of the underlying space, namely a Polish space.
Continuous likelihood: Note that for each is a -measurable function. With continuity on it follows that is -measurable. This is indeed a profound result in measure theory that puzzled topologists for many years (eg. Sierpiński, 1920). The reference we use (Gowrisankaran, 1972) made his prove for when is a Suslin space, which is a generalization of Polish spaces. Counterexamples showing that a measurable function on each variable separately is not measurable in the product space show that the continuity requirement on may not be relaxed without further provisions. That is, the likelihood is required to be continuous.
Cromwell’s rule: If an event has zero a priori probability then it will have zero posterior probability; indeed since . We will adopt the notation for the posterior measure to make the dependance explicit both on the data and on the prior . In this respect may be seen as an operator that transforms (updates) the prior measure into the posterior measure , which represents the inference process of learning from the data .
Likelihood principle: As usual, Bayesian inference follows the likelihood principle since the posterior measure depends on the data only through the likelihood. “Well-Posedness” as studied by Stuart (2010) or Hosseini and Nigam (2017), in which close enough data and will lead to similar posteriors, is interesting but we believe is a wrong concept. Two very different data sets should lead to the same inferences (eg. and having the same mean) and even two alternative models should lead to the same conclusions, when following the likelihood principle (eg. binomial vs. negative binomial sampling); see for example Berger and Wolpert (Berger and Wolpert).
now viewed as a function of is in fact the marginal density, w.r.t , of the joint measure . That is, is a density for not yet observed data , namely the prior predictive measure. Defining the posterior through Radon-Nikodym derivatives does not preclude directly the existence of such measure.
In the next section we discuss how to ensure that when substituting the likelihood with a numeric approximation , the corresponding posterior is close enough to the theoretical posterior . Also we will discuss the analogous when using an alternative prior instead of and combining both, leading to the approximate posteriors and .
3 The inverse problems setting and discretization consistency
We follow the general setting of Scheichl et al. (2017) for the statistical inverse problem. Let and be separable Banach spaces, let be the Borel measurable forward map (FM) and the Borel measurable observation operator. The composition defines a Borel measurable mapping from the parameter space to the data sample space in , plus possibly additional parameters. Going beyond Gaussian noise assume that is a density for data w.r.t. for all . The parametric family of sample models, as in section 2, is defined with the family of -densities
To fix ideas we elaborate the usual independent Gaussian noise case,
and , ie. . If is also unknown we may take
and include it as a parameter. The same if we had and unknown variance-covariance matrix etc. We do not discuss this case further in the main part of the paper. Some notes are added in section7 regarding the case when is unknown.
Let be a discretized version of the forward map , for some discretization that depends on an integer refinement . For example, a time step size, FEM discretization, etc. This is the actual numerical version of the forward map defined in our computers. Let be the resulting discretized numerical likelihood. Moreover, suppose there are approximate or alternative prior measures also defined in . In the rest of the paper we take the following assumption.
Assume that, for all the observation model is uniformly Lipschitz continuous for each , and for -a.s. is bounded. Moreover, the FM maps and are continuous.
If and are continuous then and are continuous and all requirements are met for lemmas 2.1 and 2.2 and the posterior measures are well defined and exist as probability measures when using the theoretical likelihood and exact prior and also when using the numerical likelihood or/and an alternative prior, namely , and . Also let , and be the corresponding partition functions in each case. In the usual setting of Stuart (2010), Scheichl et al. (2017) and others it is also assumed that is continuous; here we require nothing further.
Note that if we consider independent data with a location-scale model as
where is uniformly Lipschitz continuous and known, the first part of assumption 3.1 is met and we only require to establish that and are continuous. Indeed the former is true if is Gaussian.
Assume a global error control of this numeric FM as
for some functional . Note that this is a global bound, valid for all and includes already the observational operator. That is, it is a global bound (for all ) but is only a statement at the locations s where each is observed.
Usually the error control global bounds are proven for the FM but these are easily inherited to the composition by ensuring, for example, that is Lipschitz continuous as we next explain. From assumption 3.1 is uniform Lipschitz continuous for any given . Then since we have
which is also a global error bound, now for the numeric likelihood, where the constant is independent of .
The next step is to prove the consistency of using the discretization and the prior truncation (the term will be clear in section 5), that is, how and tend to the theoretical posterior measure . We first prove the latter in weak convergence. Rates of convergence are proven in the then Total Variation norm in the following section. As mentioned before, we stress the fact that similar consistency results have proved before in this Bayesian inverse setting, in a more particular setting. We present weak convergence and TV rates of convergence results since our setting is more general basically only requiring assumption 3.1.
3.1 Weak convergence
The following theorem presents our discretization consistency results.
Theorem 3.2 (discretization consistency)
1. From (3) we have that for all , then by bounded convergence
since is finite (Swartz, 1994, chap. 3). Since we also have for all . Now, since and have the latter as densities w.r.t this implies by Scheffé’s lemma. The prove for is analogous.
2. Note that is bounded, real, non-negative, continuos function, therefore
Let be any bounded, real, non-negative, continuos function, then since and then
which implies . The prove for is analogous.
3.2 Total variation and rates of convergence
As previously mentioned we use TV to establish rates of convergence in our discretizations.
This is proven in lemma A.2.
With assumption 3.1, if then
for big enough , where maximize and .
For measurable with we have
Let and , the above implies and
since , and we obtain the result. The prove involving and is analogous.
Theorem 3.5 (Consistent rate of convergence)
The “posterior operator” is Lipschitz continuos, that is
If the rate of convergence of the truncated prior to the complete prior is then, since ,
(with ). That is, the discretized version of the posterior converges in total variation to the theoretical posterior at the same rate as the FM and the prior truncation.
In many cases of PDE discretization schemes, the number of parameters or dimension of the prior increases (linearly, quadratically etc.) with the discretization size as it is the case in some inverse problems using the Finite Element Method (eg. Bui-Thanh et al., 2013; Petra et al., 2014). In principle this should not represent an additional problem and the consistency result in (5) still holds for big enough as far as .
3.4 Posterior Estimates
In modern Bayesian theory all inference problems are viewed in a perspective of a decision under uncertainty, ultimately needing to maximize posterior expected utility, which is in fact the Bayesian paradigm. Moreover, all utility functions are bounded and by convention normalized to (DeGroot, 1970). If one wants to calculate the posterior expectation of an utility function, or any other bounded functional, note that
where and . That is, controlling will bound the error in any estimation required and the rates of convergence are transferred. (In passing, note from the prove of theorem 3.3, that is lemma A.1, that is the bound for .)
Traditionally we are used to working with the posterior mean and/or variance. In that case, is not bounded. However, if is continuos and the are uniformly integrable then exists and . This can be verified if
for some positive (Billingsley, 1968, chap. 2). For example if the tails of the finte dimensional posterior decay exponentially then , needing only to verify that these are bounded.
4 Expected a priori bounds and Bayes Factors
As in Capistrán et al. (2016) in order to find reasonable guidelines to choose a discretization level and a suitable prior truncation , we compare the numeric posterior with the theoretical posterior
using Bayesian model selection, namely Bayes Factors (BF). Assuming an equal prior probability for both models, the BF is the posterior odds of one model against the other, that iswhere , the posterior probability of the numerical model. That is, the BF is the ratio of the normalization constants . In terms of model equivalence an alternative expression conveying the same odds is
We now try to control the Bayes Factor between the discretized model and the theoretical model, , through the use of the Absolute BF (ABF). In order to do that, independently of the specific data at hand, we try to bound the expected ABF (the EABF),
in terms of estimates on the error in the numeric forward map, as in (2). The idea is to keep the EABF below a small threshold (eg. ) so that the BF is close to 1 and the difference between the numeric and the theoretical model is “not worth more than a bare mention” (Kass and Raftery, 1995; Jeffreys, 1961).
As seen in the proof of theorem 3.4 we have
and therefore . Therefore
To bound the last integral, note that
For close enough to , the a likelihood ratio is near to 1 and
With the first order Taylor approximation of around we have
Ignoring the higher order terms in the residual and using the error bound in (2) we have
since for any two vectors
since for any two vectorswith and we obtain the result.
We may attempt to calculate the remaining double integral by changing the order of integration letting and
This in general is difficult to achieve, however it is possible if it happens that does not depend on .
In the usual case of independent Gaussian errors with known variance , and since . This result may be generalized to any location-scale family and we present it next.
With the setting of theorem 4.1, assuming independent data arising from a location-scale family, namely
with a bounded symmetric Lebesgue density in with then
From (8) note that