This paper presents a statistical perspective on the theoretical and methodological issues pertinent to probabilistic numerical methods. Our aim is to stimulate what we feel is an important discussion about these methods for use in contemporary and emerging scientific and statistical applications.
Numerical methods, for tasks such as approximate solution of a linear system, integration, global optimisation and discretisation schemes to approximate the solution of differential equations, are core building blocks in modern scientific and statistical computation. These are typically considered as computational black-boxes that return a point estimate for a deterministic quantity of interest whose numerical error is then neglected. Numerical methods are thus one part of statistical analysis for which uncertainty is not routinely accounted (although analysis of errors and bounds on these are often available and highly developed). In many situations numerical error will be negligible and no further action is required. However, if numerical errors are propagated through a computational pipeline and allowed to accumulate, then failure to properly account for such errors could potentially have drastic consequences on subsequent statistical inferences(Mosbach and Turner, 2009; Oates et al., 2017c).
The study of numerical algorithms from a statistical point of view, where uncertainty is formally due to discretisation, is known as probabilistic numerics. The philosophical foundations of probabilistic numerics were, to the best of our knowledge, first clearly exposed in the work of Larkin (1972); Kadane (1985); Diaconis (1988) and O’Hagan (1992). Theoretical support comes from the field of information-based complexity (Traub et al., 1988)
, where continuous mathematical operations are approximated by discrete and finite operations to achieve a prescribed accuracy level. Proponents claim that this approach provides three important benefits: Firstly, it provides a principled approach to quantify and propagate numerical uncertainty through computation, allowing for the possibility of errors with complex statistical structure. Secondly, it enables the user to uncover key contributors to numerical error, using established statistical techniques such as analysis of variance, in order to better target computational resources. Thirdly, this dual perspective on numerical analysis as an inference task enables new insights, as well as the potential to critique and refine existing numerical methods. On this final point, recent interest has led to several new and effective numerical algorithms in many areas, including differential equations, linear algebra and optimisation. For an extensive bibliography, the reader is referred to the recent expositions ofHennig et al. (2015) and Cockayne et al. (2017).
Our aim is to stimulate a discussion on the suitability of probabilistic numerical methods in statistical computation. A decision was made to focus on numerical integration due to its central role in computational statistics, including frequentist approaches such as bootstrap estimators (Efron and Tibshirani, 1994) and Bayesian approaches, such as computing marginal distributions (Robert and Casella, 2013). In particular we focus on numerical integrals where the cost of evaluating the integrand forms a computational bottleneck. To this end, let be a distribution on a state space . The task is to compute (or, rather, to estimate) integrals of the form
where the integrand is a function of interest. Our motivation comes from settings where does not possess a convenient closed form so that, until the function is actually evaluated at an input , there is epistemic uncertainty over the actual value attained by at . The use of a probabilistic model for this epistemic uncertainty has been advocated as far back as Larkin (1972). The probabilistic integration method that we focus on is known as Bayesian cubature (BC). The method operates by evaluating the integrand at a set of states , so-called discretisation, and returns a distribution over that expresses belief about the true value of . The computational cost associated with BQ is in general . As the name suggests, this distribution will be based on a prior that captures certain properties of , and that is updated, via Bayes’ rule, on the basis of evaluations of the integrand. The maximum a posteriori (MAP) value acts as a point estimate of the integral, while the rest of the distribution captures uncertainty due to the fact that we can only evaluate the integrand at a finite number of inputs. However, a theoretical investigation of this posterior111in contrast to the MAP estimator, which has been well-studied. is, to the best of our knowledge, non-existent.
Our first contribution is therefore to investigate the claim that the BC posterior provides a coherent and honest assessment of the uncertainty due to discretisation of the integrand. This claim is shown to be substantiated by rigorous mathematical analysis of BC, building on analogous results from reproducing kernel Hilbert spaces, if the prior is well-specified. In particular, rates of posterior contraction to a point mass centred on the true value are established. However, to check that a prior is well-specified for a given integration problem can be non-trivial.
Our second contribution is to explore the potential for the use of probabilistic integrators in the contemporary statistical context.
In doing so, we have developed strategies for (i) model evidence evaluation via thermodynamic integration, where a large number of candidate models are to be compared, (ii) inverse problems arising in partial differential equation models for oil reservoirs, (iii) logistic regression models involving high-dimensional latent random effects, and (iv) spherical integration, as used in the rendering of virtual objects in prescribed visual environments.In each case results are presented “as they are” and the relative advantages and disadvantages of the probabilistic approach to integration are presented for critical assessment.
The paper is structured as follows. Sec. 2 provides background on BC and outlines an analytic framework in which the method can be studied. Sec. 3 describes our novel theoretical results. Sec. 4 is devoted to a discussion of practical issues, including the important issue of prior elicitation. Sec. 5 presents several novel applications of probabilistic integration for critical assessment222computer code to reproduce experiments reported in this paper can be downloaded from http://www.warwick.ac.uk/fxbriol/probabilistic_integration.. Sec. 6 concludes with an appraisal of the suitability of probabilistic numerical methods in the applied statistical context.
First we provide the reader with the relevant background. Sec. 2.1 provides a formal description of BC. Secs. 2.2 and 2.3 explain how the analysis of BC is dual to minimax analysis in nonparametric regression, and Sec. 2.4 relates these ideas to established sampling methods.
Let be a measurable space, where will either be a subspace of or a more general manifold (e.g. the sphere ), in each case equipped with the Borel -algebra . Let be a distribution on . Our integrand is assumed to be an integrable function whose integral, , is the object of interest.
For functional arguments write ,
and for vector arguments denote. For vector-valued functions we write for the vector whose th element is . The notation will be used. The relation is taken to mean that there exist such that .
A cubature rule describes any functional of the form
for some states and weights . The term quadrature rule is sometimes preferred when the domain of integration is one-dimensional (i.e. ). The notation is motivated by the fact that this expression can be re-written as the integral of with respect to an empirical measure , where is an atomic measure (i.e. for all , if , if ). The weights can be negative and need not satisfy .
2.1 Bayesian Cubature
Probabilistic integration begins by defining a probability space and an associated stochastic process , such that for each , belongs to a linear topological space .
For BC, Larkin (1972) considered a Gaussian process (GP); this is a stochastic process such that the random variables
considered a Gaussian process (GP); this is a stochastic process such that the random variablesare Gaussian for all , where is the topological dual of (Bogachev, 1998). In this paper, to avoid technical obfuscation, it is assumed that contains only continuous functions. Let denote expectation taken over . A GP can be characterised by its mean function , and its covariance function and we write . In this paper we assume without loss of generality that . Note that other priors could also be used (e.g. a Student-t process affords heavier tails for values assumed by the integrand).
The next step is to consider the restriction of to the set to induce a posterior measure over . The fact that contains only continuous functions ensures that is well-defined333this would not have been the case if instead .. Moreover the restriction to a -null set is also well-defined444since the canonical space of continuous processes is a Polish space and all Polish spaces are Borel spaces and thus admit regular conditional laws (c.f. Theorem A1.2 and Theorem 6.3 of Kallenberg, 2002).. Then, for BC, can be shown to be a GP, denoted (see Chap. 2 of Rasmussen and Williams, 2006).
The final step is to produce a distribution on by projecting the posterior defined on through the integration operator. A sketch of the procedure is presented in Figure 1 and the relevant formulae are now provided. Denote by , the expectation and variance taken with respect to . Write for the vector of values, and for the vector whose th entry is and for the matrix with entries .
The induced distribution of is Gaussian with mean and variance
Here, denotes the integral of with respect to each argument. All proofs in this paper are reserved for Supplement A. It can be seen that the computational cost of obtaining this full posterior is much higher than that of obtaining a point estimate for the integral, at . However, certain combinations of point sets and covariance functions can reduce this cost by several orders of magnitude (see e.g. Karvonen and Särkkä (2017)).
BC formally associates the stochastic process with a prior model for the integrand . This in turn provides a probabilistic model for epistemic uncertainty over the value of the integral . Without loss of generality we assume for the remainder of the paper. Then Eqn. 2 takes the form of a cubature rule
where . Furthermore, Eqn. 3 does not depend on function values , but only on the location of the states and the choice of covariance function . This is useful as it allows state locations and weights to be pre-computed and re-used. However, it also means that the variance is endogeneous, being driven by the choice of prior. A valid quantification of uncertainty thus relies on a well-specified prior; we consider this issue further in Sec. 4.1.
The BC mean (Eqn. 4) coincides with classical cubature rules for specific choices of covariance function . For example, in one dimension a Brownian covariance function leads to a posterior mean
that is a piecewise linear interpolant ofbetween the states , i.e. the trapezium rule (Suldin, 1959). Similarly, Särkkä et al. (2016) constructed a covariance function for which Gauss-Hermite cubature is recovered, and Karvonen and Särkkä (2017) showed how other polynomial-based cubature rules can be recovered. Clearly the point estimator in Eqn. 4 is a natural object; it has also received attention in both the kernel quadrature literature (Sommariva and Vianello, 2006) and empirical interpolation literature (Kristoffersen, 2013). Recent work with a computational focus includes Kennedy (1998); Minka (2000); Rasmussen and Ghahramani (2002); Huszar and Duvenaud (2012); Gunter et al. (2014); Briol et al. (2015); Karvonen and Särkkä (2017); Oettershagen (2017). The present paper focuses on the full posterior, as opposed to just the point estimator that these papers studied.
2.2 Cubature Rules in Hilbert Spaces
Next we review how analysis of the approximation properties of the cubature rule can be carried out in terms of reproducing kernel Hilbert spaces (RKHS; Berlinet and Thomas-Agnan, 2004).
Consider a Hilbert space with inner product and associated norm . is said to be an RKHS if there exists a symmetric, positive definite function , called a kernel, that satisfies two properties: (i) for all and; (ii) for all and (the reproducing property). It can be shown that every kernel defines a RKHS and every RKHS admits a unique reproducing kernel (Berlinet and Thomas-Agnan, 2004, Sec. 1.3). In this paper all kernels are assumed to satisfy . In particular this guarantees for all . Define the kernel mean as . This exists in as a consequence of (Smola et al., 2007). The name is justified by the fact555the integral and inner product commute due to the existence of as a Bochner integral (Steinwart and Christmann, 2008, p510). that
The reproducing property permits an elegant theoretical analysis, with many quantities of interest tractable in . In the language of kernel means, cubature rules of the form in Eqn. 1 can be written as where is the approximation to the kernel mean given by . For fixed , the integration error associated with can then be expressed as
A tight upper bound for the error is obtained by Cauchy-Schwarz:
The expression above666sometimes called the Koksma-Hlawka inequality (Hickernell, 1998). decouples the magnitude (in ) of the integrand from the kernel mean approximation error. The following sections discuss how cubature rules can be tailored to target the second term in this upper bound.
2.3 Optimality of Cubature Weights
Denote the dual space of as and denote its corresponding norm . The performance of a cubature rule can be quantified by its worst-case error (WCE) in the RKHS:
The WCE is characterised as the error in estimating the kernel mean:
Minimisation of the WCE is natural and corresponds to solving a least-squares problem in the feature space induced by the kernel: Let denote the vector of weights , be a vector such that , and be the matrix with entries . Then we obtain the following:
Several optimality properties for integration in RKHS were collated in Sec. 4.2 of Novak and Woźniakowski (2008). Relevant to this work is that an optimal estimate can, without loss of generality, take the form of a cubature rule (i.e. of the form in Eqn. 1). To be more precise, any non-linear and/or adaptive estimator can be matched777of course, adaptive cubature may provide superior performance for a single fixed function , and the minimax result is not true in general outside the RKHS framework. in terms of asymptotic WCE by a cubature rule as we have defined.
To relate these ideas to BC, consider the challenge of deriving an optimal cubature rule, conditional on fixed states , that minimises the WCE (in the RKHS ) over weights . From Fact 2, the solution to this convex problem is . This shows that if the reproducing kernel is equal to the covariance function of the GP, then the MAP from BC is identical to the optimal cubature rule in the RKHS (Kadane and Wasilkowski, 1985). Furthermore, with , the expression for the WCE in Fact 2 shows that where is any other cubature rule based on the same states . Regarding optimality, the problem is thus reduced to selection of states .
2.4 Selection of States
In earlier work, O’Hagan (1991) considered states that are employed in Gaussian cubature methods. Rasmussen and Ghahramani (2002) generated states using Monte Carlo (MC), calling the approach Bayesian MC (BMC). Recent work by Gunter et al. (2014); Briol et al. (2015) selected states using experimental design to target the variance . These approaches are now briefly recalled.
2.4.1 Monte Carlo Methods
An MC method is a cubature rule based on uniform weights and random states . The simplest of those methods consists of sampling states independently from
. For un-normalised densities, Markov chain Monte Carlo (MCMC) methods proceed similarly but induce a dependence structure among the. We denote these (random) estimators by (when ) and (when ). Uniformly weighted estimators are well-suited to many challenging integration problems since they provide a dimension-independent convergence rate for the WCE of
. They are widely applicable and straight-forward to analyse; for instance the central limit theorem (CLT) gives thatwhere and the convergence is in distribution. However, the CLT may not be well-suited as a measure of epistemic uncertainty (i.e. as an explicit model for numerical error) since (i) it is only valid asymptotically, and (ii) is unknown, depending on the integral being estimated.
Quasi Monte Carlo (QMC) methods exploit knowledge of the RKHS to spread the states in an efficient, deterministic way over the domain (Hickernell, 1998). QMC also approximates integrals using a cubature rule that has uniform weights . The (in some cases) optimal convergence rates, as well as sound statistical properties, of QMC have recently led to interest within statistics (e.g. Gerber and Chopin, 2015; Buchholz and Chopin, 2017). A related method with non-uniform weights was explored in Stein (1995a, b).
2.4.2 Experimental Design Methods
An Optimal BC (OBC) rule selects states to globally minimise the variance . OBC corresponds to classical cubature rules (e.g. Gauss-Hermite) for specific choices of kernels (Karvonen and Särkkä, 2017). However OBC cannot in general be implemented; the problem of optimising states is in general NP-hard (Schölkopf and Smola, 2002, Sec. 10.2.3).
A more pragmatic approach to select states is to use experimental design methods, such as the greedy algorithm that sequentially minimises . This method, called sequential BC (SBC), is straightforward to implement, e.g. using general-purpose numerical optimisation, and is a probabilistic integration method that is often used (Osborne et al., 2012; Gunter et al., 2014). More sophisticated optimisation algorithms have also been used: For example, in the empirical interpolation literature, Eftang and Stamm (2012) proposed adaptive procedures to iteratively divide the domain of integration into sub-domains. In the BC literature, Briol et al. (2015) used conditional gradient algorithms for this task. A similar approach was recently considered in Oettershagen (2017).
At present, experimental design schemes do not possess the computational efficiency that we have come to expect from MCMC and QMC. Moreover, they do not scale well to high-dimensional settings due to the need to repeatedly solve high-dimensional optimisation problems and have few established theoretical guarantees. For these reasons we will focus next on MC, MCMC and QMC.
This section presents novel theoretical results on probabilistic integration methods in which the states are generated with MCMC and QMC. Sec. 3.1 provides formal definitions, while Sec. 3.2 establishes theoretical results.
3.1 Probabilistic Integration
The sampling methods of MCMC and, to a lesser extent, QMC are widely used in statistical computation. Here we pursue the idea of using MCMC and QMC to generate states for BC, with the aim to exploit BC to account for the possible impact of numerical integration error on inferences made in statistical applications. In MCMC it is possible that two states are identical. To prevent the kernel matrix from becoming singular, duplicate states should be discarded888this is justified since the information contained in function evaluations is not lost. This does not introduce additional bias into BC methods, in contrast to MC methods.. Then we define and . This two-step procedure requires no modification to existing MCMC or QMC sampling methods. Each estimator is associated with a full posterior distribution, described in Sec. 2.1.
A moment is taken to emphasise that the apparently simple act of re-weighting MCMC samples can have a dramatic improvement on convergence rates for integration of a sufficiently smooth integrand.
A moment is taken to emphasise that the apparently simple act of re-weighting MCMC samples can have a dramatic improvement on convergence rates for integration of a sufficiently smooth integrand.Whilst our main interest is in the suitability of BC as a statistical model for discretisation of an integral, we highlight the efficient point estimation which comes out as a by-product.
To date we are not aware of any previous use of BMCMC, presumably due to analytic intractability of the kernel mean when is un-normalised. BQMC has been described by Hickernell et al. (2005); Marques et al. (2013); Särkkä et al. (2016). To the best of our knowledge there has been no theoretical analysis of the posterior distributions associated with either method. The goal of the next section is to establish these fundamental results.
3.2 Theoretical Properties
In this section we present novel theoretical results for BMC, BMCMC and BQMC. The setting we consider assumes that the true integrand belongs to a RKHS and that the GP prior is based on a covariance function which is identical to the kernel of . That the GP is not supported on , but rather on a Hilbert scale of , is viewed as a technical detail: Indeed, a GP can be constructed on via and a theoretical analysis similar to ours could be carried out (Lemma 2.2 of Cialenco et al., 2012).
3.2.1 Bayesian Markov Chain Monte Carlo
As a baseline, we begin by noting a general result for MC estimation. This requires a slight strengthening of the assumption on the kernel: . This implies that all are bounded on . For MC estimators, Lemma 33 of Song (2008) show that, when , the WCE converges in probability at the classical rate .
Turning now to BMCMC (and BMC as a special case), we consider the compact manifold . Below the distribution will be assumed to admit a density with respect to Lebesgue measure, denoted by . Define the Sobolev space to consist of all measurable functions such that . Here is the order of and is a RKHS. Derivative counting can hence be a principled approach for practitioners to choose a suitable RKHS. All results below apply to RKHS that are norm-equivalent999two norms , on a vector space are equivalent when there exists constants such that for all we have . to , permitting flexibility in the choice of kernel. Specific examples of kernels are provided in Sec. 4.2.
Our analysis below is based on the scattered data approximation literature (Wendland, 2005). A minor technical assumption, that enables us to simplify the presentation of results below, is that the set may be augmented with a finite, pre-determined set where does not increase with . Clearly this has no bearing on asymptotics. For measurable we write where is the indicator function of the event .
Theorem 1 (BMCMC in ).
Suppose is bounded away from zero on . Let be norm-equivalent to where , . Suppose states are generated by a reversible, uniformly ergodic Markov chain that targets . Then and moreover, if and ,
where depends on and can be arbitrarily small.
This result shows the posterior distribution is well-behaved; the posterior distribution of concentrates in any open neighbourhood of the true integral . This result does not address the frequentist coverage of the posterior, which is assessed empirically in Sec. 5.
Although we do not focus on point estimation, a brief comment is warranted: A lower bound on the WCE that can be attained by randomised algorithms in this setting is (Novak and Woźniakowski, 2010). Thus our result shows that the point estimate is at most one MC rate away from being optimal101010the control variate trick of Bakhvalov (1959) can be used to achieve the optimal randomised WCE, but this steps outside of the Bayesian framework.. Bach (2015) obtained a similar result for fixed and a specific importance sampling distribution; his analysis does not directly imply our asymptotic results and vice versa. After completion of this work, similar results on point estimation appeared in Oettershagen (2017); Bauer et al. (2017).
Thm. 1 can be generalised in several directions. Firstly, we can consider more general domains . Specifically, the scattered data approximation bounds that are used in our proof apply to any compact domain that satisfies an interior cone condition (Wendland, 2005, p.28). Technical results in this direction were established in Oates et al. (2016a). Second, we can consider other spaces . For example, a slight extension of Thm. 1 shows that certain infinitely differentiable kernels lead to exponential rates for the WCE and super-exponential rates for posterior contraction. For brevity, details are omitted.
3.2.2 Bayesian Quasi Monte Carlo
The previous section focused on BMCMC in the Sobolev space . To avoid repetition, here we consider more interesting spaces of functions whose mixed partial derivatives exist, for which even faster convergence rates can be obtained using BQMC. To formulate BQMC we must posit an RKHS a priori and consider collections of states that constitute a QMC point set tailored to the RKHS.
Consider with uniform on . Define the Sobolev space of dominating mixed smoothness to consist of functions for which . Here is the order of the space and is a RKHS. To build intuition, note that
is norm-equivalent to the RKHS generated by a tensor product of Matérn kernels(Sickel and Ullrich, 2009), or indeed a tensor product of any other univariate Sobolev space -generating kernel.
For a specific space such as , we seek an appropriate QMC point set. The higher-order digital net construction is an example of a QMC point set for ; for details we refer the reader to Dick and Pillichshammer (2010) for details.
Theorem 2 (BQMC in ).
Let be norm-equivalent to , where , . Suppose states are chosen according to a higher-order digital net over for some prime where . Then and , if and ,
where depends on and can be arbitrarily small.
This result shows that the posterior is again well-behaved. Indeed, the rate of contraction is much faster in compared to . In terms of point estimation, this is the optimal rate for any deterministic algorithm for integration of functions in (Novak and Woźniakowski, 2010). These results should be understood to hold on the sub-sequence , as QMC methods do not in general give guarantees for all . It is not clear how far this result can be generalised, in terms of and , compared to the result for BMCMC, since this would require the use of different QMC point sets.
In this section we established rates of posterior contraction for BMC, BMCMC and BQMC in a general Sobolev space context. These results are essential since they establish the sound properties of the posterior, which is shown to contract to the truth as more evaluations are made of the integrand. Of course, the higher computational cost of up to may restrict the applicability of the method in large- regimes. However, we emphasise that the motivation is to quantify the uncertainty induced from numerical integration, an important task which often justifies the higher computational cost.
So far we have established sound theoretical properties for BMCMC and BQMC under the assumption that the prior is well-specified. Unfortunately, prior specification complicates the situation in practice since, given a test function , there are an infinitude of RKHS to which belongs and the specific choice of this space will impact upon the performance of the method. In particular, the scale of the posterior is driven by the scale of the prior, so that the uncertainty quantification being provided is endogenous and, if the prior is not well-specified, this could mitigate the advantages of the probabilistic numerical framework. This important point is now discussed.
It is important to highlight a distinction between B(MC)MC and BQMC; for the former the choice of states does not depend on the RKHS. For B(MC)MC this allows for the possibility of off-line specification of the kernel after evaluations of the integrand have been obtained, whereas for alternative methods the kernel must be stated up-front. Our discussion below therefore centres on prior specification in relation to B(MC)MC, where several statistical techniques can be applied.
4.1 Prior Specification
The above theoretical results do not address the important issue of whether the scale of the posterior uncertainty provides an accurate reflection of the actual numerical error. This is closely related to the well-studied problem of prior specification in the kriging literature (Stein, 1991; Xu and Stein, 2017).
Consider a parametric kernel , with a distinction drawn here between scale parameters and smoothness parameters . The former are defined as parametrising the norm on , whereas the latter affect the set itself. Selection of based on data can only be successful in the absence of acute sensitivity to these parameters. For scale parameters, a wide body of evidence demonstrates that this is usually not a concern (Stein, 1991). However, selection of smoothness parameters is an active area of theoretical research (e.g. Szabó et al., 2015). In some cases it is possible to elicit a smoothness parameter from physical or mathematical considerations, such as a known number of derivatives of the integrand. Our attention below is instead restricted to scale parameters, where several approaches are discussed in relation to their suitability for BC:
A natural approach, from a Bayesian perspective, is to set a prior on parameters and then to marginalise over to obtain a posterior over . Recent results for a certain infinitely differentiable kernel establish minimax optimal rates for this approach, including in the practically relevant setting where is supported on a low-dimensional sub-mainfold of the ambient space (Yang and Dunson, 2016). However, the act of marginalisation itself involves an intractable integral. While the computational cost of evaluating this integral will often be dwarfed by that of the integral of interest, marginalisation nevertheless introduces an additional undesirable computational challenge that might require several approximations (e.g. Osborne, 2010). It is however possible to analytically marginalise certain types of scale parameters, such as amplitude parameters:
Suppose our covariance function takes the form where is itself a reproducing kernel and is an amplitude parameter. Consider the improper prior . Then the posterior marginal for is a Student-t distribution with mean and variance
and degrees of freedom. Here , , .
Another approach to kernel choice is cross-validation. However, this can perform poorly when the number of data is small, since the data needs to be further reduced into training and test sets. The performance estimates are also known to have large variance in those cases (Chap. 5 of Rasmussen and Williams, 2006). Since the small scenario is one of our primary settings of interest for BC, we felt that cross-validation was unsuitable for use in applications below.
4.1.3 Empirical Bayes
An alternative to the above approaches is empirical Bayes (EB) selection of scale parameters, choosing to maximise the log-marginal likelihood of the data , (Sec. 5.4.1 of Rasmussen and Williams, 2006). EB has the advantage of providing an objective function that is easier to optimise relative to cross-validation. However, we also note that EB can lead to over-confidence when is very small, since the full irregularity of the integrand has yet to be uncovered (Szabó et al., 2015). In addition, it can be shown that EB estimates need not converge as when the GP is supported on infinitely differentiable functions (Xu and Stein, 2017).
For the remainder, we chose to focus on a combination of the marginalisation approach for amplitude parameters and the EB approach for remaining scale parameters. Empirical results support the use of this approach, though we do not claim that this strategy is optimal.
4.2 Tractable and Intractable Kernel Means
BC requires that the kernel mean is available in closed-form. This is the case for several kernel-distribution pairs and a subset of these pairs are recorded in Table 1. In the event that the kernel-distribution pair of interest does not lead to a closed-form kernel mean, it is sometimes possible to determine another kernel-density pair for which is available and such that (i) is absolutely continuous with respect to , so that the Radon-Nikodym derivative exists, and (ii) . Then one can construct an importance sampling estimator
and proceed as above (O’Hagan, 1991).
|Unif||Wendland TP||Oates et al. (2016b)|
|Unif||Matérn Weighted TP||Sec. 5.4|
|Unif||Exponentiated Quadratic||Use of error function|
|Mixt. of Gaussians||Exponentiated Quadratic||Kennedy (1998)|
|Arbitrary||Unif / Mixt. of Gauss.||Trigonometric||Integration by parts|
|Arbitrary||Known moments||Polynomial TP||Briol et al. (2015)|
|Arbitrary||Known||Gradient-based Kernel||Oates et al. (2016a, 2017a)|
The aims of the following section are two-fold; (i) to validate the preceding theoretical analysis and (ii) to explore the use of probabilistic integrators in a range of problems arising in contemporary statistical applications.
5.1 Assessment of Uncertainty Quantification
Our focus below is on the uncertainty quantification provided by BC and, in particular, the performance of the hybrid marginalisation/EB approach to kernel parameters. To be clear, we are not concerned with accurate point estimation at low computational cost. This is a well-studied problem that reaches far beyond the methods of this paper. Rather, we are aiming to assess the suitability of the probabilistic description for integration error that is provided by BC. Our motivation is expensive integrands, but to perform assessment in a controlled environment we considered inexpensive test functions of varying degrees of irregularity, whose integrals can be accurately approximated. These included a non-isotropic test function with an “easy” setting and a “hard” setting . The hard test function is more variable and will hence be more difficult to approximate (see Fig. 2). One realisation of states , generated independently and uniformly over (initially ), was used to estimate the . We work in an RKHS characterised by tensor products of Matérn kernels
where is the modified Bessel function of the second kind. Closed-form kernel means exist in this case for whenever . In this set-up, EB was used to select the length-scale parameters of the kernel, while the amplitude parameter was marginalised as in Prop. 2. The smoothness parameter was fixed at . Note that all test functions will be in the space for any and there is a degree of arbitrariness in this choice of prior.
. The upper-left quadrant represents conservative credible intervals whilst the lower-right quadrant represents over-confident intervals.Left: “Easy” test function . Right: “Hard” test function .
Results are shown in Fig. 2. Error-bars are used to denote the 95% posterior credible regions for the value of the integral and we also display the values of the length scale selected by EB111111the term “credible” is used loosely since the are estimated rather than marginalised.. The appear to converge rapidly as ; this is encouraging but we emphasise that we do not provide theoretical guarantees for EB in this work. On the negative side, over-confidence is possible at small values of . Indeed, the BC posterior is liable to be over-confident under EB, since in the absence of evidence to the contrary, EB selects large values for that correspond to more regular functions; this is most evident in the “hard” case.
Next we computed coverage frequencies for credible regions. For each sample size , the process was repeated over many realisations of the states , shown in Fig. 3. It may be seen that (for large enough) the uncertainty quantification provided by EB is over-cautious for the easier function , whilst being well-calibrated for the more complicated functions such as . As expected, we observed that the coverage was over-confident for small values of . Performance was subsequently investigated with selected by EB. In general this performed worse than when was marginalised; results are contained in Supplement C.
Finally, to understand whether theoretical results on asymptotic behaviour are realised in practice, we note (in the absence of EB) that the variance is independent of the integrand and may be plotted as a function of . Results in Supplement C demonstrate that theoretical rates are observed in practice for for BQMC; however, at large values of , more data are required to achieve accurate estimation and increased numerical instability was observed.
The results on test functions provided in this section illustrate the extent to which uncertainty quantification in possible using BC. In particular, for our examples, we observed reasonable frequentist coverage if the number of samples was not too small.
For the remainder we explore possible roles for BMCMC and BQMC in statistical applications. Four case studies, carefully chosen to highlight both the strengths and the weaknesses of BC are presented. Brief critiques of each study are contained below, the full details of which can be found in Supplement D.
5.2 Case Study #1: Model Selection via Thermodynamic Integration
Consider the problem of selecting a single best model among a set , based on data assumed to arise from a true model in this set. The Bayesian solution, assuming a uniform prior over models, is to select the MAP model. We focus on the case with uniform prior on models , and this problem hence reduces to finding the largest marginal likelihood . The are usually intractable integrals over the parameters associated with model . One widely-used approach to model selection is to estimate each in turn, say by , then to take the maximum of the over . In particular, thermodynamic integration is one approach to approximation of marginal likelihoods for individual models (Gelman and Meng, 1998; Friel and Pettitt, 2008).
In many contemporary applications the MAP model is not well-identified, for example in variable selection where there are very many candidate models. Then, the MAP becomes sensitive to numerical error in the , since an incorrect model , can be assigned an overly large value of due to numerical error, in which case it could be selected in place of the true MAP model. Below we explore the potential to exploit probabilistic integration to surmount this problem.
5.2.1 Thermodynamic Integration
To simplify notation below we consider computation of a single and suppress dependence on the index corresponding to model . Denote the parameter space by . For (an inverse temperature) define the power posterior , a distribution over with density . The thermodynamic identity is formulated as a double integral:
The thermodynamic integral can be re-expressed as , , where . Standard practice is to discretise the outer integral and estimate the inner integral using MCMC: Letting denote a fixed temperature schedule, we thus have (e.g. using the trapezium rule)
where are MCMC samples from . Several improvements have been proposed, including the use of higher-order numerical quadrature for the outer integral (Friel et al., 2014; Hug et al., 2016) and the use of control variates for the inner integral (Oates et al., 2017a, 2016b). To date, probabilistic integration has not been explored in this context.
5.2.2 Probabilistic Thermodynamic Integration
Our proposal is to apply BC to both the inner and outer integrals. This is instructive, since nested integrals are prone to propagation and accumulation of numerical error. Several features of the method are highlighted:
Transfer Learning: In the probabilistic approach, the two integrands and
are each assigned prior probability models. For the inner integral we assign a prior. Our data here are the vector where . For estimating with BC we have times as much data as for the MC estimator , in Eqn. 7, which makes use of only function evaluations. Here, information transfer across temperatures is made possible by the explicit model for underpinning BC.
In the posterior, is a Gaussian random vector with where the mean and covariance are defined, in the obvious notation, by
where and is an kernel matrix defined by .
Inclusion of Prior Information: For the outer integral, it is known that discretisation error can be substantial; Friel et al. (2014) proposed a second-order correction to the trapezium rule to mitigate this bias, while Hug et al. (2016) pursued the use of Simpson’s rule. Attacking this problem from the probabilistic perspective, we do not want to place a stationary prior on , since it is known from extensive empirical work that will vary more at smaller values of . Indeed the rule-of-thumb is commonly used (Calderhead and Girolami, 2009). We would like to encode this information into our prior. To do this, we proceed with an importance sampling step . The rule-of-thumb implies an importance distribution for some small , which renders the function approximately stationary (made precise in Supplement D.1). A stationary GP prior on the transformed integrand provides the encoding of this prior knowledge that was used.
Propagation of Uncertainty: Under this construction, in the posterior is Gaussian with mean and covariance defined as
where and is an kernel matrix defined by . The term arises from BC on the outer integral, while the term arises from propagating numerical uncertainty from the inner integral through to the outer integral.
Probabilistic integration, where a model for numerical error on each integral was propagated through into the posterior over models. The probabilistic approach produces a “probability distribution over a probability distribution”, where the numerical uncertainty is modelled on top of the usual uncertainty associated with model selection.
5.2.3 Simulation Study
An experiment was conducted to elicit the MAP model from a collection of 56 candidate logistic regression models in a variable selection setting. This could be achieved in many ways; our aim was not to compare accuracy of point estimates, but rather to explore the probability model that, unlike in standard methods, is provided by BC. Full details are in Supplement D.1.
Results are shown in Fig. 4. Here we compared approximations to the model posterior obtained using the standard method versus the probabilistic method, over two realisations of the MCMC (the data were fixed). We make some observations: (i) The probabilistic approach models numerical uncertainty on top of the usual statistical uncertainty. (ii) The computation associated with BC required less time, in total, than the time taken afforded to MCMC. (iii) The same model was not always selected as the MAP when numerical error was ignored and depended on the MCMC random seed. In contrast, under the probabilistic approach, either or could feasibly be the MAP under any of the MCMC realisations, up to numerical uncertainty. (iv) The top row of Fig. 4 shows a large posterior uncertainty over the marginal likelihood for . This could be used as an indicator that more computational effort should be expended on this particular integral. (v) The posterior variance was dominated by uncertainty due to discretisation error in the outer integral, rather than the inner integral. This suggests that numerical uncertainty could be reduced by allocating more computational resources to the outer integral rather than the inner integral.
5.3 Case Study #2: Uncertainty Quantification for Computer Experiments
Here we consider an industrial scale computer model for the Teal South oil field, New Orleans (Hajizadeh et al., 2011). Conditional on field data, posterior inference was facilitated using state-of-the-art MCMC (Lan et al., 2016). Oil reservoir models are generally challenging for MCMC: First, simulating from those models can be time-consuming, making the cost of individual MCMC samples a few minutes to several hours. Second, the posterior distribution will often exhibit strongly non-linear concentration of measure. Here we computed statistics of interest using BMCMC, where the uncertainty quantification afforded by BC aims to enable valid inferences in the presence of relatively few MCMC samples. Full details are provided in Supplement D.2.
Quantification of the uncertainty associated with predictions is a major topic of ongoing research in this field (Mohamed et al., 2010; Hajizadeh et al., 2011; Park et al., 2013) due to the economic consequences associated with inaccurate predictions of quantities such as future oil production rate. A probabilistic model for numerical error in integrals associated with prediction could provide a more complete uncertainty assessment.
The particular integrals that we considered are posterior means for each model parameter, and we compared against an empirical benchmark obtained with brute force MCMC. BMCMC was employed with a Matérn kernel whose lengthscale-parameter was selected using EB. Estimates for posterior means were obtained using both standard MCMC and BMCMC, shown in Fig. 5. For this example the posterior distribution provides sensible uncertainty quantification for integrals 1, 3, 6-9, but was over-confident for integrals 2, 4, 5. The point accuracy of the BMCMC estimator matched that of the standard MCMC estimator. The lack of faster convergence for BMCMC appears to be due to inaccurate estimation of the kernel mean and we conjecture that alternative exact approaches, such as Oates et al. (2017a), may provide improved performance in this context. However, standard confidence intervals obtained from the CLT for MCMC with a plug-in estimate for the asymptotic variance were over-confident for parameters 2-9.
5.4 Case Study #3: High-Dimensional Random Effects
Our aim here was to explore whether more flexible representations afforded by weighted combinations of Hilbert spaces enable probabilistic integration when is high-dimensional. The focus was BQMC, but the methodology could be applied to other probabilistic integrators.
5.4.1 Weighted Spaces
The formulation of high (and infinite) -dimensional QMC can be achieved with a construction known as a weighted Hilbert space. These spaces, defined below, are motivated by the observation that many integrands encountered in applications seem to vary more in lower dimensional projections compared to higher dimensional projections. Our presentation below follows Sec. 2.5.4 and 12.2 of Dick and Pillichshammer (2010), but the idea goes back at least to Wahba (1990, Chap. 10).
As usual with QMC, we work in and uniform over . Let . For each subset , define a weight and denote the collection of all weights by . Consider the space of functions of the form , where belongs to an RKHS with kernel and denotes the components of that are indexed by . This is not restrictive, since any function can be written in this form by considering only . We turn into a Hilbert space by defining an inner product where . Constructed in this way, is an RKHS with kernel . Intuitively, the weights can be taken to be small whenever the function does not depend heavily on the -way interaction of the states . Thus, most of the will be small for a function that is effectively low-dimensional. A measure of the effective dimension of the function is given by ; in an extreme case could even be infinite provided that this sum remains bounded (Dick et al., 2013).
The (canonical) weighted Sobolev space of dominating mixed smoothness is defined by taking each of the component spaces to be . In finite dimensions , BQMC rules based on a higher-order digital net attain optimal WCE rates for this RKHS; see Supplement D.3 for full details.
5.4.2 Semi-Parametric Random Effects Regression
For illustration we considered generalised linear models, and focus on a Poisson semi-parametric random effects regression model studied by Kuo et al. (2008, Example 2). The context is inference for the parameters of the following model
Here , and where are pre-determined knots. We took equally spaced knots in . Inference for requires multiple evaluations of the observed data likelihood and therefore is a candidate for probabilistic integration methods, in order to model the cumulative uncertainty of estimating multiple numerical integrals.
In order to transform this integration problem to the unit cube we perform the change of variables so that we wish to evaluate . Here denotes the standard Gaussian inverse CDF applied to each component of . Probabilistic integration here proceeds under the hypothesis that the integrand belongs to (or at least can be well approximated by functions in) for some smoothness parameter and some weights . Intuitively, the integrand is such that an increase in the value of at the knot can be compensated for by a decrease in the value of at a neighbouring knot , but not by changing values of at more remote knots. Therefore we expect to exhibit strong individual and pairwise dependence on the , but expect higher-order dependency to be weaker. This motivates the weighted space assumption. Sinescu et al. (2012) provides theoretical analysis for the choice of weights . Here, weights of order two were used; for , , otherwise, which corresponds to an assumption of low-order interaction terms (though can still depend on all of its arguments). Full details are provided in Supplement D.3.
Results in Fig. 6 showed that the posterior credible regions more-or-less cover the truth for this problem, suggesting that the uncertainty estimates are appropriate. On the negative side, the BQMC method does not encode non-negativity of the integrand and, consequently, some posterior mass is placed on negative values for the integral, which is not meaningful. To understand the effect of the weighted space construction here, we compared against the BQMC point estimate with -way interactions (