Numerical integration, or quadrature/cubature, is a fundamental task in many areas of science and engineering. This includes machine learning and statistics, where such problems arise when computing marginals and conditionals in probabilistic inference problems. In particular in hierarchical Bayesian inference, quadrature is generally required for the computation of themarginal likelihood, the key quantity for model selection, and for prediction, for which latent variables are to be marginalized out.
To describe the problem, let be a compact metric space (such as a bounded and closed subset in ), be a known density function, and be an integrand, a known function such that the function value can be obtained for any given query . The task of quadrature is to numerically compute the integral (assumed to be intractable analytically)
This is done by evaluating the function values at design points and using them to approximate and the integral. The points should be “good” in the sense that provide useful information for computing the the integral.
Monte Carlo methods are the classic alternative, where are randomly generated from a proposal distribution and the integral is approximated as , with being importance weights. Monte Carlo is consistent, but may make inefficient use of computational resources (e.g. 10.2307/2348519, )
, requiring a huge number of points to give an accurate estimate of the integral.111To give an example, Wenliang et al. (WenSutStrGre19, , Fig. 3) used Monte Carlo samples to estimate the the normalizing constant of their model. The inefficiency of Monte Carlo stems partly from the fact that it covers a wide class of functions, such as that of bounded continuous functions. Its focus is to obtain sample points that approximate well the distribution of in a weak sense. If, on the contrary, the purpose is to compute the integral for one specific target integrand , then there may be room for drastic improvement by adaptively generating informative design points. For instance, if the function value changes more rapidly in a certain region than the rest of the domain, it would be more efficient to allocate a larger number of points in that region, rather than uniformly allocating design points in the entire domain.
Adaptive Bayesian quadrature (ABQ) is a recent approach from machine learning that actively, sequentially and deterministiclaly selects design points to adapt to the target integrand osborne_active_2012 ; osborne_bayesian_2012 ; gunter_sampling_2014 ; Ace18 ; ChaGar19 . It is an extension of Bayesian quadrature (BQ) Oha91 ; GhaZou03 ; BriOatGirOsbSej19 ; KarOatSar18 , a probabilistic numerical method for quadrature that makes use of prior knowledge about the integrand, such as smoothness and structure, via a Gaussian process (GP) prior. A drawback of vanilla BQ is that the Gaussian process model prevents the use of certain kinds of relevant knowledge about the integrand, such as it being positive
(or non-negative), because they can not be encoded in a Gaussian distribution. Positive integrands are ubiquitous in machine learning and statistics, where integration tasks emerge in the marginalization and conditioning of probability density functions, which are positive by definition. In ABQ such prior knowledge is modelled by describing integrand as given by a certaintransformation (or warping) of a GP — for instance, an exponentiated GP osborne_bayesian_2012 ; osborne_active_2012 ; ChaGar19 or a squared GP (gunter_sampling_2014, ). ABQ methods with such transformations have empirically been shown to improve upon both standard BQ and Monte Carlo, leading to state-of-the-art wall-clock time performance on problems of medium dimensionality.
If the transformation is nonlinear, as in the examples above, the transformed GP no longer allows an analytic expression for its posterior process, and thus approximations are used to obtain a tractable acquisition function. In contrast to the posterior covariance of GPs, these acquisition functions then become dependent on previous observations, making the algorithm adaptive. This twist seems to be critical for ABQ methods’ superior empirical performance, but it complicates analysis. Thus, there has been no theoretical guarantee for their convergence, rendering them heuristics in practice. This is problematic since integration is usually an intermediate computational step in a larger system, and thus must be reliable. This paper provides the first convergence analysis for ABQ methods.
. Our convergence analysis is done for this class. We also derive an upper-bound on the quadrature error using a transformed integrand, which is applicable to any design points and given in terms of the GP posterior variance (Prop.2.1). In Sec. 3, we establish a connection between ABQ and certain weak greedy algorithms (Thm. 3.3). This is based on a new result that the scaled GP posterior variance can be interpreted in terms of a certain projection in a Hilbert space (Lemma 3.1). Using this connection, we derive convergence rates of ABQ methods in Sec. 4. We provide sufficient conditions for ABQ methods to converge.
To motivate the relevance of our results, consider the parallel situation in Monte Carlo methods: Exact sampling is also hard and inefficient, so Markov Chain Monte Carlo methods are a popular approximate alternative. Not every MCMC method is consistent though. The notions of detailed balance and ergodicity (which go back to Boltzmann) are the crucial tool (cf. §6.5 & 6.6 in ChrCas2004, ): MCMC chains that satisfy these properties (and a few other technical constraints) are consistent. It is comparably straightforward to design such Markov Chains, so researchers have been able to build an ecosystem of efficient MCMC methods within it. Typically such methods have little additional theoretical guarantees and are mostly motivated empirically and intuitively. Analogously, we here develop a relatively general notion for active exploration that we term weak adaptivity. An ABQ method that satisfies weak adaptivity (and a few additional technical constraints) is consistent, and the conceptual space of weakly adaptive BQ methods is large and flexible. We hope that our results spark a practical interest in the design of empirically efficient acquisition functions, to extend the reach of quadrature to problems of higher and higher dimensionalities.
For standard BQ methods, and the corresponding kernel quadrature rules, convergence properties have been studied extensively (e.g. BriOatGirOsb15, ; KanSriFuk16, ; Bac17, ; Xi_ICML2018, ; KarOatSar18, ; Chen_ICML2018, ; BriOatGirOsbSej19, ; OatCocBriGir19, ; KanSriFuk17, ). In particular, there are methods that deterministically generate design points CheWelSmo10 ; BacJulObo12 ; huszar2012herdingbmcuai ; BriOatGirOsb15 ; Chen_ICML2018 . These methods are not adaptive, as design points are generated independently to the function values of the target integrand.
Our analysis is technically related to the work by Santin and Haasdonk SanHaa17
on solvers for partial differential equations, which analyzed the so-calledP-greedy algorithm, an algorithm to sequentially obtain design points using the GP posterior variance as an acquisition function. Our results can be regarded as a generalization of their result so that the acquisition function can include i) a scaling and a transformation of the GP posterior variance and ii) a data-dependent term that takes care of adaptation; see (4) for details.
Adaptive methods have also been theoretically studied in the information-based complexity literature Nov95_nonsymmetric ; Nov95_width ; Nov96_power ; Nov16 . The key result is that optimal points for quadrature can be obtained without observing actual function values, if the hypothesis class of functions is symmetric and convex (e.g. the unit ball in a Hilbert space): in this case adaptation does not help improve the performance. On the other hand, it the hypothesis class is either asymmetric or nonconvex, then adaptation may be helpful. For instance, a class of positive functions is assymetric if only one of or can be positive. These results thus support the choice of acquisition functions of existing ABQ methods, where the adaptivity to function values is motivated by modeling the positivity of the integrand.
denotes the set of positive integers, the real line, and the -dimensional Euclidean space for . for is the Banach space of -integrable functions, and is that of essentially bounded functions.
2 Adaptive Bayesian Quadrature (ABQ)
We describe here ABQ methods, and present a generic form of acquisition functions that we analyze. We also derive an upper-bound on the quadrature error using a transformed integrand in terms of the GP posterior variance, motivating our analysis in the later sections. Throughout the paper we assume that the domain is a compact metric space.
2.1 Bayesian Quadrature with Transformation
ABQ methods deal with an integrand that is a priori known to satisfy a certain constraint, for example . Such a constraint is modeled by considering a certain transformation , and assuming that there exists a latent function such that the integrand is given as the transformation of , i.e., . Examples of for modeling the positivity include i) the square transformation , whereand is a small constant such that , assuming that is bounded away from gunter_sampling_2014 ; and ii) the exponential transformation osborne_bayesian_2012 ; osborne_active_2012 ; ChaGar19 . Note that the identity map recovers standard Bayesian quadrature (BQ) methods Oha91 ; GhaZou03 ; BriOatGirOsb15 ; KarOatSar18 . To model the latent function , a Gaussian process (GP) prior RasWil06 is placed over :
where is a mean function and is a covariance kernel. Both and should be chosen to capture as much prior knowledge or belief about (or its transformation ) as possible, such as smoothness and correlation structure; see e.g. (RasWil06, , Chap. 4).
Assume that a set of points are given, such that the kernel matrix is invertible. Given the function values , define such that for . Treating as “observed data without noise,” the posterior distribution of under the GP prior (1) is again given as a GP
where is the posterior mean function and is the posterior covariance kernel given by (see e.g. RasWil06 )
where , and . Then a quadrature estimate222The point is that, in contrast to the integral over , this estimate should be analytically tractable. This depends on the choices for , and . For instance, for or with and Gaussian, the estimate can be obtained analytically gunter_sampling_2014 , while for one needs approximations; (cf. ChaGar19, ). for the integral is given as the integral of the transformed posterior mean function . The posterior covariance for is given similarly; see ChaGar19 for details.
2.2 A Generic Form of Acquisition Functions
The key remaining question is how to select good design points . ABQ methods sequentially and deterministically generate using an acquisition function. Many of the acquisition functions can be formulated in the following generic form:
where , is an increasing function such that , and is a function that may change at each iteration , e.g., it may depend on the function values of the target integrand . We analyse ABQ with this generic form (4), aiming for results with wide applicability. Here are some representative choices.
Warped Sequential Active Bayesian Integration (WSABI) gunter_sampling_2014 : Gunter et al. gunter_sampling_2014 employ the square transformation with two acquisition functions: i) WSABI-L (gunter_sampling_2014, , Eq. 15), which is based on linearization of and recovered with , and ; and ii) WSABI-M (gunter_sampling_2014, , Eq. 14)
, the one based on moment matching given by, and .
Variational Bayesian Monte Carlo (VBMC) Ace18 ; Ace19_acquisition : Acerbi (Ace19_acquisition, , Eq. 2) uses the identity with the acquisition function given by , and , where is the variational posterior at the -th iteration and are constants: setting recovers the original acquisition function (Ace18, , Eq. 9). Acerbi (Ace18, , Sec. 2.1) considers an integrand that is defined as the logarithm of a joint density, while is an intractable posterior that is gradually approximated by the variational posteriors .
For the WSABI and MMLT, the acquisition function (4) is obtained by a certain approximation for the posterior variance of the integral ; thus this is a form of uncertainty sampling. Such an approximation is needed because the posterior variance of the integral is not available in closed form, due to the nonlinear transformation . The resulting acquisition function includes the data-dependent term , which encourages exploration in regions where the value of is expected to be large. This makes ABQ methods adaptive to the target integrand. Alas, it also complicates analysis. Thus there has been no convergence guarantee for these ABQ methods; which is what we aim to remedy in this paper.
2.3 Bounding the Quadrature Error with Transformation
Our first result, which may be of independent interest, is an upper-bound on the error for the quadrature estimate based on a transformation described in Sec. 2.1. It is applicable to any point set , and the bound is given in terms of the posterior variance . This gives us a motivation to study the behavior of this quantity for generated by ABQ (4) in the later sections. To state the result, we need to introduce the Reproducing Kernel Hilbert Space (RKHS) of the covariance kernel of the GP prior. See e.g. SchSmo02 ; SteChr2008 for details of RKHS’s, and Berlinet2004 ; KanHenSejSri18 for discussions of their close but subtle relation to the GP notion.
Let be the RKHS associated with the covariance kernel of the GP prior (1), with and being its inner-product and norm, respectively. is a Hilbert space consisting of functions on , such that i) for all , and ii) for all and (the reproducing property), where denotes the function of the first argument such that , with being fixed. As a set of functions, is given as the closure of the linear span of such functions , i.e., , meaning that any can be written as for some and such that . We are now ready state our assumption:
is continuously differentiable. For , there exists such that and that . It holds that and .
The assumption is common in theoretical analysis of standard BQ methods, where and (see e.g. BriOatGirOsb15, ; Xi_ICML2018, ; BriOatGirOsbSej19, , and references therein). This assumption may be weakened by using proof techniques developed for standard BQ in the misspecifid setting KanSriFuk16 ; KanSriFuk17 , but we leave it for a future work. The other conditions on , and are weak.
3 Connections to Weak Greedy Algorithms in Hilbert Spaces
To analyze the quantity for points generated from ABQ (4), we show here that the ABQ can be interpreted as a certain weak greedy algorithm studied by DeVore et al. DevPetWoj13 . To describe this, let be a (generic) Hilbert space and be a compact subset. To define some notation, let be given. Denote by the linear subspace spanned by . For a given , let be the distance between and defined by
where denotes the norm of . Geometrically, this is the distance between and its orthogonal projection onto the subspace . The task considered in DevPetWoj13 is to select such that the worst case error in defined by
becomes as small as possible: are to be chosen to approximate well the set .
The following weak greedy algorithm is considered in DeVore et al. DevPetWoj13 . Let be a constant such that , and let . First select such that . For , suppose that have already been generated, and let . Then select a next element such that
In this paper we refer to such as a -weak greedy approximation of in because, recovers the standard greedy algorithm, while weakens the “greediness” of this rule. DeVore et al. DevPetWoj13 derived convergence rates of the worst case error (5) as for generated from this weak greedy algorithm.
Weak Greedy Algorithms in the RKHS.
Note that is the image of the mapping with being the domain. Therefore is compact, if and are continuous and is compact; this is because in this case the mapping becomes continuous, and in general the image of a continuous mapping from a compact domain is compact. Thus, we make the following assumption:
is a compact metric space, is continuous with for all , and is continuous.
(proof in Appendix B.1) Let be such that the kernel matrix is invertible. Define for any , and let . Assume that holds for all . Then for all we have
where is the GP posterior variance function given by (3). Moreover, we have
where is the worst case error defined by (5) with and defined here.
Adaptive Bayesian Quadrature as a Weak Greedy Algorithm.
We now show that the ABQ (4) gives a weak greedy approximation of the compact set in the RKHS in the sense of (6). We summarize required conditions in Assumptions 3 and 4. As mentioned above, Assumption 3 is the crucial one: its implications for certain specific ABQ methods will be discussed in Sec. 4.2.
Assumption 3 (Weak Adaptivity Condition).
There are constants such that holds for all and for all .
is increasing and continuous, and . For any , there is a constant such that holds for all .
For instance, if for then and thus we have for ; is the case for the WSABI gunter_sampling_2014 , and for the VBMC Ace18 ; Ace19_acquisition . If as in the MMLT ChaGar19 , we have and it can be shown that for ; see Appendix B.2. Note that in Assumption 4, the inverse is well-defined since is increasing and continuous.
The following lemma guarantees we can assume without loss of generality that the kernel matrix for the points generated from the ABQ (4) is invertible under the assumptions above, since otherwise holds, implying that the quadrature error is from Prop. 2.1. This guarantees the applicability of Lemma 3.1 for points generated from the ABQ (4).
Lemma 3.1 leads to the following theorem, which establishes a connection between ABQ and weak greedy algorithms.
4 Convergence Rates of Adaptive Bayesian Quadrature
We use the connection established in the previous section to derive convergence rates of ABQ. To this end we introduce a quantity called Kolmogorov -width, which is defined (for a Hilbert space and a compact subset ) by
where the infimum is taken over all -dimensional subspaces of . This is the worst case error for the best possible solution using elements in ; thus holds for any choice of that defines the worst case error in (5). The following result by DeVore et al. (DevPetWoj13, , Corollary 3.3) relates the Kolmogorov -width with the worst case error of a weak greedy algorithm.
Let be a Hilbert space and be a compact subset. For , let be a -weak greedy approximation of in for , and let be the worst case error (5) for the subspace . Then we have:
- – Exponential decay:
Assume that there exist constants , and such that holds for all . Then holds for all with .
- – Polynomial decay:
Assume that there exist constants and such that holds for all . Then holds for all with .
Thus, the key is how to upper-bound the Kolmogorov -width for the RKHS associated with the covariance kernel . Given such an upper bound, one can then derive convergence rates for ABQ using Thm. 3.3. Below we demonstrate such results for kernels with infinite smoothness on , such as Gaussian and (inverse) multiquadric kernels. In a similar way one can also derive rates for kernels with finite smoothness, such as Matérn and Wendland kernels. These additional results are presented in Appendix C.4. We emphasize that one can also analyze other cases (e.g. kernels on a sphere) by deriving upper-bounds on the Kolmogorov -width and using Thm. 3.3.
4.1 Convergence Rates for Kernels with Infinite Smoothness
We consider kernels with infinite smoothness, such as square-exponential kernels with , multiquadric kernels with such that , where denotes the smallest integer greater than , and inverse multiquadric kernels with . We have the following bound on the Kolmogorov -width of the for these kernels; the proof is in Appendix C.2.
Let be a cube, and suppose that Assumption 2 is satisfied. Let be a square-exponential kernel or an (inverse) multiquadric kernel. Then there exist constants such that holds for all .
The requirement for to be a cube stems from the use of Wendland (Wen05, , Thm. 11.22) in our proof, which requires this condition. In fact, this can be weakened to being a compact set satisfying an interior cone condition, but the resulting rate weakens to (note that this is still exponential); see (Wen05, , Sec. 11.4). This also applies to the following results. Combining Prop. 4.2 with Lemma 3.1, Thm. 3.3 and Lemma 4.1, we now obtain a bound on .
4.2 Discussions of the Weak Adaptivity Condition (Assumption 3)
We discuss consequences of our results to individual ABQ methods reviewed in Sec. 2.2. We do this in particular by discussing the weak adaptivity condition (Assumption 3), which requires that the data-dependent term in (4) is uniformly bounded away from zero and infinity. (A discussion for VBMC by Acerbi Ace18 ; Ace19_acquisition is given in Appendix C.8. To summarize, Assumption 3 holds if the densities of the variational distributions are bounded away uniformly from zero and infinity.)
We first consider the WSABI-L approach by Gunter et al. gunter_sampling_2014 , for which ; a similar result is presented for the WSABI-M in Appendix C.7. The following bounds for follow from Lemma C.5 in Appendix C.5.
Lemma 4.5 implies that WSABI-L may not be consistent when, e.g., one uses the zero prior mean function , since in this case the condition is not satisfied. Intuitively, the inconsistency may happen because the posterior mean for inputs in regions distant from the current design points would become close to , since the prior mean function is ; and such regions will never be explored in the subsequent iterations, because of the form . One simple way to guarantee the consistency is to make a modification like ; then we can guarantee that , encouraging exploration in the whole region . This then makes the algorithm consistent.
We next consider the MMLT method by Chai and Garnett ChaGar19 , for which . Lemma 4.6 below shows that the weak adaptivity condition holds for the MMLT as long as Assumption 1 is satisfied. Therefore different from the WSABI, the MMLT is consistent without requiring a further assumption.
5 Conclusion and Outlook
Extending efficient numerical integration beyond the low-dimensional domain remains both a formidable challenge and a crucial desideratum for many areas. In machine learning, efficient numerical integration in the high-dimensional domain would be a game-changer for Bayesian learning. Developed by, and used in, the machine learning community, adaptive Bayesian quadrature is a promising new direction for progress in this fundamental problem class. So far, it has been hindered by the absence of theoretical guarantees.
In this work, we have provided the first known convergence guarantees for ABQ methods, by analyzing a generic form of their acquisition functions. Of central importance is the notion of weak adaptivity which, speaking vaguely, ensures that the algorithm asymptotically does not “overly focus” on some evaluations. It is conceptually related to ideas like detailed balance and ergodicity, which play a similar role for Markov Chain Monte Carlo methods (where, speaking equally vaguely, they guard against the same kind of locality). Like those of MCMC, our sufficient conditions for consistency span a flexible class of design options, and can thus act as a guideline for the design of novel acquisition functions for ABQ, guided by practical and intuitive considerations. Based on the results presented herein, novel ABQ methods may be proposed for novel domains other than only positive integrands, for example integrands with discontinuities PlaWas09_singularities and those with spatially inhomogeneous smoothness.
An important theoretical question, however, remains to be addressed: While our results provide convergence guarantees for ABQ methods, they do not provide a theoretical explanation for why, how and when ABQ methods should be fundamentally better than non-adaptive methods. In fact, little is known about theoretical properties of adaptive quadrature methods in general. In applied mathematics, they remain an open problem Nov95_nonsymmetric ; Nov95_width ; Nov96_power ; Nov16 . While we have to leave this question of ABQ’s potential advantages over standard BQ for future research, we consider this area to be highly promising on account of the fundamental role of high-dimensional integrals of structured functions in probabilistic machine learning.
We thank Hans Kersting and Alexandra Gessner for fruitful discussions. This work has been supported by the ERC action StG 757275 / PANAMA.
Appendix A Appendices for Section 2
a.1 Proof of Prop. 2.1
In the proof we use the following notation: and .
It is known that (see e.g. (KanHenSejSri18, , Prop. 3.10)
) the GP posterior standard deviation can be written as
where . Note that for any , we have , since . Therefore by , and (9) we have
On the other hand, by Taylor’s theorem, there exists such that for we have
where denotes the derivative of at . From this and (10) we have
Note that is uniformly bounded over all and , since is continuous by assumption and is bound uniformly over all and ; the latter can be shown as
where we used and . This implies that