Log In Sign Up

On the Sampling Problem for Kernel Quadrature

by   Francois-Xavier Briol, et al.

The standard Kernel Quadrature method for numerical integration with random point sets (also called Bayesian Monte Carlo) is known to converge in root mean square error at a rate determined by the ratio s/d, where s and d encode the smoothness and dimension of the integrand. However, an empirical investigation reveals that the rate constant C is highly sensitive to the distribution of the random points. In contrast to standard Monte Carlo integration, for which optimal importance sampling is well-understood, the sampling distribution that minimises C for Kernel Quadrature does not admit a closed form. This paper argues that the practical choice of sampling distribution is an important open problem. One solution is considered; a novel automatic approach based on adaptive tempering and sequential Monte Carlo. Empirical results demonstrate a dramatic reduction in integration error of up to 4 orders of magnitude can be achieved with the proposed method.


page 1

page 2

page 3

page 4


Multiple Importance Sampling for Efficient Symbol Error Rate Estimation

Digital constellations formed by hexagonal or other non-square two-dimen...

Pushing the Limits of Importance Sampling through Iterative Moment Matching

The accuracy of an integral approximation via Monte Carlo sampling depen...

Numerical Integration Method for Training Neural Network

We propose a new numerical integration method for training a shallow neu...

Efficient Programmable Random Variate Generation Accelerator from Sensor Noise

We introduce a method for non-uniform random number generation based on ...

Distribution Compression in Near-linear Time

In distribution compression, one aims to accurately summarize a probabil...

Two-Stage Monte Carlo Denoising with Adaptive Sampling and Kernel Pool

Monte Carlo path tracer renders noisy image sequences at low sampling co...

1 Introduction

Consider approximation of the Lebesgue integral


where is a Borel measure defined over and is Borel measurable. Define to be the set of Borel measures such that , meaning that , and assume . In situations where

does not admit a closed-form, Monte Carlo (MC) methods can be used to estimate the numerical value of Eqn.


. A classical research problem in computational statistics is to reduce the MC estimation error in this context, where the integral can, for example, represent an expectation or marginalisation over a random variable of interest.

The default MC estimator comprises of

where are sampled identically and independently (i.i.d.) from . Then we have a root mean square error (RMSE) bound


and the expectation is with respect to the joint distribution of the

. For settings where the Lebesgue density of

is only known up to normalising constant, Markov chain Monte Carlo (MCMC) methods can be used; the rate-constant

is then related to the asymptotic variance of

under the Markov chain sample path.

Considerations of computational cost place emphasis on methods to reduce the rate constant . For the MC estimator, this rate constant can be made smaller via importance sampling (IS): where an optimal choice , that minimises , is available in explicit closed-form (see Robert and Casella, 2013, Thm. 3.3.4). However, the RMSE remains asymptotically gated at .

The default Kernel Quadrature (KQ) estimate comprises of


where the are independent (or arise from a Markov chain) and . In contrast to MC, the weights in KQ are in general non-uniform, real-valued and depend on . The KQ nomenclature derives from the (symmetric, positive-definite) kernel

that is used to construct an interpolant

such that for . The weights in Eqn. 2 are implicitly defined via the equation . The KQ estimator is identical to the posterior mean in Bayesian Monte Carlo (O’Hagan, 1991; Rasmussen and Ghahramani, 2002), and its relationship with classical numerical quadrature rules has been studied (Diaconis, 1988; Särkkä et al., 2015).

Under regularity conditions, Briol et al. (2015b) established the following RMSE bound for KQ:

where both the integrand and each argument of the kernel admit continuous mixed weak derivatives of order and can be arbitrarily small. An information-theoretic lower bound on the RMSE is (Bakhvalov, 1959). The faster convergence of the RMSE, relative to MC, can lead to improved precision in applications. Akin to IS, the samples need not be draws from in order for KQ to provide consistent estimation (since is encoded in the weights ). Importantly, KQ can be viewed as post-processing of MC samples; the kernel can be reverse-engineered (e.g. via cross-validation) and does not need to be specified up-front.

One notable disadvantage of KQ methods is that little is known about how the rate constant depends on the choice of sampling distribution . In contrast to IS, no general closed-form expression has been established for an optimal distribution for KQ (the technical meaning of ‘optimal’ is defined below). Moreover, limited practical guidance is available on the selection of the sampling distribution (an exception is Bach, 2015, as explained in Sec. 2.4) and in applications it is usual to take .
This choice is convenient but leads to estimators that are not efficient, as we demonstrate in dramatic empirical examples in Sec. 2.3.

The main contributions of this paper are twofold. First, we formalise the problem of optimal sampling for KQ as an important and open challenge in computational statistics. To be precise, our target is an optimal sampling distribution for KQ, defined as


for some functional class to be specified. In general a (possibly non-unique) optimal will depend on and, unlike for IS, also on the kernel and the number of samples .

Second, we propose a novel and automatic method for selection of that is rooted in approximation of the unavailable . In brief, our method considers candidate sampling distributions of the form for and a reference distribution on . The exponent is chosen such that minimises an empirical upper bound on the RMSE. The overall approach is facilitated with an efficient sequential MC (SMC) sampler and called SMC-KQ. In particular, the approach (i) provides practical guidance for selection of for KQ, (ii) offers robustness to kernel mis-specification, and (iii) extends recent work on computing posterior expectations with kernels obtained using Stein’s method (Oates et al., 2017).

The paper proceeds as follows: Empirical results in Sec. 2 reveal that the RMSE for KQ is highly sensitive to the choice of . The proposed approach to selection of is contained in Sec. 3. Numerical experiments, presented in Sec. 4, demonstrate that dramatic reductions in integration error (up to 4 orders of magnitude) can be achieved with SMC-KQ. Lastly, a discussion is provided in Sec. 5.

2 Background

This section presents an overview of KQ (Sec. 2.1 and 2.2), empirical (Secs. 2.3) and theoretical (Sec. 2.4) results on the choice of sampling distribution, and discusses kernel learning for KQ (Sec. 2.5).

2.1 Overview of Kernel Quadrature

We now proceed to describe KQ: Recall the approximation to ; an explicit form for the coefficients is given as , where and . It is assumed that exists almost surely; for non-degenerate kernels, this corresponds to having no atoms. From the above definition of KQ,

Defining leads to the estimate in Eqn. 2 with weights . Pairs for which the have closed form are reported in Table 1 of Briol et al. (2015b). Computation of these weights incurs a computational cost of at most and can be justified when either (i) evaluation of forms the computational bottleneck, or (ii) the gain in estimator precision (as a function in ) dominates this cost (i.e. whenever ).

Notable contributions on KQ include Diaconis (1988); O’Hagan (1991); Rasmussen and Ghahramani (2002) who introduced the method and Huszar and Duvenaud (2012); Osborne et al. (2012a, b); Gunter et al. (2014); Bach (2015); Briol et al. (2015a, b); Särkkä et al. (2015); Kanagawa et al. (2016); Liu and Lee (2017) who provided consequent methodological extensions. KQ has been applied to a wide range of problems including probabilistic ODE solvers (Kersting and Hennig, 2016)

, reinforcement learning

(Paul et al., 2016), filtering (Prüher and Šimandl, 2015) and design of experiments (Ma et al., 2014).

Several characterisations of the KQ estimator are known and detailed below. Let denote the Hilbert space characterised by the reproducing kernel , and denote its norm as (Berlinet and Thomas-Agnan, 2011). Then we have the following: (a) The function is the minimiser of over subject to for all . (b) The function is the posterior mean for under the Gaussian process prior conditioned on data and is the mean of the implied posterior marginal over . (c) The weights are characterised as the minimiser over of

the maximal error in the unit ball of . These characterisations connect KQ to (a) non-parametric regression, (b) probabilistic integration and (c) quasi-Monte Carlo (QMC) methods (Dick and Pillichshammer, 2010). The scattered data approximation literature (Sommariva and Vianello, 2006) and the numerical analysis literature (where KQ is known as the ‘empirical interpolation method’; Eftang and Stamm, 2012; Kristoffersen, 2013) can also be connected to KQ. However, our search of all of these literatures did not yield guidance on the optimal selection of the sampling distribution (with the exception of Bach (2015) reported in Sec. 2.4).

2.2 Over-Reliance on the Kernel

In Osborne et al. (2012a); Huszar and Duvenaud (2012); Gunter et al. (2014); Briol et al. (2015a), the selection of was approached as a greedy optimisation problem, wherein the maximal integration error was minimised, given the location of the previous . This approach has demonstrated considerable success in applications. However, the error criterion is strongly dependant on the choice of kernel and the sequential optimisation approach is vulnerable to kernel misspecification. In particular, if the intrinsic length scale of is “too small” then the all cluster around the mode of , leading to poor integral estimation (see Fig. 5 in the Appendix). Related work on sub-sample selection, such as leverage scores (Bach, 2013), can also be non-robust to mis-specified kernels. The partial solution of online kernel learning requires a sufficient number of data and is not always practicable in small- regimes that motivate KQ.

This paper considers sampling methods as a robust alternative to optimisation methods. Although our method also makes use of to select , it reverts to in the limit as the length scale of is made small. In this sense, sampling offers more robustness to kernel mis-specification than optimisation methods, at the expense of a possible (non-asymptotic) decrease in precision in the case of a well-specified kernel. This line of research is thus complementary to existing work. However, we emphasise that robustness is an important consideration for general applications of KQ in which kernel specification may be a non-trivial task.

2.3 Sensitivity to the Sampling Distribution

To date, we are not aware of a clear demonstration of the acute dependence of the performance of the KQ estimator on the choice of distribution . It is therefore important to illustrate this phenomenon in order to build intuition.

Consider the toy problem with state space , target distribution , a single test function and kernel . For this problem, consider a range of sampling distributions of the form for . Fig. 1 plots

an empirical estimate for the RMSE where is the th of independent KQ estimates for based on samples drawn from the distribution

with standard deviation

(). In this case is available in closed-form. It is seen that the ‘obvious’ choice of , i.e. , is sub-optimal. The intuition here is that ‘extreme’ samples from the tails of are rather informative for building the interpolant underlying KQ; we should therefore over-sample these values via a heavier-tailed . The same intuition is used for column sampling and to construct leverage scores (Mahoney, 2011; Drineas et al., 2012).

Figure 1: The performance of kernel quadrature is sensitive to the choice of sampling distribution. Here the test function was , the target measure was , while samples were generated from . The kernel was used. Notice that the values of that minimise the root mean square error (RMSE) are uniformly greater than (dashed line) and depend on the number of samples in general.

2.4 Established Results

Here we recall the main convergence results to-date on KQ and discuss how these relate to choices of sampling distribution. To reduce the level of detail below, we make several assumptions at the outset:

Assumption on the domain: The domain will either be itself or a compact subset of that satisfies an ‘interior cone condition’, meaning that there exists an angle and a radius such that for every there exists such that the cone is contained in (see Wendland, 2004, for background).

Assumption on the kernel: Consider the integral operator , with defined as the Bochner integral . Assume that , so that is self-adjoint, positive semi-definite and trace-class (Simon, 1979). Then, from an extension of Mercer’s theorem (König, 1986) we have a decomposition , where and

are the eigenvalues and eigenfunctions of

. Further assume that is dense in .

The first result is adapted and extended from Thm. 1 in Oates et al. (2016).

Theorem 1.

Assume that admits a density defined on a compact domain . Assume that for some . Let be fixed and define the Euclidean fill distance

Let be independent draws from . Assume gives rise to a Sobolev space . Then there exists such that, for ,

for all . Here for some constant independent of and .

All proofs are reserved for the Appendix. The main contribution of Thm. 1 is to establish a convergence rate for KQ when using importance sampling distributions. A similar result appeared in Thm. 1 of Briol et al. (2015b) for samples from (see the Appendix) and was extended to MCMC samples in Oates et al. (2016). An extension to the case of a mis-specified kernel was considered in Kanagawa et al. (2016). However a limitation of this direction of research is that it does not address the question of how to select .

Figure 2: The performance of kernel quadrature is sensitive to the choice of kernel. Here the same set-up as Fig. 1 was used with . The kernel was used for various choices of parameter . The root mean square error (RMSE) is sensitive to choice of for all choices of , suggesting that online kernel learning could be used to improve over the default choice of and (dashed lines).

The second result that we present is a consequence of the recent work of Bach (2015), who considered a particular choice of , depending on a fixed , via the density The following is adapted from Prop. 1 in Bach (2015):

Theorem 2.

Let be independent and . For and , we have that

with probability greater than


Some remarks are in order: (i) Bach (2015, Prop. 3) showed that, for , integration error scales at an optimal rate in up to logarithmic terms and, after samples, is of size . (ii) The distribution is obtained from minimising an upper bound on the integration error, rather than the error itself. It is unclear to us how well approximates an optimal sampling distribution for KQ. (iii) In general is hard to compute. For the specific case , equal to and uniform, the distribution is also uniform (and hence independent of ; see Sec. 4.4 of Bach (2015)). However, even for the simple example of Sec. 2.3, does not appear to have a closed form (details in Appendix). An approximation scheme was proposed in Sec. 4.2 of Bach (2015) but the error of this scheme was not studied.

Optimal sampling for approximation in with weighted least squares (not in the kernel setting) was considered in Hampton and Doostan (2015); Cohen and Migliorati (2016).

2.5 Goals

Our first goal was to formalise the sampling problem for KQ; this is now completed. Our second goal was to develop a novel automatic approach to selection of , called SMC-KQ; full details are provided in Sec. 3.

Also, observe that the integrand will in general belong to an infinitude of Hilbert spaces, while for KQ a single kernel must be selected. This choice will affect the performance of the KQ estimator; for example, in Fig. 2, the problem of Sec. 2.3 was reconsidered based on a class of kernels parametrised by . Results showed that, for all choices of parameter, the RMSE of KQ is sensitive to choice of . In particular, the default choice of is not optimal. For this reason, an extension that includes kernel learning, called SMC-KQ-KL, is proposed in Sec. 3.

3 Methods

In this section the SMC-KQ and SMC-KQ-KL methods are presented. Our aim is to explain in detail the main components (SMC, temp, crit) of Alg. 1. To this end, Secs. 3.1 and 3.2 set up our SMC sampler to target tempered distributions, while Sec. 3.3

presents a heuristic for the choice of temperature schedule. Sec.

3.4 extends the approach to kernel learning and Sec. 3.5 proposes a novel criterion to determine when a desired error tolerance is reached.

3.1 Thermodynamic Ansatz

To begin, consider , and as fixed. The following ansatz is central to our proposed SMC-KQ method: An optimal distribution (in the sense of Eqn. 3) can be well-approximated by a distribution of the form


for a specific (but unknown) ‘inverse temperature’ parameter . Here is a reference distribution to be specified and which should be chosen to be un-informative in practice. It is assumed that all exist (i.e. can be normalised). The motivation for this ansatz stems from Sec. 2.3, where and can be cast in this form with and

an (improper) uniform distribution on

. In general, tempering generates a class of distributions which over-represent extreme events relative to (i.e. have heavier tails). This property has the potential to improve performance for KQ, as demonstrated in Sec. 2.3.

The ansatz of Eqn. 4 reduces the non-parametric sampling problem for KQ to the one-dimensional parametric problem of selecting a suitable . The problem can be further simplified by focusing on a discrete temperature ladder such that , and . Discussion of the choice of ladder is deferred to Sec. 3.3. This reduced problem, where we seek an optimal index , is still non-trivial as no closed-form expression is available for the RMSE at each candidate . To overcome this impasse a novel approach to estimate the RMSE is presented in Sec. 3.5.

3.2 Convex Ansatz (Smc)

The proposed SMC-KQ algorithm requires a second ansatz, namely that the RMSE is convex in and possesses a global minimum in the range . This second ansatz (borne out in numerical results in Fig. 1) motivates an algorithm that begins at and tracks the RMSE until an increase is detected, say at ; at which point the index is taken for KQ.

To realise such an algorithm, this paper exploited SMC methods (Chopin, 2002; Del Moral et al., 2006). Here, a particle approximation to is first obtained where are independent draws from , and . Then, at iteration , the particle approximation to is re-weighted, re-sampled and subject to a Markov transition, to deliver a particle approximation to . This ‘re-sample-move’ algorithm, denoted SMC, is standard but, for completeness, pseudo-code is provided as Alg. 2 in the Appendix.

At iteration , a subset of size is drawn from the unique111This ensures that kernel matrices have full rank. It does not introduce bias into KQ, since in general need not equal . However, to keep notation clear, we do not make this operation explicit. elements in , from the particle approximation to , and proposed for use in KQ. A criterion crit, defined in Sec. 3.5, is used to determine whether the resultant KQ error has increased relative to . If this is the case, then the distribution from the previous iteration is taken for use in KQ. Otherwise the algorithm proceeds to and the process repeats. In the degenerate case where the RMSE has a minimum at , the algorithm defaults to standard KQ with .

Both ansatz of the SMC-KQ algorithm are justified through the strong empirical results presented in Sec. 4.

3.3 Choice of Temperature Schedule (temp)

The choice of temperature schedule influences several aspects of SMC-KQ: (i) The SMC approximation to is governed by the “distance” (in some appropriate metric) between and . (ii) The speed at which the minimum can be reached is linear in the number of temperatures between and . (iii) The precision of KQ depends on the approximation . Factors (i,iii) motivate the use of a fine schedule with large, while (ii) motivates a coarse schedule with small.

For this work, a temperature schedule was used that is well suited to both (i) and (ii), while a strict constraint was imposed on the grid spacing to acknowledge (iii). The specific schedule used in this work was determined based on the conditional effective sample size of the current particle population, as proposed in the recent work of Zhou et al. (2016). Full details are presented in Algs. 4 and 5 in the Appendix.

3.4 Kernel Learning

In Sec. 2.5 we demonstrated the benefit of kernel learning for KQ. From the Gaussian process characterisation of KQ from Sec. 2.1, it follows that kernel parameters

can be estimated, conditional on a vector of function evaluations

, via maximum marginal likelihood:

In SMC-KQ-KL, the function evaluations are obtained at the first222This is a notational convention and is without loss of generality. In this paper these states were a random sample (without replacement) of size , though stratified sampling among the states could be used. More sophisticated alternatives that also involve the kernel , such as leverage scores, were not considered, since in general these (i) introduce a vulnerability to mis-specified kernels and (ii) require manipulation of a kernel matrix (Patel et al., 2015). (of ) states and the parameters are updated in each iteration of the SMC. This demands repeated function evaluation; this burden can be reduced with less frequent parameter updates and caching of all previous function evaluations. The experiments in Sec. 4 assessed both SMC-KQ and SMC-KQ-KL in terms of precision per total number of function evaluations, so that the additional cost of kernel learning was taken into account.

  input (integrand)
  input (target disn.)
  input (kernel)
  input (reference disn.)
  input (re-sample threshold)
  input (num. func. evaluations)
  input (num. particles)
  ; ;
   (initialise states )
   (initialise weights )
   (est’d error)
  while  and  do
      (next temp.)
      (next particle approx.)
      (est’d error)
  end while
   (function eval. )
   (kernel mean eval. )
   (kernel eval. )
   (eval. KQ estimator)
Algorithm 1 SMC Algorithm for KQ

3.5 Termination Criterion (crit)

The SMC-KQ-KL algorithm is designed to track the RMSE as is increased. However, the RMSE is not available in closed form. In this section we derive a tight upper bound on the RMSE that is used for the crit component in Alg. 1.

From the worst-case characterisation of KQ presented in Sec. 2.1, we have an upper bound


The term , denoted henceforth as (since depends on ), can be computed in closed form (see the Appendix). This motivates the following upper bound on MSE:


The term can be estimated with the bootstrap approximation

where are independent draws from . In SMC-KQ the term is an unknown constant and the statistic , an empirical proxy for the RMSE, is monitored at each iteration. The algorithm terminates once an increase in this statistic occurs. For SMC-KQ-KL the term is non-constant as it depends on the kernel hyper-parameters; then can in addition be estimated as and we monitor the product of and , with termination when an increase is observed (c.f. test, defined in the Appendix).

Full pseudo-code for SMC-KQ is provided as Alg. 1, while SMC-KQ-KL is Alg. 9 in the Appendix. To summarise, we have developed a novel procedure, SMC-KQ (and an extension SMC-KQ-KL), designed to approximate the optimal KQ estimator based on the unavailable optimal distribution in Eqn. 3 where is the unit ball of . Earlier empirical results in Sec. 2.3 suggest that SMC-KQ has potential to provide a powerful and general algorithm for numerical integration. The additional computational cost of optimising the sampling distribution does however have to be counterbalanced with the potential gain in error, and so this method will mainly be of practical interest for problems with expensive integrands or complex target distributions. The following section reports experiments designed to test this claim.

4 Results

Here we compared SMC-KQ (and SMC-KQ-KL) against the corresponding default approaches KQ (and KQ-KL) that are based on . Sec. 4.1 below reports an assessment in which the true value of integrals is known by design, while in Sec. 4.2 the methods were deployed to solve a parameter estimation problem involving differential equations.

4.1 Simulation Study

To continue our illustration from Sec. 2, we investigated the performance of SMC-KQ and SMC-KQ-KL for integration of against the distribution . Here the reference distribution was taken to be . All experiments employed SMC with particles, random walk Metropolis transitions (Alg. 3), the re-sample threshold and a maximum grid size . Dependence of the subsequent results on the choice of was investigated in Fig. 10 in the Appendix.

Fig. 3 (top) reports results for SMC-KQ against KQ, for fixed length-scale . Corresponding results for SMC-KQ-KL against KQ-KL are shown in the bottom plot. It was observed that SMC-KQ (resp. SMC-KQ-KL) out-performed KQ (resp. KQ-KL) in the sense that, on a per-function-evaluation basis, the MSE achieved by the proposed method was lower than for the standard method. The largest reduction in MSE achieved was about 8 orders of magnitude (correspondingly 4 orders of magnitude in RMSE). A fair approximation to the method, which is approximately optimal for (c.f. results in Fig. 1), was observed. The termination criterion in Sec. 3.5 was observed to be a good approximation to the optimal temperature (Fig. 9 in Appendix). As an aside, we note that the MSE was gated at for all methods due to numerical condition of the kernel matrix (a known feature of the Gaussian kernel used in this experiment).

The investigation was extended to larger dimensions ( and ) and more complex integrands in the Appendix. In all cases, considerable improvements were obtained using SMC-KQ over KQ.

Figure 3: Performance on for the running illustration of Figs. 1 and 2. The top plot shows SMC-KQ against KQ, whilst the bottom plot illustrates the versions with kernel learning.

4.2 Inference for Differential Equations

Consider the model given by with solution depending on unknown parameters . Suppose we can obtain observations through the following noise model (likelihood): at times where we assume for known . Our goal is to estimate for a fixed (potentially large) . To do so, we will use a Bayesian approach and specify a prior , then obtain samples from the posterior

using MCMC. The posterior predictive mean is then defined as:

, and this can be estimated using an empirical average from the posterior samples. This type of integration problem is particularly challenging as the integrand requires simulating from the differential equation at each iteration. Furthermore, the larger or the smaller the grid, the longer the simulation will be and the higher the computational cost.

For a tractable test-bed, we considered Hooke’s law, given by the following second order homogeneous ODE given by

with initial conditions and . This equation represents the evolution of a mass on a spring with friction (Robinson, 2004, Chapter 13). More precisely, denotes the spring constant, the damping coefficient representing friction and the mass of the object. Since this differential equation is an overdetermined system we fixed . In this case, if , we get a damped oscillatory behaviour as presented in Fig. 4 (top). Data were generated with , . with log-normal priors with scale equal to for all parameters.

To implement KQ under an unknown normalisation constant for , we followed Oates et al. (2017) and made use of a Gaussian kernel that was adapted with Stein’s method (see the Appendix for details). The reference distribution was an wide uniform prior on the hypercube . Brute force computation was used to obtain a benchmark value for the integral. For the SMC algorithm, an independent lognormal transition kernel was used at each iteration with parameters automatically tuned to the current set of particles. Results in Fig. 4 demonstrate that SMC-KQ outperforms KQ for these integration problems. These results improve upon those reported in Oates et al. (2016) for a similar integration problem based on parameter estimation for differential equations.

Figure 4: Comparison of SMC-KQ and KQ on the ODE inverse problem. The top plot illustrates the physical system, the middle plot shows observations of the ODE, whilst the bottom plot illustrates the superior performance of SMC-KQ against KQ.

5 Discussion

In this paper we formalised the optimal sampling problem for KQ. A general, practical solution was proposed, based on novel use of SMC methods. Initial empirical results demonstrate performance gains relative to standard approach of KQ with . A more challenging example based on parameter estimation for differential equations was used to illustrate the potential of SMC-KQ for Bayesian computation in combination with Stein’s method.

Our methods were general but required user-specified choice of an initial distribution . For compact state spaces we recommend taking

to be uniform. For non-compact spaces, however, there is a degree of flexibility here and default solutions, such as wide Gaussian distributions, necessarily require user input. However, the choice of

is easier than the choice of itself, since is not required to be optimal. In our examples, improved performance (relative to standard KQ) was observed for a range of reference distributions .

A main motivation for this research was to provide an alternative to optimisation-based KQ that alleviates strong dependence on the choice of kernel (Sec. 2.2). This paper provides essential groundwork toward that goal, in developing sampling-based methods for KQ in the case of complex and expensive integration problems. An empirical comparison of sampling-based and optimisation-based methods is reserved for future work.

Two extensions of this research are identified: First, the curse of dimension that is intrinsic to standard Sobolev spaces can be alleviated by demanding ‘dominating mixed smoothness’; our methods are compatible with these (essentially tensor product) kernels

(Dick et al., 2013). Second, the use of sequential QMC (Gerber and Chopin, 2015) can be considered, motivated by further orders of magnitude reduction in numerical error observed for deterministic point sets (see Fig. 13 in the Appendix).


FXB was supported by the EPSRC grant [EP/L016710/1]. CJO & MG we supported by the Lloyds Register Foundation Programme on Data-Centric Engineering. WYC was supported by the ARC Centre of Excellence in Mathematical and Statistical Frontiers. MG was supported by the EPSRC grants [EP/J016934/3, EP/K034154/1, EP/P020720/1], an EPSRC Established Career Fellowship, the EU grant [EU/259348], a Royal Society Wolfson Research Merit Award. FXB, CJO, JC & MG were also supported by the SAMSI working group on Probabilistic Numerics.



  • Bach [2013] F. Bach. Sharp analysis of low-rank kernel matrix approximations. In Proc. I. Conf. Learn. Theory, 2013.
  • Bach [2015] F. Bach. On the equivalence between kernel quadrature rules and random features. arXiv:1502.06800, 2015.
  • Bakhvalov [1959] N. S. Bakhvalov. On approximate computation of integrals. Vestnik MGU, Ser. Math. Mech. Astron. Phys. Chem., 4:3–18, 1959. In Russian.
  • Berlinet and Thomas-Agnan [2011] A. Berlinet and C. Thomas-Agnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.
  • Briol et al. [2015a] F-X. Briol, C. J. Oates, M. Girolami, and M. A. Osborne. Frank-Wolfe Bayesian quadrature: Probabilistic integration with theoretical guarantees. In Adv. Neur. Inf. Proc. Sys., 2015a.
  • Briol et al. [2015b] F-X. Briol, C. J. Oates, M. Girolami, M. A. Osborne, and D. Sejdinovic. Probabilistic integration: A role for statisticians in numerical analysis? arXiv:1512.00933, 2015b.
  • Chopin [2002] N. Chopin. A sequential particle filter method for static models. Biometrika, 89(3):539–552, 2002.
  • Cohen and Migliorati [2016] A. Cohen and G. Migliorati. Optimal weighted least-squares methods. arXiv:1608.00512, 2016.
  • Del Moral et al. [2006] P. Del Moral, A. Doucet, and A. Jasra. Sequential monte carlo samplers. J. R. Stat. Soc. Ser. B. Stat. Methodol., 68:411–436, 2006.
  • Diaconis [1988] P. Diaconis. Bayesian Numerical Analysis, volume IV of Statistical Decision Theory and Related Topics, pages 163–175. Springer-Verlag, New York, 1988.
  • Dick and Pillichshammer [2010] J. Dick and F. Pillichshammer. Digital nets and sequences: Discrepancy Theory and Quasi–Monte Carlo Integration. Cambridge University Press, 2010.
  • Dick et al. [2013] J. Dick, F. Y. Kuo, and I. H. Sloan. High-dimensional integration: The quasi-Monte Carlo way. Acta Numerica, 22:133–288, 2013.
  • Drineas et al. [2012] P. Drineas, M. Magdon-Ismail, M. W. Mahoney, and D. P. Woodruff. Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res., 13:3475–3506, 2012.
  • Eftang and Stamm [2012] J. L. Eftang and B. Stamm. Parameter multi-domain ‘hp’ empirical interpolation. I. J. Numer. Methods in Eng., 90(4):412–428, 2012.
  • Gerber and Chopin [2015] M. Gerber and N. Chopin. Sequential quasi Monte Carlo. J. R. Statist. Soc. B, 77(3):509–579, 2015.
  • Gunter et al. [2014] T. Gunter, R. Garnett, M. Osborne, P. Hennig, and S. Roberts. Sampling for inference in probabilistic models with fast Bayesian quadrature. In Adv. Neur. Inf. Proc. Sys., 2014.
  • Hampton and Doostan [2015] J. Hampton and A. Doostan. Coherence motivated sampling and convergence analysis of least squares polynomial Chaos regression. Comput. Methods Appl. Mech. Engrg., 290:73–97, 2015.
  • Hinrichs [2010] A. Hinrichs. Optimal importance sampling for the approximation of integrals. J. Complexity, 26(2):125–134, 2010.
  • Huszar and Duvenaud [2012] F. Huszar and D. Duvenaud. Optimally-weighted herding is Bayesian quadrature. In Uncert. Artif. Intell., 2012.
  • Kanagawa et al. [2016] M. Kanagawa, B. Sriperumbudur, and K. Fukumizu. Convergence guarantees for kernel-based quadrature rules in misspecified settings. In Adv. Neur. Inf. Proc. Sys., 2016.
  • Kersting and Hennig [2016] H. Kersting and P. Hennig. Active uncertainty calibration in bayesian ode solvers. In Proc. Conf. Uncert. Artif. Intell., 2016.
  • König [1986] H. König. Eigenvalues of compact operators with applications to integral operators. Linear Algebra Appl., 84:111–122, 1986.
  • Kristoffersen [2013] S. Kristoffersen. The empirical interpolation method. Master’s thesis, Department of Mathematical Sciences, Norwegian University of Science and Technology, 2013.
  • Liu and Lee [2017] Q. Liu and J. D. Lee. Black-Box Importance Sampling. I. Conf. Artif. Intell. Stat., 2017.
  • Ma et al. [2014] Y. Ma, R. Garnett, and J. Schneider. Active Area Search via Bayesian Quadrature. I. Conf Artif. Intell. Stat., 33, 2014.
  • Mahoney [2011] M. W. Mahoney. Randomized algorithms for matrices and data. Found. Trends Mach. Learn., 3(2):123–224, 2011.
  • Oates et al. [2016] C. J. Oates, J. Cockayne, F-X. Briol, and M. Girolami. Convergence Rates for a Class of Estimators Based on Stein’s Identity. arXiv:1603.03220, 2016.
  • Oates et al. [2017] C. J. Oates, M. Girolami, and N. Chopin. Control Functionals for Monte Carlo Integration. J. R. Stat. Soc. Ser. B. Stat. Methodol., 2017. To appear.
  • O’Hagan [1991] A. O’Hagan. Bayes-Hermite quadrature. J. Statist. Plann. Inference, 29:245–260, 1991.
  • Osborne et al. [2012a] M. A. Osborne, D. Duvenaud, R. Garnett, C. E. Rasmussen, S. Roberts, and Z. Ghahramani. Active learning of model evidence using Bayesian quadrature. In Adv. Neur. Inf. Proc. Sys., 2012a.
  • Osborne et al. [2012b] M. A. Osborne, R. Garnett, S. Roberts, C. Hart, S. Aigrain, and N. Gibson. Bayesian quadrature for ratios. In Proc. I. Conf. Artif. Intell. Stat., 2012b.
  • Patel et al. [2015] R. Patel, T. A. Goldstein, E. L. Dyer, A.Mirhoseini, and R. G. Baraniuk. OASIS: Adaptive Column Sampling for Kernel Matrix Approximation. arXiv:1505.05208, 2015.
  • Paul et al. [2016] S. Paul, K. Ciosek, M. A. Osborne, and S. Whiteson. Alternating Optimisation and Quadrature for Robust Reinforcement Learning. arXiv:1605.07496, 2016.
  • Plaskota et al. [2009] L. Plaskota, G.W. Wasilkowski, and Y. Zhao. New averaging technique for approximating weighted integrals. J. Complexity, 25(3):268–291, 2009.
  • Prüher and Šimandl [2015] J. Prüher and M. Šimandl.

    Bayesian Quadrature in Nonlinear Filtering.

    In 12th I. Conf. Inform. Control Autom. Robot., 2015.
  • Rasmussen and Ghahramani [2002] C. E. Rasmussen and Z. Ghahramani. Bayesian Monte Carlo. In Adv. Neur. Inf. Proc. Sys., 2002.
  • Robert and Casella [2013] C. Robert and G. Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2013.
  • Robinson [2004] J. C. Robinson.

    An introduction to ordinary differential equations

    Cambridge University Press, 2004.
  • Särkkä et al. [2015] S. Särkkä, J. Hartikainen, L. Svensson, and F. Sandblom. On the relation between Gaussian process quadratures and sigma-point methods. arXiv:1504.05994, 2015.
  • Shi et al. [2009] T. Shi, M. Belkin, and B. Yu.

    Data spectroscopy: Eigenspaces of convolution operators and clustering.

    Ann. Statist., 37(6):3960–3984, 2009.
  • Simon [1979] B. Simon. Trace Ideals and Their Applications. Cambridge University Press, 1979.
  • Smola et al. [2007] A. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions. In Algorithmic Learn. Theor., pages 13–31, 2007.
  • Sommariva and Vianello [2006] A. Sommariva and M. Vianello.

    Numerical cubature on scattered data by radial basis functions.

    Computing, 76(3-4):295–310, 2006.
  • Temme [1996] N. M. Temme. Special Functions: An Introduction to the Classical Functions of Mathematical Physics. Wiley, New York, 1996.
  • Wendland [2004] H. Wendland. Scattered Data Approximation. Cambridge University Press, 2004.
  • Zhou et al. [2016] Y. Zhou, A. M. Johansen, and J. A. D. Aston. Towards Automatic Model Comparison: An Adaptive Sequential Monte Carlo Approach. J. Comput. Graph. Statist., 25(3):701–726, 2016.

Appendix A Appendix

This appendix complements the paper “On the sampling problem for kernel quadrature”. Section A.1 discusses the potential lack of robustness of greedy optimization methods, which motivated the development of SMC-KQ. Sections A.2 and A.3 discuss some of the theoretical aspects of KQ, whilst Section A.4 and A.5 presents additional numerical experiments and details for implementation. Finally, Section A.6 provides detailed pseudo-code for all algorithms used in this paper.

a.1 Lack of Robustness of Optimisation Methods

To demonstrate the non-robustness to mis-specified kernels, that is a feature of optimisation-based methods, we considered integration against for functions that can be approximated by the kernel . An initial state was fixed at the origin and then for the state was chosen to minimise the error criterion given the location of the . This is known as ‘sequential Bayesian quadrature’ [SBQ; Huszar and Duvenaud, 2012, Gunter et al., 2014, Briol et al., 2015a]. The kernel length scale was fixed at and we consider (as a thought experiment, since it does not enter into our selection of points) a more regular integrand, such as that shown in Fig. 5 (top). The location of the states obtained in this manner are shown in Fig. 5 (bottom). It is clear that SBQ is not an efficient use of computation for integration of the integrand against . Of course, a bad choice of kernel length scale parameter can in principle be alleviated by kernel learning, but this will not be robust the case where is very small.

This example motivates sampling-based methods as an alternative to optimisation-based methods. Future work will be required to better understand when methods such as SBQ can be reliable in the presence of unknown kernel parameters, but this was beyond the scope of this work.

Figure 5: Sequential minimisation of the error criterion , denoted SBQ, does not lead to adequate placement of points when the kernel is mis-specified. [Here the kernel length scale was fixed to . Selected points are represented as red. For comparison, a collection of draws from , as used in KQ, are shown as blue points.]

a.2 Additional Definitions

The space is defined to be the set of -measurable functions such that the Lebesgue integral

exists and is finite.

For a multi-index define . The (standard) Sobolev space of order is denoted

This space is equipped with norm

Two normed spaces and are said to be ‘norm equivalent’ if there exists such that

for all .

a.3 Theoretical Results

a.3.1 Proof of Theorem 1


From Thm. 11.13 in Wendland [2004] we have that there exist constants , such that


for all , provided , where

Under the hypotheses, we can suppose that the deterministic states ensure . Then Eqn. 7 holds for all , where the are independent draws from . It follows that

Next, Lem. 1 in Oates et al. [2016] establishes that, under the present hypotheses on and , there exists such that

for all , where is independent of .

Combining the above results produces

as required, with . ∎

a.3.2 Proof of Theorem 2


The Cauchy-Schwarz result for kernel mean embeddings [Smola et al., 2007] gives

Consider the first term above. Since is dense in , it follows that (the unique positive self-adjoint square root of ) is an isometry from to . Now, since , there exists a unique element such that . Then we have that

For , we have if and only if


for some , in which case is equal to the infimum of under all such representations . In particular, it follows that for the particular choice with for all .

Under the hypothesis on , Prop. 1 of Bach [2015] established that when are independent, then

with probability at least . Fixing the function in Eqn. 9 leads to the statement that

is at most with probability at least . The infimum over can be replaced with an unconstrained infimum over to obtain the weaker statement that

is at most with probability at least . Now, recall from Sec. 2.1 that the KQ weights are characterised through the solution to this optimisation problem as . It follows that

with probability at least . Combining this fact with Eqn. A.3.2 completes the proof. ∎

a.3.3 for the Example of Figure 1

In this section we consider scope to derive in closed-form for the example of Fig. 1. The following will be used:

Proposition 1 (Prop. 1 in Shi et al. [2009]).

Let , and . Define and denote the th Hermite polynomial as . Then the eigenvalues and corresponding eigenfunctions of the integral operator are