1 Introduction
Consider approximation of the Lebesgue integral
(1) 
where is a Borel measure defined over and is Borel measurable. Define to be the set of Borel measures such that , meaning that , and assume . In situations where
does not admit a closedform, Monte Carlo (MC) methods can be used to estimate the numerical value of Eqn.
1. A classical research problem in computational statistics is to reduce the MC estimation error in this context, where the integral can, for example, represent an expectation or marginalisation over a random variable of interest.
The default MC estimator comprises of
where are sampled identically and independently (i.i.d.) from . Then we have a root mean square error (RMSE) bound
where
and the expectation is with respect to the joint distribution of the
. For settings where the Lebesgue density ofis only known up to normalising constant, Markov chain Monte Carlo (MCMC) methods can be used; the rateconstant
is then related to the asymptotic variance of
under the Markov chain sample path.Considerations of computational cost place emphasis on methods to reduce the rate constant . For the MC estimator, this rate constant can be made smaller via importance sampling (IS): where an optimal choice , that minimises , is available in explicit closedform (see Robert and Casella, 2013, Thm. 3.3.4). However, the RMSE remains asymptotically gated at .
The default Kernel Quadrature (KQ) estimate comprises of
(2) 
where the are independent (or arise from a Markov chain) and . In contrast to MC, the weights in KQ are in general nonuniform, realvalued and depend on . The KQ nomenclature derives from the (symmetric, positivedefinite) kernel
that is used to construct an interpolant
such that for . The weights in Eqn. 2 are implicitly defined via the equation . The KQ estimator is identical to the posterior mean in Bayesian Monte Carlo (O’Hagan, 1991; Rasmussen and Ghahramani, 2002), and its relationship with classical numerical quadrature rules has been studied (Diaconis, 1988; Särkkä et al., 2015).Under regularity conditions, Briol et al. (2015b) established the following RMSE bound for KQ:
where both the integrand and each argument of the kernel admit continuous mixed weak derivatives of order and can be arbitrarily small. An informationtheoretic lower bound on the RMSE is (Bakhvalov, 1959). The faster convergence of the RMSE, relative to MC, can lead to improved precision in applications. Akin to IS, the samples need not be draws from in order for KQ to provide consistent estimation (since is encoded in the weights ). Importantly, KQ can be viewed as postprocessing of MC samples; the kernel can be reverseengineered (e.g. via crossvalidation) and does not need to be specified upfront.
One notable disadvantage of KQ methods is that little is known about how the rate constant depends on the choice of sampling distribution .
In contrast to IS, no general closedform expression has been established for an optimal distribution for KQ (the technical meaning of ‘optimal’ is defined below).
Moreover, limited practical guidance is available on the selection of the sampling distribution (an exception is Bach, 2015, as explained in Sec. 2.4) and in applications it is usual to take .
This choice is convenient but leads to estimators that are not efficient, as we demonstrate in dramatic empirical examples in Sec. 2.3.
The main contributions of this paper are twofold. First, we formalise the problem of optimal sampling for KQ as an important and open challenge in computational statistics. To be precise, our target is an optimal sampling distribution for KQ, defined as
(3) 
for some functional class to be specified. In general a (possibly nonunique) optimal will depend on and, unlike for IS, also on the kernel and the number of samples .
Second, we propose a novel and automatic method for selection of that is rooted in approximation of the unavailable . In brief, our method considers candidate sampling distributions of the form for and a reference distribution on . The exponent is chosen such that minimises an empirical upper bound on the RMSE. The overall approach is facilitated with an efficient sequential MC (SMC) sampler and called SMCKQ. In particular, the approach (i) provides practical guidance for selection of for KQ, (ii) offers robustness to kernel misspecification, and (iii) extends recent work on computing posterior expectations with kernels obtained using Stein’s method (Oates et al., 2017).
The paper proceeds as follows: Empirical results in Sec. 2 reveal that the RMSE for KQ is highly sensitive to the choice of . The proposed approach to selection of is contained in Sec. 3. Numerical experiments, presented in Sec. 4, demonstrate that dramatic reductions in integration error (up to 4 orders of magnitude) can be achieved with SMCKQ. Lastly, a discussion is provided in Sec. 5.
2 Background
This section presents an overview of KQ (Sec. 2.1 and 2.2), empirical (Secs. 2.3) and theoretical (Sec. 2.4) results on the choice of sampling distribution, and discusses kernel learning for KQ (Sec. 2.5).
2.1 Overview of Kernel Quadrature
We now proceed to describe KQ: Recall the approximation to ; an explicit form for the coefficients is given as , where and . It is assumed that exists almost surely; for nondegenerate kernels, this corresponds to having no atoms. From the above definition of KQ,
Defining leads to the estimate in Eqn. 2 with weights . Pairs for which the have closed form are reported in Table 1 of Briol et al. (2015b). Computation of these weights incurs a computational cost of at most and can be justified when either (i) evaluation of forms the computational bottleneck, or (ii) the gain in estimator precision (as a function in ) dominates this cost (i.e. whenever ).
Notable contributions on KQ include Diaconis (1988); O’Hagan (1991); Rasmussen and Ghahramani (2002) who introduced the method and Huszar and Duvenaud (2012); Osborne et al. (2012a, b); Gunter et al. (2014); Bach (2015); Briol et al. (2015a, b); Särkkä et al. (2015); Kanagawa et al. (2016); Liu and Lee (2017) who provided consequent methodological extensions. KQ has been applied to a wide range of problems including probabilistic ODE solvers (Kersting and Hennig, 2016)
(Paul et al., 2016), filtering (Prüher and Šimandl, 2015) and design of experiments (Ma et al., 2014).Several characterisations of the KQ estimator are known and detailed below. Let denote the Hilbert space characterised by the reproducing kernel , and denote its norm as (Berlinet and ThomasAgnan, 2011). Then we have the following: (a) The function is the minimiser of over subject to for all . (b) The function is the posterior mean for under the Gaussian process prior conditioned on data and is the mean of the implied posterior marginal over . (c) The weights are characterised as the minimiser over of
the maximal error in the unit ball of . These characterisations connect KQ to (a) nonparametric regression, (b) probabilistic integration and (c) quasiMonte Carlo (QMC) methods (Dick and Pillichshammer, 2010). The scattered data approximation literature (Sommariva and Vianello, 2006) and the numerical analysis literature (where KQ is known as the ‘empirical interpolation method’; Eftang and Stamm, 2012; Kristoffersen, 2013) can also be connected to KQ. However, our search of all of these literatures did not yield guidance on the optimal selection of the sampling distribution (with the exception of Bach (2015) reported in Sec. 2.4).
2.2 OverReliance on the Kernel
In Osborne et al. (2012a); Huszar and Duvenaud (2012); Gunter et al. (2014); Briol et al. (2015a), the selection of was approached as a greedy optimisation problem, wherein the maximal integration error was minimised, given the location of the previous . This approach has demonstrated considerable success in applications. However, the error criterion is strongly dependant on the choice of kernel and the sequential optimisation approach is vulnerable to kernel misspecification. In particular, if the intrinsic length scale of is “too small” then the all cluster around the mode of , leading to poor integral estimation (see Fig. 5 in the Appendix). Related work on subsample selection, such as leverage scores (Bach, 2013), can also be nonrobust to misspecified kernels. The partial solution of online kernel learning requires a sufficient number of data and is not always practicable in small regimes that motivate KQ.
This paper considers sampling methods as a robust alternative to optimisation methods. Although our method also makes use of to select , it reverts to in the limit as the length scale of is made small. In this sense, sampling offers more robustness to kernel misspecification than optimisation methods, at the expense of a possible (nonasymptotic) decrease in precision in the case of a wellspecified kernel. This line of research is thus complementary to existing work. However, we emphasise that robustness is an important consideration for general applications of KQ in which kernel specification may be a nontrivial task.
2.3 Sensitivity to the Sampling Distribution
To date, we are not aware of a clear demonstration of the acute dependence of the performance of the KQ estimator on the choice of distribution . It is therefore important to illustrate this phenomenon in order to build intuition.
Consider the toy problem with state space , target distribution , a single test function and kernel . For this problem, consider a range of sampling distributions of the form for . Fig. 1 plots
an empirical estimate for the RMSE where is the th of independent KQ estimates for based on samples drawn from the distribution
with standard deviation
(). In this case is available in closedform. It is seen that the ‘obvious’ choice of , i.e. , is suboptimal. The intuition here is that ‘extreme’ samples from the tails of are rather informative for building the interpolant underlying KQ; we should therefore oversample these values via a heaviertailed . The same intuition is used for column sampling and to construct leverage scores (Mahoney, 2011; Drineas et al., 2012).2.4 Established Results
Here we recall the main convergence results todate on KQ and discuss how these relate to choices of sampling distribution. To reduce the level of detail below, we make several assumptions at the outset:
Assumption on the domain: The domain will either be itself or a compact subset of that satisfies an ‘interior cone condition’, meaning that there exists an angle and a radius such that for every there exists such that the cone is contained in (see Wendland, 2004, for background).
Assumption on the kernel: Consider the integral operator , with defined as the Bochner integral . Assume that , so that is selfadjoint, positive semidefinite and traceclass (Simon, 1979). Then, from an extension of Mercer’s theorem (König, 1986) we have a decomposition , where and
are the eigenvalues and eigenfunctions of
. Further assume that is dense in .The first result is adapted and extended from Thm. 1 in Oates et al. (2016).
Theorem 1.
Assume that admits a density defined on a compact domain . Assume that for some . Let be fixed and define the Euclidean fill distance
Let be independent draws from . Assume gives rise to a Sobolev space . Then there exists such that, for ,
for all . Here for some constant independent of and .
All proofs are reserved for the Appendix. The main contribution of Thm. 1 is to establish a convergence rate for KQ when using importance sampling distributions. A similar result appeared in Thm. 1 of Briol et al. (2015b) for samples from (see the Appendix) and was extended to MCMC samples in Oates et al. (2016). An extension to the case of a misspecified kernel was considered in Kanagawa et al. (2016). However a limitation of this direction of research is that it does not address the question of how to select .
The second result that we present is a consequence of the recent work of Bach (2015), who considered a particular choice of , depending on a fixed , via the density The following is adapted from Prop. 1 in Bach (2015):
Theorem 2.
Some remarks are in order: (i) Bach (2015, Prop. 3) showed that, for , integration error scales at an optimal rate in up to logarithmic terms and, after samples, is of size . (ii) The distribution is obtained from minimising an upper bound on the integration error, rather than the error itself. It is unclear to us how well approximates an optimal sampling distribution for KQ. (iii) In general is hard to compute. For the specific case , equal to and uniform, the distribution is also uniform (and hence independent of ; see Sec. 4.4 of Bach (2015)). However, even for the simple example of Sec. 2.3, does not appear to have a closed form (details in Appendix). An approximation scheme was proposed in Sec. 4.2 of Bach (2015) but the error of this scheme was not studied.
2.5 Goals
Our first goal was to formalise the sampling problem for KQ; this is now completed. Our second goal was to develop a novel automatic approach to selection of , called SMCKQ; full details are provided in Sec. 3.
Also, observe that the integrand will in general belong to an infinitude of Hilbert spaces, while for KQ a single kernel must be selected. This choice will affect the performance of the KQ estimator; for example, in Fig. 2, the problem of Sec. 2.3 was reconsidered based on a class of kernels parametrised by . Results showed that, for all choices of parameter, the RMSE of KQ is sensitive to choice of . In particular, the default choice of is not optimal. For this reason, an extension that includes kernel learning, called SMCKQKL, is proposed in Sec. 3.
3 Methods
In this section the SMCKQ and SMCKQKL methods are presented. Our aim is to explain in detail the main components (SMC, temp, crit) of Alg. 1. To this end, Secs. 3.1 and 3.2 set up our SMC sampler to target tempered distributions, while Sec. 3.3
presents a heuristic for the choice of temperature schedule. Sec.
3.4 extends the approach to kernel learning and Sec. 3.5 proposes a novel criterion to determine when a desired error tolerance is reached.3.1 Thermodynamic Ansatz
To begin, consider , and as fixed. The following ansatz is central to our proposed SMCKQ method: An optimal distribution (in the sense of Eqn. 3) can be wellapproximated by a distribution of the form
(4) 
for a specific (but unknown) ‘inverse temperature’ parameter . Here is a reference distribution to be specified and which should be chosen to be uninformative in practice. It is assumed that all exist (i.e. can be normalised). The motivation for this ansatz stems from Sec. 2.3, where and can be cast in this form with and
an (improper) uniform distribution on
. In general, tempering generates a class of distributions which overrepresent extreme events relative to (i.e. have heavier tails). This property has the potential to improve performance for KQ, as demonstrated in Sec. 2.3.The ansatz of Eqn. 4 reduces the nonparametric sampling problem for KQ to the onedimensional parametric problem of selecting a suitable . The problem can be further simplified by focusing on a discrete temperature ladder such that , and . Discussion of the choice of ladder is deferred to Sec. 3.3. This reduced problem, where we seek an optimal index , is still nontrivial as no closedform expression is available for the RMSE at each candidate . To overcome this impasse a novel approach to estimate the RMSE is presented in Sec. 3.5.
3.2 Convex Ansatz (Smc)
The proposed SMCKQ algorithm requires a second ansatz, namely that the RMSE is convex in and possesses a global minimum in the range . This second ansatz (borne out in numerical results in Fig. 1) motivates an algorithm that begins at and tracks the RMSE until an increase is detected, say at ; at which point the index is taken for KQ.
To realise such an algorithm, this paper exploited SMC methods (Chopin, 2002; Del Moral et al., 2006). Here, a particle approximation to is first obtained where are independent draws from , and . Then, at iteration , the particle approximation to is reweighted, resampled and subject to a Markov transition, to deliver a particle approximation to . This ‘resamplemove’ algorithm, denoted SMC, is standard but, for completeness, pseudocode is provided as Alg. 2 in the Appendix.
At iteration , a subset of size is drawn from the unique^{1}^{1}1This ensures that kernel matrices have full rank. It does not introduce bias into KQ, since in general need not equal . However, to keep notation clear, we do not make this operation explicit. elements in , from the particle approximation to , and proposed for use in KQ. A criterion crit, defined in Sec. 3.5, is used to determine whether the resultant KQ error has increased relative to . If this is the case, then the distribution from the previous iteration is taken for use in KQ. Otherwise the algorithm proceeds to and the process repeats. In the degenerate case where the RMSE has a minimum at , the algorithm defaults to standard KQ with .
Both ansatz of the SMCKQ algorithm are justified through the strong empirical results presented in Sec. 4.
3.3 Choice of Temperature Schedule (temp)
The choice of temperature schedule influences several aspects of SMCKQ: (i) The SMC approximation to is governed by the “distance” (in some appropriate metric) between and . (ii) The speed at which the minimum can be reached is linear in the number of temperatures between and . (iii) The precision of KQ depends on the approximation . Factors (i,iii) motivate the use of a fine schedule with large, while (ii) motivates a coarse schedule with small.
For this work, a temperature schedule was used that is well suited to both (i) and (ii), while a strict constraint was imposed on the grid spacing to acknowledge (iii). The specific schedule used in this work was determined based on the conditional effective sample size of the current particle population, as proposed in the recent work of Zhou et al. (2016). Full details are presented in Algs. 4 and 5 in the Appendix.
3.4 Kernel Learning
In Sec. 2.5 we demonstrated the benefit of kernel learning for KQ. From the Gaussian process characterisation of KQ from Sec. 2.1, it follows that kernel parameters
can be estimated, conditional on a vector of function evaluations
, via maximum marginal likelihood:In SMCKQKL, the function evaluations are obtained at the first^{2}^{2}2This is a notational convention and is without loss of generality. In this paper these states were a random sample (without replacement) of size , though stratified sampling among the states could be used. More sophisticated alternatives that also involve the kernel , such as leverage scores, were not considered, since in general these (i) introduce a vulnerability to misspecified kernels and (ii) require manipulation of a kernel matrix (Patel et al., 2015). (of ) states and the parameters are updated in each iteration of the SMC. This demands repeated function evaluation; this burden can be reduced with less frequent parameter updates and caching of all previous function evaluations. The experiments in Sec. 4 assessed both SMCKQ and SMCKQKL in terms of precision per total number of function evaluations, so that the additional cost of kernel learning was taken into account.
3.5 Termination Criterion (crit)
The SMCKQKL algorithm is designed to track the RMSE as is increased. However, the RMSE is not available in closed form. In this section we derive a tight upper bound on the RMSE that is used for the crit component in Alg. 1.
From the worstcase characterisation of KQ presented in Sec. 2.1, we have an upper bound
(5) 
The term , denoted henceforth as (since depends on ), can be computed in closed form (see the Appendix). This motivates the following upper bound on MSE:
(6) 
The term can be estimated with the bootstrap approximation
where are independent draws from . In SMCKQ the term is an unknown constant and the statistic , an empirical proxy for the RMSE, is monitored at each iteration. The algorithm terminates once an increase in this statistic occurs. For SMCKQKL the term is nonconstant as it depends on the kernel hyperparameters; then can in addition be estimated as and we monitor the product of and , with termination when an increase is observed (c.f. test, defined in the Appendix).
Full pseudocode for SMCKQ is provided as Alg. 1, while SMCKQKL is Alg. 9 in the Appendix. To summarise, we have developed a novel procedure, SMCKQ (and an extension SMCKQKL), designed to approximate the optimal KQ estimator based on the unavailable optimal distribution in Eqn. 3 where is the unit ball of . Earlier empirical results in Sec. 2.3 suggest that SMCKQ has potential to provide a powerful and general algorithm for numerical integration. The additional computational cost of optimising the sampling distribution does however have to be counterbalanced with the potential gain in error, and so this method will mainly be of practical interest for problems with expensive integrands or complex target distributions. The following section reports experiments designed to test this claim.
4 Results
Here we compared SMCKQ (and SMCKQKL) against the corresponding default approaches KQ (and KQKL) that are based on . Sec. 4.1 below reports an assessment in which the true value of integrals is known by design, while in Sec. 4.2 the methods were deployed to solve a parameter estimation problem involving differential equations.
4.1 Simulation Study
To continue our illustration from Sec. 2, we investigated the performance of SMCKQ and SMCKQKL for integration of against the distribution . Here the reference distribution was taken to be . All experiments employed SMC with particles, random walk Metropolis transitions (Alg. 3), the resample threshold and a maximum grid size . Dependence of the subsequent results on the choice of was investigated in Fig. 10 in the Appendix.
Fig. 3 (top) reports results for SMCKQ against KQ, for fixed lengthscale . Corresponding results for SMCKQKL against KQKL are shown in the bottom plot. It was observed that SMCKQ (resp. SMCKQKL) outperformed KQ (resp. KQKL) in the sense that, on a perfunctionevaluation basis, the MSE achieved by the proposed method was lower than for the standard method. The largest reduction in MSE achieved was about 8 orders of magnitude (correspondingly 4 orders of magnitude in RMSE). A fair approximation to the method, which is approximately optimal for (c.f. results in Fig. 1), was observed. The termination criterion in Sec. 3.5 was observed to be a good approximation to the optimal temperature (Fig. 9 in Appendix). As an aside, we note that the MSE was gated at for all methods due to numerical condition of the kernel matrix (a known feature of the Gaussian kernel used in this experiment).
The investigation was extended to larger dimensions ( and ) and more complex integrands in the Appendix. In all cases, considerable improvements were obtained using SMCKQ over KQ.
4.2 Inference for Differential Equations
Consider the model given by with solution depending on unknown parameters . Suppose we can obtain observations through the following noise model (likelihood): at times where we assume for known . Our goal is to estimate for a fixed (potentially large) . To do so, we will use a Bayesian approach and specify a prior , then obtain samples from the posterior
using MCMC. The posterior predictive mean is then defined as:
, and this can be estimated using an empirical average from the posterior samples. This type of integration problem is particularly challenging as the integrand requires simulating from the differential equation at each iteration. Furthermore, the larger or the smaller the grid, the longer the simulation will be and the higher the computational cost.For a tractable testbed, we considered Hooke’s law, given by the following second order homogeneous ODE given by
with initial conditions and . This equation represents the evolution of a mass on a spring with friction (Robinson, 2004, Chapter 13). More precisely, denotes the spring constant, the damping coefficient representing friction and the mass of the object. Since this differential equation is an overdetermined system we fixed . In this case, if , we get a damped oscillatory behaviour as presented in Fig. 4 (top). Data were generated with , . with lognormal priors with scale equal to for all parameters.
To implement KQ under an unknown normalisation constant for , we followed Oates et al. (2017) and made use of a Gaussian kernel that was adapted with Stein’s method (see the Appendix for details). The reference distribution was an wide uniform prior on the hypercube . Brute force computation was used to obtain a benchmark value for the integral. For the SMC algorithm, an independent lognormal transition kernel was used at each iteration with parameters automatically tuned to the current set of particles. Results in Fig. 4 demonstrate that SMCKQ outperforms KQ for these integration problems. These results improve upon those reported in Oates et al. (2016) for a similar integration problem based on parameter estimation for differential equations.
5 Discussion
In this paper we formalised the optimal sampling problem for KQ. A general, practical solution was proposed, based on novel use of SMC methods. Initial empirical results demonstrate performance gains relative to standard approach of KQ with . A more challenging example based on parameter estimation for differential equations was used to illustrate the potential of SMCKQ for Bayesian computation in combination with Stein’s method.
Our methods were general but required userspecified choice of an initial distribution . For compact state spaces we recommend taking
to be uniform. For noncompact spaces, however, there is a degree of flexibility here and default solutions, such as wide Gaussian distributions, necessarily require user input. However, the choice of
is easier than the choice of itself, since is not required to be optimal. In our examples, improved performance (relative to standard KQ) was observed for a range of reference distributions .A main motivation for this research was to provide an alternative to optimisationbased KQ that alleviates strong dependence on the choice of kernel (Sec. 2.2). This paper provides essential groundwork toward that goal, in developing samplingbased methods for KQ in the case of complex and expensive integration problems. An empirical comparison of samplingbased and optimisationbased methods is reserved for future work.
Two extensions of this research are identified: First, the curse of dimension that is intrinsic to standard Sobolev spaces can be alleviated by demanding ‘dominating mixed smoothness’; our methods are compatible with these (essentially tensor product) kernels
(Dick et al., 2013). Second, the use of sequential QMC (Gerber and Chopin, 2015) can be considered, motivated by further orders of magnitude reduction in numerical error observed for deterministic point sets (see Fig. 13 in the Appendix).Acknowledgements
FXB was supported by the EPSRC grant [EP/L016710/1]. CJO & MG we supported by the Lloyds Register Foundation Programme on DataCentric Engineering. WYC was supported by the ARC Centre of Excellence in Mathematical and Statistical Frontiers. MG was supported by the EPSRC grants [EP/J016934/3, EP/K034154/1, EP/P020720/1], an EPSRC Established Career Fellowship, the EU grant [EU/259348], a Royal Society Wolfson Research Merit Award. FXB, CJO, JC & MG were also supported by the SAMSI working group on Probabilistic Numerics.
References
References
 Bach [2013] F. Bach. Sharp analysis of lowrank kernel matrix approximations. In Proc. I. Conf. Learn. Theory, 2013.
 Bach [2015] F. Bach. On the equivalence between kernel quadrature rules and random features. arXiv:1502.06800, 2015.
 Bakhvalov [1959] N. S. Bakhvalov. On approximate computation of integrals. Vestnik MGU, Ser. Math. Mech. Astron. Phys. Chem., 4:3–18, 1959. In Russian.
 Berlinet and ThomasAgnan [2011] A. Berlinet and C. ThomasAgnan. Reproducing kernel Hilbert spaces in probability and statistics. Springer Science & Business Media, 2011.
 Briol et al. [2015a] FX. Briol, C. J. Oates, M. Girolami, and M. A. Osborne. FrankWolfe Bayesian quadrature: Probabilistic integration with theoretical guarantees. In Adv. Neur. Inf. Proc. Sys., 2015a.
 Briol et al. [2015b] FX. Briol, C. J. Oates, M. Girolami, M. A. Osborne, and D. Sejdinovic. Probabilistic integration: A role for statisticians in numerical analysis? arXiv:1512.00933, 2015b.
 Chopin [2002] N. Chopin. A sequential particle filter method for static models. Biometrika, 89(3):539–552, 2002.
 Cohen and Migliorati [2016] A. Cohen and G. Migliorati. Optimal weighted leastsquares methods. arXiv:1608.00512, 2016.
 Del Moral et al. [2006] P. Del Moral, A. Doucet, and A. Jasra. Sequential monte carlo samplers. J. R. Stat. Soc. Ser. B. Stat. Methodol., 68:411–436, 2006.
 Diaconis [1988] P. Diaconis. Bayesian Numerical Analysis, volume IV of Statistical Decision Theory and Related Topics, pages 163–175. SpringerVerlag, New York, 1988.
 Dick and Pillichshammer [2010] J. Dick and F. Pillichshammer. Digital nets and sequences: Discrepancy Theory and Quasi–Monte Carlo Integration. Cambridge University Press, 2010.
 Dick et al. [2013] J. Dick, F. Y. Kuo, and I. H. Sloan. Highdimensional integration: The quasiMonte Carlo way. Acta Numerica, 22:133–288, 2013.
 Drineas et al. [2012] P. Drineas, M. MagdonIsmail, M. W. Mahoney, and D. P. Woodruff. Fast approximation of matrix coherence and statistical leverage. J. Mach. Learn. Res., 13:3475–3506, 2012.
 Eftang and Stamm [2012] J. L. Eftang and B. Stamm. Parameter multidomain ‘hp’ empirical interpolation. I. J. Numer. Methods in Eng., 90(4):412–428, 2012.
 Gerber and Chopin [2015] M. Gerber and N. Chopin. Sequential quasi Monte Carlo. J. R. Statist. Soc. B, 77(3):509–579, 2015.
 Gunter et al. [2014] T. Gunter, R. Garnett, M. Osborne, P. Hennig, and S. Roberts. Sampling for inference in probabilistic models with fast Bayesian quadrature. In Adv. Neur. Inf. Proc. Sys., 2014.
 Hampton and Doostan [2015] J. Hampton and A. Doostan. Coherence motivated sampling and convergence analysis of least squares polynomial Chaos regression. Comput. Methods Appl. Mech. Engrg., 290:73–97, 2015.
 Hinrichs [2010] A. Hinrichs. Optimal importance sampling for the approximation of integrals. J. Complexity, 26(2):125–134, 2010.
 Huszar and Duvenaud [2012] F. Huszar and D. Duvenaud. Optimallyweighted herding is Bayesian quadrature. In Uncert. Artif. Intell., 2012.
 Kanagawa et al. [2016] M. Kanagawa, B. Sriperumbudur, and K. Fukumizu. Convergence guarantees for kernelbased quadrature rules in misspecified settings. In Adv. Neur. Inf. Proc. Sys., 2016.
 Kersting and Hennig [2016] H. Kersting and P. Hennig. Active uncertainty calibration in bayesian ode solvers. In Proc. Conf. Uncert. Artif. Intell., 2016.
 König [1986] H. König. Eigenvalues of compact operators with applications to integral operators. Linear Algebra Appl., 84:111–122, 1986.
 Kristoffersen [2013] S. Kristoffersen. The empirical interpolation method. Master’s thesis, Department of Mathematical Sciences, Norwegian University of Science and Technology, 2013.
 Liu and Lee [2017] Q. Liu and J. D. Lee. BlackBox Importance Sampling. I. Conf. Artif. Intell. Stat., 2017.
 Ma et al. [2014] Y. Ma, R. Garnett, and J. Schneider. Active Area Search via Bayesian Quadrature. I. Conf Artif. Intell. Stat., 33, 2014.
 Mahoney [2011] M. W. Mahoney. Randomized algorithms for matrices and data. Found. Trends Mach. Learn., 3(2):123–224, 2011.
 Oates et al. [2016] C. J. Oates, J. Cockayne, FX. Briol, and M. Girolami. Convergence Rates for a Class of Estimators Based on Stein’s Identity. arXiv:1603.03220, 2016.
 Oates et al. [2017] C. J. Oates, M. Girolami, and N. Chopin. Control Functionals for Monte Carlo Integration. J. R. Stat. Soc. Ser. B. Stat. Methodol., 2017. To appear.
 O’Hagan [1991] A. O’Hagan. BayesHermite quadrature. J. Statist. Plann. Inference, 29:245–260, 1991.
 Osborne et al. [2012a] M. A. Osborne, D. Duvenaud, R. Garnett, C. E. Rasmussen, S. Roberts, and Z. Ghahramani. Active learning of model evidence using Bayesian quadrature. In Adv. Neur. Inf. Proc. Sys., 2012a.
 Osborne et al. [2012b] M. A. Osborne, R. Garnett, S. Roberts, C. Hart, S. Aigrain, and N. Gibson. Bayesian quadrature for ratios. In Proc. I. Conf. Artif. Intell. Stat., 2012b.
 Patel et al. [2015] R. Patel, T. A. Goldstein, E. L. Dyer, A.Mirhoseini, and R. G. Baraniuk. OASIS: Adaptive Column Sampling for Kernel Matrix Approximation. arXiv:1505.05208, 2015.
 Paul et al. [2016] S. Paul, K. Ciosek, M. A. Osborne, and S. Whiteson. Alternating Optimisation and Quadrature for Robust Reinforcement Learning. arXiv:1605.07496, 2016.
 Plaskota et al. [2009] L. Plaskota, G.W. Wasilkowski, and Y. Zhao. New averaging technique for approximating weighted integrals. J. Complexity, 25(3):268–291, 2009.

Prüher and Šimandl [2015]
J. Prüher and M. Šimandl.
Bayesian Quadrature in Nonlinear Filtering.
In 12th I. Conf. Inform. Control Autom. Robot., 2015.  Rasmussen and Ghahramani [2002] C. E. Rasmussen and Z. Ghahramani. Bayesian Monte Carlo. In Adv. Neur. Inf. Proc. Sys., 2002.
 Robert and Casella [2013] C. Robert and G. Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2013.

Robinson [2004]
J. C. Robinson.
An introduction to ordinary differential equations
. Cambridge University Press, 2004.  Särkkä et al. [2015] S. Särkkä, J. Hartikainen, L. Svensson, and F. Sandblom. On the relation between Gaussian process quadratures and sigmapoint methods. arXiv:1504.05994, 2015.

Shi et al. [2009]
T. Shi, M. Belkin, and B. Yu.
Data spectroscopy: Eigenspaces of convolution operators and clustering.
Ann. Statist., 37(6):3960–3984, 2009.  Simon [1979] B. Simon. Trace Ideals and Their Applications. Cambridge University Press, 1979.
 Smola et al. [2007] A. Smola, A. Gretton, L. Song, and B. Schölkopf. A Hilbert space embedding for distributions. In Algorithmic Learn. Theor., pages 13–31, 2007.

Sommariva and Vianello [2006]
A. Sommariva and M. Vianello.
Numerical cubature on scattered data by radial basis functions.
Computing, 76(34):295–310, 2006.  Temme [1996] N. M. Temme. Special Functions: An Introduction to the Classical Functions of Mathematical Physics. Wiley, New York, 1996.
 Wendland [2004] H. Wendland. Scattered Data Approximation. Cambridge University Press, 2004.
 Zhou et al. [2016] Y. Zhou, A. M. Johansen, and J. A. D. Aston. Towards Automatic Model Comparison: An Adaptive Sequential Monte Carlo Approach. J. Comput. Graph. Statist., 25(3):701–726, 2016.
Appendix A Appendix
This appendix complements the paper “On the sampling problem for kernel quadrature”. Section A.1 discusses the potential lack of robustness of greedy optimization methods, which motivated the development of SMCKQ. Sections A.2 and A.3 discuss some of the theoretical aspects of KQ, whilst Section A.4 and A.5 presents additional numerical experiments and details for implementation. Finally, Section A.6 provides detailed pseudocode for all algorithms used in this paper.
a.1 Lack of Robustness of Optimisation Methods
To demonstrate the nonrobustness to misspecified kernels, that is a feature of optimisationbased methods, we considered integration against for functions that can be approximated by the kernel . An initial state was fixed at the origin and then for the state was chosen to minimise the error criterion given the location of the . This is known as ‘sequential Bayesian quadrature’ [SBQ; Huszar and Duvenaud, 2012, Gunter et al., 2014, Briol et al., 2015a]. The kernel length scale was fixed at and we consider (as a thought experiment, since it does not enter into our selection of points) a more regular integrand, such as that shown in Fig. 5 (top). The location of the states obtained in this manner are shown in Fig. 5 (bottom). It is clear that SBQ is not an efficient use of computation for integration of the integrand against . Of course, a bad choice of kernel length scale parameter can in principle be alleviated by kernel learning, but this will not be robust the case where is very small.
This example motivates samplingbased methods as an alternative to optimisationbased methods. Future work will be required to better understand when methods such as SBQ can be reliable in the presence of unknown kernel parameters, but this was beyond the scope of this work.
a.2 Additional Definitions
The space is defined to be the set of measurable functions such that the Lebesgue integral
exists and is finite.
For a multiindex define . The (standard) Sobolev space of order is denoted
This space is equipped with norm
Two normed spaces and are said to be ‘norm equivalent’ if there exists such that
for all .
a.3 Theoretical Results
a.3.1 Proof of Theorem 1
Proof.
From Thm. 11.13 in Wendland [2004] we have that there exist constants , such that
(7) 
for all , provided , where
Under the hypotheses, we can suppose that the deterministic states ensure . Then Eqn. 7 holds for all , where the are independent draws from . It follows that
Next, Lem. 1 in Oates et al. [2016] establishes that, under the present hypotheses on and , there exists such that
for all , where is independent of .
Combining the above results produces
as required, with . ∎
a.3.2 Proof of Theorem 2
Proof.
The CauchySchwarz result for kernel mean embeddings [Smola et al., 2007] gives
Consider the first term above. Since is dense in , it follows that (the unique positive selfadjoint square root of ) is an isometry from to . Now, since , there exists a unique element such that . Then we have that
For , we have if and only if
(9) 
for some , in which case is equal to the infimum of under all such representations . In particular, it follows that for the particular choice with for all .
Under the hypothesis on , Prop. 1 of Bach [2015] established that when are independent, then
with probability at least . Fixing the function in Eqn. 9 leads to the statement that
is at most with probability at least . The infimum over can be replaced with an unconstrained infimum over to obtain the weaker statement that
is at most with probability at least . Now, recall from Sec. 2.1 that the KQ weights are characterised through the solution to this optimisation problem as . It follows that
with probability at least . Combining this fact with Eqn. A.3.2 completes the proof. ∎
a.3.3 for the Example of Figure 1
In this section we consider scope to derive in closedform for the example of Fig. 1. The following will be used:
Proposition 1 (Prop. 1 in Shi et al. [2009]).
Let , and . Define and denote the th Hermite polynomial as . Then the eigenvalues and corresponding eigenfunctions of the integral operator are
and