1 Introduction
This paper considers the approximation of a probability distribution
, defined on a set , by an discrete distribution , for some representative points , where denotes a point mass located at . This quantisation task arises in many areas including numerical cubature (Karvonen, 2019), experimental design (Chaloner and Verdinelli, 1995) and Bayesian computation (Riabiz et al., 2020). To solve the quantisation task one first identifies an optimality criterion, typically a notion of discrepancy between and , and then develops an algorithm to approximately minimise it. Classical optimal quantisation picks the to minimise a Wasserstein distance between the empirical measure and the target , which leads to an elegant connection with Voronoi partitions whose centres are the (Graf and Luschgy, 2007). Several other discrepancies exist but are less wellstudied for the quantisation task. In this paper we study quantisation with the maximum mean discrepancy (MMD), as well as a specific version called the kernel Stein discrepancy (KSD), which each admit a closedform expression for a wide class of target distributions (e.g. Rustamov, 2019).Despite several interesting results, optimal quantisation with MMD remains largely unsolved. Quasi Monte Carlo (QMC) provides representative point sets that asymptotically minimise MMD (Hickernell, 1998; Dick and Pillichshammer, 2010); however, these results are typically limited to specific instances of and MMD.^{1}^{1}1In Section 2.1 we explain how MMD is parametrised by a kernel; the QMC literature typically focuses on uniform on and
dimensional tensor products of univariate kernels defined on
.The use of greedy sequential algorithms, in which the point is selected conditional on the points already chosen, has received some attention in the context of MMD—see the recent surveys in Oettershagen (2017) and Pronzato and Zhigljavsky (2018). Greedy sequential algorithms have also been proposed for KSD (Chen et al., 2018, 2019), as well as a nongreedy sequential algorithm for minimising MMD, called kernel herding (Chen et al., 2010).
Empirical studies have shown that greedy algorithms tend to outperform kernel herding (Briol et al., 2015; Chen et al., 2018), but theory for herding is better established due to its interpretation as a Frank–Wolfe algorithm (Bach et al., 2012; LacosteJulien et al., 2015). Informationtheoretic lower bounds on MMD have been derived in the literature on informationbased complexity (Novak and Woźniakowski, 2008) and in Mak et al. (2018), who studied representative points that minimise an energy distance; the relationship between energy distances and MMD is clarified in Sejdinovic et al. (2013).
The aforementioned sequential algorithms are of limited practical use due to the fact that, to select the next point , one has to search over the whole set . This is often impractical, since
will typically be an infinite set and may not have useful structure (e.g. a vector space) that can be exploited by a numerical optimisation method. Moreover, the global nonconvex optimisation problem that must be solved to find the next point
becomes increasingly difficult as more and more points are selected.This paper studies sequential minimisation of MMD over a finite candidate set, instead of over the whole of . This obviates the need to use a numerical optimisation routine, requiring only that a suitable candidate set can be produced. Such an approach was recently described using KSD in Riabiz et al. (2020), where an algorithm termed Stein thinning was proposed. The use of a discrete candidate set in the context of kernel herding was briefly discussed in LacosteJulien et al. (2015, Appx. G). Mak et al. (2018) proposed sequential selection from a discrete candidate set to approximately minimise energy distance, but theoretical analysis of this algorithm was not attempted. To the best of our knowledge, algorithms for optimal quantisation with general MMD, based on sequential selection from a discrete candidate set, have yet to be studied.
1.1 Contributions
The novel contributions of this paper are as follows:

The consistency of greedy algorithms for minimisation of MMD is established in the setting where the candidate points are sampled from the target, and a finitesamplesize error bound is provided.

Nonmyopic extensions of the greedy algorithm are proposed, in which multiple points are simultaneously selected at each step, and these are combined with minibatching of the candidate set. We show how nonmyopic algorithms can be cast as integer quadratic programmes (IQP) that can be exactly solved using standard libraries for discrete optimisation. Consistency is established and a finitesamplesize error bound is provided.

A detailed empirical assessment is presented, including a comparison with a nonsequential algorithm that can be seen as the limiting case of the nonmyopic algorithm, in which all representative points are simultaneously selected. Such nonsequential algorithms require high computational cost and so a semidefinite relaxation of the IQP is considered.
The remainder of the paper is structured thus. In Section 2 we provide background on MMD and KSD. In Section 3 our novel methods for optimal quantisation are presented. Our empirical assessment is contained in Section 4 and our theoretical assessment is contained in Section 5. The paper concludes with a discussion in Section 6.
2 Background
Let be a measurable space and let denote the set of probability distributions on . First we introduce a notion of discrepancy between two measures , and then specialise this definition to MMD (Section 2.1) and KSD (Section 2.2).
For any and set consisting of realvalued measurable functions on , we define a discrepancy to be a quantity of the form
(1) 
assuming was chosen so that all integrals in (1) exist. The set is called measuredetermining if implies , and in this case is called an integral probability metric (Müller, 1997). An example is the Wasserstein metric—induced by choosing as the set of Lipschitz functions defined on —that is used in classical quantisation (Dudley, 2018, Thm. 11.8.2). Next we describe how MMD and KSD are induced from specific choices of .
2.1 Maximum Mean Discrepancy
Consider a symmetric and positivedefinite function , which we call a kernel. A kernel reproduces a Hilbert space of functions from if (i) for all we have , and (ii) for all and we have , where denotes the inner product in .
By the Moore–Aronszajn theorem (Aronszajn, 1950), there is a onetoone mapping between the kernel and the reproducing kernel Hilbert space (RKHS) , which we make explicit by writing . A prototypical example of a kernel on is the squaredexponential kernel , where in this paper denotes the Euclidean norm and is a positive scaling constant.
Choosing the set in (1) to be the unit ball of the RKHS enables the supremum in (1) to be written in closed form and defines the MMD (Song, 2008):
(2) 
Our notation emphasises as the variable of interest, since in this paper we aim to minimise MMD over possible for a fixed kernel and a fixed target . Under suitable conditions on and it can be shown that MMD is a metric on (in which case the kernel is called characteristic); for sufficient conditions see Section 3 of Sriperumbudur et al. (2010). Furthermore, under stronger conditions on , MMD metrises the weak topology on (Sriperumbudur et al., 2010, Thms. 23, 24). This provides theoretical justification for minimisation of MMD: if then , where denotes weak convergence in .
Evaluation of MMD requires that and are either explicit or can be easily approximated (e.g. by sampling), so as to compute the integrals appearing in (2). This is the case in many applications and MMD has been widely used (Arbel et al., 2019; Briol et al., 2019a; ChériefAbdellatif and Alquier, 2020). In cases where is not explicit, such as when it arises as an intractable posterior in a Bayesian context, KSD can be a useful specialisation of MMD that circumvents integration with respect to . We describe this next.
2.2 Kernel Stein Discrepancy
While originally proposed as a means of proving distributional convergence, Stein’s method (Stein, 1972) can be used to circumvent the integration against required in (2) to calculate MMD. Suppose we have an operator defined on a set of functions such that holds for all . Choosing in (1), we would then have , an expression which no longer involves integrals with respect to . Appropriate choices for and were studied in Gorham and Mackey (2015), who termed the Stein discrepancy, and these will now be described.
Let admit a positive and continuously differentiable density on ; let and denote the gradient and divergence operators respectively; and let be a kernel that is continuously differentiable in each argument. Then take
Note that is the unit ball in the dimensional tensor product of . Under mild conditions on and (Gorham and Mackey, 2017, Prop. 1), it holds that for all . The set can then be shown (Oates et al., 2017) to be the unit ball in a different RKHS with reproducing kernel
(3) 
where subscripts are used to denote the argument upon which a differential operator acts. Since , it follows that and from (2) we arrive at the kernel Stein discrepancy (KSD)
Under stronger conditions on and it can be shown that KSD controls weak convergence to in , meaning that if then (Gorham and Mackey, 2017, Thm. 8).
The description of KSD here is limited to , but constructions also exist for discrete spaces (Yang et al., 2018) and more general Riemannian manifolds (Barp et al., 2018; Xu and Matsuda, 2020; Le et al., 2020). Extensions that use other operators (Gorham et al., 2019; Barp et al., 2019) have also been studied.
3 Methods
In this section we propose novel algorithms for minimisation of MMD over a finite candidate set. The simplest algorithm is described in Section 3.1, and from this we generalise to consider both nonmyopic selection of representative points and minibatching in Section 3.2. A discussion of nonsequential algorithms, as the limit of nonmyopic algorithms where all points are simultaneously selected, is given in Section 3.3.
3.1 A Simple Algorithm for Quantisation
In what follows we assume that we are provided with a finite candidate set from which representative points are to be selected. Ideally, these candidates should be in regions where is supported, but we defer making any assumptions on this set until the theoretical analysis in Section 5.
The simplest algorithm that we consider greedily minimises MMD over the candidate set; for each , pick^{2}^{2}2The convention is used.
to obtain, after steps, an index sequence and associated empirical distribution . Explicit formulae are contained in Algorithm 1. The computational complexity of selecting points in this manner is , provided that the integrals appearing in Algorithm 1 can be evaluated in . Note that candidate points can be selected more than once.
As explained in Section 1, aspects of Algorithm 1 have appeared in earlier work, but we are not aware of any existing theoretical analysis that applies to Algorithm 1 in general. Theorems 1, 2 and 4 in Section 5 provide novel finitesamplesize error bounds for Algorithm 1 (as a special case of Algorithm 2).
The two main shortcomings of Algorithm 1 are that (i) the myopic nature of the optimisation may be statistically inefficient, and (ii) the requirement to scan through a large candidate set during each iteration may lead to unacceptable computational cost. In Section 3.2 we propose nonmyopic and minibatch extensions of Algorithm 1 that aim to address these issues.
3.2 Generalised Sequential Algorithms
In Section 3.2.1 we describe a nonmyopic extension of Algorithm 1, where multiple points are simultaneously selected at each step. The use of nonmyopic optimisation is impractical when a large candidate set is used, and therefore we explain how minibatches from the candidate set can be employed in Section 3.2.2.
3.2.1 NonMyopic Minimisation
Now we consider the simultaneous selection of representative points from the candidate set at each step. This leads to the nonmyopic algorithm
whose output is a bivariate index , together with the associated empirical distribution . Explicit formulae are contained in Algorithm 2. The computational complexity of selecting points in this manner is , which is larger than Algorithm 1 when . Theorems 1, 2 and 4 in Section 5 provide novel finitesamplesize error bounds for Algorithm 2.
Despite its daunting computational complexity, we have found that it is practical to exactly implement Algorithm 2 for moderate values of and by casting each iteration of the algorithm as an instance of a constrained integer quadratic programme (IQP), so that stateoftheart discrete optimisation methods can be employed.
To this end, we represent the indices of the points to be selected at iteration of Algorithm 2 as a vector whose th element indicates the number of copies of that are selected, and where is constrained to satisfy . It is then an algebraic exercise to recast an optimal subset as the solution to a constrained IQP:
Remark 1.
If one further imposes the constraint for all , so that each candidate may be selected at most once, then the resulting binary quadratic programme (BQP) is equivalent to the cardinality constrained partition problem from discrete optimisation, which is known to be NPhard (Rendl, 2016). (The results we present do not impose this constraint.)
3.2.2 MiniBatching
The exact solution of (3.2.1) is practical only for moderate values of and . This motivates the idea of considering only a subset of the candidates at each iteration, a procedure we call minibatching and inspired by the similar approach used in stochastic optimisation. There are several ways that minibatching can be performed, but here we simply state that candidates denoted are considered during the th iteration, with the minibatch size denoted by . The nonmyopic algorithm for minimisation of MMD with minibatching is then
Explicit formulae are contained in Algorithm 3. The complexity of selecting points in this manner is , which is smaller than Algorithm 2 when . As with Algorithm 2, an exact IQP formulation can be employed. Theorem 3 provides a novel finitesamplesize error bound for Algorithm 3.
3.3 NonSequential Algorithms
Finally we consider the limit of the nonmyopic Algorithm 2, in which all representative points are simultaneously selected in a single step:
(4) 
The index set can again be recast as the solution to an IQP and the associated empirical measure provides, by definition, a value for that is at least as small as any of the methods so far described (thus satisfying the same error bounds derived in Theorems 1–4).
However, it is only practical to exactly solve (4) for small and thus, to arrive at a practical algorithm, we consider approximation of (4). There are at least two natural ways to do this. Firstly, one could run a numerical solver for the IQP formulation of (4) and terminate after a fixed computational limit is reached; the solver will return a feasible, but not necessarily optimal, solution to the IQP. An advantage of this approach is that no further methodological work is required. Alternatively, one could employ a convex relaxation of the intractable IQP, which introduces an approximation error that is hard to quantify but leads to a convex problem that may be exactly soluble at reasonable cost. We expand on the latter approach in Appendix B, with preliminary empirical comparisons.
4 Empirical Assessment
[width=0.36]images/fourplots_1 [width=0.36]images/fourplots_2
[width=0.36]images/fourplots_3 [width=0.36]images/fourplots_4
Quantisation of a Gaussian mixture model using MMD. A candidate set of 1000 independent samples (top left), from which 12 representative points were selected using: the myopic method (top right); nonmyopic selection, picking 4 points at a time (bottom left), and by simultaneous selection of all 12 points (bottom right). Simulations were conducted using Algorithms 1 and 2, with a Gaussian kernel whose lengthscale was
.This section presents an empirical assessment^{3}^{3}3Our code is written in Python and is available at [blinded]. All optimisation is performed using Gurobi, via the interface GurobiPy. of Algorithms 1–3. Two regimes are considered, corresponding to high compression (small ; Section 4.1) and low compression (large ; Section 4.2) of the target. These occur, respectively, in applications to Bayesian cubature and thinning of Markov chain output. For details of the kernels used and a sensitivity analysis for the kernel parameters, see Appendix C.
Figure 1 illustrates how a nonmyopic algorithm may outperform a myopic one. Here a candidate set was constructed using independent samples from the target , and 12 representatives were selected from this set using the myopic (Algorithm 1 with ) and nonmyopic (Algorithm 2 with and ) methods, and by nonsequential selection of all 12 representatives simultaneously (Algorithm 2 with and ).
Having selected the first three samples close to the three modes, the myopic method subsequently selects points that temporarily worsen the overall approximation; note in particular the placement of the fourth point. The nonmyopic methods do not suffer to the same extent: choosing 4 points simultaneously gives better approximations relative to the myopic approach after each of 4, 8 and 12 samples have been chosen; was chosen deliberately so as to be coprime to the number of mixture components, 3. Choosing all 12 points at once gives an even better approximation of the true distribution.
[width=0.35height=0.34]images/twentygaussian.png [width=0.4height=0.38]images/MMD_batch_size.png
4.1 Bayesian Cubature
Larkin (1972) and subsequent authors proposed to cast numerical cubature in the Bayesian framework, such that an integrand is a priori modelled as a Gaussian process with covariance , then conditioned on data
where the integrand is evaluated. The posterior standard deviation is
(Briol et al., 2019b)(5) 
The selection of to minimise (5) is impractical since evaluation of (5) has a complexity of . Huszár and Duvenaud (2012) and Briol et al. (2015) noted that (5) can be bounded above by fixing , and that quantisation methods provide a practical means to minimise this bound. All convergence results derived in Section 5 can therefore be applied and, moreover, the bound is expected to be quite tight.^{4}^{4}4 suggests that the point was not optimally placed. Thus for optimal we anticipate . Earlier authors studied herding; here we study nonmyopic^{5}^{5}5Nonmyopic algorithms have been proposed for Bayesian cubature in Jiang et al. (2019) and Jarvenpaa et al. (2020), but the corresponding theory was not developed. Moreover, our IQP approach is novel in this context. greedy algorithms based on independent samples.^{6}^{6}6That independent samples are sufficient to obtain optimal convergence rates for Bayesian cubature was shown in Corollary 1 of Ehler et al. (2019).
Bayesian cubature is most often used when evaluation of is associated with a high computational cost and one is prepared to expend resources in the optimisation of the point set . Our focus in this section is therefore on the quality of the point set obtained, irrespective of computational cost. (The computational cost will, however, be assessed in Section 4.2.)
Figure 1 suggests that the approximation quality of nonmyopic approaches is dependent on the number of simultaneously selected points . Figure 2 compares the selection of 60 from 1000 independently sampled points in a model consisting of a mixture of 20 Gaussians, while varying . More myopic selections are seen to be outperformed by less myopic ones for a given quantity of selected samples. The simultaneous selection of samples explains the presentation of the plot as step functions. Note in particular that MMD of the myopic method () is observed to decrease nonmonotonically. This is a manifestation of the phenomenon also seen in Figure 1, where a particular sample choice may temporarily worsen the quality of the overall approximation.
Next we consider applications in Bayesian statistics, where both approximation quality and computation time are important. In contrast to
Section 4.1, in what follows the density will be available only up to an unknown normalisation constant and thus KSD—which requires only that can be evaluated—will be used.4.2 Thinning of Markov Chain Output
The use of quantisation to ‘thin’ Markov chain output was proposed in Riabiz et al. (2020), where the authors studied greedy myopic algorithms based on KSD. In this section we revisit an application from that work to determine whether our proposed nonmyopic and minibatching algorithms offer a performance improvement. Unlike Section 4.1, the computational cost of our algorithms must now be assessed, since their runtime may be comparable to the time required to produce Markov chain output itself.
The datasets we use consist of (1) samples from a 38parameter intracellular calcium signalling model introduced by Hinch et al. (2004), and (2) samples from a 4parameter LotkaVolterra predatorprey model. Both were previously considered in Riabiz et al. (2020). The MCMC chains are both highly autocorrelated and start far from any mode. The greedy KSD approach used by Riabiz et al. (2020) was found to slow down dramatically after selecting the first samples, due to the requirement to compute the kernel on all pairs of values involving the selected points and all points in the candidate set at each iteration. We employ minibatching () to ameliorate this, and investigate the effectiveness of nonmyopic selection () in that context.
[width=0.37height=0.33]images/cardiac_comparison [width=0.37height=0.33]images/cardiac_comparison_2
[width=0.37height=0.33]images/lv_comparison [width=0.37height=0.33]images/lv_comparison_2
time vs. number of selected samples, shown for the 38dim calcium signalling model (top two panes) and 4dim LotkaVolterra model (bottom two panes). The kernel lengthscale in each case was set using the median heuristic
(Garreau et al., 2017), and estimated in practice using a uniform subsample of 1000 points for each model. The myopic algorithm of
Riabiz et al. (2020) is included in the LotkaVolterra plots—see main text for details.Figure 3 plots KSD against time, with the number of collected samples fixed at 1000, as well as timeadjusted KSD against for both models. This acknowledges that both approximation quality and computational time are important. In both cases, larger minibatches were able to perform better provided that the number of states selected was large enough to realise their potential.
Nonmyopic selection shows a significant improvement over batchmyopic in the LotkaVolterra models, and a less significant (though still visible) improvement in the calcium signalling model. The practical upper limit on for nonmyopic methods (due to the requirement to optimise over all points) may make performance for larger poorer relatively; in a larger and more complex model, there may be fewer than ‘good’ samples to choose from given moderate . This suggests that control of the ratio may be important for best performance; the best results that we observed occurred when .
A comparison to the original myopic algorithm (i.e. ) of Riabiz et al. (2020) is presented for the Lotka–Volterra model. This is implemented using the same code and on the same machine as the batch and nonmyopic simulations. The cyan line shown in the bottom two panes of Figure 3 represents only 50 points not 1000; collecting just these took 42 minutes. This algorithm is slower still for the calcium signalling model, so we omit the comparison.
5 Theoretical Assessment
This section presents a theoretical assessment of Algorithms 1–3. Once stated, a standing assumption is understood to hold for the remainder of the main text.
Standing Assumption 1.
Let be a measurable space equipped with a probability measure . Let be symmetric positive definite and satisfy .
For the kernels described in Section 2.2, and the assumption is trivially satisfied. Our first result is a finitesamplesize error bound for nonmyopic algorithms when the candidate set is fixed:
Theorem 1.
Let be fixed and let . Consider an index sequence of length and with selection size produced by Algorithm 2. Then for all it holds that
with an independent constant.
The proof is provided in Section A.1. Aside from providing an explicit error bound, we see that the output of Algorithm 2 converges in MMD to the optimal (weighted) quantisation of that is achievable using the candidate point set. Interestingly, all bounds we present are independent of .
Remark 2.
Our remaining results explore the cases where the candidate points are randomly sampled. Independent and dependent sampling is considered and, in each case, the following moment bound will be assumed:
Standing Assumption 2.
For some , , where the expectation is taken over .
In the first randomised setting, the are independently sampled from , as would typically be possible when is explicit:
Theorem 2.
Let be independently sampled from . Consider an index sequence of length produced by Algorithm 2. Then for all and all it holds that
The proof is provided in Section A.2. It is seen that must be asymptotically increased with in order for the approximation provided by Algorithm 2 to be consistent. No smoothness assumptions were placed on ; such assumptions can be used to improve the term (as in Thm. 1 of Ehler et al., 2019), but we did not consider this useful since the term is the bottleneck in the bound.
An analogous, but more technically involved, argument leads to a finitsamplesize error bound when minibatches are used:
Theorem 3.
Let each minibatch be independently sampled from . Consider an index sequence of length produced by Algorithm 3. Then
The proof is provided in Section A.3. The minibatch size plays an analogous role to in Theorem 2 and must be asymptotically increased with in order for Algorithm 3 to be consistent.
In our second randomised setting the candidate set arises as a Markov chain sample path. Let be a function and, for a function and a measure on , let
A irreducible and aperiodic Markov chain with step transition kernel is uniformly ergodic if and only if there exists and such that
(6) 
for all initial states and all (see Thm. 16.0.1 of Meyn and Tweedie, 2012).
Theorem 4.
Assume that for all . Consider a invariant, timehomogeneous, reversible Markov chain generated using a uniformly ergodic transition kernel, such that (6) is satisfied with for all . Suppose that . Consider an index sequence of length and selection subset size produced by Algorithm 2. Then, with , we have that
The proof is provided in Section A.4. Analysis of minibatching in the dependent sampling context appears to be more challenging and was not attempted.
6 Discussion
This paper focused on quantisation using MMD, proposing and analysing novel algorithms for this task, but other integral probability metrics could be considered. More generally, if one is interested in compression by means other than quantisation then other approaches may be useful, such as Gaussian mixture models and related approaches from the literature on density estimation (Silverman, 1986).
Some avenues for further research include: (a) extending symmetric structure in the target to the set of representative points (Karvonen et al., 2019); (b) characterising an optimal sampling distribution from which elements of the candidate set can be obtained (Bach, 2017)
; (c) further applications of our method, for example to Bayesian neural networks, where quantisation of the posterior provides a promising route to reduce the cost of predicting each label in the test dataset. •
Acknowledgements:
This work was supported by the Lloyd’s Register Foundation programme on datacentric engineering at the Alan Turing Institute, UK. MR was supported by the British Heart FoundationAlan Turing Institute cardiovascular datascience award (BHF; SP/18/6/33805).
References
 Arbel et al. (2019) Arbel, M., Korba, A., Salim, A., and Gretton, A. (2019). Maximum mean discrepancy gradient flow. NeurIPS 32, pages 6484–6494.
 Aronszajn (1950) Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Am. Math. Soc., 68(3):337–404.
 Bach (2017) Bach, F. (2017). On the equivalence between kernel quadrature rules and random feature expansions. J. Mach. Learn. Res., 18(1):714–751.
 Bach et al. (2012) Bach, F., LacosteJulien, S., and Obozinski, G. (2012). On the equivalence between herding and conditional gradient algorithms. ICML 29.
 Barp et al. (2019) Barp, A., Briol, F.X., Duncan, A., Girolami, M., and Mackey, L. (2019). Minimum Stein discrepancy estimators. NeurIPS 32, pages 12964–12976.
 Barp et al. (2018) Barp, A., Oates, C., Porcu, E., Girolami, M., et al. (2018). A RiemannianStein kernel method. arXiv:1810.04946.
 Briol et al. (2019a) Briol, F.X., Barp, A., Duncan, A. B., and Girolami, M. (2019a). Statistical inference for generative models with maximum mean discrepancy. arXiv:1906.05944.
 Briol et al. (2015) Briol, F.X., Oates, C., Girolami, M., and Osborne, M. A. (2015). FrankWolfe Bayesian quadrature: Probabilistic integration with theoretical guarantees. NeurIPS 28, pages 1162–1170.
 Briol et al. (2019b) Briol, F.X., Oates, C. J., Girolami, M., Osborne, M. A., Sejdinovic, D., et al. (2019b). Probabilistic integration: A role in statistical computation? Stat. Sci., 34(1):1–22.
 Chaloner and Verdinelli (1995) Chaloner, K. and Verdinelli, I. (1995). Bayesian experimental design: a review. Stat. Sci., 10(3):273–304.
 Chen et al. (2019) Chen, W. Y., Barp, A., Briol, F.X., Gorham, J., Girolami, M., Mackey, L., and Oates, C. J. (2019). Stein point Markov chain Monte Carlo. ICML, 36.
 Chen et al. (2018) Chen, W. Y., Mackey, L., Gorham, J., Briol, F.X., and Oates, C. J. (2018). Stein points. ICML, 35.
 Chen et al. (2010) Chen, Y., Welling, M., and Smola, A. (2010). Supersamples from kernel herding. UAI, 10.
 ChériefAbdellatif and Alquier (2020) ChériefAbdellatif, B.E. and Alquier, P. (2020). MMDBayes: robust Bayesian estimation via maximum mean discrepancy. Proc. Mach. Learn. Res., 18:1–21.
 Dick and Pillichshammer (2010) Dick, J. and Pillichshammer, F. (2010). Digital nets and sequences: discrepancy theory and quasi–Monte Carlo integration. Cambridge University Press.
 Dudley (2018) Dudley, R. M. (2018). Real analysis and probability. CRC Press.
 Ehler et al. (2019) Ehler, M., Gräf, M., and Oates, C. J. (2019). Optimal Monte Carlo integration on closed manifolds. Stat. Comput., 29(6):1203–1214.
 Garreau et al. (2017) Garreau, D., Jitkrittum, W., and Kanagawa, M. (2017). Large sample analysis of the median heuristic. arXiv:1707.07269.
 Goemans and Williamson (1995) Goemans, M. X. and Williamson, D. P. (1995). Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6):1115–1145.
 Gorham et al. (2019) Gorham, J., Duncan, A., Mackey, L., and Vollmer, S. (2019). Measuring sample quality with diffusions. Ann. Appl. Probab., 29(5):2884–2928.
 Gorham and Mackey (2015) Gorham, J. and Mackey, L. (2015). Measuring sample quality with Stein’s method. NeurIPS 28, pages 226–234.
 Gorham and Mackey (2017) Gorham, J. and Mackey, L. (2017). Measuring sample quality with kernels. ICML 34, 70:1292–1301.
 Graf and Luschgy (2007) Graf, S. and Luschgy, H. (2007). Foundations of quantization for probability distributions. Springer.
 Hickernell (1998) Hickernell, F. (1998). A generalized discrepancy and quadrature error bound. Math. Comput, 67(221):299–322.
 Hinch et al. (2004) Hinch, R., Greenstein, J., Tanskanen, A., Xu, L., and Winslow, R. (2004). A simplified local control model of calciuminduced calcium release in cardiac ventricular myocytes. Biophys. J, 87(6):3723–3736.
 Huszár and Duvenaud (2012) Huszár, F. and Duvenaud, D. (2012). Optimallyweighted herding is Bayesian quadrature. UAI, 12.
 Jarvenpaa et al. (2020) Jarvenpaa, M., Vehtari, A., and Marttinen, P. (2020). Batch simulations and uncertainty quantification in Gaussian process surrogate approximate Bayesian computation. Proc. Mach. Learn. Res., 124:779–788.
 Jiang et al. (2019) Jiang, S., Chai, H., Gonzalez, J., and Garnett, R. (2019). BINOCULARS for efficient, nonmyopic sequential experimental design. ICML 37.
 Karvonen (2019) Karvonen, T. (2019). Kernelbased and Bayesian methods for numerical integration. PhD Thesis, Aalto University.
 Karvonen et al. (2019) Karvonen, T., Särkkä, S., and Oates, C. (2019). Symmetry exploits for Bayesian cubature methods. Stat. Comput., 29:1231–1248.
 LacosteJulien et al. (2015) LacosteJulien, S., Lindsten, F., and Bach, F. (2015). Sequential kernel herding: FrankWolfe optimization for particle filtering. Proc. Mach. Learn. Res., 38:544–552.
 Larkin (1972) Larkin, F. M. (1972). Gaussian measure in Hilbert space and applications in numerical analysis. Rocky Mt. J. Math., 3:379–421.
 Le et al. (2020) Le, H., Lewis, A., Bharath, K., and Fallaize, C. (2020). A diffusion approach to Stein’s method on Riemannian manifolds. arXiv:2003.11497.
 Mak et al. (2018) Mak, S., Joseph, V. R., et al. (2018). Support points. Ann. Stat., 46(6A):2562–2592.
 Meyn and Tweedie (2012) Meyn, S. P. and Tweedie, R. L. (2012). Markov chains and stochastic stability. Springer.
 Müller (1997) Müller, A. (1997). Integral probability metrics and their generating classes of functions. Adv. App. Prob., 29(2):429–443.
 Novak and Woźniakowski (2008) Novak, E. and Woźniakowski, H. (2008). Tractability of Multivariate Problems: Standard information for functionals. European Mathematical Society.
 Oates et al. (2017) Oates, C. J., Girolami, M., and Chopin, N. (2017). Control functionals for Monte Carlo integration. J. R. Statist. Soc. B, 3(79):695–718.
 Oettershagen (2017) Oettershagen, J. (2017). Construction of optimal cubature algorithms with applications to econometrics and uncertainty quantification. PhD thesis, University of Bonn.
 Pronzato and Zhigljavsky (2018) Pronzato, L. and Zhigljavsky, A. (2018). Bayesian quadrature, energy minimization, and spacefilling design. SIAMASA J. Uncert. Quant., 8(3):959–1011.
 Rendl (2016) Rendl, F. (2016). Semidefinite relaxations for partitioning, assignment and ordering problems. Ann. Oper. Res., 240:119–140.
 Riabiz et al. (2020) Riabiz, M., Chen, W., Cockayne, J., Swietach, P., Niederer, S. A., Mackey, L., and Oates, C. (2020). Optimal thinning of MCMC output. arXiv:2005.03952.
 Rustamov (2019) Rustamov, R. M. (2019). Closedform expressions for maximum mean discrepancy with applications to Wasserstein autoencoders. arXiv:1901.03227.
 Sejdinovic et al. (2013) Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K. (2013). Equivalence of distancebased and RKHSbased statistics in hypothesis testing. Ann. Stat., 41(5):2263–2291.
 Silverman (1986) Silverman, B. W. (1986). Density estimation for statistics and data analysis. CRC Press.
 Song (2008) Song, L. (2008). Learning via Hilbert space embedding of distributions. PhD thesis, University of Sydney.
 Sriperumbudur et al. (2010) Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Schölkopf, B., and Lanckriet, G. R. (2010). Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res., 11:1517–1561.

Stein (1972)
Stein, C. (1972).
A bound for the error in the normal approximation to the distribution of a sum of dependent random variables.
In Proc. Sixth Berkeley Symp. on Math. Statist. and Prob., Vol. 2, pages 583–602.  Xu and Matsuda (2020) Xu, W. and Matsuda, T. (2020). A Stein goodnessoffit test for directional distributions. Proc. Mach. Learn Res., 108:320–330.
 Yang et al. (2018) Yang, J., Liu, Q., Rao, V., and Neville, J. (2018). Goodnessoffit testing for discrete distributions via Stein discrepancy. Proc. Mach. Learn. Res., 80:5561–5570.
Supplemental Material
This supplement is structured as follows: In Appendix A we present proofs for all novel theoretical results stated in Section 5 of the main text. In Appendices B and C we provide additional experimental results to support the discussion in Section 4 of the main text.
Appendix A Proof of Theoretical Results
In what follows we let denote the reproducing kernel Hilbert space reproduced by the kernel and let denote the induced norm in .
a.1 Proof of Theorem 1
To start the proof, define
and note immediately that . Then we can write a recursive relation
We will first derive an upper bound for , then one for .
Bounding :
Noting that the algorithm chooses the that minimises
we therefore have that
(7)  
(8)  
(9)  
(10) 
In (7) we used the reproducing property, while in (8) we used the Cauchy–Schwarz inequality and in (9) we used Jensen’s inequality. To bound the third term in (10), we write
Define as the convex hull in of . Since the extreme points of correspond to the vertices we have that
Then we have, for any ,
This linear combination is clearly minimised by taking each of the equal to a candidate point that minimises , and taking the corresponding , and all other . Now consider an element for which the weights minimise subject to and . Clearly . Thus
Combining this with (10) provides an overall bound on .
Bounding :
To upper bound we can in fact just use an equality;
where .
Bound on the Iterates:
Combining our bounds on and , we obtain
Comments
There are no comments yet.