# Optimal quantisation of probability measures using maximum mean discrepancy

Several researchers have proposed minimisation of maximum mean discrepancy (MMD) as a method to quantise probability measures, i.e., to approximate a target distribution by a representative point set. We consider sequential algorithms that greedily minimise MMD over a discrete candidate set. We propose a novel non-myopic algorithm and, in order to both improve statistical efficiency and reduce computational cost, we investigate a variant that applies this technique to a mini-batch of the candidate set at each iteration. When the candidate points are sampled from the target, the consistency of these new algorithm - and their mini-batch variants - is established. We demonstrate the algorithms on a range of important computational problems, including optimisation of nodes in Bayesian cubature and the thinning of Markov chain output.

## Authors

• 7 publications
• 8 publications
• 6 publications
• 37 publications
05/09/2019

### Stein Point Markov Chain Monte Carlo

An important task in machine learning and statistics is the approximatio...
02/23/2020

### Improve SGD Training via Aligning Mini-batches

Deep neural networks (DNNs) for supervised learning can be viewed as a p...
02/09/2016

### Nested Mini-Batch K-Means

A new algorithm is proposed which accelerates the mini-batch k-means alg...
01/13/2022

### Recursive Least Squares Policy Control with Echo State Network

The echo state network (ESN) is a special type of recurrent neural netwo...
02/23/2020

### Improve SGD Training via Aligning Min-batches

Deep neural networks (DNNs) for supervised learning can be viewed as a p...
08/22/2021

### An Efficient Mini-batch Method via Partial Transportation

Mini-batch optimal transport (m-OT) has been widely used recently to dea...
10/27/2015

### Statistically efficient thinning of a Markov chain sampler

It is common to subsample Markov chain output to reduce the storage burd...
##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper considers the approximation of a probability distribution

, defined on a set , by an discrete distribution , for some representative points , where denotes a point mass located at . This quantisation task arises in many areas including numerical cubature (Karvonen, 2019), experimental design (Chaloner and Verdinelli, 1995) and Bayesian computation (Riabiz et al., 2020). To solve the quantisation task one first identifies an optimality criterion, typically a notion of discrepancy between and , and then develops an algorithm to approximately minimise it. Classical optimal quantisation picks the to minimise a Wasserstein distance between the empirical measure and the target , which leads to an elegant connection with Voronoi partitions whose centres are the (Graf and Luschgy, 2007). Several other discrepancies exist but are less well-studied for the quantisation task. In this paper we study quantisation with the maximum mean discrepancy (MMD), as well as a specific version called the kernel Stein discrepancy (KSD), which each admit a closed-form expression for a wide class of target distributions (e.g. Rustamov, 2019).

Despite several interesting results, optimal quantisation with MMD remains largely unsolved. Quasi Monte Carlo (QMC) provides representative point sets that asymptotically minimise MMD (Hickernell, 1998; Dick and Pillichshammer, 2010); however, these results are typically limited to specific instances of and MMD.111In Section 2.1 we explain how MMD is parametrised by a kernel; the QMC literature typically focuses on uniform on and

-dimensional tensor products of univariate kernels defined on

.

The use of greedy sequential algorithms, in which the point is selected conditional on the points already chosen, has received some attention in the context of MMD—see the recent surveys in Oettershagen (2017) and Pronzato and Zhigljavsky (2018). Greedy sequential algorithms have also been proposed for KSD (Chen et al., 2018, 2019), as well as a non-greedy sequential algorithm for minimising MMD, called kernel herding (Chen et al., 2010).

Empirical studies have shown that greedy algorithms tend to out-perform kernel herding (Briol et al., 2015; Chen et al., 2018), but theory for herding is better established due to its interpretation as a Frank–Wolfe algorithm (Bach et al., 2012; Lacoste-Julien et al., 2015). Information-theoretic lower bounds on MMD have been derived in the literature on information-based complexity (Novak and Woźniakowski, 2008) and in Mak et al. (2018), who studied representative points that minimise an energy distance; the relationship between energy distances and MMD is clarified in Sejdinovic et al. (2013).

The aforementioned sequential algorithms are of limited practical use due to the fact that, to select the next point , one has to search over the whole set . This is often impractical, since

will typically be an infinite set and may not have useful structure (e.g. a vector space) that can be exploited by a numerical optimisation method. Moreover, the global non-convex optimisation problem that must be solved to find the next point

becomes increasingly difficult as more and more points are selected.

This paper studies sequential minimisation of MMD over a finite candidate set, instead of over the whole of . This obviates the need to use a numerical optimisation routine, requiring only that a suitable candidate set can be produced. Such an approach was recently described using KSD in Riabiz et al. (2020), where an algorithm termed Stein thinning was proposed. The use of a discrete candidate set in the context of kernel herding was briefly discussed in Lacoste-Julien et al. (2015, Appx. G). Mak et al. (2018) proposed sequential selection from a discrete candidate set to approximately minimise energy distance, but theoretical analysis of this algorithm was not attempted. To the best of our knowledge, algorithms for optimal quantisation with general MMD, based on sequential selection from a discrete candidate set, have yet to be studied.

### 1.1 Contributions

The novel contributions of this paper are as follows:

• The consistency of greedy algorithms for minimisation of MMD is established in the setting where the candidate points are sampled from the target, and a finite-sample-size error bound is provided.

• Non-myopic extensions of the greedy algorithm are proposed, in which multiple points are simultaneously selected at each step, and these are combined with mini-batching of the candidate set. We show how non-myopic algorithms can be cast as integer quadratic programmes (IQP) that can be exactly solved using standard libraries for discrete optimisation. Consistency is established and a finite-sample-size error bound is provided.

• A detailed empirical assessment is presented, including a comparison with a non-sequential algorithm that can be seen as the limiting case of the non-myopic algorithm, in which all representative points are simultaneously selected. Such non-sequential algorithms require high computational cost and so a semi-definite relaxation of the IQP is considered.

The remainder of the paper is structured thus. In Section 2 we provide background on MMD and KSD. In Section 3 our novel methods for optimal quantisation are presented. Our empirical assessment is contained in Section 4 and our theoretical assessment is contained in Section 5. The paper concludes with a discussion in Section 6.

## 2 Background

Let be a measurable space and let denote the set of probability distributions on . First we introduce a notion of discrepancy between two measures , and then specialise this definition to MMD (Section 2.1) and KSD (Section 2.2).

For any and set consisting of real-valued measurable functions on , we define a discrepancy to be a quantity of the form

 DF(μ,ν)=supf∈F∣∣∫fdμ−∫fdν∣∣, (1)

assuming was chosen so that all integrals in (1) exist. The set is called measure-determining if implies , and in this case is called an integral probability metric (Müller, 1997). An example is the Wasserstein metric—induced by choosing as the set of -Lipschitz functions defined on —that is used in classical quantisation (Dudley, 2018, Thm. 11.8.2). Next we describe how MMD and KSD are induced from specific choices of .

### 2.1 Maximum Mean Discrepancy

Consider a symmetric and positive-definite function , which we call a kernel. A kernel reproduces a Hilbert space of functions from if (i) for all we have , and (ii) for all and we have , where denotes the inner product in .

By the Moore–Aronszajn theorem (Aronszajn, 1950), there is a one-to-one mapping between the kernel and the reproducing kernel Hilbert space (RKHS) , which we make explicit by writing . A prototypical example of a kernel on is the squared-exponential kernel , where in this paper denotes the Euclidean norm and is a positive scaling constant.

Choosing the set in (1) to be the unit ball of the RKHS enables the supremum in (1) to be written in closed form and defines the MMD (Song, 2008):

 MMDμ,k(ν)2 :=DB(k)(μ,ν) =∬k(x,y)dν(x)dν(y)−2∬k(x,y)dν(x)dμ(y)+∬k(x,y)dμ(x)dμ(y) (2)

Our notation emphasises as the variable of interest, since in this paper we aim to minimise MMD over possible for a fixed kernel and a fixed target . Under suitable conditions on and it can be shown that MMD is a metric on (in which case the kernel is called characteristic); for sufficient conditions see Section 3 of Sriperumbudur et al. (2010). Furthermore, under stronger conditions on , MMD metrises the weak topology on (Sriperumbudur et al., 2010, Thms. 23, 24). This provides theoretical justification for minimisation of MMD: if then , where denotes weak convergence in .

Evaluation of MMD requires that and are either explicit or can be easily approximated (e.g. by sampling), so as to compute the integrals appearing in (2). This is the case in many applications and MMD has been widely used (Arbel et al., 2019; Briol et al., 2019a; Chérief-Abdellatif and Alquier, 2020). In cases where is not explicit, such as when it arises as an intractable posterior in a Bayesian context, KSD can be a useful specialisation of MMD that circumvents integration with respect to . We describe this next.

### 2.2 Kernel Stein Discrepancy

While originally proposed as a means of proving distributional convergence, Stein’s method (Stein, 1972) can be used to circumvent the integration against required in (2) to calculate MMD. Suppose we have an operator defined on a set of functions such that holds for all . Choosing in (1), we would then have , an expression which no longer involves integrals with respect to . Appropriate choices for and were studied in Gorham and Mackey (2015), who termed the Stein discrepancy, and these will now be described.

Let admit a positive and continuously differentiable density on ; let and denote the gradient and divergence operators respectively; and let be a kernel that is continuously differentiable in each argument. Then take

 Aμg :=∇⋅g+uμ⋅g, uμ :=∇logpμ, G :={g:Rd→Rd:d∑i=1⟨gi,gi⟩H(k)≤1}.

Note that is the unit ball in the -dimensional tensor product of . Under mild conditions on and (Gorham and Mackey, 2017, Prop. 1), it holds that for all . The set can then be shown (Oates et al., 2017) to be the unit ball in a different RKHS with reproducing kernel

 kμ(x,y):=∇x⋅∇yk(x,y)+∇xk(x,y)⋅uμ(y)+∇yk(x,y)⋅uμ(x)+k(x,y)uμ(x)⋅uμ(y), (3)

where subscripts are used to denote the argument upon which a differential operator acts. Since , it follows that and from (2) we arrive at the kernel Stein discrepancy (KSD)

 MMDμ,kμ(ν)2=∬kμ(x,y)dν(x)dν(y).

Under stronger conditions on and it can be shown that KSD controls weak convergence to in , meaning that if then (Gorham and Mackey, 2017, Thm. 8).

The description of KSD here is limited to , but constructions also exist for discrete spaces (Yang et al., 2018) and more general Riemannian manifolds (Barp et al., 2018; Xu and Matsuda, 2020; Le et al., 2020). Extensions that use other operators (Gorham et al., 2019; Barp et al., 2019) have also been studied.

## 3 Methods

In this section we propose novel algorithms for minimisation of MMD over a finite candidate set. The simplest algorithm is described in Section 3.1, and from this we generalise to consider both non-myopic selection of representative points and mini-batching in Section 3.2. A discussion of non-sequential algorithms, as the limit of non-myopic algorithms where all points are simultaneously selected, is given in Section 3.3.

### 3.1 A Simple Algorithm for Quantisation

In what follows we assume that we are provided with a finite candidate set from which representative points are to be selected. Ideally, these candidates should be in regions where is supported, but we defer making any assumptions on this set until the theoretical analysis in Section 5.

The simplest algorithm that we consider greedily minimises MMD over the candidate set; for each , pick222The convention is used.

 π(i)∈argminj∈{1,…,n} MMDμ,k(1ii−1∑i′=1δ(xπ(i′))+1iδ(xj)),

to obtain, after steps, an index sequence and associated empirical distribution . Explicit formulae are contained in Algorithm 1. The computational complexity of selecting points in this manner is , provided that the integrals appearing in Algorithm 1 can be evaluated in . Note that candidate points can be selected more than once.

As explained in Section 1, aspects of Algorithm 1 have appeared in earlier work, but we are not aware of any existing theoretical analysis that applies to Algorithm 1 in general. Theorems 1, 2 and 4 in Section 5 provide novel finite-sample-size error bounds for Algorithm 1 (as a special case of Algorithm 2).

The two main shortcomings of Algorithm 1 are that (i) the myopic nature of the optimisation may be statistically inefficient, and (ii) the requirement to scan through a large candidate set during each iteration may lead to unacceptable computational cost. In Section 3.2 we propose non-myopic and mini-batch extensions of Algorithm 1 that aim to address these issues.

### 3.2 Generalised Sequential Algorithms

In Section 3.2.1 we describe a non-myopic extension of Algorithm 1, where multiple points are simultaneously selected at each step. The use of non-myopic optimisation is impractical when a large candidate set is used, and therefore we explain how mini-batches from the candidate set can be employed in Section 3.2.2.

#### 3.2.1 Non-Myopic Minimisation

Now we consider the simultaneous selection of representative points from the candidate set at each step. This leads to the non-myopic algorithm

 π(i,⋅)∈argminS∈{1,…,n}sMMDμ,k(1isi−1∑i′=1s∑j=1δ(xπ(i′,j))+1is∑j∈Sδ(xj)),

whose output is a bivariate index , together with the associated empirical distribution . Explicit formulae are contained in Algorithm 2. The computational complexity of selecting points in this manner is , which is larger than Algorithm 1 when . Theorems 1, 2 and 4 in Section 5 provide novel finite-sample-size error bounds for Algorithm 2.

Despite its daunting computational complexity, we have found that it is practical to exactly implement Algorithm 2 for moderate values of and by casting each iteration of the algorithm as an instance of a constrained integer quadratic programme (IQP), so that state-of-the-art discrete optimisation methods can be employed.

To this end, we represent the indices of the points to be selected at iteration of Algorithm 2 as a vector whose th element indicates the number of copies of that are selected, and where is constrained to satisfy . It is then an algebraic exercise to recast an optimal subset as the solution to a constrained IQP:

 argminv∈Ns0 12v⊤Kv+ci⊤vs.t.1⊤v=sKj,j′:=k(xj,xj′),    1j:=1 for j=1,…,n,cij:=i−1∑i′=1s∑j′=1k(xπ(i′,j′),xj)−is∫k(x,xj)dμ(x)
###### Remark 1.

If one further imposes the constraint for all , so that each candidate may be selected at most once, then the resulting binary quadratic programme (BQP) is equivalent to the cardinality constrained -partition problem from discrete optimisation, which is known to be NP-hard (Rendl, 2016). (The results we present do not impose this constraint.)

#### 3.2.2 Mini-Batching

The exact solution of (3.2.1) is practical only for moderate values of and . This motivates the idea of considering only a subset of the candidates at each iteration, a procedure we call mini-batching and inspired by the similar approach used in stochastic optimisation. There are several ways that mini-batching can be performed, but here we simply state that candidates denoted are considered during the th iteration, with the mini-batch size denoted by . The non-myopic algorithm for minimisation of MMD with mini-batching is then

 π(i,⋅) ∈argminS∈{1,…,b}sMMDμ,k(1isi−1∑i′=1s∑j=1δ(xi′π(i′,j))+1is∑j∈Sδ(xij))

Explicit formulae are contained in Algorithm 3. The complexity of selecting points in this manner is , which is smaller than Algorithm 2 when . As with Algorithm 2, an exact IQP formulation can be employed. Theorem 3 provides a novel finite-sample-size error bound for Algorithm 3.

### 3.3 Non-Sequential Algorithms

Finally we consider the limit of the non-myopic Algorithm 2, in which all representative points are simultaneously selected in a single step:

 π∈argminS∈{1,…,n}m% MMDμ,k(1m∑i∈Sδ(xπ(i))) (4)

The index set can again be recast as the solution to an IQP and the associated empirical measure provides, by definition, a value for that is at least as small as any of the methods so far described (thus satisfying the same error bounds derived in Theorems 14).

However, it is only practical to exactly solve (4) for small and thus, to arrive at a practical algorithm, we consider approximation of (4). There are at least two natural ways to do this. Firstly, one could run a numerical solver for the IQP formulation of (4) and terminate after a fixed computational limit is reached; the solver will return a feasible, but not necessarily optimal, solution to the IQP. An advantage of this approach is that no further methodological work is required. Alternatively, one could employ a convex relaxation of the intractable IQP, which introduces an approximation error that is hard to quantify but leads to a convex problem that may be exactly soluble at reasonable cost. We expand on the latter approach in Appendix B, with preliminary empirical comparisons.

## 4 Empirical Assessment

This section presents an empirical assessment333Our code is written in Python and is available at [blinded]. All optimisation is performed using Gurobi, via the interface GurobiPy. of Algorithms 13. Two regimes are considered, corresponding to high compression (small ; Section 4.1) and low compression (large ; Section 4.2) of the target. These occur, respectively, in applications to Bayesian cubature and thinning of Markov chain output. For details of the kernels used and a sensitivity analysis for the kernel parameters, see Appendix C.

Figure 1 illustrates how a non-myopic algorithm may outperform a myopic one. Here a candidate set was constructed using independent samples from the target , and 12 representatives were selected from this set using the myopic (Algorithm 1 with ) and non-myopic (Algorithm 2 with and ) methods, and by non-sequential selection of all 12 representatives simultaneously (Algorithm 2 with and ).

Having selected the first three samples close to the three modes, the myopic method subsequently selects points that temporarily worsen the overall approximation; note in particular the placement of the fourth point. The non-myopic methods do not suffer to the same extent: choosing 4 points simultaneously gives better approximations relative to the myopic approach after each of 4, 8 and 12 samples have been chosen; was chosen deliberately so as to be co-prime to the number of mixture components, 3. Choosing all 12 points at once gives an even better approximation of the true distribution.

### 4.1 Bayesian Cubature

Larkin (1972) and subsequent authors proposed to cast numerical cubature in the Bayesian framework, such that an integrand is a priori modelled as a Gaussian process with covariance , then conditioned on data

where the integrand is evaluated. The posterior standard deviation is

(Briol et al., 2019b)

 Std[∫fdμ|D]=minw1,…,wm∈RMMDμ,k(m∑i=1wiδ(xi)). (5)

The selection of to minimise (5) is impractical since evaluation of (5) has a complexity of . Huszár and Duvenaud (2012) and Briol et al. (2015) noted that (5) can be bounded above by fixing , and that quantisation methods provide a practical means to minimise this bound. All convergence results derived in Section 5 can therefore be applied and, moreover, the bound is expected to be quite tight.444 suggests that the point was not optimally placed. Thus for optimal we anticipate . Earlier authors studied herding; here we study non-myopic555Non-myopic algorithms have been proposed for Bayesian cubature in Jiang et al. (2019) and Jarvenpaa et al. (2020), but the corresponding theory was not developed. Moreover, our IQP approach is novel in this context. greedy algorithms based on independent samples.666That independent samples are sufficient to obtain optimal convergence rates for Bayesian cubature was shown in Corollary 1 of Ehler et al. (2019).

Bayesian cubature is most often used when evaluation of is associated with a high computational cost and one is prepared to expend resources in the optimisation of the point set . Our focus in this section is therefore on the quality of the point set obtained, irrespective of computational cost. (The computational cost will, however, be assessed in Section 4.2.)

Figure 1 suggests that the approximation quality of non-myopic approaches is dependent on the number of simultaneously selected points . Figure 2 compares the selection of 60 from 1000 independently sampled points in a model consisting of a mixture of 20 Gaussians, while varying . More myopic selections are seen to be outperformed by less myopic ones for a given quantity of selected samples. The simultaneous selection of samples explains the presentation of the plot as step functions. Note in particular that MMD of the myopic method () is observed to decrease non-monotonically. This is a manifestation of the phenomenon also seen in Figure 1, where a particular sample choice may temporarily worsen the quality of the overall approximation.

Next we consider applications in Bayesian statistics, where both approximation quality and computation time are important. In contrast to

Section 4.1, in what follows the density will be available only up to an unknown normalisation constant and thus KSD—which requires only that can be evaluated—will be used.

### 4.2 Thinning of Markov Chain Output

The use of quantisation to ‘thin’ Markov chain output was proposed in Riabiz et al. (2020), where the authors studied greedy myopic algorithms based on KSD. In this section we revisit an application from that work to determine whether our proposed non-myopic and mini-batching algorithms offer a performance improvement. Unlike Section 4.1, the computational cost of our algorithms must now be assessed, since their runtime may be comparable to the time required to produce Markov chain output itself.

The datasets we use consist of (1) samples from a 38-parameter intracellular calcium signalling model introduced by Hinch et al. (2004), and (2) samples from a 4-parameter Lotka-Volterra predator-prey model. Both were previously considered in Riabiz et al. (2020). The MCMC chains are both highly auto-correlated and start far from any mode. The greedy KSD approach used by Riabiz et al. (2020) was found to slow down dramatically after selecting the first samples, due to the requirement to compute the kernel on all pairs of values involving the selected points and all points in the candidate set at each iteration. We employ mini-batching () to ameliorate this, and investigate the effectiveness of non-myopic selection () in that context.

Figure 3 plots KSD against time, with the number of collected samples fixed at 1000, as well as time-adjusted KSD against for both models. This acknowledges that both approximation quality and computational time are important. In both cases, larger mini-batches were able to perform better provided that the number of states selected was large enough to realise their potential.

Non-myopic selection shows a significant improvement over batch-myopic in the Lotka-Volterra models, and a less significant (though still visible) improvement in the calcium signalling model. The practical upper limit on for non-myopic methods (due to the requirement to optimise over all points) may make performance for larger poorer relatively; in a larger and more complex model, there may be fewer than ‘good’ samples to choose from given moderate . This suggests that control of the ratio may be important for best performance; the best results that we observed occurred when .

A comparison to the original myopic algorithm (i.e. ) of Riabiz et al. (2020) is presented for the Lotka–Volterra model. This is implemented using the same code and on the same machine as the batch and non-myopic simulations. The cyan line shown in the bottom two panes of Figure 3 represents only 50 points not 1000; collecting just these took 42 minutes. This algorithm is slower still for the calcium signalling model, so we omit the comparison.

## 5 Theoretical Assessment

This section presents a theoretical assessment of Algorithms 13. Once stated, a standing assumption is understood to hold for the remainder of the main text.

###### Standing Assumption 1.

Let be a measurable space equipped with a probability measure . Let be symmetric positive definite and satisfy .

For the kernels described in Section 2.2, and the assumption is trivially satisfied. Our first result is a finite-sample-size error bound for non-myopic algorithms when the candidate set is fixed:

###### Theorem 1.

Let be fixed and let . Consider an index sequence of length and with selection size produced by Algorithm 2. Then for all it holds that

 MMDμ,k(1msm∑i=1s∑j=1δ(xπ(i,j)))2≤min1⊤w=1wi≥0 MMDμ,k(n∑i=1wiδ(xi))2+C2(1+logmm),

with an -independent constant.

The proof is provided in Section A.1. Aside from providing an explicit error bound, we see that the output of Algorithm 2 converges in MMD to the optimal (weighted) quantisation of that is achievable using the candidate point set. Interestingly, all bounds we present are independent of .

###### Remark 2.

Theorems 14 are stated for general MMD and apply in particular to KSD, for which we set , . Specialised versions of Theorems 1, 3 and 4 appeared in Chen et al. (2019) and Riabiz et al. (2020), but these authors considered only KSD and myopic algorithms; i.e. .

Our remaining results explore the cases where the candidate points are randomly sampled. Independent and dependent sampling is considered and, in each case, the following moment bound will be assumed:

###### Standing Assumption 2.

For some , , where the expectation is taken over .

In the first randomised setting, the are independently sampled from , as would typically be possible when is explicit:

###### Theorem 2.

Let be independently sampled from . Consider an index sequence of length produced by Algorithm 2. Then for all and all it holds that

 E⎡⎣MMDμ,k(1msm∑i=1s∑j=1δ(xπ(i,j)))2⎤⎦≤log(C1)nγ+2(C2μ,k+log(nC1)γ)(1+logmm).

The proof is provided in Section A.2. It is seen that must be asymptotically increased with in order for the approximation provided by Algorithm 2 to be consistent. No smoothness assumptions were placed on ; such assumptions can be used to improve the term (as in Thm. 1 of Ehler et al., 2019), but we did not consider this useful since the term is the bottleneck in the bound.

An analogous, but more technically involved, argument leads to a finit-sample-size error bound when mini-batches are used:

###### Theorem 3.

Let each mini-batch be independently sampled from . Consider an index sequence of length produced by Algorithm 3. Then

 E⎡⎣MMDμ,k(1msm∑i=1s∑j=1δ(xiπ(i,j)))2⎤⎦≤log(C1)bγ+2(C2μ,k+log(bC1)γ)(1+logmm).

The proof is provided in Section A.3. The mini-batch size plays an analogous role to in Theorem 2 and must be asymptotically increased with in order for Algorithm 3 to be consistent.

In our second randomised setting the candidate set arises as a Markov chain sample path. Let be a function and, for a function and a measure on , let

 ∥f∥V:=supx∈X|f(x)|V(x),∥μ∥V:=sup∥f∥V≤1∣∣∫fdμ∣∣.

A -irreducible and aperiodic Markov chain with step transition kernel is -uniformly ergodic if and only if there exists and such that

 ∥Pn(x,⋅)−P∥V≤RV(x)ρn (6)

for all initial states and all (see Thm. 16.0.1 of Meyn and Tweedie, 2012).

###### Theorem 4.

Assume that for all . Consider a -invariant, time-homogeneous, reversible Markov chain generated using a -uniformly ergodic transition kernel, such that (6) is satisfied with for all . Suppose that . Consider an index sequence of length and selection subset size produced by Algorithm 2. Then, with , we have that

 E⎡⎣MMDμ,k(1msm∑i=1s∑j=1δ(xπ(i,j)))2⎤⎦≤log(C1)nγ+C2C3n+2(C2μ,k+log(nC1)γ)(1+logmm)

The proof is provided in Section A.4. Analysis of mini-batching in the dependent sampling context appears to be more challenging and was not attempted.

## 6 Discussion

This paper focused on quantisation using MMD, proposing and analysing novel algorithms for this task, but other integral probability metrics could be considered. More generally, if one is interested in compression by means other than quantisation then other approaches may be useful, such as Gaussian mixture models and related approaches from the literature on density estimation (Silverman, 1986).

Some avenues for further research include: (a) extending symmetric structure in the target to the set of representative points (Karvonen et al., 2019); (b) characterising an optimal sampling distribution from which elements of the candidate set can be obtained (Bach, 2017)

; (c) further applications of our method, for example to Bayesian neural networks, where quantisation of the posterior provides a promising route to reduce the cost of predicting each label in the test dataset. •

Acknowledgements:

This work was supported by the Lloyd’s Register Foundation programme on data-centric engineering at the Alan Turing Institute, UK. MR was supported by the British Heart Foundation-Alan Turing Institute cardiovascular data-science award (BHF; SP/18/6/33805).

## References

• Arbel et al. (2019) Arbel, M., Korba, A., Salim, A., and Gretton, A. (2019). Maximum mean discrepancy gradient flow. NeurIPS 32, pages 6484–6494.
• Aronszajn (1950) Aronszajn, N. (1950). Theory of reproducing kernels. Trans. Am. Math. Soc., 68(3):337–404.
• Bach (2017) Bach, F. (2017). On the equivalence between kernel quadrature rules and random feature expansions. J. Mach. Learn. Res., 18(1):714–751.
• Bach et al. (2012) Bach, F., Lacoste-Julien, S., and Obozinski, G. (2012). On the equivalence between herding and conditional gradient algorithms. ICML 29.
• Barp et al. (2019) Barp, A., Briol, F.-X., Duncan, A., Girolami, M., and Mackey, L. (2019). Minimum Stein discrepancy estimators. NeurIPS 32, pages 12964–12976.
• Barp et al. (2018) Barp, A., Oates, C., Porcu, E., Girolami, M., et al. (2018). A Riemannian-Stein kernel method. arXiv:1810.04946.
• Briol et al. (2019a) Briol, F.-X., Barp, A., Duncan, A. B., and Girolami, M. (2019a). Statistical inference for generative models with maximum mean discrepancy. arXiv:1906.05944.
• Briol et al. (2015) Briol, F.-X., Oates, C., Girolami, M., and Osborne, M. A. (2015). Frank-Wolfe Bayesian quadrature: Probabilistic integration with theoretical guarantees. NeurIPS 28, pages 1162–1170.
• Briol et al. (2019b) Briol, F.-X., Oates, C. J., Girolami, M., Osborne, M. A., Sejdinovic, D., et al. (2019b). Probabilistic integration: A role in statistical computation? Stat. Sci., 34(1):1–22.
• Chaloner and Verdinelli (1995) Chaloner, K. and Verdinelli, I. (1995). Bayesian experimental design: a review. Stat. Sci., 10(3):273–304.
• Chen et al. (2019) Chen, W. Y., Barp, A., Briol, F.-X., Gorham, J., Girolami, M., Mackey, L., and Oates, C. J. (2019). Stein point Markov chain Monte Carlo. ICML, 36.
• Chen et al. (2018) Chen, W. Y., Mackey, L., Gorham, J., Briol, F.-X., and Oates, C. J. (2018). Stein points. ICML, 35.
• Chen et al. (2010) Chen, Y., Welling, M., and Smola, A. (2010). Super-samples from kernel herding. UAI, 10.
• Chérief-Abdellatif and Alquier (2020) Chérief-Abdellatif, B.-E. and Alquier, P. (2020). MMD-Bayes: robust Bayesian estimation via maximum mean discrepancy. Proc. Mach. Learn. Res., 18:1–21.
• Dick and Pillichshammer (2010) Dick, J. and Pillichshammer, F. (2010). Digital nets and sequences: discrepancy theory and quasi–Monte Carlo integration. Cambridge University Press.
• Dudley (2018) Dudley, R. M. (2018). Real analysis and probability. CRC Press.
• Ehler et al. (2019) Ehler, M., Gräf, M., and Oates, C. J. (2019). Optimal Monte Carlo integration on closed manifolds. Stat. Comput., 29(6):1203–1214.
• Garreau et al. (2017) Garreau, D., Jitkrittum, W., and Kanagawa, M. (2017). Large sample analysis of the median heuristic. arXiv:1707.07269.
• Goemans and Williamson (1995) Goemans, M. X. and Williamson, D. P. (1995). Improved approximation algorithms for maximum cut and satisfiability problems using semidefinite programming. J. ACM, 42(6):1115–1145.
• Gorham et al. (2019) Gorham, J., Duncan, A., Mackey, L., and Vollmer, S. (2019). Measuring sample quality with diffusions. Ann. Appl. Probab., 29(5):2884–2928.
• Gorham and Mackey (2015) Gorham, J. and Mackey, L. (2015). Measuring sample quality with Stein’s method. NeurIPS 28, pages 226–234.
• Gorham and Mackey (2017) Gorham, J. and Mackey, L. (2017). Measuring sample quality with kernels. ICML 34, 70:1292–1301.
• Graf and Luschgy (2007) Graf, S. and Luschgy, H. (2007). Foundations of quantization for probability distributions. Springer.
• Hickernell (1998) Hickernell, F. (1998). A generalized discrepancy and quadrature error bound. Math. Comput, 67(221):299–322.
• Hinch et al. (2004) Hinch, R., Greenstein, J., Tanskanen, A., Xu, L., and Winslow, R. (2004). A simplified local control model of calcium-induced calcium release in cardiac ventricular myocytes. Biophys. J, 87(6):3723–3736.
• Huszár and Duvenaud (2012) Huszár, F. and Duvenaud, D. (2012). Optimally-weighted herding is Bayesian quadrature. UAI, 12.
• Jarvenpaa et al. (2020) Jarvenpaa, M., Vehtari, A., and Marttinen, P. (2020). Batch simulations and uncertainty quantification in Gaussian process surrogate approximate Bayesian computation. Proc. Mach. Learn. Res., 124:779–788.
• Jiang et al. (2019) Jiang, S., Chai, H., Gonzalez, J., and Garnett, R. (2019). BINOCULARS for efficient, nonmyopic sequential experimental design. ICML 37.
• Karvonen (2019) Karvonen, T. (2019). Kernel-based and Bayesian methods for numerical integration. PhD Thesis, Aalto University.
• Karvonen et al. (2019) Karvonen, T., Särkkä, S., and Oates, C. (2019). Symmetry exploits for Bayesian cubature methods. Stat. Comput., 29:1231–1248.
• Lacoste-Julien et al. (2015) Lacoste-Julien, S., Lindsten, F., and Bach, F. (2015). Sequential kernel herding: Frank-Wolfe optimization for particle filtering. Proc. Mach. Learn. Res., 38:544–552.
• Larkin (1972) Larkin, F. M. (1972). Gaussian measure in Hilbert space and applications in numerical analysis. Rocky Mt. J. Math., 3:379–421.
• Le et al. (2020) Le, H., Lewis, A., Bharath, K., and Fallaize, C. (2020). A diffusion approach to Stein’s method on Riemannian manifolds. arXiv:2003.11497.
• Mak et al. (2018) Mak, S., Joseph, V. R., et al. (2018). Support points. Ann. Stat., 46(6A):2562–2592.
• Meyn and Tweedie (2012) Meyn, S. P. and Tweedie, R. L. (2012). Markov chains and stochastic stability. Springer.
• Müller (1997) Müller, A. (1997). Integral probability metrics and their generating classes of functions. Adv. App. Prob., 29(2):429–443.
• Novak and Woźniakowski (2008) Novak, E. and Woźniakowski, H. (2008). Tractability of Multivariate Problems: Standard information for functionals. European Mathematical Society.
• Oates et al. (2017) Oates, C. J., Girolami, M., and Chopin, N. (2017). Control functionals for Monte Carlo integration. J. R. Statist. Soc. B, 3(79):695–718.
• Oettershagen (2017) Oettershagen, J. (2017). Construction of optimal cubature algorithms with applications to econometrics and uncertainty quantification. PhD thesis, University of Bonn.
• Pronzato and Zhigljavsky (2018) Pronzato, L. and Zhigljavsky, A. (2018). Bayesian quadrature, energy minimization, and space-filling design. SIAM-ASA J. Uncert. Quant., 8(3):959–1011.
• Rendl (2016) Rendl, F. (2016). Semidefinite relaxations for partitioning, assignment and ordering problems. Ann. Oper. Res., 240:119–140.
• Riabiz et al. (2020) Riabiz, M., Chen, W., Cockayne, J., Swietach, P., Niederer, S. A., Mackey, L., and Oates, C. (2020). Optimal thinning of MCMC output. arXiv:2005.03952.
• Rustamov (2019) Rustamov, R. M. (2019). Closed-form expressions for maximum mean discrepancy with applications to Wasserstein auto-encoders. arXiv:1901.03227.
• Sejdinovic et al. (2013) Sejdinovic, D., Sriperumbudur, B., Gretton, A., and Fukumizu, K. (2013). Equivalence of distance-based and RKHS-based statistics in hypothesis testing. Ann. Stat., 41(5):2263–2291.
• Silverman (1986) Silverman, B. W. (1986). Density estimation for statistics and data analysis. CRC Press.
• Song (2008) Song, L. (2008). Learning via Hilbert space embedding of distributions. PhD thesis, University of Sydney.
• Sriperumbudur et al. (2010) Sriperumbudur, B. K., Gretton, A., Fukumizu, K., Schölkopf, B., and Lanckriet, G. R. (2010). Hilbert space embeddings and metrics on probability measures. J. Mach. Learn. Res., 11:1517–1561.
• Stein (1972) Stein, C. (1972).

A bound for the error in the normal approximation to the distribution of a sum of dependent random variables.

In Proc. Sixth Berkeley Symp. on Math. Statist. and Prob., Vol. 2, pages 583–602.
• Xu and Matsuda (2020) Xu, W. and Matsuda, T. (2020). A Stein goodness-of-fit test for directional distributions. Proc. Mach. Learn Res., 108:320–330.
• Yang et al. (2018) Yang, J., Liu, Q., Rao, V., and Neville, J. (2018). Goodness-of-fit testing for discrete distributions via Stein discrepancy. Proc. Mach. Learn. Res., 80:5561–5570.

## Supplemental Material

This supplement is structured as follows: In Appendix A we present proofs for all novel theoretical results stated in Section 5 of the main text. In Appendices B and C we provide additional experimental results to support the discussion in Section 4 of the main text.

## Appendix A Proof of Theoretical Results

In what follows we let denote the reproducing kernel Hilbert space reproduced by the kernel and let denote the induced norm in .

### a.1 Proof of Theorem 1

To start the proof, define

 am :=(ms)2MMDμ,k(1msm∑i=1s∑j=1δ(xπ(i,j)))2 =m∑i=1m∑i′=1s∑j=1s∑j′=1k(xπ(i,j),xπ(i′,j′))−2msm∑i=1s∑j=1∫k(xπ(i,j),x)dμ(x)+(ms)2∬k(x,x′)dμ(x)dμ(x′) fm(⋅) :=m∑i=1s∑j=1k(xπ(i,j),⋅)−ms∫k(⋅,x)dμ(x)

and note immediately that . Then we can write a recursive relation

 am =am−1+\underbracket∑j=1s∑j=1s∑j′=1k(xπ(m,j),xπ(m,j′))+2m−1∑i=1s∑j=1s∑j′=1k(xπ(m,j),xπ(i,j′))−∑j=12mss∑j=1∫k(xπ(m,j),x)dμ(x)(∗) \underbracket−2sm−1∑i=1s∑j=1∫k(xπ(i,j),x)dμ(x)+∑j=1s2(2m−1)∬k(x,x′)dμ(x)dμ(x′)(∗∗)

We will first derive an upper bound for , then one for .

##### Bounding (∗):

Noting that the algorithm chooses the that minimises

 ∑j∈S∑j′∈Sk(xj,xj′)+2∑j∈Ss∑j′=1m−1∑i=1k(xj,xπ(i,j′))−2ms∑j∈S∫k(xj,x)dμ(x)=∑j∈S∑j′∈Sk(xj,xj′)−2s∑j∈S∫k(xj,x)dμ(x)+2∑j∈Sfm−1(xj),

we therefore have that

 (∗) =minS∈{1,…,n}s⎡⎣∑j∈S∑j′∈Sk(xj,xj′)−2s∑j∈S∫k(xj,x)dμ(x)+2∑j∈Sfm−1(xj)⎤⎦ ≤maxS∈{1,…,n}s⎡⎣∑j∈S∑j′∈Sk(xj,xj′)−2s∑j∈S∫k(xj,x)dμ(x)⎤⎦+2minS∈{1,…,n}s∑j∈Sfm−1(xj) (7) ≤maxS∈{1,…,n}s⎡⎣∑j∈S∑j′∈Sk(xj,xj′)+2s∑j∈S∥∥XXXk(xj,⋅)∥∥H⋅∫∥∥XXXk(x,⋅)∥∥Hdμ(x)⎤⎦+2minS∈{1,…,n}s∑j∈Sfm−1(xj) (8) ≤s2maxj∈{1,…,n}k(xj,xj)+2s2maxj∈{1,…,n}√k(xj,xj)⋅∫√k(x,x)dμ(x)+2minS∈{1,…,n}s∑j∈Sfm−1(xj) ≤s2C2n,k+2s2Cn,k(∫k(x,x)dμ(x))1/2+2minS∈{1,…,n}s∑j∈Sfm−1(xj) (9) =s2C2n,k+2s2Cn,kCμ,k+2minS∈{1,…,n}s∑j∈Sfm−1(xj) (10)

In (7) we used the reproducing property, while in (8) we used the Cauchy–Schwarz inequality and in (9) we used Jensen’s inequality. To bound the third term in (10), we write

 minS∈{1,…,n}s∑j∈Sfm−1(xj)=minS∈{1,…,n}s⟨fm−1,∑j∈Sk(⋅,xj)⟩H

Define as the convex hull in of . Since the extreme points of correspond to the vertices we have that

 M={n∑i=1cik(⋅,xi):ci≥0,n∑i=1ci=1}.

Then we have, for any ,

 ⟨fm−1,h⟩H=⟨fm−1,n∑i=1cik(⋅,xi)⟩H=n∑i=1cifm−1(xi).

This linear combination is clearly minimised by taking each of the equal to a candidate point that minimises , and taking the corresponding , and all other . Now consider an element for which the weights minimise subject to and . Clearly . Thus

 minS∈{1,…,n}s∑j∈Sfm−1(xj)=s⋅infh∈M⟨fm−1,h⟩H≤s⋅⟨fm−1,hw⟩H.

Combining this with (10) provides an overall bound on .

##### Bounding (∗∗):

To upper bound we can in fact just use an equality;

 (∗∗) +s2∬k(x,x′)dμ(x)dμ(x′) =−2s⟨fm−1,hμ⟩H+s2∥hμ∥2H

where .

##### Bound on the Iterates:

Combining our bounds on and , we obtain

 am ≤am−1+s2C2n,k+2s2Cn,kCμ,k+2s⟨fm−1,hw⟩H