DeepAI
Log In Sign Up

Kernel Thinning

We introduce kernel thinning, a new procedure for compressing a distribution ℙ more effectively than i.i.d. sampling or standard thinning. Given a suitable reproducing kernel 𝐤 and 𝒪(n^2) time, kernel thinning compresses an n-point approximation to ℙ into a √(n)-point approximation with comparable worst-case integration error across the associated reproducing kernel Hilbert space. With high probability, the maximum discrepancy in integration error is 𝒪_d(n^-1/2√(log n)) for compactly supported ℙ and 𝒪_d(n^-1/2√((log n)^d+1loglog n)) for sub-exponential ℙ on ℝ^d. In contrast, an equal-sized i.i.d. sample from ℙ suffers Ω(n^-1/4) integration error. Our sub-exponential guarantees resemble the classical quasi-Monte Carlo error rates for uniform ℙ on [0,1]^d but apply to general distributions on ℝ^d and a wide range of common kernels. We use our results to derive explicit non-asymptotic maximum mean discrepancy bounds for Gaussian, Matérn, and B-spline kernels and present two vignettes illustrating the practical benefits of kernel thinning over i.i.d. sampling and standard Markov chain Monte Carlo thinning, in dimensions d=2 through 100.

READ FULL TEXT VIEW PDF

page 1

page 2

page 3

page 4

10/04/2021

Expected uniform integration approximation under general equal measure partition

In this paper, we study bounds of expected L_2-discrepancy to give mean ...
10/04/2021

Generalized Kernel Thinning

The kernel thinning (KT) algorithm of Dwivedi and Mackey (2021) compress...
10/28/2022

Minimum Kernel Discrepancy Estimators

For two decades, reproducing kernels and their associated discrepancies ...
11/15/2021

Distribution Compression in Near-linear Time

In distribution compression, one aims to accurately summarize a probabil...
02/28/2021

A Stein Goodness of fit Test for Exponential Random Graph Models

We propose and analyse a novel nonparametric goodness of fit testing pro...
02/23/2018

Numerical performance of optimized Frolov lattices in tensor product reproducing kernel Sobolev spaces

In this paper, we deal with several aspects of the universal Frolov cuba...
10/12/2022

Polynomial tractability for integration in an unweighted function space with absolutely convergent Fourier series

In this note, we prove that the following function space with absolutely...

1 Introduction

Markov chain Monte Carlo (MCMC) methods (brooks2011handbook) are commonly used to approximate intractable target expectations with asymptotically exact averages based on points generated from a Markov chain. A standard practice, to minimize the expense of downstream function evaluation, is to thin the Markov chain output by discarding every -th sample point (owen2017statistically). Unfortunately, standard thinning also leads to a significant reduction in accuracy: thinning one’s chain to sample points, for example, increases integration error from to . In this work, we introduce a more effective thinning strategy, providing integration error when points are returned.

We call our strategy kernel thinning as we focus on integration in a reproducing kernel Hilbert space (RKHS, berlinet2011reproducing) with a kernel and norm . The worst-case error between sample and target expectations over the RKHS unit ball is given by the kernel maximum mean discrepancy (MMD, JMLR:v13:gretton12a),

(1)

and we call a sequence of points an -MMD coreset for if . For a bounded kernel , i.i.d. draws from yield a -MMD coreset with high probability, and comparable guarantees hold for a thinned geometrically ergodic Markov chain (see Prop. 1). A benchmark for improvement is provided by the online Haar strategy of dwivedi2019power, which generates an -MMD coreset from i.i.d. sample points when

is specifically the uniform distribution on the unit cube

.222dwivedi2019power specifically control the star discrepancy, a quantity which in turn upper bounds a Sobolev space MMD called the discrepancy (hickernell1998generalized; novak2010tractability). Our aim is to develop thinned coresets of improved quality for any target with sufficiently fast tail decay.

Our contributions

After describing our problem setup in Sec. 2, we introduce in LABEL:sec:kernel_thinning a practical procedure, kernel thinning, for compressing an input point sequence into a provably high-quality coreset. Kernel thinning uses non-uniform randomness and a less smooth square-root kernel (see LABEL:def:square_root_kernel) to partition the input into subsets of comparable quality and then refines the best of these subsets. Given input points sampled i.i.d. or from a fast-mixing Markov chain, the result is, with high probability, an -MMD coreset for and with bounded support, an -MMD coreset for and with light tails, and an -MMD coreset for and with moments. For compactly supported or light-tailed and , these results compare favorably with the lower bounds of Sec. 1.1. Our guarantees extend more generally to any predetermined input point sequence, even deterministic ones based on quadrature or kernel herding (chen2012super), and give rise to explicit, non-asymptotic error bounds for a wide variety of popular kernels including Gaussian, Matérn, and B-spline kernels. Moreover, our algorithm runs in time given space.

Our results rely on three principal contributions. First, we establish an important link between MMD coresets for and coresets for . We say a sequence of points is an - coreset for if

(2)

In LABEL:sec:a_general_recipe_for_bounding_kernel_mmd, we show that any coreset for is also an MMD coreset for with quality depending on the tail decay of and .

Second, we prove in LABEL:sub:kernel_thinning_is_recursive_kernel_halving that, with high probability and for a wide range of kernels, kernel thinning compresses any input - coreset into an order - coreset with even tighter guarantees for lighter-tailed targets. Third, as a building block for kernel thinning, we introduce and analyze a Hilbert space generalization of the self-balancing walk of alweiss2020discrepancy to partition a sequence of functions (like ) into nearly equal halves. Our analysis of this self-balancing Hilbert walk in LABEL:sec:from_self_balancing_walk_to_kernel_thinning

may be of independent interest for solving the online vector balancing problem

(spencer1977balancing) in Hilbert spaces. We conclude our work with two vignettes illustrating the practical benefits of kernel thinning in LABEL:sec:vignettes and a discussion in LABEL:sec:discussion.

Notation

Throughout, we will make frequent use of the norms and and the shorthand , , , , and . The set denotes the complement of a set , and if and 0 otherwise. We say is of order and write or to denote that for some universal constant . Throughout, we view the success probability as a fixed constant. We use or to denote for some universal constant . We write when and . Moreover, we use to indicate dependency of constant on . We say if . We also write order -MMD (or ) coreset to mean an -MMD (or ) coreset. For point sequences with empirical distributions , we overload the MMD notation to write and .

1.1 Related work on MMD coresets

Here we review lower bounds and prior strategies for generating coresets with small MMD and defer discussion of prior coreset constructions to LABEL:sub:prior_work_on_linf_coresets.

Lower bounds

For any bounded and radial (i.e., ) kernel satisfying mild decay and smoothness conditions, phillips2020near showed that any procedure outputting coresets of size must suffer MMD for some (discrete) target distribution . This lower bound applies, for example, to Matérnkernels and to infinitely smooth Gaussian kernels. For any continuous and shift-invariant (i.e., ) kernel taking on at least two values, tolstikhin2017minimax showed that anyestimator of (even non-coreset estimators) based only on i.i.d. draws from must suffer MMD with probability at least for some discrete target . If, in addition, is characteristic (i.e., when ), then tolstikhin2017minimax establishes the same lower bound for some continuous target with infinitely differentiable density. These last two lower bounds hold, for example, for Gaussian, Matérn, and B-spline kernels and apply in particular to any thinning algorithm that compresses i.i.d. sample points without additional knowledge of . For light-tailed and , the kernel thinning guarantees of LABEL:theorem:main_result_all_in_one will match each of these lower bounds up to factors of and constants depending on .

Order -MMD coresets for general

By Prop. A.1 of tolstikhin2017minimax, an i.i.d. sample from yields an order -MMD coreset with high probability. chen2012super showed that kernel herding with a finite-dimensional kernel (like the linear ) finds an -MMD coreset for an inexplicit parameter . However, bach2012equivalence showed that their analysis does not apply to any infinite-dimensional kernel (like the Gaussian, Matérn, and B-spline kernels studied in this work), as would necessarily equal . The best known rate for kernel herding with bounded infinite-dimensional kernels (lacoste2015sequential, Thm. G.1) guarantees an order -MMD coreset, matching the i.i.d. guarantee. For bounded kernels, the same guarantee is available for Stein Points MCMC (chen2019stein, Thm. 1) which greedily minimizes MMD333To bound using chen2019stein, choose . over random draws from and for a variant of the greedy sign selection algorithm described in karnin2019discrepancy.444The statement of karnin2019discrepancy bounds , but the proof bounds .

Finite-dimensional kernels

harvey2014near construct -MMD coresets for finite-dimensional linear kernels on but do not address infinite-dimensional kernels.

Uniform distribution on

hickernell1998generalized; novak2010tractability show that low discrepancy quasi-Monte Carlo (QMC) methods generate -MMD coresets when is the uniform distribution on the unit cube (see dick2013high, for a contemporary QMC overview). For the same target, the online Haar strategy of dwivedi2019power yields an -MMD coreset. These constructions satisfy our quality criteria but are tailored specifically to the uniform distribution on the unit cube.

Unknown coreset quality

paige2016super analyze the impact of approximating a kernel in super-sampling with a reservoir but do not analyze the quality of the constructed MMD coreset. For the conditionally positive definite energy distance kernel, mak2018support establish that an optimal coreset of size has MMD but do not provide a construction; in addition, mak2018support propose two support points convex-concave procedures for constructing MMD coresets but do not establish their optimality and do not analyze their quality.

1.2 Related work on weighted MMD coresets

While coresets satisfy a number of valuable constraints that are critical for some downstream applications – exact approximation of constants, automatic preservation of convex integrand constraints,  compatibility with unweighted downstream tasks, easy visualization, straightforward sampling, and increased numerical stability (karvonen2019positivity) – some applications also support weighted coreset approximations of of the form for weights that need not be equal, need not be nonnegative, or need not sum to . Notably, weighted coresets that depend on only through an i.i.d. sample of size are subject to the same MMD lower bounds of tolstikhin2017minimax described in Sec. 1.1. Any constructions that violate these bounds do so only by exploiting additional information about (for example, exact knowledge of ) that is not generally available and not required for our kernel thinning guarantees.

With this context, we now review known weighted MMD coreset guarantees. We highlight that, unlike kernel thinning and despite allowing for more general weights , none of the procedures below is known to yield a weighted -MMD coreset when is unknown. Moreover, when is known, nearly all of the weighted -MMD guarantees apply only to with bounded support and do not cover the unbounded distributions addressed in this work. In other words, our unweighted kernel thinning construction provides MMD improvements for practical pairings not covered by prior weighted constructions.

with bounded support

If has bounded density and bounded, regular support and is a Gaussian or Matérnkernel, then Bayesian quadrature (o1991bayes) and Bayes-Sard cubature (karvonen2018bayes) with quasi-uniform unisolvent point sets yield weighted -MMD coresets by wendland2004scattered. If has bounded support, and has more than continuous derivatives, then the P-greedy algorithm (de2005near) also yields weighted -MMD coresets by santin2017convergence. For

pairs with compact support and sufficiently rapid eigenvalue decay, approximate

continuous volume sampling kernel quadrature (belhadji2020kernel) using the Gibbs sampler of rezaei2019polynomial yields weighted coresets with root mean squared MMD.

Finite-dimensional kernels with compactly supported

For compactly supported , briol2015frank and bach2012equivalence showed that Frank-Wolfe Bayesian quadrature and weighted variants of kernel herding respectively yield weighted -MMD coresets for continuous finite-dimensional kernels, but, by bach2012equivalence, these analyses do not extend to infinite-dimensional kernels, like the Gaussian, Matérn, and B-spline kernels studied in this work.

Eigenfunction restrictions

For

pairs with known Mercer eigenfunctions,

belhadji2019kernel bound the expected squared MMD of determinantal point process (DPP) kernel quadrature in terms of kernel eigenvalue decay and provide explicit rates for univariate Gaussian and uniform on . Their construction makes explicit use of the kernel eigenfunctions which are not available for most pairings. For pairs with , uniformly bounded eigenfunctions, and rapidly decaying eigenvalues, liu2016black prove that black-box importance sampling generates an -MMD probability-weighted coreset but do not provide any examples verifying their assumptions. The uniformly bounded eigenfunction condition is considered particularly difficult to check (steinwart2012mercer), does not hold for Gaussian kernels with Gaussian (minh2010some, Thm. 1), and need not hold even for infinitely smooth kernels on (zhou2002covering, Ex. 1).

Unknown coreset quality

khanna2019linear prove that weighted kernel herding yields a weighted -MMD coreset. However, the term in khanna2019linear is at least as large as the condition number of an kernel matrix, which for typical kernels (including the Gaussian and Matérnkernels) is  (koltchinskii2000random; el2010spectrum); the resulting MMD error bound therefore does not decay with . campbell2019automated prove that Hilbert coresets via Frank-Wolfe with input points yield weighted order -MMD coresets for some but do not analyze the dependence of on .

Non-MMD guarantees

For with continuously differentiable Lebesgue density and a bounded Langevin Stein kernel with , Thm. 2 of oates2017control does not bound MMD but does prove that a randomized control functionals weighted coreset satisfies for each in the RKHS of and an unspecified . This bound is asymptotically better than the guarantee for unweighted i.i.d. coresets but worse than the unweighted kernel thinning guarantees of LABEL:theorem:main_result_all_in_one. On compact domains, Thm. 1 of oates2019convergence establishes improved rates for the same weighted coreset when both and are sufficiently smooth. bardenet2020monte establish an asymptotic decay of for DPP kernel quadrature with on and each in the RKHS of a particular kernel.

2 Problem Setup

Given a target distribution on , a reproducing kernel , and a sequence of -valued input points , our goal is to identify a thinned MMD coreset, a subsequence of length satisfying .

2.1 Input sequence requirements

To achieve our goal, we will require the input sequence to have quality and to be oblivious, that is, generated independently of any randomness in the thinning algorithm. By Sec. 1.1, the input sequences generated by i.i.d. sampling, kernel herding, Stein Points MCMC, and greedy sign selection all satisfy these properties with high probability. Moreover, we prove in LABEL:proof_of_mcmc_mmd that an analogous guarantee holds for the iterates of a fast-mixing Markov chain:

Proposition 1 (MMD guarantee for MCMC)

By meyn2012markov, if are the iterates of a homogeneous geometrically ergodic Markov chain with stationary distribution , then, for some constant and function ,

(3)

If , then, with probability at least , , for depending only on the Markov transition probabilities.

2.2 Kernel requirements

We will use the terms reproducing kernel and kernel interchangeably to indicate that is symmetric and positive definite, i.e., that the kernel matrix is symmetric and positive semidefinite for any evaluation points (berlinet2011reproducing). In addition to , our algorithm will take as input a square-root kernel for :