The reproducing Stein kernel approach for post-hoc corrected sampling

01/25/2020 ∙ by Liam Hodgkinson, et al. ∙ UNSW berkeley college The University of Queensland 0

Stein importance sampling is a widely applicable technique based on kernelized Stein discrepancy, which corrects the output of approximate sampling algorithms by reweighting the empirical distribution of the samples. A general analysis of this technique is conducted for the previously unconsidered setting where samples are obtained via the simulation of a Markov chain, and applies to an arbitrary underlying Polish space. We prove that Stein importance sampling yields consistent estimators for quantities related to a target distribution of interest by using samples obtained from a geometrically ergodic Markov chain with a possibly unknown invariant measure that differs from the desired target. The approach is shown to be valid under conditions that are satisfied for a large number of unadjusted samplers, and is capable of retaining consistency when data subsampling is used. Along the way, a universal theory of reproducing Stein kernels is established, which enables the construction of kernelized Stein discrepancy on general Polish spaces, and provides sufficient conditions for kernels to be convergence-determining on such spaces. These results are of independent interest for the development of future methodology based on kernelized Stein discrepancies.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Our problem of interest is the efficient computation of integrals with respect to some target probability measure

. Adopting the Monte Carlo approach, is approximated by an empirical distribution formed from samples drawn according to . However, in many problems of interest, it is not possible to simulate according to exactly, and so further approximate methods must be used. Arguably the most widely employed and general approach is Markov Chain Monte Carlo (MCMC); successively drawing samples as a realization of a Markov chain. The dominant approach to MCMC involves the simulation of a process that is -ergodic, often constructed by the Metropolis-Hastings algorithm from an underlying irreducible and aperiodic Markov chain [58]. However, there has been significant recent interest in so-called unadjusted MCMC approaches [14, 19, 45, 29]. A common strategy with these methods is the approximate numerical simulation of Langevin diffusions, which are -ergodic [45]

. For the same computational effort, one can achieve substantially lower variance of estimates at the cost of incurring additional (asymptotic) bias. Despite poorer asymptotic guarantees

[21], the ensuing Markov chains are often rapidly mixing, and perform particularly well in high dimensional settings [20]. However, despite significant recent empirical success [29] and the presence of explicit error estimates [14, 19], such algorithms have not been widely adopted. Statisticians and practitioners alike remain skeptical, as consistency is often considered a basic theoretical necessity for any sampling algorithm. The aim of this work is to construct a universal theoretical framework for Stein importance sampling

. In this way, one can achieve the best of both worlds by generating approximate samples from a rapidly mixing Markov chain, and then subsequently ‘correcting’ them, without pre-existing knowledge of the chain, to obtain consistent estimators of quantities involving an unnormalized probability distribution. This allows us to obtain both a low-variance estimator with good practical performance, and a theoretical guarantee that increasing the number of samples will provide an estimator with diminishing error. The approach is viable for all algorithms, but is most effective for biased samplers. Stein importance sampling is not new, having been introduced as a form of black-box importance sampling in

[42], but lacked sufficient theoretical justification to support its use as a universal post-hoc correction tool, particularly as it applies to Markov chains. It belongs to a large class of algorithms derived from the kernelized Stein discrepancy (KSD), first introduced in [43], and motivated by ideas in [53]. Interpreted as an integral probability metric with respect to a fixed target distribution , the KSD combines ideas from Stein’s method [64, 60] and reproducing kernel Hilbert spaces [2, 65], and has the advantage of being explicitly computable. KSD has been applied to a myriad of problems, including measurement of sample quality [27], goodness-of-fit tests [12, 43, 70, 33], variance reduction [53, 52, 42], variational inference [41, 56], and the construction of other sampling algorithms [44, 67, 15, 11, 10]. Stein importance sampling stands out as one of the most straightforward techniques involving KSD: for elements , weights are selected such that the resulting weighted empirical distribution with integrals

(1)

is as close as possible to a target distribution under a corresponding KSD . As all weights must be non-negative and sum to one for to be a valid probability distribution, the resulting optimization problem reduces to the constrained quadratic program

(2)

where is the Gram matrix corresponding to a reproducing Stein kernel for , where the inequality is to be interpreted element-wise. Stein importance sampling requires no modification to existing methods; it is a post-hoc correction applicable for any arbitrary collection . The program (2) is a well-studied problem for which many solution techniques exist, for example, interior point methods, or alternating direction method of multipliers (see [51] for a comprehensive survey).

1.1 Contribution

In this work, we make several important theoretical contributions to the theory of Stein importance sampling and reproducing Stein kernels:

  1. Our main result is presented in §3 as Theorem 1, where we establish sufficient conditions for Stein importance sampling — when applied to samples generated by a Markov chain — to yield consistent estimators for integrals of test functions under .

The result is particularly interesting for two reasons. Firstly, somewhat surprisingly, we do not require that the Markov chain is -ergodic. In fact, technical conditions for the result that pertain to the ergodicity and convergence of the chain are easily verified for many unadjusted samplers, which are of increasing interest to practitioners due to their rapid mixing in high dimension. Secondly, the result is established for a generalized form of Stein importance sampling where certain terms are estimated unbiasedly, for example by data subsampling, which provides justification for the use of the method in the large data settings common in twenty-first century statistical computing. To establish the above result for general underlying spaces, we first in §2 — following some background exposition — establish new theory for Stein kernels and KSD that is of independent interest:

  1. Construction of reproducing Stein kernels, and therefore KSD, with respect to probability measures on arbitrary Polish spaces is addressed.

The construction subsumes existing approaches pertaining to Euclidean space [53, 43], discrete spaces [70], and Riemannian manifolds [40, 4], and further enables construction of Stein kernels and KSD on any Hilbert space, and other separable function spaces, such as . It builds off the generator approach for Stein’s method [3]; Proposition 1 shows that reproducing Stein kernels can always be constructed from the generators of ergodic processes with strongly continuous semigroups, and Proposition 2 deals with the construction of reproducing Stein kernels on product spaces. Finally,

  1. We establish general sufficient conditions for reproducing Stein kernels to yield a convergence-determining KSD.

Previously, specialized direct proofs have been used to establish the convergence-determining property for specific KSDs; see [27, 11]. In Propositions 3 and 4, we show that in the locally compact setting, separability and divergence of some function in the image of the RKHS under the Stein operator yields the convergence-determining property.

1.2 Notation

In the sequel, denotes an arbitrary Polish space, with (resp. ) the set of arbitrary (resp. bounded) continuous functions from to , and a Banach subspace of . For locally compact , denotes the closure of the space of compactly supported continuous functions from to under the uniform norm . For , denotes the set of -times continuously-differentiable functions from to . The space of probability measures on is denoted by . The law of a random element is denoted . Let denote the space of -finite measures on , equipped with the vague topology. For any measure , the space of -integrable functions with respect to is denoted . For each , we let . For , we write if is absolutely continuous with respect to , that is, if implies for any Borel set — the corresponding Radon-Nikodym derivative in such cases is denoted . We write when the measures and are equivalent, that is, if for any Borel set . We also adopt big notation in probability: satisfy if for any , there exists a finite such that . For any operator and function in the domain of (denoted ), denotes the evaluation of the function at . For operators acting on elements of a function space of scalar-valued functions on ,

denotes the tensor product of

and , acting on functions by

For function spaces and , denotes the tensor product space of and . Finally, collections of elements are denoted in bold, and for any and , we denote .

2 Reproducing Stein Kernels

Discrepancy measures are a fundamental concept in many areas of probability and statistics. However, the vast majority of discrepancies can be estimated unbiasedly, let alone computed exactly. The majority of useful discrepancies valid for arbitrary probability measures lie in the class of integral probability metrics [50], which compare two probability measures and in terms of their integrals with respect to some class of test functions :

For example, by taking to be the unit ball in , becomes the total variation metric. Choosing to be the set of Fréchet-differentiable functions with derivative bounded uniformly by one yields the Wasserstein-1 metric [66, Remark 6.5]. Conditions that guarantee is a metric on the space of probability measures, and that as implies for any bounded continuous function (the convergence determining property), can be found in [22, Theorem 3.4.5]. There are now two obstacles encountered when computing a metric between an empirical measure and a target probability measure . Firstly, while the expectation is easy enough to evaluate, is not. In fact, the empirical measure is usually designed to approximate , and computing is often the objective! One of the first universal methods for solving this problem is Stein’s method, an ingenious technique which encodes details about a target probability measure into a so-called Stein operator . Nearly twenty years after the method’s introduction in [64], Barbour [3] developed a popular framework for constructing Stein operators known as the generator approach, which also serves as intuition behind the method. We briefly outline the approach: let be an arbitrary probability measure on , and let be a -ergodic Markov process that induces a strongly continuous semigroup on defined by for . Arguably the most important case is when is locally compact and is Feller, in which case is strongly continuous on [32, Theorem 19.6]. Let denote the Markov generator of , that is, the linear operator satisfying for functions where the limit exists with respect to the topology of . Let be a dense subset of the domain of with respect to the topology of . By [22, Proposition 1.1.5(c)], if is distributed according to , then for any and ,

and by taking , since ,

(3)

For example, let and be Dirac measures at points (respectively) and a deterministic process satisfying and . The infinitesimal generator of is , and so which is precisely the fundamental theorem of calculus for line integrals, along the path parameterized by . Therefore, (3) is a stochastic generalization of the fundamental theorem of calculus for line integrals, where the infinitesimal generator plays the role of the gradient, and the stochastic process the curve joining two probability measures. Keeping with this analogy, the Stein equation is a stochastic analogue of the mean value theorem. Let denote the expectation operator subject to the event . Under the same assumptions, by [22, Proposition 1.1.5], the function lies in the domain of for any and , with image (in other words, the generator and the integral may be exchanged). Because is closed [22, Proposition 1.1.6], assuming strong continuity of , one may derive the Stein equation

(4)

where (see the proof of [26, Theorem 5], for example). Denoting by the image of under the operation , note

(5)

no longer involves the expectation , and therefore sidesteps the challenge in its computation. An important observation of Stein is that for any operator with the property

(6)

we may proceed to formulate (4), solve for accordingly, and arrive at (5), even if is not a Markov generator. This procedure describes Stein’s method at its highest level of generality. Hence, operators satisfying (6) are often called Stein operators. The second issue with computing is the non-trivial optimization problem of obtaining the supremum over . In the context of (5), this is even more difficult due to the complex relationship between and . The trick, as outlined in [43], is to choose such that is the unit ball of a reproducing kernel Hilbert space. Recall that a Hilbert space of functions is a reproducing kernel Hilbert space (RKHS) if for all , the evaluation functional is bounded [2]. By the Riesz representation theorem, for any RKHS , there exists a unique symmetric, positive-definite function , called the reproducing kernel of , such that for any , and

(7)

The relation (7) is often called the reproducing property of the kernel. Conversely, the Moore-Aronszajn theorem [65, Theorem 4.21] states that any positive-definite kernel induces a unique corresponding RKHS. There exists a significant theory of RKHS; see, for example, [2] and [65, §4]

. Reproducing kernel Hilbert spaces arise in machine learning as they can be used to reduce infinite-dimensional optimization problems over function spaces, to problems over finite-dimensional spaces. Such a simplification occurs in (

5) when is the unit ball of a RKHS , due to the existence of a reproducing Stein kernel, which is guaranteed when the Stein operator is the generator of a Markov process (Proposition 1). For brevity, we will often drop the term ‘reproducing’ as it is clear in our context, but stress that these Stein kernels should not be confused with those seen in [13], for example.

Proposition 1 (Stein Kernels from Generators).

Let be a probability measure on , and a -ergodic process with strongly continuous semigroup over and Markov generator . Further, let be a RKHS with reproducing kernel . Assuming that for any , then is a RKHS with reproducing kernel

(8)
Proof.

Since the Markov generator is the time derivative of its corresponding semigroup, we can proceed in a similar fashion to [65, Lemma 4.34] concerning interchangability of derivatives and inner products. Let denote the semigroup of , recalling that for any ,

(9)

where is the family of transition probability kernels of . Interpreting (9) as a Bochner integral, for any , since is a continuous linear operator, the reproducing property together with properties of the Bochner integral imply

Therefore, by letting for each , where is the identity operator,

(10)

To show that exists, it suffices to show that for any sequence of positive real numbers converging to zero and any , is a Cauchy sequence in . Since is complete, by definition of the generator ,

(11)

For any , let . Choosing any and ,

(12)

Now, for any , by multiple applications of (10),

The Kolmogorov equations [22, Proposition 1.5(b)] state that for any and in the domain of , , which, by the mean value theorem, implies for some . Since and are both assumed to be in the domain of (due to the symmetry of ), for some and ,

The strong continuity of the semigroup implies that for any , there is an such that for ,

The above inequality combined with (12) implies that is a Cauchy sequence, and so (11) holds. Together with (10),

The result now follows from [65, Theorem 4.21]. ∎

Example 1 (Riemannian Stein kernels).

We rederive the construction of Stein kernels on Riemannian manifolds outlined in [4, 40]. Let be a complete connected -dimensional Riemannian manifold with basis coordinates . The gradient of a smooth function is defined on by , where is the -th element of the inverse of the matrix with elements for . The Laplace-Beltrami operator on is the divergence of the gradient of , given by

For any

-smooth vector field

on , there exists a unique Feller process on with generator [68, §2.1]. If is absolutely continuous with respect to the Riemannian volume measure on with density , the choice yields a process with invariant measure [68, pg. 72]. Furthermore, under the Bakry-Emery curvature condition, the diffusion is ergodic, indeed, geometrically ergodic in the Wasserstein metric [68, Theorem 2.3.3]. Therefore, letting be the generator of the diffusion , by Proposition 1, induces a Stein kernel for any base reproducing kernel on , with target measure . To complete the construction of Stein kernels on manifolds requires an underlying reproducing kernel. Any reproducing kernel on admits an (extrinsic) reproducing kernel on by the strong Whitney embedding theorem [62, Theorem 2.2.a] ( where and is an embedding). Intrinsic kernels are known for special manifolds, such as the -dimensional sphere [1].

As a consequence of the proof of Proposition 1, we require only that for any and , to infer that is a RKHS with reproducing kernel (8). Therefore, for more general Stein operators, we rely on the following definition for constructing reproducing Stein kernels.

Definition 1.

A Stein operator is said to induce a Stein kernel if for any RKHS of scalar-valued functions on with reproducing kernel ,

For any RKHS of functions on , target measure , and operator inducing a Stein kernel , the kernelized Stein discrepancy (KSD) between and , denoted , satisfies

(13)

where, as in Proposition 1, and the second equality follows by the property for any . This construction (2) resembles maximum mean discrepancy (MMD); see [49, §3.5] for further details. Now, if is the empirical measure of samples , then

(14)

Both of the issues concerning computation of discrepancies for empirical distributions are now addressed, with an explicit and often easily computable formula for the evaluation of a particular integral probability metric. However, the kernelized Stein discrepancy is not limited to applications with empirical distributions, since, for any distribution on , we can readily estimate where are independently distributed according to , using crude Monte Carlo.

2.1 Efficient construction of Stein kernels on product spaces

Building upon Barbour’s generator approach to Stein’s method [3], using Proposition 1 one can, in principle, construct Stein kernels over any desired space. Unfortunately, Stein kernels obtained in this way are not always efficient to compute, and do not always coincide with Stein kernels currently in the literature. Consider the two following univariate cases:

  • : If is an absolutely continuous probability measure on with smooth density supported on a connected subset of , the overdamped Langevin stochastic differential equation

    (15)

    is an ergodic Feller process with stationary measure . By Itô’s lemma, its Markov generator acts on functions by

    (16)

    Because is arbitrary, the extra derivatives on and are redundant for to be a Stein operator, and one may instead consider

    (17)

    which coincides with the canonical Stein operator on [39, 53, 64]. The operator as in (17) is no longer the generator of a Markov process, however it still induces a Stein kernel nonetheless, as a consequence of [65, Lemma 4.34] (see also the remark after Proposition 1). Alternatively, in place of the Langevin diffusion (15), one could start with any other -ergodic diffusion equation. See [26] for other alternatives, all of which induce Stein kernels under the reduction of derivatives, as in (17).

  • : If is a probability mass function on (or some other countable set, in which case, one need only apply a injection into ), any birth-death chain on the integers with birth and death rates and respectively, will have stationary measure provided it satisfies the detailed balance relations

    An immediate choice is

    (18)

    Notice that the proposed solution (18) does not require that be normalized. The generator of this chain acts on all functions by

    Analogously to how the generator in (16) acted on the derivatives of , is acting on the (backward) difference of . A similar reduction yields

    (19)

    It can be readily shown that (19) induces a Stein kernel. The Stein operator (19) is most common in discrete scenarios, as seen in [8]

    , and coincides with the Stein-Chen operator for the Poisson distribution when

    and , for example. This was also the approach taken in [70] for the construction of reproducing Stein kernels on discrete spaces.

With (17) as (19) as building blocks, one can easily construct Stein kernels on higher-dimensional spaces that are products of and . The method for doing so is a decomposition of the state space into a product of subspaces, and extends to products of arbitrary Polish spaces as well. Let and be a distribution on . For each , let denote the conditional distribution of over its -th subspace , for . If is a Stein operator for for each and , then we may consider the operator

where , and is the projection operator mapping to the function over . By linearity, if each induces a Stein kernel, then also induces a Stein kernel. This is formalized in Proposition 2, the proof of which builds upon the arguments of [53, Theorem 1].

Proposition 2 (Stein Kernels on Product Spaces).

Suppose that and let be a probability measure on . For each , let denote the conditional distribution of over its -th subspace . Furthermore, for each , let be a Stein operator for acting on , where is a RKHS of scalar-valued functions on with reproducing kernel , that also induces a Stein kernel. Then, the operator acting on functions as

is a Stein operator for on , and is a RKHS of scalar-valued functions on with reproducing kernel

(20)
Proof.

Since the zero function is in and is a Stein operator for for each , for all if and only if the conditional distributions of are equivalent to those of . Therefore, is a Stein operator for on . For each and , let , and let . By the hypothesis, for any ,

Defining to be the space of functions such that the norm

is finite, [65, Theorem 4.21] implies that is the reproducing kernel Hilbert space with reproducing kernel

where the last line follows since each commutes with the inner product. Since can be seen to be the image of under , the result follows. ∎

Example 2 (Canonical Stein kernel on ).

As an especially important example, if is a smooth density on , then applying (17), we can take and so

(21)

which is the well-known Stein operator for densities on . To build the corresponding Stein kernel, expanding out

Therefore, the Stein kernel as given by (20) becomes

(22)

which is precisely the canonical Stein kernel of [53, 43, 39, 27]. There is an extensive literature available for choosing reproducing kernels on . The most common choices are radial basis function kernels of the form

(23)

By the Schoenberg construction, such a kernel has the required positive-definiteness property (and therefore induces a RKHS) over any dimension , if and only if is completely monotone, that is, for every non-negative integer , the sign of the -th derivative of is for any [69, Theorem 7.13]. Examples of completely monotonic functions include (inducing the inverse multiquadric kernel) and (inducing the Gaussian kernel) for any . Furthermore, for smooth functions , if and are completely monotonic and , then is completely monotonic [48, Theorem 2]. Therefore, (inducing the inverse-log kernel) is a completely monotone function.

Example 3 (Complete conditional Stein kernel on ).

Letting denote a reproducing kernel on , for a RKHS of functions , we can extend to an RKHS of functions by for . Since and are isomorphic, is a RKHS on with reproducing kernel satisfying for . The Stein operator (21) applied to (20) with these choices of underlying kernels yields the complete conditional Stein kernel introduced in [61, Theorem 1]:

(24)

The kernel (24) enjoys an increased sensitivity to single-dimensional perturbations over (22), which becomes especially apparent when the dimension is large. For further discussion, see [61].

Example 4 (Stein kernel on ).

Let be a probability mass function on . Any reproducing kernel on is also a reproducing kernel on . Applying (19) with the choice of transition rates (18) to (20) immediately yields

where and denotes the -th unit coordinate vector. In cases where decreases rapidly in , this choice of Stein kernel can succumb to significant numerical error. Instead, one would prefer to involve ratios of , such as . Observe that, if is a positive-definite kernel on , then letting ,

(25)

is also a positive-definite (and therefore, reproducing) kernel on . Applying (19) with the choice (18), letting , the Stein kernel becomes