 # A stochastic version of Stein Variational Gradient Descent for efficient sampling

We propose in this work RBM-SVGD, a stochastic version of Stein Variational Gradient Descent (SVGD) method for efficiently sampling from a given probability measure and thus useful for Bayesian inference. The method is to apply the Random Batch Method (RBM) for interacting particle systems proposed by Jin et al to the interacting particle systems in SVGD. While keeping the behaviors of SVGD, it reduces the computational cost, especially when the interacting kernel has long range. Numerical examples verify the efficiency of this new version of SVGD.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The empirical measure with samples from some probability measure (which might be known up to a multiplicative factor) has many applications in Bayesian inference [1, 2] and data assimilation 

. A class of widely used sampling methods is the Markov Chain Monte Carlo (MCMC) methods, where the trajectory of a particle is given by some constructed Markov chain with the desired distribution invariant. The trajectory of the particle is clearly stochastic, and the Monte Carlo methods take effect slowly for small number of samples. Unlike MCMC, the Stein variational Gradient method (proposed by Liu and Wang in

) belongs to particle based variational inference sampling methods (see also [5, 6]

). These methods update particles by solving optimization problems, and each iteration is expected to make progress. As a non-parametric variational inference method, SVGD gives a deterministic way to generate points that approximate the desired probability distribution by solving an ODE system. Suppose that we are interested in some target probability distribution with density

(). In SVGD, one sets and solve the following ODE system for given initial points (see [4, 7]):

 ˙Xi=1NN∑j=1∇yK(Xi,Xj)−1NN∑j=1K(Xi,Xj)∇V(Xj), (1.1)

where is a symmetric positive definite kernel. When is large enough, the empirical measures constructed using is expected to be close to .

SVGD seems to be more efficient in the particle level for approximating the desired measure and interestingly, it reduces to the maximum a posterior (MAP) method when 

. It provides consistent estimation for generic distributions as Monte Carlo methods do, but with fewer samples. Theoretic understanding of (

1.1) is limited. For example, the convergence of the particle system (1.1) is still open. Recently, there are a few attempts for the understanding of the limiting mean field PDE [7, 8]. In particular, Lu et al  showed the convergence of the PDE to the desired measure.

Though (1.1) behaves well when the particle number is not very big, one sometimes still needs efficient algorithm to simulate (1.1). For example, in a typical MCMC method while in SVGD, one may have . Though is not large, simulating (1.1) needs work to compute the interactions for each iteration, especially for interaction kernels that are not super localized (such as kernels with algebraic decaying rate, like ). The computation cost of SVGD for these cases is therefore comparable with MCMC with larger number of particles. Hence, it is highly motivated to develop a cheap version of SVGD.

In this work, we propose RBM-SVGD, a stochastic version of SVGD for sampling from a given probability measure. The idea is very natural: we apply the random batch method in  to the interacting particle system (1.1

). Note that in the random batch method, the ’batch’ refers to the set for computing the interaction forces, not to be confused with the ’batch’ of samples for computing gradient as in stochastic gradient descent (SGD). Of course, if

is the loss function corresponding to many samples, or the probability density in Bayesian inference corresponding to many observed data, the mini-batch idea can be used to compute

in SVGD as well (see ). With the random batch idea for computing interaction, the complexity for each iteration now is only . Moreover, it inherits the advantages of SVGD (i.e. efficient for sampling when the number of particles is not large) since the random batch method is designed to approximate the particle system directly. In fact, we will prove that the one marginal of the random batch method converges to the one marginal of the interacting particle systems under Wasserstein-2 distance on fixed time interval . Note that the behavior of randomness in RBM-SVGD is different from that in MCMC. In MCMC, the randomness is required to ensure that the desired probability is invariant under the transition. The randomness in RBM-SVGD is simply due to the batch for computing the interaction forces, which is mainly for speeding the computation. Though this randomness is not essential for sampling from the invariant measure, it may have other benefits. For example, it may lead to better ergodic properties for particle system.

## 2 Background

We now give a brief introduction to the SVGD proposed in  and make some discussion. The derivation here is a continuous counterpart of that in .

Assume that random variable

has density . Consider some mapping and we denote the distribution of by , which is called the push-forward of under . The goal is to make closer to in some sense. The way to measure the closeness of measures in  is taken to be the Kullback-Leibler (KL) divergence, which is also known as the relative entropy, defined by

 KL(μ||ν)=EY∼μlog(dμdν(Y)), (2.1)

where is the well-known Radon-Nikodym derivative. In [4, Theorem 3.1], it is shown that the Frechet differential of is given by

 ⟨δGδT,ϕ⟩=−EY∼pSπϕ(Y),  ∀ϕ∈C∞c(Rd;Rd) (2.2)

where associated with a probability density is called the Stein operator given by

 Sqϕ(x)=∇(logq(x))⋅ϕ(x)+∇⋅ϕ(x). (2.3)

In fact, using the formula

 ddϵ(T+ϵϕ∘T)#p0|ϵ=0=ddϵ(I+ϵϕ)#p|ϵ=0=−pSpϕ=−∇⋅(pϕ), (2.4)

and , one finds

 ⟨δGδT,ϕ⟩=⟨δKL(p||π)δp,−∇⋅(pϕ)⟩=−∫RdpSπϕdx. (2.5)

The quantity can be understood as the directional derivative of in the direction given by .

Based on this calculation, we now consider a continuously varying family of mappings and

 ddτTτ=ϕτ∘Tτ.

Here, ’’ means composition, i.e. for any given , . In this sense is the trajectory of under this mapping; can be viewed as the so-called Lagrangian coordinate as in fluid mechanics while is the flow field. We denote

 pτ:=(Tτ)#p0. (2.6)

The idea is then to choose such that the functional decays as fast as possible. Note that to optimize the direction, we must impose the field to have bounded magnitude , where is some subspace of the functions defined on . The optimized curve is a constant speed curve (in some manifold). Hence, the problem is reduced to the following optimization problem

 sup{EY∼pSπϕ(Y)|∥ϕ∥H≤1}. (2.7)

It is observed in  that this optimization problem can be solved by a convenient closed formula if

is the so-called (vector) reproducing kernel Hilbert space (RKHS)

[10, 11]. A (scalar) RKHS is a Hilbert space, denoted by , consisting of functions defined on some space (in our case ) such that the evaluation function is continuous for all . There thus exists such that . Then the kernel is symmetric and positive definite, meaning that for any and . Reversely, given any positive definite kernel, one can construct a RKHS consisting of functions of the form where is some suitably given measure on . For example, if is the counting measure, choosing () can recover the form of RKHS in . All such constructions yield isomorphic RKHS as guaranteed by Moore-Aronszajn theorem . Now, consider a given and to be the vector RKHS:

 H={f=∫RdK(⋅,y)ψ(y)dμ(y)∣∣ψ:Rd→Rd,∬Rd×RdK(x,y)ψ(x)⋅ψ(y)dμ(x)dμ(y)<∞}.

The inner product is defined as

 ⟨f(1),f(2)⟩H=∬Rd×RdK(x,y)ψ(1)(x)⋅ψ(2)(y)dμ(x)dμ(y)=d∑j=1∬Rd×RdK(x,y)ψ(1)j(x)ψ(2)j(y)dμ(x)dμ(y). (2.8)

This inner product therefore induces a norm . Clearly, consists of functions with to be finite. The optimization problem (2.7) can be solved by the Lagrange multiplier method

 L=∫Rd(Sπϕ)pτ(y)dy−λ∬Rd×RdK(x,y)ψ(x)⋅ψ(y)dμ(x)dμ(y),

where means Lebesgue measure and . Using , we find

 2λϕ=∫RdK(x,y)(S∗πpt)(y)dy=:V(pt), (2.9)

where is given by

 S∗π(f)=f(y)∇(logπ)−∇f(y)=−f(y)∇V(y)−∇f(y). (2.10)

The ODE flow

 ddτTτ=12λ(τ)V(pτ)∘Tτ,

gives the constant speed optimal curve, so that the velocity is the unit vector in along the gradient of . Re-parametrizing the curve so that , and we denote , then

 ddtTt=V(ρt)∘Tt. (2.11)

Clearly, the curve of is not changed by this reparametrization. Using (2.4), one finds that satisfies the following equation

 ∂tρ=−∇⋅(V(ρt)ρ)=∇⋅(ρtK∗(ρ∇V+∇ρ)). (2.12)

Here, . It is easy to see that is invariant under this PDE. According to the explanation here, the right hand side gives the optimal decreasing direction of KL divergence if the transport flow is measured by RKHS. Hence, one expects it to be the negation of gradient of KL divergence in the manifold of probability densities with metric defined through RKHS. Indeed, Liu made the first attempt to justify this in [7, Sec. 3.4].

While everything looks great for continuous probability densities, the above theory does not work for empirical measures because the KL divergence is simply infinity. For empirical measure, must be in the distributional sense. However, the good thing for RKHS is that we can move the gradient from onto the kernel so that the flow (2.11) becomes (1.1), which makes perfect sense. In fact, if (1.1), holds, the empirical measure is a measure solution to (2.12) (by testing on smooth function ) [8, Proposition 2.5]. Hence, one expects that (1.1) will give approximation for the desired density. The numerical tests in  indeed justify this expectation. In this sense, the ODE system is formally a gradient flow of KL divergence, though the KL divergence functional is infinity for empirical measures.

Typical examples of include , Gaussian kernel for , and for 1D space . By Bochner’s theorem , if a function

has a positive Fourier transform, then

 K(x,y)=K(x−y) (2.13)

is a positive definite kernel. With this kernel, (1.1) becomes

 ˙Xi=−1NN∑j=1∇K(Xi−Xj)−1NN∑j=1K(Xi−Xj)∇V(Xj), (2.14)

as used in . Both Gaussians and with have positive Fourier transforms. The difference is that Gaussian has short range of interaction while the latter has long range of interaction. One can smoothen out by mollifying with Gaussian kernels, resulting in positive definite smooth kernels but with long range interaction. Choosing localized kernels like Gaussians may have some issues in very high dimensional spaces [13, 14]. Due to its simplicity, when the dimension is not very high, we choose Gaussian kernels in section 4.

As a further comment, one may consider other metric to gauge the closeness of probability measures, such as Wasserstein distances. Also, one can consider other norms for and get gradient flows in different spaces. These variants have been explored by some authors already [15, 16]. In general, computing the Frechet derivatives in closed form for these variants seems not that easy.

###### Remark 1.

If we optimize (2.7) for in spaces, the flow is then given by

 ddtT=(S∗πρ)∘T. (2.15)

The corresponding PDE is . This is in fact the case when we choose . This PDE, however, will not make sense for empirical measures since is hard to justify (Clearly, the corresponding ODE system has the same trouble.) By using RKHS, the derivative on can be moved onto the kernel and then the ODE system makes sense.

## 3 The new sampling algorithm: RBM-SVGD

We consider in general the particle system of the following form.

 ˙Xi=1NN∑j=1F(Xi,Xj)=1NF(Xi,Xi)+1N∑j:j≠iF(Xi,Xj). (3.1)

Here, does not have to be symmetric, and also is not necessarily zero.

### 3.1 The algorithms

We apply the random batch method in  to this particle system. In particular, choose a time step . We define time grid points

 tm=mη. (3.2)

The idea of random batch method is to form some random batches at and then turn on interactions inside batches only. As indicated in , the random division of the particles into batches takes operations (we can for example use random permutation). Depending on whether we do batches without or with replacement, we can have different versions (see Algorithm 1 and 2). For the ODEs in the algorithms, one can apply any suitable ODE solver. For example, one can use the forward Euler discretization if is smooth like Gaussian kernels. If is singular, one may take and apply the splitting strategy in .

For the Stein Variational Gradient Descent (1.1), the kernel takes the following form.

 F(x,y)=∇yK(x,y)−K(x,y)∇V(y). (3.5)

Applying the random batch method to this special kernel and using any suitable ODE solvers, we get a class of sampling algorithms, which we will call RBM-SVGD. In this work, we will mainly focus on the ones without replacement. Some discussion for RBM-SVGD with or without replacement will be made in section 4.2. The one with forward Euler discretization (with possible variant step size) is shown in Algorithm 3. Clearly, the complexity is for each iteration.

Here, is the number of iterations and is the sequence of time steps, which play the same role as learning rate in stochastic gradient descent (SGD). There are often two choices of the time steps: (i) ; (ii) .

### 3.2 Theoretical results

We now give convergence analysis regarding the time continuous version of RBM-SVGD on torus (i.e. choosing the particular force (3.5) for Algorithm 1 and ). The derivation of SVGD clearly stays unchanged for torus. The reason we consider torus is that (1.1) is challenging to analyze in because of the nonlocal effect of the external force. On torus, all functions are smooth and bounded. Moreover, using bounded domains with periodic boundary condition can always approximate the problem in in practice.

Consider the random force for defined by

 fi(z):=(1−1N)1p−1∑j:j∈CF(xi,xj), (3.7)

where is the random batch that contains in the random batch method. Correspondingly, the exact force is given by . Define the ’noise’ by

 χi(z):=1N∑j:j≠iF(xi,xj)−fi(z). (3.8)

We have the following consistency result regarding the random batch.

###### Theorem 1.

For given (or ), it holds that

 Eχi(z)=0 (3.9)

Moreover, the second moment is given by

 E|χi(z)|2=(1−1N)2(1p−1−1N−1)Λi(z), (3.10)

where

 Λi(z)=1N−2∑j:j≠i∣∣F(xi,xj)−1N−1∑k:k≠iF(xi,xk)∣∣2. (3.11)

The proof is similar as in , but we also attach it in the appendix for convenience.

We now state the convergence result for the time continuous version of RBM-SVGD, where . We use to denote the process generated by the random algorithm while is the process by (1.1).

###### Theorem 2.

Assume and are smooth on torus . The initial data are drawn independently from the same initial distribution. Given , there exists , such that . Consequently, the one marginals and are close under Wasserstein- distance:

 W2(μ(1)N,~μ(1)N)≤C(T)√η.

We recall that the Wasserstein- distance is given by 

 W2(μ,ν)=(infγ∈Π(μ,ν)∫Td×Td|x−y|2dγ)1/2, (3.12)

where

is called the transport plan, consisting of all the joint distributions whose marginal distributions are

and respectively: i.e. for any Borel set , and .

As can be seen in the proof (in appendix), the main contribution in the local truncation error comes from the variance of the the noise

. We believe the error bound here can be made independent of due to the intrinsic structure of SVGD discussed above in section 2. Often, such long time estimates are established by some contracting properties, so one may want to find the intrinsic converging structure of (1.1). However, rigorously establishing such results seems nontrivial due to the nonlocal effects of the external forces (the terms).

## 4 Experiments

We consider some test examples in 

to validate RBM-SVGD algorithm and compare with the original SVGD algorithm. In particular, in a toy example for 1D Gaussian mixture, RBM-SVGD was proved to be effective in the sense that the particle system converges to the expected distribution with less running time than the original SVGD method. A more practical example, namely Bayesian logistic regression, is also considered to verify the effectiveness of RBM-SVGD on large datasets in high dimension. Competitive prediction accuracy is presented by RBM-SVGD, and less time is needed. Hence, RBM-SVGD seems to be a more efficient method when large number of particles are required.

### 4.1 1D Gaussian Mixture

As a first example, we use the Gaussian mixture probability in  for RBM-SVGD. The initial distribution is , Gaussian with mean and variance 1. The target measure is given by the following Gaussian mixture

 π(x)=13⋅1√2πe−(x+2)2/2+23⋅1√2πe−(x−2)2/2. (4.1)

The kernel for the RKHS is the following Gaussian kernel

 K(x)=1√4πe−x2/4. (4.2)

As a preliminary comparison, we show in Figure 1 a numerical result of RBM-SVGD and the figure of SVGD cut from . The red curve is the desired density while green curve is the empirical density by simple bin counting (using ’histcounts’ in MATLAB to obtain necessary data). The RBM-SVGD used here is Algorithm 3, i.e., random batch method without replacement. The parameters are , , , in Figure 1. Clearly, RBM-SVGD works for this toy example and shows similar result as the original SVGD. As stated in , the difficulty lies in the strong disagreement between the initial density function and . RBM-SVGD inherits the advantage of SVGD in the sense that it can conquer the challenge and also show compelling result with SVGD.

To further investigate RBM-SVGD, we conduct more simulations using the Gaussian kernel (4.2) and the same parameters (, , terminal time ). The only varying parameter is the batch size , which is taken to be: . Figure 2 presents the results with the same initial values, sampling from . A distribution close to the expected one could be developed for all batch sizes considered. Figure 2: Result of different batch sizes. Red curve is the desired density while green curves are empirical densities generated by bin counting.

Provided the good performance of RBM-SVGD, we also check its computational cost. Table 1 displays the average running time of each different size over 50 simulations. We can tell from the chart that less time is needed in general by RBM-SVGD to achieve competitive performance compared with SVGD (the one with ).

Interestingly, in this example, the running time for is larger than the time for SVGD (), and a minimum of running time is attained at . The reason is that for small , the time required to update particles in each batch is roughly the same due to some optimization technique (in MATLAB). Dividing them into batches may hinder the efficiency and thus the time is related to number of batches. For example, requires double number of batches compared with . The ratio of running time of 8 and 16 is also close to 2 (). When the batch size becomes big, the computational cost for one batch then increases. Hence the ’decrease first, increase then’ of running time is reasonable. Of course, when computing gradient and evaluating is expensive so that the computational cost for each batch is proportional to , using smaller batch then takes smaller time. Hence, in practice, a proper selection of batch size when using RBM-SVGD might be an important factor to consider.

To check the sampling performance of RBM-SVGD, we use test function and compute the estimated expectation . The sampling accuracy is measured by computing the Minimum Square Error (MSE) over experiments:

 MSE=15050∑j=1(¯hj−EX∼ph(X))2.

Figure 3 shows the MSE when are chosen as test functions. Small batch sizes seem to have good performance already (large batch size seems only slightly better for high moments like ).

### 4.2 Bayesian Logistic Regression

In this experiment, we apply RBM-SVGD to complete Bayesian logistic regression for binary classification, using the Covertype dataset with 581012 data points and 54 features . Under the same setting as Gershman [18, 4], the regression weights are assigned with a Gaussian prior , and the variance satisfies where

represents the density of Gamma distribution. The inference is applied on posterior

with . The kernel is taken again to be the same Gaussian kernel as in the previous example.

Although  only took particles in the training process, we chose a larger number of particles to leave more space for the selection of batch size. The training is done on 80% of the whole dataset, and the other 20% of dataset is left as the test set. For particle system (1.1), computing is very expensive. In fact, we use the same strategy as mentioned in [4, section 3.2], i.e. using mini-batches of the samples to form a stochastic version of . Each time the function for computing is called, some sampling process needs to be done and this is time consuming. Hence, for each time interval, we call this function only once and compute for all particles, and then store them in an array. Then, for each batch, we only use the computed to update . Due to such reason, RBM with replacement seems unsuitable for this type of problems since we need to call the function for each batch, which is clearly time consuming. Below, all results are based on 50 simulations with 6000 iterations in each one.

Figure 4 presents the results of accuracy for different batch sizes (accuracy is computed as the average of simulations). RBM-SVGD functions as the original SVGD () in the context of real world data by providing close accuracy of prediction on binary classification on large amount of data in high dimension. No significant difference between different batch sizes was observed except when is very small (). Even if the batch size was chosen as , only 6% of decreasing of accuracy was presented.

The average running times (over experiments) are compared in table 2. The minimum running time is obtained when batch size is 32, due to similar reason as in the previous example. This optimal time is merely about 1/7 the time for SVGD (batch size of 512). Hence, RBM-SVGD can significantly reduce training time while preserving model performance in the real world problem.

## 5 Conclusion

We have applied the random batch method for interacting particle systems to SVGD, resulting in RBM-SVGD, which turns out to be a cheap sampling algorithm and inherits the efficiency of the original SVGD algorithm. Theory and Numerical experiments have validated the algorithm and hence, it can potentially have many applications, like Bayesian inference.

The choice of batch size for RBM-SVGD is a practical issue because large number of batches many hinder some optimization for computing interaction within batches. Moreover, as a hybrid strategy, one may increase the batch size as time goes on to increase the accuracy.

## Acknowledgement

This work is supported by KI-Net NSF RNMS11-07444 . The work of J.-G. Liu was partially supported by NSF DMS-1812573, and the work of J. Lu was supported in part by NSF DMS-1454939.

## Appendix A Proof of Theorem 1

###### Proof of Theorem 1.

The proof is pretty like the one in . We use the random variable to indicate whether and are in a common batch. In particular, if and are in a common batch while if otherwise. Then, it is not hard to compute (see )

 E1I(i,j)=1=p−1N−1,P(I(i,j)I(j,k)=1)=(p−1)(p−2)(N−1)(N−2). (A.1)

We note

 χi(x)=1N∑j:j≠i(1−N−1p−1I(i,j))F(xi,xj). (A.2)

The first equation in (A.1) clearly implies that . Using (A.1), we can compute directly that

 E|χi(x)|2=1N2(∑j:j≠i(N−1p−1−1)|F(xi,xj)|2+∑j,k:j≠i,k≠i,j≠k((N−1)(p−2)(N−2)(p−1)−1)F(xi,xk)⋅F(xi,xj))

Rearranging this, we get the claimed expression. ∎

## Appendix B Proof of Theorem 2

###### Proof of Theorem 2.

In the proof below, the constant will represent a general constant independent of and , but its concrete meaning can change for every occurrence.

Consider the corresponding two processes:

 ddt~Xi=1N(∇yK(~Xi,~Xi)−K(~Xi,~Xi)∇V(~Xi))+1−1/Np−1∑j:j∈C(∇yK(~Xi,~Xj)−K(~Xi,~Xj)∇V(~Xj)),ddtXi=1N(∇yK(Xi,Xi)−K(Xi,Xi)∇V(Xi))+1N∑j:j≠i(∇yK(Xi,Xj)−K(Xi,Xj)∇V(Xj)). (B.1)

Taking the difference and dotting with , one has

 (~Xi−Xi)⋅ddt(~Xi(t)−Xi(t))≤CN|~Xi(t)−Xi(t)|2+(~Xi(t)−Xi(t))⋅(I1+I2)

where

 I1=1−1/Np−1(∑j:j∈C(∇yK(~Xi,~Xj)−K(~Xi,~Xj)∇V(~Xj))−∑j:j∈C(∇yK(Xi,Xj)−K(Xi,Xj)∇V(Xj))),
 I2=1−1/Np−1∑j:j∈C(∇yK(Xi,Xj)−K(Xi,Xj)∇V(Xj))−1N∑j:j≠i(∇yK(Xi,Xj)−K(Xi,Xj)∇V(Xj)).

Hence, introducing

 u(t)=E|Xi(t)−~Xi(t)|2=E|X1(t)−~X1(t)|2,

we have

 ddtu≤CNu(t)+E(Xi−~Xi)⋅I1+E(Xi−~Xi)⋅I2.

Due to the smoothness of and on torus, we easily find

 |I1|≤C1p−1∑j∈C,j≠i(|Xi−~Xi|+|Xj−~Xj|),

where is independent of . Hence,

 E(Xi−~Xi)⋅I1≤Cu(t),

where is independent of .

Letting , one sees easily that . Define the error process

 Yi(t)=~Xi(t)−Xi(t).

Then, we find

 Yi(t)⋅I2(t)=(Yi(t)−Yi(tm))⋅χi(Z(t))+Yi(tm)⋅χi(Z(t))=J1+J2.

In , is independent of the random batch division at . Then, Theorem 1 tells us that

 EJ2=0.

Using (B.1), we have

 Yi(t)−Yi(tm)=∫ttmfi(~Z(s))−Fi(Z(s))ds=−∫ttmχi(Z(s))ds+∫ttmfi(~Z(s))−fi(Z(s))ds

Clearly,

 ∣∣∣E∫ttmχi(Z(s))⋅χi(Z(t))ds∣∣∣≤Cη,

where is related to the infinity norm of the variance of . This is the main term in the local truncation error. Just as we did for ,

 |fi(~Z(s))−fi(Z(s))|≤C1p−1∑j∈C,j≠i(|Xi−~Xi|+|Xj−~Xj|).

Hence, we find

 EJ1≤C(√uη+η)

Eventually,

 ddtu≤C(u+√uη+η)≤Cu+Cη.

Applying Grönwall’s inequality, we find

 supt≤Tu(t)≤C(T)η.

The last claim for distance follows from the definition of . ∎