1 Introduction
Bayesian methods have been playing an important role in modern machine learning, especially in an unsupervisedlearning setting. When facing with big data, two lines of research directions have been developed to scale up Bayesian methods,
e.g., variationalBayesbased and samplingbased methods. Stochastic gradient Markov chain Monte Carlo (SGMCMC) is a family of scalable Bayesian learning algorithms designed to efficiently sample from a target distribution such as a posterior distribution WellingT:ICML11 ; ChenFG:ICML14 ; DingFBCSN:NIPS14 ; ChenDC:NIPS15 . In principle, SGMCMC generates samples from a Markov chain, which are used to approximate a target distribution. Under a standard setting, samples from SGMCMC are able to match a target distribution exactly in an infinitesample regime TehTV:arxiv14 ; ChenDC:NIPS15 . However, this case never occurs in practice, as only a finite amount of samples are available. Although nonasymptotic bounds w.r.t. the number of samples have been investigated TehTV:arxiv14 ; VollmerZT:arxiv15 ; ChenDC:NIPS15 , there are no theory/algorithms to guide learning an optimal set of fixedsize samples/particles. This is an undesirable property of SGMCMC, because given a fixed number of samples, one often wants ot learn the optimal samples that best approximate a target distribution.A remedy for this issue is to adopt the idea of particlebased sampling methods, where a set of particles (or samples) are initialized from some simple distribution, and they are updated iteratively such that they approximate a target distribution better and better. The updating procedure is usually done by optimizing some objective function. There is not much work in this direction for Bayesian sampling, with an outstanding representative being the Stein variational gradient descent (SVGD) LiuW:NIPS16 . In SVGD, the update of particles are done by optimizing the KLdivergence between the empiricalparticle distribution and a target distribution, thus the samples can be guaranteed to be optimal in each update. Because of this property, SVGD is found to perform better than SGMCMC when the number of samples used to approximate a target distribution is limited LiuW:NIPS16 .
Little research has been done on investigating the particleoptimization idea in SGMCMC. Inspired by SVGD, we develop a similar particleoptimization procedure for SGMCMC for more efficient sampling. To achieve this goal, we propose a novel technique to directly optimize particles based on a variational reformulation of the corresponding FokkerPlanck equation of an SGMCMC algorithm, adapted from JordanKO:MA98 . In this way, instead of sampling from a Markov chain sequentially, we evolve particles through an optimization procedure, obtaining both optimal particles and faster convergence speed compared to standard SGMCMC. Furthermore, under some relaxations, we are able to show particle optimization in SGMCMC can be regarded as an extension of SVGD with momentum. To the best of our knowledge, this is the first time particles can be optimized in SGMCMC algorithms.
2 Preliminaries
2.1 Stochastic gradient MCMC
Diffusionbased sampling methods
Generating random samples from a posterior distribution is a pervasive problem in Bayesian statistics which has many important applications in machine learning. The Markov Chain Monte Carlo method (MCMC), proposed by Metropolis
et al. Metropolis:53 , produces unbiased samples from a desired distribution when the density function is known up to a normalizing constant. However, traditional MCMC methods are based on random walk proposals which often lead to highly correlated samples. On the other hand, dynamicsbased sampling methods, e.g., Hybrid Monte Carlo (HMC) Duane:87 ; Horowitz:91 , avoid this high degree of correlation by combining dynamical systems with the Metropolis step. The dynamical system uses information from the gradient of the log density to reduce the random walk effect, and the Metropolis step serves as a correction of the discretization error introduced by the numerical integration of the dynamical systems. In fact, these dynamical systems are derived from a more general mathematical technique called diffusion process (or more specifically, Itó diffusion) Oksendal:85 .Specifically, our objective is to generate random samples from a posterior distribution , where represents the model parameter, and represents the data. The canonical form is , where
(1) 
is referred to as the potential energy based on an i.i.d. assumption of the model, and is the normalizing constant. In general, the posterior distribution can be corresponding to the (marginal) stationary distribution of a (continuoustime) Itó diffusion, defined as a stochastic differential equation of the form:
(2) 
where is the time index; represents the full variables in a dynamical system, and (thus ) is potentially an augmentation of model parameter ; is dimensional Brownian motion. Functions and are assumed to satisfy the Lipschitz continuity condition Ghosh:book11 . By FokkerPlanck equation (or the forward Kolmogorov equation) Kolmogoroff:MA31 ; Risken:FPE89 , when appropriately designing the diffusioncoefficient functions and , the stationary distribution of the corresponding Itó diffusion equals the posterior distribution of interest, . For example, the 1storder Langevin dynamic defines , and ; the 2ndorder Langevin diffusion defines , and for a scalar ; is an auxiliary variable known as the momentum ChenFG:ICML14 ; DingFBCSN:NIPS14 .
Denoting the distribution of as , it is well known Risken:FPE89 that is characterized by the FokkerPlanck (FP) equation:
(3) 
where
for vectors
and , for matrices and . The FP equation is the key to develop our particleoptimization framework in SGMCMC.Stochastic gradient MCMC
SGMCMC algorithms are discretized numerical approximations of the Itó diffusions. They mitigate the slow mixing and nonscalability issues encountered by traditional MCMC algorithms by adopting gradient information of the posterior distribution, and using minibatches of the data in each iteration of the algorithm to generate samples. To make the algorithms scalable in a bigdata setting, three developments will be implemented based on the Itó diffusion: define appropriate functions and in the Itódiffusion formula so that the (marginal) stationary distributions coincide with the target posterior distribution ; replace or with unbiased stochastic approximations to reduce the computational complexity, e.g., approximating with a random subset of the data instead of using the full data. For example, in the 1storder Langevin dynamics, could be approximated
by an unbiased estimator with a subset of data:
(4) 
where is a size random subset of , leading to the first SGMCMC algorithm in machine learning – stochastic gradient Langevin dynamics (SGLD) WellingT:ICML11 ; and solve the generally intractable continuoustime Itô diffusions with a numerical method, e.g., the Euler method ChenDC:NIPS15 . For example, this leads to the following update in SGLD:
where means the stepsize, indexes the samples,
is a random sample from an isotropic normal distribution. After running the algorithm for
steps, the collection of samples , which are collected from a Markov chain, are used to approximate the unknown posterior distribution .2.2 Stein variational gradient descent
Different from SGMCMC, SVGD initializes a set of particles and iteratively updates them so that the empirical particle distribution approximates the posterior distribution. Specifically, considers a set of particles drawn from distribution . SVGD tries to update these particles by doing gradient descent on the space of probability distributions via
(5) 
where is a function perturbation direction chosen to minimize the KL divergence between the updated empirical distribution and the posterior , for short. Since is convex in , global optimum of can be guaranteed. SVGD considers as the unit ball of a vectorvalued reproducing kernel Hilbert space (RKHS) associated with a kernel . In such as setting, it was shown in (liu2016stein, ) that:
(6)  
where is called the Stein operator. Assuming that the update function is in a RKHS with kernel , it was shown in (liu2016stein, ) that (6) is maximized with:
(7) 
When approximating the expectation with empirical particle distribution, we arrive at the following updates for the particles at the th iteration:
(8) 
SVGD applies the updates in (8) repeatedly, and the samples move closer to the target distribution in each iteration.
2.3 Comparing SGMCMC with SVGD
SGMCMC is a Markovchainbased sampling methods, in the sense that samples are generated from a Markov chain, with potentially highly correlated samples. Furthermore, it often requires a large number of samples in order to approximate a target distribution reasonably well ChenDC:NIPS15 . In contrast, SVGD directly updates the particles to their optimum guided by an objective function, thus requires much less samples to approximate a target distribution.
On the other hand, SVGD has been explained as gradient flows whose gradient operator is defined on the RKHS liu2017stein_flow ; whereas SGMCMC are flows with flow operator defined on the space  the functional space that is square integrable. Since RKHS is smaller than , SGMCMC can potentially obtain better asymptotic properties than SVGD in theory liu2017stein_flow .
The above arguments motivate us to combine goods from both sides, i.e., we aim to developed a particlebased SGMCMC algorithm similar to what SVGD does.
3 Particle Optimization in SGMCMC
To develop our particleoptimization framework, we first introduce the following lemma adapted from ChenLCWPC:arxiv17 , viewing SGMCMC from an optimization perspective.
Lemma 1
Assume that is infinitely differentiable, and for some constants . Let with being an integer, is an arbitrary distribution with same support as , and be the solution of the functional optimization problem:
(9) 
where , is the 2ndorder Wasserstein distance, with
being the space of joint distributions of
;is the space of probability distributions with finite 2ndorder moment. Then
converges to in the limit of , i.e., , where is the solution of the FP equation (3) at time .Lemma 1 reveals an interesting way to compute via a sequence of functional optimization problems. By comparing it with the objective of SVGD, which minimizes the KLdivergence between and , at each suboptimizationproblem in Lemma 1, it minimizes the KLdivergence, plus a regularization term as the Wasserstein distance between and . The extra Wassersteindistance term arises naturally due to the fact that the corresponding diffusion is a gradient flow equipped with a geometric associated with the Wasserstein distance (Otto:ARMA98, ). From another point of view, it is known that the Wasserstein distance is a better metric for probability distributions than the KLdivergence, especially in the case of nonoverlapping domains (ArjovskyB:ICLR17, ; ArjovskyCB:arxiv17, ).
According to Lemma 1, it is now apparent that SGMCMC can be achieved by alternatively solving the optimization problem in (9) for each iteration.
3.1 Optimizing on the space of probability distributions
Our idea of particleoptimization is inspired by Lemma 1, in that we can obtain the optimal distribution for each iteration (which will be approximated by particles) by optimizing (9). Consequently, instead of doing simulation based on the original Itó diffusion, we propose to directly solve (9) on the space of probability distributions . As will be shown, this allows us to derive algorithms which directly optimize particles in SGMCMC. However, this also bring challenges for the optimization, as the probabilitydistribution space is too flexible to derive exact solutions. In the following, we propose techniques to approximate the corresponding terms in (9). We denote as a sample from , i.e., .
3.1.1 Approximating the KLdivergence
Given , the solution of (9) can be considered as an unknown transformation from to , i.e.,
(10) 
under the constraint that still lies in after the transformation. Directly solving the unknown transformation is challenging, we propose two methods to solve it approximately, detailed below.
Optimizing with adversarial learning
Optimizing the KL divergence with an unknown transformation is generally infeasible. Instead, we approximate it with the JensenShanon divergence, which appears to have the same optimality solution, thus they are equivalent. Formally, we first introduce the following lemma.
Lemma 2
Let and be probability distributions on with the same support. Then KL divergence is equivalent to the JensenShanon divergence (JSD) in the sense that they are both convex in , and achieve the same minimum value of zero at .
Based on Lemma 2, we can replace the KL term in (9) with the JSD, resulting in
(11) 
where is introduced to balance the difference between the scales of KL and JSD.
It is well known that JSD is the metric for measuring the distance between two probability distributions in generative adversarial networks (GANs) GoodfellowAMXWOCB:NIPS14 . According to the properties of GAN, it is easy to see that the unknown transformation is equivalent to the generator network in GANs. The only difference is that in our case, the latent space of GAN will be the same as the data space, which does not impact the learning algorithm. Consequently, we can update the transformation in each iteration by running a GAN update on its generator, which are then used to generate samples from .
Optimizing via kernelized Stein operator
Optimizing with adversarial learning described above brings us a little computation overhead to update the transformation. Here we propose a more efficient method based on the kernelized Stein operator LiuW:NIPS16 . We first introduce a theoretical result from LiuW:NIPS16 in Lemma 3.
Lemma 3 (LiuW:NIPS16 )
Let denotes the reproducing kernel Hilbert space (RKHS), and the space of vector functions with . Let , where . Denote to be the density of , then we have
where .
Lemma 3 essentially says that the functional gradient of the KL divergence is , when the transformation is in the form of with restricted to . The result seems to be applicable in our problem (9) except that we require to be in a large space instead of . To compromise this, we propose to inject noise into the functional gradient, leading to the following approximation:
(12) 
where , , and
controls the variance of the injected noise, which is typically decreasing in the algorithm to ensure convergence. Note this resembles stochastic gradient descent, where full gradients are replaced with noisy gradients, thus convergence can still be guaranteed.
3.1.2 Approximating the Wasserstein distance
In order to calculate the Wasserstein term on the RHS of (9), we adapt results from optimal transport theory Villani:08 ; FeiziSXT:arxiv17 to rewrite as
(13)  
where is the convexconjugate of the function .
Optimizing is in general infeasible due to the need to search on the space of convex functions. However, we can approximate it by restricting on some nice convex functions. We describe some preliminary results below.
Restricting to be quadratic
We can defined to be in the form of
with parameters . In this case, the convexconjugate is in a nice form of
Substituting these forms into the “” part and maximizing w.r.t. by setting the derivatives to zero, we have
Substituting the above formula into (13) and simplifying, we have
(14) 
Restricting to other forms
It is also interesting to parameterize with other convex functions. We are currently investigating on this direction.
3.2 Particle optimization
The above sections describes how to optimize the distribution directly from (9). In practice, one must adopt some representation for in order to perform optimization. A simple representation is to use particles for approximation, i.e., where is a point mass at , is the set of particles at the th iteration.
Denote our objective to be where indexes the iteration. We can update the particles by calculating their gradients and perform gradient descent. For example, when adopting the approximations (12) and (3.1.2), we have
(15) 
where .
SGMCMC vs SVGD with Polyak’s momentum gradient descent
Let be the objective function to be optimized. Polyak’s momentum gradient descent update Polyak:CMMP64 is given by
(16) 
To see the relation of particle optimization in SGMCMC of (15) with SVGD, first note that because (13) is an upper bound for the 2ndorder Wasserstein distance, we should scale it by some constant in the implementation to approximate the true . Based on the gradient formula in (15), the update equation for then becomes
which is in the same form of (16) where the gradient is represented as noisy gradient. Thus particle optimization in SGMCMC with (15) can be regarded as SVGD with Polyak’s momentum method.
4 Empirical Verification
We test our algorithm against SVGD on a simple logisticregression task. We use the same model and data as
LiuW:NIPS16 . For SVGD, we adopt the same setting as in LiuW:NIPS16 . Note we use the authors’ implementation for SVGD LiuW:NIPS16 , which uses Adagrad for optimization, thus it is not strictly the standard SVGD (should perform better). For our particle optimization in SGMCMC, we simply set the noise to be standard isotropic normal. We scale with 0.1, and use (15) to update particles in SGMCMC. Figure 1 plots both test accuracies and test loglikelihoods w.r.t. the number of training iterations. It is clearly that without tuning, SGMCMC already obtains slightly faster convergence speed than SVGD in terms of both test accuracy and test loglikelihood.5 Conclusion
We propose novel methods to directly optimize particles in SGMCMC, a paradigm to improve sample quality in standard SGMCMC under limited samples. We develop our techniques by solving the FokkerPlanck equation of the correspond SGMCMC on the space of probability distributions. A particular approximate solution of our framework results in SVGD with noisy gradient and momentum updates. To the best of our knowledge, this is the first time the relation of SGMCMC and SVGD is established. A simple empirical result shows improvement of our method compared with standard SVGD. More extensive experiments are to be performed.
References
 [1] M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In ICML, 2011.
 [2] T. Chen, E. B. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning (ICML), 2014.
 [3] N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven. Bayesian sampling using stochastic gradient thermostats. In Neural Information Processing Systems (NIPS), 2014.
 [4] C. Chen, N. Ding, and L. Carin. On the convergence of stochastic gradient MCMC algorithms with highorder integrators. In Neural Information Processing Systems (NIPS), 2015.
 [5] Y. W. Teh, A. H. Thiery, and S. J. Vollmer. Consistency and fluctuations for stochastic gradient Langevin dynamics. JMLR, 17(1):193–225, 2016.
 [6] S. J. Vollmer, K. C. Zygalakis, and Y. W. Teh. (exploration of the (Non)asymptotic bias and variance of stochastic gradient Langevin dynamics. JMLR, 1:1–48, 2016.

[7]
Q. Liu and D. Wang.
Stein variational gradient descent: A general purpose Bayesian inference algorithm.
In Neural Information Processing Systems (NIPS), 2016.  [8] R. Jordan, D. Kinderlehrer, and F. Otto. The variational formulation of the FokkerPlanck equation. SIAM Journal on Mathematical Analysis, 29(1):1–17, 1998.
 [9] N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equation of State Calculations by Fast Computing Machines. Journal of Chemical Physics, 21:1087–1092, 1953.
 [10] S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B, 195:216–222, 1987.
 [11] A. M. Horowitz. A Generalized Guided MonteCarlo Algorithm. Phys. Lett. B, 268:247–252, 1991.
 [12] B. Øksendal, editor. Stochastic Differential Equations. SpringerVerlag, Berlin, 1985.
 [13] A. P. Ghosh. Backward and Forward Equations for Diffusion Processes. Wiley Encyclopedia of Operations Research and Management Science, 2011.
 [14] A. Kolmogoroff. Some studies in machine learning using the game of checkers. Mathematische Annalen, 104(1):415–458, 1931.
 [15] H. Risken. The FokkerPlanck equation. SpringerVerlag, New York, 1989.
 [16] Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In NIPS, 2016.
 [17] Qiang Liu. Stein variational gradient descent as gradient flow. In NIPS, 2017.
 [18] C. Chen, C. Li, L. Chen, W. Wang, Y. Pu, and L. Carin. Continuoustime flows for deep generative models. In arXiv:1709.01179, 2017.
 [19] F. Otto. Dynamics of Labyrinthine pattern formation in magnetic fluids: A meanfield theory. Arch. Rational Mech. Anal., pages 63–103, 1998.
 [20] M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In ICLR, 2017.
 [21] M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. Technical Report arXiv:1701.07875, March 2017.
 [22] I. Goodfellow, J. PougetAbadie, M. Mirza, B. Xu, D. WardeFarley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Neural Information Processing Systems (NIPS), 2014.
 [23] C. Villani. Optimal transport: old and new. Springer Science & Business Media, 2008.
 [24] S. Feizi, C. Suh, F. Xia, and D. Tse. Understanding GANs: the LQG setting. In arXiv:1710.10793, 2017.
 [25] B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.
Comments
There are no comments yet.