Bayesian methods have been playing an important role in modern machine learning, especially in an unsupervised-learning setting. When facing with big data, two lines of research directions have been developed to scale up Bayesian methods,e.g., variational-Bayes-based and sampling-based methods. Stochastic gradient Markov chain Monte Carlo (SG-MCMC) is a family of scalable Bayesian learning algorithms designed to efficiently sample from a target distribution such as a posterior distribution WellingT:ICML11 ; ChenFG:ICML14 ; DingFBCSN:NIPS14 ; ChenDC:NIPS15 . In principle, SG-MCMC generates samples from a Markov chain, which are used to approximate a target distribution. Under a standard setting, samples from SG-MCMC are able to match a target distribution exactly in an infinite-sample regime TehTV:arxiv14 ; ChenDC:NIPS15 . However, this case never occurs in practice, as only a finite amount of samples are available. Although nonasymptotic bounds w.r.t. the number of samples have been investigated TehTV:arxiv14 ; VollmerZT:arxiv15 ; ChenDC:NIPS15 , there are no theory/algorithms to guide learning an optimal set of fixed-size samples/particles. This is an undesirable property of SG-MCMC, because given a fixed number of samples, one often wants ot learn the optimal samples that best approximate a target distribution.
A remedy for this issue is to adopt the idea of particle-based sampling methods, where a set of particles (or samples) are initialized from some simple distribution, and they are updated iteratively such that they approximate a target distribution better and better. The updating procedure is usually done by optimizing some objective function. There is not much work in this direction for Bayesian sampling, with an outstanding representative being the Stein variational gradient descent (SVGD) LiuW:NIPS16 . In SVGD, the update of particles are done by optimizing the KL-divergence between the empirical-particle distribution and a target distribution, thus the samples can be guaranteed to be optimal in each update. Because of this property, SVGD is found to perform better than SG-MCMC when the number of samples used to approximate a target distribution is limited LiuW:NIPS16 .
Little research has been done on investigating the particle-optimization idea in SG-MCMC. Inspired by SVGD, we develop a similar particle-optimization procedure for SG-MCMC for more efficient sampling. To achieve this goal, we propose a novel technique to directly optimize particles based on a variational reformulation of the corresponding Fokker-Planck equation of an SG-MCMC algorithm, adapted from JordanKO:MA98 . In this way, instead of sampling from a Markov chain sequentially, we evolve particles through an optimization procedure, obtaining both optimal particles and faster convergence speed compared to standard SG-MCMC. Furthermore, under some relaxations, we are able to show particle optimization in SG-MCMC can be regarded as an extension of SVGD with momentum. To the best of our knowledge, this is the first time particles can be optimized in SG-MCMC algorithms.
2.1 Stochastic gradient MCMC
Diffusion-based sampling methods
Generating random samples from a posterior distribution is a pervasive problem in Bayesian statistics which has many important applications in machine learning. The Markov Chain Monte Carlo method (MCMC), proposed by Metropoliset al. Metropolis:53 , produces unbiased samples from a desired distribution when the density function is known up to a normalizing constant. However, traditional MCMC methods are based on random walk proposals which often lead to highly correlated samples. On the other hand, dynamics-based sampling methods, e.g., Hybrid Monte Carlo (HMC) Duane:87 ; Horowitz:91 , avoid this high degree of correlation by combining dynamical systems with the Metropolis step. The dynamical system uses information from the gradient of the log density to reduce the random walk effect, and the Metropolis step serves as a correction of the discretization error introduced by the numerical integration of the dynamical systems. In fact, these dynamical systems are derived from a more general mathematical technique called diffusion process (or more specifically, Itó diffusion) Oksendal:85 .
Specifically, our objective is to generate random samples from a posterior distribution , where represents the model parameter, and represents the data. The canonical form is , where
is referred to as the potential energy based on an i.i.d. assumption of the model, and is the normalizing constant. In general, the posterior distribution can be corresponding to the (marginal) stationary distribution of a (continuous-time) Itó diffusion, defined as a stochastic differential equation of the form:
where is the time index; represents the full variables in a dynamical system, and (thus ) is potentially an augmentation of model parameter ; is -dimensional Brownian motion. Functions and are assumed to satisfy the Lipschitz continuity condition Ghosh:book11 . By Fokker-Planck equation (or the forward Kolmogorov equation) Kolmogoroff:MA31 ; Risken:FPE89 , when appropriately designing the diffusion-coefficient functions and , the stationary distribution of the corresponding Itó diffusion equals the posterior distribution of interest, . For example, the 1st-order Langevin dynamic defines , and ; the 2nd-order Langevin diffusion defines , and for a scalar ; is an auxiliary variable known as the momentum ChenFG:ICML14 ; DingFBCSN:NIPS14 .
Stochastic gradient MCMC
SG-MCMC algorithms are discretized numerical approximations of the Itó diffusions. They mitigate the slow mixing and non-scalability issues encountered by traditional MCMC algorithms by adopting gradient information of the posterior distribution, and using minibatches of the data in each iteration of the algorithm to generate samples. To make the algorithms scalable in a big-data setting, three developments will be implemented based on the Itó diffusion: define appropriate functions and in the Itó-diffusion formula so that the (marginal) stationary distributions coincide with the target posterior distribution ; replace or with unbiased stochastic approximations to reduce the computational complexity, e.g., approximating with a random subset of the data instead of using the full data. For example, in the 1st-order Langevin dynamics, could be approximated
by an unbiased estimator with a subset of data:
where is a size- random subset of , leading to the first SG-MCMC algorithm in machine learning – stochastic gradient Langevin dynamics (SGLD) WellingT:ICML11 ; and solve the generally intractable continuous-time Itô diffusions with a numerical method, e.g., the Euler method ChenDC:NIPS15 . For example, this leads to the following update in SGLD:
where means the stepsize, indexes the samples,
is a random sample from an isotropic normal distribution. After running the algorithm forsteps, the collection of samples , which are collected from a Markov chain, are used to approximate the unknown posterior distribution .
2.2 Stein variational gradient descent
Different from SG-MCMC, SVGD initializes a set of particles and iteratively updates them so that the empirical particle distribution approximates the posterior distribution. Specifically, considers a set of particles drawn from distribution . SVGD tries to update these particles by doing gradient descent on the space of probability distributions via
where is a function perturbation direction chosen to minimize the KL divergence between the updated empirical distribution and the posterior , for short. Since is convex in , global optimum of can be guaranteed. SVGD considers as the unit ball of a vector-valued reproducing kernel Hilbert space (RKHS) associated with a kernel . In such as setting, it was shown in (liu2016stein, ) that:
When approximating the expectation with empirical particle distribution, we arrive at the following updates for the particles at the -th iteration:
SVGD applies the updates in (8) repeatedly, and the samples move closer to the target distribution in each iteration.
2.3 Comparing SG-MCMC with SVGD
SG-MCMC is a Markov-chain-based sampling methods, in the sense that samples are generated from a Markov chain, with potentially highly correlated samples. Furthermore, it often requires a large number of samples in order to approximate a target distribution reasonably well ChenDC:NIPS15 . In contrast, SVGD directly updates the particles to their optimum guided by an objective function, thus requires much less samples to approximate a target distribution.
On the other hand, SVGD has been explained as gradient flows whose gradient operator is defined on the RKHS liu2017stein_flow ; whereas SG-MCMC are flows with flow operator defined on the space - the functional space that is square integrable. Since RKHS is smaller than , SG-MCMC can potentially obtain better asymptotic properties than SVGD in theory liu2017stein_flow .
The above arguments motivate us to combine goods from both sides, i.e., we aim to developed a particle-based SG-MCMC algorithm similar to what SVGD does.
3 Particle Optimization in SG-MCMC
To develop our particle-optimization framework, we first introduce the following lemma adapted from ChenLCWPC:arxiv17 , viewing SG-MCMC from an optimization perspective.
Assume that is infinitely differentiable, and for some constants . Let with being an integer, is an arbitrary distribution with same support as , and be the solution of the functional optimization problem:
where , is the 2nd-order Wasserstein distance, with being the space of joint distributions of is the space of probability distributions with finite 2nd-order moment. Then
being the space of joint distributions of;
is the space of probability distributions with finite 2nd-order moment. Thenconverges to in the limit of , i.e., , where is the solution of the FP equation (3) at time .
Lemma 1 reveals an interesting way to compute via a sequence of functional optimization problems. By comparing it with the objective of SVGD, which minimizes the KL-divergence between and , at each sub-optimization-problem in Lemma 1, it minimizes the KL-divergence, plus a regularization term as the Wasserstein distance between and . The extra Wasserstein-distance term arises naturally due to the fact that the corresponding diffusion is a gradient flow equipped with a geometric associated with the Wasserstein distance (Otto:ARMA98, ). From another point of view, it is known that the Wasserstein distance is a better metric for probability distributions than the KL-divergence, especially in the case of non-overlapping domains (ArjovskyB:ICLR17, ; ArjovskyCB:arxiv17, ).
3.1 Optimizing on the space of probability distributions
Our idea of particle-optimization is inspired by Lemma 1, in that we can obtain the optimal distribution for each iteration (which will be approximated by particles) by optimizing (9). Consequently, instead of doing simulation based on the original Itó diffusion, we propose to directly solve (9) on the space of probability distributions . As will be shown, this allows us to derive algorithms which directly optimize particles in SG-MCMC. However, this also bring challenges for the optimization, as the probability-distribution space is too flexible to derive exact solutions. In the following, we propose techniques to approximate the corresponding terms in (9). We denote as a sample from , i.e., .
3.1.1 Approximating the KL-divergence
Given , the solution of (9) can be considered as an unknown transformation from to , i.e.,
under the constraint that still lies in after the transformation. Directly solving the unknown transformation is challenging, we propose two methods to solve it approximately, detailed below.
Optimizing with adversarial learning
Optimizing the KL divergence with an unknown transformation is generally infeasible. Instead, we approximate it with the Jensen-Shanon divergence, which appears to have the same optimality solution, thus they are equivalent. Formally, we first introduce the following lemma.
Let and be probability distributions on with the same support. Then KL divergence is equivalent to the Jensen-Shanon divergence (JSD) in the sense that they are both convex in , and achieve the same minimum value of zero at .
where is introduced to balance the difference between the scales of KL and JSD.
It is well known that JSD is the metric for measuring the distance between two probability distributions in generative adversarial networks (GANs) GoodfellowAMXWOCB:NIPS14 . According to the properties of GAN, it is easy to see that the unknown transformation is equivalent to the generator network in GANs. The only difference is that in our case, the latent space of GAN will be the same as the data space, which does not impact the learning algorithm. Consequently, we can update the transformation in each iteration by running a GAN update on its generator, which are then used to generate samples from .
Optimizing via kernelized Stein operator
Optimizing with adversarial learning described above brings us a little computation overhead to update the transformation. Here we propose a more efficient method based on the kernelized Stein operator LiuW:NIPS16 . We first introduce a theoretical result from LiuW:NIPS16 in Lemma 3.
Lemma 3 (LiuW:NIPS16 )
Let denotes the reproducing kernel Hilbert space (RKHS), and the space of vector functions with . Let , where . Denote to be the density of , then we have
Lemma 3 essentially says that the functional gradient of the KL divergence is , when the transformation is in the form of with restricted to . The result seems to be applicable in our problem (9) except that we require to be in a large space instead of . To compromise this, we propose to inject noise into the functional gradient, leading to the following approximation:
where , , and
controls the variance of the injected noise, which is typically decreasing in the algorithm to ensure convergence. Note this resembles stochastic gradient descent, where full gradients are replaced with noisy gradients, thus convergence can still be guaranteed.
3.1.2 Approximating the Wasserstein distance
where is the convex-conjugate of the function .
Optimizing is in general infeasible due to the need to search on the space of convex functions. However, we can approximate it by restricting on some nice convex functions. We describe some preliminary results below.
Restricting to be quadratic
We can defined to be in the form of
with parameters . In this case, the convex-conjugate is in a nice form of
Substituting these forms into the “” part and maximizing w.r.t. by setting the derivatives to zero, we have
Substituting the above formula into (13) and simplifying, we have
Restricting to other forms
It is also interesting to parameterize with other convex functions. We are currently investigating on this direction.
3.2 Particle optimization
The above sections describes how to optimize the distribution directly from (9). In practice, one must adopt some representation for in order to perform optimization. A simple representation is to use particles for approximation, i.e., where is a point mass at , is the set of particles at the -th iteration.
Denote our objective to be where indexes the iteration. We can update the particles by calculating their gradients and perform gradient descent. For example, when adopting the approximations (12) and (3.1.2), we have
SG-MCMC vs SVGD with Polyak’s momentum gradient descent
Let be the objective function to be optimized. Polyak’s momentum gradient descent update Polyak:CMMP64 is given by
To see the relation of particle optimization in SG-MCMC of (15) with SVGD, first note that because (13) is an upper bound for the 2nd-order Wasserstein distance, we should scale it by some constant in the implementation to approximate the true . Based on the gradient formula in (15), the update equation for then becomes
4 Empirical Verification
We test our algorithm against SVGD on a simple logistic-regression task. We use the same model and data asLiuW:NIPS16 . For SVGD, we adopt the same setting as in LiuW:NIPS16 . Note we use the authors’ implementation for SVGD LiuW:NIPS16 , which uses Adagrad for optimization, thus it is not strictly the standard SVGD (should perform better). For our particle optimization in SG-MCMC, we simply set the noise to be standard isotropic normal. We scale with 0.1, and use (15) to update particles in SG-MCMC. Figure 1 plots both test accuracies and test log-likelihoods w.r.t. the number of training iterations. It is clearly that without tuning, SG-MCMC already obtains slightly faster convergence speed than SVGD in terms of both test accuracy and test log-likelihood.
We propose novel methods to directly optimize particles in SG-MCMC, a paradigm to improve sample quality in standard SG-MCMC under limited samples. We develop our techniques by solving the Fokker-Planck equation of the correspond SG-MCMC on the space of probability distributions. A particular approximate solution of our framework results in SVGD with noisy gradient and momentum updates. To the best of our knowledge, this is the first time the relation of SG-MCMC and SVGD is established. A simple empirical result shows improvement of our method compared with standard SVGD. More extensive experiments are to be performed.
-  M. Welling and Y. W. Teh. Bayesian learning via stochastic gradient Langevin dynamics. In ICML, 2011.
-  T. Chen, E. B. Fox, and C. Guestrin. Stochastic gradient Hamiltonian Monte Carlo. In International Conference on Machine Learning (ICML), 2014.
-  N. Ding, Y. Fang, R. Babbush, C. Chen, R. D. Skeel, and H. Neven. Bayesian sampling using stochastic gradient thermostats. In Neural Information Processing Systems (NIPS), 2014.
-  C. Chen, N. Ding, and L. Carin. On the convergence of stochastic gradient MCMC algorithms with high-order integrators. In Neural Information Processing Systems (NIPS), 2015.
-  Y. W. Teh, A. H. Thiery, and S. J. Vollmer. Consistency and fluctuations for stochastic gradient Langevin dynamics. JMLR, 17(1):193–225, 2016.
-  S. J. Vollmer, K. C. Zygalakis, and Y. W. Teh. (exploration of the (Non-)asymptotic bias and variance of stochastic gradient Langevin dynamics. JMLR, 1:1–48, 2016.
Q. Liu and D. Wang.
Stein variational gradient descent: A general purpose Bayesian inference algorithm.In Neural Information Processing Systems (NIPS), 2016.
-  R. Jordan, D. Kinderlehrer, and F. Otto. The variational formulation of the Fokker-Planck equation. SIAM Journal on Mathematical Analysis, 29(1):1–17, 1998.
-  N. Metropolis, A. Rosenbluth, M. Rosenbluth, A. Teller, and E. Teller. Equation of State Calculations by Fast Computing Machines. Journal of Chemical Physics, 21:1087–1092, 1953.
-  S. Duane, A. D. Kennedy, B. J. Pendleton, and D. Roweth. Hybrid Monte Carlo. Physics Letters B, 195:216–222, 1987.
-  A. M. Horowitz. A Generalized Guided Monte-Carlo Algorithm. Phys. Lett. B, 268:247–252, 1991.
-  B. Øksendal, editor. Stochastic Differential Equations. Springer-Verlag, Berlin, 1985.
-  A. P. Ghosh. Backward and Forward Equations for Diffusion Processes. Wiley Encyclopedia of Operations Research and Management Science, 2011.
-  A. Kolmogoroff. Some studies in machine learning using the game of checkers. Mathematische Annalen, 104(1):415–458, 1931.
-  H. Risken. The Fokker-Planck equation. Springer-Verlag, New York, 1989.
-  Qiang Liu and Dilin Wang. Stein variational gradient descent: A general purpose bayesian inference algorithm. In NIPS, 2016.
-  Qiang Liu. Stein variational gradient descent as gradient flow. In NIPS, 2017.
-  C. Chen, C. Li, L. Chen, W. Wang, Y. Pu, and L. Carin. Continuous-time flows for deep generative models. In arXiv:1709.01179, 2017.
-  F. Otto. Dynamics of Labyrinthine pattern formation in magnetic fluids: A mean-field theory. Arch. Rational Mech. Anal., pages 63–103, 1998.
-  M. Arjovsky and L. Bottou. Towards principled methods for training generative adversarial networks. In ICLR, 2017.
-  M. Arjovsky, S. Chintala, and L. Bottou. Wasserstein GAN. Technical Report arXiv:1701.07875, March 2017.
-  I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio. Generative adversarial nets. In Neural Information Processing Systems (NIPS), 2014.
-  C. Villani. Optimal transport: old and new. Springer Science & Business Media, 2008.
-  S. Feizi, C. Suh, F. Xia, and D. Tse. Understanding GANs: the LQG setting. In arXiv:1710.10793, 2017.
-  B. T. Polyak. Some methods of speeding up the convergence of iteration methods. USSR Computational Mathematics and Mathematical Physics, 4(5):1–17, 1964.