Self-Adversarially Learned Bayesian Sampling

11/21/2018 ∙ by Yang Zhao, et al. ∙ University at Buffalo 0

Scalable Bayesian sampling is playing an important role in modern machine learning, especially in the fast-developed unsupervised-(deep)-learning models. While tremendous progresses have been achieved via scalable Bayesian sampling such as stochastic gradient MCMC (SG-MCMC) and Stein variational gradient descent (SVGD), the generated samples are typically highly correlated. Moreover, their sample-generation processes are often criticized to be inefficient. In this paper, we propose a novel self-adversarial learning framework that automatically learns a conditional generator to mimic the behavior of a Markov kernel (transition kernel). High-quality samples can be efficiently generated by direct forward passes though a learned generator. Most importantly, the learning process adopts a self-learning paradigm, requiring no information on existing Markov kernels, e.g., knowledge of how to draw samples from them. Specifically, our framework learns to use current samples, either from the generator or pre-provided training data, to update the generator such that the generated samples progressively approach a target distribution, thus it is called self-learning. Experiments on both synthetic and real datasets verify advantages of our framework, outperforming related methods in terms of both sampling efficiency and sample quality.



There are no comments yet.


page 6

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

With the abundance of unlabeled data, Bayesian methods have been increasingly popular in modern machine learning. Various real-world applications have greatly benefited from Bayesian modeling through uncertainty modeling [Blundell et al.2015, Zhang et al.2018], deep generative models [Feng, Wang, and Liu2017, Chen et al.2017]

and deep reinforcement learning

[Osband and Van Roy2017, Haarnoja et al.2017, Liu et al.2017]

. The core of Bayesian methods is efficient Bayesian inference, among which Bayesian sampling stands as one of the most effective tools.

In the setting of big data, recent research has facilitated the development of scalable Bayesian sampling methods. There are mainly two directions on developing these methods, Markov-chain (MC) based and particle-optimization (PO) based methods. Stochastic gradient Markov chain Monte Carlo (SG-MCMC) is a family of scalable MC-based Bayesian learning algorithms designed to efficiently sample from a target (posterior) distribution

[Welling and Teh2011, Chen, Fox, and Guestrin2014, Ding et al.2014, Chen, Ding, and Carin2015]. Specifically, SG-MCMC generates samples from a Markov chain induced by an Itó diffusion. Under a standard setting, samples from SG-MCMC can approximate a target distribution arbitrarily well given sufficient samples [Teh, Thiery, and Vollmer2016, Chen, Ding, and Carin2015]. By contrast, PO-based sampling methods such as Stein variational gradient descent (SVGD) [Liu and Wang2016] initiate a set of particles (or samples) from some simple distributions, and update them iteratively and interactively to approximate a target distribution. Recently, [Chen et al.2018] proposed a unified Bayesian sampling framework by combing SG-MCMC and SVGD from a Wasserstein-gradient-flow (WGF) perspective, obtaining improved performance compared to both SG-MCMC and SVGD. Our proposed method is partly based on the WGF theory presented in [Chen et al.2018].

Though achieving encouraging results, we note two issues in the aforementioned sampling methods:

Slow sample generation: though SG-MCMC and SVGD achieve scalable sampling by adopting stochastic gradient information, sample generation is still not efficient enough under complicated models such as a very deep neural network. The problem is even more severe in SVGD as each particle needs to interact with all other particles in the sample-generation process;

Slow mixing: samples tend to be highly correlated, leading to slow mixing. Actually, it has been shown that diffusion-based methods such as SG-MCMC might need exponential time to jump out of local modes [Raginsky, Rakhlin, and Telgarsky2017, Zhang, Liang, and Charikar2017]. Thus more sample-efficient algorithms are desperate to be designed.

In this paper, we reinterpret Bayesian sampling as learning a Markov kernel (or a transition kernel), a conditional probability sequentially mapping an old state (sample) to a new state. Leveraging advantages of scalable sampling and recent developments on deep generative models, we reformulate the sampling process based on the generative-adversarial-net (GAN) framework

[Goodfellow et al.2014]. The formulation is based on the connection between density evolution in a Bayesian sampling algorithm and WGFs. Specifically, a conditional generator which solves the corresponding WGF is trained to mimic the sample-generation process. In this way, both fast sample generation and fast sample mixing are achieved.

We consider two settings in our framework to learn the conditional generators, i.e., when samples from the unknown target distribution are available, and when only the form of the target distribution is provided. The former case can be learned by directly adopting standard GAN training techniques, whereas the later case is much more challenging. To overcome the challenge, we propose a self-learning paradigm that adjust samples from the generator itself to approach the target distribution in a principled way, such that the adjusted samples can be used as real samples to train the generator. We call our proposed framework self-adversarially learned Bayesian sampling. Extensive experiments are performed on both synthetic and real datasets, demonstrating the effectiveness and efficiency of the proposed framework, relative to existing methods.

2 Preliminaries

This section reviews background of related Bayesian sampling algorithms, e.g., SG-MCMC, SVGD and particle-optimization Bayesian sampling (POS) [Chen et al.2018].

2.1 Stochastic gradient MCMC

Given observations , we aim at drawing samples from a target posterior distribution with model parameters . In Bayesian modeling, we write , where is called the potential energy (negative log unnormalized posterior). SG-MCMC is a scalable Bayesian sampling method, which takes stochastic gradient information of the potential energy into consideration. Let be a stochastic version of with the -th element of a random permutation of . The stochastic gradient Langevin dynamic (SGLD) stands as the first SG-MCMC algorithm [Welling and Teh2011], endowing the following update rule (samples are indexed by ):


where is a stepsize sequence, and . Further development on SG-MCMC methods leads to several variants of SGLD by introducing auxiliary variables into the corresponding dynamics systems [Ding et al.2014, Chen, Fox, and Guestrin2014]. With samples from a sampler, one can approximate statistics of a function , e.g., the posterior expectation is approximated as .

2.2 Stein variational gradient descent

Different from SG-MCMC, SVGD is derived from a particle-optimization perspective [Liu and Wang2016]. It iteratively and interactively updates a set of particles drawn from some initial distribution. The updating rule follows with


where is a positive definite kernel, e.g., the RBF kernel with bandwidth , and is the step size. It is shown that (2) is equivalent to minimizing the Kullback-Leibler (KL) divergence , where is the underlying density of the particles. Consequently, SVGD drives the particles to asymptotically distributed as the target distribution.

2.3 Particle-optimization sampling

Compared with SVGD, an instance of the POS framework [Chen et al.2018, Zhang, Zhang, and Chen2018], samples from SG-MCMC are likely highly correlated due to the property of Markovian chains. The POS framework alleviate the issue by interpreting both SG-MCMC and SVGD as WGFs on the space of probability measures , and proposing a unified particle-optimization framework for efficient Bayesian sampling.

Specifically, the POS framework translates Bayesian sampling to solving a partial differential equation defined on

with , defined as:


Here is an absolutely continuous curve on and

is a vector field describing the direction of sample evolutions. In WGFs,

is related to what is known as energy functional , mapping a probability measure to a real value, i.e. , via the equation [Ambrosio, Gigli, and Savaré2005]: , where is called the first variation of at , with evolved directions constrained on the tangent space of the probability manifold. Consequently, gradient flows on can be written as


Solving by discrete gradient flows

An exact solution to the WGF formula (4) is generally infeasible. A typical solution is to approximate the continuous-time solution of (4) with discrete-time flows, called discrete gradient flows (DGFs). Denote

to be the space of probability measures with finite 2nd-order moments, and define the following optimization problem with a step size



where denotes the Wasserstein distance between and . Here is such that corresponds to the target distribution. The idea of DGFs is to approximate the continuous-time solution from (4) via a composition of a sequence of discrete solutions of (5), i.e.,


The DGF method is the gradient-descent analogy on Euclidean space for . One can show that when , the solution from DGFs (6) converges to the true flow (4) for all [Craig2014].

3 Self-Adversarially Learned Bayesian Sampling

In this section, we develop a GAN-based framework to efficiently solve the DGF problem (6), avoiding the computational complexity of the original particle-approximation-based methods [Chen et al.2018]. Based on this, more powerful and flexible approximations with a self-adversarial learning scheme are developed.

3.1 Reformulating POS as conditional GANs

We first specify the functional energy in (5). For popular sampling methods such as SG-MCMC and SVGD, has been shown to be the KL-divergence between and the target distribution 111Though the metric for SVGD is defined as a variant of the Wasserstein distance called -Wasserstein distance [Liu2017].. In this case, the DGF method described above becomes the well-known Jordan-Kinderlehrer-Otto scheme (JKO) [Jordan, Kinderlehrer, and Otto1998]. For convenience, we instead define the functional energy as the Jensen-Shannon divergence (JSD) between and . Note the JSD also endows the convexity property, rendering a unique optimal solution as for the KL-divergence. Consequently, the DGF (5) becomes


where is a tunable parameter.

Now solving the WGF (4) is equivalent to composing results from a sequence of optimizations defined in (7) via (6). As a result, is optimized sequentially, each time conditioning on its previous value. In addition, the JSD is well-known to be the objective function of GAN. Consequently, the optimization problem (7) can be reformulated as a conditional GAN, where is defined as an implicit distribution induced by a conditional generator . Specifically, is designed to take an old sample and random noise as input, and outputs the updated sample. The term in (7) regularizes the outputs such that they are not too far away from their input samples. According to the GAN theory [Goodfellow et al.2014], (7) is equivalent to the following objective:


where denotes the true data distribution, is the previous sample from the generator whose implicit distribution is denoted as ; denotes the implicit distribution of the output ; and is a discriminator network to distinguish an input to be real or fake. A conditional generator is required because the output is correlated with the input via the term. The objective (8) is illustrated in Figure 1 (left).

Figure 1: Graphs of the proposed framework with real samples (left) and a self-learning component (right). “G” represents generator, “D” represents discriminator, and “” corresponds to the Wasserstein regularizer in (8

). Dash lines indicates gradients not flowing back in backpropagation.

We note two differences between our formulation (8) and the standard GAN: Performing stochastic gradient descend (SGD) for learning is challenging due to the existence of the regularization term; and The real sample might not be available in training. In the following, we address the first problem by deriving an approximate form for in Section 3.2, by assuming the availability of real/true data . We then proceed to solve the second problem by proposing a self-adversarial-learning component to automatically adjust samples from the generator such that they approach real samples in Section 3.3.

3.2 Adversarially-learning Bayesian sampling

Approximating the Wasserstein regularizer

Following [Chen et al.2018], we use particle approximation to deal with the Wasserstein term in (8). Let be a minibatch of input samples from the generator (samples from last step), and be the output samples. Using samples/particles to approximate the in (8), is approximated as


where . Now the goal is turned into solving for the optimal . Introducing Lagrangian multipliers to deal with the constraints, and adding an entropy regularized term for , the dual problem can be written as


Solving for (10), the optimal endows a forms of , where , . Now substituting the optimal back to (9), the Wasserstein distance can be approximated as:


where the original and have been merged into the constant for simplicity.

Adversarially-learned Bayesian sampling

Given the approximation (11) for and real training data , gradients of the generator parameters can be readily calculated by backpropagation, making generator update with SGD readily available. We call this version of our framework Adversarially-learned Markovian chain (AL-MC), with detailed algorithm given in the Supplementary Material (SM) on our homepage.

3.3 Self-adversarially learned Bayesian sampling

A more challenging setting in practice is that real samples are not readily available, e.g., in posterior sampling where only an (unnormalized) posterior distribution is provided. This section addresses the problem of how to learn a generator to generate effective samples only with such information.

Our basic idea is to add a self-learning module that can automatically adjust the current samples from the generator to approach a target distribution. These adjusted samples are then used as real data to update both the generator and discriminator gradually. This idea is illustrated in Figure 1 (right). Based on an unnormalized target distribution, we consider two settings:

Real data generation with approximate Bayesian sampling

In this case, one is assume to be able to directly draw approximate samples from a target distribution, based on the previous outputs of the generator. This procedure can be done by adopting existing effective approximate samplers such as SG-MCMC and SVGD. Specifically, let the previous outputs from the generator be , the real samples for next generator update are then approximated as

where or means running one or several SG-MCMC/SVGD updates on the input samples toward the target distribution. Based on these approximate real samples, the generator update then proceeds as what AL-MC does described in the last section.

One can easily see that the effectiveness of this learning scheme highly depends on the approximate sampling algorithms, e.g., SG-MCMC or SVGD. Empirically, we usually observe samples from the generator collapsed to one mode on a multi-mode target distribution. The reason is that when modes are too far away from each other, making samples jump from one mode to another with SG-MCMC or SVGD typically takes a long time, and sometimes they even fail to move. This misleads the generator to be trained to generate samples only on one mode of the target distribution. In the following, we propose a novel self-adversarial learning module to overcome this issue, which does not even need samples from an approximate sampler.

Real-data generation with self learning

Developed on ideas from [Han and Liu2018], we propose a self-learning scheme which can automatically adjust samples from a generator by only relying on gradient information from the current sample density, instead of on the true gradient information of the target distribution. Specifically, let the current induced distribution from the generator be (an implicit distribution without an explicit form) with corresponding samples . The output of the self-learning module in Figure 1 (right) is defined as:


where , and . According to [Han and Liu2018], the distribution of is guaranteed to asymptotically converge to .

Note the issue in the above update is that there are no explicit forms for both and as

is an implicit distribution. Therefore, we adopt density estimation techniques for approximation. For

, we use the popular kernel density estimator

[Guidoum2015], i.e.,


where is a positive kernel in our paper. For , we apply the recently developed Stein gradient estimator [Li and Turner2018], defined as


where is a kernel matrix with . The Stein gradient estimator has been shown to significantly outperform existing gradient estimator methods [Li and Turner2018]. Note we use two forms of RBF kernels with bandwidth and with bandwidth to allow flexibility. is used in SVGD, and is used to approximate the density and the log-posterior gradient. Posterior sampling with (3.3) is an instance of the SVGD-without-gradient framework [Han and Liu2018]. We denote the update (3.3) as approximate gradient SVGD (AG-SVGD), and will show impressive performances compared to standard SG-MCMC or SVGD in the experiments.

Taking AG-SVGD as the self-learning module in Figure 1 (right), we obtain what is called self-adversarially-learning Markov chain (SAL-MC) sampler, described below.

The SAL-MC sampler

The training procedure is given in the first part of Algorithm 1, where AG-SVGD is used to generate approximate real samples from a target distribution. This procedure would gradually guide the generator to generate real samples as the stationary distribution of AG-SVGD is the target distribution. In addition, this self-adjusted behavior makes samples jump out of local modes easier, leading to much better performance compared to the one using SG-MCMC or SVGD to generate real samples, as will be shown in the experiments. We also find that multiple updates of the discriminator per generator update can increase effective sample size (ESS).

Input : , , and
Output : Transition operator G and samples set
1. SAL-MC training;
for t= do
       // Adjust particles:
       AG-SVGD() // Adversarial training between and : for i= do
             train the discriminator with the objective (8)
       end for
      train the generator with the objective (8) // Update adversarially-learned particles: for i= do
       end for
end for
2. SAL-MC sampling;
for l= do
       no MH step and directly add to
end for
Algorithm 1 SAL-MC training and sampling

After training, only the conditional generator is adopted to generate a sequence of samples via the following generative process:


where is a random noise drawn from a simple distribution such as an isotropy multivariate Gaussian . In this way, is taken as a transition operator of a Markov chain.

Unlike SG-MCMC and SVGD, no gradient information from the target distribution is needed, leading to fast sample generation only through forward passes of a neural network. In addition, the generator allows distant jumps between consecutive samples via the complex transformation in , making samples mix faster. In practice, we do not apply any burn-in steps or thinning methods on the samples, while it still shows a good convergence property with high ESS, as will be demonstrated in the experiments.

4 Experiments

In this section, we first examine the effectiveness of AG-SVGD with stochastic gradient estimations for direct sampling, and compare it with standard SGLD and the recently proposed Annealed-SVGD (A-SVGD) [Han and Liu2018]

. We then apply the proposed SAL-MC framework on a set of multi-mode synthetic distributions, as well as on Bayesian Logistic Regression (BLR) tasks. We compare SAL-MC with the recently developed A-NICE-MC method

[Song, Zhao, and Ermon2017], which is also based on adversarial training. Finally, we apply our AL-MC algorithm for image generation trained on real samples.

4.1 Verification of AG-SVGD

We conduct experiments on two multi-mode toy examples, a 5D Gaussian Mixture Model (GMM) distributed in an aggregation state and a challenging 2D-GMM distribution with distant modes and varied variances. We use the same RBF kernel and the median trick for AG-SVGD and A-SVGD. As suggested by

[Welling and Teh2011], a polynomially-decayed step size is used in SGLD for a fair comparison. We use samples to approximate the mean and variance of the distributions, measured by mean squared errors (MSE) w.r.t. true values. The results are averaged over 20 random runs with 500 iterations in each run. To be consistent with [Han and Liu2018], we also use RBF kernels and the median trick when calculating maximum mean discrepancy (MMD) between the sample approximation and true distribution for all the three methods.

The relatively simple 5D mixture distribution endows 10 modes. Figure 2 compares the convergence results of the three methods by varying sample sizes. Different from A-SVGD, our AG-SVGD does not need gradient information from the target distribution. Surprisingly, however, AG-SVGD is comparable to A-SVGD when the number of particles is big enough. Furthermore, AG-SVGD obtains the best estimation of variance among the three algorithms. Moreover, AG-SVGD has a slightly better convergence property when sample size is small. In contrast, SGLD performs the worse due to the noisy updates and slow mixing samples.

Figure 2: Simulation results on Mog10
Figure 3: Samples on Mog4. From left to right: AG-SVGD, A-SVGD and SGLD.
Figure 4: SAL-MC on Mog6. Left: without MH; Middle: with MH.
Figure 5: Convergence behavior of SAL-MC on Mog6. Acceptance rate in the third plot (without MH steps) denotes the ratio of the proposed-sample probability to current-sample probability, i.e., )
Figure 6: SAL sampler results of Ring under two settings
Figure 7: Sampling results of digits for (top) and (bottom). Each row represents 50 consecutive samples from the same chain. The sampler with mixes well by generating samples from different modes easily.

For the challenging 2D-GMM dataset Mog4, the samples obtained are visualized in Figure 3. Again, we use the same bandwidth of for AG-SVGD and A-SVGD. Parameters of the gradient estimator are set among and , selected by a grid search. The initial particles are empirically drawn from . As can be seen, both A-SVGD and SGLD somewhat fail to balance samples from different modes, whereas our AG-SVGD is able to travel better between low and high density regions, leading to a more accurate approximation. In the following, the well-validated AG-SVGD is applied in SAL-MC samplers as the self-learning module, which is tested on a number of datasets.

ESS / (ESS/s) Ring Mog2 Mog6 Ring5
SAL-MC - 1635/121138 1435/72691 1212/65287 -
MH 1561/25624 1172/17531 978/12235 414/16287
MH 2000/43887 951/17216 889/11745 335/6022
Table 1: ESS (2000 maximum) and ESS speed on synthetic distributions of both methods

4.2 SAL-MC on synthetic datasets

Next we demonstrate the effectiveness of SAL-MC by comparing it with A-NICE-MC [Song, Zhao, and Ermon2017]. We adopt the Ring, Mog2, Mog6 and Ring5 datasets used in A-NICE-MC, and measure the efficiency of a MCMC method in terms of ESS (2000 maximum) and ESS per second. The smallest ESS among all dimensions is reported. More details are provided in SM. The stochastic term in the algorithm is drawn from for all experiments. Since A-NICE-MC requires a Metropolis-Hastings (MH) step to accept or reject a sample, we also test the MH step in our algorithm.

The results are shown in Table 1, where SAL-MC consistently outperforms A-NICE-MC. Interestingly, A-NICE-MC collapses without MH steps, whereas SAL-MC works similarly with or without a MH step. In addition, it is observed that both POS and SGLD achieve a very low ESS that is around 10. We also calculate the Gelman-Rubin convergence statistic [Gelman et al.1995, Brooks and Gelman1998], a common convergence diagnostic using multiple chains to check for the convergence of an algorithm. Typically, a values of close to one (e.g., less than 1.1) is a good indicator of convergence. It is observed SAL-MC obtains using 32 chains for all tasks.

We further illustrate samples drawn from the Mog6 example in the first two scatter plots of Figure 4. As can be seen, SAL-MC is able to learn the six modes reasonably well, no matter if it is with or without the MH step. Without MH, SAL-MC tends to be able to generate more samples in low-density regions in between different modes. Note appropriate injected noise should be chosen because too small noise makes distant jumps difficult, while too large noise makes the convergence slower with fewer effective samples. The rightmost plot of Figure 4 together with Figure 5 demonstrate the proposed SAL-MC endows nice convergence properties in terms of rapid decay of MMD (efficiency and low bias), low autocorrelation (low variance), and high acceptance rates.

We also compare the settings of single and multiple chains in SAL-MC with equal number of samples. Specifically, for the single-chain setting, a sample is initialized from , followed by 200 updates from SAL-MC to form 200 samples. For the multiple-chain setting, we initialize 200 samples followed by a single-step update to form the final 200 samples. The results on Ring are visualized in Figure 6, from which we can see both settings perform similarly with well approximate samples from the target distribution.

Figure 8:

ESS and test accuracy with respect to training epochs averaged by 10 different runs. (

German dataset)
Figure 9: Sampling results of faces for (top) and (bottom). Each row represents 24 consecutive samples from the same chain.

4.3 SAL-MC for BLR

We further compare SAL-MC with A-NICE-MC on several BLR tasks for scalable Bayesian posterior sampling. Three datasets, Heart (532-13), Australian (690-14) and German (1000-24), are used, where - means a dataset with samples and feature dimensionality . The mini-batch size for training is 64; and the injected noise is drawn from for all tasks.

The ESS and ESS speed are shown in Table 2, which are calculated after 20K training epochs. With a MH step, SAL-MC tends to generate relatively highly-correlated samples with high rejection rates. Nerveless, it is still better than A-NICE-MC, which, unfortunately, does not even work without MH steps. We also evaluate BLR in terms of testing accuracies, which are calculated by averaging over 10 runs with 32 chains. The models are trained on a random 80% of the datasets and tested on the remaining 20% in each run. The results are respectively 84.10%, 88.38% and 80.32% on Heart, Australian and German datasets for SAL-MC, which are the same as A-NICE-MC [Song, Zhao, and Ermon2017]. However, as shown in Figure 8, SAL-MC obtains much better ESS. The experiments also indicate A-NICE-MC must take 20K training iterations to get the highest ESS (which is much lower than SAL-MC as shown in Table 2).

ESS / (ESS/s) Heart Australian German
SAL-MC - 1683/93140 1385/70655 1897/86512
MH 1/10 1/8 1/14
MH 663/10939 596/9834 483/7848
Table 2: ESS (2000 maximum) and ESS speed for BLR.

4.4 AL-MC for image synthesize

We finally test AL-MC, a variant with only real samples available in training, for image synthesis on MNIST and CelebA datasets. The balance factor of Wasserstein regularization term in (8) is set to .

The generated samples are visualized in Figure 7 and Figure 9

. For MNIST, when a sampler is well trained, the sample distribution over 10 classes on the generated samples should be relatively uniform in order to match that of the training-data statistics. To verify this, we classify the generated samples with a pre-trained deep-CNN MNIST classifier (with a 99.1% accuracy). We calculate the

class distribution on two settings with different injected noise, and . The empirical distributions of the ten digits are indeed even. We also plot the digits generated from the learned Markov chain in Figure 7, from which we can see in the case of , the sampler can make distant jumps easily; whereas when , transitions seem to be very smooth, thus it needs longer time to generate all digits from the ten classes. Similar results on CelebA, though not as obvious, are observed in Figure 9. More detailed results are included in the SM.

5 Conclusion

Motivated by the WGF theory, we present self-adversarially learned Bayesian sampling, a generative model learning to draw samples from a target distribution. Two settings, i.e. whether or not true samples are provided as training data, are considered. When learning without true samples, a self-learning mechanism is proposed to automatically adjust samples from the current generator to approach a target distribution. Our method is fully automatic, and is fast and effective in sample generation. Experiments on both synthetic and real datasets demonstrate the effectiveness of our framework, which endows good convergence property while is able to generate much less correlated samples, relative to existing methods.


  • [Ambrosio, Gigli, and Savaré2005] Ambrosio, L.; Gigli, N.; and Savaré, G. 2005. Gradient Flows in Metric Spaces and in the Space of Probability Measures. Lectures in Mathematics ETH Zürich.
  • [Blundell et al.2015] Blundell, C.; Cornebise, J.; Kavukcuoglu, K.; and Wierstra, D. 2015. Weight uncertainty in neural networks. In ICML.
  • [Brooks and Gelman1998] Brooks, S. P., and Gelman, A. 1998. General methods for monitoring convergence of iterative simulations. Journal of computational and graphical statistics 7(4):434–455.
  • [Chen et al.2017] Chen, C.; Li, C.; Chen, L.; Wang, W.; Pu, Y.; and Carin, L. 2017. Continuous-time flows for deep generative models. In arXiv:1709.01179.
  • [Chen et al.2018] Chen, C.; Zhang, R.; Wang, W.; Li, B.; and Chen, L. 2018. A unified particle-optimization framework for scalable Bayesian sampling. In UAI.
  • [Chen, Ding, and Carin2015] Chen, C.; Ding, N.; and Carin, L. 2015. On the convergence of stochastic gradient MCMC algorithms with high-order integrators. In NIPS.
  • [Chen, Fox, and Guestrin2014] Chen, T.; Fox, E. B.; and Guestrin, C. 2014. Stochastic gradient Hamiltonian Monte Carlo. In ICML.
  • [Craig2014] Craig, K., ed. 2014. The Exponential Formula for the Wasserstein Metric. PhD thesis, The State University of New Jersey.
  • [Ding et al.2014] Ding, N.; Fang, Y.; Babbush, R.; Chen, C.; Skeel, R. D.; and Neven, H. 2014. Bayesian sampling using stochastic gradient thermostats. In NIPS.
  • [Feng, Wang, and Liu2017] Feng, Y.; Wang, D.; and Liu, Q. 2017. Learning to draw samples with amortized stein variational gradient descent. In UAI.
  • [Gelman et al.1995] Gelman, A.; Carlin, J. B.; Stern, H. S.; and Rubin, D. B. 1995. Bayesian data analysis. Chapman and Hall/CRC.
  • [Goodfellow et al.2014] Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; and Bengio, Y. 2014. Generative adversarial nets. In Neural Information Processing Systems (NIPS).
  • [Guidoum2015] Guidoum, A. C. 2015. Kernel estimator and bandwidth selection for density and its derivatives. Technical Report The kedd Package, Version 1.0.3.
  • [Haarnoja et al.2017] Haarnoja, T.; Tang, H.; Abbeel, P.; and Levine, S. 2017. Reinforcement learning with deep energy-based policies. In ICML.
  • [Han and Liu2018] Han, J., and Liu, Q. 2018. Stein variational gradient descent without gradient. In ICML.
  • [Jordan, Kinderlehrer, and Otto1998] Jordan, R.; Kinderlehrer, D.; and Otto, F. 1998. The variational formulation of the Fokker-Planck equation. SIAM Journal on Mathematical Analysis 29(1):1–17.
  • [Li and Turner2018] Li, Y., and Turner, R. E. 2018. Gradient estimators for implicit models. In ICLR.
  • [Liu and Wang2016] Liu, Q., and Wang, D. 2016. Stein variational gradient descent: A general purpose bayesian inference algorithm. In NIPS.
  • [Liu et al.2017] Liu, Y.; Ramachandran, P.; Liu, Q.; and Peng, J. 2017. Stein variational policy gradient. In UAI.
  • [Liu2017] Liu, Q. 2017. Stein variational gradient descent as gradient flow. In NIPS.
  • [Osband and Van Roy2017] Osband, I., and Van Roy, B. 2017. Why is posterior sampling better than optimism for reinforcement learning? In ICML.
  • [Raginsky, Rakhlin, and Telgarsky2017] Raginsky, M.; Rakhlin, A.; and Telgarsky, M. 2017. Non-convex learning via stochastic gradient Langevin dynamics: a nonasymptotic analysis. In COLT.
  • [Song, Zhao, and Ermon2017] Song, J.; Zhao, S.; and Ermon, S. 2017. A-nice-mc: Adversarial training for mcmc. In NIPS.
  • [Teh, Thiery, and Vollmer2016] Teh, Y. W.; Thiery, A. H.; and Vollmer, S. J. 2016. Consistency and fluctuations for stochastic gradient Langevin dynamics. JMLR 17(1):193–225.
  • [Welling and Teh2011] Welling, M., and Teh, Y. W. 2011. Bayesian learning via stochastic gradient Langevin dynamics. In ICML.
  • [Zhang et al.2018] Zhang, R.; Li, C.; Chen, C.; and Carin, L. 2018. Learning structural weight uncertainty for sequential decision-making. In AISTATS.
  • [Zhang, Liang, and Charikar2017] Zhang, Y.; Liang, P.; and Charikar, M. 2017. A hitting time analysis of stochastic gradient langevin dynamics. In COLT.
  • [Zhang, Zhang, and Chen2018] Zhang, J.; Zhang, R.; and Chen, C. 2018. Stochastic particle-optimization sampling and the non-asymptotic convergence theory. Technical Report arXiv:1809.01293.