1 Introduction
Drawing novel samples from the data distribution is at the heart of generative models. Existing work can be put into two categories, namely the Implicit Generative Models (IGM) and the Explicit Generative Models (EGM). The generative adversarial networks (GAN) Goodfellow et al. (2014) are representative examples of IGM, which learns to transform simple source distributions to target data distributions by minimizing divergence Nowozin et al. (2016)
or integral probability metrics
Arjovsky et al. (2017); Li et al. (2017) between the model and the data distribution. On the other hand, EGMs typically optimize the likelihood of nonnormalized density models (e.g., energybased models LeCun et al. (2006)) or learn score functions (i.e., gradient of logdensity) Hyvärinen (2005); Vincent (2011). Because of the explicitly modeling of densities or gradient of logdensities, EGM is still favorable for a wide range of applications such as anomaly detection
Du and Mordatch (2019); Grathwohl et al. (2020), image processing Song and Ermon (2019) and more. However, the generative capability of many EGMs is not as competitive with GAN on highdimensional distributions, such as images. This paper focuses on an indepth study in the EGMs.Recent advances in Stochastic Gradient Langevin Dynamics (SGLD) Welling and Teh (2011) has led to certain success in EGMs, especially with energybased models Du and Mordatch (2019); Nijkamp et al. (2019); Grathwohl et al. (2020) and scorebased models Song et al. (2019) for highdimension inference tasks such as image generation. As a stochastic optimization procedure, SGLD moves samples along the gradient of logdensity, with carefully controlled diminishing random noise, which converges to the true data distribution with a theoretical guarantee. The recent noiseconditioned score network Song and Ermon (2019) estimates the score functions using perturbed data distributions with varying degrees of Gaussian noises. For inference, they consider annealed SGLD to produce impressive samples that are comparable to stateoftheart (SOTA) GANbased models on CIFAR10.
Another interesting sampling technique is Kernel Stein variational gradient descent (SVGD) Liu and Wang (2016); Liu (2017), which iteratively produce samples via deterministic updates that optimally reduce the KL divergence between the model and the target distribution. The particles (samples) in SVGD interact with each other, simultaneously moving towards a highdensity region following the gradients, and also pushing each other away due to a repulsive force induced from the kernels. These interesting properties of SVGD has made it promising in various challenging applications such as Bayesian optimization Gong et al. (2019), deep probabilistic models Wang and Liu (2016); Feng et al. (2017); Pu et al. (2017)
, and reinforcement learning
Haarnoja et al. (2017).Despite the attractive nature of SVGD, how to make its inference effective and scalable for complex highdimensional data distributions is an open question that have not been studied in sufficient depth. One major challenge in highdimensional inference is to deal with multimodal distributions with many of lowdensity regions, where SVGD can fail even on simple Gaussian mixtures. A remedy for this problem is to use “noiseannealing”. However, such a relatively simple solution may still lead to deteriorating performance along with the increased dimensionality of data.
In this paper, we aim to significantly enhance the capability of SVGD for sampling from complex and highdimensional distributions. Specifically, we propose a novel method, namely the Noise Conditional Kernel SVGD or NCKSVGD in short, where the kernels are conditionally learned or selected based on the perturbed data distributions. Our main contributions can be summarized in three folds. Firstly, we propose to learn the parameterized kernels with noiseconditional autoencoders, which captures shared visual properties of sampled data at different noise levels. Secondly, we introduce NCKSVGD with an additional entropy regularization, for flexible control of the tradeoff between sample quality and diversity, which is quantitatively evaluated with precision and recall curves. Thirdly, the proposed NCKSVGD achieves a new SOTA FID score of on CIFAR10 within the EGM family and is comparable to the results of GANbased models in the IGM family. Our work shows that high dimensional inference can be successfully achieved with SVGD.
2 Background
In this section, we review Stein Variational Gradient Descent (SVGD) for drawing samples from target distributions and describe how to estimate score functions via a recent advance in Noise Conditional Score Network (NCSN).
Stein Variational Gradient Descent (SVGD)
Let
be a positive and continuously differentiable probability density function on
. For simplicity, we denote as in the following derivation. SVGD Liu and Wang (2016); Liu (2017) aims to find a set of particles to approximate , such that the empirical distribution of the particles weakly converges to when is large. Here denotes the Dirac delta function.To achieve this, SVGD begins with a set of initial particles from , and iteratively updates them with a deterministic transformation function :
(1) 
where is the measure of the updated particles and is the optimal transformation function maximally decreasing the KL divergence between the target data distribution and the transformed particle distribution , for in the unit ball in the RKHS.
By letting go to zero, the continuous Stein descent is therefore given by , where is the density of . A key observation from Liu (2017) is that, under some mild conditions, the negative gradient of KL divergence in Eq. (1) is exactly equal to the square Kernel Stein Discrepancy (KSD):
(2) 
where , and
is the Stein operator that maps a vectorvalued function
to a scalarvalued function .KSD provides a discrepancy measure between and and given that is sufficiently large. By taking to be a reproducing kernel Hilbert space (RKHS), KSD provides a closedform solution of Eq. (2). Specifically, let be a RKHS of scalarvalued functions with a positive definite kernel , and the corresponding vectorvalued RKHS. The optimal solution of Eq. (2) Liu et al. (2016); Chwialkowski et al. (2016) is where
(3) 
The remaining question is how to estimate the score function based on the data without knowing for generative modeling tasks.
Score Estimation
To circumvent the expensive Monte Carlo Markov Chain (MCMC) sampling or the intractable partition function when estimating the nonnormalized probability density (e.g., energybased models), Score Matching
Hyvärinen (2005) directly estimates the score by , where . To train the model , we use the equivalent objective(4) 
without the need of accessing
. However, score matching is not scalable to deep neural network and highdimensional data
Song et al. (2019) due to the expensive computation of .To overcome this issue, denoising score matching (DSM) Vincent (2011) instead matches
to a nonparametric kernel density estimator (KDE) (i.e.,
) where andis a smoothing kernel with isotropic Gaussian of variance
. Vincent (2011) further show that the objective is equivalent to the following(5) 
where the target has a simple closedform. We can interpret Eq. (5) as learning a score function to move noisy input toward the clean data .
One caveat of the DSM is that the optimal score is true only when the noise is small enough. However, learning the score function with the singlenoise perturbed data distribution will lead to inaccurate score estimation in the low data density region on highdimension data space, which could be severe due to the lowdimensional manifold assumption. Thus, Song and Ermon (2019) propose learning a noiseconditional score network (NCSN) based on multiple perturbed data distributions with Gaussian noises of varying magnitudes:
(6) 
where is a positive geometric sequence such that is large enough to mitigate the low density region on highdimension space and is small enough to minimize the effect of perturbed data. Note that
is to balance the scale of loss function for each noise level
.After learning the noise conditional score functions , Song and Ermon (2019) conduct anneal SGLD to draw samples from a sequence of the model’s score, under annealed noise levels , which is
(7) 
where is the standard normal noise and is a constant controlling the diversity of SGLD. It is crucial to choose the learning rate , which balances the scale between the gradient term and the noise term . See detailed explanation in Song and Ermon (2019).
3 Challenges of SVGD in Highdimension
In this section, we provide a deeper analysis of SVGD on a toy mixture model with an imbalance mixture weights, as the dimension of the data distribution increases.
3.1 Mixture of Gaussian with Disjoint Support
Consider a simple mixture distribution , where and having disjoint, separated supports, and . Wenliang et al. (2019); Song and Ermon (2019) demonstrate that scorebased sampling methods such as SGLD can not correctly recover the mixture weights of these two modes in reasonable time. The reason is that the score function will be in the support of and in the support of . In either case, the scorebased sampling algorithms are blind about the mixture weights, which may leads to samples with any reweighing of the components, depending on the initialization. We now show that the Vanilla SVGD (SVGD) also suffer from this issue on a 2dimensional mixture of Gaussian . We use the ground truth scores (i.e., ) when sampling with SVGD. See the middle left of Figure 0(a) for the samples and the green curve in Figure 0(b) for the error (e.g., Maximum Mean Discrepancy) between the generated samples and the ground truth samples. This phenomenon can also be explained by the objective of SVGD, as KLdivergence is not sensible to the mixture weights.
3.2 Anneal SVGD in Highdimension
To overcome this issue, Song and Ermon (2019) propose AnnealSGLD (ASGLD) with a sequence of noiseperturbed score functions, where injecting noises with varying magnitudes of the variance will enlarge the support of different mixture components, and mitigate the lowdata density region in highdimensional settings. We also perform AnnealSVGD (ASVGD) with a sequence of noiseperturbed score functions as used in ASGLD, and consider the RBF kernel bandwidth to the median pairwise distance of samples from .
For the lowdimensional case (e.g., ), ASVGD produces samples that represents the correct mixture weights of , as shown in bottom left of the Fig. 0(a). However, the performance of ASVGD deteriorates as increases, as shown in the red curve of Fig. 0(b). On the other hand, the performance of ASGLD seems to be rather robust w.r.t. the increasing (i.e., the orange curve in Fig. 0(b) and topright samples in Fig. 0(a)).
We argue that the failure of ASVGD in highdimension may be due to the inadequate kernel choice, which are fixed to the median pairwise distance based on the realdata distribution , regardless of the different noise level of . In Fig. 2, we present , the median pairwise distance where samples are from different noiseperturbed , along with the increase of . We observe that, in lowdimensional setting (e.g., ), the medians of different do not deviate too much. It explains the good performance of ASVGD when in Fig. 0(b). Nevertheless, in highdimensional settings (e.g., ), the median differs a lot for the largest and smallest noiseperturbed data distribution (i.e., v.s. ). The bandwidths suitable for large perturbed noise no longer holds for small perturbed noise , hence limiting performance of fixed kernels.
Following this insight, we propose noiseconditional kernels (NCK) for ASVGD, namely NCKSVGD, where the datadriven kernels is conditional on annealing noises to dynamically adapt the varying scale changes. In this example, we can set the bandwidth of RBF kernel to be the median pairwise distance from noiseperturbed data distributions conditional on different noises . See the samples visualization and performance of NCKSVGD in Fig. 0(a) and 0(b), respectively.
4 SVGD with Noiseconditional Kernels and Entropy Regularization
In Fig. 0(b), we show a simple noiseconditional kernel (NCK) for ASVGD leads to considerable gain on the Gaussian mixtures as described in Sec 3. Here we discuss how to learn the NCK with deep neural networks for complex realworld data distributions. What’s more, we extend the proposed NCKSVGD with entropy regularization for the diversity tradeoff.
4.1 Learning Deep Noiseconditional Kernels
Kernel selection, also known as kernel learning, is a critical ingredient in kernel methods, and has been actively studied in many applications such as generative modeling Sutherland et al. (2017); Li et al. (2017); Bińkowski et al. (2018); Li et al. (2019), Bayesian nonparametric learning Wilson et al. (2016), changepoint detection Chang et al. (2019), statistical testing Wenliang et al. (2019), and more.
To make the kernel adapt to different noises, we consider a deep NCK of the form
(8) 
where the Radius Basis (RBF) kernel , the inverse multiquadratic (IMQ) kernel , and the learnable deep encoder with parameters . Note that the kernel hyperparameters and is also conditional on the noise .
Next, we learn the deep encoder via the noiseconditional autoencoder framework to capture common visual semantic in noiseperturbed data distributions of :
(9) 
where is the corresponding noiseconditional decoder, and is the dimension of the codespace of autoencoder. Similar to the objective Eq. (6), the scaling constant is to make the reconstruction loss of each noise level to be scalebalanced.
We note that the kernel learning via autoencoding is simple to train and is working well in our experimental study. There are many recent advance in deep kernel learning
Wilson et al. (2016); Jacot et al. (2018); Wenliang et al. (2019), which can potentially bring additional performance. We leave combining advanced kernel learning into the proposed algorithm as future work.4.2 Entropy Regularization and Diversity Tradeoff
Similarly to Wang and Liu (2019), we propose to learn a distribution such that , for . The first term is the fidelity term the second term controls the diversity. We have
Note that writing is an abuse of notation, since is not normalized. Nevertheless as we will see next, sampling from can be seamlessly achieved by an entropic regularization of the stein descent.
Consider the continuous Stein descent where: We have: Hence the is a descent direction for the entropy regularized KL divergence. More importantly the decreasing amount is the closeness in the Stein sense of to the smoothed distribution .
See Appendix A.1 for the detailed derivation. From this proposition we see if is small, entropic regularized stein descent, will converge to , and will converge only to the high likelihood modes of the distribution and we have less diversity. If is high, on the other hand, we see that we will have more diversity but we will target a smoothed distribution , since we have equivalently:
Hence controls the diversity of the sampling and the precision / recall of the sampling. See Appendix A.2 for an illustration on a simple Gaussian mixtures and the MNIST dataset.
In Eq (7), plays the same role as and ensures convergence of SGLD to .
5 Related Work
SVGD has been applied to deep generative models in various contexts Wang and Liu (2016); Feng et al. (2017); Pu et al. (2017). Specifically, Feng et al. (2017); Pu et al. (2017) apply amortized SVGD to learn complex encoder functions in VAEs, which avoid the restricted assumption of the parametric prior such as Gaussian. Wang and Liu (2016) train stochastic neural samplers to mimic the SVGD dynamic and apply it for adversarial training the energybased model with MLE objective, which avoids the expensive MCMC sampling. Our proposed NCKSVGD explicitly learn the noiseconditional score functions via matching annealed KDEs without any adversarial training. For inference, NCKSVGD leverage noiseconditional kernels to robustly interact with the noiseconditional score functions across different perturbing noise scales.
Recently, Wang and Liu (2019) present an entropyregularized SVGD algorithm to learn diversified models, and apply it to the toy mixture of Gaussian and deep clustering. The major difference between Wang and Liu (2019) and this paper is two folds. First, our analysis provides alternative insights on the entropyregularized KL objective and show that it converges to . Second, the main application of NCKSVGD is highdimensional inference for image generation, which is more challenging than the applications presented in Wang and Liu (2019).
Ye et al. (2020) propose Stein selfrepulsive dynamics, which integrates SGLD with SVGD to decrease the autocorrelation of Langevin Dynamics, hence potentially encourage diversified samples. Their work is complementary to our proposed NCKSVGD, and is worth exploring as a future direction.
6 Experiments
We begin with the experiment setup and show NCKSVGD produces good quality images on MNIST/CIFAR10 datasets as well as offers flexible control between the sample quality and diversity. ASGLD is the primary competing method, and we use the recent stateoftheart Song and Ermon (2019) for comparison.
Network Architecture For score network, we adapt noiseconditional score network (NCSN) Song and Ermon (2019) for both ASGLD and NCKSVGD. For the noiseconditional kernels, we use the same architecture as NCSN except for reducing the bottleneck layer width to a much smaller embedding size (e.g., from to on CIFAR10). Please refer to Appendix B.2 for more details.
Kernel Design We consider a mixture of RBF and IMQ kernel on the dataspace and codespace features, as defined in Eq (8). The bandwidth of RBF kernel , where denotes the median of samples’ pairwise distance drawn from anneal data distributions .
Inference Hyperparameters Following Song and Ermon (2019), we choose
different noise levels where the standard deviations
is a geometric sequence with and . Note that Gaussian noise of is almost indistinguishable to human eyes for image data. For NCKSVGD, we choose , , , and .Evaluation Metric We report the Inception Salimans et al. (2016) and FID Heusel et al. (2017) scores using k samples. In addition, we also present the Improved Precision Recall (IPR) curve Kynkäänniemi et al. (2019) to justify the impact of entropy regularization and kernel hyperparameters on diversity versus quality tradeoff.
6.1 Quantitative Evaluation
Mnist
We analyze variants of NCKSVGD quantitatively with IPR Kynkäänniemi et al. (2019) on MNIST, as shown in Fig. 3. NCKSVGDD denotes the NCKSVGD with dataspace kernels and NCKSVGDC as the NCKSVGD with codespace kernels. We have three main observations. First and foremost, both NCKSVGDD and NCKSVGDC demonstrate flexible control between the quality (i.e., Precision) and diversity (i.e., Recall) tradeoff. This finding aligns well with our theoretical analysis on the entropy regularization constant explicitly controls the samples diversity (i.e., the green and red curve in Fig. 2(a)). Furthermore, the bandwidth of RBF kernel also impacts the sample diversity, which attests the original study of SVGD on the repulsive term Liu (2017). Secondly, NCKSVGDC improves upon NCKSVGDD on the higher precision region, which justifies the advantages from using deep noiseconditional kernel with autoencoder learning. Finally, when initializing with samples obtained from 5 out of 100 steps of ASGLD at the first noiselevel (i.e., of total ASGLD), NCKSVGDC achieves a higher recall compared to the SOTA ASGLD.
CIFAR10 We compare the proposed NCKSVGD (using codespace kernels) with two representative families of generative models: IGMs (e.g., WGANGP Gulrajani et al. (2017), MoLM Ravuri et al. (2018), SNGAN Miyato et al. (2018), ProgressGAN Karras et al. (2018)) and gradientbased EGMs (e.g., EBM Du and Mordatch (2019), Shortrun MCMC Nijkamp et al. (2019), ASGLD Song and Ermon (2019), JEM Grathwohl et al. (2020)). See Tab. 5 for the Inception/FID scores and Fig. 5 for the IPR curve.
Motivated from the MNIST experiment, we initialized the NCKSVGD using ASGLD samples generated from the first 5 noiselevels , then continue running NCKSVGD for the remaining latter 5 noiselevels . This amounts to using of ASGLD as initialization.
Comparing within the gradientbased EGMs (e.g., EBM, Shortrun MCMC, and ASGLD), NCKSVGD achieves new SOTA FID score of , which is considerably better than the competing ASGLD and is even better than some classconditional EGMs. The Inception score is also comparable to top existing methods, such as SNGAN Miyato et al. (2018).
From the IPR curve of Fig. 5, we see that NCKSVGD improves over the ASGLD with sizable margin, especially in the high recall region. This again certifies the advantage of noiseconditional kernels and the entropy regularization indeed encourages samples with higherrecall.
6.2 Qualitative Analysis
We present the generated samples from NCKSVGD in Fig. 6. Our generated images have higher or comparable quality to those from modern IGM like GANs or gradientbased EGMs. To intuit the procedure of NCKSVGD, we provide intermediate samples in Fig. 5(c), where each row shows how samples evolve from the initialized samples to high quality images. We also compare NCKSVGD against two SVGD baselines mentioned in Sec. 3, namely SVGD and ASGLD. Due to the space limit, see the failure samples from the two baselines in Appendix B.3.
7 Conclusion
In this paper, we presented NCKSVGD, a diverse and deterministic sampling procedure of highdimensional inference for image generations. NCKSVGD is competitive to advance stochastic MCMC methods such as ASGLD, and reaching a lower FID scores of . In addition, NCKSVGD with entropic regularization offers a flexible control between sample quality and diversity, which is quantitatively verified by the precision and recall curves.
Broader Impact
Recent development in generative models have begun to blur lines between machine and human generated content, creating an additional care to look at the ethical issues such as the copyright ownership of the AI generated art pieces, faceswapping of fake celebrity images for malicious usages, producing biased or offensive content reflective of the training data, and more. Our NCKSVGD framework, which performs annealed SVGD sampling via the score functions learned from data, is no exception. Fortunately, one advantage of explicit generative model families including the proposed NCKSVGD is having more explicit control over the iteratively sampling process, where we can examine if the intermediate samples violate any user specification or other ethical constraints. On the other hand, implicit generative models such as GANs are less transparent in the generative process, which is a onestep evaluation of the complex generator network.
References

Wasserstein generative adversarial networks.
In
International Conference on Machine Learning (ICML)
, Cited by: §1.  Demystifying mmd gans. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.1.
 Large scale gan training for high fidelity natural image synthesis. In ICLR, Cited by: Figure 5.
 Kernel changepoint detection with auxiliary deep generative models. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.1.
 A kernel test of goodness of fit. In JMLR: Workshop and Conference Proceedings, Cited by: §2.
 Implicit generation and modeling with energy based models. In Advances in Neural Information Processing Systems, pp. 3603–3613. Cited by: §1, §1, Figure 5, §6.1.
 Learning to draw samples with amortized stein variational gradient descent. In UAI, Cited by: §1, §5.
 Quantile stein variational gradient descent for batch bayesian optimization. In International Conference on Machine Learning, Cited by: §1.
 Generative adversarial nets. In NIPS, Cited by: §1.

Your classifier is secretly an energy based model and you should treat it like one
. In ICLR, Cited by: §1, §1, Figure 5, §6.1.  Improved training of wasserstein gans. In NIPS, Cited by: Figure 5, §6.1.
 Reinforcement learning with deep energybased policies. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1352–1361. Cited by: §1.
 Gans trained by a two timescale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §B.2, §6.
 Estimation of nonnormalized statistical models by score matching. Journal of Machine Learning Research 6 (Apr), pp. 695–709. Cited by: §1, §2.
 Neural tangent kernel: convergence and generalization in neural networks. In NIPS, pp. 8571–8580. Cited by: §4.1.
 Progressive growing of gans for improved quality, stability, and variation. In ICLR, Cited by: Figure 5, §6.1.
 Improved precision and recall metric for assessing generative models. In Advances in Neural Information Processing Systems, pp. 3929–3938. Cited by: §B.2, §6.1, §6.
 A tutorial on energybased learning. Predicting structured data 1 (0). Cited by: §1.

Mmd gan: towards deeper understanding of moment matching network
. In Advances in Neural Information Processing Systems (NIPS), pp. 2203–2213. Cited by: §1, §4.1. 
Implicit kernel learning.
In
Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS)
, Cited by: §4.1.  A kernelized stein discrepancy for goodnessoffit tests. In International conference on machine learning, pp. 276–284. Cited by: §2.
 Stein variational gradient descent: a general purpose bayesian inference algorithm. In Advances in neural information processing systems, Cited by: §1, §2.
 Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pp. 3115–3123. Cited by: §1, §2, §2, §6.1.
 Spectral normalization for generative adversarial networks. In ICLR, Cited by: Figure 5, §6.1, §6.1.
 Learning nonconvergent nonpersistent shortrun mcmc toward energybased model. In Advances in Neural Information Processing Systems, pp. 5233–5243. Cited by: §1, Figure 5, §6.1.
 Fgan: training generative neural samplers using variational divergence minimization. In NIPS, pp. 271–279. Cited by: §1.
 Vae learning via stein variational gradient descent. In NIPS, pp. 4236–4245. Cited by: §1, §5.
 Learning implicit generative models with the method of learned moments. In ICML, Cited by: Figure 5, §6.1.
 Improved techniques for training gans. In NIPS, Cited by: §B.2, §6.
 Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11895–11907. Cited by: §B.1, §B.2, §B.2, §1, §1, §2, §2, §3.1, §3.2, Figure 5, §6.1, §6, §6, §6.
 Sliced score matching: a scalable approach to density and score estimation. In UAI, Cited by: §1, §2.
 Generative models and model criticism via optimized maximum mean discrepancy. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.1.

A connection between score matching and denoising autoencoders
. Neural computation 23 (7), pp. 1661–1674. Cited by: §1, §2.  Learning to draw samples: with application to amortized mle for generative adversarial learning. arXiv preprint arXiv:1611.01722. Cited by: §1, §5.
 Nonlinear stein variational gradient descent for learning diversified mixture models. In International Conference on Machine Learning, pp. 6576–6585. Cited by: §4.2, §5.
 Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML11), Cited by: §1.
 Learning deep kernels for exponential family densities. In ICML, Cited by: §3.1, §4.1, §4.1.
 Deep kernel learning. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §4.1, §4.1.
 Stein selfrepulsive dynamics: benefits from past samples. arXiv preprint arXiv:2002.09070. Cited by: §5.
Appendix A Generative Modeling with Diversity Constraint
a.1 Technical Proof
We propose to learn a distribution such that , for
The first term is the fidelity term the second term controls the diversity. We have
Its first variation is given by:
Consider the continuous Stein descent
where:
We have:
Hence the is a descent direction for the entropy regularized KL divergence. More importantly the decreasing amount is the closeness in the Stein sense of to the smoothed distribution .
From this proposition we see if is small, entropic regularized stein descent, will converge to , and will converge only to the high likelihood modes of the distribution and we have less diversity. If is high, on the other hand, we see that we will have more diversity but we will target a smoothed distribution . Hence controls the diversity of the sampling and the precision / recall of the sampling.
Proof.
Let be the density of , We have:
It follows that we have:
(Using divergence theorem)  
∎
a.2 The Effect of Entropy Regularization on Precision/Recall
From the MNIST’s IPR curve in Figure 7, We see the empirical evidence that supports the theoretical insights on the entropyregularizer of NCKSVGD. We visualize four representative samples on the IPR curve, namely , corresponding to different of , respectively. When is small (e.g., point A), the precision is high and recall is small. The resulting samples show that many modes are disappearing. In contrary, when is large (e.g., point D), the precision becomes lower but recall is greatly increase. The resulting samples have better coverage in different digits.
a.3 The Effect of Entropy Regularization on 2D Mixture of Gaussian
As shown in Figure 8, we visualize the nonnormalized density of a 2dimensional Gaussian mixture with varying choices of the entropy regularizer . Specifically, consider . When is small (e.g., ), the resulting shows mode dropping compared to original . When is large (e.g., ), the resulting covers all four modes, but wrongly with the almost equal weights.
Appendix B Additional Experiment Details
b.1 Toy Experiment
For the results in Figure 1, we mainly follow the setting of [30] with . We generate samples for each subfigure of Figure 1. The score function can be analytically derived from . The initial samples are all uniformly chosen in the square . For SGLD and SVGD, we use . For ASGLD, ASVGD, NCKSVGD, we use , , , . The learning rate is chosen from . When evaluating with Maxmimum Mean Discrepancy between the real data samples from and the generated samples from different sampling methods
, we consider the RBF kernel and set bandwidth by median heuristic. The experiment was run on one Nvidia 2080Ti GPU.
b.2 Image Generation
Network Architecture
For the noiseconditional score network, we use the pretrained model [30]^{1}^{1}1https://github.com/ermongroup/ncsn. For the noiseconditional kernel, we consider a modified NCSN architecture where the encoder consists of ResNet with instance normalization layer, and the decoder consists of UNettype architecture. The critical difference to the score network is the dimension of bottleneck layer , which is for MNIST and for CIFAR10. Note that for both MNIST and CIFAR10 are considerably smaller than the data dimension, which is for MNIST and for CIFAR10. In contrast, the dimension of the hidden layers of NCSN is around x larger than the data dimension.
Kernel Design
We consider a Mixture of RBF and IMQ kernel on the dataspace and codespace features, as defined in Eq (8). The bandwidth of RBF kernel , where denotes the median of samples’ pairwise distance drawn from anneal data distributions . We search for the best kernel hyperparameters and .
Inference Hyperparameters
Following [30], we choose different noise levels where the standard deviations is a geometric sequence with and . Note that Gaussian noise of is almost indistinguishable to human eyes for image data. For ASGLD, we choose and and . For NCKSVGD, we choose , , , .
Evaluation Metric
We report the Inception [29]^{2}^{2}2https://github.com/openai/improvedgan/tree/master/inception_score and FID [13]^{3}^{3}3https://github.com/bioinfjku/TTUR scores using k samples. In addition, We also present the Improved Precision Recall (IPR) curve [17]^{4}^{4}4https://github.com/kynkaat/improvedprecisionandrecallmetric
to justify the impact of entropy regularization and kernel hyperparameters on diversity versus quality tradeoff. For the IPR curve on MNIST, we use dataspace features (i.e., raw image pixels) to compute the KNN3 data manifold, as grayscale images do not apply to the VGG16 network. For the IPR curve on CIFAR10, we follow the origin setting of
[17] that uses codespace embeddings from the pretrained VGG16 model to construct the KNN3 data manifold. For simplicity, we generate 1024 samples to compute the precision and recall, and report the average of 5 runs with different random seeds.b.3 Baseline Comparison with Svgd and ASvgd
Similar to the study in Section 3, we compare the proposed NCKSVGD with two SVGD baselines, namely the vanilla SVGD (i.e., SVGD) and anneal SVGD with a fixed kernel (i.e., ASVGD), on three image generation benchmarks. See MNIST results in Figure 9, and CIFAR10 results in Figure 10, and CelebA results in Figure 11. We present the qualitative study only and omit the quantitative evaluation, as the performance difference can be clearly distinguish from the sample quality alone. We can see that the proposed NCKSVGD produces higher quality samples comparing against two baselines, SVGD and ASGLD.
Comments
There are no comments yet.