Kernel Stein Generative Modeling

07/06/2020 ∙ by Wei-Cheng Chang, et al. ∙ Google ibm Carnegie Mellon University 11

We are interested in gradient-based Explicit Generative Modeling where samples can be derived from iterative gradient updates based on an estimate of the score function of the data distribution. Recent advances in Stochastic Gradient Langevin Dynamics (SGLD) demonstrates impressive results with energy-based models on high-dimensional and complex data distributions. Stein Variational Gradient Descent (SVGD) is a deterministic sampling algorithm that iteratively transports a set of particles to approximate a given distribution, based on functional gradient descent that decreases the KL divergence. SVGD has promising results on several Bayesian inference applications. However, applying SVGD on high dimensional problems is still under-explored. The goal of this work is to study high dimensional inference with SVGD. We first identify key challenges in practical kernel SVGD inference in high-dimension. We propose noise conditional kernel SVGD (NCK-SVGD), that works in tandem with the recently introduced Noise Conditional Score Network estimator. NCK is crucial for successful inference with SVGD in high dimension, as it adapts the kernel to the noise level of the score estimate. As we anneal the noise, NCK-SVGD targets the real data distribution. We then extend the annealed SVGD with an entropic regularization. We show that this offers a flexible control between sample quality and diversity, and verify it empirically by precision and recall evaluations. The NCK-SVGD produces samples comparable to GANs and annealed SGLD on computer vision benchmarks, including MNIST and CIFAR-10.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

page 8

page 12

page 14

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Drawing novel samples from the data distribution is at the heart of generative models. Existing work can be put into two categories, namely the Implicit Generative Models (IGM) and the Explicit Generative Models (EGM). The generative adversarial networks (GAN)  Goodfellow et al. (2014) are representative examples of IGM, which learns to transform simple source distributions to target data distributions by minimizing -divergence Nowozin et al. (2016)

or integral probability metrics 

Arjovsky et al. (2017); Li et al. (2017) between the model and the data distribution. On the other hand, EGMs typically optimize the likelihood of non-normalized density models (e.g., energy-based models LeCun et al. (2006)) or learn score functions (i.e., gradient of log-density)  Hyvärinen (2005); Vincent (2011)

. Because of the explicitly modeling of densities or gradient of log-densities, EGM is still favorable for a wide range of applications such as anomaly detection 

Du and Mordatch (2019); Grathwohl et al. (2020), image processing Song and Ermon (2019) and more. However, the generative capability of many EGMs is not as competitive with GAN on high-dimensional distributions, such as images. This paper focuses on an in-depth study in the EGMs.

Recent advances in Stochastic Gradient Langevin Dynamics (SGLD) Welling and Teh (2011) has led to certain success in EGMs, especially with energy-based models Du and Mordatch (2019); Nijkamp et al. (2019); Grathwohl et al. (2020) and score-based models Song et al. (2019) for high-dimension inference tasks such as image generation. As a stochastic optimization procedure, SGLD moves samples along the gradient of log-density, with carefully controlled diminishing random noise, which converges to the true data distribution with a theoretical guarantee. The recent noise-conditioned score network  Song and Ermon (2019) estimates the score functions using perturbed data distributions with varying degrees of Gaussian noises. For inference, they consider annealed SGLD to produce impressive samples that are comparable to state-of-the-art (SOTA) GAN-based models on CIFAR-10.

Another interesting sampling technique is Kernel Stein variational gradient descent (SVGD) Liu and Wang (2016); Liu (2017), which iteratively produce samples via deterministic updates that optimally reduce the KL divergence between the model and the target distribution. The particles (samples) in SVGD interact with each other, simultaneously moving towards a high-density region following the gradients, and also pushing each other away due to a repulsive force induced from the kernels. These interesting properties of SVGD has made it promising in various challenging applications such as Bayesian optimization Gong et al. (2019), deep probabilistic models Wang and Liu (2016); Feng et al. (2017); Pu et al. (2017)

, and reinforcement learning 

Haarnoja et al. (2017).

Despite the attractive nature of SVGD, how to make its inference effective and scalable for complex high-dimensional data distributions is an open question that have not been studied in sufficient depth. One major challenge in high-dimensional inference is to deal with multi-modal distributions with many of low-density regions, where SVGD can fail even on simple Gaussian mixtures. A remedy for this problem is to use “noise-annealing”. However, such a relatively simple solution may still lead to deteriorating performance along with the increased dimensionality of data.

In this paper, we aim to significantly enhance the capability of SVGD for sampling from complex and high-dimensional distributions. Specifically, we propose a novel method, namely the Noise Conditional Kernel SVGD or NCK-SVGD in short, where the kernels are conditionally learned or selected based on the perturbed data distributions. Our main contributions can be summarized in three folds. Firstly, we propose to learn the parameterized kernels with noise-conditional auto-encoders, which captures shared visual properties of sampled data at different noise levels. Secondly, we introduce NCK-SVGD with an additional entropy regularization, for flexible control of the trade-off between sample quality and diversity, which is quantitatively evaluated with precision and recall curves. Thirdly, the proposed NCK-SVGD achieves a new SOTA FID score of on CIFAR-10 within the EGM family and is comparable to the results of GAN-based models in the IGM family. Our work shows that high dimensional inference can be successfully achieved with SVGD.

2 Background

In this section, we review Stein Variational Gradient Descent (SVGD) for drawing samples from target distributions and describe how to estimate score functions via a recent advance in Noise Conditional Score Network (NCSN).

Stein Variational Gradient Descent (SVGD)

Let

be a positive and continuously differentiable probability density function on

. For simplicity, we denote as in the following derivation. SVGD Liu and Wang (2016); Liu (2017) aims to find a set of particles to approximate , such that the empirical distribution of the particles weakly converges to when is large. Here denotes the Dirac delta function.

To achieve this, SVGD begins with a set of initial particles from , and iteratively updates them with a deterministic transformation function :

(1)

where is the measure of the updated particles and is the optimal transformation function maximally decreasing the KL divergence between the target data distribution and the transformed particle distribution , for in the unit ball in the RKHS.

By letting go to zero, the continuous Stein descent is therefore given by , where is the density of . A key observation from Liu (2017) is that, under some mild conditions, the negative gradient of KL divergence in Eq. (1) is exactly equal to the square Kernel Stein Discrepancy (KSD):

(2)

where , and

is the Stein operator that maps a vector-valued function

to a scalar-valued function .

KSD provides a discrepancy measure between and and given that is sufficiently large. By taking to be a reproducing kernel Hilbert space (RKHS), KSD provides a closed-form solution of Eq. (2). Specifically, let be a RKHS of scalar-valued functions with a positive definite kernel , and the corresponding vector-valued RKHS. The optimal solution of Eq. (2Liu et al. (2016); Chwialkowski et al. (2016) is where

(3)

The remaining question is how to estimate the score function based on the data without knowing for generative modeling tasks.

Score Estimation

To circumvent the expensive Monte Carlo Markov Chain (MCMC) sampling or the intractable partition function when estimating the non-normalized probability density (e.g., energy-based models), Score Matching 

Hyvärinen (2005) directly estimates the score by , where . To train the model , we use the equivalent objective

(4)

without the need of accessing

. However, score matching is not scalable to deep neural network and high-dimensional data 

Song et al. (2019) due to the expensive computation of .

To overcome this issue, denoising score matching (DSM) Vincent (2011) instead matches

to a non-parametric kernel density estimator (KDE) (i.e.,

) where and

is a smoothing kernel with isotropic Gaussian of variance

. Vincent (2011) further show that the objective is equivalent to the following

(5)

where the target has a simple closed-form. We can interpret Eq. (5) as learning a score function to move noisy input toward the clean data .

One caveat of the DSM is that the optimal score is true only when the noise is small enough. However, learning the score function with the single-noise perturbed data distribution will lead to inaccurate score estimation in the low data density region on high-dimension data space, which could be severe due to the low-dimensional manifold assumption. Thus, Song and Ermon (2019) propose learning a noise-conditional score network (NCSN) based on multiple perturbed data distributions with Gaussian noises of varying magnitudes:

(6)

where is a positive geometric sequence such that is large enough to mitigate the low density region on high-dimension space and is small enough to minimize the effect of perturbed data. Note that

is to balance the scale of loss function for each noise level

.

After learning the noise conditional score functions , Song and Ermon (2019) conduct anneal SGLD to draw samples from a sequence of the model’s score, under annealed noise levels , which is

(7)

where is the standard normal noise and is a constant controlling the diversity of SGLD. It is crucial to choose the learning rate , which balances the scale between the gradient term and the noise term . See detailed explanation in  Song and Ermon (2019).

3 Challenges of SVGD in High-dimension

In this section, we provide a deeper analysis of SVGD on a toy mixture model with an imbalance mixture weights, as the dimension of the data distribution increases.

[width=1.00]figures/toy-gmm_v0.png

(a) Samples visualization

[width=1.00]figures/toy-gmm_v1.png

(b) MMD loss versus data dimension
Figure 1: (a) Visualization of real data from and samples from different inference algorithms. The three figures in the left column are on 2-dimensional mixture of Gaussian. The right column is showing the results of different sampling algorithms on 64-dimensional distributions, and we visualize only the first two dimension, as the real-data distribution is isotropic. (b) Maximum Mean Discrepancy (MMD) between the real-data samples () and generated samples () from different inference algorithms, where A-SGLD means Anneal-SGLD, A-SVGD means Anneal-SVGD with a fixed kernel, and NCK-SVGD means Anneal-SVGD with noise-conditional kernels. Note that SGLD (blue) and SVGD (green) have alike samples with even mixture weights, so two curves overlapped.

3.1 Mixture of Gaussian with Disjoint Support

Consider a simple mixture distribution , where and having disjoint, separated supports, and . Wenliang et al. (2019); Song and Ermon (2019) demonstrate that score-based sampling methods such as SGLD can not correctly recover the mixture weights of these two modes in reasonable time. The reason is that the score function will be in the support of and in the support of . In either case, the score-based sampling algorithms are blind about the mixture weights, which may leads to samples with any reweighing of the components, depending on the initialization. We now show that the Vanilla SVGD (SVGD) also suffer from this issue on a 2-dimensional mixture of Gaussian . We use the ground truth scores (i.e., ) when sampling with SVGD. See the middle left of Figure 0(a) for the samples and the green curve in Figure 0(b) for the error (e.g., Maximum Mean Discrepancy) between the generated samples and the ground truth samples. This phenomenon can also be explained by the objective of SVGD, as KL-divergence is not sensible to the mixture weights.

[width=4.5cm]figures/toy-gmm_v2.png

Figure 2: Medians of pairwise distance under different noise across .

3.2 Anneal SVGD in High-dimension

To overcome this issue, Song and Ermon (2019) propose Anneal-SGLD (A-SGLD) with a sequence of noise-perturbed score functions, where injecting noises with varying magnitudes of the variance will enlarge the support of different mixture components, and mitigate the low-data density region in high-dimensional settings. We also perform Anneal-SVGD (A-SVGD) with a sequence of noise-perturbed score functions as used in A-SGLD, and consider the RBF kernel bandwidth to the median pairwise distance of samples from .

For the low-dimensional case (e.g., ), A-SVGD produces samples that represents the correct mixture weights of , as shown in bottom left of the Fig. 0(a). However, the performance of A-SVGD deteriorates as increases, as shown in the red curve of Fig. 0(b). On the other hand, the performance of A-SGLD seems to be rather robust w.r.t. the increasing (i.e., the orange curve in Fig. 0(b) and top-right samples in Fig. 0(a)).

We argue that the failure of A-SVGD in high-dimension may be due to the inadequate kernel choice, which are fixed to the median pairwise distance based on the real-data distribution , regardless of the different noise level of . In Fig. 2, we present , the median pairwise distance where samples are from different noise-perturbed , along with the increase of . We observe that, in low-dimensional setting (e.g., ), the medians of different do not deviate too much. It explains the good performance of A-SVGD when in Fig. 0(b). Nevertheless, in high-dimensional settings (e.g., ), the median differs a lot for the largest and smallest noise-perturbed data distribution (i.e., v.s. ). The bandwidths suitable for large perturbed noise no longer holds for small perturbed noise , hence limiting performance of fixed kernels.

Following this insight, we propose noise-conditional kernels (NCK) for A-SVGD, namely NCK-SVGD, where the data-driven kernels is conditional on annealing noises to dynamically adapt the varying scale changes. In this example, we can set the bandwidth of RBF kernel to be the median pairwise distance from noise-perturbed data distributions conditional on different noises . See the samples visualization and performance of NCK-SVGD in Fig. 0(a) and  0(b), respectively.

4 SVGD with Noise-conditional Kernels and Entropy Regularization

In Fig. 0(b), we show a simple noise-conditional kernel (NCK) for A-SVGD leads to considerable gain on the Gaussian mixtures as described in Sec 3. Here we discuss how to learn the NCK with deep neural networks for complex real-world data distributions. What’s more, we extend the proposed NCK-SVGD with entropy regularization for the diversity trade-off.

4.1 Learning Deep Noise-conditional Kernels

Kernel selection, also known as kernel learning, is a critical ingredient in kernel methods, and has been actively studied in many applications such as generative modeling Sutherland et al. (2017); Li et al. (2017); Bińkowski et al. (2018); Li et al. (2019), Bayesian non-parametric learning Wilson et al. (2016), change-point detection Chang et al. (2019), statistical testing Wenliang et al. (2019), and more.

To make the kernel adapt to different noises, we consider a deep NCK of the form

(8)

where the Radius Basis (RBF) kernel , the inverse multiquadratic (IMQ) kernel , and the learnable deep encoder with parameters . Note that the kernel hyper-parameters and is also conditional on the noise .

Next, we learn the deep encoder via the noise-conditional auto-encoder framework to capture common visual semantic in noise-perturbed data distributions of :

(9)

where is the corresponding noise-conditional decoder, and is the dimension of the code-space of auto-encoder. Similar to the objective Eq. (6), the scaling constant is to make the reconstruction loss of each noise level to be scale-balanced.

We note that the kernel learning via autoencoding is simple to train and is working well in our experimental study. There are many recent advance in deep kernel learning 

Wilson et al. (2016); Jacot et al. (2018); Wenliang et al. (2019), which can potentially bring additional performance. We leave combining advanced kernel learning into the proposed algorithm as future work.

4.2 Entropy Regularization and Diversity Trade-off

Similarly to Wang and Liu (2019), we propose to learn a distribution such that , for . The first term is the fidelity term the second term controls the diversity. We have

Note that writing is an abuse of notation, since is not normalized. Nevertheless as we will see next, sampling from can be seamlessly achieved by an entropic regularization of the stein descent.

Consider the continuous Stein descent where: We have: Hence the is a descent direction for the entropy regularized KL divergence. More importantly the decreasing amount is the closeness in the Stein sense of to the smoothed distribution .

See Appendix A.1 for the detailed derivation. From this proposition we see if is small, entropic regularized stein descent, will converge to , and will converge only to the high likelihood modes of the distribution and we have less diversity. If is high, on the other hand, we see that we will have more diversity but we will target a smoothed distribution , since we have equivalently:

Hence controls the diversity of the sampling and the precision / recall of the sampling. See Appendix A.2 for an illustration on a simple Gaussian mixtures and the MNIST dataset.

In Eq (7), plays the same role as and ensures convergence of SGLD to .

input :  the data-perturbing noises, the noise conditional score function, the noise conditional kernel, a set of initial particles ,
the entropic regularizer , an initial learning rate , and a maximum iteration .
output : A set of particles that approximates the target distribution.
for  to  do
      
       for  to  do
             ,  where .
            
      .
Algorithm 1 NCK-SVGD with Entropic Regularization.

5 Related Work

SVGD has been applied to deep generative models in various contexts Wang and Liu (2016); Feng et al. (2017); Pu et al. (2017). Specifically,  Feng et al. (2017); Pu et al. (2017) apply amortized SVGD to learn complex encoder functions in VAEs, which avoid the restricted assumption of the parametric prior such as Gaussian. Wang and Liu (2016) train stochastic neural samplers to mimic the SVGD dynamic and apply it for adversarial training the energy-based model with MLE objective, which avoids the expensive MCMC sampling. Our proposed NCK-SVGD explicitly learn the noise-conditional score functions via matching annealed KDEs without any adversarial training. For inference, NCK-SVGD leverage noise-conditional kernels to robustly interact with the noise-conditional score functions across different perturbing noise scales.

Recently, Wang and Liu (2019) present an entropy-regularized SVGD algorithm to learn diversified models, and apply it to the toy mixture of Gaussian and deep clustering. The major difference between  Wang and Liu (2019) and this paper is two folds. First, our analysis provides alternative insights on the entropy-regularized KL objective and show that it converges to . Second, the main application of NCK-SVGD is high-dimensional inference for image generation, which is more challenging than the applications presented in  Wang and Liu (2019).

Ye et al. (2020) propose Stein self-repulsive dynamics, which integrates SGLD with SVGD to decrease the auto-correlation of Langevin Dynamics, hence potentially encourage diversified samples. Their work is complementary to our proposed NCK-SVGD, and is worth exploring as a future direction.

6 Experiments

We begin with the experiment setup and show NCK-SVGD produces good quality images on MNIST/CIFAR-10 datasets as well as offers flexible control between the sample quality and diversity. A-SGLD is the primary competing method, and we use the recent state-of-the-art Song and Ermon (2019) for comparison.

Network Architecture For score network, we adapt noise-conditional score network (NCSN)  Song and Ermon (2019) for both A-SGLD and NCK-SVGD. For the noise-conditional kernels, we use the same architecture as NCSN except for reducing the bottleneck layer width to a much smaller embedding size (e.g., from to on CIFAR-10). Please refer to Appendix B.2 for more details.

Kernel Design We consider a mixture of RBF and IMQ kernel on the data-space and code-space features, as defined in Eq (8). The bandwidth of RBF kernel , where denotes the median of samples’ pairwise distance drawn from anneal data distributions .

Inference Hyper-parameters Following Song and Ermon (2019), we choose

different noise levels where the standard deviations

is a geometric sequence with and . Note that Gaussian noise of is almost indistinguishable to human eyes for image data. For NCK-SVGD, we choose , , , and .

Evaluation Metric We report the Inception Salimans et al. (2016) and FID Heusel et al. (2017) scores using k samples. In addition, we also present the Improved Precision Recall (IPR) curve Kynkäänniemi et al. (2019) to justify the impact of entropy regularization and kernel hyper-parameters on diversity versus quality trade-off.

6.1 Quantitative Evaluation

[width=1.00]figures/mnist_ipr-v0.png

(a) Ablation of NCK-SVGD

[width=1.0]figures/mnist_ipr-v1.png

(b) Comparison with A-SGLD

[width=1.0]figures/mnist_ipr-v2.png

(c) Representative samples
Figure 3: MNIST experiment evaluated with Improved Precision and Recall (IPR). (a) NCK-SVGD-D (data-space kernels) and NCK-SVGD-C (code-space kernels) with varying kernel bandwidth ( from RBF kernel) and the entropy-regularizer . (b) With of A-SGLD (i.e., 5 steps at noise-level ) as initialization, NCK-SVGD-C achieves higher recall than A-SGLD. (c) Uncurated samples of NCK-SVGD-C, including the high-precision/low-recall (A) to low-precision/high-recall (B).

Mnist

We analyze variants of NCK-SVGD quantitatively with IPR Kynkäänniemi et al. (2019) on MNIST, as shown in Fig. 3. NCK-SVGD-D denotes the NCK-SVGD with data-space kernels and NCK-SVGD-C as the NCK-SVGD with code-space kernels. We have three main observations. First and foremost, both NCK-SVGD-D and NCK-SVGD-C demonstrate flexible control between the quality (i.e., Precision) and diversity (i.e., Recall) trade-off. This finding aligns well with our theoretical analysis on the entropy regularization constant explicitly controls the samples diversity (i.e., the green and red curve in Fig. 2(a)). Furthermore, the bandwidth of RBF kernel also impacts the sample diversity, which attests the original study of SVGD on the repulsive term  Liu (2017). Secondly, NCK-SVGD-C improves upon NCK-SVGD-D on the higher precision region, which justifies the advantages from using deep noise-conditional kernel with auto-encoder learning. Finally, when initializing with samples obtained from 5 out of 100 steps of A-SGLD at the first noise-level (i.e., of total A-SGLD), NCK-SVGD-C achieves a higher recall compared to the SOTA A-SGLD.

0.93! Model Inception FID CIFAR-10 Unconditional WGAN-GP Gulrajani et al. (2017) 7.86 36.4 MoLM Ravuri et al. (2018) 7.90 18.9 SN-GAN Miyato et al. (2018) 8.22 21.7 ProgressGAN Karras et al. (2018) 8.80 - EBM (Ensemble)  Du and Mordatch (2019) 6.78 38.2 Short-MCMC Nijkamp et al. (2019) 6.21 - A-SGLD Song and Ermon (2019) 8.87 25.32 NCK-SVGD 8.20 21.95 CIFAR-10 Conditional EBM Du and Mordatch (2019) 8.30 37.9 JEM Grathwohl et al. (2020) 8.76 38.4 SN-GAN Miyato et al. (2018) 8.60 17.5 BigGAN Brock et al. (2019) 9.22 14.73
Figure 4: CIFAR-10 Inception and FID scores
[width=0.92]figures/cifar10_ipr-v1.png
Figure 5: CIFAR-10 Precision-Recall Curve

CIFAR-10 We compare the proposed NCK-SVGD (using code-space kernels) with two representative families of generative models: IGMs (e.g., WGAN-GP Gulrajani et al. (2017), MoLM Ravuri et al. (2018), SN-GAN Miyato et al. (2018), ProgressGAN Karras et al. (2018)) and gradient-based EGMs (e.g., EBM Du and Mordatch (2019), Short-run MCMC Nijkamp et al. (2019), A-SGLD Song and Ermon (2019), JEM Grathwohl et al. (2020)). See Tab. 5 for the Inception/FID scores and Fig. 5 for the IPR curve.

Motivated from the MNIST experiment, we initialized the NCK-SVGD using A-SGLD samples generated from the first 5 noise-levels , then continue running NCK-SVGD for the remaining latter 5 noise-levels . This amounts to using of A-SGLD as initialization.

Comparing within the gradient-based EGMs (e.g., EBM, Short-run MCMC, and A-SGLD), NCK-SVGD achieves new SOTA FID score of , which is considerably better than the competing A-SGLD and is even better than some class-conditional EGMs. The Inception score is also comparable to top existing methods, such as SN-GAN Miyato et al. (2018).

From the IPR curve of Fig. 5, we see that NCK-SVGD improves over the A-SGLD with sizable margin, especially in the high recall region. This again certifies the advantage of noise-conditional kernels and the entropy regularization indeed encourages samples with higher-recall.

[width=1.00]figures/mnist_samples-v0.png

(a) MNIST

[width=1.0]figures/cifar10_samples-v0.png

(b) CIFAR-10

[width=1.0]figures/mnist_cifar10_samples.png

(c) Intermediate samples
Figure 6: Uncurated samples generated from NCK-SVGD on MNIST and CIFAR-10 datasets.

6.2 Qualitative Analysis

We present the generated samples from NCK-SVGD in Fig. 6. Our generated images have higher or comparable quality to those from modern IGM like GANs or gradient-based EGMs. To intuit the procedure of NCK-SVGD, we provide intermediate samples in Fig. 5(c), where each row shows how samples evolve from the initialized samples to high quality images. We also compare NCK-SVGD against two SVGD baselines mentioned in Sec. 3, namely SVGD and A-SGLD. Due to the space limit, see the failure samples from the two baselines in Appendix B.3.

7 Conclusion

In this paper, we presented NCK-SVGD, a diverse and deterministic sampling procedure of high-dimensional inference for image generations. NCK-SVGD is competitive to advance stochastic MCMC methods such as A-SGLD, and reaching a lower FID scores of . In addition, NCK-SVGD with entropic regularization offers a flexible control between sample quality and diversity, which is quantitatively verified by the precision and recall curves.

Broader Impact

Recent development in generative models have begun to blur lines between machine and human generated content, creating an additional care to look at the ethical issues such as the copyright ownership of the AI generated art pieces, face-swapping of fake celebrity images for malicious usages, producing biased or offensive content reflective of the training data, and more. Our NCK-SVGD framework, which performs annealed SVGD sampling via the score functions learned from data, is no exception. Fortunately, one advantage of explicit generative model families including the proposed NCK-SVGD is having more explicit control over the iteratively sampling process, where we can examine if the intermediate samples violate any user specification or other ethical constraints. On the other hand, implicit generative models such as GANs are less transparent in the generative process, which is a one-step evaluation of the complex generator network.

References

  • M. Arjovsky, S. Chintala, and L. Bottou (2017) Wasserstein generative adversarial networks. In

    International Conference on Machine Learning (ICML)

    ,
    Cited by: §1.
  • M. Bińkowski, D. J. Sutherland, M. Arbel, and A. Gretton (2018) Demystifying mmd gans. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.1.
  • A. Brock, J. Donahue, and K. Simonyan (2019) Large scale gan training for high fidelity natural image synthesis. In ICLR, Cited by: Figure 5.
  • W. Chang, C. Li, Y. Yang, and B. Póczos (2019) Kernel change-point detection with auxiliary deep generative models. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.1.
  • K. Chwialkowski, H. Strathmann, and A. Gretton (2016) A kernel test of goodness of fit. In JMLR: Workshop and Conference Proceedings, Cited by: §2.
  • Y. Du and I. Mordatch (2019) Implicit generation and modeling with energy based models. In Advances in Neural Information Processing Systems, pp. 3603–3613. Cited by: §1, §1, Figure 5, §6.1.
  • Y. Feng, D. Wang, and Q. Liu (2017) Learning to draw samples with amortized stein variational gradient descent. In UAI, Cited by: §1, §5.
  • C. Gong, J. Peng, and Q. Liu (2019) Quantile stein variational gradient descent for batch bayesian optimization. In International Conference on Machine Learning, Cited by: §1.
  • I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio (2014) Generative adversarial nets. In NIPS, Cited by: §1.
  • W. Grathwohl, K. Wang, J. Jacobsen, D. Duvenaud, M. Norouzi, and K. Swersky (2020)

    Your classifier is secretly an energy based model and you should treat it like one

    .
    In ICLR, Cited by: §1, §1, Figure 5, §6.1.
  • I. Gulrajani, F. Ahmed, M. Arjovsky, V. Dumoulin, and A. C. Courville (2017) Improved training of wasserstein gans. In NIPS, Cited by: Figure 5, §6.1.
  • T. Haarnoja, H. Tang, P. Abbeel, and S. Levine (2017) Reinforcement learning with deep energy-based policies. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1352–1361. Cited by: §1.
  • M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter (2017) Gans trained by a two time-scale update rule converge to a local nash equilibrium. In Advances in neural information processing systems, pp. 6626–6637. Cited by: §B.2, §6.
  • A. Hyvärinen (2005) Estimation of non-normalized statistical models by score matching. Journal of Machine Learning Research 6 (Apr), pp. 695–709. Cited by: §1, §2.
  • A. Jacot, F. Gabriel, and C. Hongler (2018) Neural tangent kernel: convergence and generalization in neural networks. In NIPS, pp. 8571–8580. Cited by: §4.1.
  • T. Karras, T. Aila, S. Laine, and J. Lehtinen (2018) Progressive growing of gans for improved quality, stability, and variation. In ICLR, Cited by: Figure 5, §6.1.
  • T. Kynkäänniemi, T. Karras, S. Laine, J. Lehtinen, and T. Aila (2019) Improved precision and recall metric for assessing generative models. In Advances in Neural Information Processing Systems, pp. 3929–3938. Cited by: §B.2, §6.1, §6.
  • Y. LeCun, S. Chopra, R. Hadsell, M. Ranzato, and F. Huang (2006) A tutorial on energy-based learning. Predicting structured data 1 (0). Cited by: §1.
  • C. Li, W. Chang, Y. Cheng, Y. Yang, and B. Póczos (2017)

    Mmd gan: towards deeper understanding of moment matching network

    .
    In Advances in Neural Information Processing Systems (NIPS), pp. 2203–2213. Cited by: §1, §4.1.
  • C. Li, W. Chang, Y. Mroueh, Y. Yang, and B. Póczos (2019) Implicit kernel learning. In

    Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS)

    ,
    Cited by: §4.1.
  • Q. Liu, J. Lee, and M. Jordan (2016) A kernelized stein discrepancy for goodness-of-fit tests. In International conference on machine learning, pp. 276–284. Cited by: §2.
  • Q. Liu and D. Wang (2016) Stein variational gradient descent: a general purpose bayesian inference algorithm. In Advances in neural information processing systems, Cited by: §1, §2.
  • Q. Liu (2017) Stein variational gradient descent as gradient flow. In Advances in neural information processing systems, pp. 3115–3123. Cited by: §1, §2, §2, §6.1.
  • T. Miyato, T. Kataoka, M. Koyama, and Y. Yoshida (2018) Spectral normalization for generative adversarial networks. In ICLR, Cited by: Figure 5, §6.1, §6.1.
  • E. Nijkamp, M. Hill, S. Zhu, and Y. N. Wu (2019) Learning non-convergent non-persistent short-run mcmc toward energy-based model. In Advances in Neural Information Processing Systems, pp. 5233–5243. Cited by: §1, Figure 5, §6.1.
  • S. Nowozin, B. Cseke, and R. Tomioka (2016) F-gan: training generative neural samplers using variational divergence minimization. In NIPS, pp. 271–279. Cited by: §1.
  • Y. Pu, Z. Gan, R. Henao, C. Li, S. Han, and L. Carin (2017) Vae learning via stein variational gradient descent. In NIPS, pp. 4236–4245. Cited by: §1, §5.
  • S. Ravuri, S. Mohamed, M. Rosca, and O. Vinyals (2018) Learning implicit generative models with the method of learned moments. In ICML, Cited by: Figure 5, §6.1.
  • T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and X. Chen (2016) Improved techniques for training gans. In NIPS, Cited by: §B.2, §6.
  • Y. Song and S. Ermon (2019) Generative modeling by estimating gradients of the data distribution. In Advances in Neural Information Processing Systems, pp. 11895–11907. Cited by: §B.1, §B.2, §B.2, §1, §1, §2, §2, §3.1, §3.2, Figure 5, §6.1, §6, §6, §6.
  • Y. Song, S. Garg, J. Shi, and S. Ermon (2019) Sliced score matching: a scalable approach to density and score estimation. In UAI, Cited by: §1, §2.
  • D. J. Sutherland, H. Tung, H. Strathmann, S. De, A. Ramdas, A. Smola, and A. Gretton (2017) Generative models and model criticism via optimized maximum mean discrepancy. In Proceedings of the International Conference on Learning Representations (ICLR), Cited by: §4.1.
  • P. Vincent (2011)

    A connection between score matching and denoising autoencoders

    .
    Neural computation 23 (7), pp. 1661–1674. Cited by: §1, §2.
  • D. Wang and Q. Liu (2016) Learning to draw samples: with application to amortized mle for generative adversarial learning. arXiv preprint arXiv:1611.01722. Cited by: §1, §5.
  • D. Wang and Q. Liu (2019) Nonlinear stein variational gradient descent for learning diversified mixture models. In International Conference on Machine Learning, pp. 6576–6585. Cited by: §4.2, §5.
  • M. Welling and Y. W. Teh (2011) Bayesian learning via stochastic gradient langevin dynamics. In Proceedings of the 28th international conference on machine learning (ICML-11), Cited by: §1.
  • L. Wenliang, D. Sutherland, H. Strathmann, and A. Gretton (2019) Learning deep kernels for exponential family densities. In ICML, Cited by: §3.1, §4.1, §4.1.
  • A. G. Wilson, Z. Hu, R. Salakhutdinov, and E. P. Xing (2016) Deep kernel learning. In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), Cited by: §4.1, §4.1.
  • M. Ye, T. Ren, and Q. Liu (2020) Stein self-repulsive dynamics: benefits from past samples. arXiv preprint arXiv:2002.09070. Cited by: §5.

Appendix A Generative Modeling with Diversity Constraint

a.1 Technical Proof

We propose to learn a distribution such that , for

The first term is the fidelity term the second term controls the diversity. We have

Its first variation is given by:

Consider the continuous Stein descent

where:

We have:

Hence the is a descent direction for the entropy regularized KL divergence. More importantly the decreasing amount is the closeness in the Stein sense of to the smoothed distribution .

From this proposition we see if is small, entropic regularized stein descent, will converge to , and will converge only to the high likelihood modes of the distribution and we have less diversity. If is high, on the other hand, we see that we will have more diversity but we will target a smoothed distribution . Hence controls the diversity of the sampling and the precision / recall of the sampling.

Proof.

Let be the density of , We have:

It follows that we have:

(Using divergence theorem)

a.2 The Effect of Entropy Regularization on Precision/Recall

[width=1.00]figures/mnist_beta_ipr.png

(a) IPR of different

[width=1.0]figures/mnist_beta-samples.png

(b) Corrsponding samples of different
Figure 7: Varying entropy regularizer on the Improved Precision Reccall (IPR) curve and its corresponding samples. For A, the and . For B, the and . For C, the and . For D, the and . The empirical observation on MNIST’s IPR curve aligns well with our theoretical insights on the entropy-regularization .

From the MNIST’s IPR curve in Figure 7, We see the empirical evidence that supports the theoretical insights on the entropy-regularizer of NCK-SVGD. We visualize four representative samples on the IPR curve, namely , corresponding to different of , respectively. When is small (e.g., point A), the precision is high and recall is small. The resulting samples show that many modes are disappearing. In contrary, when is large (e.g., point D), the precision becomes lower but recall is greatly increase. The resulting samples have better coverage in different digits.

a.3 The Effect of Entropy Regularization on 2D Mixture of Gaussian

[width=1.00]figures/2d-gmm_beta-v1.png

(a)

[width=1.00]figures/2d-gmm_beta-v2.png

(b)

[width=1.00]figures/2d-gmm_beta-v3.png

(c)

[width=1.00]figures/2d-gmm_beta-v4.png

(d)
Figure 8: (non-normalized) density of where is a 2-dimensional mixture of Gaussian with imbalance mixture weights.

As shown in Figure 8, we visualize the non-normalized density of a 2-dimensional Gaussian mixture with varying choices of the entropy regularizer . Specifically, consider . When is small (e.g., ), the resulting shows mode dropping compared to original . When is large (e.g., ), the resulting covers all four modes, but wrongly with the almost equal weights.

Appendix B Additional Experiment Details

b.1 Toy Experiment

For the results in Figure 1, we mainly follow the setting of [30] with . We generate samples for each subfigure of Figure 1. The score function can be analytically derived from . The initial samples are all uniformly chosen in the square . For SGLD and SVGD, we use . For A-SGLD, A-SVGD, NCK-SVGD, we use , , , . The learning rate is chosen from . When evaluating with Maxmimum Mean Discrepancy between the real data samples from and the generated samples from different sampling methods

, we consider the RBF kernel and set bandwidth by median heuristic. The experiment was run on one Nvidia 2080Ti GPU.

b.2 Image Generation

Network Architecture

For the noise-conditional score network, we use the pre-trained model [30]111https://github.com/ermongroup/ncsn. For the noise-conditional kernel, we consider a modified NCSN architecture where the encoder consists of ResNet with instance normalization layer, and the decoder consists of U-Net-type architecture. The critical difference to the score network is the dimension of bottleneck layer , which is for MNIST and for CIFAR-10. Note that for both MNIST and CIFAR-10 are considerably smaller than the data dimension, which is for MNIST and for CIFAR-10. In contrast, the dimension of the hidden layers of NCSN is around x larger than the data dimension.

Kernel Design

We consider a Mixture of RBF and IMQ kernel on the data-space and code-space features, as defined in Eq (8). The bandwidth of RBF kernel , where denotes the median of samples’ pairwise distance drawn from anneal data distributions . We search for the best kernel hyper-parameters and .

Inference Hyper-parameters

Following [30], we choose different noise levels where the standard deviations is a geometric sequence with and . Note that Gaussian noise of is almost indistinguishable to human eyes for image data. For A-SGLD, we choose and and . For NCK-SVGD, we choose , , , .

Evaluation Metric

We report the Inception [29]222https://github.com/openai/improved-gan/tree/master/inception_score and FID [13]333https://github.com/bioinf-jku/TTUR scores using k samples. In addition, We also present the Improved Precision Recall (IPR) curve [17]444https://github.com/kynkaat/improved-precision-and-recall-metric

to justify the impact of entropy regularization and kernel hyper-parameters on diversity versus quality trade-off. For the IPR curve on MNIST, we use data-space features (i.e., raw image pixels) to compute the KNN-3 data manifold, as gray-scale images do not apply to the VGG-16 network. For the IPR curve on CIFAR-10, we follow the origin setting of 

[17] that uses code-space embeddings from the pre-trained VGG-16 model to construct the KNN-3 data manifold. For simplicity, we generate 1024 samples to compute the precision and recall, and report the average of 5 runs with different random seeds.

b.3 Baseline Comparison with Svgd and A-Svgd

[width=1.00]figures/mnist_baseline_svgd-v0.png

(a) SVGD

[width=1.00]figures/mnist_baseline_svgd-v1.png

(b) A-SVGD

[width=1.00]figures/mnist_baseline_svgd-v2.png

(c) NCK-SVGD
Figure 9: SVGD baseline comparison on MNIST.

[width=1.00]figures/cifar10_baseline_svgd-v0.png

(a) SVGD

[width=1.00]figures/cifar10_baseline_svgd-v1.png

(b) A-SVGD

[width=1.00]figures/cifar10_baseline_svgd-v2.png

(c) NCK-SVGD
Figure 10: SVGD baseline comparison on CIFAR-10.

[width=1.00]figures/celeba_baseline_svgd-v0.png

(a) SVGD

[width=1.00]figures/celeba_baseline_svgd-v1.png

(b) A-SVGD

[width=1.00]figures/celeba_baseline_svgd-v2.png

(c) NCK-SVGD
Figure 11: SVGD baseline comparison on CelebA.

Similar to the study in Section 3, we compare the proposed NCK-SVGD with two SVGD baselines, namely the vanilla SVGD (i.e., SVGD) and anneal SVGD with a fixed kernel (i.e., A-SVGD), on three image generation benchmarks. See MNIST results in Figure 9, and CIFAR-10 results in Figure 10, and CelebA results in Figure 11. We present the qualitative study only and omit the quantitative evaluation, as the performance difference can be clearly distinguish from the sample quality alone. We can see that the proposed NCK-SVGD produces higher quality samples comparing against two baselines, SVGD and A-SGLD.