Deep Generative Learning via Variational Gradient Flow

01/24/2019 ∙ by Gao Yuan, et al. ∙ Wuhan University The Hong Kong University of Science and Technology 42

We propose a general framework to learn deep generative models via Variational Gradient Flow (VGrow) on probability spaces. The evolving distribution that asymptotically converges to the target distribution is governed by a vector field, which is the negative gradient of the first variation of the f-divergence between them. We prove that the evolving distribution coincides with the pushforward distribution through the infinitesimal time composition of residual maps that are perturbations of the identity map along the vector field. The vector field depends on the density ratio of the pushforward distribution and the target distribution, which can be consistently learned from a binary classification problem. Connections of our proposed VGrow method with other popular methods, such as VAE, GAN and flow-based methods, have been established in this framework, gaining new insights of deep generative learning. We also evaluated several commonly used divergences, including Kullback-Leibler, Jensen-Shannon, Jeffrey divergences as well as our newly discovered `logD' divergence which serves as the objective function of the logD-trick GAN. Experimental results on benchmark datasets demonstrate that VGrow can generate high-fidelity images in a stable and efficient manner, achieving competitive performance with state-of-the-art GANs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 19

page 20

page 21

page 22

Code Repositories

VGrow-Pg

A tensorflow implementation of VGrow by using progressive growing method.


view repo

PGVGrow

A tensorflow implementation of VGrow by using progressive growing method.


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Learning the generative model, i.e., the underlying data generating distribution, based on large amounts of data is one the fundamental task in machine learning and statistics

[46]

. Recent advances in deep generative models have provided novel techniques for unsupervised and semi-supervised learning, with broad application varying from image synthesis

[44], semantic image editing [60]

, image-to-image translation

[61] to low-level image processing [29]. Implicit deep generative model is a powerful and flexible framework to approximate the target distribution by learning deep samplers [38] including Generative adversarial networks (GAN) [16] and likelihood based models, such as variational auto-encoders (VAE) [23] and flow based methods [11], as their main representatives. The above mentioned implicit deep generative models focus on learning a deterministic or stochastic nonlinear mapping that can transform low dimensional latent samples from referenced simple distribution to samples that closely match the target distribution.

GANs build a minmax two player game between the generator and discriminator. During the training, the generator transforms samples from a simple reference distribution into samples that would hopefully to deceive the discriminator, while the discriminator conducts a differential two-sample test to distinguish the generated samples from the observed samples. The objective of vanilla GANs amounts to the Jensen-Shannon (JS) divergence between the learned distribution and target distributions. The vanilla GAN generates sharp image samples but suffers form the instability issues [3]. A myriad of extensions to vanilla GANs have been investigated, both theoretically or empirically, in order to achieve a stable training and high quality sample generation. Existing works include but are not limited to designing new learning procedures or network architectures [10, 43, 58, 59, 4, 51, 8], and seeking alternative distribution discrepancy measures as loss criteria in feature or data space [31, 15, 30, 49, 6, 3, 36, 39], and exploiting insightful regularization methods [9, 17, 37, 57], and building hybrid models [13, 53, 14, 54, 21].

VAE approximately minimizes the Kullback-Leibler (KL) divergence between the transformed distribution and the target distribution via minimizing a surrogate loss , i.e., the negative evidence lower bound defined as the reconstruction loss plus the regularization loss, where the reconstruction loss measures the difference between the decoder and the encoder, and the regularization loss measures the difference between the encoder and the simple latent prior distribution [23]. VAE enjoys optimization stability but was disputed for generating blurry image samples caused by the Gaussian decoder and the marginal log-likelihood based loss [53]. Adversarial auto-encoders [35] use GANs to penalize the discrepancy between the aggregated posterior of latent codes and the simple prior distribution. Wasserstein auto-encoders [52] that extend the adversarial auto-encoders to general penalized optimal transport objectives [7] alleviate the blurry. Similar ideas are found in some works on disentangled representations of natural images [20, 27].

Flow based methods minimize exactly the negative log-likelihood, i.e., the KL divergence, where the model density is the pushforward density of simple reference density through a sequence of learnable invertible transformations called normalizing flow [45]

. The research of flow based generative models mainly focus on designing the neural network architectures to trade off the representative power and the computation complexity of the log-determinants

[11, 12, 25, 42, 24].

In this paper, we propose a general framework to learn a deep generative model to sample from the target distribution via combing the strengths of variational gradient flow (VGrow) on probability space, particle optimization and deep neural network. Our method aims to find a deterministic transportation map that transforms low dimensional samples from a simple reference distribution, such as Gaussian distribution or uniform distribution, into samples from underlying target distribution. The evolving distribution that asymptotically converges to the target distribution is governed by a vector field, which is the negative gradient of the first variation of the

-divergence between the the evolution distribution and the target distribution. We prove that the evolution distribution coincides with the pushforward distribution through the infinitesimal time composition of residual maps that are perturbations of the identity map along the vector field. At the population level, the vector field only depends on the density ratio of the pushforward distribution and the target distribution, which can be consistently learned from a binary classification problem to distinguish the observed data sampling from the target distribution from the generated data sampling from pushforward distribution. Both the transform and binary classifier are parameterized with deep convolutional neural networks and trained via stochastic gradient descent (SGD). Connections of our proposed VGrow method with other popular methods, such as VAE, GAN and flow-based methods, have been established in this our framwork, gaining new insights of deep generative learning. We also evaluated several commonly used divergences, including Kullback-Leibler, Jensen-Shannon, Jeffrey divergences as well as our newly discovered “logD” divergence serving as the objective function of the logD-trick GAN, which is of independent interest of its own. We test VGrow with the above mentioned four divergences on four benchmark datasets including MNIST

[28], FashionMNIST [56], CIFAR10 [26] and CelebA [34]. The VGrow learning procedure is very stable, as indicted from our established theory. The resulting deep sampler can obtain realistic looking images, achieving competitive performance with state-of-the-art GANs. The code of VGrow is available at https://github.com/xjtuygao/VGrow.

2 Background, Notation and Theory

Let be independent and identically distributed samples from an unknown target distribution with density with respective to the Lebesgue measure (we made the same assumption for the distributions in this paper). We aim to learn the distribution via constructing variational gradient flow on Borel probability . To this end, we need the following backround detail studied in [1].

Given with density , we use the -divergence to measure the discrepancy between and which is defined as

(2.1)

where is convex and . We use to denote the energy functional for simplicity. Obviously and iff

Lemma 1.

Let be the first variation of at . with

Considering a curve with density . Let

be the vector field and

Definition.

We call is a variational gradient flow of the energy functional governed by the vector field if satisfies the Vlasov-Fokker-Planck equation

(2.2)

As shown in the following Lemma 2, the energy functional is decreasing along the curve . As a consequence, the limit of is the target as .

Lemma 2.

For any fixed time , let

be a random variable with distribution

. Let be an element of the Hilbert space and be a small positive number. Define a residual map as a small permutation of identify map along , i.e.,

Let be the inverse of , which is well defined when is small enough. By change of variable formula, the density of pushforward distribution of random variable is

Let

denote the functional of mapping from to . It is natural to find satisfying , which indicates the pushforward distribution is much closer to than . We find such via calculating the first variation of the functional at

Theorem 1.

For any , if the vanishing condition is satisfied, then

The vanishing condition assumed in Theorem 1 holds when the densities have compact supports or with light tails. Theorem 1 shows that the residual map defined as a small perturbation of identity map along the vector field can push samples from into samples more likely sampled from .

Theorem 2.

The evolution distribution of under infinitesimal pushforward map satisfies the Vlasov-Fokker-Planck equation (2.2).

As consequences of Theorem 2, we know the pushforward distribution through the residual maps with infinitesimal time perturbations is the same as the variational gradient flow. This connection motivates us to approximately solve the Vlasov-Fokker-Planck equation (2.2) via finding a pushforward map defined as composition of sequences of discreet time residual maps with small stepsize as long as we can learn the vector field . By definition, the vector field is an explicit function of density ratio , which is well studied, see for example, [48].

Lemma 3.

Let be random variable pair samples from with binary marginal distribution taking value in . Denote and Let

If , then

According to Lemma 3

, we can estimate the density ratio

via samples. Let be samples from and , respectively. We introduce a random variable , and assign a label for and for . Define

(2.3)

Then consistently estimates as

3 Variational gradient flow (VGrow) learning procedure

With data samples from an unknown target distribution , our goal is to learn a deterministic transportation map that transforms low dimensional samples from a simple reference distribution such as a Gaussian distribution or a uniform distribution into samples from underlying target .

To this end, we parameterize the sought transform via a deep neural network with , where denotes its parameter. We sample particles from simple reference distribution and transform them into with the initial . We do the following two steps iteratively. First, we learn a density ratio via solving (2.3) with real data and generated data , where we parameterize into a neural network . Then, we define residual map using the estimated vector field with a small step size and update by . According to the theory we discussed in Section 3, the above iteratively two steps can get particles more likely sampled from . So, we can update the generator via fitting the pairs We can repeat the above whole procedure as desired with warmsart. We give the detail description of VGrow learning procedure as follows.

  • Outer loop

    • Sample from the simple reference distribution and let particles .
      Inner loop

      • Restrict in (2.3) be a neural network with parameter and solve (2.3) with SGD to get .

      • Define the residual map with a small step size , where .

      • Update the particles .

      End inner loop.

    • Update the parameter via solving with SGD.

  • End outer loop

We consider four divergences in our paper. The form of the four divergences and their second order derivatives are shown in Table 1. They are the three commonly used divergences, including Kullback-Leibler (KL), Jensen-Shannon (JS), Jeffrey divergences, as well as our newly discovered “logD” divergence serving as the objective function of the logD-trick GAN, which to the best of knowledge is a new result.

Theorem 3.

At the population level, the logD-trick GAN [16] minimizes the “logD” divergence , with , where is the distribution of generated data.


-Div
KL
JS
logD
Jeffrey
Table 1: Four representative -divergences

4 Related Works

We discuss connections between our proposed VGrow learning procedure and related works, such as VAE, GAN and flow-based methods.

VAE [23] is formulated as maximizing a lower bound based on the KL divergence. Flow based methods [11, 12] minimize the KL divergence between target and a model, which is pushforward density of a simple reference density through a sequence of learnable invertible transformations. The fow based methods parameterize these transforms via special designed neural networks facilitating log determinant computation [11, 12, 25, 42, 24] and train it using MLE. Our VGrow also learns a sequence of simple residual maps guided form the variational gradient flow in probability space, which is quite different from the flow based method in principle.

The original vanilla GAN and the logD-trick GAN [16] minimize the JS divergence and the “logD” divergence, respectively, as shown in Theorem 3. This idea can be extended to a general -GAN [41], where the general -divergence is used. However, the GANs based on -divergence are formulated to solve the dual problem. In contrast, our VGrow minimizes the -divergence from the primal form. The most related work of GANs to our VGrow is [22, 40, 55], where functional gradient (first variation of functional) is adopted to help in GAN training. [40] introduced a gradient layer based on first variation of generator loss in WGAN [3] to accelerate convergence of training. In [55], a deep energy model was trained along Stein variational gradient [33], which was the projection of the first variation of KL divergence in Theorem 1 onto a reproducing kernel Hilbert space, see Section 7.7 for the proof. [22] propose a CFG-GAN that directly minimizes the KL divergence via functional gradient descent. In their paper, the update direction is the gradient of log density ratio multiplied by a positive scaling function. They empirically set this scaling function to be 1 in their numerical study. Our VGrow is based on the general -divergence, and Theorem 1 implies that the update direction in KL divergence case is indeed the gradient of log density ratio, and thus the scaling function should be exactly 1.

5 Experiments

We evaluated our model on four benchmark datasets including MNIST [28], FashionMNIST [56], CIFAR10 [26] and CelebA [34]. Four representative -divergences were tested to demonstrate the effectiveness of the general Variational Gradient flow (VGrow) framework for generative learning.

5.1 Experimental setup

-divergences. Theoretically, our model works for the whole -divergence family by simply plugging the -function in. Special cases are obtained when specific -divergences are considered. At the population level, when the KL divergence is adopted, our VGrow naturally gives birth to CFG-GAN while the adoptation of JS divergence leads to vanilla GAN. As we proved above, GAN with the logD trick corresponds to our newly discovered ”logD” divergence which belongs to the f-divergence family. Moreover, we consider the Jeffrey divergence to show that our model is applicable to other -divergences. We name these four cases VGrow-KL, VGrow-JS, VGrow-logD and VGrow-JF.

Datasets. We chose four benchmark datasets which included three small datasets (MNIST, FashionMNIST, CIFAR10) and one large dataset (CelebA) from GAN literature. Both MNIST and FashionMNIST have a training set of 60k examples and a test set of 10k examples as bilevel images. CIFAR10 has a training set of 50k examples and a test set of 10k examples as color images. There are naturally 10 classes on these three datasets. CelebA consists of more than 200k celebrity images which were randomly divided into a training set and a test set by us. The division ratio is approximately 9 : 1. For MNIST and FashionMNIST, the input images were resized to resolution. We also pre-processed CelebA images by first taking a central crop and then resizing to the resolution. Only the training sets are used to train our models.

Evaluation metrics. Inception Score (IS) [47], calculates the exponential mutual information where is the conditional class distribution given the generated image and is the marginal class distribution across generated images [5]. To estimate and , we trained specific classifiers on MNIST, FashionMNIST, CIFAR10 following [22] using pre-activation ResNet-18 [18]. All the IS values were calculated over 50k generated images. Fréchet Inception Distance (FID) [19] computes the Wasserstein-2 distance by fitting Gaussians on real images and generated images after propagated through the Inception-v3 model [50], i.e.

. Particularly, all the FID scores are reported with respect to the 10k test examples on MNIST, FashionMNIST and CIFAR10 via the tensorflow implementation

https://github.com/bioinf-jku/TTUR/blob/master/fid.py. In a nutshell, higher IS and lower FID are better.

Network architectures and hyperparameter settings. We adopted a new architecture modified from the residual networks used in [37]

. The modifications were comprised of reducing the number of batch normalization layers and introducing spectral normalization in the deep sampler / generator. The architecture was shared across the three small datasets and most hyperparameters were shared across different divergences. More residual blocks, upsampling and downsampling are employed on CelebA. In our experiments, we set the batch size to be 64 and use RMSProp as the SGD optimizer when training neural networks. The learning rate is 0.0001 for both the deep sampler and the deep classifier except for 0.0002 on MNIST for VGrow-JF. Inputs to deep samplers are vectors generated from a

dimensional standard normal distribution on all the datasets. The meta-parameters in our VGrow learning procedure are set to be

and inner loop .

The sampler and the classifier are parameterized with residual networks. Each ResNet block has a skip-connection. The skip-connection use upsampling / downsampling of its input and a 1x1 convolution if there is upsampling / downsampling in the residual block. We use the identity mapping as the skip-connection if there is no upsamling / downsampling in the residual block. The upsampling is nearest-neighbor upsampling and the downsampling is achieved with mean pooling. Details concerning the networks are listed in Table 4, 5, 6, 7 in Appendix B.

5.2 Results

Through our experiment, We demonstrate empirically that (1) VGrow is very stable in the training phase, and that (2) VGrow can generate high-fidelity samples that are comparable to real samples both visually and quantitatively. Comparisons with the state-of-the-art GANs suggest the effectiveness of VGrow.

Stability. It has been shown that the binary classification loss poorly correlates with the generating performance for JS divergence based GAN models [3]. We observed similar phenomena with our -divergence based VGrow model, i.e. the classification loss changed a little at the beginning of training and then fluctuated around a constant value. Since the classfication loss was not meaningful enough to measure the generating performance, we turned to utilize the aforementioned inception score to draw IS-Loop  learning curves on MNIST, FashionMNIST and CIFAR10. The results are presented in Figure 1

. As indicated in all three subfigures, the IS-Loop learning curves are very smooth and the inception scores nearly monotonically increase until 3500 outer loops (almost 75 epochs) on MNIST and FashionMNIST as well as 4500 outer loops (almost 100 epochs) on CIFAR10.

Effectiveness. First, we list the real images and generated examples of our VGrow-KL model on the four benchmark datasets in Figure 2, 3, 4, 5. We claim that the realistic-looking generated images are visually comparable to real images sampled from the training set. It is easy to distinguish which class the generated example belongs to even on CIFAR10. Second, Table 2 presents the FID scores for the considered four models, and the FID values on 10k training data of MNIST and FashionMNIST. Scores of generated samples are very close to scores on real data. Especially, VGrow-JS obtains average scores of 3.32 and 8.75 while the scores on training data are 2.12 and 4.16 on MNIST and FashionMNIST, respectively. Third, Table 3 shows the FID evaluations of our four models, and the referred evaluations of state-of-the-art WGANs and MMDGANs from [2]

based on 50k samples. Our VGrow-logD attain a score of 28.8 with less variance that is competitive with the best (28.5) of referred baseline evalutions. Moreover, VGrow-JS and VGrow-KL achieve better performance than the remaining referred baselines. In a word, the quantitative results in Table

2 and Table 3 illustrate the effectiveness of our VGrow model.


Models MNIST(10k) FashionMNIST (10k)
VGrow-KL 3.66 (0.09) 9.30 (0.09)
VGrow-JS 3.32 (0.05) 8.75 (0.06)
VGrow-logD 3.64 (0.05) 9.51 (0.09)
VGrow-JF 3.40 (0.07) 9.72 (0.06)
Training set 2.12 (0.02) 4.16 (0.03)
Table 2:

Mean (standard deviation) of FID evaluations over 10k generated MNIST / FashionMNIST images with five-time bootstrap sampling. The last row states statistics of the FID scores between 10k training examples and 10k test examples.


Models CIFAR10 (50k)
VGrow-KL 29.7 (0.1)
VGrow-JS 29.1 (0.1)
VGrow-logD 28.8 (0.1)
VGrow-JF 32.3 (0.1)
WGAN-GP 31.1 (0.2)
MMDGAN-GP-L2 31.4 (0.3)
SMMDGAN 31.5 (0.4)
SN-SWGAN 28.5 (0.2)
Table 3: Mean (standard deviation) of FID evaluations over 50k generated CIFAR10 images with five-time bootstrap sampling. The last four rows are baseline results adapted from [2].

6 Conclusion

We propose a framework to learn deep generative models via Variational Gradient Flow (VGrow) on probability spaces. We discus connections of our proposed VGrow method with VAE, GAN and flow-based methods. We evaluated VGrow on several divergences, including a newly discovered “logD” divergence which serves as the objective function of the logD-trick GAN. Experimental results on benchmark datasets demonstrate that VGrow can generate high-fidelity images in a stable and efficient manner, achieving competitive performance with state-of-the-art GANs.

(a) MNIST
(b) FashionMNIST
(c) CIFAR10
Figure 1: IS-Loop learning curves on MNIST, FashionMNIST and CIFAR10. The training of VGrow is very stable until 3500 outer loops on MNIST and FashionMNIST (4500 outer loops on CIFAR10).
(a) real MNIST
(b) generated MNIST
Figure 2: Real samples and generated samples obtained by VGrow-KL on MNIST.
(a) real FashionMNIST
(b) generated FashionMNIST
Figure 3: Real samples and generated samples obtained by VGrow-KL on FashionMNIST.
(a) real CIFAR10
(b) generated CIFAR10
Figure 4: Real samples and generated samples obtained by VGrow-KL on CIFAR10.
(a) real CelebA
(b) generated CelebA
Figure 5: Real samples and generated samples obtained by VGrow-KL on CelebA.

References

  • [1] Luigi Ambrosio, Nicola Gigli, and Giuseppe Savaré. Gradient flows: in metric spaces and in the space of probability measures. Springer Science & Business Media, 2008.
  • [2] Michael Arbel, Dougal Sutherland, Mikolaj Binkowski, and Arthur Gretton. On gradient regularizers for MMD GANs. In NeurIPS, 2018.
  • [3] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In ICML, 2017.
  • [4] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (GANs). In ICML, 2017.
  • [5] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973, 2018.
  • [6] Mikoaj Binkowski, Dougal J Sutherland, Michael Arbel, and Arthur Gretton. Demystifying MMD GANs. In ICLR, 2018.
  • [7] Olivier Bousquet, Sylvain Gelly, Ilya Tolstikhin, Carl-Johann Simon-Gabriel, and Bernhard Schoelkopf. From optimal transport to generative modeling: the vegan cookbook. arXiv preprint arXiv:1705.07642, 2017.
  • [8] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale gan training for high fidelity natural image synthesis. arXiv preprint arXiv:1809.11096, 2018.
  • [9] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generative adversarial networks. In ICLR, 2017.
  • [10] Emily L Denton, Soumith Chintala, arthur szlam, and Rob Fergus. Deep generative image models using a Laplacian pyramid of adversarial networks. In NIPS, 2015.
  • [11] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components estimation. In ICLR, 2015.
  • [12] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp. In ICLR, 2017.
  • [13] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. In ICLR, 2017.
  • [14] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. In ICLR, 2017.
  • [15] Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. In UAI, 2015.
  • [16] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
  • [17] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In NIPS, 2017.
  • [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, 2016.
  • [19] Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium. In NIPS, 2017.
  • [20] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir Mohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational framework. In ICLR, 2017.
  • [21] Huaibo Huang, zhihang Li, Ran He, Zhenan Sun, and Tieniu Tan.

    Introvae: Introspective variational autoencoders for photographic image synthesis.

    In NeurIPS. 2018.
  • [22] Rie Johnson and Tong Zhang. Composite functional gradient learning of generative adversarial models. In ICML, 2018.
  • [23] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. In ICLR, 2014.
  • [24] Durk P Kingma and Prafulla Dhariwal. Glow: Generative flow with invertible 1x1 convolutions. In NeurIPS, 2018.
  • [25] Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In NIPS, 2016.
  • [26] Alex Krizhevsky and Geoffrey Hinton. Learning multiple layers of features from tiny images. Technical report, Citeseer, 2009.
  • [27] Abhishek Kumar, Prasanna Sattigeri, and Avinash Balakrishnan. Variational inference of disentangled latent concepts from unlabeled observations. In ICLR, 2018.
  • [28] Yann LeCun, Léon Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
  • [29] C. Ledig, L. Theis, F. Huszar, J. Caballero, A. Aitken, A. Tejani, J. Totz, Z. Wang, and W. Shi.

    Photo-realistic single image super-resolution using a generative adversarial network.

    In CVPR, 2017.
  • [30] Chun-Liang Li, Wei-Cheng Chang, Yu Cheng, Yiming Yang, and Barnabás Póczos.

    Mmd gan: Towards deeper understanding of moment matching network.

    In NIPS, 2017.
  • [31] Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In ICML, 2015.
  • [32] Qiang Liu. Stein variational gradient descent as gradient flow. In NIPS, 2017.
  • [33] Qiang Liu and Dilin Wang.

    Stein variational gradient descent: A general purpose bayesian inference algorithm.

    In NIPS, 2016.
  • [34] Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. In ICCV, 2015.
  • [35] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. In ICLR, 2016.
  • [36] Xudong Mao, Qing Li, Haoran Xie, Raymond Y.K. Lau, Zhen Wang, and Stephen Paul Smolley. Least squares generative adversarial networks. In ICCV, 2017.
  • [37] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. In ICML, 2018.
  • [38] Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
  • [39] Youssef Mroueh and Tom Sercu. Fisher GAN. In NIPS, 2017.
  • [40] Atsushi Nitanda and Taiji Suzuki. Gradient layer: Enhancing the convergence of adversarial training for generative models. In AISTATS, 2018.
  • [41] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. -GAN: Training generative neural samplers using variational divergence minimization. In NIPS, 2016.
  • [42] George Papamakarios, Theo Pavlakou, and Iain Murray. Masked autoregressive flow for density estimation. In NIPS, 2017.
  • [43] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
  • [44] Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. arXiv preprint arXiv:1605.05396, 2016.
  • [45] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In ICML, 2015.
  • [46] Ruslan Salakhutdinov. Learning deep generative models. Annual Review of Statistics and Its Application, 2:361–385, 2015.
  • [47] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, Xi Chen, and Xi Chen. Improved techniques for training gans. In NIPS, 2016.
  • [48] Masashi Sugiyama, Taiji Suzuki, and Takafumi Kanamori. Density ratio estimation in machine learning. Cambridge University Press, 2012.
  • [49] Dougal J Sutherland, Hsiao-Yu Tung, Heiko Strathmann, Soumyajit De, Aaditya Ramdas, Alex Smola, and Arthur Gretton. Generative models and model criticism via optimized maximum mean discrepancy. In ICLR, 2017.
  • [50] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna.

    Rethinking the inception architecture for computer vision.

    In CVPR, 2016.
  • [51] Chenyang Tao, Liqun Chen, Ricardo Henao, Jianfeng Feng, and Lawrence Carin Duke. Chi-square generative adversarial network. In ICML, 2018.
  • [52] Ilya Tolstikhin, Olivier Bousquet, Sylvain Gelly, and Bernhard Schoelkopf. Wasserstein auto-encoders. In ICML, 2018.
  • [53] Ilya O Tolstikhin, Sylvain Gelly, Olivier Bousquet, Carl-Johann SIMON-GABRIEL, and Bernhard Schölkopf. AdaGAN: Boosting generative models. In NIPS, 2017.
  • [54] Dmitry Ulyanov, Andrea Vedaldi, and Victor S. Lempitsky. It takes (only) two: Adversarial generator-encoder networks. In AAAI, 2018.
  • [55] Dilin Wang and Qiang Liu. Learning to draw samples: With application to amortized mle for generative adversarial learning. In ICLR workshop, 2017.
  • [56] Han Xiao, Kashif Rasul, and Roland Vollgraf. Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms. arXiv preprint arXiv:1708.07747, 2017.
  • [57] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
  • [58] Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris Metaxas. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In ICCV, 2017.
  • [59] Junbo Zhao, Michael Mathieu, and Yann LeCun. Energy-based generative adversarial network. In ICLR, 2017.
  • [60] Jun-Yan Zhu, Philipp Krähenbühl, Eli Shechtman, and Alexei A. Efros. Generative visual manipulation on the natural image manifold. In ECCV, 2016.
  • [61] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks. In ICCV, 2017.

7 Appendix A

In this section we give detail proofs for the main theory in the paper.

7.1 Proof for Lemma 1

Proof.

For any , define the function

. Chain rule and direct calculation shows

. ∎

7.2 Proof for Lemma 2

Proof.

Follows from expression 10.1.16 in [1] (section E of chapter 10.1.2, page 233.) ∎

7.3 Proof for Theorem 1

Proof.

For any , define

as a function of . Let By definition,

Since

we need calculate the derivative of at Recall,

by chain rule, we get

where,

By definition, We claim that

Indeed, recall that

We get

and

Then it follows that

and

We finish our claim by calculating

Thus,

where, the fourth equality follows from integral by part and the vanishing assumption. ∎

7.4 Proof for Theorem 2

Proof.

Similar as the proof of equation (13) in [32]. We present the detail here for completeness. The proof of Theorem 1 shows that,

and

Then by Taylor enpension,

Let denote the density of . Then,

Let , we get the desired result. ∎

7.5 Proof for Lemma 3

Proof.

is the minimizer of

The above criterion is a functional of . By setting the first variation to zero yields

i.e.,

7.6 Proof for Theorem 3

Proof.

By definition,

At the population level, the objective function of the logD-trick GAN reads [16]:

where, is the simple low dimensional reference distribution. Denote as the distribution of . Then the losses of and are equivalent to

The optimal discriminator is Substituting this into the criterion, we get

7.7 Proof of the relation of VGrow with SVGD

Proof.

Let . Let in a Stein class associate with . By the proof of Theorem 1, we know,

where, last equality follows from via restricting in a Stein class associate with , i.e., . ∎

8 Appendix B

In this Section, We present the detail of the network used in our experiment. We use to denote the number of channels of the images used in the experiment, i.e., or


Layer Details Output size
Latent noise 128
Fully connected Linear 2048
Reshape
ResNet block ReLU
Upsampling
Conv, BN, ReLU
Conv
ResNet block ReLU
Upsampling
Conv, BN, ReLU
Conv
ResNet block ReLU
Upsampling
Conv, BN, ReLU
Conv
Conv ReLU, Conv, Tanh
Table 4: ResNet sampler with resolution.

Layer Details Output size
ResNet block Conv
ReLU, Conv
Downsampling
ResNet block ReLU, Conv
ReLU, Conv
Downsampling
ResNet block ReLU, Conv
ReLU, Conv
ResNet block ReLU, Conv
ReLU, Conv
Fully connected ReLU, GlobalSum pooling 128
Linear 1
Table 5: ResNet classifier with resolution.

Layer Details Output size
Latent noise 128
Fully connected Linear 2048
Reshape
ResNet block ReLU
Upsampling
Conv, BN, ReLU
Conv
ResNet block ReLU
Upsampling
Conv, BN, ReLU
Conv
ResNet block ReLU
Upsampling
Conv, BN, ReLU
Conv
ResNet block ReLU
Upsampling
Conv, BN, ReLU
Conv
Conv ReLU, Conv, Tanh
Table 6: ResNet sampler with resolution.

Layer Details Output size
ResNet block Conv
ReLU, Conv
Downsampling
ResNet block ReLU, Conv
ReLU, Conv
Downsampling
ResNet block ReLU, Conv
ReLU, Conv
Downsampling
ResNet block ReLU, Conv
ReLU, Conv
Fully connected ReLU, GlobalSum pooling 128
Linear 1
Table 7: ResNet classifier with resolution.