VGrow-Pg
A tensorflow implementation of VGrow by using progressive growing method.
view repo
We propose a general framework to learn deep generative models via Variational Gradient Flow (VGrow) on probability spaces. The evolving distribution that asymptotically converges to the target distribution is governed by a vector field, which is the negative gradient of the first variation of the f-divergence between them. We prove that the evolving distribution coincides with the pushforward distribution through the infinitesimal time composition of residual maps that are perturbations of the identity map along the vector field. The vector field depends on the density ratio of the pushforward distribution and the target distribution, which can be consistently learned from a binary classification problem. Connections of our proposed VGrow method with other popular methods, such as VAE, GAN and flow-based methods, have been established in this framework, gaining new insights of deep generative learning. We also evaluated several commonly used divergences, including Kullback-Leibler, Jensen-Shannon, Jeffrey divergences as well as our newly discovered `logD' divergence which serves as the objective function of the logD-trick GAN. Experimental results on benchmark datasets demonstrate that VGrow can generate high-fidelity images in a stable and efficient manner, achieving competitive performance with state-of-the-art GANs.
READ FULL TEXT VIEW PDF
Deep generative modeling has seen impressive advances in recent years, t...
read it
Recent research has shown that it is challenging to detect
out-of-distri...
read it
Normalizing flows have been developed recently as a method for drawing
s...
read it
We propose to learn a generative model via entropy interpolation with a
...
read it
Generative deep learning systems offer powerful tools for artefact
gener...
read it
Learning with kernels is an often resorted tool in modern machine learni...
read it
We propose an Euler particle transport (EPT) approach for generative
lea...
read it
A tensorflow implementation of VGrow by using progressive growing method.
A tensorflow implementation of VGrow by using progressive growing method.
Learning the generative model, i.e., the underlying data generating distribution, based on large amounts of data is one the fundamental task in machine learning and statistics
[46]. Recent advances in deep generative models have provided novel techniques for unsupervised and semi-supervised learning, with broad application varying from image synthesis
[44], semantic image editing [60][61] to low-level image processing [29]. Implicit deep generative model is a powerful and flexible framework to approximate the target distribution by learning deep samplers [38] including Generative adversarial networks (GAN) [16] and likelihood based models, such as variational auto-encoders (VAE) [23] and flow based methods [11], as their main representatives. The above mentioned implicit deep generative models focus on learning a deterministic or stochastic nonlinear mapping that can transform low dimensional latent samples from referenced simple distribution to samples that closely match the target distribution.GANs build a minmax two player game between the generator and discriminator. During the training, the generator transforms samples from a simple reference distribution into samples that would hopefully to deceive the discriminator, while the discriminator conducts a differential two-sample test to distinguish the generated samples from the observed samples. The objective of vanilla GANs amounts to the Jensen-Shannon (JS) divergence between the learned distribution and target distributions. The vanilla GAN generates sharp image samples but suffers form the instability issues [3]. A myriad of extensions to vanilla GANs have been investigated, both theoretically or empirically, in order to achieve a stable training and high quality sample generation. Existing works include but are not limited to designing new learning procedures or network architectures [10, 43, 58, 59, 4, 51, 8], and seeking alternative distribution discrepancy measures as loss criteria in feature or data space [31, 15, 30, 49, 6, 3, 36, 39], and exploiting insightful regularization methods [9, 17, 37, 57], and building hybrid models [13, 53, 14, 54, 21].
VAE approximately minimizes the Kullback-Leibler (KL) divergence between the transformed distribution and the target distribution via minimizing a surrogate loss , i.e., the negative evidence lower bound defined as the reconstruction loss plus the regularization loss, where the reconstruction loss measures the difference between the decoder and the encoder, and the regularization loss measures the difference between the encoder and the simple latent prior distribution [23]. VAE enjoys optimization stability but was disputed for generating blurry image samples caused by the Gaussian decoder and the marginal log-likelihood based loss [53]. Adversarial auto-encoders [35] use GANs to penalize the discrepancy between the aggregated posterior of latent codes and the simple prior distribution. Wasserstein auto-encoders [52] that extend the adversarial auto-encoders to general penalized optimal transport objectives [7] alleviate the blurry. Similar ideas are found in some works on disentangled representations of natural images [20, 27].
Flow based methods minimize exactly the negative log-likelihood, i.e., the KL divergence, where the model density is the pushforward density of simple reference density through a sequence of learnable invertible transformations called normalizing flow [45]
. The research of flow based generative models mainly focus on designing the neural network architectures to trade off the representative power and the computation complexity of the log-determinants
[11, 12, 25, 42, 24].In this paper, we propose a general framework to learn a deep generative model to sample from the target distribution via combing the strengths of variational gradient flow (VGrow) on probability space, particle optimization and deep neural network. Our method aims to find a deterministic transportation map that transforms low dimensional samples from a simple reference distribution, such as Gaussian distribution or uniform distribution, into samples from underlying target distribution. The evolving distribution that asymptotically converges to the target distribution is governed by a vector field, which is the negative gradient of the first variation of the
-divergence between the the evolution distribution and the target distribution. We prove that the evolution distribution coincides with the pushforward distribution through the infinitesimal time composition of residual maps that are perturbations of the identity map along the vector field. At the population level, the vector field only depends on the density ratio of the pushforward distribution and the target distribution, which can be consistently learned from a binary classification problem to distinguish the observed data sampling from the target distribution from the generated data sampling from pushforward distribution. Both the transform and binary classifier are parameterized with deep convolutional neural networks and trained via stochastic gradient descent (SGD). Connections of our proposed VGrow method with other popular methods, such as VAE, GAN and flow-based methods, have been established in this our framwork, gaining new insights of deep generative learning. We also evaluated several commonly used divergences, including Kullback-Leibler, Jensen-Shannon, Jeffrey divergences as well as our newly discovered “logD” divergence serving as the objective function of the logD-trick GAN, which is of independent interest of its own. We test VGrow with the above mentioned four divergences on four benchmark datasets including MNIST
[28], FashionMNIST [56], CIFAR10 [26] and CelebA [34]. The VGrow learning procedure is very stable, as indicted from our established theory. The resulting deep sampler can obtain realistic looking images, achieving competitive performance with state-of-the-art GANs. The code of VGrow is available at https://github.com/xjtuygao/VGrow.Let be independent and identically distributed samples from an unknown target distribution with density with respective to the Lebesgue measure (we made the same assumption for the distributions in this paper). We aim to learn the distribution via constructing variational gradient flow on Borel probability . To this end, we need the following backround detail studied in [1].
Given with density , we use the -divergence to measure the discrepancy between and which is defined as
(2.1) |
where is convex and . We use to denote the energy functional for simplicity. Obviously and iff
Let be the first variation of at . with
Considering a curve with density . Let
be the vector field and
We call is a variational gradient flow of the energy functional governed by the vector field if satisfies the Vlasov-Fokker-Planck equation
(2.2) |
As shown in the following Lemma 2, the energy functional is decreasing along the curve . As a consequence, the limit of is the target as .
For any fixed time , let
be a random variable with distribution
. Let be an element of the Hilbert space and be a small positive number. Define a residual map as a small permutation of identify map along , i.e.,Let be the inverse of , which is well defined when is small enough. By change of variable formula, the density of pushforward distribution of random variable is
Let
denote the functional of mapping from to . It is natural to find satisfying , which indicates the pushforward distribution is much closer to than . We find such via calculating the first variation of the functional at
For any , if the vanishing condition is satisfied, then
The vanishing condition assumed in Theorem 1 holds when the densities have compact supports or with light tails. Theorem 1 shows that the residual map defined as a small perturbation of identity map along the vector field can push samples from into samples more likely sampled from .
The evolution distribution of under infinitesimal pushforward map satisfies the Vlasov-Fokker-Planck equation (2.2).
As consequences of Theorem 2, we know the pushforward distribution through the residual maps with infinitesimal time perturbations is the same as the variational gradient flow. This connection motivates us to approximately solve the Vlasov-Fokker-Planck equation (2.2) via finding a pushforward map defined as composition of sequences of discreet time residual maps with small stepsize as long as we can learn the vector field . By definition, the vector field is an explicit function of density ratio , which is well studied, see for example, [48].
Let be random variable pair samples from with binary marginal distribution taking value in . Denote and Let
If , then
With data samples from an unknown target distribution , our goal is to learn a deterministic transportation map that transforms low dimensional samples from a simple reference distribution such as a Gaussian distribution or a uniform distribution into samples from underlying target .
To this end, we parameterize the sought transform via a deep neural network with , where denotes its parameter. We sample particles from simple reference distribution and transform them into with the initial . We do the following two steps iteratively. First, we learn a density ratio via solving (2.3) with real data and generated data , where we parameterize into a neural network . Then, we define residual map using the estimated vector field with a small step size and update by . According to the theory we discussed in Section 3, the above iteratively two steps can get particles more likely sampled from . So, we can update the generator via fitting the pairs We can repeat the above whole procedure as desired with warmsart. We give the detail description of VGrow learning procedure as follows.
Outer loop
Update the parameter via solving with SGD.
End outer loop
We consider four divergences in our paper. The form of the four divergences and their second order derivatives are shown in Table 1. They are the three commonly used divergences, including Kullback-Leibler (KL), Jensen-Shannon (JS), Jeffrey divergences, as well as our newly discovered “logD” divergence serving as the objective function of the logD-trick GAN, which to the best of knowledge is a new result.
At the population level, the logD-trick GAN [16] minimizes the “logD” divergence , with , where is the distribution of generated data.
-Div | ||
---|---|---|
KL | ||
JS | ||
logD | ||
Jeffrey |
We discuss connections between our proposed VGrow learning procedure and related works, such as VAE, GAN and flow-based methods.
VAE [23] is formulated as maximizing a lower bound based on the KL divergence. Flow based methods [11, 12] minimize the KL divergence between target and a model, which is pushforward density of a simple reference density through a sequence of learnable invertible transformations. The fow based methods parameterize these transforms via special designed neural networks facilitating log determinant computation [11, 12, 25, 42, 24] and train it using MLE. Our VGrow also learns a sequence of simple residual maps guided form the variational gradient flow in probability space, which is quite different from the flow based method in principle.
The original vanilla GAN and the logD-trick GAN [16] minimize the JS divergence and the “logD” divergence, respectively, as shown in Theorem 3. This idea can be extended to a general -GAN [41], where the general -divergence is used. However, the GANs based on -divergence are formulated to solve the dual problem. In contrast, our VGrow minimizes the -divergence from the primal form. The most related work of GANs to our VGrow is [22, 40, 55], where functional gradient (first variation of functional) is adopted to help in GAN training. [40] introduced a gradient layer based on first variation of generator loss in WGAN [3] to accelerate convergence of training. In [55], a deep energy model was trained along Stein variational gradient [33], which was the projection of the first variation of KL divergence in Theorem 1 onto a reproducing kernel Hilbert space, see Section 7.7 for the proof. [22] propose a CFG-GAN that directly minimizes the KL divergence via functional gradient descent. In their paper, the update direction is the gradient of log density ratio multiplied by a positive scaling function. They empirically set this scaling function to be 1 in their numerical study. Our VGrow is based on the general -divergence, and Theorem 1 implies that the update direction in KL divergence case is indeed the gradient of log density ratio, and thus the scaling function should be exactly 1.
We evaluated our model on four benchmark datasets including MNIST [28], FashionMNIST [56], CIFAR10 [26] and CelebA [34]. Four representative -divergences were tested to demonstrate the effectiveness of the general Variational Gradient flow (VGrow) framework for generative learning.
-divergences. Theoretically, our model works for the whole -divergence family by simply plugging the -function in. Special cases are obtained when specific -divergences are considered. At the population level, when the KL divergence is adopted, our VGrow naturally gives birth to CFG-GAN while the adoptation of JS divergence leads to vanilla GAN. As we proved above, GAN with the logD trick corresponds to our newly discovered ”logD” divergence which belongs to the f-divergence family. Moreover, we consider the Jeffrey divergence to show that our model is applicable to other -divergences. We name these four cases VGrow-KL, VGrow-JS, VGrow-logD and VGrow-JF.
Datasets. We chose four benchmark datasets which included three small datasets (MNIST, FashionMNIST, CIFAR10) and one large dataset (CelebA) from GAN literature. Both MNIST and FashionMNIST have a training set of 60k examples and a test set of 10k examples as bilevel images. CIFAR10 has a training set of 50k examples and a test set of 10k examples as color images. There are naturally 10 classes on these three datasets. CelebA consists of more than 200k celebrity images which were randomly divided into a training set and a test set by us. The division ratio is approximately 9 : 1. For MNIST and FashionMNIST, the input images were resized to resolution. We also pre-processed CelebA images by first taking a central crop and then resizing to the resolution. Only the training sets are used to train our models.
Evaluation metrics. Inception Score (IS) [47], calculates the exponential mutual information where is the conditional class distribution given the generated image and is the marginal class distribution across generated images [5]. To estimate and , we trained specific classifiers on MNIST, FashionMNIST, CIFAR10 following [22] using pre-activation ResNet-18 [18]. All the IS values were calculated over 50k generated images. Fréchet Inception Distance (FID) [19] computes the Wasserstein-2 distance by fitting Gaussians on real images and generated images after propagated through the Inception-v3 model [50], i.e.
. Particularly, all the FID scores are reported with respect to the 10k test examples on MNIST, FashionMNIST and CIFAR10 via the tensorflow implementation
https://github.com/bioinf-jku/TTUR/blob/master/fid.py. In a nutshell, higher IS and lower FID are better.Network architectures and hyperparameter settings. We adopted a new architecture modified from the residual networks used in [37]
. The modifications were comprised of reducing the number of batch normalization layers and introducing spectral normalization in the deep sampler / generator. The architecture was shared across the three small datasets and most hyperparameters were shared across different divergences. More residual blocks, upsampling and downsampling are employed on CelebA. In our experiments, we set the batch size to be 64 and use RMSProp as the SGD optimizer when training neural networks. The learning rate is 0.0001 for both the deep sampler and the deep classifier except for 0.0002 on MNIST for VGrow-JF. Inputs to deep samplers are vectors generated from a
dimensional standard normal distribution on all the datasets. The meta-parameters in our VGrow learning procedure are set to be
and inner loop .The sampler and the classifier are parameterized with residual networks. Each ResNet block has a skip-connection. The skip-connection use upsampling / downsampling of its input and a 1x1 convolution if there is upsampling / downsampling in the residual block. We use the identity mapping as the skip-connection if there is no upsamling / downsampling in the residual block. The upsampling is nearest-neighbor upsampling and the downsampling is achieved with mean pooling. Details concerning the networks are listed in Table 4, 5, 6, 7 in Appendix B.
Through our experiment, We demonstrate empirically that (1) VGrow is very stable in the training phase, and that (2) VGrow can generate high-fidelity samples that are comparable to real samples both visually and quantitatively. Comparisons with the state-of-the-art GANs suggest the effectiveness of VGrow.
Stability. It has been shown that the binary classification loss poorly correlates with the generating performance for JS divergence based GAN models [3]. We observed similar phenomena with our -divergence based VGrow model, i.e. the classification loss changed a little at the beginning of training and then fluctuated around a constant value. Since the classfication loss was not meaningful enough to measure the generating performance, we turned to utilize the aforementioned inception score to draw IS-Loop learning curves on MNIST, FashionMNIST and CIFAR10. The results are presented in Figure 1
. As indicated in all three subfigures, the IS-Loop learning curves are very smooth and the inception scores nearly monotonically increase until 3500 outer loops (almost 75 epochs) on MNIST and FashionMNIST as well as 4500 outer loops (almost 100 epochs) on CIFAR10.
Effectiveness. First, we list the real images and generated examples of our VGrow-KL model on the four benchmark datasets in Figure 2, 3, 4, 5. We claim that the realistic-looking generated images are visually comparable to real images sampled from the training set. It is easy to distinguish which class the generated example belongs to even on CIFAR10. Second, Table 2 presents the FID scores for the considered four models, and the FID values on 10k training data of MNIST and FashionMNIST. Scores of generated samples are very close to scores on real data. Especially, VGrow-JS obtains average scores of 3.32 and 8.75 while the scores on training data are 2.12 and 4.16 on MNIST and FashionMNIST, respectively. Third, Table 3 shows the FID evaluations of our four models, and the referred evaluations of state-of-the-art WGANs and MMDGANs from [2]
based on 50k samples. Our VGrow-logD attain a score of 28.8 with less variance that is competitive with the best (28.5) of referred baseline evalutions. Moreover, VGrow-JS and VGrow-KL achieve better performance than the remaining referred baselines. In a word, the quantitative results in Table
2 and Table 3 illustrate the effectiveness of our VGrow model.Models | MNIST(10k) | FashionMNIST (10k) |
---|---|---|
VGrow-KL | 3.66 (0.09) | 9.30 (0.09) |
VGrow-JS | 3.32 (0.05) | 8.75 (0.06) |
VGrow-logD | 3.64 (0.05) | 9.51 (0.09) |
VGrow-JF | 3.40 (0.07) | 9.72 (0.06) |
Training set | 2.12 (0.02) | 4.16 (0.03) |
Mean (standard deviation) of FID evaluations over 10k generated MNIST / FashionMNIST images with five-time bootstrap sampling. The last row states statistics of the FID scores between 10k training examples and 10k test examples.
Models | CIFAR10 (50k) |
---|---|
VGrow-KL | 29.7 (0.1) |
VGrow-JS | 29.1 (0.1) |
VGrow-logD | 28.8 (0.1) |
VGrow-JF | 32.3 (0.1) |
WGAN-GP | 31.1 (0.2) |
MMDGAN-GP-L2 | 31.4 (0.3) |
SMMDGAN | 31.5 (0.4) |
SN-SWGAN | 28.5 (0.2) |
We propose a framework to learn deep generative models via Variational Gradient Flow (VGrow) on probability spaces. We discus connections of our proposed VGrow method with VAE, GAN and flow-based methods. We evaluated VGrow on several divergences, including a newly discovered “logD” divergence which serves as the objective function of the logD-trick GAN. Experimental results on benchmark datasets demonstrate that VGrow can generate high-fidelity images in a stable and efficient manner, achieving competitive performance with state-of-the-art GANs.
Introvae: Introspective variational autoencoders for photographic image synthesis.
In NeurIPS. 2018.Photo-realistic single image super-resolution using a generative adversarial network.
In CVPR, 2017.Mmd gan: Towards deeper understanding of moment matching network.
In NIPS, 2017.Stein variational gradient descent: A general purpose bayesian inference algorithm.
In NIPS, 2016.Rethinking the inception architecture for computer vision.
In CVPR, 2016.In this section we give detail proofs for the main theory in the paper.
Follows from expression 10.1.16 in [1] (section E of chapter 10.1.2, page 233.) ∎
For any , define
as a function of . Let By definition,
Since
we need calculate the derivative of at Recall,
by chain rule, we get
where,
By definition, We claim that
Indeed, recall that
We get
and
Then it follows that
and
We finish our claim by calculating
Thus,
where, the fourth equality follows from integral by part and the vanishing assumption. ∎
is the minimizer of
The above criterion is a functional of . By setting the first variation to zero yields
i.e., ∎
By definition,
At the population level, the objective function of the logD-trick GAN reads [16]:
where, is the simple low dimensional reference distribution. Denote as the distribution of . Then the losses of and are equivalent to
The optimal discriminator is Substituting this into the criterion, we get
∎
Let . Let in a Stein class associate with . By the proof of Theorem 1, we know,
where, last equality follows from via restricting in a Stein class associate with , i.e., . ∎
In this Section, We present the detail of the network used in our experiment. We use to denote the number of channels of the images used in the experiment, i.e., or
Layer | Details | Output size |
Latent noise | 128 | |
Fully connected | Linear | 2048 |
Reshape | ||
ResNet block | ReLU | |
Upsampling | ||
Conv, BN, ReLU | ||
Conv | ||
ResNet block | ReLU | |
Upsampling | ||
Conv, BN, ReLU | ||
Conv | ||
ResNet block | ReLU | |
Upsampling | ||
Conv, BN, ReLU | ||
Conv | ||
Conv | ReLU, Conv, Tanh |
Layer | Details | Output size |
---|---|---|
ResNet block | Conv | |
ReLU, Conv | ||
Downsampling | ||
ResNet block | ReLU, Conv | |
ReLU, Conv | ||
Downsampling | ||
ResNet block | ReLU, Conv | |
ReLU, Conv | ||
ResNet block | ReLU, Conv | |
ReLU, Conv | ||
Fully connected | ReLU, GlobalSum pooling | 128 |
Linear | 1 |
Layer | Details | Output size |
Latent noise | 128 | |
Fully connected | Linear | 2048 |
Reshape | ||
ResNet block | ReLU | |
Upsampling | ||
Conv, BN, ReLU | ||
Conv | ||
ResNet block | ReLU | |
Upsampling | ||
Conv, BN, ReLU | ||
Conv | ||
ResNet block | ReLU | |
Upsampling | ||
Conv, BN, ReLU | ||
Conv | ||
ResNet block | ReLU | |
Upsampling | ||
Conv, BN, ReLU | ||
Conv | ||
Conv | ReLU, Conv, Tanh |
Layer | Details | Output size |
---|---|---|
ResNet block | Conv | |
ReLU, Conv | ||
Downsampling | ||
ResNet block | ReLU, Conv | |
ReLU, Conv | ||
Downsampling | ||
ResNet block | ReLU, Conv | |
ReLU, Conv | ||
Downsampling | ||
ResNet block | ReLU, Conv | |
ReLU, Conv | ||
Fully connected | ReLU, GlobalSum pooling | 128 |
Linear | 1 |
Comments
There are no comments yet.