Semi-Implicit Generative Model

05/29/2019 ∙ by Mingzhang Yin, et al. ∙ The University of Texas at Austin 0

To combine explicit and implicit generative models, we introduce semi-implicit generator (SIG) as a flexible hierarchical model that can be trained in the maximum likelihood framework. Both theoretically and experimentally, we demonstrate that SIG can generate high quality samples especially when dealing with multi-modality. By introducing SIG as an unbiased regularizer to the generative adversarial network (GAN), we show the interplay between maximum likelihood and adversarial learning can stabilize the adversarial training, resist the notorious mode collapsing problem of GANs, and improve the diversity of generated random samples.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 6

page 8

page 14

page 15

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generative models consist of a group of fundamental machine learning algorithms that are used to estimate the underlying probability distributions over data manifolds. Promoted by recent development in deep neural networks, deep generative models achieve great success in data simulation, density estimation, missing data imputation, reinforcement learning and are widely utilized for tasks such as image super-resolution, compression and image-to-text translation. The goal of generative models is to minimize the distance between the generative distribution and data distribution under a certain metric or divergence

(1)

where is usually approximated with empirical data distibution based on observations .

Depending on the type of

, an existing generative model can often be classified as either an explicit generative model or implicit one. The former requires an explicit probability density function (PDF) for

such that we can both sample data from it and evaluate its likelihood. Examples for explicit generative models include variational auto-encoders kingma2013auto , PixelRNNoord2016pixel , Real NVPdinh2016density , and many Bayesian hierarchical models such as sigmoid belief net neal1992connectionist . An explicit generative model has a tractable density that can often be directly optimized by (1). The optimization target is a distance measure with nice geometric properties, which often leads to stable training and theoretically guaranteed convergence. However, the requirement of having a tractable density usually restricts the flexibility of an explicit model, making it hard to scale with increasing data complexity.

An implicit generative model, on the other hand, generates its random samples via a stochastic procedure but may not allow a point-wise evaluable PDF, which often makes a direct optimization in (1) become infeasible. Generative adversarial networks (GANs) goodfellow2016nips tackle this problem by introducing an augmented discriminator and solving a minimax game: a generative network generates random samples by propagating random noises through a deep neural network, whereas a discriminator aims to distinguish the generated samples from true data. Under the condition of having an optimal discriminator, training a vanilla GAN’s generator is equivalent to optimizing (1) where

is set as the Jensen-Shannon Divergence. Unfortunately in practice, the overall loss function of GAN is usually non-convex and practitioners have encountered a variety of obstacles such as gradient vanishing, mode collapsing, and high sensitivity to the network architecture

arjovsky2017wasserstein ; goodfellow2016nips ; salimans2016improved ; radford2015unsupervised .

To incorporate highly expressive generative model while maintaining a well-behaved optimization objective, we introduce semi-implicit generator (SIG), a Bayesian hierarchical generative model that mixes a specified distribution with an implicit distribution where the implicit distribution can be constructed by deterministically transforming random noise to using a parameterized deterministic transform as . Intuitively, can incorporate our prior knowledge on the observed data, such as the data support, while can maintain the high expressiveness. With the hierarchical structure, SIG can be directly trained by choosing as the Kullback-Leibler(KL) divergence and estimating (1) with Monte-Carlo estimation. We show the SIG optimization objective can intrinsically resist the mode-collapse problem. By leveraging adversarial training, we apply SIG as a semi-implicit regularizer to generative adversarial networks, which helps stabilize optimization, significantly mitigates mode collapsing, and generates high quality samples in natural image scenarios.

2 Semi-implicit generator

Defining a family of parametric distribution , a classic explicit generative model is trained by maximizing the log-likelihood as

(2)

which is identical to minimize cross-entropy . Assuming , which is independent of the optimization parameter , minimizing this cross-entropy is equivalent to (1) where is set as the KL divergence.

Instead of treating as a global optimization parameter, we consider

as local random variable generated from distribution

with parameter . Semi-implicit generator (SIG) is defined in a two-stage manner

(3)

Marginalizing out, we can view the generator as . Here is required to be explicit but can be defined by sampling a random variable from fixed distribution and setting , where is a deterministic mapping represented by neural network with parameter . Therefore, typically cannot be evaluated pointwisely and the marginal is implicit. Notice in this setting, is required to be continuous while can be sampled from discrete distribution with continuous parameters.

Minimizing the cross-entropy as is equivalent to minimizing the KL-divergence with respect to the model parameter as in (1)

(4)
(5)

We show below that SIG can be trained by minimizing an upper bound of the cross entropy in (5).

Lemma 1.

Let us construct an estimator for the cross-entropy as

(6)

then for all , and . When , let then where the equality is true if and only if .

In practice, is approximated with Monte-Carlo samples as , where and are two sets of Monte Carlo samples generated from and implicit , respectively. When , the local will degenerate to the same and the objective degenerate to (2). To analyze the performance of SIG, we first consider multi-modal data on which popular deep generative models such as GANs often fail due to mode collapsing. For theoretical analysis, we first define a discrete multi-modal space as follows.

Definition 1.

(Discrete multi-modal space) Suppose is a metric space with metric , , where for . Let the distance between two sets be and let the diameter of a set be . Suppose there exists such that , . Then is a discrete multi-modal space under mesure .

Strictly speaking, there could be sub-modes within each , but the above definition emphasizes the existence of multiple separated regions in the support. Since the loss of a deep neural network is a nonconvex problem, finding the global optimality condition for can be difficult shang1996global ; yun2017global . Thanks to the structure of SIG as a two-stage model, assuming the implicit distribution is flexible enough, we can study a simplified optimal assignment problem: assuming that data points have been sampled from the true data distribution, how to assign generated data to the neighborhood of the true data such that defined in (6) is minimized under expectation

(7)

where the data are assumed to be generated from a discrete multi-modal space , , , and are the number of ’s that are assigned to be in . Assuming the data distribution is the marginal distribution of a normal-implicit mixture as and are equally spaced, we have the following theorem.

Theorem 1.

(SIG for multi-modal space) Suppose is defined on a discrete multi-modal space with -norm. Suppose there are data points , among which points belong to . Suppose we need to sample , and denotes the number of ’s in . Denoting

as a radial basis function (RBF), we let

if , and if , . Then the objective in (7) is convex and the optimum to maximize (7) satisfies . In particular, if .

Corollary 1.

Assume and . Let if and if , then .

The ideal proportion for would be , and plays the role as bias. In the normal-implicit mixture case, as shown in Corollary 1, if ,

are approximately normal distributed,

can be exponentially small for well separated modes. This indicates that SIG has a strong built-in resistance to model collapsing.

There is an interesting connection between SIG and variational auto-encoder (VAE) kingma2013auto ; rezende2014stochastic . VAE tries to maximize the evidence lower bound as , which is the same as maximizing

(8)

for which the optimal solution is Therefore, VAE imposes the constraint that there exits a recognition network/encoder , which is inferred by minimizing the KL divergence from

, the joint distribution of the model, to

, the joint distribution specified by the data distribution and encoder.

In SIG, we maximize

(9)

where can be any valid probability density/mass function. VAE tries to match the joint distribution between the data combined with its encoder and the model, whereas SIG only cares about matching the marginal model distribution and the data distribution. It is clear that SIG does not require a specific encoder structure and hence provides more flexibility.

In experiments, we find that SIG can generate high-quality data samples on relatively simple data manifolds such as MNIST, but observe that the richness of its generated images can be hard to scale well with high data complexity, such as CelebA dataset with 200K RGB images. More specifically, when setting , we find the effect of “mode averaging” on generated images for complex data. We suspect that needs to scale with data complexity such that is close to and this is the price we pay for SIG to have a stable training with a strong resistance against mode collapsing. While SIG performs well on relatively simple data but suffers from “mode averaging” on complex natural images, the generative adversarial network (GAN) has shown the ability to generate high quality samples with large scale observed data, but suffers from “model collapsing” even on a simple mixture of Gaussians. To benefits from both worlds, we apply SIG as a regularizer in adversarial learning, which can produce realistic samples, while strongly resisting both the mode collapsing and unstable optimization problems that are notorious for the training of GANs.

3 Generative adversarial network with semi-implicit regularizer

Generative adversarial network (GAN)goodfellow2014generative solves a minimax problem

(10)

It is shown in goodfellow2016nips ; goodfellow2014distinguishability that if the generator loss is changed from to , with ideally optimal discriminator, the generator loss (10) is identical to the SIG loss (4), which means SIG can be considered as training with the GAN’s objective, using the optimal discriminator in the update of the generator. The discriminator in GAN can be considered as an augmented part of the model to avoid density evaluation and indirectly feed the information of the real data to optimizing the generator. With the help of the discriminator, the weak fitting of generator to real data brings high expressive samples that go beyond memorizing inputs. However, recently extensive research in both practical experiments radford2015unsupervised ; metz2016unrolled and theoretical analysis li2017towards ; zhang2017discrimination ; arjovsky2017towards show that the lack of capacity, insufficient training of the discriminator, and the mismatches between the generator and discriminator in both network types and structures are the root causes of a variety of obstacles in GAN training. It also has been observed in goodfellow2014generative and highlighted in metz2016unrolled ; arjovsky2017wasserstein that the optimal generator for a fixed discriminator is a sum of delta functions at the ’s, where the discriminator assigns the highest value, which eventually collapses the generator to produce a small family of similar samples. In comparison, SIG is trained by maximizing likelihood without using a discriminator, which can be considered as a strong fitting between real data and generated samples directly. This encourages us to combine the two models and apply SIG as a regularizer in a GAN model, which is referred to as GAN-SI.

For GAN-SI, the discriminative loss is

(11)

and generator loss is a linear combination of the original GAN loss and SIG loss as

(12)

where are the discriminator network parameters, is the deterministic transform for the implicit distribution in SIG. We choose as for image generation, and set

as a hyperparameter to balance the strength between the GAN and SIG objectives. In practice, we set

such that the GAN’s generator loss and the cross-entropy term in (12) are on the same scale. The neural networks are set according to the DCGAN radford2015unsupervised .

Since SIG can be considered as training GAN with a theoretically optimal discriminator, by adjusting

, we are able to interpolate between the standard GAN training and true generator loss, therefore balancing the discrimination-generalization trade-off in the GAN dynamics

zhang2017discrimination . This idea is related to Unrolled GAN metz2016unrolled in which the discriminator parameter is temporarily updated times before updating the generator and the look-forwarded discriminator parameters are used to train the current generator. By adjusting the unrolling steps , Unrolled GAN can also interpolate between the standard GAN and optimal discriminator GAN . However in Unrolled GAN, the discriminator for is not the theoretically optimal discriminator but a fully optimized one that is still influenced by the network design and data complexity. The effectivity of Unrolled GAN in improving stability and mode-coverage is explained by the intuition that the training for the generator with looking ahead technique can take into account the discriminator’s reaction in the future, thus helping spread the probability mass. But there is no theoretical analysis provided yet. Moreover, the interpolation is non-linear, a few orders of magnitude slower as shown by srivastava2017veegan , which makes picking not easy. Training GAN with a semi-implicit regularizer benefits from both theoretical explanation and low extra computation, and shows the improved performance on reducing mode collapsing and increasing the stability of optimization in multiple experiments.

4 Related work

Using a two-stage model is related to Empirical Bayes (EB) robbins1956empirical ; casella1985introduction . A Bayesian hierarchical model can be represented as , where

is a hyper-prior distribution. In EB, the hyperprior

is dropped and the data is used to provide information about such that the marginal likelihood

is maximized. Previous learning algorithms for EB are often based on simple methods such as Expectation-Maximization and moment-matching. SIG can be considered as a parametric EB model where the neural network parameters are represented by

and the training objective is to find the maximum marginal likelihood (MMLE) solution of carlin1997bayes .

Without an explicit probability density, the evaluation of GAN has been considered challenging. There have been several recent attempts to introduce maximum likelihood to the GAN training che2017maximum ; grover2018flow . Flow-GAN grover2018flow constructs a generative model based on normalizing flow, which has been proven as an effective way to expand the distribution family in variational inference. Normalizing flow, however, requires the deterministic transformation to be invertible, a constraint that is often too strong to allow it to generate satisfactory random samples by its own. Therefore, its main use is to be combined with GAN to help improve its sample quality.

There has been significant recent interest in improving the vanilla GAN objective. For example, the measure between the data and model distributions can be changed to the KL-divergence goodfellow2014distinguishability or Wasserstein distance arjovsky2017wasserstein ; variational divergence estimation and density ratio estimation approaches have been used to extent the measure to a family of -divergence nowozin2016f ; poole2016improved ; a mutual information term has been introduced into the generator loss to enable learning disentangled representation and visual concepts chen2016infogan

; and based on a heuristic intuition, two regularizers with an auxiliary encoder are introduced to stabilize the training and improve mode-catching, respectively

che2016mode .

A variety of GAN research focuses on solving the mode collapse problem via new methodology and/or theoretical analysis. Encoder-decoder GAN architectures, such as MDGAN che2016mode , VEEGAN srivastava2017veegan , BiGANdonahue2016adversarial , and ALIdumoulin2016adversarially , use an encoding network to learn reversed mapping from the data to noise. The intuition is that training an encoder can force the system to learn meaningful mapping that can transform imbedded codes to data points from different modes. Unrolled GAN metz2016unrolled , as discussed in the previous section, interpolates between the vanilla GAN discriminator and optimal discriminator that resists model collapsing. AdaGAN tolstikhin2017adagan takes a boosting-like approach which is trained on weighted samples with more weights assigned to missing modes. From a theoretical perspective, it is shown that if the discriminator size is bounded by , even the generator loss is close to optimal, the output distribution can be supported only on images arora2017generalization . A simplified GMM-GAN is used to theoretically show that the optimal discriminator dynamics can converge to the ground truth in total variation distance, while a first order approximation of the discriminator leads to unstable GAN dynamics and mode collapsing li2017towards . A negative conclusion is made that the encoder-decoder training objective cannot learn meaningful latent codes and avoid mode collapsing arora2017theoretical . These theoretical analyses do support our practice of combining the GAN and SIG objectives.

5 Experiments

In this section, we first demonstrate the stability and mode coverage property of SIG on synthetic datasets. The toy examples show SIG can capture skewness, multimodality, and generate both continuous and discrete random samples that are indistinguishable from the true data. By interpolating between MLE and adversarial training scheme, we show GAN-SI can balance sample quality and diversity on real dataset. The evaluation criterion of generative model, however, is not straight-forward and no single metric is conclusive on its own. Therefore, we exploit multiple metrics to cross validate each other and emphasize quality and diversity separately. We notice the GAN training is sensitive to network structure, hyper-parameters, random initialization, and mini-batch feeding. To make a fair comparison, we share the same network structure between different generative models in each specific experiment setting and do multiple random trials. The results support the theorem that SIG can stably cover multi-modes and training GAN-SI adversarially greatly mitigates mode collapsing in GANs.

5.1 Toy examples

We first show the expressive of SIG with both discrete and continuous true data. For the discrete data, SIG is set as where

is implicit distribution generated by mapping from ten dimensional random noise with a two-hidden-layer multi-layer perceptron (MLP). The top left figures correspond to

and bottom left figures correspond to . For continuous data, SIG is set as , where the is the same as that for the discrete cases. As Figure 1 show, the implicit distribution is able to recover the underlying mixing distribution such that the samples following the marginal distribution can well approximate the true data. Vanilla GAN, as comparison, can only generate samples whose similarity to true data is restricted by the discriminator and cannot recover the original data well.

Figure 1:

Generated samples from SIG with true data coming from: (a)-(b): Negative-Binomial distribution; the implicit distribution can learn the true mixing distribution

; (f)-(g): Mixture of Poisson and Negative-Binomial distribution; (c)-(e): Ring+Gaussian noise; (h)-(j): Gaussian mixture arranged in a ring.

5.2 Mixture of Gaussians

We compare different generative models on a Gaussian mixture model. For fair comparison, all the models share the same generative network: a two-hidden-layer MLP with size 100 and rectified linear units (ReLU) activation function. The discriminator for GAN has a fully connected layer with size 100, and the encoder for VAE and VeeGAN is a two-hidden-layer MLP with size 100.

Detecting mode collapsing on a large dataset is challenging but can be accurately measured on synthetic data. To quantitatively evaluate the sample quality, we sample 50,000 points from trained generator and count it as high quality sample if it is within three standard deviations away from any of the mixture component centers. A center that is associated with more than 1000 high quality samples will be counted as a captured mode. The proportion of high quality samples at each mode, together with the proportion of low quality samples, form a 26 dimensional discrete distribution

. We calculated the KL divergence

. All results are reported based on the average and standard error of five independent random trials.

Figure 2: Comparison of generated sample for Gaussian mixture model by vanilla GAN, modified GAN to reduce mode collapsing(Unrolled GAN, VeeGAN), VAE and SIG.

GAN VAE VeeGAN Unrolled GAN SIG Modes 4.03.08 250 23.20.84 6.28.6 250 Proportion of high quality samples 0.360.16 0.830.02 0.820.03 0.420.13 0.910.04 KL 2.870.78 0.320.07 0.380.08 1.970.60 0.140.07

Table 1: Comparison of mode-capturing ability on mixture of Gaussian. ’Modes’ is the number of captured modes out of 25. ’KL’ is . For ’Modes’ and ’Proportion of high quality samples’, the higher the better; for ’KL’, the lower the better.

As shown in Table 1, SIG captures all the modes and generates the highest proportion of high quality samples, whose distribution is closest to the ground truth. It also achieves the shortest running time and highest stability using a single neural network.

We notice, however, SIG generalization ability may not scale well with increasing data complexity, as shown in Figure 3. To generate natural images, we train SIG adversarially and notice the proposed GAN-SI can stabilize GAN training and mitigate the mode collapsing problem.

(a) (b) (c)
Figure 3: (a): SIG can generate low complexity data well. The input data is from unevenly distributed GMM, where the components in the 1st, 3rd, and 5th columns has twice more data than the 2nd and 4th. SIG generates samples well aligned with the true distribution. (b): SIG generated MNIST digits. (c): SIG scales not well when the data is as complex as CelebA.

5.3 Stacked MNIST

To measure the performance of combining MLE and adversarial training schemes on discrete multimode data, we stack 3 randomly chosen MNIST images on the RGB color channels to form a image (MNIST-3) srivastava2017veegan ; metz2016unrolled ; che2016mode ; tolstikhin2017adagan . MNIST-3 contains 1000 modes corresponding to 3-digit between 0 and 999. Similar to metz2016unrolled and tolstikhin2017adagan we find the missing modes problem of GAN on MNIST3 is sensitive to the network architecture and the randomness in training process due to the instability. Therefore, we choose three different network sizes (denoted as S, M, and L), run each experiment five times and use exactly the same generator and discriminator for DCGAN and DCGAN-SI.

The inception score (IS) salimans2016improved is a widely used criterion for GAN evaluation. It is applied to data with label using a pre-trained classifier. Low entropy of conditional distribution and high entropy of marginal distribution are considered to represent high image quality and diversity respectively.

(13)

As the IS by itself cannot fully characterize generative model performance barratt2018note ; borji2018pros , we provide more metrics for evaluation: High quality image means the proportion of images that can be classified by the trained classifier with a probability larger than 0.7; Mode is the number of digit triples that have at least one sample; KL is where .

IS High quality Mode KL DCGAN(S) 2.90.52 0.630.14 1.960.32 5.11.19 21.08.12 4.990.24 DCGAN-SI(S) 4.330.59 0.60.07 2.050.2 8.780.41 279.2296.52 2.631.0 DCGAN(M) 5.590.36 0.70.03 1.710.09 9.510.31 811.8116.24 0.750.35 DCGAN-SI(M) 5.930.47 0.720.04 1.650.11 9.750.11 969.029.19 0.30.13 DCGAN(L) 4.711.12 0.670.08 1.780.17 8.251.32 389.8477.24 2.952.33 DCGAN-SI(L) 6.050.68 0.730.06 1.620.17 9.750.12 957.032.74 0.360.12

Table 2: High quality image and reflect sample quality while , Mode and KL reflect sample diversity. For Inception score, High quality image, , higher is better; for and KL, lower is better.

5.4 Sample quality and diversity on CIFAR10

We test the semi-implicit regularizer on the CIFAR-10 dataset, a widely studied dataset consisting of 50,000 training images with pixel from ten categories. The image diversity is high between or within each category. We combine semi-implicit regularizer with two popular GAN frameworks DCGAN radford2015unsupervised and WGAN-GPgulrajani2017improved to balance the quality and diversity of generated samples.

Real data Unsupervised, standard CNN
DCGAN DCGAN-SI WGAN-GP WGAN-GP-SI
11.24 .12 6.16 .14 6.85 .06 6.43 .07 6.67 .11
Table 3: Inception scores for models on CIFAR-10

We train each model for 100K iterations with mini-batch size 64. The optimizer is Adam with learning rate 0.0002. The inception model we use is based on pre-trained Inception Modelszegedy2016rethinking

on ImageNet. As shown in Appendix Figure

6, the images generated by DCGAN include duplicated images indicating the existence of mode collapsing which does not seem to happen with regularized DCGAN-SI, and this is reflected in the improvement of inception score as shown in Table 3.

6 Conclusions

We propose semi-implicit generator (SIG) as a flexible and stable generative model. Training under well-understood maximum likelihood framework, SIG is proposed either as a black-box generative model or as unbiased regularizer in adversarial learning. We analyze the inherent mode-capturing mechanism and show its advantage over several state-of-the-art generative methods in reducing mode collapse. Combined with GAN, semi-implicit regularizer provides an interplay between adversarial learning and maximum likelihood inference, leading to a better balance between sample quality and diversity.

References

Appendix A Proofs

Proof of Lemma 1.

Assume integer and is the set of all size subsets of . Let be a discrete uniform random variable that takes outcome in with probability . We have . Then we have

By law of large number

converges to a.s. as so . When , assume then

Multiply both sides by and integrate over , we have

The minimal is reached when implicit distribution degenerates to the point probability mass where maximizes average log-likelihood over data. ∎

Proof of Theorem 1.

Suppose is defined on a discrete multi-modal space . For , assume ; for , assume , where , denote the mode label of true data and generated data center respectively. Then we have and for .

(14)

Notice , . By definition of , if , and if , . With the definition of and in theorem 1, we have . Then we have objective (14) as a constrained optimization problem with Lagrange multiplier

Taking the gradient with respect to and set to zero gives

Together with constraint , we have

(15)

The Hessian shows convexity and is global minimum. Let the right-hand-side greater than 0, the condition for mode k not vanishing is

Proof of Corollary 1.

Assume , . Let and then . Let , then

follows noncentral chi-squared distribution

where is the dimension of , is noncentrality parameter. By moment genrating function (MGF) of noncentral chi-squared distribution, we have

(16)

For , , and for , , . Plugging into (16), we have , , therefore . ∎

Appendix B Algorithm for GAN-SI

1 while not converged do
2       Sample for ;
3       Sample for ;
4       Set for all ;
5       ;
6       ;
7       ,
8 end while
The first order optimization is used as Adam[37] in our experiments.
Algorithm 1 Mini-batch training of GAN-SI

Appendix C Network architecture and samples for MNIST-3

The generator network is defined with parameter to adjust network size

Number of output Kernel size Stride Padding
Input -
Fully connected 4*4*64 -
Transpose Convolution 64* 4 1 VALID
Transpose Convolution 32* 4 2 SAME
Transpose Convolution 8* 4 1 SAME
Convolution 3 4 2 SAME

The discriminator network is defined with parameter to adjust network size

Number of output Kernel size Stride Padding
Input is image batch with size 28*28*3 -
Transpose Convolution 8* 4 2 VALID
Transpose Convolution 16* 4 2 SAME
Transpose Convolution 32* 4 1 SAME
Flat+Fully connected 1 -

For the network work size denoted as (S), (M), (L), the (,) pair is chosen as , , respectively.

Appendix D Additional figures

Figure 4: MNIST-3, highest inception score cases among 10 independent trials
Figure 5: MNIST-3, lowest inception score cases among 10 independent trials
(a)DCGAN
(b)DCGAN-SI
Figure 6: Randomly generated images by DCGAN and DCGAN with semi-implicit regularizer.