1 Introduction
Generative models consist of a group of fundamental machine learning algorithms that are used to estimate the underlying probability distributions over data manifolds. Promoted by recent development in deep neural networks, deep generative models achieve great success in data simulation, density estimation, missing data imputation, reinforcement learning and are widely utilized for tasks such as image superresolution, compression and imagetotext translation. The goal of generative models is to minimize the distance between the generative distribution and data distribution under a certain metric or divergence
(1) 
where is usually approximated with empirical data distibution based on observations .
Depending on the type of
, an existing generative model can often be classified as either an explicit generative model or implicit one. The former requires an explicit probability density function (PDF) for
such that we can both sample data from it and evaluate its likelihood. Examples for explicit generative models include variational autoencoders kingma2013auto , PixelRNNoord2016pixel , Real NVPdinh2016density , and many Bayesian hierarchical models such as sigmoid belief net neal1992connectionist . An explicit generative model has a tractable density that can often be directly optimized by (1). The optimization target is a distance measure with nice geometric properties, which often leads to stable training and theoretically guaranteed convergence. However, the requirement of having a tractable density usually restricts the flexibility of an explicit model, making it hard to scale with increasing data complexity.An implicit generative model, on the other hand, generates its random samples via a stochastic procedure but may not allow a pointwise evaluable PDF, which often makes a direct optimization in (1) become infeasible. Generative adversarial networks (GANs) goodfellow2016nips tackle this problem by introducing an augmented discriminator and solving a minimax game: a generative network generates random samples by propagating random noises through a deep neural network, whereas a discriminator aims to distinguish the generated samples from true data. Under the condition of having an optimal discriminator, training a vanilla GAN’s generator is equivalent to optimizing (1) where
is set as the JensenShannon Divergence. Unfortunately in practice, the overall loss function of GAN is usually nonconvex and practitioners have encountered a variety of obstacles such as gradient vanishing, mode collapsing, and high sensitivity to the network architecture
arjovsky2017wasserstein ; goodfellow2016nips ; salimans2016improved ; radford2015unsupervised .To incorporate highly expressive generative model while maintaining a wellbehaved optimization objective, we introduce semiimplicit generator (SIG), a Bayesian hierarchical generative model that mixes a specified distribution with an implicit distribution where the implicit distribution can be constructed by deterministically transforming random noise to using a parameterized deterministic transform as . Intuitively, can incorporate our prior knowledge on the observed data, such as the data support, while can maintain the high expressiveness. With the hierarchical structure, SIG can be directly trained by choosing as the KullbackLeibler(KL) divergence and estimating (1) with MonteCarlo estimation. We show the SIG optimization objective can intrinsically resist the modecollapse problem. By leveraging adversarial training, we apply SIG as a semiimplicit regularizer to generative adversarial networks, which helps stabilize optimization, significantly mitigates mode collapsing, and generates high quality samples in natural image scenarios.
2 Semiimplicit generator
Defining a family of parametric distribution , a classic explicit generative model is trained by maximizing the loglikelihood as
(2) 
which is identical to minimize crossentropy . Assuming , which is independent of the optimization parameter , minimizing this crossentropy is equivalent to (1) where is set as the KL divergence.
Instead of treating as a global optimization parameter, we consider
as local random variable generated from distribution
with parameter . Semiimplicit generator (SIG) is defined in a twostage manner(3) 
Marginalizing out, we can view the generator as . Here is required to be explicit but can be defined by sampling a random variable from fixed distribution and setting , where is a deterministic mapping represented by neural network with parameter . Therefore, typically cannot be evaluated pointwisely and the marginal is implicit. Notice in this setting, is required to be continuous while can be sampled from discrete distribution with continuous parameters.
Minimizing the crossentropy as is equivalent to minimizing the KLdivergence with respect to the model parameter as in (1)
(4)  
(5) 
We show below that SIG can be trained by minimizing an upper bound of the cross entropy in (5).
Lemma 1.
Let us construct an estimator for the crossentropy as
(6) 
then for all , and . When , let then where the equality is true if and only if .
In practice, is approximated with MonteCarlo samples as , where and are two sets of Monte Carlo samples generated from and implicit , respectively. When , the local will degenerate to the same and the objective degenerate to (2). To analyze the performance of SIG, we first consider multimodal data on which popular deep generative models such as GANs often fail due to mode collapsing. For theoretical analysis, we first define a discrete multimodal space as follows.
Definition 1.
(Discrete multimodal space) Suppose is a metric space with metric , , where for . Let the distance between two sets be and let the diameter of a set be . Suppose there exists such that , . Then is a discrete multimodal space under mesure .
Strictly speaking, there could be submodes within each , but the above definition emphasizes the existence of multiple separated regions in the support. Since the loss of a deep neural network is a nonconvex problem, finding the global optimality condition for can be difficult shang1996global ; yun2017global . Thanks to the structure of SIG as a twostage model, assuming the implicit distribution is flexible enough, we can study a simplified optimal assignment problem: assuming that data points have been sampled from the true data distribution, how to assign generated data to the neighborhood of the true data such that defined in (6) is minimized under expectation
(7) 
where the data are assumed to be generated from a discrete multimodal space , , , and are the number of ’s that are assigned to be in . Assuming the data distribution is the marginal distribution of a normalimplicit mixture as and are equally spaced, we have the following theorem.
Theorem 1.
(SIG for multimodal space) Suppose is defined on a discrete multimodal space with norm. Suppose there are data points , among which points belong to . Suppose we need to sample , and denotes the number of ’s in . Denoting
as a radial basis function (RBF), we let
if , and if , . Then the objective in (7) is convex and the optimum to maximize (7) satisfies . In particular, if .Corollary 1.
Assume and . Let if and if , then .
The ideal proportion for would be , and plays the role as bias. In the normalimplicit mixture case, as shown in Corollary 1, if ,
are approximately normal distributed,
can be exponentially small for well separated modes. This indicates that SIG has a strong builtin resistance to model collapsing.There is an interesting connection between SIG and variational autoencoder (VAE) kingma2013auto ; rezende2014stochastic . VAE tries to maximize the evidence lower bound as , which is the same as maximizing
(8) 
for which the optimal solution is Therefore, VAE imposes the constraint that there exits a recognition network/encoder , which is inferred by minimizing the KL divergence from
, the joint distribution of the model, to
, the joint distribution specified by the data distribution and encoder.In SIG, we maximize
(9) 
where can be any valid probability density/mass function. VAE tries to match the joint distribution between the data combined with its encoder and the model, whereas SIG only cares about matching the marginal model distribution and the data distribution. It is clear that SIG does not require a specific encoder structure and hence provides more flexibility.
In experiments, we find that SIG can generate highquality data samples on relatively simple data manifolds such as MNIST, but observe that the richness of its generated images can be hard to scale well with high data complexity, such as CelebA dataset with 200K RGB images. More specifically, when setting , we find the effect of “mode averaging” on generated images for complex data. We suspect that needs to scale with data complexity such that is close to and this is the price we pay for SIG to have a stable training with a strong resistance against mode collapsing. While SIG performs well on relatively simple data but suffers from “mode averaging” on complex natural images, the generative adversarial network (GAN) has shown the ability to generate high quality samples with large scale observed data, but suffers from “model collapsing” even on a simple mixture of Gaussians. To benefits from both worlds, we apply SIG as a regularizer in adversarial learning, which can produce realistic samples, while strongly resisting both the mode collapsing and unstable optimization problems that are notorious for the training of GANs.
3 Generative adversarial network with semiimplicit regularizer
Generative adversarial network (GAN)goodfellow2014generative solves a minimax problem
(10) 
It is shown in goodfellow2016nips ; goodfellow2014distinguishability that if the generator loss is changed from to , with ideally optimal discriminator, the generator loss (10) is identical to the SIG loss (4), which means SIG can be considered as training with the GAN’s objective, using the optimal discriminator in the update of the generator. The discriminator in GAN can be considered as an augmented part of the model to avoid density evaluation and indirectly feed the information of the real data to optimizing the generator. With the help of the discriminator, the weak fitting of generator to real data brings high expressive samples that go beyond memorizing inputs. However, recently extensive research in both practical experiments radford2015unsupervised ; metz2016unrolled and theoretical analysis li2017towards ; zhang2017discrimination ; arjovsky2017towards show that the lack of capacity, insufficient training of the discriminator, and the mismatches between the generator and discriminator in both network types and structures are the root causes of a variety of obstacles in GAN training. It also has been observed in goodfellow2014generative and highlighted in metz2016unrolled ; arjovsky2017wasserstein that the optimal generator for a fixed discriminator is a sum of delta functions at the ’s, where the discriminator assigns the highest value, which eventually collapses the generator to produce a small family of similar samples. In comparison, SIG is trained by maximizing likelihood without using a discriminator, which can be considered as a strong fitting between real data and generated samples directly. This encourages us to combine the two models and apply SIG as a regularizer in a GAN model, which is referred to as GANSI.
For GANSI, the discriminative loss is
(11) 
and generator loss is a linear combination of the original GAN loss and SIG loss as
(12) 
where are the discriminator network parameters, is the deterministic transform for the implicit distribution in SIG. We choose as for image generation, and set
as a hyperparameter to balance the strength between the GAN and SIG objectives. In practice, we set
such that the GAN’s generator loss and the crossentropy term in (12) are on the same scale. The neural networks are set according to the DCGAN radford2015unsupervised .Since SIG can be considered as training GAN with a theoretically optimal discriminator, by adjusting
, we are able to interpolate between the standard GAN training and true generator loss, therefore balancing the discriminationgeneralization tradeoff in the GAN dynamics
zhang2017discrimination . This idea is related to Unrolled GAN metz2016unrolled in which the discriminator parameter is temporarily updated times before updating the generator and the lookforwarded discriminator parameters are used to train the current generator. By adjusting the unrolling steps , Unrolled GAN can also interpolate between the standard GAN and optimal discriminator GAN . However in Unrolled GAN, the discriminator for is not the theoretically optimal discriminator but a fully optimized one that is still influenced by the network design and data complexity. The effectivity of Unrolled GAN in improving stability and modecoverage is explained by the intuition that the training for the generator with looking ahead technique can take into account the discriminator’s reaction in the future, thus helping spread the probability mass. But there is no theoretical analysis provided yet. Moreover, the interpolation is nonlinear, a few orders of magnitude slower as shown by srivastava2017veegan , which makes picking not easy. Training GAN with a semiimplicit regularizer benefits from both theoretical explanation and low extra computation, and shows the improved performance on reducing mode collapsing and increasing the stability of optimization in multiple experiments.4 Related work
Using a twostage model is related to Empirical Bayes (EB) robbins1956empirical ; casella1985introduction . A Bayesian hierarchical model can be represented as , where
is a hyperprior distribution. In EB, the hyperprior
is dropped and the data is used to provide information about such that the marginal likelihoodis maximized. Previous learning algorithms for EB are often based on simple methods such as ExpectationMaximization and momentmatching. SIG can be considered as a parametric EB model where the neural network parameters are represented by
and the training objective is to find the maximum marginal likelihood (MMLE) solution of carlin1997bayes .Without an explicit probability density, the evaluation of GAN has been considered challenging. There have been several recent attempts to introduce maximum likelihood to the GAN training che2017maximum ; grover2018flow . FlowGAN grover2018flow constructs a generative model based on normalizing flow, which has been proven as an effective way to expand the distribution family in variational inference. Normalizing flow, however, requires the deterministic transformation to be invertible, a constraint that is often too strong to allow it to generate satisfactory random samples by its own. Therefore, its main use is to be combined with GAN to help improve its sample quality.
There has been significant recent interest in improving the vanilla GAN objective. For example, the measure between the data and model distributions can be changed to the KLdivergence goodfellow2014distinguishability or Wasserstein distance arjovsky2017wasserstein ; variational divergence estimation and density ratio estimation approaches have been used to extent the measure to a family of divergence nowozin2016f ; poole2016improved ; a mutual information term has been introduced into the generator loss to enable learning disentangled representation and visual concepts chen2016infogan
; and based on a heuristic intuition, two regularizers with an auxiliary encoder are introduced to stabilize the training and improve modecatching, respectively
che2016mode .A variety of GAN research focuses on solving the mode collapse problem via new methodology and/or theoretical analysis. Encoderdecoder GAN architectures, such as MDGAN che2016mode , VEEGAN srivastava2017veegan , BiGANdonahue2016adversarial , and ALIdumoulin2016adversarially , use an encoding network to learn reversed mapping from the data to noise. The intuition is that training an encoder can force the system to learn meaningful mapping that can transform imbedded codes to data points from different modes. Unrolled GAN metz2016unrolled , as discussed in the previous section, interpolates between the vanilla GAN discriminator and optimal discriminator that resists model collapsing. AdaGAN tolstikhin2017adagan takes a boostinglike approach which is trained on weighted samples with more weights assigned to missing modes. From a theoretical perspective, it is shown that if the discriminator size is bounded by , even the generator loss is close to optimal, the output distribution can be supported only on images arora2017generalization . A simplified GMMGAN is used to theoretically show that the optimal discriminator dynamics can converge to the ground truth in total variation distance, while a first order approximation of the discriminator leads to unstable GAN dynamics and mode collapsing li2017towards . A negative conclusion is made that the encoderdecoder training objective cannot learn meaningful latent codes and avoid mode collapsing arora2017theoretical . These theoretical analyses do support our practice of combining the GAN and SIG objectives.
5 Experiments
In this section, we first demonstrate the stability and mode coverage property of SIG on synthetic datasets. The toy examples show SIG can capture skewness, multimodality, and generate both continuous and discrete random samples that are indistinguishable from the true data. By interpolating between MLE and adversarial training scheme, we show GANSI can balance sample quality and diversity on real dataset. The evaluation criterion of generative model, however, is not straightforward and no single metric is conclusive on its own. Therefore, we exploit multiple metrics to cross validate each other and emphasize quality and diversity separately. We notice the GAN training is sensitive to network structure, hyperparameters, random initialization, and minibatch feeding. To make a fair comparison, we share the same network structure between different generative models in each specific experiment setting and do multiple random trials. The results support the theorem that SIG can stably cover multimodes and training GANSI adversarially greatly mitigates mode collapsing in GANs.
5.1 Toy examples
We first show the expressive of SIG with both discrete and continuous true data. For the discrete data, SIG is set as where
is implicit distribution generated by mapping from ten dimensional random noise with a twohiddenlayer multilayer perceptron (MLP). The top left figures correspond to
and bottom left figures correspond to . For continuous data, SIG is set as , where the is the same as that for the discrete cases. As Figure 1 show, the implicit distribution is able to recover the underlying mixing distribution such that the samples following the marginal distribution can well approximate the true data. Vanilla GAN, as comparison, can only generate samples whose similarity to true data is restricted by the discriminator and cannot recover the original data well.5.2 Mixture of Gaussians
We compare different generative models on a Gaussian mixture model. For fair comparison, all the models share the same generative network: a twohiddenlayer MLP with size 100 and rectified linear units (ReLU) activation function. The discriminator for GAN has a fully connected layer with size 100, and the encoder for VAE and VeeGAN is a twohiddenlayer MLP with size 100.
Detecting mode collapsing on a large dataset is challenging but can be accurately measured on synthetic data. To quantitatively evaluate the sample quality, we sample 50,000 points from trained generator and count it as high quality sample if it is within three standard deviations away from any of the mixture component centers. A center that is associated with more than 1000 high quality samples will be counted as a captured mode. The proportion of high quality samples at each mode, together with the proportion of low quality samples, form a 26 dimensional discrete distribution
. We calculated the KL divergence. All results are reported based on the average and standard error of five independent random trials.
As shown in Table 1, SIG captures all the modes and generates the highest proportion of high quality samples, whose distribution is closest to the ground truth. It also achieves the shortest running time and highest stability using a single neural network.
We notice, however, SIG generalization ability may not scale well with increasing data complexity, as shown in Figure 3. To generate natural images, we train SIG adversarially and notice the proposed GANSI can stabilize GAN training and mitigate the mode collapsing problem.
(a)  (b)  (c) 
5.3 Stacked MNIST
To measure the performance of combining MLE and adversarial training schemes on discrete multimode data, we stack 3 randomly chosen MNIST images on the RGB color channels to form a image (MNIST3) srivastava2017veegan ; metz2016unrolled ; che2016mode ; tolstikhin2017adagan . MNIST3 contains 1000 modes corresponding to 3digit between 0 and 999. Similar to metz2016unrolled and tolstikhin2017adagan we find the missing modes problem of GAN on MNIST3 is sensitive to the network architecture and the randomness in training process due to the instability. Therefore, we choose three different network sizes (denoted as S, M, and L), run each experiment five times and use exactly the same generator and discriminator for DCGAN and DCGANSI.
The inception score (IS) salimans2016improved is a widely used criterion for GAN evaluation. It is applied to data with label using a pretrained classifier. Low entropy of conditional distribution and high entropy of marginal distribution are considered to represent high image quality and diversity respectively.
(13) 
As the IS by itself cannot fully characterize generative model performance barratt2018note ; borji2018pros , we provide more metrics for evaluation: High quality image means the proportion of images that can be classified by the trained classifier with a probability larger than 0.7; Mode is the number of digit triples that have at least one sample; KL is where .
5.4 Sample quality and diversity on CIFAR10
We test the semiimplicit regularizer on the CIFAR10 dataset, a widely studied dataset consisting of 50,000 training images with pixel from ten categories. The image diversity is high between or within each category. We combine semiimplicit regularizer with two popular GAN frameworks DCGAN radford2015unsupervised and WGANGPgulrajani2017improved to balance the quality and diversity of generated samples.
Real data  Unsupervised, standard CNN  
DCGAN  DCGANSI  WGANGP  WGANGPSI  
11.24 .12  6.16 .14  6.85 .06  6.43 .07  6.67 .11 
We train each model for 100K iterations with minibatch size 64. The optimizer is Adam with learning rate 0.0002. The inception model we use is based on pretrained Inception Modelszegedy2016rethinking
on ImageNet. As shown in Appendix Figure
6, the images generated by DCGAN include duplicated images indicating the existence of mode collapsing which does not seem to happen with regularized DCGANSI, and this is reflected in the improvement of inception score as shown in Table 3.6 Conclusions
We propose semiimplicit generator (SIG) as a flexible and stable generative model. Training under wellunderstood maximum likelihood framework, SIG is proposed either as a blackbox generative model or as unbiased regularizer in adversarial learning. We analyze the inherent modecapturing mechanism and show its advantage over several stateoftheart generative methods in reducing mode collapse. Combined with GAN, semiimplicit regularizer provides an interplay between adversarial learning and maximum likelihood inference, leading to a better balance between sample quality and diversity.
References
 [1] Diederik P Kingma and Max Welling. Autoencoding variational Bayes. arXiv preprint arXiv:1312.6114, 2013.
 [2] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
 [3] Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
 [4] Radford M Neal. Connectionist learning of belief networks. Artificial intelligence, 56(1):71–113, 1992.
 [5] Ian Goodfellow. Nips 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160, 2016.
 [6] Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein gan. arXiv preprint arXiv:1701.07875, 2017.
 [7] Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In NIPS, pages 2234–2242, 2016.
 [8] Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 [9] Yi Shang and Benjamin W Wah. Global optimization for neural network training. Computer, 29(3):45–54, 1996.
 [10] Chulhee Yun, Suvrit Sra, and Ali Jadbabaie. Global optimality conditions for deep neural networks. arXiv preprint arXiv:1707.02444, 2017.

[11]
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
In ICML, pages 1278–1286, 2014.  [12] Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 [13] Ian J Goodfellow. On distinguishability criteria for estimating generative models. arXiv preprint arXiv:1412.6515, 2014.
 [14] Luke Metz, Ben Poole, David Pfau, and Jascha SohlDickstein. Unrolled generative adversarial networks. arXiv preprint arXiv:1611.02163, 2016.
 [15] Jerry Li, Aleksander Madry, John Peebles, and Ludwig Schmidt. Towards understanding the dynamics of generative adversarial networks. arXiv preprint arXiv:1706.09884, 2017.
 [16] Pengchuan Zhang, Qiang Liu, Dengyong Zhou, Tao Xu, and Xiaodong He. On the discriminationgeneralization tradeoff in gans. arXiv preprint arXiv:1711.02771, 2017.
 [17] Martin Arjovsky and Léon Bottou. Towards principled methods for training generative adversarial networks. arXiv preprint arXiv:1701.04862, 2017.
 [18] Akash Srivastava, Lazar Valkoz, Chris Russell, Michael U Gutmann, and Charles Sutton. Veegan: Reducing mode collapse in gans using implicit variational learning. In NIPS, pages 3310–3320, 2017.
 [19] Herbert Robbins et al. An empirical bayes approach to statistics. In Proceedings of the Third Berkeley Symposium on Mathematical Statistics and Probability, Volume 1: Contributions to the Theory of Statistics. The Regents of the University of California, 1956.
 [20] George Casella. An introduction to empirical bayes data analysis. The American Statistician, 39(2):83–87, 1985.

[21]
Bradley P Carlin and Thomas A Louis.
Bayes and empirical Bayes methods for data analysis.
Statistics and Computing, 7(2):153–154, 1997.  [22] Tong Che, Yanran Li, Ruixiang Zhang, R Devon Hjelm, Wenjie Li, Yangqiu Song, and Yoshua Bengio. Maximumlikelihood augmented discrete generative adversarial networks. arXiv preprint arXiv:1702.07983, 2017.
 [23] Aditya Grover, Manik Dhar, and Stefano Ermon. Flowgan: Combining maximum likelihood and adversarial learning in generative models. In AAAI, 2018.
 [24] Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. fgan: Training generative neural samplers using variational divergence minimization. In NIPS, pages 271–279, 2016.
 [25] Ben Poole, Alexander A Alemi, Jascha SohlDickstein, and Anelia Angelova. Improved generator objectives for gans. arXiv preprint arXiv:1612.02780, 2016.
 [26] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan: Interpretable representation learning by information maximizing generative adversarial nets. In NIPS, pages 2172–2180, 2016.
 [27] Tong Che, Yanran Li, Athul Paul Jacob, Yoshua Bengio, and Wenjie Li. Mode regularized generative adversarial networks. arXiv preprint arXiv:1612.02136, 2016.
 [28] Jeff Donahue, Philipp Krähenbühl, and Trevor Darrell. Adversarial feature learning. arXiv preprint arXiv:1605.09782, 2016.
 [29] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Olivier Mastropietro, Alex Lamb, Martin Arjovsky, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
 [30] Ilya O Tolstikhin, Sylvain Gelly, Olivier Bousquet, CarlJohann SimonGabriel, and Bernhard Schölkopf. Adagan: Boosting generative models. In NIPS, pages 5430–5439, 2017.
 [31] Sanjeev Arora, Rong Ge, Yingyu Liang, Tengyu Ma, and Yi Zhang. Generalization and equilibrium in generative adversarial nets (gans). arXiv preprint arXiv:1703.00573, 2017.
 [32] Sanjeev Arora, Andrej Risteski, and Yi Zhang. Theoretical limitations of encoderdecoder gan architectures. arXiv preprint arXiv:1711.02651, 2017.
 [33] Shane Barratt and Rishi Sharma. A note on the inception score. arXiv preprint arXiv:1801.01973, 2018.
 [34] Ali Borji. Pros and cons of gan evaluation measures. arXiv preprint arXiv:1802.03446, 2018.
 [35] Ishaan Gulrajani, Faruk Ahmed, Martin Arjovsky, Vincent Dumoulin, and Aaron C Courville. Improved training of wasserstein gans. In NIPS, pages 5769–5779, 2017.

[36]
Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew
Wojna.
Rethinking the inception architecture for computer vision.
InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition
, pages 2818–2826, 2016.  [37] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
Appendix A Proofs
Proof of Lemma 1.
Assume integer and is the set of all size subsets of . Let be a discrete uniform random variable that takes outcome in with probability . We have . Then we have
Multiply both sides by and integrate over , we have
The minimal is reached when implicit distribution degenerates to the point probability mass where maximizes average loglikelihood over data. ∎
Proof of Theorem 1.
Suppose is defined on a discrete multimodal space . For , assume ; for , assume , where , denote the mode label of true data and generated data center respectively. Then we have and for .
(14) 
Notice , . By definition of , if , and if , . With the definition of and in theorem 1, we have . Then we have objective (14) as a constrained optimization problem with Lagrange multiplier
Taking the gradient with respect to and set to zero gives
Together with constraint , we have
(15) 
The Hessian shows convexity and is global minimum. Let the righthandside greater than 0, the condition for mode k not vanishing is ∎
Proof of Corollary 1.
Assume , . Let and then . Let , then
follows noncentral chisquared distribution
where is the dimension of , is noncentrality parameter. By moment genrating function (MGF) of noncentral chisquared distribution, we have(16) 
For , , and for , , . Plugging into (16), we have , , therefore . ∎
Appendix B Algorithm for GANSI
Appendix C Network architecture and samples for MNIST3
The generator network is defined with parameter to adjust network size
Number of output  Kernel size  Stride  Padding  
Input    
Fully connected  4*4*64    
Transpose Convolution  64*  4  1  VALID 
Transpose Convolution  32*  4  2  SAME 
Transpose Convolution  8*  4  1  SAME 
Convolution  3  4  2  SAME 
The discriminator network is defined with parameter to adjust network size
Number of output  Kernel size  Stride  Padding  
Input is image batch with size 28*28*3    
Transpose Convolution  8*  4  2  VALID 
Transpose Convolution  16*  4  2  SAME 
Transpose Convolution  32*  4  1  SAME 
Flat+Fully connected  1   
For the network work size denoted as (S), (M), (L), the (,) pair is chosen as , , respectively.
Appendix D Additional figures
(a)DCGAN 
(b)DCGANSI 
Comments
There are no comments yet.