1 Introduction
In the recent years, many types of generative models such as autoregressive models
Van Oord et al. (2016); van den Oord et al. (2016), variational autoencoders (VAEs) Kingma & Welling (2014); Rezende et al. (2014), generative adversarial networks (GANs) Goodfellow et al. (2014), realvalued nonvolume preserving (real NVP) transformations Dinh et al. (2017)and generative moment matching networks (GMMNs)
Li et al. (2015) have been proposed and widely studied. They have achieved remarkable success in various tasks, such as unconditional or conditional image synthesis Lample et al. (2017); Ma et al. (2017)Liu et al. (2017); Zhu et al. (2017), image restoration Dahl et al. (2017); Huang et al. (2017) and speech synthesis Gibiansky et al. (2017). While each model has its own significant strengths and limitations, the two most prominent models are VAEs and GANs. VAEs are theoretically elegant and easy to train. They have nice manifold representations but produce very blurry images that lack details. GANs usually generate much sharper images but face challenges in training stability and sampling diversity, especially when synthesizing highresolution images.Many techniques have been developed to address these challenges. LAPGAN Denton et al. (2015) and StackGAN Zhang et al. (2017a) train a stack of GANs within a Laplacian pyramid to generate highresolution images in a coarsetofine manner. StackGANv2 Zhang et al. (2017b) and HDGAN Zhang et al. (2018) adopt multiscale discriminators in a treelike structure. Some studies Durugkar et al. (2017); Wang et al. (2018) have trained a single generator with multiple discriminators to improve the image quality. PGGAN Karras et al. (2018) achieves the stateoftheart by training symmetric generators and discriminators progressively. As illustrated in Fig. 1(a) (A, B, C, and D show the above GANs respectively), most existing GANs require multiscale discriminators to decompose highresolution tasks to fromlowtohigh resolution tasks, which increases the training complexity. In addition, much effort has been devoted to combining the strengths of VAEs and GANs via hybrid models. VAE/GAN Larsen et al. (2016) imposes a discriminator on the data space to improve the quality of the results generated by VAEs. AAE Makhzani et al. (2015) discriminates in the latent space to match the posterior to the prior distribution. ALI Dumoulin et al. (2017) and BiGAN Donahue et al. (2017) discriminate jointly in the data and latent space, while VEEGAN Srivastava et al. (2017) uses additional constraints in the latent space. However, hybrid models usually have more complex network architectures (as illustrated in Fig. 1(b), A, B, C, and D show the above hybrid models respectively) and still lag behind GANs in image quality Karras et al. (2018).
To alleviate this problem, we introduce an introspective variational autoencoder (IntroVAE), a simple yet efficient approach to training VAEs for photographic image synthesis. One of the reasons why samples from VAEs tend to be blurry could be that the training principle makes VAEs assign a high probability to training points, which cannot ensure that blurry points are assigned to a low probability
Goodfellow et al. (2016). Motivated by this issue, we train VAEs in an introspective manner such that the model can selfestimate the differences between generated and real images. In the training phase, the inference model attempts to minimize the divergence of the approximate posterior with the prior for real data while maximize it for the generated samples; the generator model attempts to mislead the inference model by minimizing the divergence of the generated samples. The model acts like a standard VAE for real data and acts like a GAN when handling generated samples. Compared to most VAE and GAN hybrid models, our version requires no extra discriminators, which reduces the complexity of the model. Another advantage of the proposed method is that it can generate highresolution realistic images through a singlestream network in a single stage. The divergence object is adversarially optimized along with the reconstruction error, which increases the difficulty of distinguishing between the generated and real images for the inference model, even for those with highresolution. This arrangement greatly improves the stability of the adversarial training. The reason could be that the instability of GANs is often due to the fact that the discriminator distinguishes the generated images from the training images too easily
Karras et al. (2018); Odena et al. (2017).Our contribution is threefold. i) We propose a new training technique for VAEs, that trains VAEs in an introspective manner such that the model itself estimates the differences between the generated and real images without extra discriminators. ii) We propose a singlestream singlestage adversarial model for highresolution photographic image synthesis, which is, to our knowledge, the first feasible method for GANs to generate highresolution images in such a simple yet efficient manner, e.g., CELEBA images at . iii) Experiments demonstrate that our method combines the strengths of GANs and VAEs, producing highresolution photographic images comparable to those produced by the stateoftheart GANs while preserving the advantages of VAEs, such as stable training and nice latent manifold.
2 Background
As our work is a specific hybrid model of VAEs and GANs, we start with a brief review of VAEs, GANs and their hybrid models.
Variational Autoencoders (VAEs) consist of two networks: a generative network (Generator) that samples the visible variables given the latent variables and an approximate inference network (Encoder) that maps the visible variables to the latent variables which approximate a prior . The object of VAEs is to maximize the variational lower bound (or evidence lower bound, ELBO) of :
(1) 
The main limitation of VAEs is that the generated samples tend to be blurry, which is often attributed to the limited expressiveness of the inference models, the injected noise and imperfect elementwise criteria such as the squared error Larsen et al. (2016); Zhao et al. (2017b). Although recent studies Chen et al. (2017); Dosovitskiy & Brox (2016); Kingma et al. (2016); Sønderby et al. (2016); Zhao et al. (2017b) have greatly improved the predicted loglikelihood, they still face challenges in generating highresolution images.
Generative Adversarial Networks (GANs) employ a twoplayer minmax game with two models: the generative model (Generator) G produces samples from the prior to confuse the discriminator , while is trained to distinguish between the generated samples and the given training data. The training object is
(2) 
GANs are promising tools for generating sharp images, but they are difficult to train. The training process is usually unstable and is prone to mode collapse, especially when generating highresolution images. Many methods Zhao et al. (2017a); Arjovsky et al. (2017); Berthelot et al. (2017); Gulrajani et al. (2017); Salimans et al. (2016) have been attempted to improve GANs in terms of training stability and sample variation. To synthesize highresolution images, several studies have trained GANs in a Laplacian pyramid Denton et al. (2015); Zhang et al. (2017a) or a treelike structure Zhang et al. (2017b, 2018) with multiscale discriminators Durugkar et al. (2017); Nguyen et al. (2017); Wang et al. (2018), mostly in a coarsetofine manner, including the stateoftheart PGGAN Karras et al. (2018).
Hybrid Models of VAEs and GANs usually consist of three components: an encoder and a decoder, as in autoencoders (AEs) or VAEs, to map between the latent space and the data space, and an extra discriminator to add an adversarial constraint into the latent space Makhzani et al. (2015), data space Larsen et al. (2016), or their joint space Donahue et al. (2017); Dumoulin et al. (2017); Srivastava et al. (2017). Recently, Ulyanov et al. Ulyanov et al. (2018) propose adversarial generatorencoder networks (AGE) that shares some similarity with ours in the architecture of two components, while the two models differ in many ways, such as the design of the inference models, the training objects, and the divergence computations. Brock et al. Brock et al. (2017) also propose an introspective adversarial network (IAN) that the encoder and discriminator share most of the layers except the last layer, and their adversarial loss is a variation of the standard GAN loss. In addition, existing hybrid models, including AGE and IAN, still lag far behind GANs in generating highresolution images, which is one of the focuses of our method.
3 Approach
In this section, we train VAEs in an introspective manner such that the model can selfestimate the differences between the generated samples and the training data and then updates itself to produce more realistic samples. To achieve this goal, one part of the model needs to discriminate the generated samples from the training data, and another part should mislead the former part, analogous to the generator and discriminator in GANs. Specifically, we select the approximate inference model (or encoder) of VAEs as the discriminator of GANs and the generator model of VAEs as the generator of GANs. In addition to performing adversarial learning like GANs, the inference and generator models are also expected to train jointly for the given training data to preserve the advantages of VAEs.
There are two components in the ELBO objective of VAEs, a loglikelihood (autoencoding) term and a prior regularization term , which are listed below in the negative version:
(3) 
(4) 
The first term is the reconstruction error in a probabilistic autoencoder, and the second term regularizes the encoder by encouraging the approximate posterior to match the prior . In the following, we describe the proposed introspective VAE (IntroVAE) with the modified combination objective of these two terms.
3.1 Adversarial distribution matching
To match the distribution of the generated samples with the true distribution of the given training data, we use the regularization term as the adversarial training cost function. The inference model is trained to minimize to encourage the posterior of the real data to match the prior , and simultaneously to maximize to encourage the posterior of the generated samples to deviate from the prior , where is sampled from . Conversely, the generator is trained to produce samples that have a small , such that the samples’ posterior distribution approximately matches the prior distribution.
Given a data sample and a generated sample , we design two different losses, one to train the inference model , and another to train the generator :
(5) 
(6) 
where , , and is a positive margin. The above two equations form a minmax game between the inference model and the generator when , i.e., minimizing for the generator is equal to maximizing the second term of for the inference model .^{1}^{1}1It should be noted that we use to denote the inference model and to denote the kldivergence function for representation convenience.
Following the original GANs Goodfellow et al. (2016), we train the inference model to minimize the quantity , and the generator to minimize the quantity . In a nonparametric setting, i.e., and are assumed to have infinite capacity, the following theorem shows that when the system reaches a Nash equilibrium (a saddle point) , the generator produces samples that are distinguishable from the given training distribution, i.e., .
Theorem 1. Assuming that no region exists where , forms a saddle point of the above system if and only if and , where is a constant. Proof. See Appendix A.
Relationships with other GANs To some degree, the proposed adversarial method appears to be similar to Energybased GANs (EBGAN) Zhao et al. (2017a), which views the discriminator as an energy function that assigns low energies to the regions of high data density and higher energies to the other regions. The proposed KLdivergence function can be considered as a specific type of energy function that is computed by the inference model instead of an extra autoencoder discriminator Zhao et al. (2017a). The architecture of our system is simpler and the KLdivergence shows more promising properties than the reconstruction error Zhao et al. (2017a), such as stable training for highresolution images.
3.2 Introspective variational inference
As demonstrated in the previous subsection, playing a minmax game between the inference model and the generator is a promising method for the model to align the generated and true distributions and thus produce visualrealistic samples. However, training the model in this adversarial manner could still cause problems such as mode collapse and training instability, like in other GANs. As discussed above, we introduce IntroVAE to alleviate these problems by combining GANs with VAEs in an introspective manner.
The solution is surprisingly simple, and we only need to combine the adversarial object in Eq. (5) and Eq. (6) with the ELBO object of VAEs. The training objects for the inference model and the generator can be reformulated as below:
(7) 
(8) 
The addition of the reconstruction error builds a bridge between the inference model and the generator and results in a specific hybrid models of VAEs and GANs. For a data sample from the training set, the object of the proposed method collapses to the standard ELBO object of VAEs, thus preserving the properties of VAEs; for a generated sample , this object generates a minmax game of GANs between and and makes more realistic.
Relationships with other hybrid models Compared to other hybrid models Makhzani et al. (2015); Larsen et al. (2016); Donahue et al. (2017); Dumoulin et al. (2017); Srivastava et al. (2017) of VAEs and GANs, which always use a discriminator to regularize the latent code and generated data individually or jointly, the proposed method adds prior regularization into both the latent space and data space in an introspective manner. The first term in Eq. (7) (i.e., in Eq. (4)) encourages the latent code of the training data to approximately follow the prior distribution. The adversarial part of Eq. (7) and Eq. (8) encourages the generated samples to have the same distribution as the training data. The inference model and the generator are trained both jointly and adversarially without extra discriminators.
Compared to AGE Ulyanov et al. (2018), the major differences are addressed in threefold. 1) AGE is designed in an autoencodertype where the encoder has one output variable and no noise term is injected when reconstructing the input data. The proposed method follows the original VAEs that the inference model has two output variables, i.e., and , to utilize the reparameterization trick, i.e., where . 2) AGE uses different reconstruction errors to regularize the encoder and generator respectively, while the proposed method uses the reconstruction error to regularize both the encoder and generator. 3) AGE computes the KLdivergence using batchlevel statistics, i.e., and in Eq. (7) in Ulyanov et al. (2018), while we compute it using the two batchindependent outputs of the inference model, i.e., and in Eq. (9). For highresolution image synthesis, the training batchsize is usually limited to be very small, which may harm the performance of AGE but has little influence on ours. As AGE is trained on images, we retrain AGE and find it hard to converge on images; there is no improvement even when replacing AGE’s network with ours.
3.3 Training IntroVAE networks
Following the original VAEs Kingma & Welling (2014), we select the centered isotropic multivariate Gaussian as the prior over the latent variables. As illustrated in Fig. 2, the inference model is designed to output two individual variables, and , and thus the posterior . The input of the generator is sampled from using a reparameterization trick: where . In this setting, the KLdivergence (i.e., in Eq. (7) and Eq. (8)), given data samples, can be computed as below:
(9) 
where is the dimension of the latent code .
For the reconstruction error in Eq. (7) and Eq. (8), we choose the commonlyused pixelwise mean squared error (MSE) function. Let be the reconstruction sample, is defined as:
(10) 
where is the dimension of the data .
Similar to VAE/GAN Larsen et al. (2016), we train IntroVAE to discriminate real samples from both the model samples and reconstructions. As shown in Fig. 2, these two types of samples are the reconstruction samples and the new samples . When the KLdivergence object of VAEs is adequately optimized, the posterior matches the prior approximately and the samples are similar to each other. The combined use of samples from and
is expected to provide a more useful signal for the model to learn more expressive latent code and synthesize more realistic samples. The total loss functions for
and are respectively redefined as:(11) 
(12) 
where indicates that the back propagation of the gradients is stopped at this point, represents the mapping function of , and and are weighting parameters used to balance the importance of each item.
The networks of and are designed in a similar manner to other GANs Radford et al. (2016); Karras et al. (2018), except that has two output variables with respect to and . As shown in Algorithm 1, and are trained iteratively by updating using to distinguish the real data and generated samples, and , and then updating using to generate samples that are increasingly similar to the real data; these steps are repeated until convergence.
4 Experiments
In this section, we conduct a set of experiments to evaluate the performance of the proposed method. We first give an introduction of the experimental implementations, and then discuss in detail the image quality, training stability and sample diversity of our method. Besides, we also investigate the learned manifold via interpolation in the latent space.
4.1 Implementations
Dataset We condider three data sets, namely CelebA Liu et al. (2015) , CelebAHQ Karras et al. (2018) and LSUN BEDROOM Yu et al. (2015). The CelebA dataset consists of 202,599 celebrity images with large variations in facial attributes. Following the standard protocol of CelebA, we use 162,770 images for training, 19,867 for validation and 19,962 for testing. The CelebAHQ dataset is a highquality version of CelebA that consists of 30,000 images at
resolution. The dataset is split into two sets: the first 29,000 images as the training set and the rest 1,000 images as the testing set. We take the testing set to evaluate the reconstruction quality. The LSUN BEDROOM is a subset of the Largescale Scene Understanding (LSUN) dataset
Yu et al. (2015). We adopt its whole training set of 3,033,042 images in our experiments.Network architecture
We design the inference and generator models of IntroVAE in a similar way to the discriminator and generator in PGGAN except of the use of residual blocks to accelerate the training convergence (see Appendix B for more details). Like other VAEs, the inference model has two output vectors, respectively representing the mean
and the covariance in Eq. (9). For the images at, the dimension of the the latent code is set to be 512 and the hyperparameters in Eq. (11) and Eq. (12) are set empirically to hold the training balance of the inference and generator models:
, and . For the images at , the latent dimension is 512, , and . For the images at , the latent dimension is 256, , and . The key is to hold the regularization term in Eq. (11) and Eq. (12) below the margin value for most of the time. It is suggested to pretrain the model with epochs in the original VAEs form (i.e., ) to find the appropriate configuration of the hyperparameters for different image sizes. More analyses and results for different hyperparameters are provided in Appendix D.As illustrated in Algorithm 1, the inference and generator models are trained iteratively using Adam algorithm (Kingma & Ba, 2014) (, ) with a batch size of 8 and a fixed learning rate of 0.0002. An additional illustration of the training flow is provided in Appendix C.
4.2 High quality image synthesis
As shown in Fig. 3, our method produces visually appealing highresolution images of resolution both in reconstruction and sampling. The images in Fig. 3(c) are the reconstruction results of the original images in Fig. 3(a) from the CelebAHQ testing set. Due to the training principle of VAEs that injects random noise in the training phase, the reconstruction images cannot keep accurate pixelwise similarity with the original images. In spite of this, our results preserve the most global topology information of the input images while achieve photographic highquality in visual perception.
We also compare our sampling results against PGGAN Karras et al. (2018), the stateoftheart in synthesizing highresolution images. As illustrated in Fig. 3(d), our method is able to synthesize highresolution highquality samples comparable with PGGAN, which are both distinguishable with the real images. While PGGAN is trained with symmetric generators and discriminators in a progressive multistage manner, our model is trained in a much simpler manner that iteratively trains a single inference model and a single generator in a single stage like the original GANs Goodfellow et al. (2014). The results of our method demonstrate that it is possible to synthesize very highresolution images by training directly with highresolution images without decomposing the single task to multiple fromlowtohigh resolution tasks. Additionally, we provide the visual quality results in LSUN BEDROOM in Fig. 4, which further demonstrate that our method is capable to synthesize high quality images that are comparable with PGGAN’s. (More visual results on extra datasets are provided in Appendix F & G.)
4.3 Training stability and speed
Figure 5 illustrates the quality of the samples with regard to the loss functions of the reconstruction error and the KLdivergences. It can be seen that the losses converge very fast to a stable stage in which their values fluctuate slightly around a balance line. As described in Theorem 1, the prediction of the inference model reaches a constant in . This is consistent with the curves in Fig. 4, that when approximately converged, the KLdivergence of real images is around a constant value lower than while those of the reconstruction and sample images fluctuate around . Besides, the image quality of the samples improves stably along with the training process.
We evaluate the training speed on CelebA images of various resolutions, i.e., , , and . As illustrated in Tab. 1, The convergence time increases along with the resolution since the hardware limits the minibatch size for highresolutions.
Resolution  

Minibatch  64  32  12  8 
Time (days)  0.5  1  7  21 
4.4 Diversity analysis
We take two metrics to evaluate the sample diversity of our method, namely multiscale structural similarity (MSSSIM) Odena et al. (2017) and Fréchet Inception Distance (FID) Heusel et al. (2017). The MSSSIM measures the similarity of two images and FID measures the Fréchet distance of two distributions in feature space. For fair comparison with PGGAN, the MSSSIM scores are computed among an average of 10K pairs of synthesize images at for CelebA and LSUN BEDROOM, respectively. FID is computed from 50K images at for CelebAHQ and from 50K images at for LSUN BEDROOM. As illustrated in Tab. 2, our method achieves comparable or better quantitative performance than PGGAN, which reflects the sample diversity to some degree. More visual results are provided in Appendix H to further demonstrate the diversity.
4.5 Latent manifold analysis
We conduct interpolations of real images in the latent space to estimate the manifold continuity. For a pair of real images, we first map them to latent codes using the inference model and then make linear interpolations between the codes. As illustrated in Fig. 6, our model demonstrates continuity in the latent space in interpolating from a male to a female or rotating a profile face. This manifold continuity verifies that the proposed model generalizes the image contents instead of simply memorizing them.
5 Conclusion
We have introduced introspective VAEs, a novel and simple approach to training VAEs for synthesizing highresolution photographic images. The learning objective is to play a minmax game between the inference and generator models of VAEs. The inference model not only learns a nice latent manifold structure, but also acts as a discriminator to maximize the divergence of the approximate posterior with the prior for the generated data. Thus, the proposed IntroVAE has an introspection capability to selfestimate the quality of the generated images and improve itself accordingly. Compared to other stateoftheart methods, the proposed model is simpler and more efficient with a singlestream network in a single stage, and it can synthesize highresolution photographic images via a stable training process. Since our model has a standard VAE architecture, it may be easily extended to various VAEsrelated tasks, such as conditional image synthesis.
Acknowledgments
This work is partially funded by the State Key Development Program (Grant No. 2016YFB1001001) and National Natural Science Foundation of China (Grant No. 61622310, 61427811).
References
 Arjovsky et al. (2017) Arjovsky, Martin, Chintala, Soumith, and Bottou, Léon. Wasserstein GAN. arXiv preprint arXiv:1701.07875, 2017.
 Berthelot et al. (2017) Berthelot, David, Schumm, Tom, and Metz, Luke. BEGAN: Boundary equilibrium generative adversarial networks. arXiv preprint arXiv:1703.10717, 2017.
 Brock et al. (2017) Brock, Andrew, Lim, Theodore, Ritchie, James M, and Weston, Nick. Neural photo editing with introspective adversarial networks. In ICLR, 2017.
 Chen et al. (2017) Chen, Xi, Kingma, Diederik P, Salimans, Tim, Duan, Yan, Dhariwal, Prafulla, Schulman, John, Sutskever, Ilya, and Abbeel, Pieter. Variational lossy autoencoder. In ICLR, 2017.

Dahl et al. (2017)
Dahl, Ryan, Norouzi, Mohammad, and Shlens, Jonathon.
Pixel recursive super resolution.
In ICCV, 2017.  Denton et al. (2015) Denton, Emily L, Chintala, Soumith, Fergus, Rob, et al. Deep generative image models using a laplacian pyramid of adversarial networks. In NIPS, pp. 1486–1494, 2015.
 Dinh et al. (2017) Dinh, Laurent, SohlDickstein, Jascha, and Bengio, Samy. Density estimation using real NVP. In ICLR, 2017.
 Donahue et al. (2017) Donahue, Jeff, Krähenbühl, Philipp, and Darrell, Trevor. Adversarial feature learning. In ICLR, 2017.
 Dosovitskiy & Brox (2016) Dosovitskiy, Alexey and Brox, Thomas. Generating images with perceptual similarity metrics based on deep networks. In NIPS, pp. 658–666, 2016.
 Dumoulin et al. (2017) Dumoulin, Vincent, Belghazi, Ishmael, Poole, Ben, Mastropietro, Olivier, Lamb, Alex, Arjovsky, Martin, and Courville, Aaron. Adversarially learned inference. In ICLR, 2017.
 Durugkar et al. (2017) Durugkar, Ishan, Gemp, Ian, and Mahadevan, Sridhar. Generative multiadversarial networks. In ICLR, 2017.
 Gibiansky et al. (2017) Gibiansky, Andrew, Arik, Sercan, Diamos, Gregory, Miller, John, Peng, Kainan, Ping, Wei, Raiman, Jonathan, and Zhou, Yanqi. Deep voice 2: Multispeaker neural texttospeech. In NIPS, pp. 2966–2974, 2017.
 Goodfellow et al. (2014) Goodfellow, Ian, PougetAbadie, Jean, Mirza, Mehdi, Xu, Bing, WardeFarley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In NIPS, pp. 2672–2680, 2014.
 Goodfellow et al. (2016) Goodfellow, Ian, Bengio, Yoshua, Courville, Aaron, and Bengio, Yoshua. Deep learning, volume 1. MIT press Cambridge, 2016.
 Gulrajani et al. (2017) Gulrajani, Ishaan, Ahmed, Faruk, Arjovsky, Martin, Dumoulin, Vincent, and Courville, Aaron C. Improved training of wasserstein GANs. In NIPS, pp. 5769–5779, 2017.
 Heusel et al. (2017) Heusel, Martin, Ramsauer, Hubert, Unterthiner, Thomas, Nessler, Bernhard, and Hochreiter, Sepp. Gans trained by a two timescale update rule converge to a local nash equilibrium. In NIPS, pp. 6626–6637, 2017.
 Huang et al. (2017) Huang, Huaibo, He, Ran, Sun, Zhenan, and Tan, Tieniu. Waveletsrnet: A waveletbased cnn for multiscale face super resolution. In ICCV, pp. 1689–1697, 2017.
 Karras et al. (2018) Karras, Tero, Aila, Timo, Laine, Samuli, and Lehtinen, Jaakko. Progressive growing of GANs for improved quality, stability, and variation. In ICLR, 2018.
 Kingma & Ba (2014) Kingma, Diederik and Ba, Jimmy. Adam: A method for stochastic optimization. In ICLR, 2014.
 Kingma & Welling (2014) Kingma, Diederik P and Welling, Max. Autoencoding variational bayes. In ICLR, 2014.
 Kingma et al. (2016) Kingma, Diederik P, Salimans, Tim, Jozefowicz, Rafal, Chen, Xi, Sutskever, Ilya, and Welling, Max. Improved variational inference with inverse autoregressive flow. In NIPS, pp. 4743–4751, 2016.
 Lample et al. (2017) Lample, Guillaume, Zeghidour, Neil, Usunier, Nicolas, Bordes, Antoine, Denoyer, Ludovic, et al. Fader networks: Manipulating images by sliding attributes. In NIPS, pp. 5969–5978, 2017.
 Larsen et al. (2016) Larsen, Anders Boesen Lindbo, Sønderby, Søren Kaae, Larochelle, Hugo, and Winther, Ole. Autoencoding beyond pixels using a learned similarity metric. In ICML, pp. 1558–1566, 2016.
 Li et al. (2015) Li, Yujia, Swersky, Kevin, and Zemel, Rich. Generative moment matching networks. In ICML, pp. 1718–1727, 2015.
 Liu et al. (2017) Liu, MingYu, Breuel, Thomas, and Kautz, Jan. Unsupervised imagetoimage translation networks. In NIPS, pp. 700–708, 2017.
 Liu et al. (2015) Liu, Ziwei, Luo, Ping, Wang, Xiaogang, and Tang, Xiaoou. Deep learning face attributes in the wild. In ICCV, pp. 3730–3738, 2015.
 Ma et al. (2017) Ma, Liqian, Jia, Xu, Sun, Qianru, Schiele, Bernt, Tuytelaars, Tinne, and Van Gool, Luc. Pose guided person image generation. In NIPS, pp. 405–415, 2017.
 Makhzani et al. (2015) Makhzani, Alireza, Shlens, Jonathon, Jaitly, Navdeep, Goodfellow, Ian, and Frey, Brendan. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 Nguyen et al. (2017) Nguyen, Tu, Le, Trung, Vu, Hung, and Phung, Dinh. Dual discriminator generative adversarial nets. In NIPS, pp. 2667–2677, 2017.
 Odena et al. (2017) Odena, Augustus, Olah, Christopher, and Shlens, Jonathon. Conditional image synthesis with auxiliary classifier GANs. In ICML, pp. 2642–2651, 2017.
 Radford et al. (2016) Radford, Alec, Metz, Luke, and Chintala, Soumith. Unsupervised representation learning with deep convolutional generative adversarial networks. In ICLR, 2016.

Rezende et al. (2014)
Rezende, Danilo Jimenez, Mohamed, Shakir, and Wierstra, Daan.
Stochastic backpropagation and approximate inference in deep generative models.
In ICML, pp. 1278–1286, 2014.  Salimans et al. (2016) Salimans, Tim, Goodfellow, Ian, Zaremba, Wojciech, Cheung, Vicki, Radford, Alec, and Chen, Xi. Improved techniques for training GANs. In NIPS, pp. 2234–2242, 2016.
 Sønderby et al. (2016) Sønderby, Casper Kaae, Raiko, Tapani, Maaløe, Lars, Sønderby, Søren Kaae, and Winther, Ole. Ladder variational autoencoders. In NIPS, pp. 3738–3746, 2016.
 Srivastava et al. (2017) Srivastava, Akash, Valkoz, Lazar, Russell, Chris, Gutmann, Michael U, and Sutton, Charles. VEEGAN: Reducing mode collapse in gans using implicit variational learning. In NIPS, pp. 3310–3320, 2017.
 Ulyanov et al. (2018) Ulyanov, Dmitry, Vedaldi, Andrea, and Lempitsky, Victor. It takes (only) two: Adversarial generatorencoder networks. In AAAI, 2018.
 van den Oord et al. (2016) van den Oord, Aaron, Kalchbrenner, Nal, Espeholt, Lasse, Vinyals, Oriol, Graves, Alex, et al. Conditional image generation with pixelcnn decoders. In NIPS, pp. 4790–4798, 2016.

Van Oord et al. (2016)
Van Oord, Aaron, Kalchbrenner, Nal, and Kavukcuoglu, Koray.
Pixel recurrent neural networks.
In ICML, pp. 1747–1756, 2016.  Wang et al. (2018) Wang, TingChun, Liu, MingYu, Zhu, JunYan, Tao, Andrew, Kautz, Jan, and Catanzaro, Bryan. Highresolution image synthesis and semantic manipulation with conditional GANs. In CVPR, 2018.
 (40) Wu, Xiang, He, Ran, Sun, Zhenan, and Tan, Tieniu. A light cnn for deep face representation with noisy labels.
 Yu et al. (2015) Yu, Fisher, Seff, Ari, Zhang, Yinda, Song, Shuran, Funkhouser, Thomas, and Xiao, Jianxiong. Lsun: Construction of a largescale image dataset using deep learning with humans in the loop. arXiv preprint arXiv:1506.03365, 2015.
 Zhang et al. (2017a) Zhang, Han, Xu, Tao, Li, Hongsheng, Zhang, Shaoting, Huang, Xiaolei, Wang, Xiaogang, and Metaxas, Dimitris. StackGAN: Text to photorealistic image synthesis with stacked generative adversarial networks. In ICCV, pp. 5907–5915, 2017a.
 Zhang et al. (2017b) Zhang, Han, Xu, Tao, Li, Hongsheng, Zhang, Shaoting, Wang, Xiaogang, Huang, Xiaolei, and Metaxas, Dimitris. StackGAN++: Realistic image synthesis with stacked generative adversarial networks. arXiv preprint arXiv:1710.10916v2, 2017b.
 Zhang et al. (2018) Zhang, Zizhao, Xie, Yuanpu, and Yang, Lin. Photographic texttoimage synthesis with a hierarchicallynested adversarial network. arXiv preprint arXiv:1802.09178, 2018.
 Zhao et al. (2017a) Zhao, Junbo, Mathieu, Michael, and LeCun, Yann. Energybased generative adversarial network. In ICLR, 2017a.
 Zhao et al. (2017b) Zhao, Shengjia, Song, Jiaming, and Ermon, Stefano. InfoVAE: Information maximizing variational autoencoders. arXiv preprint arXiv:1706.02262, 2017b.
 Zhu et al. (2017) Zhu, JunYan, Zhang, Richard, Pathak, Deepak, Darrell, Trevor, Efros, Alexei A, Wang, Oliver, and Shechtman, Eli. Toward multimodal imagetoimage translation. In NIPS, pp. 465–476, 2017.
Appendix A Proof of theorem 1
Following the EBGAN Zhao et al. (2017a), we give the proof as follows:
It is obvious that the sufficient conditions hold. So, we prove the necessary conditions. For the necessary condition :
forms a saddle point that satisfies:
(13)  
(14) 
Firstly, can be transformed as follows:
(15)  
(16)  
(17) 
where . According to the analysis of in lemma A.1, which has been proved in Zhao et al. (2017a),
Lemma A.1
Let , . The minimum of on exists and is reached in if , and it is reached in otherwise (the minimum may not be unique).
reaches its minimum when we replace by these values.
(18)  
(19)  
(20)  
(21)  
(22)  
(23) 
Since the second term in 23 , so . By putting into the right side of equation 14, we get
(24) 
(25) 
(26) 
According to lemma A.1, almost everywhere. So we get .
Thus, i.e. . Putting it into equation 23, , so we obtain . We can see that only if almost everywhere, the above equation is true.
Now for the necessary condition where is a constant. Following the proof by contradiction in Zhao et al. (2017a). Let us now assume that is not constant almost everywhere and find a contradiction. If it is not, then there exists a constant and a set of nonzero measure such that and . In addition we can choose such that there exists a subset of nonzero measure such that on (because of the assumption in the footnote). We can build a generator such that over and over . We compute
(27)  
(28)  
(29)  
(30) 
which violates equation 14.
Appendix B Network Architecture
Tab. 1 is the network architecture for generating images of resolution. We reduce the number of [Resblock + AvgPool] in the inference model and [Upsample + Resblock] in the generator for other smaller resolutions. In the experimental process we find that the residual block can accelerate the convergence for image synthesis, especially for resolutions larger than .
Inference model  Act.  Output shape 

Input image  –  
Conv  
AvgPool  –  
Resblock  
AvgPool  –  
Resblock  
AvgPool  –  
Resblock  
AvgPool  –  
Resblock  
AvgPool  –  
Resblock  
AvgPool  –  
Resblock  
AvgPool  –  
Resblock  
AvgPool  –  
Resblock  
Reshape  –  
FC1024  –  
Split  –  512, 512 
Generator  Act.  Output shape 

Latent vector  –  
FC8192  ReLU  
Reshape  –  
Resblock  
Upsample  –  
Resblock  
Upsample  –  
Resblock  
Upsample  –  
Resblock  
Upsample  –  
Resblock  
Upsample  –  
Resblock  
Upsample  –  
Resblock  
Upsample  –  
Resblock  
Upsample  –  
Resblock  
Conv 
Appendix C Illustration of training flow
As illustrated in Fig. 1, we train the inference model and generator iteratively that an extra pass through the inference model is necessary after images are generated or reconstructed. As in the algorithm 1, we use to stop the gradients of (Line (8) and (9) in the Algorithm 1) propagating back to the generator in the first pass. For other choices, such as no or updating the generator first, it also works with one forward pass through the inference model. The current choice is for realization convenience.
Appendix D Discussion of hyperparameters
We conduct experiments on the images of in CELEBAHQ and find the training stability is not very sensitive to the hyperparameters in some degree though they indeed have influences on the sample and reconstruction quality. is better to be and larger or smaller may decelerate the convergence speed. As illustrated in Fig. 2, when is fixed, larger always improves the reconstruction quality but may influence the sample diversity. The margin should be selected according to the value of because larger needs larger to balance the adversarial training. Pretraining the model following the original VAEs (i.e., ) is suggested to find the most appropriate value of responding to a specific . can be selected to be a little larger than the training kldivergence value of VAEs.
Appendix E Nearest neighbors for the generated images
Fig. 3 shows the nearest neighbors from the training data for the generated images (the first row in Fig. 3). We find the nearest neighbors using two distance measures: the second row of images in Fig. 3 are the results based on
distance in pixel space; the bottom row of images are the results based on cosine distance in feature space. The highlevel features are extracted using a pretrained face recognition network, i.e. LightCNN
(Wu et al., ).