1 Introduction
Successful recent generative models of natural images can be divided into two broad families, which are trained in fundamentally different ways. The first is trained using likelihoodbased criteria, which ensure that all training data points are well covered by the model. This category includes variational autoencoders (VAEs) (Kingma & Welling, 2014; Kingma et al., 2016)
, autoregressive models such as PixelCNNs
(van den Oord et al., 2016; Salimans et al., 2017), and flowbased models such as RealNVP (Dinh et al., 2017). The second category is trained based on a signal that measures to what extent (statistics of) samples from the model can be distinguished from (statistics of) the training data, i.e., based on the quality of samples drawn from the model. This is the case for generative adversarial networks (GANs) (Goodfellow et al., 2014; Karras et al., 2018), as well as moment matching methods
(Li et al., 2015).Motivation.
Despite recent progress, existing methods exhibit a number of drawbacks. Likelihoodbased models are trained to put probability mass on all elements of the training set. However, covering all modes of the training distribution forces models to overgeneralize and assign probability mass on nonrealistic images due to the lack of flexibility, as illustrated in Figure
(a)a. Limiting factors in such models include the use of fully factorized decoders in variational autoencoders, and restriction to the class of fully invertible functions in RealNVP. Addressing these limitations is key to improving the sample quality.Adversarial training on the other hand pushes samples to be indistinguishable from training images, at the expense of covering the full support of the training distribution. This phenomenon, known as “mode collapse” (Arjovsky et al., 2017), is illustrated in Figure (b)b. Moreover, adversarial models have a lowdimensional support, so that heldout data typically has zero probability under the learned model. This, together with the lack of an inference mechanism prevents the use of likelihood to assess coverage of heldout data, and thus complicates evaluation of GANs.
Contribution. Prior attempts have been made to leverage the complementarity of quality and coverage driven training using an inference network, for instance the VAEGAN model (Larsen et al., 2016), and approaches that learn an inference network adversarially (Dumoulin et al., 2017a; Donahue et al., 2017; Ulyanov et al., 2018). In contrast to these approaches, our model is directly optimized on a valid measure of loglikelihood performance in the RGB space, which we then report on a heldout dataset. As illustrated in Figure 2, our model uses nonvolume preserving invertible transformations close to the output, optimized to increase the volume of data points. This relaxes naive independence assumptions on pixels given the latent variables, which are typical in VAEs. The invertibility of the feature is a crucial difference with Larsen et al. (2016), as it enables likelihood computations and ensures that separate data points cannot collapse in feature space. Experimental results show this extension to be beneficial for both the sample quality and the likelihood of heldout data. An adversarial loss is then used to explicitly optimize the sample quality.
We experimentally validate our approach on the CIFAR10 dataset. Using the same architecture, our proposed model yields substantially improved samples over VAE models, as measured by the IS and FID scores, and improved likelihoods compared to a modified GAN model. Our model significantly improves upon existing hybrid models, producing GANlike samples, and IS and FID scores that are competitive with fully adversarial models, while offering likelihoods on heldout data comparable to recent likelihoodbased methods. We further confirm these observations with qualitative and quantitative experimental results on the CelebA dataset, STL10, ImageNet, and LSUNBedrooms. We are the first to report IS and FID scores together with heldout likelihoods on all these five datasets. We also assess the performance of conditional versions of our models with the data augmentation based GAN evaluation procedure proposed in
Shmelkov et al. (2018).2 Related work
The complementary properties of autoencoders and GANs have motivated several hybrid approaches. The VAEGAN model of Larsen et al. (2016) uses the intermediate layers of a GAN discriminator as target space for the VAE. This model goes beyond the pixelwise reconstruction loss. A drawback of this model, however, is that it does not define a density model in the image space.
Another line of research has focused on using adversarial methods to train an inference network. Dumoulin et al. (2017a) and Donahue et al. (2017)
show that it is possible to learn an encoder and decoder model with a fully adversarial procedure. Given pairs of images and latent variable vectors,
, a discriminator has to predict if was encoded from a real image, or if was decoded from a sampled from the prior. A similar approach is taken by Chen et al. (2018), which showed that it is possible to approximate the symmetric KL in a fully adversarial setup, and additionally uses reconstruction losses to improve the correspondence between reconstructed and target variables for and . Along the same line of research, Ulyanov et al. (2018) have shown that it is possible to collapse the encoder and the discriminator into one network, that encodes both real images and generated samples, and tries to spread their posteriors apart. Rosca et al. (2017)use a discriminator to replace the KullbackLeibler divergence terms in the variational lower bound used to train VAEs with the density ratio trick.
Makhzani et al. (2016) show that the regularization term on the latent variables of a VAE can be replaced with a discriminator that compares latent variables from the prior and from the posterior. This regularization is more flexible than the standard KullbackLiebler divergence, but does not lead to a valid density measure on images either.In the generative adversarial networks literature, modecollapse has recently received considerable attention as one of the main failure modes of GANs. One line of research focuses on allowing the discriminator to access batch statistics of generated images, as pioneered by Salimans et al. (2016); Karras et al. (2018), and further generalized by Lucas et al. (2018); Lin et al. (2018). This is based on the idea that a batch of samples should behave like a batch of real images. For this to happen, individual samples should look realistic, but should also have as much variability as a batch of real images, which approximates a form of coveragedriven training as the batch size increases. These approaches, however, do not define likelihood or other measure to assess the model on heldout data, and suffer from typical adversarial training instabilities.
In our work, in contrast to these approaches, we focus on models which define a valid likelihood measure in the data space, which we report on heldout data. To achieve this, we leverage the merits of invertible transformations together with the VAE framework. This allows us to avoid the severely limiting conditional independence assumption commonly made in VAEs. Some recent work (Gulrajani et al., 2017b; Chen et al., 2017; Lucas & Verbeek, 2018) has proposed using autoregressive decoders to go beyond the VAE independence assumption in pixel space. They, however, are not amenable to be used for adversarial training, since a prohibitively slow sequential pixel sampling is required in these models.
3 Preliminaries
In this section we briefly review coverage and quality driven training, and their respective shortcomings.
3.1 Maximumlikelihood and overgeneralization
The defacto standard approach for training generative models is maximumlikelihood estimation (MLE). It maximizes the probability of data sampled from an unknown data generating distribution under the model w.r.t. the model parameters , which is equivalent to minimizing the KullbackLiebler (KL) divergence, , between and . This yields models that tend to cover all the modes of the data, but put mass in spurious regions of the target space; a phenomenon known as overgeneralization (Bishop, 2006), and manifested by unrealistic samples in the context of generative image models, see Figure (a)a.
Overgeneralization is inherent to the optimization of KL divergence. Real images are sampled from , and is explicitly optimized to cover all of them. The training procedure, however, fails to sample from and evaluate the quality of these samples; ideally using the inaccessible as a score. Therefore may put mass in spurious regions of the space without being heavily penalized. We refer to this kind of training procedure as “coveragedriven training” (CDT). This optimizes a loss of the form ,
where evaluates how well a sample is covered by the model.
Explicitly evaluating sample quality is redundant in the regime of models with infinite capacity and infinite training data. Indeed, putting mass on spurious regions takes it away from the support of , and thus reduces the likelihood of the training data. In practice, however, datasets and model capacity are finite, and models need to put mass outside the finite training set in order to generalize. The maximum likelihood criterion, by construction, only measures how much mass goes off the training data, not where it goes. In classic MLE, generalization is controlled in two ways: (i) inductive bias, in the form of model architecture, controls where the offdataset mass goes, and (ii) regularization controls to which extent this happens. An adversarial loss, that considers samples from the model , can provide a second handle to control where the offdataset mass goes. In contrast to model architecture design, an adversarial loss provides a “trainable” form of inductive bias.
3.2 Adversarial models and mode collapse
Adversarially trained models, e.g., GANs (Goodfellow et al., 2014), produce samples of excellent quality compared to MLE. However, their main drawback is that they typically do not cover the full support of data. This wellrecognized phenomenon is known as “modedropping” (Bishop, 2006; Arjovsky et al., 2017)
. Another drawback is that they do not provide a measure to assess modedropping, or their quantitative performance in general. The reasons for this are twofold. First, defining a valid likelihood requires adding volume to the lowdimensional manifold learned by GANs to define a density under which training and test data have nonzero density. Second, computing the density of a data point under the defined probability distribution requires marginalizing out the latent variables, which is not trivial in the absence of an efficient inference mechanism.
When a human expert subjectively evaluates the quality of generated images, samples from the model are compared to the expert’s implicit approximation of . This type of objective can be written as , and we refer to it as “qualitydriven training” (QDT). To see that GANs use this type of training, recall that the discriminator is trained with the loss (Goodfellow et al., 2014)
(1) 
It is easy to show that the optimal discriminator equals Substituting the optimal discriminator, equals (up to additive and multiplicative constants) the JensenShannon divergence
(2) 
This loss, approximated by the discriminator, is symmetric and contains two KL divergence terms. Note that is an integral on , so coverage driven. However the term that approximates it in Eq. (1), i.e., , is independent from the generative model, and disappears when differentiating. Therefore, it cannot be used to perform coveragedriven training, and the generator is trained to minimize ,^{1}^{1}1Or to maximize (Goodfellow et al., 2014). where is the deterministic generator that maps latent variables to the data space. Assuming , this yields
(3) 
which is a qualitydriven criterion. Both human assessment of generative models as well as the GAN objective are qualitydriven, and thus favor quality over support coverage. This may explain why images produced by GANs typically correlate well with human judgment.
4 Our approach
In this section we describe our generative model, and how we train it to ensure both quality and coverage.
4.1 Partially Invertible Variational Autoencoders
Adversarial training requires continuous sampling from the model during training. As VAEs and flowbased models allow for efficient feedforward sampling, they are suitable likelihoodbased models to build our approach on. VAEs rely on an inference network , or “encoder”, to construct a variational evidence lowerbound (ELBO),
(4) 
The “decoder” has a convolutional architecture similar to that of a GAN generator, with the exception that the decoder maps the latent variable to a distribution over images, rather to a single image. Another difference is that in a VAE the prior is typically learned, and more flexible than in a GAN, which can significantly improve the likelihoods on heldout data (Kingma et al., 2016). Our generative model uses a latent variable hierarchy with topdown sampling similar to Sønderby et al. (2016); Bachman (2016); Kingma et al. (2016), see Appendix A.1. It also leverages inverse autoregressive flow (Kingma et al., 2016) to obtain accurate posterior approximations, beyond commonly used factorized Gaussian approximations, see Appendix A.2.
Typically, strong independence assumptions are also made on the decoder, e.g. by constraining it to a fully factorized Gaussian, i.e. . In this case, all dependency structure across the pixels has to be modeled by the latent variable , any correlations not captured by are treated as independent perpixel noise. Unless captures each and every aspect of the image structure, this is a poor model for natural images, and leads the model to overgeneralize with independent perpixel noise around blurry nonrealistic examples. Using the decoder to produce a sparse Cholesky decomposition of the inverse covariance matrix alleviates this problem to some extent (Dorta et al., 2018), but retains a limiting assumption of linearGaussian dependency across pixel values.
Flowbased models offer a more flexible alternative, allowing to depart from Gaussian or other parametric distributions. Models such as NVP (Dinh et al., 2017) map an image from RGB space to a latent code using a bijection , and rely on the change of variable formula to compute the likelihood
(5) 
To sample , we first sample from a parametric prior, e.g. a unit Gaussian, and use the reverse mapping to find the corresponding . Despite allowing for exact inference and efficient sampling, current flowbased approaches are worse than stateoftheart likelihoodbased approaches in terms heldout likelihood, and sample quality.
In our model we use invertible network layers to map RGB images to an abstract feature space . A VAE is then trained to model the distribution of . This results in a nonfactorial and nonparametric form of in the space of RGB images. See Figure 2 for a schematic illustration of the model. Although the likelihood of this model is intractable to compute, we can rely on a lower bound for training:
(6) 
The bound is obtained by combining the VAE variational lower bound of Eq. (4) with the change of variable formula of Eq. (5). Our model combines benefits from VAE and NVP: it uses efficient noninvertible (convolutional) layers of VAE, while using a limited number of invertible layers as in NVP to avoid factorization in the conditional distribution . An alternative interpretation of our model is to see it as a variant of NVP with a complex nonparametric prior distribution rather than a unit Gaussian. The Jacobian in Eq. (4.1) pushes the model to increases the volume around training images in feature space, and the VAE measures their density in that space. Experimentally, we find our partially invertible nonfactorial decoder to improve both sample quality as well as the likelihood of heldout data.
4.2 Improving samples with adversarial training
When optimizing , the regularization term drives the posterior and the prior closer together. Ideally, the posterior marginalized across real images and the prior match, i.e. In this is the case, latent variables , and mapped through the feedforward decoder, should result in realistic samples. Adversarial training be leveraged for qualitydriven training of the prior, thus enrich its training signal as previously discussed.
For qualitydriven training, we train a discriminator using the modified objective proposed by Sønderby et al. (2017) that combines both generator losses considered by Goodfellow et al. (2014):
(7) 
Assuming the discriminator is trained to optimality at every step, it is easy to demonstrate that the generator is trained to optimize . To regularize the training of the discriminator, we use the gradient penalty introduced by Gulrajani et al. (2017a), see App. A.3 for details.
The training procedure, written as an algorithm in Appendix E, alternates between two steps, similar to that of GANs. In the first step, the discriminator is trained to maximize , bringing it closer to it’s optimal value . In the second step, the generative model is trained to minimize , the sum of the coveragebased loss in Eq. (4.1), and the qualitybased loss in Eq. (7). Assuming that the discriminator is trained to optimality at every step, the generator is trained to minimize a bound on the sum of two symmetric KL divergences:
where the entropy of the data generating distribution, , is an additive constant that does not depend on the learned generative model .
5 Experimental evaluation
Below, we present our evaluation protocol (Section 5.1), followed by an ablation study to assess the importance of the components of our model (Section 5.2). In Section 5.3 we improve quantitative and qualitative performance using recent advances from the VAE and GAN literature. We then compare to the state of the art on the CIFAR10 and STL10 datasets (Section 5.4), and present additional results at higher resolutions and on other datasets (Section 5.5). Finally, we evaluate a classconditional version of our model using the image classification framework of Shmelkov et al. (2018).
5.1 Evaluation protocol
To evaluate the how well models cover heldout data, we use the bits per dimension (BPD) measure. This measure is defined as the negative loglikelihood on heldout data, averaged across pixels and color channels (Dinh et al., 2017). Due to their degenerate lowdimensional support, GANs do not define a valid density in the image space, which prevents measuring BPD. To endow a GAN with a full support and a valid likelihood, we train a VAE “around it”. In particular, we train an isotropic noise parameter that does not depend on , as in our VAE decoder, as well as an inference network. As we train these, the weights of the GAN generator are kept fixed. For both GANs and VAEs, we use the inference network to compute a lowerbound to approximate the likelihood, i.e. an upper bound on BPD.
To evaluate the sample quality, we report Fréchet inception distance (FID) (Heusel et al., 2017) and inception score (IS) (Salimans et al., 2016), which are commonly used to quantitatively evaluate GANs (Zhang et al., 2018; Brock et al., 2019). Although IS and FID are used to evaluate sample quality, these metrics are also sensitive to coverage. In fact, any metric evaluating sample quality only would be degenerate, as collapsing to the mode of the target distribution would maximize it. However, in practice both metrics correlate stronger with sample quality than with support coverage, see Appendix B. We evaluate all measures using heldout data not used during training, which improves over common practice in the GAN literature, where train data is often used for evaluation.
5.2 Comparison to GAN and VAE baselines
Experimental setup. We evaluate our approach on the CIFAR10 dataset, using 50k/10k train/test images of 3232 pixels (standard split). We train our GAN baseline to optimize , and use the architecture of SNGAN (Miyato et al., 2018), which is stable and trains quickly. The same architecture and training hyperparameters are used for all models in this experiments, see Appendix A for details.
We train our VAE baseline by optimizing
. We use the GAN generator architecture for the decoder, which produces the mean of a factorizing Gaussian distribution over pixel RGB values. We add a trainable isotropic variance
, to ensure a valid density model. In the VAE model some feature maps in the decoder are treated as conditional latent variables, allowing for hierarchical topdown sampling. Experimentally, we find that similar topdown sampling is not effective for the GAN model.





VAE  piVAE  CQG 




GAN  CQFG (Ours) 
To train the generator for both coverage and quality, we optimize the sum of and . We refer to the model trained in this way as CQG. We refer to model that also includes invertible layers in the decoder as CQFG. The small invertible model uses a single scale with three invertible layers, each composed of two residual blocks and increases the number of weights in the generator by roughly 1.4% so we also slightly increase the width of the generator in the CQG version for fair comparison. All implementation details can be found in Appendix A. Our code will also be released upon publication for reproducibility.
Analysis of results. Form the experimental results in Table 1 we make several observations. As expected, the GAN baseline yields better sample quality (IS and FID) than the VAE baseline, e.g. obtaining inception scores of and , respectively. Conversely, the VAE achieves better coverage, with a BPD of , compared to an estimated for the GAN. The same generator trained for both quality and coverage, CQG, achieves the same BPD as the VAE baseline. The same quality of this model is in between that of the GAN and the VAE baselines. In Figure 3 we show samples from the different models, and these confirm the quantitative observations.
Flow  BPD  IS  FID  

GAN  [7.0]  6.8  31.4  
VAE  4.4  2.0  171.0  
piVAE 
3.5  3.0  112.0  
CQG  4.4  5.1  58.6  
CQFG  3.9  7.1  28.0 
When adding the invertible layers to the VAE decoder, but using maximum likelihood training with (piVAE), leads to improves sample quality with IS increasing from to and FID dropping from to . Note that the quantitative sample quality is below that of the GAN baseline and our CQG model. When we combine the nonfactorial decoder with coverage and quality driven training, CQFG, we obtain quantitative sample quality that is somewhat better than that of the GAN baseline: IS improving from 6.8 to 7.1, and FID decreasing from 31.4 to 28.0. The samples in Figure 3 confirm the high sample quality of the CQFG model. Note that the CQFG model also achieves a better BPD than the VAE baseline. These experimental observations demonstrate the importance of our contributions: our nonfactorial decoder trained for coverage and quality improves both VAE and GAN in terms of heldout likelihood, and improves VAE sample quality to, or slightly beyond, that of the GAN.
In Appendix C we analyze the impact of the number of layers and scales in the invertible part of the decoder. In Appendix D we also provide reconstructions qualitatively demonstrating the inference abilities of our CQFG model. As is typical with expressive VAE model, groundtruth images and reconstructions are indistinguishable to the eye.
CIFAR10 samples  CIFAR10 train images  



STL10 samples  STL10 train images  


5.3 Evaluation of architectural refinements
IAF  Res  BPD  IS  FID  
GAN  [7.0]  6.8  31.4  
GAN 
—  7.4  24.0  
CQFG  3.9  7.1  28.0  
CQFG  3.8  7.5  26.0  
CQFG  3.8  7.9  20.1  
CQFG (largeD) 
To further improve quantitative and qualitative performance, we proceed to include two recent advances in the VAE and GAN literature. First, Gulrajani et al. (2017a)
have shown a deeper discriminator with residual connections to be beneficial to training. We using such improved discriminators, we find it useful to make similar changes to the generator to mirror these modifications. Second,
Kingma et al. (2016) improve VAE encoders by introducing inverse autoregressive flow (IAF) to allow for more accurate posterior approximations that go beyond factorized Gaussian approximations that are commonly used.The results in Table 2 show consistent improvements across all metrics when adding residual connections and IAF. Increasing the size of the discriminator (denoted “large D”) yields further improvements in IS and FID, while slightly degrading the BPD from 3.77 to 3.74. These results show that our model benefits from recent architectural advances in GANs and VAEs. In the remainder of our experiments we use the CQFG (large D).
5.4 Comparison to the state of the art
In Table 3 we compare the performance of our models with previously proposed hybrid approaches, as well as stateoftheart adversarial and likelihood based models. Many entries in the table are missing, since the likelihood of heldout data is not defined for most adversarial methods, and most likelihoodbased models do not report IS or FID scores. We present results for two variants of our CQFG model, the largeD variant from Table 2, as well as a variant that uses two scales in the invertible layers rather than one, denoted “S2”. See Appendix C for details. The latter model achieves better BPD at the expense of worse IS and FID.
Compared to the best hybrid approaches, our largeD model yields a substantial improvement in IS to 8.1, while our S2 model yields a comparable value of 6.9. Compared to adversarially trained models, our largeD model obtains results that are comparable to the best results obtained by SNGAN using residual connections and hingeloss. We note that the use of spectral normalization and hingeloss for adversarial training could potentially improve our results, but we leave this for future work. Our S2 model is comparable to the basic SNGAN (somewhat better FID, somewhat worse IS) that does not use residual connections and hingeloss. On STL10 (4848 pixels) our models trained using 100k/8k train/test images, also achieve competitive IS and FID scores; being only outperformed by SNGAN (ResHinge).
Using our S2 model we obtain a BPD of 3.5 that is comparable to RealNVP, while our largeD model obtains a slightly worse value of 3.7. We computed IS and FID scores for VAEIAF and PixelCNN++ using publicly released code and parameters. We find that these IS and FID scores are substantially worse than the ones we measured both of our model variants. To the best of our knowledge, we are the first to report BPD measurements on STL10, and can therefore not compare to previous work in this metric.
We display samples from our CQFG (largeD) model on both datasets in Figure 4.
CIFAR10  STL10  
BPD  IS  FID  BPD  IS  FID  
Hybrid models  
AGE  5.9  
ALI  5.3  
SVAE  6.8  
6.8  
SVAEr  7.0  
CQFG (Ours)  3.7  8.1  18.6  4.0  8.6  52.7 
CQFG (S2) (Ours)  3.5  6.9  28.9  3.8  8.6  52.1 
Adversarial models  
SNGAN  7.4  29.3  8.3  53.1  
BatchGAN  7.5  23.7  8.7  51  
WGANGP  7.9  
SNGAN (ResHinge)  8.2  21.7  9.1  40.1  
Likelihoodbased models  
RealNVP  3.5  
VAEIAF  3.1  [3.8]  [73.5]  
PixelCNN++  2.9  [5.4]  [121.3] 
5.5 Results on additional datasets
To further validate our approach we train our CQFG (largeD) model on three additional datasets, and on STL10 at 9696 resolution. The architecture and training procedure are unchanged from the preceding experiments, up to the addition of convolutional layers to adapt to the increased resolution. For the CelebA dataset we used 196k/6.4k train/test images, resized to , and used central image crops of both and pixels. We also train on STL10 (100k/8k train/test images) resized to , on the LSUNbedrooms dataset (3M/300 train/test images) at resolution, and on ImageNet (1.2M/50k train/test images) resized to pixels.
We show samples and train images for these datasets in Figure 5, and quantitative evaluation results in Table 4. The fact that our model works without changing the architecture and training hyperparameters shows the stability of our approach. On the CelebA and LSUN datasets, our CQF generator produces compelling samples despite the high resolution of the images. The samples for STL10 and ImageNet are less realistic due to the larger variability in these datasets; recall that we do not condition on class labels for generation. On CelebA, all scores improve significantly when using central crops of , due to the reduced variability in the smaller crop which removes the background from the images.
Samples  Real images  











Resolution  BPD  IS  FID  

CelebA, crop 178  2.85  —  24.3  
CelebA, crop 96  2.45  —  13.8  
STL10  3.85  8.8  100.8  
ImageNet  4.90  7.6  69.9  
LSUNBedrooms  4.01  —  61.9 
5.6 Classconditional results
We also evaluate a classconditional model and rely on two measures recently proposed by Shmelkov et al. (2018)
. The first measure, GANtest, is obtained by training a classifier on natural image data and evaluating it on the samples of a classconditional generative model. This measure is sensitive to sample quality only. The second measure, GANtrain, is obtained by training a classifier on generated samples and evaluating it on natural images. This measure requires is sensitive both to quality and coverage. For a given GANtest level, variations in GANtrain can be attributed to different coverage.
To perform this evaluation we develop a class conditional version of our CQFG model. The discriminator is conditioned using the class conditioning introduced by Miyato & Koyama (2018)
. GAN generators are typically made classconditional using conditional batch normalization
(De Vries et al., 2017; Dumoulin et al., 2017b), however batch normalization is known to be detrimental in VAEs (Kingma et al., 2016), as we verified in practice. To address this issue, we propose conditional weight normalization (CWN). As in weight normalization (Salimans & Kingma, 2016), we separate the training of the scale and the direction of the weight matrix. Additionally, the scaling factor of the weight matrix is conditioned on the class label :(8) 
We also make the network biases conditional on the class label. Otherwise, the architecture is the same one used for the experiments in Table 1.
In Table 5 we report the GANtrain and GANtest measures on CIFAR10. Our CQFG model obtain a slightly higher GANtest score than the GAN baseline, which shows that it achieves comparable if not better sample quality, which is inline with the results in terms of IS and FID scores in Section 5.2. Moreover, with CQFG we obtain a substantially better GANtrain score, going from to . Having established similar GANtest performance, this demonstrates significantly improved sample diversity of the CQFG model as compared to the GAN baseline. This shows that the coveragedriven training improves the coverage of the learned model.
model  GANtest (%)  GANtrain (%) 

GAN  71.8  29.7 
CQFG  76.9  73.4 
DCGAN  58.2  65.0 
6 Conclusion
We presented CQGF, a generative model that leverages invertible network layers to relax the conditional pixel independence assumption commonly made in VAE models. Since our model allows for efficient feedforward sampling, we are able to train our model using a maximum likelihood criterion that ensure coverage of the data generating distribution, as well as an adversarial criterion that ensures high sample quality. We provide quantitative and qualitative experimental results on a collection of five datasets (CIFAR10, STL10, CelebA, ImageNet and LSUNBedrooms). We obtain IS and FID scores comparable to stateoftheart GAN models, and heldout likelihood scores that are comparable to recent pure likelihoodbased models.
References
 Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein generative adversarial networks. In ICML, 2017.
 Bachman (2016) Bachman, P. An architecture for deep, hierarchical generative models. In NIPS, 2016.
 Bishop (2006) Bishop, C. Pattern recognition and machine learning. SpingerVerlag, 2006.
 Brock et al. (2019) Brock, A., Donahue, J., and Simonyan, K. Large scale GAN training for high fidelity natural image synthesis. In ICLR, 2019.
 Chen et al. (2018) Chen, L., Dai, S., Pu, Y., Zhou, E., Li, C., Su, Q., Chen, C., and Carin, L. Symmetric variational autoencoder and connections to adversarial learning. In AISTATS, 2018.
 Chen et al. (2017) Chen, X., Kingma, D., Salimans, T., Duan, Y., Dhariwal, P., Schulman, J., Sutskever, I., and Abbeel, P. Variational lossy autoencoder. In ICLR, 2017.
 De Vries et al. (2017) De Vries, H., Strub, F., Mary, J., Larochelle, H., Pietquin, O., and Courville, A. C. Modulating early visual processing by language. In NIPS, 2017.
 Dinh et al. (2017) Dinh, L., SohlDickstein, J., and Bengio, S. Density estimation using real NVP. In ICLR, 2017.
 Donahue et al. (2017) Donahue, J., Krähenbühl, P., and Darrell, T. Adversarial feature learning. In ICLR, 2017.
 Dorta et al. (2018) Dorta, G., Vicente, S., Agapito, L., Campbell, N. D. F., and Simpson, I. Structured uncertainty prediction networks. In CVPR, 2018.
 Dumoulin et al. (2017a) Dumoulin, V., Belghazi, I., Poole, B., Lamb, A., Arjovsky, M., Mastropietro, O., and Courville, A. C. Adversarially learned inference. In ICLR, 2017a.
 Dumoulin et al. (2017b) Dumoulin, V., Shlens, J., and Kudlur, M. A learned representation for artistic style. In ICLR, 2017b.
 Goodfellow et al. (2014) Goodfellow, I., PougetAbadie, J., Mirza, M., Xu, B., WardeFarley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In NIPS, 2014.
 Gulrajani et al. (2017a) Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., and Courville, A. C. Improved training of Wasserstein GANs. In NIPS, 2017a.
 Gulrajani et al. (2017b) Gulrajani, I., Kumar, K., Ahmed, F., Taiga, A. A., Visin, F., Vazquez, D., and Courville, A. PixelVAE: A latent variable model for natural images. In ICLR, 2017b.
 Heusel et al. (2017) Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., and Hochreiter, S. GANs trained by a two timescale update rule converge to a local Nash equilibrium. In NIPS, 2017.
 Karras et al. (2018) Karras, T., Aila, T., and abd J. Lehtinen, S. L. Progressive growing of GANSs for improved quality, stability, and variation. In ICLR, 2018.
 Kingma & Ba (2015) Kingma, D. and Ba, J. Adam: A method for stochastic optimization. In ICLR, 2015.
 Kingma & Welling (2014) Kingma, D. P. and Welling, M. Autoencoding variational Bayes. In ICLR, 2014.
 Kingma et al. (2016) Kingma, D. P., Salimans, T., Józefowicz, R., Chen, X., Sutskever, I., and Welling, M. Improving variational autoencoders with inverse autoregressive flow. In NIPS, 2016.
 Larsen et al. (2016) Larsen, A. B. L., Sønderby, S. K., and Winther, O. Autoencoding beyond pixels using a learned similarity metric. In ICML, 2016.
 Li et al. (2015) Li, Y., Swersky, K., and Zemel, R. S. Generative moment matching networks. In ICML, 2015.
 Lin et al. (2018) Lin, Z., Khetan, A., Fanti, G., and Oh, S. PacGAN: The power of two samples in generative adversarial networks. In NIPS, 2018.
 Lucas & Verbeek (2018) Lucas, T. and Verbeek, J. Auxiliary guided autoregressive variational autoencoders. In ECML, 2018.
 Lucas et al. (2018) Lucas, T., Tallec, C., Ollivier, Y., and Verbeek, J. Mixed batches and symmetric discriminators for GAN training. In ICML, 2018.
 Makhzani et al. (2016) Makhzani, A., Shlens, J., Jaitly, N., and Goodfellow, I. Adversarial autoencoders. In ICLR, 2016.
 Miyato & Koyama (2018) Miyato, T. and Koyama, M. cGANs with projection discriminator. In ICLR, 2018.
 Miyato et al. (2018) Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spectral normalization for generative adversarial networks. In ICLR, 2018.
 Rosca et al. (2017) Rosca, M., Lakshminarayanan, B., WardeFarley, D., and Mohamed, S. Variational approaches for autoencoding generative adversarial networks. arXiv preprint arXiv:1706.04987, 2017.

Salimans & Kingma (2016)
Salimans, T. and Kingma, D. P.
Weight normalization: A simple reparameterization to accelerate training of deep neural networks.
In NIPS, 2016.  Salimans et al. (2016) Salimans, T., Goodfellow, I., Zaremba, W., Cheung, V., Radford, A., and Chen, X. Improved techniques for training GANs. In NIPS, 2016.
 Salimans et al. (2017) Salimans, T., Karpathy, A., Chen, X., and Kingma, D. P. PixelCNN++: Improving the PixelCNN with discretized logistic mixture likelihood and other modifications. In ICLR, 2017.
 Shmelkov et al. (2018) Shmelkov, K., Schmid, C., and Alahari, K. How good is my GAN? In ECCV, 2018.
 Sønderby et al. (2016) Sønderby, C. K., Raiko, T., Maaløe, L., Sønderby, S. K., and Winther, O. Ladder variational autoencoders. In NIPS, 2016.

Sønderby et al. (2017)
Sønderby, C. K., Caballero, J., Theis, L., Shi, W., and Huszár, F.
Amortised MAP inference for image superresolution.
In ICLR, 2017.  ThanhTung et al. (2019) ThanhTung, H., Tran, T., and Venkatesh, S. Improving generalization and stability of generative adversarial networks. In ICLR, 2019.
 Ulyanov et al. (2018) Ulyanov, D., Vedaldi, A., and Lempitsky, V. S. It takes (only) two: Adversarial generatorencoder networks. In AAAI, 2018.

van den Oord et al. (2016)
van den Oord, A., Kalchbrenner, N., and Kavukcuoglu, K.
Pixel recurrent neural networks.
In ICML, 2016.  Zhang et al. (2018) Zhang, H., Goodfellow, I., Metaxas, D., and Odena, A. Selfattention generative adversarial networks. arXiv preprint arXiv:1805.08318, 2018.
Appendix A Model refinements and implementation details
a.1 Topdown sampling of hierarchical latent variables
Flexible priors and posteriors for the variational autoencoder model can be obtained by sampling hierarchical latent variables at different layers in the network. In the generative model , latent variables can be split into groups, each one at a different layer, and the density over split thus:
Additionally, to allow the chain of latent variables to be sampled in the same order when encodingdecoding and when sampling, topdown sampling is used, as proposed in Sønderby et al. (2016); Bachman (2016); Kingma et al. (2016). With topdown sampling, the encoder (symmetric to the decoder) extracts deterministic features at different levels as the image is being encoded, constituting the bottomup deterministic pass. While decoding the image, these previously extracted deterministic features are used for topdown sampling and help determining the posterior over latent variables at different depths in the decoder. These posteriors are also conditioned on the latent variables sampled at lower feature resolutions, using normal densities:
This constitutes the stochastic topdown pass.
a.2 Inverse autoregressive flow
To increase the flexibility of posteriors used over latent variables in variational inference, Kingma et al. (2016) has proposed a type of normalizing flow called inverse autoregressive flow (IAF). The main appeals of this normalizing flow are its scalability to high dimensionality and its ability to leverage autoregressive neural network (such as those introduced in van den Oord et al. (2016)). First, a latent variable vector is sampled using the reparametrization trick (Kingma & Welling, 2014):
Then mean and variance parameters and are computed as functions of using autoregressive models, and a new latent variable is obtained:
Because and are implemented by autoregressive networks, the jacobian is triangular with the values of on the diagonal and the density under the new latent variable remains efficient to compute. In theory this transformation can be repeated an arbitrary number of times for increased flexibility, in practice typically a single step is used.
a.3 Gradient penalty
A body of work on Generative Adversarial Networks centers around the idea of regularizing the discriminator by enforcing Lipschitz continuity, for instance in Miyato et al. (2018); Arjovsky et al. (2017); Gulrajani et al. (2017a); ThanhTung et al. (2019). In this work we use the approach of Gulrajani et al. (2017a), that enforces the Lipschitz constraint with a gradient penalty term added to the loss:
where
is obtained by interpolating between real and generated data:
We add this term to the loss used to train the discriminator that yields our quality driven criterion.
a.4 Architecture and training hyperparameters
We used Adamax (Kingma & Ba, 2015) with learning rate 0.002, , for all experiments. All CIFAR10 experiments use batch size 64, other experiments in high resolution use batch size 32. To stabilize the adversarial training we use the gradient penalty (Gulrajani et al., 2017a) with coefficient 100, and 1 discriminator update per generator update. We experimented with different weighting coefficient between the two loss components, and found that values in the range to on the adversarial component work best in practice. In this range, no significant influence on the final performance of the model is observed, though the training dynamics in early training are improved with higher values. With values significantly smaller than , discriminator collapses was observed in a few isolated cases. All experiments reported here use coefficient .
For experiments with hierarchical latent variables we use
of them per layer. In the generator we use ELU nonlinearity, in discriminator with residual blocks we use ReLU while in simple convolutional discriminator we use leaky ReLU with slope 0.2.
Unless stated otherwise we use three NVP layers with a single scale and two residual blocks that we train only with the likelihood loss. Regardless of the number of scales, the VAE decoder always outputs a tensor of the same dimension as the target image, which is then fed to the NVP layers. Just like in reference implementations we use both batch normalization and weight normalization in NVP and only weight normalization in IAF.
We use the reference implementations of IAF and NVP released by the authors.


Appendix B On the Inception Score and the Fréchet inception distance
Quantitative evaluation of Generative Adversarial Networks is complicated by the absence of loglikelihood. The Inception Score (IS) and the Fréchet Inception Distance (FID) are metrics that have been proposed by (Salimans et al., 2016) and (Heusel et al., 2017) respectively, to automate the qualitative evaluation of samples. Though imperfect, these metrics have been shown to correlate well with human judgement in practice, and it is standard in the GAN literature to use them for quantitative evaluations.
The Inception Score (IS) (Salimans et al., 2016) is a statistic of the generated images, based on an external deep network trained for classification on ImageNet.
Where is sampled from the generative model, is the conditional class distributions obtained by applying the pretrained classification network to generated images, and is the class marginal over generated images.
The Fréchet Inception distance (FID) (Heusel et al., 2017) compares the distributions of Inception embeddings (activations from the penultimate layer of the Inception network) of real () and generated () images. Both of these distributions are modeled as multidimensional Gaussians parameterized by their respective mean and covariance. The distance measure is defined between the two Gaussian distributions as:
(9) 
where , denote the mean and covariance of the real and generated image distributions respectively.
Practical Use. In practice, IS and FID correlate predominantly with the quality of samples. In the literature (mostly the generative adversarial networks literature, for instance Miyato et al. (2018)), they are considered to correlate well with human judgement of quality. An empirical indicator of that is that stateofthe art likelihoodbased models have very low IS/FID scores despite having good coverage, which shows that the low quality of their samples dominates. Conversely, stateofthe art adversarial models have high IS/FID scores despite suffering from mode dropping (which strongly degrades BPD), so the score is determined mostly by the high quality of their samples. This is especially true when identical architectures and training budget are considered, as in our first experiment in Section 5.2.
Split size  IS  FID 

50k (full)  
40k  
30k  
20k  
10k  
5k  
2.5k 
To obtain a quantitative estimation of how much entropy/coverage impacts the IS and FID measures, we evaluate the scores obtained by random subsamples of the dataset, such that the quality is unchanged but coverage progressively degraded (see details of the scores below). Table 7 shows that when using the full set (50k) images the FID is as the distributions are identical. Notice that as the number of images decreases, IS is very stable (it can even increase, but by very low increments that fall below statistical noise, with a typical std on IS of ). This is because the entropy of the distribution is not strongly impacted by subsampling, even though coverage is. FID is more sensitive, as it behaves more like a measure of coverage (it compares the two distributions). Nonetheless, the variations remain extremely low even when dropping most of the dataset. For instance, when removing of the dataset (i.e., using images), FID is at , to be compared with typical GAN/CQF values that are around 20. These measurement demonstrate that IS and FID scores are heavily dominated by the quality of images. From this, we conclude that IS and FID can be used as reasonable proxies to asses sample quality, even though they are also slightly influenced by coverage. One should bear in mind, however, that a small increase in these scores may come from better coverage rather than improved sample quality.
Appendix C Qualitative influence of the feature space flexibility
In this section we experiment with different architectures to implement the invertible mapping used to build the feature space as presented in Section 4.1. To assess the impact of the expressiveness of the invertible model on the behavior of our framework, we modify various standard parameters of the architecture. Popular invertible models such as NVP (Dinh et al., 2017) readily offer the possibility of extracting latent representation at several scales, separating global factors of variations from low level detail, thus we experiment with varying number of scales. An other way of increasing the flexibility of the model is to change the number of residual blocks used in each invertible layer. Note that all the models evaluated so far in the main body of the paper are based on a single scale and two residual blocks. In addition to our CQFG models, we also compare with similar models trained with maximum likelihood estimation (MLE). Models are trained first with maximumlikelihood estimation, then with both coverage and quality driven criterions.
The results in Table 8 show that factoring out features at two scales rather than one is helpful in terms of BPD. For the CQFG models, however, the IS and FID deteriorate with more scales, and so a tradeoff between must be struck. For the MLE models, the visual quality of samples also improves when using multiple scales, as reflected in better IS and FID scores. Their quality, however, remains far worse than those produced with the coverage and quality training used for the CQFG models. Samples in the maximumlikelihood setting are provided in Figure 6. With three or more scales, models exhibit symptoms of overfitting: train BPD keeps decreasing while test BPD starts increasing, and IS and FID also degrade.





No NVP  NVP 1 scale  



NVP 2 scales  NVP 3 scales 
In Figure 6 we show samples obtained using VAE models trained with MLE. The models include one without invertible decoder layers, and with NVP layers using one, two and three scales. The samples illustrate the dramatic impact of using invertible NVP layers in these autoencoders.
Appendix D Visualisations of reconstructions
We display reconstructions obtained by encoding and then decoding ground truth images with our models (CQG and CQFG from Table 1) in Figure 7. As is typical for expressive variational autoencoders, real images and their reconstructions cannot be distinguished visually.
Real image  CQFG reconstruction 