Generative models have made remarkable progress in recent years. Existing techniques can model fairly complex datasets, including natural images and speech (Radford et al., 2015; Arjovsky et al., 2017; van den Oord et al., 2016b, a; Bordes et al., 2016; Kingma et al., 2016; Gulrajani et al., 2016; Bowman et al., 2015; Chung et al., 2015). Latent variable models, such as variational autoencoders (VAE), are among the most successful ones. (Kingma & Welling, 2013; Jimenez Rezende et al., 2014; Kingma et al., 2016; Kaae Sønderby et al., 2016; Bachman, 2016; Gulrajani et al., 2016). These generative models are very flexible, but the resulting marginal likelihood is intractable to compute and optimize. Models are thus learned by optimizing a tractable ”evidence lower bound”, obtained using a tunable inference distribution.
Despite the empirical success, existing VAE models are unable to accurately model complex, large scale image datasets such as LSUN and Imagenet, unless other approaches such as adversarial training or supervised features are employed(Dosovitskiy & Brox, 2016; Larsen et al., 2015; Lamb et al., 2016). When applied to complex datasets of natural images, VAE models tend to produce unrealistic, blurry samples (Dosovitskiy & Brox, 2016). The sample quality can be improved with a more expressive generative model, however, this leads to a tendency to ignore the latent variables, thus hindering the unsupervised feature learning goal (Chen et al., 2016).
In this paper, we propose a new derivation of VAEs which is not based on a variational Bayes approach. We propose a new and more general optimization criterion that is not always a lower bound on the marginal likelihood, but it is guaranteed to learn the data distribution under suitable conditions. For a particular choice of regularization, our approach becomes the regular VAE (Kingma & Welling, 2013). This new derivation gives us insights into the properties of VAE models. In particular, we are able to formally explain some common failure modes of VAEs, and propose novel methods to alleviate these issues.
In Section 3 we provide a formal explanation for why VAEs generate blurry samples when trained on complex natural images. We show that under some conditions, blurry samples are not caused by the use of a maximum likelihood approach as previously thought, but rather they are caused by an inappropriate choice for the inference distribution. We specifically target this problem by proposing a sequential VAE model, where we gradually augment the the expressiveness of the inference distribution using a process inspired by the recent infusion training process(Bordes et al., 2016). As a result, we are able to generate sharp samples on the LSUN bedroom dataset, even using -norm reconstruction loss in pixel space.
In Section 4 we propose a new explanation of the VAE tendency to ignore the latent code. We show that this problem is specific to the original VAE objective function (Kingma & Welling, 2013) and does not apply to the more general family of VAE models we propose. We show experimentally that using our more general framework, we achieve comparable sample quality as the original VAE, while at the same time learning meaningful features through the latent code, even when the decoder is a powerful PixelCNN that can by itself model data (van den Oord et al., 2016b, a).
2 A Novel Derivation for VAE
2.1 Training Latent Variable Models
Let be the true underlying data distribution defined over , and be a dataset of i.i.d. samples from
. In the context of unsupervised learning, we are often interested in learning features and representations directly from unlabeled data. A common approach for inducing features is to consider a joint probability distributionover , where is in the observed data space and is a latent code or feature space. The distribution is specified with a prior and a conditional . The prior
is often chosen to be relatively simple – the hope is that the interactions between high level features are disentangled, and can be well approximated with a Gaussian or uniform distribution. The complexity ofis instead captured by the conditional distribution . For the analysis in this paper, it will be convenient to specify using two components:
A family of probability distributions over . We require that the set is parametric, which means that it can be indexed by a set in finite dimensional real space . We furthermore require that for every , the corresponding element has well-defined and tractable log likelihood derivative for any .
A mapping parameterized by with well defined and tractable derivatives for all . We also denote the family of all possible mappings defined by our model as .
Given and , we define a family of models
indexed by . Note that plays the role of the conditional distribution for any given . For brevity, we will use to indicate . Note that the resulting marginal likelihood is a mixture of distributions in
While the specification of using and is fully general, our definition imposes some (mild) tractability restrictions on and to allow for efficient learning. Specifically, the models we consider are those where
can be tractably computed using the chain rule:
is often Gaussian or a recurrent density estimator, and
is a deep neural network.
We consider a maximum likelihood based approach to learn the parameters, where the goal is to maximize the marginal likelihood of the data
where the expectation over is approximated using a sample average over the training data .
To actually optimize over the above criteria we require that for any , the derivative over model parameters of be easy to compute or estimate. However tractability of do not imply tractability of , and if we directly take the derivative we get
Evaluating involves the computation of a high dimensional integral. Even though this integral can be approximated by sampling
this costly approximation has to be performed for every sample .
2.2 A Naive Variational Lower Bound
To gain some intuition, we consider a simple attempt to make Eq.(1) easier to optimize. By Jensen’s inequality we can obtain a lower bound
The gradient of this lower bound can be computed more efficiently, as it does not involve the intractable estimation of
The hope is that maximizing this lower bound will also increase the original log-likelihood . However, this simple lower bound is not suitable. We can rewrite this lower bound as
No matter what prior we choose, this criteria is maximized if for each , is maximized. However, recall that , . As a result there is an optimal member independent of or that maximizes this term.
This means that regardless of , if we always choose , or equivalently an which maps all to the parameter of , then Eq.(4) is maximized. For example, if is family of Gaussians, we will not learn a mixture of Gaussians, but rather the single best Gaussian fit to . Optimizing this lower bound is easy, but undermines our very purpose of learning meaningful latent features.
2.3 Using Discrimination to Avoid Trivial Solution
The key problem demonstrated in Eq.(4) is that for any , we are fitting the same with a member of . However, if for every we fit a different distribution, then we will no longer be limited to this trivial solution.
Suppose we are given a fixed inference distribution , which maps (probabilistically) inputs to features . Even though our goal is unsupervised feature learning, we initially assume the features are given to us. It is much easier to understand the dynamics of the model when we take to be fixed. Then we generalize our understanding to learned in the next section.
We define a joint distribution
We define a joint distribution, a marginal , and a posterior .
In contrast with a standard variational Bayes approaches, for now we do not treat as a variational approximation to the posterior of some generative model . Instead we simply take to be any distribution that probabilistically maps to features . For example,
can be a classifier that detects object categories.
We define a new optimization criteria where for each we use a member of to fit a different rather than the entire .
Comparing with Eq.(4), there is a key difference. As before, no matter what we choose, this new criteria is maximized when for each , is maximized, or equivalently is minimized, as a function of (because is fixed). However, in contrast with Eq.(4) we approximate a different for each , rather than finding the single best distribution in to fit the entire data distribution .
While this is no longer a (lower) bound on the marginal likelihood as in Eq. (3), we now show that under some conditions this criterion is suitable for learning.
1) Tractable stochastic gradient: This criteria admits a tractable stochastic gradient estimator because
As before it can be efficiently optimized using mini-batch stochastic gradient descent.
2) Utilization of Latent Code: As our intuition that motivated the design of this objective (5) points out, this criteria incentivizes the use of latent code that gives “discriminative power” over , which we formally demonstrate in the following proposition
Let be the global optimum of defined in (5), and the corresponding optimal mapping. If has sufficient capacity, then for every
This is illustrated in Figure 1. When has sufficient representation capacity, we are using to variationally approximate for each respectively.
For example, suppose are images and is an image classifier over classes. Then can be thought as the appeareance distribution of objects belonging to class . According to Proposition 1, the optimal generative model based on this inference distribution and objective (5) will select a member of to approximate the distribution over images separately for each category. The optimal will map each object category to this optimal category-specific approximation, assuming it has enough capacity.
On the other hand, if a feature does not carry discriminative information about , i.e. , then the optimal will have no motivation to map them to different members of . is already maximized if both are mapped to the same optimal of that approximates or . We will return to this point below when we discuss learning .
3) Estimation of : We further show that under suitable conditions this new learning criterion is consistent with our original goal of modeling .
Let be the global optimum of in Equation (5) for a sufficiently large . If is sufficiently large so that
then the joint distribution has marginal , and the Gibbs chain
converges to if it is ergodic.
This condition is illustrated in Figure 1. Intuitively, this means that if is sufficiently large and can exactly represent the posterior , then our approach will learn . Note however that it is that has marginal , and not the original generative model . Nevertheless, if the conditions are met we will have learned all that is needed to sample or draw inferences from , for example, by the Gibbs chain defined in Proposition 2.
The significance of this result is that can be any feature detector. As long as its posterior can be represented by , we will learn by optimizing (5) with a sufficiently expressive family . We will show that this leads to a important class of models in the next section.
One drawback of the proposed approach is that we cannot tractably sample from with ancestral sampling, because the marginal of the original generative model will not match the data distribution in general. To do ancestral sampling on , we need an additional condition
If all conditions in Proposition 2 hold, and we further have
then the original generative model has marginal .
Enforcing this extra condition would restrict us to use only inference distributions whose marginal matches the prior specified by the generative model. This is the first time we are placing constraints on . Such constraints generally require joint learning of and , which we will discuss next.
2.4 Learning an Inference Distribution
In the previous section we assumed that the inference distribution was fixed, and already given to us. However, in unsupervised learning feature detectors are generally not given a-priori, and are the main purpose of learning itself. In this section we discuss learning a so that conditions in Proposition 2 (and potentially 3) are satisfied.
Suppose is also a parameterized distribution with parameters , and we denote it as . As required in VAE models in general (Kingma & Welling, 2013) we require to be reparameterizable so that can also be effectively approximated by stochastic gradients.
Intuitively, we are not only using to approximate for each , but we are also learning a such that its posterior is simple enough to be representable by . Successful training under this criterion allows us to model
by a Gibbs Markov chain (6). We refer to this model as unregularized VAE. These models do not allow direct (tractable) sampling, but they have desirable properties that we will discuss and evaluate experimentally in Section 4.
2) VAE with Regularization. If we would like to directly (and tractably) sample from and have marginal , then we also need to have . A general way to enforce this condition is by a regularization that penalizes deviation of from with some , and if and only if . The optimization criteria becomes
This gives us a new family of variational auto-encoding models. In particular when we get the well known ELBO training criteria (Kingma & Welling, 2013)
However, ELBO is only one of many possibilities. ELBO has an additional advantages in that it gives us a lower bound for the log-likelihood
However the ELBO also has significant disadvantages that we shall discuss in Section 4.
To summarize, our new derivation provides two insights, which lay the foundation for all discussions in the rest of this paper:
1) Jointly optimizing and with a sufficiently flexible family attempts to learn a feature detector such that its posterior is representable by . We will explain in Section 3 that many existing problems with VAEs arise because of the inability of to approximate the posterior of . We will also propose a solution that targets this problem.
2) We can use any regularization that encourages to be close to , or no regularization at all if we do not need ancestral sampling. This will be the central topic of Section 4.
3 Simple Requires Discriminative
By our previous analysis, the posterior of should be representable by . For many existing models, although is complex, is often chosen to be simple, such as the Gaussian family (Kingma & Welling, 2013; Jimenez Rezende et al., 2014; Burda et al., 2015), or a fully factorized discrete distribution (Kingma & Welling, 2013). Proposition 2 requires to have a posterior (the conditional data distribution corresponding to feature ) which is also simple. We claim that several existing problems of VAE models occur when this condition is not met.
3.1 Limitations of Gaussian conditionals
One commonly observed failure with auto-encoding models is the generation of blurry or fuzzy samples. This effect is commonly associated with AE/VAE models that use the L2 loss (Dosovitskiy & Brox, 2016). In this setting, we map from data to latent code through a encoder , and then reconstruct through a decoder . Loss is evaluated by 2-norm of reconstruction error
where is some regularization on . Note that if we define the distribution , then the above criteria is equivalent to the VAE criteria in Eq. (7)
where is a normalization constant irrelevant to the optimization. This means that the family
that we have chosen is actually the family of fixed variance factored Gaussians. According to Proposition 2, this objective will attempt to approximate the posterior , the distribution over data points that map to , with a fixed variance Gaussian. Unfortunately, common distributions such as natural images almost never have a mixture of Gaussian structure: if is a likely sample, is not. Unless is lossless, it will map multiple to the same encoding , resulting in a highly non-Gaussian posterior . This is where the fuzziness comes from: the mean of the best fitting Gaussian is some ”average” of . Formally we have the following proposition.
The optimal solution to reconstruction loss of Eq.(9) for a given is
and the optimal expected reconstruction error for any is the sum of coordinate-wise variances .
Intuitively Proposition 4 follows from the observation that the optimal is an M-projection onto
, and therefore satisfies moment matching conditions. It shows that the optimal reconstruction is an average ofand measures the reconstruction error. For image data this error is reflected by blurry samples.
We illustrate this fact by fitting a VAE on MNIST with 2 dimensional latent code using the ELBO regularization (8) and 2-norm (Gaussian) reconstruction loss (9). In Figure 2 we plot for each the posterior variance (color coded) and the digits generated by . Regions of latent space where has high variance (red) correspond to regions where ”fuzzy” digits are generated.
The problem of fuzzy samples in VAEs was previously attributed to the maximum likelihood objective (which penalizes regions where more than regions where ), thus encouraging solutions with larger support. This explanation was put into question by (Nowozin et al., 2016), who showed that no major difference is observed when we optimize over different types of divergences with adversarial learning. Our conclusion is consistent with this recent observation, in that fuzziness is not a direct consequence of maximum likelihood, but rather, a consequence of the VAE approximation of maximum likelihood.
We will show similar results for other distribution families in the Appendix.
3.2 Infusion Training as Latent Code Augmentation
The key problem we observed in the previous section is that an insufficiently discriminative (mapping different to the same ) feature detector will have a posterior too complex to be approximated by a simple family . In this section we propose a method to alleviate this problem and achieve significantly sharper samples on complex natural image datasets. In particular, we draw a connection with and generalize the recently proposed infusion training method (Bordes et al., 2016).
Infusion training (Bordes et al., 2016) trains a Markov chain to gradually converge to the data distribution . Formally, training starts with some initial random noise , and goes through the following two steps iteratively
1)Infusion: A new ”latent state” is generated by taking the previous reconstruction , and adding some pixels from a ground truth data point .
2)Reconstruction: The decoding model attempts the next reconstruction by maximizing . The superscript indicates that this can be a different distribution for each step , leading to a non-homogeneous Markov chain.
To draw new samples at test time, we directly sample from the Markov chain
initializing from random noise. This idea is illustrated in Figure 3. Note that we refer to the resulting image after infusion as a ”latent state” because it plays the same role as a VAE latent state. We can interpret the probability of obtaining by the above iterative procedure as an inference distribution . In contrast with VAEs, the inference distribution used in infusion training is manually specified by the ”infusion” process. By adding more true pixels and making increasingly informative about , for sufficiently large the ”latent code” will become informative enough to have a simple posterior that is highly concentrated on . Such a posterior can be well approximated by simple unimodal conditionals
, such as Gaussian distributions.
Inspired by this idea, we propose the model shown in Figure 3 which we will call a sequential VAE. Each step is a VAE, except the decoder is now also conditioned on the previous reconstruction outcome. We go through the following two steps iteratively during training:
1) Inference: An inference distribution maps a ground truth data point to a latent code.
2) Reconstruction: A generative distribution (decoder) that takes as input a sample from the previous step and latent code to generate a new sample . When , we do not condition on previous samples.
The model is jointly trained by maximizing the VAE criteria for each time step respectively.
For experiments in this section we use ELBO regularization (Kingma & Welling, 2013) where is a simple fixed prior such as white Gaussian.
To generate samples during test time, for each step we perform ancestral sampling . Details about implementation is described in the Appendix.
The idea is that the more latent code we add, the more we know about , making the posterior simpler as becomes larger. In particular, we can show this formally for 2-norm loss as in Section 3.1.
For any distribution , and any , and input dimension ,
Therefore increasing the latent code size in expectation does not increase variance. By the connection we established between variance of the posterior and blurriness of the samples, this should lead to sharper samples. We show this experimentally in Figure 4111Code is available at https://github.com/ShengjiaZhao/Sequen
tial-Variational-Autoencoder, where we evaluate our model on CelebA and LSUN. In particular we can generate sharp LSUN images based only on 2-norm loss in pixel space, something previously considered to be difficult for VAE models. Details about architecture and training are in the Appendix.
Sequential generation is a general scheme under which many different models are possible. It encompasses infusion training as a special case, but many different variants are possible. This idea has great potential for improving auto-encoding models based on simple, unimodal families.
4 Complex and the Information Preference Property
Models that use a complex such as recurrent density estimators have demonstrated good promise in modeling complex natural datasets (Gulrajani et al., 2016). However these models have a shortcoming demonstrated in (Chen et al., 2016). A model with a complex conditional distribution and optimized under the ELBO criterion tend to ignore the latent code. (Chen et al., 2016) gave an explanation of this information preference property using coding theory. Here we provide an alternative simple explanation using the framework introduced int this paper . A equivalent way to write the ELBO criteria Eq.(8) is as the negative sum of two divergences (Kingma & Welling, 2013)
Suppose is sufficiently large, so that there is a member that already satisfies . If the second divergence is also , then this is already the best we can achieve. The model can trivially make the second divergence by making latent code completely non-informative, i.e., making and independent under both and , so that , . There is no motivation for the model to learn otherwise, undermining our purpose of learning a latent variable model.
However this problem can be fixed using the general VAE objective we introduced in Eq.(7)
If we do not regularize (and therefore do not attempt to meet the conditions in Proposition 3, setting , there is an incentive to use the latent code. This is because if we satisfy conditions in Proposition 2, we have
where is the entropy under some distribution , and the mutual information. This means that this optimization criteria actually prefers to maximize mutual information between and under , unlike the ELBO objective (10).
We have derived before that is not needed if we do not require sampling to be tractable
, as we will still be able to sample by running a Markov chain as in Proposition 2. If the goal is to encode the data distribution and learn informative features, then we can ignore and the objective will encourage the use of the latent code. We illustrate this on a model that uses PixelCNN (Salimans et al., 2017; van den Oord et al., 2016b, a; Gulrajani et al., 2016) as the family . The results are shown in Figure 5 and Figure 6. Experimental setting is explained in the appendix.222Code is available at https://github.com/ShengjiaZhao/Genera
On both MNIST and CIFAR, we can generate high quality samples with or without regularization with a Markov chain. As expected, only regularized VAE produces high quality samples with ancestral sampling , as it encourages satisfaction of the condition in Proposition 3. However, mutual information between data and latent code is minimized with the ELBO criterion as shown in the top right plot in Figure 5 and 6. In fact, mutual information is driven to zero in Figure 5, indicating that the latent code is completely ignored. On the other hand, for unregularized VAE high mutual information is preferred as shown in the bottom right plot of Figure 5 and 6.
In this paper we derived a general family of VAE methods from a new perspective, which is not based on lower bounding the intractable marginal likelihood. Instead, we take the perspective of a variational approximation of the posterior of an inference distribution or feature detector. Using this new framework, we were able to explain some of the issues encountered with VAEs: blurry samples and the tendency to ignore the latent code. Using the insights derived from our new framework, we identified two new VAE models that singnificantly alleviate these problems.
We thank Justin Gottschlich, Aditya Grover, Volodymyr Kuleshow and Yang Song for comments and discussions. This research was supported by Intel, NSF (#1649208) and Future of Life Institute (#2016-158687).
- Arjovsky et al. (2017) Arjovsky, M., Chintala, S., and Bottou, L. Wasserstein GAN. ArXiv e-prints, January 2017.
- Bachman (2016) Bachman, Philip. An architecture for deep, hierarchical generative models. In Advances In Neural Information Processing Systems, pp. 4826–4834, 2016.
- Bordes et al. (2016) Bordes, Florian, Honari, Sina, and Vincent, Pascal. Learning to generate samples from noise through infusion training. https://openreview.net/pdf?id=BJAFbaolg, 2016.
- Bowman et al. (2015) Bowman, Samuel R., Vilnis, Luke, Vinyals, Oriol, Dai, Andrew M., Józefowicz, Rafal, and Bengio, Samy. Generating sentences from a continuous space. CoRR, abs/1511.06349, 2015. URL http://arxiv.org/abs/1511.06349.
- Burda et al. (2015) Burda, Yuri, Grosse, Roger, and Salakhutdinov, Ruslan. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
- Chen et al. (2016) Chen, Xi, Kingma, Diederik P, Salimans, Tim, Duan, Yan, Dhariwal, Prafulla, Schulman, John, Sutskever, Ilya, and Abbeel, Pieter. Variational lossy autoencoder. arXiv preprint arXiv:1611.02731, 2016.
- Chung et al. (2015) Chung, Junyoung, Kastner, Kyle, Dinh, Laurent, Goel, Kratarth, Courville, Aaron C., and Bengio, Yoshua. A recurrent latent variable model for sequential data. CoRR, abs/1506.02216, 2015. URL http://arxiv.org/abs/1506.02216.
- Dosovitskiy & Brox (2016) Dosovitskiy, Alexey and Brox, Thomas. Generating images with perceptual similarity metrics based on deep networks. CoRR, abs/1602.02644, 2016. URL http://arxiv.org/abs/1602.02644.
- Goodfellow et al. (2014) Goodfellow, Ian, Pouget-Abadie, Jean, Mirza, Mehdi, Xu, Bing, Warde-Farley, David, Ozair, Sherjil, Courville, Aaron, and Bengio, Yoshua. Generative adversarial nets. In Advances in Neural Information Processing Systems, pp. 2672–2680, 2014.
- Gulrajani et al. (2016) Gulrajani, Ishaan, Kumar, Kundan, Ahmed, Faruk, Taiga, Adrien Ali, Visin, Francesco, Vázquez, David, and Courville, Aaron C. Pixelvae: A latent variable model for natural images. CoRR, abs/1611.05013, 2016. URL http://arxiv.org/abs/1611.05013.
Jimenez Rezende et al. (2014)
Jimenez Rezende, D., Mohamed, S., and Wierstra, D.
Stochastic Backpropagation and Approximate Inference in Deep Generative Models.ArXiv e-prints, January 2014.
- Kaae Sønderby et al. (2016) Kaae Sønderby, C., Raiko, T., Maaløe, L., Kaae Sønderby, S., and Winther, O. Ladder Variational Autoencoders. ArXiv e-prints, February 2016.
- Kingma & Welling (2013) Kingma, D. P and Welling, M. Auto-Encoding Variational Bayes. ArXiv e-prints, December 2013.
- Kingma et al. (2016) Kingma, Diederik P, Salimans, Tim, and Welling, Max. Improving variational inference with inverse autoregressive flow. arXiv preprint arXiv:1606.04934, 2016.
- Lamb et al. (2016) Lamb, Alex, Dumoulin, Vincent, and Courville, Aaron. Discriminative regularization for generative models. arXiv preprint arXiv:1602.03220, 2016.
- Larsen et al. (2015) Larsen, Anders Boesen Lindbo, Sønderby, Søren Kaae, and Winther, Ole. Autoencoding beyond pixels using a learned similarity metric. arXiv preprint arXiv:1512.09300, 2015.
- Nowozin et al. (2016) Nowozin, Sebastian, Cseke, Botond, and Tomioka, Ryota. f-gan: Training generative neural samplers using variational divergence minimization. In Advances in Neural Information Processing Systems, pp. 271–279, 2016.
- Radford et al. (2015) Radford, Alec, Metz, Luke, and Chintala, Soumith. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
- Salimans et al. (2017) Salimans, Tim, Karpathy, Andrej, Chen, Xi, and Kingma, Diederik P. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
- van den Oord et al. (2016a) van den Oord, Aaron, Kalchbrenner, Nal, Espeholt, Lasse, Vinyals, Oriol, Graves, Alex, et al. Conditional image generation with pixelcnn decoders. In Advances in Neural Information Processing Systems, pp. 4790–4798, 2016a.
- van den Oord et al. (2016b) van den Oord, Aäron, Kalchbrenner, Nal, and Kavukcuoglu, Koray. Pixel recurrent neural networks. CoRR, abs/1601.06759, 2016b. URL http://arxiv.org/abs/1601.06759.
Appendix A Additional Results
a.1 Comparison to Adversarial Training
Adversarial training (Goodfellow et al., 2014) has shown great promise in generating high quality samples. Analysis in this paper points to a possible explanation of its relative success in complex natural image datasets compared to variational autoencoders. In Proposition 1 we pointed out that if then the model has no incentive to take advantage of this representation capacity by mapping them to different members of . This is the key reason why a simple family require a almost ”lossless” , and failure to satisfy this condition leads to fuzziness and other problems.
However adversarial training does not suffer from this limitation. In fact even without inference, adversarial training can map different latent code to different members of . This is because for adversarial training we are not using to approximate . Instead we are selecting a member of whose support is covered by the support of real data. Intuitively, we would like to generate any real looking samples, and not a particular set.
Therefore we expect adversarial training to have advantage over VAE when is expected to be complex, and is simple. However we showed in this paper that models with complex , but carefully designed to avoid the information preference property also show great promise.
a.2 Failure Modes for Factorized Discrete Family
When is the family of factorized discrete distribution , where each is a discrete distribution on the -th dimension. We obtain similar results
The optimal solution to when is the family of discrete distribution for a given is for all ,
and for each the best achievable error is the pixel-wise negative entropy
This shows us that for discrete distributions, mismatch between and manifests in a different way: by generating excessively noisy output where each pixel is independently sampled.
a.3 Estimating Mutual Information
We can estimate mutual information by obtaining samples , and
This gives us good estimates unless the mutual information is large because the above estimation is upper bounded by
This problem is not specific to our method of approximation. In fact, suppose the dataset has samples, then true mutual information under the empirical data distribution is also upper bounded by
Appendix B Proofs
For any , because
When is a sufficiently large family, there must be a so that for all , . Therefore
which means that is the global maximum of . Note that for any distribution , we have
from the non-negativity property of KL-divergence. If condition 2 is satisfied, i.e., , , then the optimum in Equation (11) is attained by
for all . Then
which by definition has marginal . Finally if condition 3 is satisfied, then
which also has marginal . ∎
Proof of Proposition 4.
Given the that maximizes is given by
That is the optimal is simply the mean of , and under this , . The optimal must also map each to this because
Proof of Proposition 6.
Because is the family of factorized discrete distribution, denote each member of as where is the independent probability of the -th dimension taking value instead of , the loss for each can be written as
and the optimal solution to the above satisfies
whose unique solution is . We can further compute that the optimal loss as
Appendix C Experimental Setup
c.1 Sequential VAE
Each step of the Sequential VAE is contains an encoder that takes as input a ground truth and produces latent code , and an autoencoder with short cut connections which takes as input the output from previous step , and latent code that is either generated from prior (during testing) or by encoder (during training). Short cut connection encourages the learning of identity mapping to help the model preserve and refine upon the results from previous step. This is either achieved by direct addition or by gated addition with learnable parameter
The architecture is shown in Figure 7. We use a non-homogeneous Markov chain so weights are not shared between different time steps. For detailed information about the architecture please refer to https://github.com/ShengjiaZhao/Sequential-Variational-Autoencoder
c.2 VAE with PixelCNN
For MNIST we use a simplified version of the conditional PixelCNN architecture (van den Oord et al., 2016a). For CIFAR we use the public implementation of PixelCNN++ (Salimans et al., 2017). In either case we use a convolutional network to generate a 20 dimensional latent code, and plug this into the conditional input for both models. The entire model is trained end to end with or without regularization. For detailed information please refer to https://github.com/ShengjiaZhao/Generalized-PixelVAE.