1 Introduction
Our starting point is the desire to learn a probabilistic generative model of observable variables , where is a dimensional manifold embedded in . Note that if , then this assumption places no restriction on the distribution of whatsoever; however, the added formalism is introduced to handle the frequently encountered case where possesses lowdimensional structure relative to a highdimensional ambient space, i.e., . In fact, the very utility of generative models of continuous data, and their attendant lowdimensional representations, often hinges on this assumption (Bengio et al., 2013). It therefore behooves us to explicitly account for this situation.
Beyond this, we assume that is a simple Riemannian manifold, which means there exists a diffeomorphism between and , or more explicitly, the mapping
is invertible and differentiable. Denote a groundtruth probability measure on
as such that the probability mass of an infinitesimal on the manifold is and .The variational autoencoder (VAE) (Kingma and Welling, 2014; Rezende et al., 2014) attempts to approximate this groundtruth measure using a parameterized density defined across all of since any underlying generative manifold is unknown in advance. This density is further assumed to admit the latent decomposition , where serves as a lowdimensional representation, with and prior .
Ideally we might like to minimize the negative loglikelihood averaged across the groundtruth measure , i.e., solve . Unfortunately though, the required marginalization over is generally infeasible. Instead the VAE model relies on tractable encoder and decoder distributions, where represents additional trainable parameters. The canonical VAE cost is a bound on the average negative loglikelihood given by
(1) 
where the inequality follows directly from the nonnegativity of the KLdivergence. Here can be viewed as tuning the tightness of bound, while
dictates the actual estimation of
. Using a few standard manipulations, this bound can also be expressed as(2) 
which explicitly involves the encoder/decoder distributions and is conveniently amenable to SGD optimization of via a reparameterization trick (Kingma and Welling, 2014; Rezende et al., 2014). The first term in (2) can be viewed as a reconstruction cost (or a stochastic analog of a traditional autoencoder), while the second penalizes posterior deviations from the prior . Additionally, for any realizable implementation via SGD, the integration over must be approximated via a finite sum across training samples drawn from . Nonetheless, examining the true objective can lead to important, practicallyrelevant insights.
At least in principle, and can be arbitrary distributions, in which case we could simply enforce such that the bound from (1) is tight. Unfortunately though, this is essentially always an intractable undertaking. Consequently, largely to facilitate practical implementation, a commonly adopted distributional assumption for continuous data is that both and are Gaussian. This design choice has previously been cited as a key limitation of VAEs (Burda et al., 2015; Kingma et al., 2016), and existing quantitative tests of generative modeling quality thus far dramatically favor contemporary alternatives such as generative adversarial networks (GAN) (Goodfellow et al., 2014). Regardless, because the VAE possesses certain desirable properties relative to GAN models (e.g., stable training (Tolstikhin et al., 2018), interpretable encoder/inference network (Brock et al., 2016)
, outlierrobustness
(Dai et al., 2018), etc.), it remains a highly influential paradigm worthy of examination and enhancement.In Section 2 we closely investigate the implications of VAE Gaussian assumptions leading to a number of interesting diagnostic conclusions. In particular, we differentiate the situation where , in which case we prove that recovering the groundtruth distribution is actually possible iff the VAE global optimum is reached, and , in which case the VAE global optimum can be reached by solutions that reflect the groundtruth distribution almost everywhere, but not necessarily uniquely so. In other words, there could exist alternative solutions that both reach the global optimum and yet do not assign the same probability measure as .
Section 3 then further probes this nonuniqueness issue by inspecting necessary conditions of global optima when . This analysis reveals that an optimal VAE parameterization will provide an encoder/decoder pair capable of perfectly reconstructing all using any drawn from . Moreover, we demonstrate that the VAE accomplishes this using a degenerate latent code whereby only dimensions are effectively active. Collectively, these results indicate that the VAE global optimum can in fact uniquely learn a mapping to the correct groundtruth manifold when , but not necessarily the correct probability measure within this manifold, a critical distinction.
Next we leverage these analytical results in Section 4 to motivate an almost triviallysimple, twostage VAE enhancement for addressing typical regimes when . In brief, the first stage just learns the manifold per the allowances from Section 3, and in doing so, provides a mapping to a lower dimensional intermediate representation with no degenerate dimensions that mirrors the regime. The second (much smaller) stage then only needs to learn the correct probability measure on this intermediate representation, which is possible per the analysis from Section 2. Experiments from Section 5 reveal that this procedure can generate highquality crisp samples, avoiding the blurriness often attributed to VAE models in the past (Dosovitskiy and Brox, 2016; Larsen et al., 2015). And to the best of our knowledge, this is the first demonstration of a VAE pipeline that can produce stable FID scores, an influential recent metric for evaluating generated sample quality (Heusel et al., 2017), that are comparable to GAN models under neutral testing conditions. Moreover, this is accomplished without additional penalty functions, cost function modifications, or sensitive tuning parameters. Finally, Section 6 provides concluding thoughts and a discussion of broader VAE modeling paradigms, such as those involving normalizing flows, parameterized families for , or modifications to encourage disentangled representations.
2 HighLevel Impact of VAE Gaussian Assumptions
Conventional wisdom suggests that VAE Gaussian assumptions will introduce a gap between and the ideal negative loglikelihood
, compromising efforts to learn the groundtruth measure. However, we will now argue that this pessimism is in some sense premature. In fact, we will demonstrate that, even with the stated Gaussian distributions, there exist parameters
and that can simultaneously: (i) Globally optimize the VAE objective and, (ii) Recover the groundtruth probability measure in a certain sense described below. This is possible because, at least for some coordinated values of and , and can indeed become arbitrarily close. Before presenting the details, we first formalize a simple VAE, which is merely a VAE model with explicit Gaussian assumptions and parameterizations: A simple VAE is defined as a VAE model with latent dimensions, the Gaussian encoder , and the Gaussian decoder. Moreover, the encoder moments are defined as
and with . Likewise, the decoder moments are and . Here is a tunable scalar, while , andspecify parameterized differentiable functional forms that can be arbitrarily complex, e.g., a deep neural network.
Equipped with these definitions, we will now demonstrate that a simple VAE, with , can achieve the optimality criteria (i) and (ii) from above. In doing so, we first consider the simpler case where , followed by the extended scenario with . The distinction between these two cases turns out to be significant, with practical implications to be explored in Section 4.
2.1 Manifold Dimension Equal to Ambient Space Dimension ()
We first analyze the specialized situation where . Assuming exists everywhere in , then represents the groundtruth probability density with respect to the standard Lebesgue measure in Euclidean space. Given these considerations, the minimal possible value of (1) will necessarily occur if
(3) 
This follows because by VAE design it must be that , and in the present context, this lower bound is achievable iff the conditions from (3) hold. Collectively, this implies that the approximate posterior produced by the encoder is in fact perfectly matched to the actual posterior , while the corresponding marginalized data distribution is perfectly matched the groundtruth density as desired. Perhaps surprisingly, a simple VAE can actually achieve such a solution:
Suppose that and there exists a density associated with the groundtruth measure that is nonzero everywhere on .^{1}^{1}1This nonzero assumption can be replaced with a much looser condition. Specifically, if there exists a diffeomorphism between the set and , then it can be shown that Theorem 2.1 still holds even if for some .. Then for any , there is a sequence of simple VAE model parameters such that
(4) 
All the proofs can be found in the appendices. So at least when , the VAE Gaussian assumptions need not actually prevent the optimal groundtruth probability measure from being recovered, as long as the latent dimension is sufficiently large (i.e., ). And contrary to popular notions, a richer class of distributions is not required to achieve this. Of course Theorem 2.1 only applies to a restricted case that excludes ; however, later we will demonstrate that a key consequence of this result can nonetheless be leveraged to dramatically enhance VAE performance.
2.2 Manifold Dimension Less Than Ambient Space Dimension ()
When , additional subtleties are introduced that will be unpacked both here and in the sequel. To begin, if both and are arbitrary/unconstrained (i.e., not necessarily Gaussian), then . To achieve this global optimum, we need only choose such that (minimizing the KL term from (1)) while selecting such that all probability mass collapses to the correct manifold . In this scenario the density will become unbounded on and zero elsewhere, such that will approach negative infinity.
But of course the stated Gaussian assumptions from the simple VAE model could ostensibly prevent this from occurring by causing the KL term to blow up, counteracting the negative loglikelihood factor. We will now analyze this case to demonstrate that this need not happen. Before proceeding to this result, we first define a manifold density as the probability density (assuming it exists) of with respect to the volume measure of the manifold . If then this volume measure reduces to the standard Lebesgue measure in and ; however, when a density defined in will not technically exist, while is still perfectly welldefined. We then have the following:
Assume and that there exists a manifold density associated with the groundtruth measure that is nonzero everywhere on . Then for any , there is a sequence of simple VAE model parameters such that

[label=()]

(5)

(6)
for all measurable sets with , where is the boundary of .
Technical details notwithstanding, Theorem 2.2 admits a very intuitive interpretation. First, (5) directly implies that the VAE Gaussian assumptions do not prevent minimization of from converging to minus infinity, which can be trivially viewed as a globally optimum solution. Furthermore, based on (6), this solution can be achieved with a limiting density estimate that will assign a probability mass to most all measurable subsets of that is indistinguishable from the groundtruth measure (which confines all mass to ). Hence this solution is moreorless an arbitrarilygood approximation to for all practical purposes.^{2}^{2}2Note that (6) is only framed in this technical way to accommodate the difficulty of comparing a measure restricted to with the VAE density defined everywhere in . See the appendices for details.
Regardless, there is an absolutely crucial distinction between Theorem 2.2 and the simpler case quantified by Theorem 2.1. Although both describe conditions whereby the simple VAE can achieve the minimal possible objective, in the case achieving the lower bound (whether the specific parameterization for doing so is unique or not) necessitates that the groundtruth probability measure has been recovered almost everywhere. But the situation is quite different because we have not ruled out the possibility that a different set of parameters could push to and yet not achieve (6). In other words, the VAE could reach the lower bound but fail to closely approximate . And we stress that this uniqueness issue is not a consequence of the VAE Gaussian assumptions per se; even if were unconstrained the same lack of uniqueness can persist.
Rather, the intrinsic difficulty is that, because the VAE model does not have access to the groundtruth lowdimensional manifold, it must implicitly rely on a density defined across all of as mentioned previously. Moreover, if this density converges towards infinity on the manifold during training without increasing the KL term at the same rate, the VAE cost can be unbounded from below, even in cases where (6) is not satisfied, meaning incorrect assignment of probability mass.
To conclude, the key takehome message from this section is that, at least in principle, VAE Gaussian assumptions need not actually be the root cause of any failure to recover groundtruth distributions. Instead we expose a structural deficiency that lies elsewhere, namely, the nonuniqueness of solutions that can optimize the VAE objective without necessarily learning a close approximation to . But to probe this issue further and motivate possible workarounds, it is critical to further disambiguate these optimal solutions and their relationship with groundtruth manifolds. This will be the task of Section 3, where we will explicitly differentiate the problem of locating the correct groundtruth manifold, from the task of learning the correct probability measure within the manifold.
Note that the only comparable prior work we are aware of related to the results in this section comes from Doersch (2016), where the implications of adopting Gaussian encoder/decoder pairs in the specialized case of are briefly considered. Moreover, the analysis there requires additional much stronger assumptions than ours, namely, that should be nonzero and infinitely differentiable everywhere in the requisite 1D ambient space. These requirements of course exclude essentially all practical usage regimes where or , or when groundtruth densities are not sufficiently smooth.
3 Optimal Solutions and the Ground Truth Manifold
We will now more closely examine the properties of optimal simple VAE solutions, and in particular, the degree to which we might expect them to at least reflect the true , even if perhaps not the correct probability measure defined within . To do so, we must first consider some necessary conditions for VAE optima:
Let denote an optimal simple VAE solution (with
) where the decoder variance
is fixed (i.e., it is the sole unoptimized parameter). Moreover, we assume that is not a Gaussian distribution when .^{3}^{3}3This requirement is only included to avoid a practically irrelevant form of nonuniqueness that exists with full, nondegenerate Gaussian distributions. Then for any , there exists a such that .This result implies that we can always reduce the VAE cost by choosing a smaller value of , and hence, if is not constrained, it must be that if we wish to minimize (2). Despite this necessary optimality condition, in existing practical VAE applications, it is standard to fix during training. This is equivalent to simply adopting a nonadaptive squarederror loss for the decoder and, at least in part, likely contributes to unrealistic/blurry VAEgenerated samples (Bousquet et al., 2017). Regardless, there are more significant consequences of this intrinsic favoritism for , in particular as related to reconstructing data drawn from the groundtruth manifold :
Applying the same conditions and definitions as in Theorem 3, then for all drawn from , we also have that
(7) 
By design any random draw can be expressed as for some . From this vantage point then, (7) effectively indicates that any will be perfectly reconstructed by the VAE encoder/decoder pair at globally optimal solutions, achieving this necessary condition despite any possible stochastic corrupting factor .
But still further insights can be obtained when we more closely inspect the VAE objective function behavior at arbitrarily small but explicitly nonzero values of . In particular, when (meaning has no superfluous capacity), Theorem 3
and attendant analyses in the appendices ultimately imply that the squared eigenvalues of
will become arbitrarily small at a rate proportional to , meaning under mild conditions. It then follows that the VAE data term integrand from (2), in the neighborhood around optimal solutions, behaves as(8) 
This expression can be derived by excluding the higherorder terms of a Taylor series approximation of around the point , which will be relatively tight under the stated conditions. But because , a theoretical lower bound on (8) is given by . So in this sense (8) cannot be significantly lowered further.
This observation is significant when we consider the inclusion of addition latent dimensions by allowing . Clearly based on the analysis above, adding dimensions to cannot improve the value of the VAE data term in any meaningful way. However, it can have a detrimental impact on the the KL regularization factor in the regime, where
(9) 
Here denotes the number of eigenvalues of (or equivalently ) that satisfy if . can be viewed as an estimate of how many lownoise latent dimensions the VAE model is preserving to reconstruct . Based on (9), there is obvious pressure to make as small as possible, at least without disrupting the data fit. The smallest possible value is , since it is not difficult to show that any value below this will contribute consequential reconstruction errors, causing to grow at a rate of , pushing the entire cost function towards infinity.^{4}^{4}4Note that for any .
Therefore, in the neighborhood of optimal solutions the VAE will naturally seek to produce perfect reconstructions using the fewest number of clean, lownoise latent dimensions, meaning dimensions whereby has negligible variance. For superfluous dimensions that are unnecessary for representing , the associated encoder variance in these directions can be pushed to one. This will optimize along these directions, and the decoder can selectively block the residual randomness to avoid influencing the reconstructions per Theorem 3. So in this sense the VAE is capable of learning a minimal representation of the groundtruth manifold when .
But we must emphasize that the VAE can learn independently of the actual distribution within . Addressing the latter is a completely separate issue from achieving the perfect reconstruction error defined by Theorem 3. This fact can be understood within the context of a traditional PCAlike model, which is perfectly capable of learning a lowdimensional subspace containing some training data without actually learning the distribution of the data within this subspace. The central issue is that there exists an intrinsic bias associated with the VAE objective such that fitting the distribution within the manifold will be completely neglected whenever there exists the chance for even an infinitesimally better approximation of the manifold itself.
Stated differently, if VAE model parameters have learned a near optimal, parsimonious latent mapping onto using , then the VAE cost will scale as regardless of . Hence there remains a huge incentive to reduce the reconstruction error still further, allowing to push even closer to zero and the cost closer to . And if we constrain to be sufficiently large so as to prevent this from happening, then we risk degrading/blurring the reconstructions and widening the gap between and , which can also compromise estimation of . Fortunately though, as will be discussed next there is a convenient way around this dilemma by exploiting the fact that this dominanting factor goes away when .
4 From Theory to Practical VAE Enhancements
Sections 2 and 3 have exposed a collection of VAE properties with useful diagnostic value in and of themselves. But the practical utility of these results, beyond the underappreciated benefit of learning
, warrant further exploration. In this regard, suppose we wish to develop a generative model of highdimensional data
where unknown lowdimensional structure is significant (i.e., the case with unknown). The results from Section 3 indicate that the VAE can partially handle this situation by learning a parsimonious representation of lowdimensional manifolds, but not necessarily the correct probability measure within such a manifold. In quantitative terms, this means that a decoder will map all samples from an encoder to the correct manifold such that the reconstruction error is negligible for any . But if the measure on has not been accurately estimated, then(10) 
where is sometimes referred to as the aggregated posterior (Makhzani et al., 2016). In other words, the distribution of the latent samples drawn from the encoder distribution, when averaged across the training data, will have lingering latent structure that is errantly incongruous with the original isotropic Gaussian prior. This then disrupts the pivotal ancestral sampling capability of the VAE, implying that samples drawn from and then passed through the decoder will not closely approximate . Fortunately, our analysis suggests the following twostage remedy:

Given observed samples , train a simple VAE, with , to estimate the unknown dimensional groundtruth manifold embedded in using a minimal number of active latent dimensions. Generate latent samples via . By design, these samples will be distributed as , but likely not .

Train a second simple VAE, with independent parameters and latent representation , to learn the unknown distribution , i.e., treat as a new groundtruth distribution and use samples to learn it.

Samples approximating the original groundtruth can then be formed via the extended ancestral process , , and finally .
The efficacy of the secondstage VAE from above is based on the following. If the first stage was successful, then even though they will not generally resemble , samples from will nonetheless have nonzero measure across the full ambient space . If , this occurs because the entire latent space is needed to represent an dimensional manifold, and if , then the extra latent dimensions will be naturally filled in via randomness introduced along dimensions associated with nonzero eigenvalues of the decoder covariance per the analysis in Section 3.
Consequently, as long as we set , the operational regime of the secondstage VAE is effectively equivalent to the situation described in Section 2.1 where the manifold dimension is equal to the ambient dimension.^{5}^{5}5Note that if a regular autoencoder were used to replace the firststage VAE, then this would no longer be the case, so indeed a VAE is required for both stages unless we have strong prior knowledge such that we may confidently set . And as we have already shown there via Theorem 2.1, the VAE can readily handle this situation, since in the narrow context of the secondstage VAE, , the troublesome factor becomes zero, and any globally minimizing solution is uniquely matched to the new groundtruth distribution . Consequently, the revised aggregated posterior produced by the secondstage VAE should now closely resemble . And importantly, because we generally assume that , we have found that the secondstage VAE can be quite small.
It should also be emphasized that concatenating the two VAE stages and jointly training does not generally improve the performance. If trained jointly the few extra secondstage parameters can simply be hijacked by the dominant influence of the first stage reconstruction term and forced to work on an incrementally better fit of the manifold rather than addressing the critical mismatch between and . This observation can be empirically tested, which we have done in multiple ways. For example, we have tried fusing the respective encoders and decoders from the first and second stages to train what amounts to a slightly more complex single VAE model. We have also tried merging the two stages including the associated penalty terms. In both cases, joint training does not help at all as expected, with average performance no better than the first stage VAE (which contains the vast majority of parameters). Consequently, although perhaps counterintuitive, separate training of these two VAE stages is actually critical to achieving high quality results as will be demonstrated next.
5 Empirical Evaluation of VAE TwoStage Enhancement
In this section we first present quantitative evaluations of the proposed twostage VAE against various GAN and VAE baselines. We then describe experiments explicitly designed to corroborate our theoretical findings from Sections 2 and 3, including a demonstration of robustness to the choice of . Finally, we include representative samples generated from our model.
5.1 Quantitative Comparisons of Generated Sample Quality
We first present quantitative evaluation of novel generated samples using the largescale testing protocol of GAN models from (Lucic et al., 2018). In this regard, GANs are wellknown to dramatically outperform existing VAE approaches in terms of the Fréchet Inception Distance (FID) score (Heusel et al., 2017) and related quantitative metrics. For fair comparison, (Lucic et al., 2018) adopted a common neutral architecture for all models, with generator and discriminator networks based on InfoGAN (Chen et al., 2016a); the point here is standardized comparisons, not tuning arbitrarilylarge networks to achieve the lowest possible absolute FID values. We applied the same architecture to our firststage VAE decoder and encoder networks respectively for direct comparison. For the lowdimensional secondstage VAE we used small, 3layer networks contributing negligible additional parameters beyond the first stage (see the appendices for further design details).
We evaluated our proposed VAE pipeline, henceforth denoted as 2Stage VAE, against three baseline VAE models differing only in the decoder output layer: a Gaussian layer with fixed , a Gaussian layer with a learned , and a crossentropy layer as has been adopted in several previous applications involving images (Chen et al., 2016b). We also tested the Gaussian decoder VAE model (with learned ) combined with an encoder augmented with normalizing flows (Rezende and Mohamed, 2015), as well as the recently proposed Wasserstein autoencoder (WAE) (Tolstikhin et al., 2018) which maintains a VAElike structure. All of these models were adapted to use the same neutral architecture from (Lucic et al., 2018). Note also that the WAE includes two variants, referred to as WAEMMD and WAEGAN because different Maximum Mean Discrepancy (MMD) and GAN regularization factors are involved. We conduct experiments using the former because it does not involve potentiallyunstable adversarial training, consistent with the other VAE baselines.^{6}^{6}6Later we compare against both WAEMMD and WAEGAN using the setup from (Tolstikhin et al., 2018). Additionally, we present results from (Lucic et al., 2018) involving numerous competing GAN models, including MM GAN (Goodfellow et al., 2014), WGAN (Arjovsky et al., 2017), WGANGP (Gulrajani et al., 2017), NS GAN (Fedus et al., 2017), DRAGAN (Kodali et al., 2017), LS GAN (Mao et al., 2017) and BEGAN (Berthelot et al., 2017). Testing is conducted across four significantly different datasets: MNIST (LeCun et al., 1998), Fashion MNIST (Xiao et al., 2017), CIFAR10 (Krizhevsky and Hinton, 2009) and CelebA (Liu et al., 2015).
For each dataset we executed
independent trials and report the mean and standard deviation of the FID scores in Table
1.^{7}^{7}7All reported FID scores for VAE and GAN models were computed using TensorFlow (
https://github.com/bioinfjku/TTUR). We have found that alternative PyTorch implementations (
https://github.com/mseitzer/pytorchfid) can produce different values in some circumstances. This seems to be due, at least in part, to subtle differences in the underlying inception models being used for computing the scores. Either way, a consistent implementation is essential for calibrating results across different scenarios. No effort was made to tune VAE training hyperparameters (e.g., learning rates, etc.); rather a single generic setting was first agnostically selected and then applied to all VAElike models (including the WAEMMD). As an analogous baseline, we also report the value of the best GAN model for each dataset when trained using suggested settings from the authors; no single model was optimal across all datasets, so these values represent the best performance from different, datasetdependent GANs. Even so, our single 2Stage VAE is still better on two of four datasets, and in aggregate, better than any individual GAN model. For example, when averaged across datasets, the mean FID score for any individual GAN trained with suggested settings was always approximately 45 or higher (see (Lucic et al., 2018)[Figure 4]), while our analogous 2Stage VAE maintained a mean below 40. The other VAE baselines were not competitive. Note also that the relatively poor performance of the WAEMMD on MNIST and Fashion MNIST data can be attributed to the sensitivity of this approach to the value of , which for consistency with other models was fixed at for all experiments. This value is likely much larger than actually needed for these simpler data types (meaning ), and the WAEMMD model can potentially be more reliant on having some . We will return to this issue in Section 5.2 below, where among other things, we empirically examine how performance varies with across different modeling paradigms.Table 1 also displays FID scores from GAN models evaluated using hyperparameters obtained from a largescale search executed independently across each dataset to achieve the best results; 100 settings per model per dataset, plus an optimal, datadependent stopping criteria as described in (Lucic et al., 2018). Within this broader paradigm, cases of severe mode collapse were omitted when computing final GAN FID averages. Despite these considerable GANspecific advantages, the FID performance of the default 2Stage VAE is well within the range of the heavilyoptimized GAN models for each dataset unlike the other VAE baselines. Overall then, these results represent the first demonstration of a VAE pipeline capable of competing with GANs in the arena of generated sample quality.
MNIST  Fashion  CIFAR10  CelebA  

MM GAN  
NS GAN  
optimized,  LSGAN  
datadependent  WGAN  
settings  WGAN GP  
DRAGAN  
BEGAN  
Best GAN  10  32  70  
VAE (crossentr.)  
default  VAE (fixed )  
settings  VAE (learned )  
VAE + Flow  
WAEMMD  
2Stage VAE (ours) 
Of course there are still other ways of evaluating sample quality. For example, although the FID score is a widelyadopted measure that significantly improves upon the earlier Inception Score (Salimans et al., 2016), it has been shown to exhibit bias in certain circumstances (Bińkowski et al., 2018). To address this issue, the recentlyproposed Kernel Inception Distance (KID) applies a polynomialkernel MMD measure to estimate the inception distance and is believed to enjoy better statistical properties (Bińkowski et al., 2018). Note that we cannot evaluate all of the GAN baselines using the KID score; only the authors of (Lucic et al., 2018) could easily do this given the huge number of trained models involved that are not publicly available, and the need to retrain selected models multiple times to produce new average scores at optimal hyperparameter settings. However, we can at least compare our trained 2Stage VAE to other VAE/AEbased models. Table 2 presents these results, where we observe that the same improvement patterns reported with respect to FID are preserved when we apply KID instead, providing further confidence in our approach.
MNIST  Fashion  CIFAR10  CelebA  

VAE (crossentr.)  
VAE (fixed )  
VAE (learned )  
VAE + Flow  
WAEMMD  
2Stage VAE (ours) 
Thus far we have presented performance evaluations of VAE models without any special tuning on a neutral testing platform; however, we have not as of yet compared our approach against stateoftheart VAElike competition benefitting from an encoderdecoder architecture and training protocol adjusted for a specific dataset. To the best of our knowledge, the WAE model involves the only published evaluation of this kind, where different architectures are designed for use with MNIST and CelebA datasets (Tolstikhin et al., 2018)
. However, because actual FID scores are only reported for CelebA (which is far more complex than MNIST anyway), we focus our attention on headtohead testing with this data. In particular, we adopt the exact same encoderdecoder networks as the WAE models, and train using the same number of epochs. We do not tune any hyperparameters whatsoever, and apply the same small secondstage VAE as used in previous experiments. As before, the secondstage size is a small fraction of the first stage, so any benefit is not simply the consequence of a larger network structure. Results are reported in Table
3, where the 2Stage VAE even outperforms the WAEGAN model, which has the advantage of adversarial training tuned for this combination of data and network architecture.VAE  WAEMMD  WAEGAN  2Stage VAE (ours)  

CelebA FID  63  55  42  34 
5.2 Experimental Corroboration of Theoretical Results
The true test of any theoretical contribution is the degree to which it leads to useful, empiricallytestable predictions about behavior in realworld settings. In the present context, although our theory from Sections 2 and 3 involves some unavoidable simplifying assumptions, it nonetheless makes predictions that can be tested under practicallyrelevant conditions where these assumptions may not strictly hold. We now present the results of such tests, which provide strong confirmation of our previous analysis. In particular, after providing validation of Theorems 3 and 3, we explicitly demonstrate that the second stage of our 2Stage VAE model can reduce the gap between and , and that the combined VAE stages are quite robust to the value of as predicted.
Validation of Theorem 3: This theorem implies that will converge to zero at any global minimum of the stated VAE objective under consideration. Figure 0(a) presents empirical support for this result, where indeed the decoder variance does tend towards zero during training (red line). This then allows for tighter image reconstructions (dark blue curve) with lower average squared error, i.e., a better manifold fit as expected. Additionally, Figure 0(b) compares the FID score computed using reconstructed training images from different VAE models (these are not new generated samples). The VAE with a learnable achieves the lowest FID on all four datasets, implying that it also produces more realistic overall reconstructions as goes to zero. As will be discussed further in Section 6, highquality image reconstructions are half the battle in ultimately achieving good generative modeling performance.
Validation of Theorem 3: Figure 2 bolsters this theorem, and the attendant analysis which follows in Section 3, by showcasing the dissimilar impact of noise factors applied to different directions in the latent space before passage through the decoder mean network . In a direction where an eigenvalue of is large (i.e., a superfluous dimension), a random perturbation is completely muted by the decoder as predicted. In contrast, in directions where such eigenvalues are small (i.e., needed for representing the manifold), varying the input causes large changes in the image space reflecting reasonable movement along the correct manifold.
Reduced Mismatch between and : Although the VAE with a learnable can achieve highquality reconstructions, the associated aggregated posterior is still likely not close to a standard Gaussian distribution as implied by (10). This mismatch then disrupts the critical ancestral sampling process. As we have previously argued, the proposed 2Stage VAE has the ability to overcome this issue and achieve a standard Gaussian aggregated posterior, or at least nearly so. As empirical evidence for this claim, Figure 4
displays the singular value spectrum of latent sample matrices
drawn from (first stage), and drawn from (enhanced second stage). As expected, the latter is much closer to the spectrum from an analogous i.i.d. matrix. We also used these same sample matrices to estimate the MMD metric (Gretton et al., 2007) between and the aggregated posterior distributions from the first and second stages in Table 4. Clearly the second stage has dramatically reduced the difference from as quantified by the MMD. Overall, these results indicate a superior latent representation, providing highlevel support for our 2Stage VAE proposal.Robustness to LatentSpace Dimensionality: According to the analysis from Section 3, the 2Stage VAE should be relatively insensitive to having a good estimate of the groundtruth manifold dimension . As long as we choose , then our approach should in principle be able to compensate for any dimensionality mismatch. To recap, the first VAE stage should fill in useless dimensions with random noise, such that does not lie on any lowerdimensional manifold. The second stage then operates within the regime arbitrated by Theorem 2.1 such that we obtain a tractable means for sampling from . This robustness to need not be shared by alternative approaches, such as those predicated upon deterministic autoencoders or singlestage VAE models. For example, both the WAEMMD and WAEGAN models are dependent on having a reasonable estimate for , at least for the deterministic encoderdecoder structures that were empirically tested in (Tolstikhin et al., 2018). This is because, if , reconstructing the data on the manifold is not possible, and if , then and cannot be matched since the latter is necessarily confined by the training data to an dimensional manifold within dimensional space. This likely explains the poor performance of WAEMMD on MNIST and Fashion MNIST data reported in Table 1.
To further probe these issues, we again adopt the neutral testing framework from (Lucic et al., 2018), and retrain each model as is varied. We conduct this experiment using Fashion MNIST (relatively small/simple) and CelebA (more complex). FID scores from both reconstructions of the training data, and novel generated samples are shown in Figure 5. From these plots (left column) we observe that both the 2Stage VAE and WAEMMD have a similar reconstruction FID values that become smaller as increases (reconstructions should generally improve with increasing ). Note that the WAEMMD relies on a Gaussian output layer with a small value,^{8}^{8}8Unlike standard VAEs that are capable of selfcalibration in some sense, with and elements of jointly pushing towards small values, may be difficult to learn for WAE models because there is also a required weighting factor balancing the MMD (or GAN) loss. Additionally, the WAE code posted in association with (Tolstikhin et al., 2018) does not attempt to learn . and hence the reconstructions are unlikely to be significantly different from the 2Stage VAE. In contrast, the other baselines are considerably worse, especially when is fixed as is often done in practice.
The situation changes considerably however when we examine the FID scores obtained from generated samples. While the VAE models remain relatively stable over a significant range of sufficiently large values, some of which are likely to be considerably larger than necessary (e.g., on Fashion MNIST data), the WAEMMD performance is much more sensitive. Of course we readily concede that compensatory, datadependent tuning of WAEMMD hyperparameters might allow for some additional improvements in these curves, but this then further highlights the larger point: the VAE models were not tuned in any way for these experiments and yet still maintain stable behavior, with our 2Stage variant providing a sizeable advantage.
Of course obviously in practice if we set to be far too large, then the training will likely become much more difficult, since in addition to learning the correct groundtruth manifold, we are also burdening the model to detect a much larger number of unnecessary dimensions. But even so, the 2Stage VAE is arguably quite robust to within these experimental settings, and certainly we need not set to achieve good results; a reasonable appears to be sufficient. Still, as a final caveat, we should mention that the VAE cannot automatically manage all forms of excess model capacity. As discussed in (Dai et al., 2018)[Section 4], if the decoder mean network becomes inordinately complex/deep, then a useless degenerate solution involving the implicit memorization of training data can break even the natural VAE regularization mechanisms we have described herein (in principle, this can happen even with for a finite training dataset). Obviously though, GAN models and virtually all other deep generative pipelines share a similar vulnerability.
5.3 Qualitative Evaluation of Generated Samples
Finally, we qualitatively evaluate samples generated via our 2Stage VAE using a simple, convenient residual network structure (with fewer parameters than the InfoGAN architecture). Details of this network are shown in the appendices. Randomly generated samples from our 2Stage VAE are shown in Figure 6 for MNIST and CelebA data. Additional samples can be found in the appendices.
6 Discussion
It is often assumed that there exists an unavoidable tradeoff between the stable training, valuable attendant encoder network, and resistance to mode collapse of VAEs, versus the impressive visual quality of images produced by GANs. While we certainly are not claiming that our twostage VAE model is superior to the latest and greatest GANbased architectures in terms of the realism of generated samples, we do strongly believe that this work at least narrows that gap substantially such that VAEs are worth considering in a broader range of applications. We now close by situating our work within the context of existing VAE enhancements, as well as recent efforts to learn socalled disentangled representations.
6.1 Connections with Existing VAE Enhancements
Although a variety of alternative VAE renovations have been proposed, unlike our work, nearly all of these have focused on improving the loglikelihood scores assigned by the model to test data. In particular, multiple elegant approaches involve replacing the Gaussian encoder network with a richer class of distributions instantiated through normalizing flows or related (Burda et al., 2015; Kingma et al., 2016; Rezende and Mohamed, 2015; van den Berg et al., 2018). While impressive loglikelihood gains have been demonstrated, this achievement is largely orthogonal to the goal of improving quantitative measures of visual quality (Theis et al., 2016), which has been our focus herein. Additionally, improving the VAE encoder does not address the uniqueness issue raised in Section 2, and therefore, a second stage could potentially benefit these models too under the right circumstances.
Broadly speaking, if the overriding objective is generating realistic samples using an encoderdecoderbased architecture (VAE or otherwise), two important, wellknown criteria must be satisfied:

[label=()]

Small reconstruction error when passing through the encoderdecoder networks, and

An aggregate posterior that is close to some known distribution like that is easy to sample from.
The first criteria can be naturally enforced by a deterministic AE, but also for the VAE as becomes small as quantified by Theorem 3. Of course the second criteria is equally important. Without it, we have no tractable way of generating random inputs that, when passed through the learned decoder, produce realistic output samples resembling the training data distribution.
Criteria (i) and (ii) can be addressed in multiple different ways. For example, (Tomczak and Welling, 2018; Zhao et al., 2018) replace with a parameterized class of prior distributions such that there exist more flexible pathways for pushing and closer together; a VAElike objective is used for this purpose in (Tomczak and Welling, 2018), while (Zhao et al., 2018) employs adversarial training. Consequently, even if is not Gaussian, we can nonetheless sample from a known nonGaussian alternative using either of these approaches. This is certainly an interesting idea, but it has not as of yet been shown to improve FID scores. For example, only loglikelihood values on relatively small blackandwhite images are reported in (Tomczak and Welling, 2018)
, while discrete data and associated evaluation metrics are the focus of
(Zhao et al., 2018).In fact, the only competing encoderdecoderbased architecture we are aware of that explicitly attempts to improve FID scores is the WAE model from (Tolstikhin et al., 2018), which can be viewed as a generalization of the adversarial autoencoder (Makhzani et al., 2016). For both WAEMMD and WAEGAN variants introduced and tested in Section 5 herein, the basic idea is to minimize an objective function composed of a reconstruction penalty for handling criteria (i), and a Wassenstein distance measure between and (either MMD or GANbased) for addressing criteria (ii). Note that under the reported experimental design from (Tolstikhin et al., 2018), the WAEGAN model moreorless defaults to an adversarial autoencoder, although broader differentiating design choices are possible. The adversariallyregularized autoencoder proposed in (Zhao et al., 2018) can also be interpreted as a variant of the WAEGAN customized to handle discrete data.
As with the approaches mentioned above, the two VAE stages we have proposed can also be motivated in onetoone correspondence with criteria (i) and (ii). In brief, the first VAE stage addresses criteria (i) by pushing both the encoder variance, and the decoder variances selectively, towards zero such that accurate reconstruction is possible using a minimal number of active latent dimensions. However, our detailed analysis suggests that, although the resulting aggregate posterior will occupy nonzero measure in dimensional space (selectively filling out superfluous dimensions with random noise), it need not be close to . This then implies that if we take samples from and pass them through the learned decoder, the result may not closely resemble real data.
Of course if we could somehow directly sample from , then we would not need to use . And fortunately, because the firststage VAE ensures that will satisfy the conditions of Theorem 2.1, we know that a second VAE can in fact be learned to accurately sample from this distribution, which in turn addresses criteria (ii). Specifically, per the arguments from Section 4, sampling and then is akin to sampling even though the latter is not available in closed form. Such samples can then be passed through the firststage VAE decoder to obtain samples of . Hence our framework provides a principled alternative to existing encoderdecoder structures designed to handle criteria (i) and (ii), leading to stateoftheart results for this class of model in terms of FID scores.
6.2 Identifiability of Disentangled Representations
Although definitions vary, if a latent is ‘disentangled’ so to speak, then each element should ideally contain an interpretable, semanticallymeaningful factor of variation that can act independently on the generative process (Chen et al., 2018). For example, consider an MNIST image and a hypothetical latent representation , whereby encodes the digit type (i.e., ), is the stroke thickness, is a measure of the overall digit slant, and so on for . This is a canonical disentangled representation, with clearly interpretable elements that can be independently adjusted to generate samples with predictable differences, e.g., we may adjust while keeping other dimensions fixed to isolate variations in stroke thickness.
A number of VAE modifications have been putatively designed to encourage this type of disentanglement (likewise for other generative modeling paradigms). These efforts can be partitioned into those that involve some degree of supervision, to both define and isolate specific groundtruth factors of variation (which may be applicationspecific), and those that aspire to be completely unsupervised. For the latter, the VAE is typically refashioned to minimize, at least to the extent possible, some loose proxy for the degree of entanglement. An influential example is the total correlation,^{9}^{9}9This naming convention is perhaps a misnomer, since total correlation measures statistical dependency beyond secondorder correlations. defined for the present context as
(11) 
where denotes the marginal distribution of . It follows that iff , meaning that the aggregate posterior distribution, which generates input samples to the VAE decoder, involves independent factors of variation. It has been argued then that adjustments to the VAE objective that push the total correlation towards zero will favor desirable disentangled representations (Chen et al., 2018; Higgins et al., 2017). Note that if criteria (ii) from Section 6.1 is satisfied with a factorial prior such as , then we have actually already achieved . In contrast, models with more complex priors as used in (Tomczak and Welling, 2018) would require further alterations to enforce this goal.
While we agree that methods penalizing TC (either directly or indirectly) will favor latent representations with independent dimensions, unfortunately this need not correspond with any semanticallymeaningful form of disentanglement. Intuitively, this means that independent latent factors need not correspond with interpretable concepts like stroke width or digit slope. In fact, without at least some form of supervision or constraints on the space of possible representations, disentanglement defined with respect to any particular groundtruth semantic factors is not generally identifiable in the strict statistical sense.
As one easy way to see this, suppose that we have access to data generated as , where serves as an arbitrary groundtruth decoder and represents groundtruth/disentangled latent factors of interest. However, we could just as well generate identical data via , where with associated factorial distribution . Here the operator converts to a standardized Gaussian distribution, is an arbitrary rotation matrix the mixes the factors but retains a factorial representation with , and converts the resulting Gaussian to a new factorial distribution with arbitrary marginals.
By construction, each new latent factor will be composed as a mixture of the original factors of interest, and yet they will also have zero total correlation, i.e., they will be statistically independent. Therefore if we treat the composite operator as the effective decoder, we have a new generative process and . By construction, this process will produce identical observed data using a latent representation with zero total correlation, but with highly entangled factors with respect to the original ground truth delineation. Note also that the revised decoder need not necessarily be more complex than
, except in special circumstances. For example, if we constrain our decoder architecture to be affine, then the model defaults to independent component analysis and the nonlinear operators
and cannot be absorbed into an affine composite operator. In this case, the model is identifiable up to an arbitrary permutation and scaling (Hyvärinen and Oja, 2000).Overall though, this lack of identifiability while searching for disentangled representations can exist even in trivially simple circumstances. For example, consider a set of observed facial images generated by two groundtruth binary factors related to gender, male (M) or female (F), and age, young (Y) or old (O). Both factors can be varied independently to modify the images with respect to either gender or age to create a set of four possible images, namely , where the subscripts indicate the latent factor values. But if given only a set of such images without explicit labels as to the groundtruth factors of interest, finding a disentangled representation with respect to age and gender is impossible.
0 0  0 0  
0 1  1 1  
1 0  0 1  
1 1  1 0 
To visualize this, let
denote a 2D binary vector that serves as the latent representation. In Table
4 we present two candidate encodings, both of which display zero total correlation between factors and (i.e., knowledge of tells us nothing about the value of and vice versa). However, in terms of the semantically meaningful factors of gender and age, these two encodings are quite different. In the first, varying alters gender independently of age, while varies age independently of gender. In contrast, with the second encoding varying while keeping fixed changes both the age and gender, an entangled representation per these attributes. Without supervising labels to resolve such ambiguity, there is no possible way, even in principle, for any generative model (VAE or otherwise) to differentiate these two encodings such that the socalled disentangled representation is favored over the entangled one.Although in some limited cases, it has been reported that disentangled representations can be at least partially identified, we speculate that this is a result of nuanced artifacts of the experimental design that need not translate to broader problems of interest. Indeed, in our own empirical tests we have not been able to consistently reproduce any stable disentangled representation across various testing conditions as expected. It was our original intent to include a thorough presentation of these results; however, in the process of preparing this manuscript we became aware of a contemporary work with extensive testing in this area (Locatello et al., 2018). In full disclosure, the results from (Locatello et al., 2018) are actually far more extensive than those that we have completed, so we simply defer the reader to this paper to examine the compelling battery of tests presented there.
Appendix A Comparison of Novel Samples Generated from our Model
Generation results for CelebA, MNIST, FashionMNIST and CIFAR10 datasets of different methods are shown in Figures 710 respectively. When is fixed to be one, the generated samples are very blurry. If a learnable is used, the samples becomes sharper; however, there are many lingering artifacts as expected. In contrast, the proposed 2Stage VAE can remove these artifacts and generate more realistic samples. For comparison purposes, we also show the results from WAEMMD, WAEGAN (Tolstikhin et al., 2018) and WGANGP (Gulrajani et al., 2017) for the CelebA dataset.
Appendix B Example Reconstructions of Training Data
Reconstruction results for MNIST, FashionMNIST, CIFAR10 and CelebA datasets are shown in Figures 1114 respectively. On relatively simple datasets like MNIST and FashionMNIST, the VAE with learnable achieves almost exact reconstruction because of a better estimate of the underlying manifold consistent with theory. However, the VAE with fixed produces blurry reconstructions as expected. Note that the reconstruction of a 2Stage VAE is the same as that of a VAE with learnable because the secondstage VAE has nothing to do with facilitating the reconstruction task.
Appendix C Additional Experimental Results Validating Theoretical Predictions
We first present more examples similar to Figure 2 from the main paper. Random noise is added to along different directions and the result is passed through the decoder network. Each row corresponds to a certain direction in the latent space and samples are shown for each direction. These dimensions/rows are ordered by the eigenvalues of . The larger is, the less impact a random perturbation along this direction will have as quantified by the reported image variance values. In the first two or three rows, the noise generates some images from different classes/objects/identities, indicating a significant visual difference. For a slightly larger , the corresponding dimensions encode relatively less significant attributes as predicted. For example, the fifth row of both MNIST and FashionMNIST contains images from the same class but with a slightly different style. The images in the fourth row of the CelebA dataset have very subtle differences. When , the corresponding dimensions become completely inactive and all the output images are exactly the same, as shown in the last rows for all the three datasets.
Additionally, as discussed in the main text and below in Section I, there are likely to be eigenvalues of converging to zero and eigenvalues converging to one. We plot the histogram of values for both MNIST and CelebA datasets in Figure 16. For both datasets, approximately converges to either to zero or one. However, since CelebA is a more complicated dataset than MNIST, the groundtruth manifold dimension of CelebA is likely to be much larger than that of MNIST. So more eigenvalues are expected to be near zero for the CelebA dataset. This is indeed the case, demonstrating that VAE has the ability to detect the manif
Comments
There are no comments yet.