1 Introduction
The Variational AutoEncoder (VAE) belongs to a class of models, which we will refer to as deep maximum likelihood models
, that uses a deep neural network to learn a maximum likelihood model for some input data. They are perhaps the most simple and efficient deep maximum likelihood model available, and have thus gained popularity in representation learning and generative image modeling. Unfortunately, in my opinion, in some circles the term “VAE” has become somewhat synonymous with “an autoencoder with stochastic regularization that generates useful or beautiful samples”, which has led to various misconceptions about VAEs. In this tutorial, we will return to the probabilistic and information theoretic roots of VAEs, clarify common misconceptions about VAEs, and look at a toy example on 2D data that will illustrate the capabilities and limitations of VAEs.
In Section 2, we will give an overview of what is a maximum likelihood model and what a VAE looks like.
In Sections 3 and 4, we will motivate the VAE and obtain intuitive insight into its behavior by deriving its objective. This derivation is broken into a probabilistic interpretation (Section 3)—in which we view a VAE through the lens of Bayes’ Rule, importance sampling, and the change of variables formula—and an information theoretic interpretation(Section 4)—in which we view a VAE through the lens of lossless compression and transmission through a noisy channel.
In Section 5
, we will clarify two misconceptions about VAEs that I have encountered in casual conversation and teaching materials: that they can be trained using the meansquared error loss and that the latent vector of the VAE can be viewed as a parameter rather than a variable. These two misconceptions over the formulation may lead to the incorrect beliefs that VAEs have blurry reconstructions or that they can only model Gaussian data.
Finally, in Section 6
, we will gain insight into the capabilities and limitations of VAEs through a code example on toy 2D data. In this code example, we will visualize the VAE’s density estimation abilities and latent space. An accompanying Jupyter Notebook is provided.
2 VAE Overview
2.1 What is Maximum Likelihood?
Suppose we have highdimensional data that follows a ground truth distribution
. A maximum likelihood model learns a probabilistic model parameterized by that seeks to approximate . We can do so by collecting i.i.d. samples from to create a training set, and learning to maximize the likelihood of the joint distribution
. For numerical stability, we instead minimize the negative loglikelihood:(1) 
As is the case with virtually all machine learning models, we hope that by minimizing the empirical risk of our training set given by Equation
1, we will also minimize the true risk , which reaches a global minima if and only if .Two key operations for a maximum likelihood model are inference and generation. Inference is the ability to evaluate for any input vector . Generation is the ability to sample data from the distribution . In the asymptotic case where approaches , one application for inference is outofdistribution data detection (e.g. adversarial examples). One application for generation is generative image modeling. However, existing maximum likelihood models are currently not powerful enough to reliably perform outofdistribution detection or sample images that come close to achieving the diversity and perceptual quality of natural images.
2.2 What is a VAE?
The VAE performs inference and generation by introducing a latent variable that follows a prior distribution . The VAE then uses an autoencoder with an encoder parameterized by and a decoder parameterized to infer a posterior distribution and an output distribution . If inference and generation can be efficiently done for all three of these distributions, then they can also be done on . As in the case of a standard autoencoder, the decoder tries to reconstruct the input given a latent variable . The encoder predicts which would be most capable of reconstructing . One advantage of latent models such as VAEs over maximum likelihood models without latent variables is the potential application of
for semisupervised or disentangled representation learning.
During generation, a latent vector is sampled from , and the decoder outputs the parameters of , from which we can sample an output vector. While exact inference cannot typically be efficiently done using a VAE, we can efficiently estimate an upperbound of the negative loglikelihood given by:
(2) 
This bound is also commonly referred to as the negative Evidence LowerBOund (ELBO), and can be denoted as . During training, we can use the negative ELBO as an objective, which in turn minimizes . Before deriving Equation 2 and discussing its intuitive meaning, let us first give a concrete example of what a VAE could look like in terms of neural network outputs.
2.3 A Typical VAE
The prior distribution is typically a standard isotropic multivariate Gaussian
. The inferred posterior distribution is typically a multivariate Gaussian with diagonal covariance
. Practically speaking, our encoder would be a deep neural network that consumes as input and outputs two vectors where is the dimensionality of the latent space.Sampling can be done using the reparameterization trick, by first sampling a random vector and letting
(3) 
where is the elementwise product. Since is a deterministic function of and , the resulting gradient with respect to the encoder is quite stable.
There is more variety for how the output distribution can be parameterized. One solution is to use an isotropic multivariate Gaussian . Practically speaking, our decoder would be a deep neural network that consumes as input and outputs the vectors .
We are now ready to discuss the objective in terms of neural network outputs. Under this formulation, the first term in Equation 2, which is also commonly referred to as the reconstruction loss or , is given by:
(4)  
(5) 
where is the th element of the vector . is approximated via MonteCarlo sampling; however, due to computational constraints, during training is typically only sampled once per iteration.
Under our formulation, the second term in Equation 2, which is also commonly referred to as the regularization loss or , has a closed form expression given by:
(6)  
(7) 
The decoder is only affected by the reconstruction loss and seeks to best reconstruct based on
. The reconstruction loss thus encourages the encoder to increase the signaltonoise ratio in
by decreasing and increasing . This effect is countered by the regularization loss, which encourages the encoder to increase and decrease .In a many autoencoding frameworks with regularization, whether deterministic (e.g. regularization) or stochastic, the weight of the regularization loss is manually tuned until the network achieves a certain desirable behavior. On the other hand, in Sections 3 and 4 we will see that in a VAE the onetoone ratio between the reconstruction loss and regularization loss has a probabilistic and information theoretic meaning. We will refer to the VAE constructed in this section as a Typical VAE and will continue to refer to Equations 5 and 7 as an illustration. However, keep in mind that a Typical VAE is just one of many possible examples of how we can choose to construct , and .
3 Probabilistic Interpretation of VAEs
3.1 Bayes’s Rule
The VAE was first introduced as an autoencoder for performing Variational Bayes [9]. The negative ELBO can be derived with a simple application of Bayes’ Rule:
(8)  
(9)  
(10) 
We can then take the expectation of both sides over . Since is constant over , . Hence:
(11) 
Unfortunately, we cannot efficiently evaluate Equation 11 exactly since we do not know . However, KLDivergence is nonnegative, so we can remove the last term in Equation 11 to obtain an upper bound equal to the negative ELBO:
(12) 
with equality holding if and only if (i.e. when the encoder is able to perfectly predict ).
3.2 Importance Sampling
ImportanceWeighted AutoEncoders [2] use importance sampling to provide a similar derivation to Section 3.1 for the negative ELBO by switching the order in which the expectation and logarithm are applied:
(13)  
(14)  
(15) 
We can then apply Jensen’s Inequality to switch the expectation with the logarithm, obtaining an upper bound on the negative log likelihood, which then simplifies to the negative ELBO:
(16)  
(17) 
While it is not apparent under this derivation that the tightness of the negative ELBO can be quantified using , we can see that equality holds if since the righthandside of Equation 15 becomes .
The key advantage of the importance sampling interpretation is that Equation 15 gives us a method to approximate the true negative loglikelihood without knowledge of . In the asymptotic case where we take infinite samples, the approximation can become arbitrarily close. However, if the inferred posterior deviates too much from the true posterior , which can easily happen when the latent space is highdimensional, importance sampling may require a prohibitively large number of samples to be accurate.
3.3 VAEs and Normalizing Flow
An alternative maximum likelihood model to VAEs are the more expressive but heavyweight normalizing flow models [7]. In this section, we will look at how a VAE with a Gaussian inferred posterior (but not necessarily diagonal, as in a Typical VAE) could in principle match the performance of normalizing flow models. However, such a VAE would be computationally no easier to train than a flow model.
Normalizing flow leverages the changeofvariables formula: let be an invertible function and be the Jacobian of . The negative log likelihood of can be evaluated using:
(18) 
where is the absolute value of the determinant of . Normalizing flow models learn neural networks that are guaranteed to be invertible and for which can be efficiently computed. An invertible mapping is learned from a latent space (where inference and generation can be easily done on ) to the data space . Equation 18 can then be efficiently used as an objective function for maximum likelihood.
Consider a VAE where is an isotropic Gaussian as in a Typical VAE, but the covariance matrix of is no longer restricted to be diagonal. For simplicity, we will assume that the output distribution is also an isotropic Gaussian (i.e. for all , where is a scalar). We will refer to such as a VAE as a VAE. We will now show that for any normalizing flow model that learns an invertible function from to , a VAE can in theory approach a solution where the negative ELBO would be equivalent to Equation 18.
Consider a VAE where , , the covariance matrix is , and where is a small scalar. Note that since is invertible, must have the same dimensionality as . Then we can sample a random vector and use the reparameterization trick to sample from , in which case
(19) 
By definition of the Jacobian, for any vectors , we have:
(20) 
for some remainder vector where . This gives us:
(21)  
(22)  
(23) 
Then according to Equation 5, our reconstruction loss is given by:
(24)  
(25) 
The regularization loss is given by:
(26)  
(27)  
(28) 
Then the total objective function is given by:
(29)  
(30) 
If we take the limit as , then approaches , so . Since is constant with respect to , . Thus
(31)  
(32) 
which is the same as Equation 18. Note that the simplification in Equation 32 can be done because . Since the negative ELBO approaches the true negative loglikelihood as , we can also conclude that .
Hence, a VAE could in principle match the modelling capacity of any normalizing flow model. However, in practice there would be a computational bottleneck in computing the determinant of the Jacobian when calculating the regularization loss, meaning that training a VAE would be no easier than training a normalizing flow model.
If we force the covariance matrix to be a diagonal matrix as in a Typical VAE, then inference becomes efficient, but loglikelihood can only be approximated as described in this section for functions such that is diagonal. If we consider the SVD of , we see that the covariance matrix is diagonal when
is diagonal. This can happen when all the eigenvalues of
are equal or when is a permutation matrix. is a permutation matrix if each of the dimensions in the latent vector influence independently. Thus, though restricting to be a diagonal matrix limits the class of functions we can model, it may also naturally encourage disentanglement. The relationship between VAEs and disentanglement remains an active area of research [8, 3]. We also highlight that regardless of whether is diagonal or not, the Gaussian modelling assumptions of our model do not restrict the modelling capacity of a VAE to only Gaussianlike data.4 Information Theoretic Interpretation of VAEs
4.1 Lossless Compression
Maximum likelihood can also be viewed as a lossless compression problem. The negative log likelihood is the optimal expected number of bits needed to describe a sample from based on . VLAE [4] shows that the negative ELBO is an upper bound on this number by constructing a code to describe that on average uses bits.
Suppose a sender and receiver have access to our VAE, and the sender wishes to send a vector to the receiver. The sender could first sample . Sending will cost bits. The receiver can then use the VAE to decode . The sender can then spend another bits to send an additional code (e.g. the errorcorrecting code of a reconstruction) that could then describe exactly. The sender on average spends bits to describe . However, since the receiver now knows , it can then use the VAE to know , from which it can decode a secondary message. For example, if the VAE is reparameterized as in Equation 3, then the receiver now also knows the value of , which contains bits of information. The expected cost used to describe with this coding scheme is thus given by:
(33) 
which is equal to the negative ELBO.
Based on this interpretation we can intuitively understand how the reconstruction loss and regularization loss interact with each other in a Typical VAE. The reconstruction loss essentially copies information from the input space to the latent space, and the regularization loss compresses that information. Even if information in the input space is incompressible (i.e. because it is pure noise), copying it into to the latent space will not hurt the negative log likelihood (although it will not help either). Thus, in an ideal optimization landscape, during training we would generally expect the reconstruction loss to decrease and the regularization loss to increase as more information gets stored in the latent space. Although VAEs are sometimes stereotypically associated with blurry reconstructions and heavy stochastic noise in the latent space, in reality as training progresses the VAE should exhibit essentially perfect reconstructions and increasingly deterministic behavior as training progresses.
4.2 Continuous versus Discrete Data
So far our discussion has been limited to continuous data. However, in practice the data we work with is often discrete or quantized, so our discussion is incomplete without considering to what degree of accuracy we wish to describe . For example, we typically wish to describe images to bit accuracy. Hence, assuming that our data is normalized to lie in the range , the cost to describe the reconstruction error for an image using the coding scheme in Section 4.1 would more accurately be written as:
(34) 
In Section 4.1, we discussed that in a nice optimization landscape, we expect the reconstruction loss to decrease and the regularization loss to increase during training. However, for discrete data, this can only occur until reaches , at which point the reconstruction loss can no longer decrease and the regularization loss will begin to decrease as compression occurs.
Since Equation 34 requires taking an integral, when working with discrete data it may be beneficial to model
using a distribution for which taking the cumulative distribution function (CDF) is efficient and differentiable. Thus instead of modelling
with an isotropic Gaussian as in a Typical VAE, VAEIAF [10]instead uses the similarly bellshaped isotropic logistic distribution, for which the CDF is given by the sigmoid function.
The units for evaluating a maximum likelihood model for image data is bits per dim (bits/dim), which is the number of bits such a model would need to losslessly describe each pixel of the image to bit accuracy. Any model with a negative loglikelihood of more than bits/dim is worse than useless as describing the raw pixel values only requires bits/dim. Current stateoftheart maximum likelihood models require a little less than bits/dim [5].
4.3 Transmission Across a Noisy Gaussian Channel
Once reaches , we can draw an analogy between a VAE and the classical problem of transmission across a memoryless noisy Gaussian channel. In such a problem, a sender wishes to reliably describe by transmitting a continuous number across a noisy Gaussian channel times. However, due to power constraints, the sender can only send a strong enough signal such that . Moreover, every time the sender sends a signal, the channel adds Gaussian noise such that . Hence the receiver receives transmissions of . If , then . The capacity of a Gaussian channel with noise and power is given by [6]:
(35) 
If the channel has capacity , then
bits of information can reliably be transmitted across the channel with arbitrarily low probability of error as
increases. Transmission across a noisy channel can thus be viewed as the dual problem of compression, since the ability to describe reliably using transmissions of a channel with capacity indicates an ability to compress into a representation of at most bits.In an analogous VAE, , where is a scalar, . corresponds to the power, each dimension of corresponds to a transmission across the noisy channel, corresponds to the noise, and each dimension of corresponds to a received transmission . If we set , then according to Equation 7, the expected regularization loss is given by:
(36)  
(37)  
(38) 
which is equal to in the transmission problem.
We now give an intuitive geometric explanation for why a channel with power and noise (where ) has capacity . A formal proof is given in [6]
. By the law of large numbers, as
increases, for almost all , soessentially forms a uniform distribution on the surface of the unit hypersphere. Let
be the volume of the unit hypersphere. Similarly, will lie uniformly on the surface of the hypersphere with radius and center for almost all . The volume of a hypersphere with radius is given by . We can thus expect to fit unique nonoverlapping hyperspheres of radius into the unit hypersphere.We can thus create a transmission scheme as follows. A codebook is created that maps every point to a hypersphere with volume proportional to and revealed to both the sender and receiver. Each hypersphere has center and radius such that . During transmission, the sender maps each input to the center of its corresponding hypersphere, which is transmitted across the channel. The receiver then uses the codebook to map each received point to the center of the hypersphere that belongs to, which can then be mapped to the input image .
Since there are finitely many values can take when it is discrete, constructing such a codebook is possible in theory. However, in practice, for a resolution RGB image, there would be possible values of , so calculating such a codebook by brute force would be prohibitively expensive. Optimizing a neural network to learn such a coding scheme would also be difficult since mapping a point to the center of the hypersphere that it belongs to would be a nondiffierential operation. VQVAE [11] takes a step in this direction by using the straightthrough gradient estimator.
One counterintuitive result of information theory is that the rate of a memoryless noise channel is optimal, meaning that we could do no better if we were able to send each transmission with knowledge of what the receiver received for for all all . In other words, when is large enough, the modelling capacity of a VAE is not inhibited by the assumption that the diagonal covariance of the inferred posterior, and that in theory a Typical VAE should be able to model any distribution to arbitrary accuracy. Indeed, in the above transmission scheme , so the negative ELBO would be an exact estimate of the negative loglikelihood. This is despite making no assumptions about whether is Gaussian or that the latent dimensions of affect independently. This is consistent with the results of Section 3.3 since the mapping from to is not invertible. However, though a VAE may have the capacity to model a distribution , in practice they are unable to learn such a solution due to computational constraints and a difficult optimization landscape.
5 Misconceptions About VAEs
In this section, we will look at misconceptions about VAEs that I have encountered in teaching materials and among other researchers in computer vision and machine learning. Not all readers may hold these miconceptions, so those who can correctly answer all of the following questions can skip this section:

Q: Can VAEs be trained using meansquared error as the reconstruction loss? A: No.

Q: Should VAEs have blurry or sharp reconstructions? A: Reconstructions should be essentially perfect.

Q: Are VAEs only able to model data that is highly Gaussian? A: No.

Q: Suppose is an isotropic Gaussian. How many dimensions does the latent space of a Typical VAE need to be in order to model ? A: . can be ignored and need not exist.

Q: Suppose where consists of mixtures of isotropic Gaussians. How many dimensions does the latent space of a Typical VAE need to be in order to model ? A: . The number of mixtures is irrelevant.
5.1 VAEs Cannot Be Trained With The MeanSquared Error Loss
Minimizing the MeanSquared Error () in lieu of is a common mistake I have encountered in casual conversation, a peer review, and at least one prominent blog/code tutorial on VAEs. In fact, using as the reconstruction loss is extremely problematic for maximum likelihood and results in the VAE being highly overregularized. The objective can be defined as:
(39) 
In a simplified Typical VAE where as described in Section 3.3, the reconstruction loss is given by:
(40)  
(41)  
(42) 
Note that we have normalized and by for simplicity, which is common practice (see Section 4.2).
We see that we obtain the objective by assuming throughout training and ignoring the first term in Equation 42, which would now be constant. Such an assumption would obviously be suboptimal from a maximum likelihood standpoint; in fact we can analytically see that, given , Equation 42 is minimized when , in which case simplifies to:
(43) 
Hence, minimizing negative loglikelihood introduces a logarithm operation in front of the objective. This should fall in line with our intuition from a lossless compression standpoint, since if the were cut in half, we would expect to require one fewer bit to describe the reconstruction error.
The central practical difference between using and as a reconstruction loss for a VAE is that as the magnitude of decreases by a factor of , the gradient of decreases by a factor of while the gradient of remains constant. This results in several problems for .
First, since we expect to decrease during training, the gradient of the reconstruction loss relative to the regularization loss becomes weaker and weaker—often by several orders of magnitude—as training progresses. We expect the model to thus be much more highly regularized at the end of training compared to the beginning.
Second, the scale of the input data essentially becomes a hyperparameter that controls how much we wish to balance the initial weight of the reconstruction loss compared to the regularization loss. For example, we can make for arbitrarily small by simply normalizing all our input data to lie in the range . Thus, training a VAE with as the reconstruction loss could more accurately be described as “an autoencoder with stochastic regularization that generates useful or beautiful samples”.
Using to assess the reconstruction capabilities of a VAE thus becomes quite meaningless. For example, one top tier paper incorrectly declared that their VAE had essentially memorized a natural image dataset, they had achieved a of roughly , which is an ostensibly low number. However, since the data was normalized to lie in the range , a of meant that reconstructions were off by an average of roughly 25 pixel values if normalized back to range. From a lossless compression standpoint as discussed in Section 4.2, if we calculated using Equation 43, their model would cost an average of around bits just to describe the reconstruction error of each pixel to bit accuracy; for context, stateoftheart VAEs can achieve a total negative loglikelihood (including both the reconstruction and the latent cost) of around bits per pixel.
The above problems all point to VAEs trained with being extremely overregularized during training when the input data is scaled to lie in the range . The overregularization due to incorrectly using instead of as a reconstruction loss is also a source of the stereotype that VAEs have “blurry” reconstructions when in fact a properly trained VAE should have nearly perfect reconstructions as discussed in Section 4.1. On the other hand, a potential benefit of overregularization is that the latent space only stores the most highly compressible information, which tend to be global structures. This may be beneficial for related downstream tasks like representation learning or disentanglement.
5.2 The Latent Vector Is Not A Parameter
The encoder in a Typical VAE learns the parameters of a neural network so that given an input , it can predict the parameters
of a Gaussian distribution from which
is sampled. The latent vector is not a parameter that describes the distribution, but rather a variable that describes an individual data point . Another misconception about VAEs that I have encountered in casual conversation, a peer review, and at least one university graduate course is to think of the parameters as learned rather than predicted and/or to in turn think of the latent vector as a parameter rather than a variable.To illustrate this, let us consider a toy case of a Typical VAE to modelling where . Those who view as a parameter may incorrectly guess that the global optimum to “learn the distribution” involves setting . This would clearly be suboptimal as contains no information about while incurring a nonzero regularization loss (unless the prior is constructed so that ).
Let us consider two valid solutions. One obvious global optimum would be for the VAE to ignore the latent space, so and . Another solution would be for the VAE to learn to essentially act as a deterministic regular autoencoder, so and . It is easy to verify that this solution is a global minimum that achieves the same negative loglikelihood as the first solution as .
In both solutions, knowledge about the distribution (i.e. knowledge about and ) are embedded in the parameters and (in the second solution) . In the first solution is completely uninformative and would still be a valid solution if did not exist. In the second solution, describes the data point exactly but contains no information of the distribution or its parameters. As describes a data point, even if were a mixture of Gaussians, would still only need to be dimensional and not dimensional (which would be the case if were estimating parameters).
A consequence of incorrectly viewing as a parameter is that one will underestimate the expressivity of a VAE. Since the information about is all embedded in and , which can be arbitrarily powerfully parameterized, VAEs can model complicated nonGaussian distributions as well as seen in Sections 3.3 and 4.3. However, if one incorrectly believes that the purpose of the latent vector is to estimate the parameters of a distribution, then one may expect that a VAE can only model data similar to the isotropic Gaussian prior .
6 Example: Toy 2D Data


Dataset  ELBO  NLL  

8 Gaussians  1.92  1.83  1.87 
Checkerboard  1  0.72  0.81 
2 Spirals  2.34  1.99  2.15 
We train VAEs on several toy 2D distributions. Details of results and implementation can be found in the Jupyter notebook^{1}^{1}1Notebook can be found at https://github.com/ronaldiscool/VAETutorial. We summarize key results below.
6.1 Typical VAEs
Each Typical VAE has two latent dimensions. The architecture takes roughly 1 to 2 GB of GPU and takes 20 minutes to train on a K80 on Google Colab. We trained for 60000 iterations with a batch size of 200 input samples at each iteration.
In Figure 0(a), we see density estimation results on several 2D datasets: a multimodal Gaussian distribution (which we will call “8 Gaussians“) , a uniform distribution over a checkerboard (“Checkerboard”), and a uniform distribution over 2 spirals (“2 Spirals”). A Typical VAE with 2 latent dimensions can capture the general shape of each distribution. However, the VAE also assigns nontrivial amounts of probability density to what should be lowdensity areas such as the space between two Gaussian clusters in “8 Gaussians”, the connection between two squares in “Checkerboard”, and the center of “2 Spirals”. As a result, less density is estimated on the groundtruth highdensity areas, so in Figure 0(a) the ground truth distributions are more yellow than the predicted distributions.
We display the negative loglikelihood of a VAE trained on each dataset in Table 1. The first column indicates the entropy of the ground truth distribution, which is also a lower bound for the negative log likelihood of a maximum likelihood model since . The second column is the negative ELBO. The third column is the negative log likelihood approximated by taking 250 importance samples using Equation 15. We can see how tight of a bound the negative ELBO is by comparing its value with the true negative log likelihood. We see that the VAE is nearly optimal on the “8 Gaussians” dataset and the negative ELBO is almost exact. However, there is room for improvement on the other two distributions.
We visualize the correspondence between the input space and the latent space in Figure 0(b). We see that the VAE copies information from the input space into the latent space and then expands the high density (colored) regions, which by extension contracts the lowdensity (dark) regions. However, a significant portion of the latent space still maps to dark areas of low density, even for the “8 Gaussians” dataset on which the VAE was nearly optimal.
Why did the VAE not achieve optimal negative loglikelihood? Another way to look at this question is, over the course of 60000 iterations and 12 million input samples, why did the VAE fail to learn to expand the colored regions in Figure 0(b) until the dark regions were arbitrarily small? Based both on our discussion in Sections 3 and 4 and visual inspection of Figure 0(b), it does not appear that the model is incapable of expressing a model for which the colored regions are fully expanded. A possible alternative explanation lies in the optimization landscape of the VAE.
Dataset  Loss  NLL  

8 Gaussians  1.92  1.84  1.91 
Checkerboard  1  0.81  0.87 
2 Spirals  2.34  2.24  2.29 


6.2 IWAEs and Beyond
One possible reason for the VAE’s suboptimal results may be that the decoder does not receive samples from as input, but rather samples from , which would presumably be centered around colored regions of the latent space. During early phases of training, has high variance, allowing the decoder to encounter inputs sampled from dark regions and “explore” the latent space, which in turn allows it to learn to expand the colored regions. However, as the VAE grows more powerful, the regularization loss quickly increases and the variance of decreases. The decoder then explores increasingly fewer inputs from the dark regions, so the decoder will expand the colored region at an increasingly slower rate. If the variance of decreases too quickly, then even asymptotically the VAE will not converge to the global optimum.
One remedy to this limitation is the ImportanceWeighted AutoEncoder (IWAE) [2] mentioned in Section 3.2
. For its loss function, the IWAE samples from
multiple times during each iteration of training to approximate Equation 15. For an intuition of the utility of importance sampling, considering the following scenario: Suppose we have a multimodal distribution where the chief job of the encoder is to select which cluster belongs to. The VAE infers an accurate but not perfect posterior distribution such that with probability and with probability (e.g. the tail end of a Gaussian extends to a region of latent space belonging to another cluster). This small chance of error is amplified in the negative ELBO, as . On the other hand, if we swap the order of the expectation and logarithm as done in importance sampling, the accurate predictions will drown out the inaccurate ones as . Hence, compared to an IWAE, training a VAE using the negative ELBO will encourage lower variance estimates of , which highly discourages the VAE from exploring the dark regions in Figure 0(b) for which are low.We train an IWAE using ten importance samples during each training iteration for 30000 iterations, which takes roughly the same amount of time as our Typical VAE did but receives half the amount of input data. We see that, compared to a VAE, the IWAE achieves better negative loglikelihood numbers in Table 2 and thus a more faithful probability density map in Figure 1(a). Table 2 also shows that the objective of the IWAE is a tighter bound on compared to the negative ELBO. However, the most striking difference can be seen in Figure 1(b), in which the colored regions have been significantly expanded.
From this toy example, we have seen the capabilities and limitations of a VAE in a practical setting. While they are capable of capturing the general shape of even highly nonGaussian distributions, the negative ELBO prevents the VAE from filling the latent space with highprobability datapoints. Even though we have shown in Sections 3.3 and 4.3 that there are certain situations (i.e. when the latent space is fully disentangled or when the latent space is highdimensional and the input data is discrete) where the VAE could theoretically perfectly model the ground truth input distribution, in practice these solutions could be virtually impossible to learn via gradient descent.
As a result of these limitations, in addition to IWAEs further research on VAEs has lay in allowing for a more flexible posterior distribution using normalizing flows in which can accurately infer even when its variance is large [12, 10]. Other improvements to the VAE including allowing for a more flexible latent prior distribution [13, 1] or output distribution [4]. With these improvements, VAEs are becoming an increasingly powerful probabilistic model that can be used for a variety of applications like lossless compression, generative image modeling, and representation learning.
References

[1]
(2019)
Resampled priors for variational autoencoders
. InThe 22nd International Conference on Artificial Intelligence and Statistics
, pp. 66–75. Cited by: §6.2.  [2] (2015) Importance weighted autoencoders. arXiv preprint arXiv:1509.00519. Cited by: §3.2, §6.2.
 [3] (2018) Isolating sources of disentanglement in variational autoencoders. In Advances in Neural Information Processing Systems, pp. 2610–2620. Cited by: §3.3.
 [4] (2017) Variational lossy autoencoder. See DBLP:conf/iclr/2017, External Links: Link Cited by: §4.1, §6.2.
 [5] (2018) PixelSNAIL: an improved autoregressive generative model. In International Conference on Machine Learning, pp. 864–872. Cited by: §4.2.
 [6] (2012) Elements of information theory. John Wiley & Sons. Cited by: §4.3, §4.3.
 [7] (2015) NICE: nonlinear independent components estimation. See DBLP:conf/iclr/2015w, External Links: Link Cited by: §3.3.
 [8] (2017) Betavae: learning basic visual concepts with a constrained variational framework. See DBLP:conf/iclr/2017, External Links: Link Cited by: §3.3.
 [9] (2014) Autoencoding variational bayes. See DBLP:conf/iclr/2014, External Links: Link Cited by: §3.1.
 [10] (2016) Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pp. 4743–4751. Cited by: §4.2, §6.2.
 [11] (2017) Neural discrete representation learning. arXiv preprint arXiv:1711.00937. Cited by: §4.3.
 [12] (2015) Variational inference with normalizing flows. In International Conference on Machine Learning, pp. 1530–1538. Cited by: §6.2.
 [13] (2018) VAE with a vampprior. In International Conference on Artificial Intelligence and Statistics, pp. 1214–1223. Cited by: §6.2.