Generative Models with TensorFlow
The variational autoencoder (VAE; Kingma, Welling (2014)) is a recently proposed generative model pairing a top-down generative network with a bottom-up recognition network which approximates posterior inference. It typically makes strong assumptions about posterior inference, for instance that the posterior distribution is approximately factorial, and that its parameters can be approximated with nonlinear regression from the observations. As we show empirically, the VAE objective can lead to overly simplified representations which fail to use the network's entire modeling capacity. We present the importance weighted autoencoder (IWAE), a generative model with the same architecture as the VAE, but which uses a strictly tighter log-likelihood lower bound derived from importance weighting. In the IWAE, the recognition network uses multiple samples to approximate the posterior, giving it increased flexibility to model complex posteriors which do not fit the VAE modeling assumptions. We show empirically that IWAEs learn richer latent space representations than VAEs, leading to improved test log-likelihood on density estimation benchmarks.READ FULL TEXT VIEW PDF
In Bayesian machine learning, the posterior distribution is typically
Variational Autoencoder (VAE), a simple and effective deep generative mo...
The Importance Weighted Auto Encoder (IWAE) objective has been shown to
The variational autoencoder (VAE) is a popular model for density estimat...
Deep dynamic generative models are developed to learn sequential depende...
Posterior collapse in Variational Autoencoders (VAEs) arises when the
This work gives an in-depth derivation of the trainable evidence lower b...
Generative Models with TensorFlow
This is a review paper to compare four approaches to solve the problem of intractable posterior.
In recent years, there has been a renewed focus on learning deep generative models (Hinton et al., 2006; Salakhutdinov & E., 2009; Gregor et al., 2014; Kingma & Welling, 2014; Rezende et al., 2014). A common difficulty faced by most approaches is the need to perform posterior inference during training: the log-likelihood gradients for most latent variable models are defined in terms of posterior statistics (e.g. Salakhutdinov & E. (2009); Neal (1992); Gregor et al. (2014)). One approach for dealing with this problem is to train a recognition network alongside the generative model (Dayan et al., 1995). The recognition network aims to predict the posterior distribution over latent variables given the observations, and can often generate a rough approximation much more quickly than generic inference algorithms such as MCMC.
The variational autoencoder (VAE; Kingma & Welling (2014); Rezende et al. (2014)) is a recently proposed generative model which pairs a top-down generative network with a bottom-up recognition network. Both networks are jointly trained to maximize a variational lower bound on the data log-likelihood. VAEs have recently been successful at separating style and content (Kingma et al., 2014; Kulkarni et al., 2015) and at learning to “draw” images in a realistic manner (Gregor et al., 2015).
VAEs make strong assumptions about the posterior distribution. Typically VAE models assume that the posterior is approximately factorial, and that its parameters can be predicted from the observables through a nonlinear regression. Because they are trained to maximize a variational lower bound on the log-likelihood, they are encouraged to learn representations where these assumptions are satisfied, i.e. where the posterior is approximately factorial and predictable with a neural network. While this effect is beneficial, it comes at a cost: constraining the form of the posterior limits the expressive power of the model. This is especially true of the VAE objective, which harshly penalizes approximate posterior samples which are unlikely to explain the data, even if the recognition network puts much of its probability mass on good explanations.
In this paper, we introduce the importance weighted autoencoder (IWAE), a generative model which shares the VAE architecture, but which is trained with a tighter log-likelihood lower bound derived from importance weighting. The recognition network generates multiple approximate posterior samples, and their weights are averaged. As the number of samples is increased, the lower bound approaches the true log-likelihood. The use of multiple samples gives the IWAE additional flexibility to learn generative models whose posterior distributions do not fit the VAE modeling assumptions. This approach is related to reweighted wake sleep (Bornschein & Bengio, 2015), but the IWAE is trained using a single unified objective. Compared with the VAE, our IWAE is able to learn richer representations with more latent dimensions, which translates into significantly higher log-likelihoods on density estimation benchmarks.
In this section, we review the variational autoencoder (VAE) model of Kingma & Welling (2014). In particular, we describe a generalization of the architecture to multiple stochastic hidden layers. We note, however, that Kingma & Welling (2014) used a single stochastic hidden layer, and there are other sensible generalizations to multiple layers, such as the one presented by Rezende et al. (2014).
The VAE defines a generative process in terms of ancestral sampling through a cascade of hidden layers:
is a vector of parameters of the variational autoencoder, anddenotes the stochastic hidden units, or latent variables. The dependence on is often suppressed for clarity. For convenience, we define . Each of the terms may denote a complicated nonlinear relationship, for instance one computed by a multilayer neural network. However, it is assumed that sampling and probability evaluation are tractable for each . Note that denotes the number of stochastic hidden layers; the deterministic layers are not shown explicitly here. We assume the recognition model is defined in terms of an analogous factorization:
where sampling and probability evaluation are tractable for each of the terms in the product.
In this work, we assume the same families of conditional probability distributions asKingma & Welling (2014). In particular, the prior
is fixed to be a zero-mean, unit-variance Gaussian. In general, each of the conditional distributionsand
is a Gaussian with diagonal covariance, where the mean and covariance parameters are computed by a deterministic feed-forward neural network. For real-valued observations,
is also defined to be such a Gaussian; for binary observations, it is defined to be a Bernoulli distribution whose mean parameters are computed by a neural network.
The VAE is trained to maximize a variational lower bound on the log-likelihood, as derived from Jensen’s Inequality:
Since the training procedure is forced to trade off the data log-likelihood and the KL divergence from the true posterior. This is beneficial, in that it encourages the model to learn a representation where posterior inference is easy to approximate.
If one computes the log-likelihood gradient for the recognition network directly from Eqn. 3, the result is a REINFORCE-like update rule which trains slowly because it does not use the log-likelihood gradients with respect to latent variables (Dayan et al., 1995; Mnih & Gregor, 2014). Instead, Kingma & Welling (2014) proposed a reparameterization of the recognition distribution in terms of auxiliary variables with fixed distributions, such that the samples from the recognition model are a deterministic function of the inputs and auxiliary variables. While they presented the reparameterization trick for a variety of distributions, for convenience we discuss the special case of Gaussians, since that is all we require in this work. (The general reparameterization trick can be used with our IWAE as well.)
In this paper, the recognition distribution always takes the form of a Gaussian , whose mean and covariance are computed from the the states of the hidden units at the previous layer and the model parameters. This can be alternatively expressed by first sampling an auxiliary variable , and then applying the deterministic mapping
The joint recognition distribution over all latent variables can be expressed in terms of a deterministic mapping , with , by applying Eqn. 4 for each layer in sequence. Since the distribution of does not depend on , we can reformulate the gradient of the bound from Eqn. 3 by pushing the gradient operator inside the expectation:
Assuming the mapping is represented as a deterministic feed-forward neural network, for a fixed
, the gradient inside the expectation can be computed using standard backpropagation. In practice, one approximates the expectation in Eqn.6 by generating samples of and applying the Monte Carlo estimator
. This is an unbiased estimate of. We note that the VAE update and the basic REINFORCE-like update are both unbiased estimators of the same gradient, but the VAE update tends to have lower variance in practice because it makes use of the log-likelihood gradients with respect to the latent variables.
The VAE objective of Eqn. 3
heavily penalizes approximate posterior samples which fail to explain the observations. This places a strong constraint on the model, since the variational assumptions must be approximately satisfied in order to achieve a good lower bound. In particular, the posterior distribution must be approximately factorial and predictable with a feed-forward neural network. This VAE criterion may be too strict; a recognition network which places only a small fraction (e.g. 20%) of its samples in the region of high posterior probability region may still be sufficient for performing accurate inference. If we lower our standards in this way, this may give us additional flexibility to train a generative network whose posterior distributions do not fit the VAE assumptions. This is the motivation behind our proposed algorithm, the Importance Weighted Autoencoder (IWAE).
Our IWAE uses the same architecture as the VAE, with both a generative network and a recognition network. The difference is that it is trained to maximize a different lower bound on . In particular, we use the following lower bound, corresponding to the -sample importance weighting estimate of the log-likelihood:
are sampled independently from the recognition model. The term inside the sum corresponds to the unnormalized importance weights for the joint distribution, which we will denote as.
This is a lower bound on the marginal log-likelihood, as follows from Jensen’s Inequality and the fact that the average importance weights are an unbiased estimator of :
where the expectations are with respect to .
It is perhaps unintuitive that importance weighting would be a reasonable estimator in high dimensions. Observe, however, that the special case of is equivalent to the standard VAE objective shown in Eqn. 3. Using more samples can only improve the tightness of the bound:
For all , the lower bounds satisfy
Moreover, if is bounded, then approaches as goes to infinity.
See Appendix A. ∎
The bound can be estimated using the straightforward Monte Carlo estimator, where we generate samples from the recognition network and average the importance weights. One might worry about the variance of this estimator, since importance weighting famously suffers from extremely high variance in cases where the proposal and target distributions are not a good match. However, as our estimator is based on the log of the average importance weights, it does not suffer from high variance. This argument is made more precise in Appendix B.
To train an IWAE with a stochastic gradient based optimizer, we use an unbiased estimate of the gradient of , defined in Eqn. 8. As with the VAE, we use the reparameterization trick to derive a low-variance upate rule:
where are the same auxiliary variables as defined in Section 2 for the VAE, are the importance weights expressed as a deterministic function, and are the normalized importance weights.
In the context of a gradient-based learning algorithm, we draw samples from the recognition network (or, equivalently, sets of auxiliary variables), and use the Monte Carlo estimate of Eqn. 13:
In the special case of , the single normalized weight takes the value 1, and one obtains the VAE update rule.
We unpack this update because it does not quite parallel that of the standard VAE.111Kingma & Welling (2014) separated out the KL divergence in the bound of Eqn. 3 in order to achieve a simpler and lower-variance update. Unfortunately, no analogous trick applies for . In principle, the IWAE updates may be higher variance for this reason. However, in our experiments, we observed that the performance of the two update rules was indistinguishable in the case of . The gradient of the log weights decomposes as:
The first term encourages the generative model to assign high probability to each given (following the convention that
). It also encourages the recognition network to adjust the hidden representations so that the generative network makes better predictions. In the case of a single stochastic layer (i.e.), the combination of these two effects is equivalent to backpropagation in a stochastic autoencoder. The second term of this update encourages the recognition network to have a spread-out distribution over predictions. This update is averaged over the samples with weight proportional to the importance weights, motivating the name “importance weighted autoencoder.”
The dominant computational cost in IWAE training is computing the activations and parameter gradients needed for . This corresponds to the forward and backward passes in backpropagation. In the basic IWAE implementation, both passes must be done independently for each of the samples. Therefore, the number of operations scales linearly with . In our GPU-based implementation, the samples are processed in parallel by replicating each training example times within a mini-batch.
One can greatly reduce the computational cost by adding another form of stochasticity. Specifically, only the forward pass is needed to compute the importance weights. The sum in Eqn. 14 can be stochastically approximated by choosing a single sample proprtional to its normalized weight and then computing . This method requires forward passes and one backward pass per training example. Since the backward pass requires roughly twice as many add-multiply operations as the forward pass, for large , this trick reduces the number of add-multiply operations by roughly a factor of 3. This comes at the cost of increased variance in the updates, but empirically we have found the tradeoff to be favorable.
There are several broad families of approaches to training deep generative models. Some models are defined in terms of Boltzmann distributions (Smolensky, 1986; Salakhutdinov & E., 2009). This has the advantage that many of the conditional distributions are tractable, but the inability to sample from the model or compute the partition function has been a major roadblock (Salakhutdinov & Murray, 2008). Other models are defined in terms of belief networks (Neal, 1992; Gregor et al., 2014). These models are tractable to sample from, but the conditional distributions become tangled due to the explaining away effect.
One strategy for dealing with intractable posterior inference is to train a recognition network which approximates the posterior. A classic approach was the wake-sleep algorithm, used to train Helmholtz machines (Dayan et al., 1995). The generative model was trained to model the conditionals inferred by the recognition net, and the recognition net was trained to explain synthetic data generated by the generative net. Unfortunately, wake-sleep trained the two networks on different objective functions. Deep autoregressive networks (Gregor et al., 2014) consisted of deep generative and recognition networks trained using a single variational lower bound. Neural variational inference and learning (Mnih & Gregor, 2014) is another algorithm for training recognition networks which reduces stochasticity in the updates by training a third network to predict reward baselines in the context of the REINFORCE algorithm (Williams, 1992). Salakhutdinov & Larochelle (2010)
used a recognition network to approximate the posterior distribution in deep Boltzmann machines.
Variational autoencoders (Kingma & Welling, 2014; Rezende et al., 2014), as described in detail in Section 2, are another combination of generative and recognition networks, trained with the same variational objective as DARN and NVIL. However, in place of REINFORCE, they reduce the variance of the updates through a clever reparameterization of the random choices. The reparameterization trick is also known as “backprop through a random number generator” (Williams, 1992).
One factor distinguishing VAEs from the other models described above is that the model is described in terms of a simple distribution followed by a deterministic mapping, rather than a sequence of stochastic choices. Similar architectures have been proposed which use different training objectives. Generative adversarial networks (Goodfellow et al., 2014) train a generative network and a recognition network which act in opposition: the recognition network attempts to distinguish between training examples and generated samples, and the generative model tries to generate samples which fool the recognition network. Maximum mean discrepancy (MMD) networks (Li et al., 2015; Dziugaite et al., 2015) attempt to generate samples which match a certain set of statistics of the training data. They can be viewed as a kind of adversarial net where the adversary simply looks at the set of pre-chosen statistics (Dziugaite et al., 2015). In contrast to VAEs, the training criteria for adversarial nets and MMD nets are not based on the data log-likelihood.
Other researchers have derived log-probability lower bounds by way of importance sampling. Tang & Salakhutdinov (2013) and Ba et al. (2015) avoided recognition networks entirely, instead performing inference using importance sampling from the prior. Gogate et al. (2007) presented a variety of graphical model inference algorithms based on importance weighting. Reweighted wake-sleep (RWS) of Bornschein & Bengio (2015) is another recognition network approach which combines the original wake-sleep algorithm with updates to the generative network equivalent to gradient ascent on our bound . However, Bornschein & Bengio (2015) interpret this update as following a biased estimate of , whereas we interpret it as following an unbiased estimate of . The IWAE also differs from RWS in that the generative and recognition networks are trained to maximize a single objective, . By contrast, the -wake and sleep steps of RWS do not appear to be related to . Finally, the IWAE differs from RWS in that it makes use of the reparameterization trick.
Apart from our approach of using multiple approximate posterior samples, another way to improve the flexibility of posterior inference is to use a more sophisticated algorithm than importance sampling. Examples of this approach include normalizing flows (Rezende & Mohamed, 2015) and the Hamiltonian variational approximation of Salimans et al. (2015).
We have compared the generative performance of the VAE and IWAE in terms of their held-out log-likelihoods on two density estimation benchmark datasets. We have further investigated a particular issue we have observed with VAEs and IWAEs, namely that they learn latent spaces of significantly lower dimensionality than the modeling capacity they are allowed. We tested whether the IWAE training method ameliorates this effect.
We evaluated the models on two benchmark datasets: MNIST, a dataset of images of handwritten digits (LeCun et al., 1998), and Omniglot, a dataset of handwritten characters in a variety of world alphabets (Lake et al., 2013)
. In both cases, the observations were binarizedimages.222Unfortunately, the generative modeling literature is inconsistent about the method of binarization, and different choices can lead to considerably different log-likelihood values. We follow the procedure of Salakhutdinov & Murray (2008): the binary-valued observations are sampled with expectations equal to the real values in the training set. See Appendix D for an alternative binarization scheme. We used the standard splits of MNIST into 60,000 training and 10,000 test examples, and of Omniglot into 24,345 training and 8,070 test examples.
We trained models with two architectures:
An architecture with a single stochastic layer with 50 units. In between the observations and the stochastic layer were two deterministic layers, each with 200 units.
An architecture with two stochastic layers and , with 100 and 50 units, respectively. In between and were two deterministic layers with 200 units each. In between and were two deterministic layers with 100 units each.
All deterministic hidden units used the
nonlinearity. All stochastic layers used Gaussian distributions with diagonal covariance, with the exception of the visible layer, which used Bernoulli distributions. Annonlinearity was applied to the predicted variances of the Gaussian distributions. The network architectures are summarized in Appendix C.
All models were initialized with the heuristic ofGlorot & Bengio (2010). For optimization, we used Adam (Kingma & Ba, 2015) with parameters and minibaches of size . The training proceeded for passes over the data with learning rate of for (for a total of passes over the data). This learning rate schedule was chosen based on preliminary experiments training a VAE with one stochastic layer on MNIST.
For each number of samples we trained a VAE with the gradient of estimted as in Eqn. 7 and an IWAE with the gradient estimated as in Eqn. 14. For each , the VAE and the IWAE were trained for approximately the same length of time.
All log-likelihood values were estimated as the mean of on the test set. Hence, the reported values are stochastic lower bounds on the true value, but are likely to be more accurate than the lower bounds used for training.
The log-likelihood results are reported in Table 1. Our VAE results are comparable to those previously reported in the literature. We observe that training a VAE with helped only slightly. By contrast, using multiple samples improved the IWAE results considerably on both datasets. Note that the two algorithms are identical for
, so the results ought to match up to random variability.
On MNIST, IWAE with two stochastic layers and
achieves a log-likelihood of -82.90 on the permutation-invariant model on this dataset. By comparison, deep belief networks achieved log-likelihood of approximately -84.55 nats(Murray & Salakhutdinov, 2009), and deep autoregressive networks achieved log-likelihood of -84.13 nats (Gregor et al., 2014). Gregor et al. (2015), who exploited spatial structure, achieved a log-likelihood of -80.97. We did not find overfitting to be a serious issue for either the VAE or the IWAE: in both cases, the training log-likelihood was 0.62 to 0.79 nats higher than the test log-likelihood. We present samples from our models in Appendix E.
For the OMNIGLOT dataset, the best performing IWAE has log-likelihood of -103.38 nats, which is slightly worse than the log-likelihood of -100.46 nats achieved by a Restricted Boltzmann Machine with 500 hidden units trained with persistent contrastive divergence(Burda et al., 2015). RBMs trained with centering or FANG methods achieve a similar performance of around -100 nats (Grosse & Salakhudinov, 2015). The training log-likelihood for the models we trained was 2.39 to 2.65 nats higher than the test log-likelihood.
|# stoch. layers||NLL||active units||NLL||active units||NLL||active units||NLL||active units|
We have observed that both VAEs and IWAEs tend to learn latent representations with effective dimensions far below their capacity. Our next set of experiments aimed to quantify this effect and determine whether the IWAE objective ameliorates this effect.
If a latent dimension encodes useful information about the data, we would expect its distribution to change depending on the observations. Based on this intuition, we measured activity of a latent dimension using the statistic . We defined the dimension to be active if . We have observed two pieces of evidence that this criterion is both well-defined and meaningful:
The distribution of for a trained model consisted of two widely separated modes, as shown in Appendix C.
To confirm that the inactive dimensions were indeed insignificant to the predictions, we evaluated all models with the inactive dimensions removed. In all cases, this changed the test log-likelihood by less than nats.
In Table 1, we report the numbers of active units for all conditions. In all conditions, the number of active dimensions was far less than the total number of dimensions. Adding more latent dimensions did not increase the number of active dimensions. Interestingly, in the two-layer models, the second layer used very little of its modeling capacity: the number of active dimensions was always less than 10. In all cases with , the IWAE learned more latent dimensions than the VAE. Since this coincided with higher log-likelihood values, we speculate that a larger number of active dimensions reflects a richer latent representation.
Superficially, the phenomenon of inactive dimensions appears similar to the problem of “units dying out” in neural networks and latent variable models, an effect which is often ascribed to difficulties in optimization. For example, if a unit is inactive, it may never receive a meaningful gradient signal because of a plateau in the optimization landscape. In such cases, the problem may be avoided through a better initialization. To determine whether the inactive units resulted from an optimization issue or a modeling issue, we took the best-performing VAE and IWAE models from Table 1, and continued training the VAE model using the IWAE objective and vice versa. In both cases, the model was trained for an additional passes over the data with a learning rate of .
The results are shown in Table 2. We found that continuing to train the VAE with the IWAE objective increased the number of active dimensions and the test log-likelihood, while continuing to train the IWAE with the VAE objective did the opposite. The fact that training with the VAE objective actively reduces both the number of active dimensions and the log-likelihood strongly suggests that inactivation of the latent dimensions is driven by the objective functions rather than by optimization issues. On the other hand, optimization also appears to play a role, as the results in Table 2 are not quite identical to those in Table 1.
|First stage||Second stage|
|trained as||NLL||active units||trained as||NLL||active units|
In this paper, we presented the importance weighted autoencoder, a variant on the VAE trained by maximizing a tighter log-likelihood lower bound derived from importance weighting. We showed empirically that IWAEs learn richer latent representations and achieve better generative performance than VAEs with equivalent architectures and training time. We believe this method may improve the flexibility of other generative models currently trained with the VAE objective.
This research was supported by NSERC, the Fields Institute, and Samsung.
International Conference on Machine Learning, 2014.
DRAW: A recurrent neural network for image generation.In International Conference on Machine Learning, pp. 1462–1471, 2015.
Generative moment matching networks.In International Conference on Machine Learning, pp. 1718–1727, 2015.
Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine Learning, 8:229–256, 1992.
Proof of Theorem 1. We need to show the following facts about the log-likelihood lower bound :
, assuming is bounded.
We prove each in turn:
It follows from Jensen’s inequality that
be a uniformly distributed subset of distinct indices from. We will use the following simple observation: for any sequence of numbers .
Using this observation and Jensen’s inequality, we get
Consider the random variable . If
is bounded, then it follows from the strong law of large numbers thatconverges to almost surely. Hence converges to as .
It is well known that the variance of an unnormalized importance sampling based estimator can be extremely large, or even infinite, if the proposal distribution is not well matched to the target distribution. Here we argue that the Monte Carlo estimator of , described in Section 3, does not suffer from large variance. More precisely, we bound the mean absolute deviation (MAD). While this does not directly bound the variance, it would be surprising if an estimator had small MAD yet extremely large variance.
Suppose we have a strictly positive unbiased estimator of a positive quantity , and we wish to use as an estimator of . By Jensen’s inequality, this is a biased estimator, i.e. . Denote the bias as . We start with the observation that is unlikely to overestimate by very much, as can be shown with Markov’s Inequality:
Let denote . We now use the above facts to bound the MAD:
Here, (22) is a general formula for the MAD, (26) uses the formula for a nonnegative random variable , and (27) applies the bound (21). Hence, the MAD is bounded by . In the context of IWAE, corresponds to the gap between and .
Here is a summary of the network architectures used in the experiments:
In Section 5.2, we defined the activity statistic , and chose a threshold of for determining if a unit is active. One justification for this is that the distribution of this statistic consisted of two widely separated modes in every case we looked at. Here is the histogram of for a VAE with one stochastic layer:
We show some examples of true and approximate posteriors for VAE and IWAE models trained with two latent dimensions. Heat maps show true posterior distributions for 6 training examples, and the pictures in the bottom row show the examples and their reconstruction from samples from . Left: VAE. Middle: IWAE, with . Right: IWAE, with
. The IWAE prefers less regular posteriors and more spread out posterior predictions.
Several previous works have used a fixed binarization of the MNIST dataset defined by Larochelle (2011)
. We repeated our experiments training the models on the 50000 examples from the training dataset, and evaluating them on the 10000 examples from the test dataset. Otherwise we used the same training procedure and hyperparameters as in the experiments in the main part of the paper. The results in table3 indicate that the conclusions about the relative merits of VAEs and IWAEs are unchanged in the new experimental setup. In this setup we noticed significantly larger amounts of overfitting.
|# stoch. layers||NLL||active units||NLL||active units|