XCoder_VAE_Conditional_Inference
None
view repo
Variational Autoencoders (VAEs) are a popular generative model, but one in which conditional inference can be challenging. If the decomposition into query and evidence variables is fixed, conditional VAEs provide an attractive solution. To support arbitrary queries, one is generally reduced to Markov Chain Monte Carlo sampling methods that can suffer from long mixing times. In this paper, we propose an idea we term crosscoding to approximate the distribution over the latent variables after conditioning on an evidence assignment to some subset of the variables. This allows generating query samples without retraining the full VAE. We experimentally evaluate three variations of crosscoding showing that (i) two can be quickly trained for different decompositions of evidence and query and (ii) they quantitatively and qualitatively outperform Hamiltonian Monte Carlo.
READ FULL TEXT VIEW PDF
We focus on generative autoencoders, such as variational or adversarial
...
read it
We introduce a new algorithm for approximate inference that combines
rep...
read it
We present a generalpurpose method to train Markov chain Monte Carlo
ke...
read it
Probabilistic Logic Programming (PLP) languages enable programmers to sp...
read it
Variational AutoEncoders (VAEs) have become very popular techniques to
...
read it
Variational Bayes is a method to find a good approximation of the poster...
read it
None
Variational Autoencoders (VAEs) (Kingma & Welling, 2014) are a popular deep generative model with numerous extensions including variations for planar flow (Rezende & Mohamed, 2015), inverse autoregressive flow (Kingma et al., 2016), importance weighting (Burda et al., 2016), ladder networks (Maaløe et al., 2016), and discrete latent spaces (Rolfe, 2017) to name just a few. Unfortunately, existing methods for conditional inference in VAEs are limited. Conditional VAEs (CVAEs) (Sohn et al., 2015) allow VAE training conditioned on a fixed decomposition of evidence and query, but are computationally impractical when varying queries are made. Alternatively, Markov Chain Monte Carlo methods such as Hamiltonian Monte Carlo (HMC) (Girolami & Calderhead, 2011; Daniel Levy, 2018) are difficult to adapt to these problems, and can suffer from long mixing times as we show empirically.
To remedy the limitations of existing methods for conditional inference in VAEs, we aim to approximate the distribution over the latent variables after conditioning on an evidence assignment through a variational Bayesian methodology. In doing this, we reuse the decoder of the VAE and show that the error of the distribution over query variables is controlled by that over latent variables via a fortuitous cancellation in the KLdivergence. This avoids the computational expense of retraining the decoder as done by the CVAE approach. We term the network that generates the conditional latent distribution the crosscoder.
We experiment with two crosscoding alternatives: Gaussian variational inference via a linear transform (GVI) and Normalizing Flows (NF). We also provide some comparison to a fully connected network (FCN), which suffers from some technical and computational issues but provides a useful point of reference for experimental comparison purposes. Overall, our results show that the GVI and NF variants of crosscoding can be optimized quickly for arbitrary decompositions of query and evidence and compare favorably against a ground truth provided by rejection sampling for low latent dimensionality. For high dimensionality, we observe that HMC often fails to mix despite our systematic efforts to tune its parameters and hence demonstrates poor performance compared to crosscoding in both quantitative and qualitative evaluation.
One way to define an expressive generative model is to introduce latent variables . Variational AutoEncoders (VAEs) (Kingma & Welling, 2014) model
as a simple fixed Gaussian distribution. Then, for real
, is a Gaussian with the mean determined by a “decoder” network as(1) 
If is binary, a product of independent Bernoulli’s is parameterized by a sigmoidally transformed decoder. If the decoder network has high capacity, the marginal distribution can represent a wide range of distributions. In principle, one might wish to train such a model by (regularized) maximum likelihood. Unfortunately, the marginal is intractable. However, a classic idea (Saul et al., 1996) is to use variational inference to lowerbound it. For any distributions and ,
(2) 
Since the KLdivergence is nonnegative, the "evidence lower bound" (ELBO) lower bounds . Thus, as a surrogate to maximizing the likelihood over one can maximize the ELBO over to and simultaneously.
VAEs define as the marginal of where is simple and fixed and is a Gaussian with a mean and covariance both determined by an “encoder” network.
In this paper, we assume a VAE has been pretrained. Then, at test time, some arbitrary subset of is observed as evidence, and the goal is to predict the distribution of the nonobserved where the decomposition is unpredictable. If this decomposition of into evidence and query variables is fixed and known ahead of time, a natural solution is to train an explicit conditional model, the approach taken by Conditional Variational Autoencoders(Sohn et al., 2015). We focus on supporting arbitrary queries, where training a conditional model for each possible decomposition is infeasible.
We now turn to the details of conditional inference. We assume we have pretrained a VAE and now wish to approximate the distribution where
is some new “test” input not known at VAE training time. Unfortunately, exact inference is difficult, since computing this probability exactly would require marginalizing out
.One helpful property comes from the fact that in a VAE, the conditional distribution over the output (Eq. 1) has a diagonal covariance, which leads to the following decomposition:
Observation 1 The distribution of a VAE can be factorized as
Since and are conditionally independent given , the conditional of given can be written as
(3) 
Here, can easily be evaluated or simulated. However is much more difficult to work with since it involves "inverting" the decoder. This factorization can also be exploited by Markov chain Monte Carlo methods (MCMC), such as Hamiltonian Monte Carlo (HMC) (Girolami & Calderhead, 2011; Daniel Levy, 2018). In this case, it allows the Markov chain to be defined over alone, rather than and together. That is, one can use MCMC to attempt sampling from , and then draw exact samples from just by evaluating the decoder network at each of the samples of . The experiments using MCMC in Section 4 use this strategy.
The basic idea of variational inference (VI) is to posit some distribution , and optimize to make it match the target distribution as closely as possible. So, in principle, the goal of VI would be to minimize For an arbitrary distribution this divergence would be difficult to work with due to the need to marginalize out in as in Eq. 3.
However, if is chosen carefully, then the above divergence can be upperbounded by one defined directly over . Specifically, we will choose so that the dependence of on under is the same as under (both determined by the “decoder”).
Suppose we choose . Then
(4) 
This is proven in the Appendix. The result follows from using the chain rule of KLdivergence
(Cover & Thomas, 2006) to bound the divergence over by the divergence jointly over and . Then the common factors in and mean this simplifies into a divergence over alone.Given this Lemma, it makes sense to seek a distribution such that the divergence on the righthand side of Eq. 4 is as low as possible. To minimize this divergence, consider the decomposition
(5) 
which is analogous to Eq. 2. Here, we call the first term the “conditional ELBO” (CELBO) to reflect that maximizing it is equivalent to minimizing an upper bound on .
The previous section says that we should seek a distribution to approximate ). Although the latent distribution may be simple, the conditional distribution is typically complex and often multimodal (cf. Fig. 3).
To define a variational distribution satisfying the conditions of Lemma 1, we propose to draw from some fixed base density and then use a network with parameters to map to the latent space so that the marginal is expressive. The conditional of given is exactly as in . The full variational distribution is therefore
(6) 
where is a multivariate delta function. We call this network a “Crosscoder” to emphasize that the parameters are fit so that matches , and so that , when “decoded” using , will predict given .
Informally, this result can be proven as follows: the CELBO was defined on alone, while our definition of in Eq. 6 also involves and . Marginalizing out is trivial. Then, since and are deterministically related under one can change variables to convert the expectation over to one over , leaving the logdeterminant Jacobian as an artifact of the entropy of .
This objective is related to the "triple ELBO" used by Vedantam et al. (2017) for a situation with a small number of fixed decompositions of into . Algorithmically, the approaches are quite different since they pretrain a single network for each subset of , which can be used for any with that pattern, and a futher product of experts approximation is used for novel missing features at test time. We assume arbitrary queries and so pretraining is inapplicable and novel missing features pose no issue. Still, our bounding justification may provide additional insight for their approach.
We explore the following two candidiate CrossCoders.
Gaussian Variational Inference (GVI): The GVI linearly warps a spherical Gaussian over into an arbitrary Gaussian :
(7) 
where for a square matrix
and a mean vector
. While projected gradient descent can be used to maintain invertibility of , we did not encounter issues with noninvertible requiring projection during our experiments.Normalizing Flows (NF): A normalizing flow (Rezende & Mohamed, 2015) projects a probability density through a sequence of easy computable and invertible mappings. By stacking multiple mappings, the transformation can be complex. We use the special structured network called Planar Normalizing Flow:
(8) 
for all , where , is the layer id, and are vectors, and the output dimension is exactly same with the input dimension. Using for function composition, the is given as
(9) 
The bound in Theorem 2 requires that is invertible. Nevertheless, we find Fully Connected Networks (FCNs) useful for comparison in lowdimensional visualizations. Here, the Jacobian must be calculated using separate gradient calls for each ouput variable, and the lack of invertibility prevents the CELBO bound from being correct.
We summarize our approach in Algorithm 1. In brief, we define a variational distribution and optimize so that is close to . The variational distribution includes a "CrossCoder" as . The algorithm uses stochastic gradient decent on the CELBO with gradients estimated using Monte Carlo samples of and the reparameterization trick (Kingma & Welling, 2014; Titsias & LázaroGredilla, 2014; Rezende et al., 2014). After inference, the original VAE distribution gives samples over the query variables.
Having defined our crosscoding methodology for conditional inference with pretrained VAEs, we now proceed to empirically evaluate our three previously defined XCoder instantiations and compare them with (Markov chain) Monte Carlo (MCMC) sampling approaches on three different pretrained VAEs. Below we discuss our datasets and methodology followed by our experimental results.
MNIST is the wellknown benchmark handwritten digit dataset (LeCun & Cortes, 2010)
. We use a pretrained VAE with a fully connected encoder and decoder each with one hidden layer of 64 ReLU units, a final sigmoid layer with Bernoulli likelihood, and 2 latent dimensions for
^{1}^{1}1 https://github.com/kvfrans/variationalautoencoder The VAE has been trained on 60,000 black and white binary thresholded images of size . The limitation to 2 dimensions allows us to visualize the conditional latent distribution of all methods and compare to the ground truth through a finegrained discretization of .Anime is a dataset of animated character faces (Jin et al., 2017). We use a pretrained VAE with convolutional encoder and deconvolutional decoder, each with 4 layers. The decoder contains respective channel sizes each using
filters of stride 2 and ReLU activations followed by batch norm layers. The VAE has a final tanh layer with Gaussian likelihood, and 64 latent dimensions for
.^{2}^{2}2 URL for pretrained VAE suppressed for anonymous review. The VAE has been trained on 20000 images encoded in RGB of size .CelebA dataset (Liu et al., 2015) is a benchmark dataset of images of celebrity faces. We use a pretrained VAE with a structure that exactly matches the Anime VAE provided above, except that it uses 100 latent dimensions for ^{3}^{3}3 https://github.com/yzwxx/vaecelebA The VAE has been trained on 200,000 images encoded in RGB of size .
For sampling approaches, we evaluate rejection sampling (RS), which is only feasible for our MNIST VAE with a 2dimensional latent embedding for . We also compare to the MCMC method of Hamiltonian Monte Carlo (HMC) (Girolami & Calderhead, 2011; Daniel Levy, 2018). Both sampling methods exploit the VAE decomposition and sampling methodology described in Section 3.1.
We went to great effort to tune the parameters of HMC. For MNIST, with low dimensions, this was generally feasible, with a few exceptions as noted in Figure 4(b). For the highdimensional latent space of the Anime and CelebA VAEs, finding parameters to achieve good mixing was often impossible, leading to poor performance. Section 6.4 of the Appendix discusses this in detail.
For the crosscoding methods, we use the three XCoder variants described in Section 3.3: Gaussian Variational Inference (GVI), Planar Normalizing Flow (NF)
, and a Fully Connected Neural Network
(FCN). By definition, the latent dimensionality of must match the latent dimensionality of for each pretrained VAE. Given evidence as described in the experiments, all crosscoders were trained as described in Algorithm 1. We could not train the FCN XCoder for conditional inference in Anime and CelebA due to the infeasibility of computing the Jacobian for the respective latent dimensionalities of these two VAEs.In preliminary experiments, we considered the alternating sampling approach suggested by (Rezende et al., 2014, Appendix F), but found it to perform very poorly when the evidence is ambiguous. We provide a thorough analysis of this in Section 6.3 of the Appendix comparing results on MNIST with various fractions of the input taken as evidence. In summary, Rezende’s alternation method produces reasonable results when a large fraction of pixels are observed, so the posterior is highly concentrated. When less than around 40% are observed, however, performance rapidly degrades.
We experiment with a variety of evidence sets to demonstrate the efficiency and flexibility of our crosscoding methodology for arbitrary conditional inference queries in pretrained VAEs. All crosscoding optimization and inference takes (typically well) under 32 seconds per evidence set for all experiments running on an Intel Xeon E51620 v4 CPU with 4 cores, 16Gb of RAM, and an NVIDIA GTX1080 GPU. A detailed running time comparison is provided in Section 6.5 of the Appendix.
Qualitatively, we visually examine the 2D latent distribution of conditioned on the evidence for the special case of MNIST, which has low enough latent dimensionality to enable us to obtain ground truth through discretization. For all experiments, we qualitatively assess sampled query images generated for each evidence set to assess both the coverage of the distribution and the quality of match between the query samples and the evidence, which is fixed in the displayed images.
Quantitatively, we evaluate the performance of the proposed framework and candidate inference methods through the following two metrics.
CELBO: As a comparative measure of inference quality for each of the XCoder methods, we provide pairwise scatterplots of the CELBO as defined in 5 for a variety of different evidence sets
Query Marginal Likelihood: For each conditional inference evaluation, we randomly select an image and then a subset of that image as evidence and the remaining pixels as the ground truth query assignment. Given this, we can evaluate the marginal likelihood of the query as follows:
For conditional inference in MNIST, we begin with Figure 2, which shows one example of conditional inference in the pretrained MNIST model using the different inference methods. While the original image used to generate the evidence represents the digit , the evidence is very sparse allowing the plausible generation of other digits. It is easy to see that most of the methods can handle this simple conditional inference, with only GVI producing some samples that do not match the evidence well in this VAE with 2 latent dimensions.
To provide additional insight into Figure 2, we now turn to Figure 3, where we visually compare the true conditional latent distribution (leftmost) with the corresponding distributions of each of the inference methods. At a first glance, we note that the true distribution is both multimodal and nonGaussian. We see that GVI covers some mass not present in the true distribution that explains its relatively poor performance in Figure 2(b). All remaining methods (both XCoder and sampling) do a reasonable job of covering the irregular shape and mass of the true distribution.
We now proceed to a quantitative comparison of performance on MNIST over 50 randomly generated queries. In Figure 4(a), we present a pairwise comparison of the performance of each XCoder method on 50 randomly generated evidence sets. Noting that higher is better, we observe that FCN and NF perform comparably and generally outperform GVI. In Figure 4(b), we examine the Query Marginal Likelihood distribution for the same 50 evidence sets from (a) with each likelihood expectation generated from 500 samples. Again, noting that higher is better, here we see that RS slightly edges out all other methods with all XCoders generally performing comparably. HMC performs worst here, where we remark that inadequate coverage of the latent due to poor mixing properties leads to overconcentration on leading to a long tail in a few cases with poor coverage. We will see that these issues with HMC mixing become much more pronounced as we move to experiments in VAEs with higher latent dimensionality in the next section.
Now we proceed to our larger VAEs for Anime and CelebA with respective latent dimensionality of 64 and 100 that allow us to work with larger and more visually complex RGB images. In these cases, FCN could not be applied due to the infeasibilty of computing the Jacobian and RS is also infeasible for such high dimensionality. Hence, we only compare the two XCoders GVI and NF with HMC.


We now continue to a qualitative and quantitative performance analysis of conditional inference for the Anime and CelebA VAEs. Qualitatively, in Figure 5 for Anime, we see that inference for both the NF XCoder and HMC show little identifiable variation and seem to have collapsed into a single latent mode. In contrast, GVI appears to show better coverage, generating a wide range of faces that generally match very well with the superimposed evidence. For Figure 6, HMC still performs poorly, but NF appears to perform much better, with both XCoders GVI and NF generating a wide range of faces that match the superimposed evidence, with perhaps slightly more face diversity for GVI.
Quantitatively, Figure 7 strongly reflects the qualitative visual observations above. In short for the XCoders, GVI solidly outperforms NF on the CELBO comparison. For all methods evaluated on Query Marginal Likelihood, GVI outperforms both NF and HMC on Anime, while for CelebA GVI performs comparably to (if not slightly worse) than NF, with both solidly outperforming HMC.
We introduced Crosscoding, a novel variational inference method for conditional queries in pretrained VAEs that does not require retraining the decoder. Using three VAEs pretrained on different datasets, we demonstrated that the Gaussian Variational Inference (GVI) and Normalizing Flows (NF) crosscoders generally outperform Hamiltonian Monte Carlo both qualitatively and quantitively, thus providing a novel and efficient tool for conditional inference in VAEs with arbitrary queries.
Proceedings of International Conference on Computer Vision (ICCV)
, 2015.33rd International Conference on Machine Learning (ICML)
, 2016.Stochastic backpropagation and approximate inference in deep generative models.
In Proceedings of the 31th International Conference on Machine Learning (ICML), pp. 1278–1286, 2014.To show this, we first note that the joint divergence over and is equivalent to one over only.
by the chain rule of KLdivergence  
since Y in both and  
Then, the result follows just from observing (again by the chain rule of KLdivergence) that
∎
In this experiment, we do not use a VAE, but instead simply model a complex latent 2D multimodal distribution over as a Gaussian mixture model to evaluate the ability of each conditional inference method to accurately draw samples from this complex distribution. In general, Figure 8 shows that while the XCoders NF and FCN work well here, GVI (by definition) cannot model this multimodal distribution and HMC draws too few samples from the disconnected mode compared to the true sample distribution, indicating slight failure to mix well.
We compare to the alternating sampling approach of Rezende et al. (2014) (Appendix Section F) which is essentially an approximation of block Gibbs sampling. We call it the “Rezende method" in the following. This method does not asymptotically sample from the conditional distribution since the step sampling the latent variables given the query variables are approximated using the encoder.
Figure 9(a) shows one experiment comparing all candidate algorithms including the Rezende method. We noticed that it fails to generate images that match the evidence when less than 40% of pixels are observed as evidence, while it makes reasonable predictions when the observation rate is higher. Figure 9(b) shows this result is consistent over 50 randomly selected queries.
Comparison of different conditional inference methods include the Rezende method on the MNIST dataset. (a) Shows one intuitive example. The first row shows the evidence observed, and the following rows show the mean of generated samples from the different algorithms. We note that with very high evidence, the posterior becomes extremely concentrated, meaning the rejection rates for rejection sampling become impractical. (b) The mean squared error between query variables of the original image and the generated samples of different algorithms. The results and standard deviations at each observation percentage come from 50 independent randomly selected queries.
While tuning HMC in lower dimensions was generally feasible for MNIST with a few exceptions noted in previous discussion of Figure 4(b), we observed that HMC becomes very difficult to tune in the Anime and CelebA VAEs with higher latent dimensionality. To illustrate these HMC tuning difficulties, we present a summary of our systematic efforts to tune HMC on Anime and CelebA in Figure 10 with boxplots of the acceptance rate distribution of HMC for 30 Markov Chains vs different on (a) Anime and (b) CelebA. We ran each Markov chain for 10,000 burnin samples with 10 leapfrog steps per iteration; we tried 3 different standard leapfrog step settings of , finding that 10 leapfrog steps provided the best performance across a range of and hence chosen for Figure 10.
In short, Figure 10 shows that only a very narrow band of lead to a reasonable acceptance rate for good mixing properties of HMC. Even then, however, the distribution of acceptance rates for any particular Markov Chain for a good
is still highly unpredictable as given by the quartile ranges of the boxplot. In summary, we found that despite our systematic efforts to tune HMC for higher dimensional problems, it was difficult to achieve a good mixing rate and overall contributes to the generally poor performance observed for HMC on Anime and CelebA that we discuss next.
The running time of conditional inference with XCoding varies with the complexity of XCoders, the optimization algorithm used, and the complexity of the pretrained Decoder. We found that LBFGS(Liu & Nocedal, 1989)
consistently converged fastest and with the best results in comparison to SGD, Adam, Adadelta, and RMSProp. Table
1, which follows, shows the computation time for each of the three candidate XCoders (FCN is only applicable to MNIST) as well as HMC and Rejection Sampling (RS is only applicable for MNIST).Period  MNIST  Amine  CelebA  
  GVI  NF  FCN  HMC  RS  GVI  NF  HMC  GVI  NF  HMC 
Optimization  0.36  2.79  5.26  34.92    2.52  4.22  81.3  31.45  9.95  224.5 
Prediction  2  0.19  2  2 
To assess the quality of the pretrained VAE models, we show 100 samples from each in Figure 11.
In Figures 12 and 13, we show two additional examples of conditional inference matching the structure of experiments shown in Figures 5 and 6 in the main text. Overall, we observe the same general trends as discussed in the main text for Figures 5 and 6.
Comments
There are no comments yet.