1 Introduction
Estimating posterior distributions is the primary focus of Bayesian inference, where we are interested in how our belief over the variables in our model would change after observing a set of data. Predictions can also be benefited from Bayesian inference as every prediction will be equipped with a confidence interval representing how sure the prediction is. Compared to the maximum a posteriori (MAP) estimator of the model parameters, which is a point estimator, the posterior distribution provides richer information about model parameters and hence more justified prediction.
Among various inference algorithms for posterior estimation, variational inference (VI) (Blei et al., 2017)
and Markov Chain Monte Carlo (MCMC)
(Geyer, 1992) are the most wisely used ones. It is well known that MCMC suffers from slow mixing time though asymptotically the chained samples will approach the true posterior. Furthermore, for latent variable models (LVMs) (Wainwright et al., 2008) where each sampled data point is associated with a latent variable, the number of simulated Markov Chains increases with the number of data points, making the computation too costly. VI, on the other hand, facilitates faster inference because it optimizes an explicit objective function and its convergence can be measured and controlled. Hence, VI has been widely used in many Bayesian models, such as the meanfield approach for the Latent Dirichlet Allocation (Blei et al., 2003), etc. To enrich the family of distributions over the latent variables, neural network based variational inference methods have also been proposed, such as Variational Autoencoder (VAE) (Kingma and Welling, 2013), Importance Weighted Autoencoder (IWAE) (Burda et al., 2015) and others (Rezende and Mohamed, 2015; Mnih and Gregor, 2014; Kingma et al., 2016). These methods outperform the traditional meanfield based inference algorithms due to their flexible distribution families and easytoscale algorithms, therefore becoming the state of the art for variational inference.The aforementioned VI methods are essentially maximizing the evidence lower bound (ELBO), i.e., the lower bound of the true marginal data likelihood, defined as
(1) 
where are data point and its latent code, and denote the generative model and the variational model, respectively. The equality holds if and only if and otherwise a gap always exists. The more flexible the variational family is, the more likely it will match the true posterior . However, arbitrarily enriching the variational model family is nontrivial, since optimizing Eq. 1 always requires evaluations of . Most of existing methods either make over simplified assumptions about the variational model, such as simple Gaussian posterior in VAE (Kingma and Welling, 2013), or resort to implicit variational models without explicitly modeling (Dumoulin et al., 2016).
In this paper we propose to enrich the variational distribution family, by incorporating auxiliary variables to the variational model. Most importantly, density evaluations are not required for the auxiliary variables and thus complex implicit density over the auxiliary variables can be easily constructed, which in turn results in a flexible variational posterior over the latent variables. We argue that the resulting inference network is essentially modeling a complex probabilistic mixture of different variational posteriors indexed by the auxiliary variable, and thus a much richer and flexible family of variational posterior distribution is achieved. We conduct empirical evaluations on several density estimation tasks, which validate the effectiveness of the proposed method.
The rest of the paper is organized as follows: We briefly review two existing approaches for inference network modeling in Section 2, and present our proposed framework in the Section 3. We then point out the connections of the proposed framework to related methods in Section 4. Empirical evaluations and analysis are carried out in Section 5, and lastly we conclude this paper in the Section 6.
2 Preliminaries
In this section, we briefly review several existing methods that aim to address variational inference with stochastic neural networks.
2.1 Variational Autoencoder (VAE)
Given a generative model defined over data and latent variable , indexed by parameter , variational inference aims to approximate the intractable posterior with , indexed by parameter , such that the ELBO is maximized
(2) 
Parameters of both generative distribution and variational distribution are learned by maximizing the ELBO with stochastic gradient methods.^{1}^{1}1We drop the dependencies of and on parameters and to prevent clutter. Specifically, VAE (Kingma and Welling, 2013) assumes both the conditional distribution of data given the latent codes of the generative model and the variational posterior distribution are Gaussians, whose means and diagonal covariances are parameterized by two neural networks, termed as generative network and inference network, respectively. Model learning is possible due to the reparameterization trick (Kingma and Welling, 2013) which makes back propagation through the stochastic variables possible.
2.2 Importance Weighted Autoencoder (IWAE)
The above ELBO is a lower bound of the true data loglikelihood , hence (Burda et al., 2015) proposed IWAE to directly estimate the true data loglikelihood with the presence of the variational model^{2}^{2}2The variational model is also referred to as the inference model, hence we use them interchangeably., namely
(3) 
where is the number of importance weighted samples. The above bound is tighter than the ELBO used in VAE. When trained on the same network structure as VAE, with the above estimate as training objective, IWAE achieves considerable improvements over VAE on various density estimation tasks (Burda et al., 2015) and similar idea is also considered in (Mnih and Rezende, 2016).
3 The Proposed Method
3.1 Variational Posterior with Auxiliary Variables
Consider the case of modeling binary data with classic VAE and IWAE, which typically assumes that a data point is generated from a multivariate Bernoulli, conditioned on a latent code which is assumed to be from a Gaussian prior, it’s easy to verify that the Gaussian variational posterior inferred by VAE and IWAE will not match the nonGaussian true posterior.
To this end, we propose to introduce an auxiliary random variable to the inference model of VAE and IWAE. Conditioned on the input , the inference model equipped with auxiliary variable now defines a joint density over as
(4) 
where we assume has proper support and both and can be parameterized. Accordingly the marginal variational posterior of given turns to be
(5) 
which essentially models the posterior as a probabilistic mixture of different densities indexed by , together with as the mixture weights. This allows complex and flexible posterior to be constructed, even when both and are from simple density families. Due to the presence of auxiliary variables , the inference model is trying to capture more sources of stochasticity than the generative model, hence we term our approach as Asymmetric Variational Autoencoder (AVAE). Figure 0(a) and 0(b) present a comparison of the inference models between classic VAE and the proposed AVAE.
In the context of of VAE and IWAE, the proposed approach includes two instantiations, AVAE and IWAVAE, with loss functions
(6) 
and
(7) 
respectively.
AVAE enjoys the following properties:

VAEs are special cases of AVAE. Conventional variational autoencoders can be seen as special cases of AVAE with no auxiliary variables assumed;

No density evaluations for are required. One key advantage brought by the auxiliary variable is that both terms inside the inner expectations of and do not involve , hence no density evaluations are required when Monte Carlo samples of are used to optimize the above bounds.

Flexible variational posterior. To fully enrich variational model flexibility, we use a neural network to implicitly model by sampling given
and a random Gaussian noise vector
as(8) Due to the flexible representative power of , the implicit density can be arbitrarily complex. Further we assume
to be Gaussian with its mean and variance parameterized by neural networks. Since the actual variational posterior
, complex posterior can be achieved even a simple density family is assumed for , due to the possibly flexible family of implicit density of defined by . (Illustration can be found in Section 5.1)
For completeness, we briefly include that
Proposition 1
Both and are lower bounds of the true data loglikelihood, satisfying .
Proof is trivial from Jensen’s inequality, hence it’s omitted.
Remark 1 Though the first equality holds for any choice of distribution (whether depends on or not), for practical estimation with Monte Carlo methods, it becomes an inequality and the bound tightens as the number of importance samples is increased (Burda et al., 2015). The second inequality always holds when estimated with Monte Carlo samples.
Remark 2 The above bounds are only concerned with one auxiliary variable , in fact can also be a set of auxiliary variables. Moreover, with the same motivation, we can make the variational family of AVAE even more flexible by defining a series of auxiliary variables, such that with sample generation process for all s defined as
(9) 
where all are random noise vectors and all are neural networks to be learned. Accordingly, we have
Proposition 2
The AVAE with auxiliary random variables is also a lower bound to the true loglikelihood, satisfying , where
(10) 
and
(11) 
Figure 0(c) illustrates the inference model of an AVAE with auxiliary variables.
3.2 Learning with Importance Weighted Auxiliary Samples
For both AVAE and IWAVAE, we can estimate the corresponding bounds and its gradients of and with ancestral sampling from the model. For example, for AVAE with one auxiliary variable , we estimate
(12) 
and
(13) 
where is the number of s sampled from the current and is the number of s sampled from the implicit conditional , which is by definition achieved by first sampling from and subsequently sampling from . The parameters of both the inference model and generative model are jointly learned by maximizing the above bounds. Besides back propagation through the stochastic variable (typically assumed to be a Gaussian for continuous latent variables) is possible through the reparameterization trick, and it is naturally also true for all the auxiliary variables since they are constructed in a generative manner.
The term essentially is an sample importance weighted estimate of , hence it is reasonable to believe that more samples of will lead to less noisy estimate of and thus a more accurate inference model . It’s worth pointing out for AVAE that additional samples of comes almost at no cost when multiple samples of are generated to optimize and , since sampling a from the inference model will also generate intermediate samples of , thus we can always reuse those samples of to estimate . For this purpose, in our experiments we always assume so that no separate process of sampling is needed in estimating the lower bounds. This also ensures that the forward pass and backward pass time complexity of the inference model are the same as conventional VAE and IWAE. In fact, as we will show in all our empirical evaluations that if AVAE performs similarly to VAE and while IWAVAE always outperforms IWAE, i.e., its counterpart with no auxiliary variables.
4 Connection to Related Methods
Before we proceed to the experimental evaluations of the proposed methods, we highlight the relations of AVAE to other similar methods.
4.1 Other methods with auxiliary variables
Relation to Hierarchical Variational Models (HVM) (Ranganath et al., 2016) and Auxiliary Deep Generative Models (ADGM) (Maaløe et al., 2016) are two closely related variational methods with auxiliary variables. HVM also considers enriching the variational model family by placing a prior over the latent variable for the variational distribution . While ADGM takes another way to this goal, by placing a prior over the auxiliary variable on the generative model, which in some cases will keep the marginal generative distribution of the data invariant. It has been shown that HVM and ADGM are mathematically equivalent by (Brümmer, 2016).
However, our proposed method doesn’t add any prior on the generative model and thus doesn’t change the structure of the generative model. We emphasize that our proposed method makes the least assumption about the generative model and that the proposal in our method is orthogonal to related methods, thus it can can be integrated with previous methods with auxiliary variables to further boost the performance on accurate posterior approximation and generative modeling.
4.2 Adversarial learning based inference models
Adversarial learning based inference models, such as Adversarial Autoencoders (Makhzani et al., 2015), Adversarial Variational Bayes (Mescheder et al., 2017), and Adversarially Learned Inference (Dumoulin et al., 2016)
, aim to maximize the ELBO without any variational likelihood evaluations at all. It can be shown that for the above adversarial learning based models, when the discriminator is trained to its optimum, the model is equivalent to optimizing the ELBO. However, due to the minimax game involved in the adversarial setting, practically at any moment it is not guaranteed that they are optimizing a lower bound of the true data likelihood, thus no maximum likelihood learning interpretation can be provided. Instead in our proposed framework, we don’t require variational density evaluations for the flexible auxiliary variables, while still maintaining the maximum likelihood interpretation.
5 Experiments
5.1 Flexible Variational Family of AVAE
To test the effect of adding auxiliary variables to the inference model, we parameterize two unnormalized 2D target densities ^{3}^{3}3Sample densities originate from (Rezende and Mohamed, 2015) with
We construct inference model^{4}^{4}4Inference model of VAE defines a conditional variational posterior , to match the target density which is independent of , we set to be fixed. In this synthetic example, is set to be an all one vector of dimension 10. to approximate the target density by minimizing the KL divergence
(14) 
Figure 2 illustrates the target densities as well as the ones learned by VAE and AVAE, respectively. It’s unsurprising to see that standard VAE with Gaussian stochastic layer as its inference model will only be able to produce Gaussian density estimates (Figure 1(b)). While with the help of introduced auxiliary random variables, AVAE is able to match the nonGaussian target densities (Figure 1(c)), even the last stochastic layer of the inference model, i.e., , is also Gaussian.
5.2 Handwritten Digits and Characters
To test AVAE for variational inference we use standard benchmark datasets MNIST^{5}^{5}5http://www.cs.toronto.edu/~larocheh/public/datasets/binarized_mnist/ and OMNIGLOT^{6}^{6}6https://github.com/yburda/iwae/raw/master/datasets/OMNIGLOT/chardata.mat (Lake et al., 2013). Our method is general and can be applied to any formulation of the generative model . For simplicity and fair comparison, in this paper we focus on defined by stochastic neural networks, i.e., a family of generative models with their parameters defined by neural networks. Specifically, we consider the following two types of generative models:

with single Gaussian stochastic layer for with 50 units. In between the latent variable and observation there are two deterministic layers, each with 200 units;

with two Gaussian stochastic layers for and with 50 and 100 units, respectively. Two deterministic layers with 200 units connect the observation and latent variable , and two deterministic layers with 100 units connect and .
A Gaussian stochastic layer consists of two fully connected linear layers, with one outputting the mean and the other outputting the logarithm of diagonal covariance. All other deterministic layers are fully connected with tanh nonlinearity. The same network architectures for both and are also used in (Burda et al., 2015)
For , an inference network with the following architecture is used by AVAE with auxiliary variables
where is defined as input , all are implemented as fully connected layers with tanh nonlinearity and denotes the concatenation operator. All noise vectors s are set to be of 50 dimensions, and all other variables have the corresponding dimensions in the generative model. Inference network used for is the same, except that the Gaussian stochastic layer is defined for . An additional Gaussian stochastic layer for is defined with as input, where the dimensions of variables aligned to those in the generative model
. Further, Bernoulli observation models are assumed for both MNIST and OMNIGLOT. For MNIST, we employ the static binarization strategy as in
(Larochelle and Murray, 2011) while dynamic binarization is employed for OMNIGLOT.Our baseline models include VAE and IWAE. Since our proposed method involves adding more layers to the inference network, we also include another enhanced version of VAE with more deterministic layers added to its inference network, which we term as VAE+^{7}^{7}7VAE+ is a restricted version of AVAE with all the noise vectors s set to be constantly 0, but with the additional layers for s retained.
and its importance sample weighted variant IWAE+. To eliminate discrepancies in implementation details of the models reported in the literature, we implement all models and carry out the experiments under the same setting: All models are implemented in PyTorch
^{8}^{8}8http://pytorch.org/ and parameters of all models are optimized with Adam (Kingma and Ba, 2014)for 2000 epochs, with an initial learning rate of 0.001, cosine annealing for learning rate decay
(Loshchilov and Hutter, 2016), exponential decay rates for the 1st and 2nd moments at 0.9 and 0.999, respectively. Batch normalization
(Ioffe and Szegedy, 2015) is applied to all fully connected layers, except for the final output layer for the generative model, as it has been shown to improve learning for neural stochastic models (Sønderby et al., 2016). Linear annealing of the KL divergence term between the variational posterior and the prior in all the loss functions from 0 to 1 is adopted for the first 200 epochs, as it has been shown to help training stochastic neural networks with multiple layers of latent variables (Sønderby et al., 2016). Code to reproduce all reported results will be made publicly available.5.2.1 Generative Density Estimation
MNIST  OMNIGLOT  
Models  on  on  on  on 
VAE (Burda et al., 2015)  
VAE+  
VAE+  
VAE+  
AVAE  
AVAE  
AVAE  
AVAE  
AVAE  
AVAE  
Models (Importance weighted)  
IWAE (Burda et al., 2015)  
IWAVAE  
IWAVAE  
IWAVAE  
IWAE+  
IWAE+  
IWAE+  
IWAVAE  
IWAVAE  
IWAVAE 
For both MNIST and OMNIGLOT, all models are trained and tuned on the training and validation sets, and estimated loglikelihood on the test set with 128 importance weighted samples are reported. Table 1 presents the performance of all models with for both and .
Firstly, VAE+ achieves slightly higher loglikelihood estimates than vanilla VAE due to the additional layers added in the inference network, implying that a better Gaussian posterior approximation is learned. Second, AVAE achieves lower NLL estimates than VAE+, more so with increasingly more samples from auxiliary variables (i.e., larger ), which confirms our expectation that: a) adding auxiliary variables to the inference network leads to a richer family of variational distributions; b) more samples of auxiliary variables yield a more accurate estimate of variational posterior . Lastly, with more importance weighted samples from both and , i.e., IWAVAE variants, the best data density estimates are achieved. Overall, on MNIST AVAE outperforms VAE by 1.5 nats on and 1.3 nats on ; IWAVAE outperforms IWAE by about 1.0 nat on and 0.5 nats on . Similar trends can be observed on OMNIGLOT, with AVAE and IWAVAE outperforming conventional VAE and IWAE in all cases, except for IWAE+ slightly outperforms IWAVAE.
Compared with previous methods with similar settings, IWAVAE achieves a best NLL of 83.77, significantly better than 85.10 achieved by Normalizing Flow (Rezende and Mohamed, 2015). Best density modeling with generative modeling on statically binarized MNIST is achieved by Pixel RNN (Oord et al., 2016; Salimans et al., 2017)
with autoregressive models and Inverse Autoregressive Flows
(Kingma et al., 2016) with latent variable models, however it’s worth noting that much more sophisticated generative models are adopted in those methods and that AVAE enhances standard VAE by focusing on enriching inference model flexibility, which pursues an orthogonal direction for improvements. Therefore, AVAE can be integrated with abovementioned methods to further improve performance on latent generative modeling.5.2.2 Latent Code Visualization
We visualize the inferred latent codes of digits in the MNIST test set with respect to their true class labels in Figure 3 from different models with tSNE (Maaten and Hinton, 2008).
We observe that on generative model , all three models are able to infer latent codes of the digits consistent with their true classes. However, VAE and VAE+ still shows disconnected cluster of latent codes from the same class (both class 0 and 1) and latent code overlapping from different classes (class 3 and 5), while AVAE outputs clear separable latent codes for different classes (notably for class 0,1,5,6,7).
5.2.3 Reconstruction and Generated Samples
Generative samples can be obtained from trained model by feeding to the learned generative model (or to ). Since higher loglikelihood estimates are obtained on , Figure 4 shows real samples from the dataset, their reconstruction, and random data points sampled from AVAE trained on for both MNIST and OMNIGLOT. We observe that the reconstructions align well with the input data and that random samples generated by the models are visually consistent with the training data.
6 Conclusions
This paper presents AVAE, a new framework to enrich variational family for variational inference, by incorporating auxiliary variables to the inference model. It can be shown that the resulting inference model is essentially learning a richer probabilistic mixture of simple variational posteriors indexed by the auxiliary variables. We emphasize that no density evaluations are required for the auxiliary variables, hence neural networks can be used to construct complex implicit distribution for the auxiliary variables. Empirical evaluations of two variants of AVAE demonstrate the effectiveness of incorporating auxiliary variables in variational inference for generative modeling.
References
 Blei et al. [2003] David M. Blei, Andrew Y. Ng, and Michael I. Jordan. Latent dirichlet allocation. Journal of Machine Learning Research, 3:993–1022, 2003.
 Blei et al. [2017] David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
 Brümmer [2016] Niko Brümmer. Note on the equivalence of hierarchical variational models and auxiliary deep generative models. arXiv preprint arXiv:1603.02443, 2016.
 Burda et al. [2015] Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
 Dumoulin et al. [2016] Vincent Dumoulin, Ishmael Belghazi, Ben Poole, Alex Lamb, Martin Arjovsky, Olivier Mastropietro, and Aaron Courville. Adversarially learned inference. arXiv preprint arXiv:1606.00704, 2016.
 Geyer [1992] Charles J Geyer. Practical markov chain monte carlo. Statistical science, pages 473–483, 1992.
 Ioffe and Szegedy [2015] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pages 448–456, 2015.
 Kingma and Ba [2014] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kingma and Welling [2013] Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 Kingma et al. [2016] Diederik P. Kingma, Tim Salimans, Rafal Józefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improving variational autoencoders with inverse autoregressive flow. In Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems 2016, December 510, 2016, Barcelona, Spain, pages 4736–4744, 2016.
 Lake et al. [2013] Brenden M. Lake, Ruslan Salakhutdinov, and Joshua B. Tenenbaum. Oneshot learning by inverting a compositional causal process. In Advances in Neural Information Processing Systems 26: 27th Annual Conference on Neural Information Processing Systems 2013. Proceedings of a meeting held December 58, 2013, Lake Tahoe, Nevada, United States., pages 2526–2534, 2013.

Larochelle and
Murray [2011]
Hugo Larochelle and Iain Murray.
The neural autoregressive distribution estimator.
In
Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2011, Fort Lauderdale, USA, April 1113, 2011
, pages 29–37, 2011.  Loshchilov and Hutter [2016] Ilya Loshchilov and Frank Hutter. Sgdr: stochastic gradient descent with restarts. arXiv preprint arXiv:1608.03983, 2016.
 Maaløe et al. [2016] Lars Maaløe, Casper Kaae Sønderby, Søren Kaae Sønderby, and Ole Winther. Auxiliary deep generative models. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, pages 1445–1453, 2016.
 Maaten and Hinton [2008] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using tsne. Journal of machine learning research, 9(Nov):2579–2605, 2008.
 Makhzani et al. [2015] Alireza Makhzani, Jonathon Shlens, Navdeep Jaitly, Ian Goodfellow, and Brendan Frey. Adversarial autoencoders. arXiv preprint arXiv:1511.05644, 2015.
 Mescheder et al. [2017] Lars Mescheder, Sebastian Nowozin, and Andreas Geiger. Adversarial variational bayes: Unifying variational autoencoders and generative adversarial networks. arXiv preprint arXiv:1701.04722, 2017.
 Mnih and Gregor [2014] Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. In Proceedings of the 31th International Conference on Machine Learning, ICML 2014, Beijing, China, 2126 June 2014, pages 1791–1799, 2014.
 Mnih and Rezende [2016] Andriy Mnih and Danilo Jimenez Rezende. Variational inference for monte carlo objectives. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, pages 2188–2196, 2016.
 Oord et al. [2016] Aaron van den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. arXiv preprint arXiv:1601.06759, 2016.
 Ranganath et al. [2016] Rajesh Ranganath, Dustin Tran, and David M. Blei. Hierarchical variational models. In Proceedings of the 33nd International Conference on Machine Learning, ICML 2016, New York City, NY, USA, June 1924, 2016, pages 324–333, 2016.
 Rezende and Mohamed [2015] Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. In Proceedings of the 32nd International Conference on Machine Learning, ICML 2015, Lille, France, 611 July 2015, pages 1530–1538, 2015.
 Salimans et al. [2017] Tim Salimans, Andrej Karpathy, Xi Chen, and Diederik P Kingma. Pixelcnn++: Improving the pixelcnn with discretized logistic mixture likelihood and other modifications. arXiv preprint arXiv:1701.05517, 2017.
 Sønderby et al. [2016] Casper Kaae Sønderby, Tapani Raiko, Lars Maaløe, Søren Kaae Sønderby, and Ole Winther. Ladder variational autoencoders. In Annual Conference on Neural Information Processing Systems 2016, December 510, 2016, Barcelona, Spain, pages 3738–3746, 2016.
 Wainwright et al. [2008] Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
Comments
There are no comments yet.