1 Introduction
Stochastic variational inference wainwright2008graphical ; hoffman2013stochastic ; blei2017variational is a scalable inference method for learning generative models with latent variables using stochastic optimization robbins1951stochastic . This method becomes especially scalable and efficient for models with continuous latent variables when it is combined with an amortized inference model and the reparameterization trick rezende2014stochastic ; kingma2013auto . The resulting method is commonly referred as the variational autoencoder (VAE) kingma2013auto . In VAEs, a variational family of distributions parameterized by the inference model is used to approximate the posterior distribution of latent variables. VAEs learn both generative and inference models simultaneously by maximizing the evidence lower bound rezende2014stochastic ; kingma2013auto . The variational distributions used in VAEs to approximate the posterior distribution of latent variables are commonly chosen to be fully factorized, whereas the true posterior distribution is not necessarily fully factorized and might even have multiple modes. Because the optimization of generative models highly depends on the approximate posterior distribution, generative models learned using VAEs are biased toward having factorized posterior distributions and can be suboptimal.
One of the core problems in variational inference has been to increase the expressibility of approximate posterior distributions while maintaining efficient optimization blei2017variational . Two kinds of approaches developed to address this problem are of interest for the current study: flowbased approaches rezende2015variational ; kingma2016improved ; caterini2018hamiltonian and multisamplebased approaches such as the importance weighted autoencoder (IWAE) burda2015importance .
In flowbased approaches, a chain of invertible transformations, referred as a flow, is applied to samples from a simple factorized distribution such that the end samples have a more flexible distribution. Examples of flowbased approaches include normalizing flow (NF) rezende2015variational , inverse autoregressive flow (IAF) kingma2016improved , and the Hamiltonian variational autoencoder (HVAE) caterini2018hamiltonian
among others. Both the NF and the IAF introduce new parameters to the inference model. In contrast, the HVAE does not introduce new parameters to the inference model and the flow in HVAE is guided by the generative model. For all three flowbased approaches, calculating parameter gradients requires running the backpropagation algorithm through the flow in the reverse direction. Therefore, both computation and memory costs increase linearly with the depth of flow.
Like the HVAE, the multisamplebased approach IWAE does not introduce new parameters to the inference model. The original motivation for the IWAE is to use multiple samples from an approximate posterior distribution to construct a tighter evidence lower bound and it was shown that optimizing the tighter lower bound helps learn better generative models burda2015importance . Later, an alternative interpretation is given for the IWAE in cremer2017reinterpreting ; bachman2015training . In the alternative interpretation, multiple samples from an approximate distribution are used to implicitly define a more flexible approximate posterior distribution based on importance sampling. The tighter lower bound in IWAE can be understood as a normal evidence lower bound used in VAE with the implicitly defined flexible approximate posterior distribution cremer2017reinterpreting ; bachman2015training .
Here we introduce the annealed importance weighted autoencoder (AIWAE) for learning generative models with latent variables. The AIWAE combines multisamplebased approaches and flowbased approaches with the annealed importance sampling and the flow used in AIWAE is constructed through an annealing process that facilitates better sampling from the posterior distribution. First, we present an alternative interpretation of how the IWAE optimizes generative model parameters: the IWAE optimizes generative model parameters by maximizing the data loglikelihood and the gradient of the data loglikelihood is estimated using importance sampling
bornschein2014reweighted . With this interpretation, we can naturally generalize the importance sampling to the annealed importance sampling (AIS) neal2001annealed to better estimate the gradient of the data loglikelihood with respect to generative model parameters. The approximate posterior distribution parameterized by the inference model and the true posterior distribution are used as the initial and the target distributions for the AIS, respectively. The inference model parameters are learned by minimizing the KullbackLeibler (KL) divergence between the two distributions. From a flowbased point of view, samples from the initial distribution of AIS also go through a chain of transformations constructed using an annealing process. In contrast to previous flowbased approaches, the flow in AIWAE is guided by the posterior distribution and does not add new parameters to the inference model. The annealing process can facilitate the exploration of the posterior distribution when the posterior distribution has multiple modes. In addition, training models with the AIWAE does not require running the backpropagation algorithm backward through the flow and this enables the AIWAE to have a constant memory cost with the depth of the flow.2 Background
2.1 Variational AutoEncoder and Importance Weighted AutoEncoder
The generative model of interest is defined by a joint distribution of both data
and continuous latent variables : , where represents parameters of the generative model. Learning the parameters by maximizing the data likelihood requires calculating expectations with respect to the posterior distribution of latent variables which is computationally expensive when analytical expressions for the expectation or are not available. To efficiently learn the generative model, the variational autoencoder (VAE) rezende2014stochastic ; kingma2013auto uses an approximate posterior distribution and maximizes the evidence lower bound (ELBO) objective function :(1) 
The gradient of with respect to , , is estimated by Monte Carlo sampling of from . To efficiently estimate the gradient of with respect to , the VAE reparameterizes the approximate posterior distribution as where
is a random variable from a fixed distribution and
represents the parameters for the transformation. When the latent variable is continuous, a common choice for parameterizing the approximate posterior distribution is and rezende2014stochastic ; kingma2013auto . With the parameterization, the ELBO function becomes(2) 
and its gradient with respect to , , can be estimated by Monte Carlo sampling of from the distribution , similarly as estimating the gradient
In the importance weighted autoencoder (IWAE) burda2015importance , the following new ELBO function based on multiple () samples is introduced:
(3) 
As increases, the ELBO function forms a tighter lower bound of the data loglikelihood. The gradient of with respect to is
(4)  
(5) 
where represents the importance weight of the sample and are the normalized importance weights. The gradient of with respect to can be calculated similarly using the reparameterization trick burda2015importance .
2.2 Importance Sampling and Annealed Importance Sampling
Importance sampling is a widely used statistical technique to compute the expectation of a function
with respect to a probability distribution whose density is proportional to a function
with an unknown normalization constant. When it is difficult to draw independent samples from the distribution , the importance sampling robert2013monte uses a proposal distribution from which it is feasible to draw independent samples directly. Suppose the proposal distribution has a probability density that is proportional to , with the independent samples drawn from the proposal distribution, the expectation can be estimated by(6) 
where is the weight assigned for the sample and is the normalized weight. The accuracy of the estimator in Eq. (6) depends on the variability of weights, , which depends on how well the proposal distribution defined by can approximate the target distribution defined by . When is high dimensional and is complex and has multiple modes, it is difficult to identify the proposal distribution which is not only a good approximation of the target distribution, but also easy to draw independent samples from.
An alternative approach is to use Markov chain Monte Carlo (MCMC) approaches
robert2013monte to draw dependent samples from the target distribution defined by . This works by making random perturbations to samples and accepting or rejecting the perturbed samples based on the MetropolisHastings criterion metropolis1953equation ; hastings1970monte . Although theoretical results show that, under relatively weak conditions, the samples will eventually converge to the target distribution robert2013monte , this might necessitate running the Markov chain for too long to be practical, especially when the target distribution has multiple modes that are connected only through low density regions.The annealed importance sampling (AIS) neal2001annealed was introduced to alleviate challenges evident in both importance sampling and MCMC. In AIS, a sequence of distributions with probability densities proportional to are constructed to connect the initial distribution defined by and the target distribution defined by . A generally useful way to construct these intermediate distributions is to let
(7) 
where . To generate a sample and calculate its weight , a sequence of samples is generated as follows. Initially, is generated by sampling from the distribution , which is chosen to be a distribution from which independent samples can be easily drawn. For , is generated using a reversible transition kernel that keeps invariant. Then the sample is set to and the weight is calculated as:
(8) 
After generating samples and their weights , the expectation can be similarly estimated using Eq. (6).
3 Annealed Importance Weighted Autoencoders
In this section, we introduce the annealed importance weighted autoencoder (AIWAE) for learning the generative model by maximizing the data loglikelihood . The AIWAE originates from the importance sampling interpretation of the gradient in Eq. (5). With this interpretation, Eq. (5) can be naturally generalized to utilize the AIS neal2001annealed .
Let us consider learning parameters for the generative model by maximizing the data loglikelihood with stochastic gradient ascent. In order to do that, we need to estimate the gradient of the data loglikelihood with respect to :
(9) 
Because directly drawing independent samples from the posterior distribution is usually not feasible, we can use the importance sampling approach to estimate the above expectation. If we choose the proposal distribution is to be , then setting and in Eq. (6) yields the following estimator:
(10) 
where and . This estimator (Eq. 10) is the same as the estimator in Eq. (5) that is used to estimate the gradient of the ELBO function . Therefore, in terms of learning the parameter , an alternative interpretation of the IWAE is that it learns the parameter by maximizing the data loglikelihood and the gradient is estimated using importance sampling with the approximate posterior distribution as the proposal distribution.
With the importance sampling view, the estimator in Eq. (10) can be improved using the AIS neal2001annealed as follows. The unnormalized density for the target distribution is set to and the initial distribution density is set to . A sequence of intermediate distributions is constructed using Eq. (7), i.e., . The reversible transition kernel that leaves invariant is constructed using the Hamiltonian Monte Carlo (HMC) method neal2011mcmc in which the potential energy function is set to . With the estimation of the gradient (Eq. 10) using AIS, we can optimize the parameter by maximizing the data loglikelihood with stochastic gradient ascent.
The importance sampling interpretation of Eq. (5) and the optimization procedure using AIS to estimate gradients of data loglikelihood only apply to the parameter of the generative model . How should we optimize the parameter in the approximate inference model ? In the AIWAE, we take the following general strategy. The main objective of AIWAE is to learn the generative model by maximizing the data loglikelihood. It requires calculating the expectation with respect to the posterior distribution , i.e., . The amortized approximate inference model is introduced to help efficiently estimate the expectation. Because the performance of AIS estimation depends on the similarity between its initial and target distributions, the objective in optimizing the parameters is to make become close to the posterior distribution
. In AIWAE, we choose to minimize the KullbackLeibler divergence between
and , i.e., maximize the ELBO function (Eq. (2)) with the reparameterization trick as in VAE kingma2013auto . Overall, the detailed procedures of the AIWAE are described in Algorithm (1).In summary, the parameters and are optimized using different objective functions in AIWAE. The generative model parameters are optimized to maximize the data loglikelihood and the inference model parameters are optimized to maximize the ELBO (Eq. 2) function with the reparameterization trick. The purpose of the inference model is to help efficiently calculate the gradient of the data loglikelihood with respect to (Eq. (9)). When the posterior distribution is not factorized and has multiple modes, the fully factorized approximate posterior distribution will not be a good approximation and the estimator based on importance sampling with (Eq. (10
)) will have a large variance. AIS is used to alleviate this problem. With AIS, samples from the approximate distribution
are transformed via an annealing process guided by the posterior distribution such that the distribution of these samples moves towards . The annealing procedure in AIS can also help samples explore when has multiple modes. In contrasted to other flowbased approaches, the AIWAE’s memory cost does not increase with the depth of flow. The computational cost of AIWAE is proportional to both the number of weighted samples and the number of steps in the flow, whereas the computational cost of IWAE is only proportional to . Therefore, AIWAE is approximately times more expensive than IWAE in computation when the same is used.4 Related Work
In addition to the IWAE and AIWAE, several approaches have been proposed to utilize multiple samples with importance sampling to either obtain a better objective function or to better estimate gradients of objective functions. For learning feedforward neural networks with stochastic binary hidden units, an importance sampling based estimator using multiple samples was proposed in
tang2013learningto estimate the expectation used in the generalized ExpectationMaximization algorithm
neal1998view . A similar estimator was derived in raiko2014techniques by constructing a variational distribution of hidden variables used in ELBO using multiple samples and importance sampling. Another approach mnih2016variational extended the multisample approach in IWAE to discrete latent variables and developed an unbiased gradient estimator for importancesampled objectives. Methods in tang2013learning ; raiko2014techniques ; mnih2016variational apply for models with discrete latent variables, whereas the IWAE only applies for models with continuous latent variables. Although our current extension of IWAE to AIWAE using AIS also only applies for continuous latent variables, the idea of replacing importance sampling with AIS could also be useful for improving methods in tang2013learning ; raiko2014techniques ; mnih2016variational that work for discrete latent variables.The reweighted wakesleep (RWS) algorithm bornschein2014reweighted is another multisample approach for learning generative models with latent variables. Different from the IWAE, the RWS algorithm uses different objective functions for optimizing the generative and the inference models. The generative model is optimized by maximizing the data loglikelihood with its gradient estimated using importance sampling. This gradient estimator is equivalent to the gradient estimator of (Eq. (5)) in IWAE. To optimize the inference model, two different update rules were proposed in RWS: (1) (wake phase update) minimizing the KullbackLeibler divergence between and with the gradient estimated using importance sampling; (2) (sleep phase update) maximizing the loglikelihood for samples from . Compared with the RWS, the AIWAE is different in two aspects: (1) instead of importance sampling, the AIS is used to estimate when optimizing the generative model; (2) the inference model is optimized by minimizing using the reparameterization trick. It will be interesting to investigate if replacing importance sampling with AIS can also help RWS learn better generative models.
The AIWAE utilizes both MCMC and variational inference. In this respect, one closely related approach is that proposed in hoffman2017learning , which uses the same objective functions as AIWAE for both generative and inference models. The difference between AIWAE and hoffman2017learning is in the method for approximating the gradient (Eq. (9)). In hoffman2017learning , only one approximate sample from is used to approximate the expectation for each data point . The approximate sample is generated by first sampling from the approximate distribution and then applying multiple Hamiltonian Monte Carlo (HMC) steps to the sample. In the HMC steps, the energy function is set to be . When the distribution has multiple modes, running HMC for a limited number of steps with can be difficult for exploring multiple modes. Multimodal distribution is less of a problem for AIWAE because the annealing process in AIS starts with the smooth energy function and slowly switches into the rugged energy function .
The Hamiltonian variational inference (HVI) salimans2015markov is another related approach combining MCMC and variational inference. Different from both AIWAE and hoffman2017learning , HVI considers samples from HMC steps as auxiliary variables and optimizes a single objective function called the auxiliary variational lower bound. A disadvantage of HVI is that it is required to learn an extra inference network for auxiliary variables to reverse the transformations in HMC steps, which introduces new parameters. In addition, learning the auxiliary variables also requires running the backpropagation algorithm backward through the HMC transformations, which increases the memory cost linearly with the number of HMC steps.
5 Experiments
5.1 Dataset and Model Setup
We conducted a series of experiments to evaluate the performance of AIWAE on learning generative models using both the MNIST lecun1998gradient dataset and the Omniglot dataset lake2015human . The same generative models are also learned using IWAE and the performance of IWAE is compared to that of AIWAE. We used the same generative model and the same inference model as that used in the IWAE study burda2015importance . The dimension of the latent variable is 50. For the generative model , the prior distribution
is a 50 dimensional standard Gaussian distribution. The conditional distribution
is a Bernoulli distribution and is parameterized by a neural network with two hidden layers, each of which has 200 units. The approximate posterior distribution
is a 50 dimensional Gaussian distribution with a diagonal covariance matrix. Its mean and variance are similarly parameterized by a neural network with two hidden layers. We used the same optimization procedure kingma2014adam as that used in the IWAE study. The detailed information about the datasets, model setup and optimization is included in the supplementary material.Models were trained using both IWAE and AIWAE with different hyperparameters. For both IWAE and AIWAE, the number of importance weighted samples
is set to 1, 5 or 50. As shown in Algorithm (1), extra hyperparameters are required for the AIWAE. For each , the number of inverse temperatures is set to 5, 8, or 11. Given a , the inverse temperatures are evenly distributed between 0 and 1, i.e., for . In HMC, the number of integration steps is set to 5 and the step sizes are dynamically adapted such that the acceptance ratio is close to 0.6.5.2 Results
K  1  5  50  

IWAE  AIWAE  IWAE  AIWAE  IWAE  AIWAE  
T = 5  T = 8  T = 11  T = 5  T = 8  T = 11  T = 5  T = 8  T = 11  
NLL(test)  86.07  84.55  84.31  84.19  85.17  84.23  84.06  83.90  84.13  83.92  83.79  83.68 
NLL(train)  85.89  84.59  84.35  84.15  84.93  84.26  84.12  83.96  83.95  83.96  83.84  83.81 
NVLB(test)  86.60  87.30  87.79  88.02  85.64  88.11  88.95  89.59  84.77  89.24  90.44  91.55 
NVLB(train)  86.26  87.22  87.72  87.94  85.31  88.02  88.88  89.45  84.42  89.17  90.28  91.41 
var gap  0.53  2.75  3.48  3.83  0.47  3.89  4.89  5.68  0.63  5.32  6.65  7.87 
gen gap  0.18  0.04  0.04  0.05  0.24  0.04  0.06  0.05  0.18  0.04  0.05  0.13 
active units  18  28  31  32  20  32  34  35  23  35  38  40 
K  1  5  50  

IWAE  AIWAE  IWAE  AIWAE  IWAE  AIWAE  
T = 5  T = 8  T = 11  T = 5  T = 8  T = 11  T = 5  T = 8  T = 11  
NLL(test)  107.39  103.23  102.63  102.43  105.30  102.54  102.10  102.05  103.31  102.07  101.83  101.68 
NLL(train)  105.75  101.47  100.85  100.60  103.45  100.71  100.33  100.14  101.29  100.18  99.89  99.71 
NVLB(test)  108.19  107.51  108.47  109.17  106.30  108.60  109.81  110.46  104.67  109.89  110.96  111.88 
NVLB(train)  106.32  105.37  106.20  106.78  104.14  106.38  107.43  108.02  102.36  107.46  108.33  109.20 
var gap  0.79  4.28  5.84  6.74  1.00  6.06  7.72  8.42  1.36  7.82  9.13  10.20 
gen gap  1.65  1.76  1.78  1.83  1.85  1.83  1.77  1.90  2.02  1.90  1.94  1.97 
active units  27  47  50  50  32  50  50  50  39  50  50  50 
Models trained with both IWAE and AIWAE with different hyperparameters are first evaluated using negative data loglikelihoods (NLLs) and negative variational lower bounds (NVLBs). Following wu2016quantitative , NLLs are calculated using 16 independent AIS chains with 10,000 inverse temperatures evenly spaced between 0 and 1. HMC with 10 leapfrog steps is used as the transition kernel and the leapfrog step size is tuned to achieve an acceptance ratio of 0.6. Following the IWAE study, NVLBs are calculated as (Eq. 3).
Results on the MNIST dataset and the Omniglot dataset are presented in Table (3) and Table (4), respectively. Each experiment was repeated for 5 times and results shown in Table (3 and 4) are mean values (standard derivations are included in the tables in supplementary material). For IWAE with , the values of NVLB agree with that in the IWAE study. As the number of importance weighted samples increases from 1 to 50, the generative model trained with IWAE improves because the values of NLLs decrease. For a fixed , models trained with AIWAE have lower values of NLLs than models trained with IWAE for all choices of . Therefore, AIWAE produces better density models than IWAE. In addition, as the number of inverse temperature increases from 5 to 11 in AIWAE, the resulting density models improve further. Similar to IWAE, models trained with AIWAE with a fixed also improve when the number of annealed weighted samples increases.
When and , the model trained with AIWAE achieves a loglikelihood of 83.68 and 101.68 on the MNIST dataset and the Omniglot dataset, respectively. We note that these results are for permutationinvariant models with only one stochastic layer. For the Omniglot dataset, our best result with a loglikelihood of 101.68 is better than the best result in the IWAE study which has an approximate loglikelihood of 103.38 and is obtained with two stochastic layers burda2015importance . The generalization ability of the models is quantified by the generalization gap (gen gap in Table (3 and 4)) which is defined as the difference between the loglikelihood values on test and training data. For the MNIST dataset, models trained with both IWAE and AIWAE have quite small generalization gaps (smaller than 0.25 nats). Most of the models trained with AIWAE have generalization gaps that are not significantly different from 0 nats.
In both VAE and IWAE, a factorized approximate posterior distribution is used to approximate the posterior distribution and the generative model is trained by optimizing the ELBO objective function. In this kind of optimization, the generative model is biased such that the posterior distribution is approximately factorized. Alleviating the bias is the main motivation for replacing importance sampling used in IWAE with AIS in AIWAE. Here we use the variational gap (var gap in Table (3 and 4)), defined as , to quantify the bias. Results show that models trained with AIWAE have greater variational gaps than those trained with IWAE. This means that posterior distributions of models trained with AIWAE are more different from factorized distributions than are posterior distributions of models trained with IWAE. Therefore, the generative models trained with AIWAE are less biased towards having factorized posterior distribution and have more complex structures in the posterior distribution. For models trained with AIWAE, as the number of annealed importance weighted samples or the number of inverse temperatures increases, the NLLs decrease whereas the NVLBs increase. This makes the variational gaps increases as either or increases. This implies that, as and increase, models learned with AIWAE not only become better on density estimation but also have more complex posterior distributions. (A visualization of posterior distributions for models learned with both IWAE and AIWAE is included in supplementary material when the dimension of is set to 2.)
Following the IWAE studyburda2015importance , we also calculated the number of active latent units and used it to represent how much the latent space’s representation capacity was utilized in learned models. The intuition is that if a latent unit is active for encoding information in the observation, it is expected that its distribution would change with observations. Therefore, we used the following variance statistics to quantify the activity of the a latent unit : , which measures how much the value of a latent unit changes when the observation in test set changes. A latent unit is defined to be active if . Both the statistics and the cutoff value are adopted from the IWAE study burda2015importance . As shown in both Table (3 and 4), the number of active units in models trained with AIWAE is much larger than that in models trained with IWAE. In addition, the number of active units in models trained with AIWAE increases monotonically not only with the number of samples but also with the number of inverse temperatures. Intuitively, when the number of inverse temperature increases, the annealing process in AIWAE becomes longer and smoother, makeing it easier for samples to explore more latent space. On the Omniglot dataset, all the latent units are active for most of the models trained with AIWAE.
6 Conclusion
We present the annealed importance weighted autoencoder (AIWAE), a learning algorithm for training probabilistic generative models with latent variables. AIWAE combines multisamplebased and flowbased approaches with the annealed importance sampling to better approximate the posterior distribution. In contrast with previous flowbased approaches, AIWAE does not require running backpropagation backwards through flows and has constant memory cost with the depth of flows. AIWAE can also be viewed as a way of combining MCMC with variational inference or trading learning speed for model accuracy. The annealed sampling process used in AIWAE facilitates sampling from complex posterior distributions. In experiments, we demonstrate that, compared with models learned with IWAE, models learned with AIWAE have higher likelihood on data, have more complex posterior distribution and utilize more of their latent space representational capacity.
References

(1)
Martin J Wainwright, Michael I Jordan, et al.
Graphical models, exponential families, and variational inference.
Foundations and Trends® in Machine Learning
, 1(1–2):1–305, 2008.  (2) Matthew D Hoffman, David M Blei, Chong Wang, and John Paisley. Stochastic variational inference. The Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 (3) David M Blei, Alp Kucukelbir, and Jon D McAuliffe. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
 (4) Herbert Robbins and Sutton Monro. A stochastic approximation method. The annals of mathematical statistics, pages 400–407, 1951.
 (5) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. arXiv preprint arXiv:1401.4082, 2014.
 (6) Diederik P Kingma and Max Welling. Autoencoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
 (7) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. arXiv preprint arXiv:1505.05770, 2015.
 (8) Durk P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In Advances in neural information processing systems, pages 4743–4751, 2016.
 (9) Anthony L Caterini, Arnaud Doucet, and Dino Sejdinovic. Hamiltonian variational autoencoder. In Advances in Neural Information Processing Systems, pages 8167–8177, 2018.
 (10) Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Importance weighted autoencoders. arXiv preprint arXiv:1509.00519, 2015.
 (11) Chris Cremer, Quaid Morris, and David Duvenaud. Reinterpreting importanceweighted autoencoders. arXiv preprint arXiv:1704.02916, 2017.
 (12) Philip Bachman and Doina Precup. Training deep generative models: Variations on a theme. In NIPS Approximate Inference Workshop, 2015.
 (13) Jörg Bornschein and Yoshua Bengio. Reweighted wakesleep. arXiv preprint arXiv:1406.2751, 2014.
 (14) Radford M Neal. Annealed importance sampling. Statistics and computing, 11(2):125–139, 2001.
 (15) Christian Robert and George Casella. Monte Carlo statistical methods. Springer Science & Business Media, 2013.
 (16) Nicholas Metropolis, Arianna W Rosenbluth, Marshall N Rosenbluth, Augusta H Teller, and Edward Teller. Equation of state calculations by fast computing machines. The journal of chemical physics, 21(6):1087–1092, 1953.
 (17) W. K. Hastings. Monte Carlo sampling methods using Markov chains and their applications. Biometrika, 57(1):97–109, 04 1970.
 (18) Radford M Neal et al. Mcmc using hamiltonian dynamics. Handbook of markov chain monte carlo, 2(11):2, 2011.
 (19) Yichuan Tang and Ruslan R Salakhutdinov. Learning stochastic feedforward neural networks. In Advances in Neural Information Processing Systems, pages 530–538, 2013.
 (20) Radford M Neal and Geoffrey E Hinton. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355–368. Springer, 1998.
 (21) Tapani Raiko, Mathias Berglund, Guillaume Alain, and Laurent Dinh. Techniques for learning binary stochastic feedforward neural networks. arXiv preprint arXiv:1406.2989, 2014.
 (22) Andriy Mnih and Danilo J Rezende. Variational inference for monte carlo objectives. arXiv preprint arXiv:1602.06725, 2016.
 (23) Matthew D Hoffman. Learning deep latent gaussian models with markov chain monte carlo. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pages 1510–1519. JMLR. org, 2017.
 (24) Tim Salimans, Diederik Kingma, and Max Welling. Markov chain monte carlo and variational inference: Bridging the gap. In International Conference on Machine Learning, pages 1218–1226, 2015.
 (25) Yann LeCun, Léon Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradientbased learning applied to document recognition. Proceedings of the IEEE, 86(11):2278–2324, 1998.
 (26) Brenden M Lake, Ruslan Salakhutdinov, and Joshua B Tenenbaum. Humanlevel concept learning through probabilistic program induction. Science, 350(6266):1332–1338, 2015.
 (27) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 (28) Yuhuai Wu, Yuri Burda, Ruslan Salakhutdinov, and Roger Grosse. On the quantitative analysis of decoderbased generative models. arXiv preprint arXiv:1611.04273, 2016.
A Details about datasets, model setup and optimization
In the MNIST dataset, there are 60,000 training examples and 10,000 test examples. In the Omniglot dataset, there are 24,345 training examples and 8,070 test examples^{1}^{1}1The Omniglot data was downloaded from https://github.com/yburda/iwae.git. Images from both datasets have a dimension of
. For both training and testing, images are dynamically binarized into vectors of 0 and 1 with the probability of being 1 equal to normalized pixel values between 0 and 1.
We used the same generative model and the same inference model as that used in the IWAE study. The dimension of the latent variable is 50. For the generative model , the prior distribution is a 50 dimensional standard Gaussian distribution. The generative conditional distribution is a Bernoulli distribution. The probability of the Bernoulli distribution is parameterized by the following neural network with two hidden layers: , where both and have 200 units and has 784 units. The approximate posterior distribution is a 50 dimensional Gaussian distribution with a diagonal covariance matrix. Its mean and variance are similarly parameterized by the following neural network with two hidden layers: , where both and have 200 units.
We used the same optimization setup as that used in the IWAE study. The Adam optimizer is used with parameters , and . The optimization was preceeded for passes over the data with a learning rate of for
. Overall, the optimization was run for 3,280 epochs. For IWAE, the size of a minibatch is 20 which is the same as that in the IWAE study. For AIWAE, the minibatch size is set to 128 to accelerate training.
B Performance of models trained with both AIWAE and IWAE on both the MNIST and the Omniglot dataset when the dimension of latent variable is equal to 50
Each experiment was repeated for 5 times. The results shown in Table (3 and 4
) are mean values from the 5 repeats. Values in parentheses are standard deviations.
K  1  5  50  

IWAE  AIWAE  IWAE  AIWAE  IWAE  AIWAE  
T=5  T=8  T=11  T=5  T=8  T=11  T=5  T=8  T=11  
NLL(test)  86.07 (0.07)  84.55 (0.07)  84.31 (0.08)  84.19 (0.06)  85.17 (0.10)  84.23 (0.10)  84.06 (0.05)  83.90 (0.03)  84.13 (0.06)  83.92 (0.07)  83.79 (0.06)  83.68 (0.10) 
NLL(train)  85.89 (0.07)  84.59 (0.09)  84.35 (0.06)  84.15 (0.07)  84.93 (0.06)  84.26 (0.08)  84.12 (0.05)  83.96 (0.04)  83.95 (0.06)  83.96 (0.05)  83.84 (0.09)  83.81 (0.09) 
NVLB(test)  86.60 (0.07)  87.30 (0.08)  87.79 (0.17)  88.02 (0.17)  85.64 (0.09)  88.11 (0.15)  88.95 (0.18)  89.59 (0.27)  84.77 (0.05)  89.24 (0.21)  90.44 (0.50)  91.55 (1.15) 
NVLB(train)  86.26 (0.05)  87.22 (0.08)  87.72 (0.12)  87.94 (0.15)  85.31 (0.06)  88.02 (0.13)  88.88 (0.15)  89.45 (0.26)  84.42 (0.03)  89.17 (0.21)  90.28 (0.47)  91.41 (1.16) 
var gap  0.53 (0.06)  2.75 (0.08)  3.48 (0.11)  3.83 (0.21)  0.47 (0.10)  3.89 (0.17)  4.89 (0.20)  5.68 (0.26)  0.63 (0.09)  5.32 (0.23)  6.65 (0.45)  7.87 (1.06) 
gen gap  0.18 (0.08)  0.04 (0.04)  0.04 (0.07)  0.05 (0.03)  0.24 (0.10)  0.04 (0.06)  0.06 (0.06)  0.05 (0.03)  0.18 (0.01)  0.04 (0.04)  0.05 (0.05)  0.13 (0.02) 
active units  18  28  31  32  20  32  34  35  23  35  38  40 
K  1  5  50  

IWAE  AIWAE  IWAE  AIWAE  IWAE  AIWAE  
T=5  T=8  T=11  T=5  T=8  T=11  T=5  T=8  T=11  
NLL(test)  107.39 (0.16)  103.23 (0.07)  102.63 (0.09)  102.43 (0.11)  105.30 (0.07)  102.54 (0.08)  102.10 (0.07)  102.05 (0.06)  103.31 (0.09)  102.07 (0.08)  101.83 (0.07)  101.68 (0.09) 
NLL(train)  105.75 (0.11)  101.47 (0.06)  100.85 (0.10)  100.60 (0.10)  103.45 (0.06)  100.71 (0.05)  100.33 (0.04)  100.14 (0.08)  101.29 (0.06)  100.18 (0.06)  99.89 (0.08)  99.71 (0.06) 
NVLB(test)  108.19 (0.16)  107.51 (0.07)  108.47 (0.08)  109.17 (0.10)  106.30 (0.04)  108.60 (0.14)  109.81 (0.10)  110.46 (0.14)  104.67 (0.12)  109.89 (0.15)  110.96 (0.12)  111.88 (0.14) 
NVLB(train)  106.32 (0.13)  105.37 (0.06)  106.20 (0.08)  106.78 (0.15)  104.14 (0.04)  106.38 (0.11)  107.43 (0.07)  108.02 (0.12)  102.36 (0.04)  107.46 (0.10)  108.33 (0.11)  109.20 (0.13) 
var gap  0.79 (0.03)  4.28 (0.08)  5.84 (0.10)  6.74 (0.08)  1.00 (0.06)  6.06 (0.10)  7.72 (0.12)  8.42 (0.14)  1.36 (0.06)  7.82 (0.08)  9.13 (0.14)  10.20 (0.06) 
gen gap  1.65 (0.10)  1.76 (0.03)  1.78 (0.07)  1.83 (0.05)  1.85 (0.05)  1.83 (0.04)  1.77 (0.06)  1.90 (0.04)  2.02 (0.05)  1.90 (0.04)  1.94 (0.09)  1.97 (0.03) 
active units  27  47  50  50  32  50  50  50  39  50  50  50 
C Posterior distributions of models trained with both AIWAE and IWAE on the MNIST dataset when the dimension of is equal to 2
In order to visualize the posterior distribution , we also trained models with the dimension of being 2 using both AIWAE and IWAE on the MNIST dataset. Other model setup and the optimization procedure are the same as those used for models with the dimension of being 50.
Figures S0S9 show posterior distributions of models trained with both IWAE and AIWAE for examples of digits 09. The first row represents models learned with IWAE and the second to the last row represents models learned with AIWAE using different numbers of temperature. The left, middle, and right columns represents models learned with 1, 5, and 50 samples, respectively. The digits used for calculating the posterior distributions shown in Figures S0S9 are shown in Figure S10.
Comments
There are no comments yet.