Improving Importance Weighted Auto-Encoders with Annealed Importance Sampling

06/12/2019 ∙ by Xinqiang Ding, et al. ∙ University of Michigan The University of Chicago 0

Stochastic variational inference with an amortized inference model and the reparameterization trick has become a widely-used algorithm for learning latent variable models. Increasing the flexibility of approximate posterior distributions while maintaining computational tractability is one of the core problems in stochastic variational inference. Two families of approaches proposed to address the problem are flow-based and multisample-based approaches such as importance weighted auto-encoders (IWAE). We introduce a new learning algorithm, the annealed importance weighted auto-encoder (AIWAE), for learning latent variable models. The proposed AIWAE combines multisample-based and flow-based approaches with the annealed importance sampling and its memory cost stays constant when the depth of flows increases. The flow constructed using an annealing process in AIWAE facilitates the exploration of the latent space when the posterior distribution has multiple modes. Through computational experiments, we show that, compared to models trained using the IWAE, AIWAE-trained models are better density models, have more complex posterior distributions and use more latent space representation capacity.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Stochastic variational inference wainwright2008graphical ; hoffman2013stochastic ; blei2017variational is a scalable inference method for learning generative models with latent variables using stochastic optimization robbins1951stochastic . This method becomes especially scalable and efficient for models with continuous latent variables when it is combined with an amortized inference model and the reparameterization trick rezende2014stochastic ; kingma2013auto . The resulting method is commonly referred as the variational auto-encoder (VAE) kingma2013auto . In VAEs, a variational family of distributions parameterized by the inference model is used to approximate the posterior distribution of latent variables. VAEs learn both generative and inference models simultaneously by maximizing the evidence lower bound rezende2014stochastic ; kingma2013auto . The variational distributions used in VAEs to approximate the posterior distribution of latent variables are commonly chosen to be fully factorized, whereas the true posterior distribution is not necessarily fully factorized and might even have multiple modes. Because the optimization of generative models highly depends on the approximate posterior distribution, generative models learned using VAEs are biased toward having factorized posterior distributions and can be suboptimal.

One of the core problems in variational inference has been to increase the expressibility of approximate posterior distributions while maintaining efficient optimization blei2017variational . Two kinds of approaches developed to address this problem are of interest for the current study: flow-based approaches rezende2015variational ; kingma2016improved ; caterini2018hamiltonian and multisample-based approaches such as the importance weighted auto-encoder (IWAE) burda2015importance .

In flow-based approaches, a chain of invertible transformations, referred as a flow, is applied to samples from a simple factorized distribution such that the end samples have a more flexible distribution. Examples of flow-based approaches include normalizing flow (NF) rezende2015variational , inverse autoregressive flow (IAF) kingma2016improved , and the Hamiltonian variational auto-encoder (HVAE) caterini2018hamiltonian

among others. Both the NF and the IAF introduce new parameters to the inference model. In contrast, the HVAE does not introduce new parameters to the inference model and the flow in HVAE is guided by the generative model. For all three flow-based approaches, calculating parameter gradients requires running the backpropagation algorithm through the flow in the reverse direction. Therefore, both computation and memory costs increase linearly with the depth of flow.

Like the HVAE, the multisample-based approach IWAE does not introduce new parameters to the inference model. The original motivation for the IWAE is to use multiple samples from an approximate posterior distribution to construct a tighter evidence lower bound and it was shown that optimizing the tighter lower bound helps learn better generative models burda2015importance . Later, an alternative interpretation is given for the IWAE in cremer2017reinterpreting ; bachman2015training . In the alternative interpretation, multiple samples from an approximate distribution are used to implicitly define a more flexible approximate posterior distribution based on importance sampling. The tighter lower bound in IWAE can be understood as a normal evidence lower bound used in VAE with the implicitly defined flexible approximate posterior distribution cremer2017reinterpreting ; bachman2015training .

Here we introduce the annealed importance weighted auto-encoder (AIWAE) for learning generative models with latent variables. The AIWAE combines multisample-based approaches and flow-based approaches with the annealed importance sampling and the flow used in AIWAE is constructed through an annealing process that facilitates better sampling from the posterior distribution. First, we present an alternative interpretation of how the IWAE optimizes generative model parameters: the IWAE optimizes generative model parameters by maximizing the data log-likelihood and the gradient of the data log-likelihood is estimated using importance sampling

bornschein2014reweighted . With this interpretation, we can naturally generalize the importance sampling to the annealed importance sampling (AIS) neal2001annealed to better estimate the gradient of the data log-likelihood with respect to generative model parameters. The approximate posterior distribution parameterized by the inference model and the true posterior distribution are used as the initial and the target distributions for the AIS, respectively. The inference model parameters are learned by minimizing the Kullback-Leibler (KL) divergence between the two distributions. From a flow-based point of view, samples from the initial distribution of AIS also go through a chain of transformations constructed using an annealing process. In contrast to previous flow-based approaches, the flow in AIWAE is guided by the posterior distribution and does not add new parameters to the inference model. The annealing process can facilitate the exploration of the posterior distribution when the posterior distribution has multiple modes. In addition, training models with the AIWAE does not require running the backpropagation algorithm backward through the flow and this enables the AIWAE to have a constant memory cost with the depth of the flow.

2 Background

2.1 Variational Auto-Encoder and Importance Weighted Auto-Encoder

The generative model of interest is defined by a joint distribution of both data

and continuous latent variables : , where represents parameters of the generative model. Learning the parameters by maximizing the data likelihood requires calculating expectations with respect to the posterior distribution of latent variables which is computationally expensive when analytical expressions for the expectation or are not available. To efficiently learn the generative model, the variational auto-encoder (VAE) rezende2014stochastic ; kingma2013auto uses an approximate posterior distribution and maximizes the evidence lower bound (ELBO) objective function :

(1)

The gradient of with respect to , , is estimated by Monte Carlo sampling of from . To efficiently estimate the gradient of with respect to , the VAE reparameterizes the approximate posterior distribution as where

is a random variable from a fixed distribution and

represents the parameters for the transformation. When the latent variable is continuous, a common choice for parameterizing the approximate posterior distribution is and rezende2014stochastic ; kingma2013auto . With the parameterization, the ELBO function becomes

(2)

and its gradient with respect to , , can be estimated by Monte Carlo sampling of from the distribution , similarly as estimating the gradient

In the importance weighted auto-encoder (IWAE) burda2015importance , the following new ELBO function based on multiple () samples is introduced:

(3)

As increases, the ELBO function forms a tighter lower bound of the data log-likelihood. The gradient of with respect to is

(4)
(5)

where represents the importance weight of the sample and are the normalized importance weights. The gradient of with respect to can be calculated similarly using the reparameterization trick burda2015importance .

2.2 Importance Sampling and Annealed Importance Sampling

Importance sampling is a widely used statistical technique to compute the expectation of a function

with respect to a probability distribution whose density is proportional to a function

with an unknown normalization constant. When it is difficult to draw independent samples from the distribution , the importance sampling robert2013monte uses a proposal distribution from which it is feasible to draw independent samples directly. Suppose the proposal distribution has a probability density that is proportional to , with the independent samples drawn from the proposal distribution, the expectation can be estimated by

(6)

where is the weight assigned for the sample and is the normalized weight. The accuracy of the estimator in Eq. (6) depends on the variability of weights, , which depends on how well the proposal distribution defined by can approximate the target distribution defined by . When is high dimensional and is complex and has multiple modes, it is difficult to identify the proposal distribution which is not only a good approximation of the target distribution, but also easy to draw independent samples from.

An alternative approach is to use Markov chain Monte Carlo (MCMC) approaches

robert2013monte to draw dependent samples from the target distribution defined by . This works by making random perturbations to samples and accepting or rejecting the perturbed samples based on the Metropolis-Hastings criterion metropolis1953equation ; hastings1970monte . Although theoretical results show that, under relatively weak conditions, the samples will eventually converge to the target distribution robert2013monte , this might necessitate running the Markov chain for too long to be practical, especially when the target distribution has multiple modes that are connected only through low density regions.

The annealed importance sampling (AIS) neal2001annealed was introduced to alleviate challenges evident in both importance sampling and MCMC. In AIS, a sequence of distributions with probability densities proportional to are constructed to connect the initial distribution defined by and the target distribution defined by . A generally useful way to construct these intermediate distributions is to let

(7)

where . To generate a sample and calculate its weight , a sequence of samples is generated as follows. Initially, is generated by sampling from the distribution , which is chosen to be a distribution from which independent samples can be easily drawn. For , is generated using a reversible transition kernel that keeps invariant. Then the sample is set to and the weight is calculated as:

(8)

After generating samples and their weights , the expectation can be similarly estimated using Eq. (6).

3 Annealed Importance Weighted Autoencoders

In this section, we introduce the annealed importance weighted auto-encoder (AIWAE) for learning the generative model by maximizing the data log-likelihood . The AIWAE originates from the importance sampling interpretation of the gradient in Eq. (5). With this interpretation, Eq. (5) can be naturally generalized to utilize the AIS neal2001annealed .

Let us consider learning parameters for the generative model by maximizing the data log-likelihood with stochastic gradient ascent. In order to do that, we need to estimate the gradient of the data log-likelihood with respect to :

(9)

Because directly drawing independent samples from the posterior distribution is usually not feasible, we can use the importance sampling approach to estimate the above expectation. If we choose the proposal distribution is to be , then setting and in Eq. (6) yields the following estimator:

(10)

where and . This estimator (Eq. 10) is the same as the estimator in Eq. (5) that is used to estimate the gradient of the ELBO function . Therefore, in terms of learning the parameter , an alternative interpretation of the IWAE is that it learns the parameter by maximizing the data log-likelihood and the gradient is estimated using importance sampling with the approximate posterior distribution as the proposal distribution.

With the importance sampling view, the estimator in Eq. (10) can be improved using the AIS neal2001annealed as follows. The unnormalized density for the target distribution is set to and the initial distribution density is set to . A sequence of intermediate distributions is constructed using Eq. (7), i.e., . The reversible transition kernel that leaves invariant is constructed using the Hamiltonian Monte Carlo (HMC) method neal2011mcmc in which the potential energy function is set to . With the estimation of the gradient (Eq. 10) using AIS, we can optimize the parameter by maximizing the data log-likelihood with stochastic gradient ascent.

The importance sampling interpretation of Eq. (5) and the optimization procedure using AIS to estimate gradients of data log-likelihood only apply to the parameter of the generative model . How should we optimize the parameter in the approximate inference model ? In the AIWAE, we take the following general strategy. The main objective of AIWAE is to learn the generative model by maximizing the data log-likelihood. It requires calculating the expectation with respect to the posterior distribution , i.e., . The amortized approximate inference model is introduced to help efficiently estimate the expectation. Because the performance of AIS estimation depends on the similarity between its initial and target distributions, the objective in optimizing the parameters is to make become close to the posterior distribution

. In AIWAE, we choose to minimize the Kullback-Leibler divergence between

and , i.e., maximize the ELBO function (Eq. (2)) with the reparameterization trick as in VAE kingma2013auto . Overall, the detailed procedures of the AIWAE are described in Algorithm (1).

Require:
    : a data point
    : the number of annealed importance weighted samples
    Generative and Inference Models
       : the joint distribution density of the generative model
       : the approximate posterior distribution density
    Parameters for Hamiltonian Monte Carlo (HMC)
       : the number of inverse temperatures
       : inverse temperatures
       : the step size used in leapfrog integration at each inverse temperature
       : the number of integration steps
Calculate Gradients and Optimize Parameters:
while  not converged do
           sample example(s) from the training data;
           update the generative model parameter
              set for ;
              sample , where are i.i.d. samples from ;
                     for  to  do
                           for ;
                           = HMC, where the potential energy function is:
                                 and ;
                    set and ;
                     estimate the gradient with ;
                     apply gradient update to using ;
                  update the inference model parameter
                     sample ;
                     set and calculate ;
                     estimate the gradient with the reparameterization trick;
                     apply gradient update to using ;
Algorithm 1 Annealed Importance Weighted Auto-Encoder

In summary, the parameters and are optimized using different objective functions in AIWAE. The generative model parameters are optimized to maximize the data log-likelihood and the inference model parameters are optimized to maximize the ELBO (Eq. 2) function with the reparameterization trick. The purpose of the inference model is to help efficiently calculate the gradient of the data log-likelihood with respect to (Eq. (9)). When the posterior distribution is not factorized and has multiple modes, the fully factorized approximate posterior distribution will not be a good approximation and the estimator based on importance sampling with (Eq. (10

)) will have a large variance. AIS is used to alleviate this problem. With AIS, samples from the approximate distribution

are transformed via an annealing process guided by the posterior distribution such that the distribution of these samples moves towards . The annealing procedure in AIS can also help samples explore when has multiple modes. In contrasted to other flow-based approaches, the AIWAE’s memory cost does not increase with the depth of flow. The computational cost of AIWAE is proportional to both the number of weighted samples and the number of steps in the flow, whereas the computational cost of IWAE is only proportional to . Therefore, AIWAE is approximately times more expensive than IWAE in computation when the same is used.

4 Related Work

In addition to the IWAE and AIWAE, several approaches have been proposed to utilize multiple samples with importance sampling to either obtain a better objective function or to better estimate gradients of objective functions. For learning feedforward neural networks with stochastic binary hidden units, an importance sampling based estimator using multiple samples was proposed in

tang2013learning

to estimate the expectation used in the generalized Expectation-Maximization algorithm

neal1998view . A similar estimator was derived in raiko2014techniques by constructing a variational distribution of hidden variables used in ELBO using multiple samples and importance sampling. Another approach mnih2016variational extended the multisample approach in IWAE to discrete latent variables and developed an unbiased gradient estimator for importance-sampled objectives. Methods in tang2013learning ; raiko2014techniques ; mnih2016variational apply for models with discrete latent variables, whereas the IWAE only applies for models with continuous latent variables. Although our current extension of IWAE to AIWAE using AIS also only applies for continuous latent variables, the idea of replacing importance sampling with AIS could also be useful for improving methods in tang2013learning ; raiko2014techniques ; mnih2016variational that work for discrete latent variables.

The reweighted wake-sleep (RWS) algorithm bornschein2014reweighted is another multisample approach for learning generative models with latent variables. Different from the IWAE, the RWS algorithm uses different objective functions for optimizing the generative and the inference models. The generative model is optimized by maximizing the data log-likelihood with its gradient estimated using importance sampling. This gradient estimator is equivalent to the gradient estimator of (Eq. (5)) in IWAE. To optimize the inference model, two different update rules were proposed in RWS: (1) (wake phase update) minimizing the Kullback-Leibler divergence between and with the gradient estimated using importance sampling; (2) (sleep phase update) maximizing the log-likelihood for samples from . Compared with the RWS, the AIWAE is different in two aspects: (1) instead of importance sampling, the AIS is used to estimate when optimizing the generative model; (2) the inference model is optimized by minimizing using the reparameterization trick. It will be interesting to investigate if replacing importance sampling with AIS can also help RWS learn better generative models.

The AIWAE utilizes both MCMC and variational inference. In this respect, one closely related approach is that proposed in hoffman2017learning , which uses the same objective functions as AIWAE for both generative and inference models. The difference between AIWAE and hoffman2017learning is in the method for approximating the gradient (Eq. (9)). In hoffman2017learning , only one approximate sample from is used to approximate the expectation for each data point . The approximate sample is generated by first sampling from the approximate distribution and then applying multiple Hamiltonian Monte Carlo (HMC) steps to the sample. In the HMC steps, the energy function is set to be . When the distribution has multiple modes, running HMC for a limited number of steps with can be difficult for exploring multiple modes. Multimodal distribution is less of a problem for AIWAE because the annealing process in AIS starts with the smooth energy function and slowly switches into the rugged energy function .

The Hamiltonian variational inference (HVI) salimans2015markov is another related approach combining MCMC and variational inference. Different from both AIWAE and hoffman2017learning , HVI considers samples from HMC steps as auxiliary variables and optimizes a single objective function called the auxiliary variational lower bound. A disadvantage of HVI is that it is required to learn an extra inference network for auxiliary variables to reverse the transformations in HMC steps, which introduces new parameters. In addition, learning the auxiliary variables also requires running the backpropagation algorithm backward through the HMC transformations, which increases the memory cost linearly with the number of HMC steps.

5 Experiments

5.1 Dataset and Model Setup

We conducted a series of experiments to evaluate the performance of AIWAE on learning generative models using both the MNIST lecun1998gradient dataset and the Omniglot dataset lake2015human . The same generative models are also learned using IWAE and the performance of IWAE is compared to that of AIWAE. We used the same generative model and the same inference model as that used in the IWAE study burda2015importance . The dimension of the latent variable is 50. For the generative model , the prior distribution

is a 50 dimensional standard Gaussian distribution. The conditional distribution

is a Bernoulli distribution and is parameterized by a neural network with two hidden layers, each of which has 200 units. The approximate posterior distribution

is a 50 dimensional Gaussian distribution with a diagonal covariance matrix. Its mean and variance are similarly parameterized by a neural network with two hidden layers. We used the same optimization procedure kingma2014adam as that used in the IWAE study. The detailed information about the datasets, model setup and optimization is included in the supplementary material.

Models were trained using both IWAE and AIWAE with different hyperparameters. For both IWAE and AIWAE, the number of importance weighted samples

is set to 1, 5 or 50. As shown in Algorithm (1), extra hyperparameters are required for the AIWAE. For each , the number of inverse temperatures is set to 5, 8, or 11. Given a , the inverse temperatures are evenly distributed between 0 and 1, i.e., for . In HMC, the number of integration steps is set to 5 and the step sizes are dynamically adapted such that the acceptance ratio is close to 0.6.

5.2 Results

K 1 5 50
IWAE AIWAE IWAE AIWAE IWAE AIWAE
T = 5 T = 8 T = 11 T = 5 T = 8 T = 11 T = 5 T = 8 T = 11
NLL(test) 86.07 84.55 84.31 84.19 85.17 84.23 84.06 83.90 84.13 83.92 83.79 83.68
NLL(train) 85.89 84.59 84.35 84.15 84.93 84.26 84.12 83.96 83.95 83.96 83.84 83.81
NVLB(test) 86.60 87.30 87.79 88.02 85.64 88.11 88.95 89.59 84.77 89.24 90.44 91.55
NVLB(train) 86.26 87.22 87.72 87.94 85.31 88.02 88.88 89.45 84.42 89.17 90.28 91.41
var gap 0.53 2.75 3.48 3.83 0.47 3.89 4.89 5.68 0.63 5.32 6.65 7.87
gen gap 0.18 -0.04 -0.04 0.05 0.24 -0.04 -0.06 -0.05 0.18 -0.04 -0.05 -0.13
active units 18 28 31 32 20 32 34 35 23 35 38 40
Table 1: Results of both AIWAE and IWAE on the MNIST dataset
K 1 5 50
IWAE AIWAE IWAE AIWAE IWAE AIWAE
T = 5 T = 8 T = 11 T = 5 T = 8 T = 11 T = 5 T = 8 T = 11
NLL(test) 107.39 103.23 102.63 102.43 105.30 102.54 102.10 102.05 103.31 102.07 101.83 101.68
NLL(train) 105.75 101.47 100.85 100.60 103.45 100.71 100.33 100.14 101.29 100.18 99.89 99.71
NVLB(test) 108.19 107.51 108.47 109.17 106.30 108.60 109.81 110.46 104.67 109.89 110.96 111.88
NVLB(train) 106.32 105.37 106.20 106.78 104.14 106.38 107.43 108.02 102.36 107.46 108.33 109.20
var gap 0.79 4.28 5.84 6.74 1.00 6.06 7.72 8.42 1.36 7.82 9.13 10.20
gen gap 1.65 1.76 1.78 1.83 1.85 1.83 1.77 1.90 2.02 1.90 1.94 1.97
active units 27 47 50 50 32 50 50 50 39 50 50 50
Table 2: Results of both AIWAE and IWAE on the Omniglot dataset

Models trained with both IWAE and AIWAE with different hyperparameters are first evaluated using negative data log-likelihoods (NLLs) and negative variational lower bounds (NVLBs). Following wu2016quantitative , NLLs are calculated using 16 independent AIS chains with 10,000 inverse temperatures evenly spaced between 0 and 1. HMC with 10 leapfrog steps is used as the transition kernel and the leapfrog step size is tuned to achieve an acceptance ratio of 0.6. Following the IWAE study, NVLBs are calculated as (Eq. 3).

Results on the MNIST dataset and the Omniglot dataset are presented in Table (3) and Table (4), respectively. Each experiment was repeated for 5 times and results shown in Table (3 and 4) are mean values (standard derivations are included in the tables in supplementary material). For IWAE with , the values of NVLB agree with that in the IWAE study. As the number of importance weighted samples increases from 1 to 50, the generative model trained with IWAE improves because the values of NLLs decrease. For a fixed , models trained with AIWAE have lower values of NLLs than models trained with IWAE for all choices of . Therefore, AIWAE produces better density models than IWAE. In addition, as the number of inverse temperature increases from 5 to 11 in AIWAE, the resulting density models improve further. Similar to IWAE, models trained with AIWAE with a fixed also improve when the number of annealed weighted samples increases.

When and , the model trained with AIWAE achieves a log-likelihood of -83.68 and -101.68 on the MNIST dataset and the Omniglot dataset, respectively. We note that these results are for permutation-invariant models with only one stochastic layer. For the Omniglot dataset, our best result with a log-likelihood of -101.68 is better than the best result in the IWAE study which has an approximate log-likelihood of -103.38 and is obtained with two stochastic layers burda2015importance . The generalization ability of the models is quantified by the generalization gap (gen gap in Table (3 and 4)) which is defined as the difference between the log-likelihood values on test and training data. For the MNIST dataset, models trained with both IWAE and AIWAE have quite small generalization gaps (smaller than 0.25 nats). Most of the models trained with AIWAE have generalization gaps that are not significantly different from 0 nats.

In both VAE and IWAE, a factorized approximate posterior distribution is used to approximate the posterior distribution and the generative model is trained by optimizing the ELBO objective function. In this kind of optimization, the generative model is biased such that the posterior distribution is approximately factorized. Alleviating the bias is the main motivation for replacing importance sampling used in IWAE with AIS in AIWAE. Here we use the variational gap (var gap in Table (3 and 4)), defined as , to quantify the bias. Results show that models trained with AIWAE have greater variational gaps than those trained with IWAE. This means that posterior distributions of models trained with AIWAE are more different from factorized distributions than are posterior distributions of models trained with IWAE. Therefore, the generative models trained with AIWAE are less biased towards having factorized posterior distribution and have more complex structures in the posterior distribution. For models trained with AIWAE, as the number of annealed importance weighted samples or the number of inverse temperatures increases, the NLLs decrease whereas the NVLBs increase. This makes the variational gaps increases as either or increases. This implies that, as and increase, models learned with AIWAE not only become better on density estimation but also have more complex posterior distributions. (A visualization of posterior distributions for models learned with both IWAE and AIWAE is included in supplementary material when the dimension of is set to 2.)

Following the IWAE studyburda2015importance , we also calculated the number of active latent units and used it to represent how much the latent space’s representation capacity was utilized in learned models. The intuition is that if a latent unit is active for encoding information in the observation, it is expected that its distribution would change with observations. Therefore, we used the following variance statistics to quantify the activity of the a latent unit : , which measures how much the value of a latent unit changes when the observation in test set changes. A latent unit is defined to be active if . Both the statistics and the cutoff value are adopted from the IWAE study burda2015importance . As shown in both Table (3 and 4), the number of active units in models trained with AIWAE is much larger than that in models trained with IWAE. In addition, the number of active units in models trained with AIWAE increases monotonically not only with the number of samples but also with the number of inverse temperatures. Intuitively, when the number of inverse temperature increases, the annealing process in AIWAE becomes longer and smoother, makeing it easier for samples to explore more latent space. On the Omniglot dataset, all the latent units are active for most of the models trained with AIWAE.

6 Conclusion

We present the annealed importance weighted auto-encoder (AIWAE), a learning algorithm for training probabilistic generative models with latent variables. AIWAE combines multisample-based and flow-based approaches with the annealed importance sampling to better approximate the posterior distribution. In contrast with previous flow-based approaches, AIWAE does not require running backpropagation backwards through flows and has constant memory cost with the depth of flows. AIWAE can also be viewed as a way of combining MCMC with variational inference or trading learning speed for model accuracy. The annealed sampling process used in AIWAE facilitates sampling from complex posterior distributions. In experiments, we demonstrate that, compared with models learned with IWAE, models learned with AIWAE have higher likelihood on data, have more complex posterior distribution and utilize more of their latent space representational capacity.

References

A Details about datasets, model setup and optimization

In the MNIST dataset, there are 60,000 training examples and 10,000 test examples. In the Omniglot dataset, there are 24,345 training examples and 8,070 test examples111The Omniglot data was downloaded from https://github.com/yburda/iwae.git. Images from both datasets have a dimension of

. For both training and testing, images are dynamically binarized into vectors of 0 and 1 with the probability of being 1 equal to normalized pixel values between 0 and 1.

We used the same generative model and the same inference model as that used in the IWAE study. The dimension of the latent variable is 50. For the generative model , the prior distribution is a 50 dimensional standard Gaussian distribution. The generative conditional distribution is a Bernoulli distribution. The probability of the Bernoulli distribution is parameterized by the following neural network with two hidden layers: , where both and have 200 units and has 784 units. The approximate posterior distribution is a 50 dimensional Gaussian distribution with a diagonal covariance matrix. Its mean and variance are similarly parameterized by the following neural network with two hidden layers: , where both and have 200 units.

We used the same optimization setup as that used in the IWAE study. The Adam optimizer is used with parameters , and . The optimization was preceeded for passes over the data with a learning rate of for

. Overall, the optimization was run for 3,280 epochs. For IWAE, the size of a minibatch is 20 which is the same as that in the IWAE study. For AIWAE, the minibatch size is set to 128 to accelerate training.

B Performance of models trained with both AIWAE and IWAE on both the MNIST and the Omniglot dataset when the dimension of latent variable is equal to 50

Each experiment was repeated for 5 times. The results shown in Table (3 and 4

) are mean values from the 5 repeats. Values in parentheses are standard deviations.

K 1 5 50
IWAE AIWAE IWAE AIWAE IWAE AIWAE
T=5 T=8 T=11 T=5 T=8 T=11 T=5 T=8 T=11
NLL(test) 86.07 (0.07) 84.55 (0.07) 84.31 (0.08) 84.19 (0.06) 85.17 (0.10) 84.23 (0.10) 84.06 (0.05) 83.90 (0.03) 84.13 (0.06) 83.92 (0.07) 83.79 (0.06) 83.68 (0.10)
NLL(train) 85.89 (0.07) 84.59 (0.09) 84.35 (0.06) 84.15 (0.07) 84.93 (0.06) 84.26 (0.08) 84.12 (0.05) 83.96 (0.04) 83.95 (0.06) 83.96 (0.05) 83.84 (0.09) 83.81 (0.09)
NVLB(test) 86.60 (0.07) 87.30 (0.08) 87.79 (0.17) 88.02 (0.17) 85.64 (0.09) 88.11 (0.15) 88.95 (0.18) 89.59 (0.27) 84.77 (0.05) 89.24 (0.21) 90.44 (0.50) 91.55 (1.15)
NVLB(train) 86.26 (0.05) 87.22 (0.08) 87.72 (0.12) 87.94 (0.15) 85.31 (0.06) 88.02 (0.13) 88.88 (0.15) 89.45 (0.26) 84.42 (0.03) 89.17 (0.21) 90.28 (0.47) 91.41 (1.16)
var gap 0.53 (0.06) 2.75 (0.08) 3.48 (0.11) 3.83 (0.21) 0.47 (0.10) 3.89 (0.17) 4.89 (0.20) 5.68 (0.26) 0.63 (0.09) 5.32 (0.23) 6.65 (0.45) 7.87 (1.06)
gen gap 0.18 (0.08) -0.04 (0.04) -0.04 (0.07) 0.05 (0.03) 0.24 (0.10) -0.04 (0.06) -0.06 (0.06) -0.05 (0.03) 0.18 (0.01) -0.04 (0.04) -0.05 (0.05) -0.13 (0.02)
active units 18 28 31 32 20 32 34 35 23 35 38 40
Table 3: Results of both AIWAE and IWAE on the MNIST dataset
K 1 5 50
IWAE AIWAE IWAE AIWAE IWAE AIWAE
T=5 T=8 T=11 T=5 T=8 T=11 T=5 T=8 T=11
NLL(test) 107.39 (0.16) 103.23 (0.07) 102.63 (0.09) 102.43 (0.11) 105.30 (0.07) 102.54 (0.08) 102.10 (0.07) 102.05 (0.06) 103.31 (0.09) 102.07 (0.08) 101.83 (0.07) 101.68 (0.09)
NLL(train) 105.75 (0.11) 101.47 (0.06) 100.85 (0.10) 100.60 (0.10) 103.45 (0.06) 100.71 (0.05) 100.33 (0.04) 100.14 (0.08) 101.29 (0.06) 100.18 (0.06) 99.89 (0.08) 99.71 (0.06)
NVLB(test) 108.19 (0.16) 107.51 (0.07) 108.47 (0.08) 109.17 (0.10) 106.30 (0.04) 108.60 (0.14) 109.81 (0.10) 110.46 (0.14) 104.67 (0.12) 109.89 (0.15) 110.96 (0.12) 111.88 (0.14)
NVLB(train) 106.32 (0.13) 105.37 (0.06) 106.20 (0.08) 106.78 (0.15) 104.14 (0.04) 106.38 (0.11) 107.43 (0.07) 108.02 (0.12) 102.36 (0.04) 107.46 (0.10) 108.33 (0.11) 109.20 (0.13)
var gap 0.79 (0.03) 4.28 (0.08) 5.84 (0.10) 6.74 (0.08) 1.00 (0.06) 6.06 (0.10) 7.72 (0.12) 8.42 (0.14) 1.36 (0.06) 7.82 (0.08) 9.13 (0.14) 10.20 (0.06)
gen gap 1.65 (0.10) 1.76 (0.03) 1.78 (0.07) 1.83 (0.05) 1.85 (0.05) 1.83 (0.04) 1.77 (0.06) 1.90 (0.04) 2.02 (0.05) 1.90 (0.04) 1.94 (0.09) 1.97 (0.03)
active units 27 47 50 50 32 50 50 50 39 50 50 50
Table 4: Results of both AIWAE and IWAE on the Omniglot dataset

C Posterior distributions of models trained with both AIWAE and IWAE on the MNIST dataset when the dimension of is equal to 2

In order to visualize the posterior distribution , we also trained models with the dimension of being 2 using both AIWAE and IWAE on the MNIST dataset. Other model setup and the optimization procedure are the same as those used for models with the dimension of being 50.

Figures S0-S9 show posterior distributions of models trained with both IWAE and AIWAE for examples of digits 0-9. The first row represents models learned with IWAE and the second to the last row represents models learned with AIWAE using different numbers of temperature. The left, middle, and right columns represents models learned with 1, 5, and 50 samples, respectively. The digits used for calculating the posterior distributions shown in Figures S0-S9 are shown in Figure S10.

Figure S0: Posterior distribution for a digit 0.
Figure S1: Posterior distribution for a digit 1.
Figure S2: Posterior distribution for a digit 2.
Figure S3: Posterior distribution for a digit 3.
Figure S4: Posterior distribution for a digit 4.
Figure S5: Posterior distribution for a digit 5.
Figure S6: Posterior distribution for a digit 6.
Figure S7: Posterior distribution for a digit 7.
Figure S8: Posterior distribution for a digit 8.
Figure S9: Posterior distribution for a digit 9.
Figure S10: Digits used to calculate the posterior distributions shown in Figures S0-S9.