DeepAI

# Training Variational Autoencoders with Buffered Stochastic Variational Inference

The recognition network in deep latent variable models such as variational autoencoders (VAEs) relies on amortized inference for efficient posterior approximation that can scale up to large datasets. However, this technique has also been demonstrated to select suboptimal variational parameters, often resulting in considerable additional error called the amortization gap. To close the amortization gap and improve the training of the generative model, recent works have introduced an additional refinement step that applies stochastic variational inference (SVI) to improve upon the variational parameters returned by the amortized inference model. In this paper, we propose the Buffered Stochastic Variational Inference (BSVI), a new refinement procedure that makes use of SVI's sequence of intermediate variational proposal distributions and their corresponding importance weights to construct a new generalized importance-weighted lower bound. We demonstrate empirically that training the variational autoencoders with BSVI consistently out-performs SVI, yielding an improved training procedure for VAEs.

• 20 publications
• 9 publications
• 9 publications
• 148 publications
02/07/2018

### Semi-Amortized Variational Autoencoders

Amortized variational inference (AVI) replaces instance-specific local i...
01/10/2018

### Inference Suboptimality in Variational Autoencoders

Amortized inference has led to efficient approximate inference for large...
07/05/2018

### Learning in Variational Autoencoders with Kullback-Leibler and Renyi Integral Bounds

In this paper we propose two novel bounds for the log-likelihood based o...
11/30/2017

### Variational Deep Q Network

We propose a framework that directly tackles the probability distributio...
12/19/2019

### Pseudo-Encoded Stochastic Variational Inference

Posterior inference in directed graphical models is commonly done using ...
04/18/2019

### Design of Communication Systems using Deep Learning: A Variational Inference Perspective

An approach to design end to end communication system using deep learnin...
11/30/2022

### Variational Laplace Autoencoders

Variational autoencoders employ an amortized inference model to approxim...

## 1 Introduction

Deep generative latent-variable models are important building blocks in current approaches to a host of challenging high-dimensional problems including density estimation

[1, 2, 3][4, 5] and representation learning for downstream tasks [6, 7, 8, 9]. To train these models, the principle of maximum likelihood is often employed. However, maximum likelihood is often intractable due to the difficulty of marginalizing the latent variables. Variational Bayes addresses this by instead providing a tractable lower bound of the log-likelihood, which serves as a surrogate target for maximization. Variational Bayes, however, introduces a per sample optimization subroutine to find the variational proposal distribution that best matches the true posterior distribution (of the latent variable given an input observation). To amortize the cost of this optimization subroutine, the variational autoencoder introduces an amortized inference model that learns to predict the best proposal distribution given an input observation [1, 10, 11, 12].

Although the computational efficiency of amortized inference has enabled latent variable models to be trained at scale on large datasets [13, 14], amortization introduces an additional source of error in the approximation of the posterior distributions if the amortized inference model fails to predict the optimal proposal distribution. This additional source of error, referred to as the amortization gap [15], causes variational autoencoder training to further deviate from maximum likelihood training [15, 16].

To improve training, numerous methods have been developed to reduce the amortization gap. In this paper, we focus on a class of methods [17, 18, 19] that takes an initial proposal distribution predicted by the amortized inference model and refines this initial distribution with the application of Stochastic Variational Inference (SVI) [20]. Since SVI applies gradient ascent to iteratively update the proposal distribution, a by-product of this procedure is a trajectory of proposal distributions and their corresponding importance weights . The intermediate distributions are discarded, and only the last distribution is retained for updating the generative model. Our key insight is that the intermediate importance weights can be repurposed to further improve training. Our contributions are as follows

1. We propose a new method, Buffered Stochastic Variational Inference (BSVI), that takes advantage of the intermediate importance weights and constructs a new lower bound (the BSVI bound).

2. We show that the BSVI bound is a special instance of a family of generalized importance-weighted lower bounds.

3. We show that training variational autoencoders with BSVI consistently outperforms SVI, demonstrating the effectiveness of leveraging the intermediate weights.

Our paper shows that BSVI is an attractive replacement of SVI with minimal development and computational overhead.

## 2 Background and Notation

We consider a latent-variable generative model where is observed, is latent, and are the model’s parameters. The marginal likelihood is intractable but can be lower bounded by the evidence lower bound (ELBO)

 lnpθ(x)≥\Expectq(z)\braclnpθ(x,z)q(z)=\Expectq(z)lnw(z), (1)

which holds for any distribution

. Since the gap of this bound is exactly the Kullback-Leibler divergence

, is thus the variational approximation of the posterior. Furthermore, by viewing as a proposal distribution in an importance sampler, we refer to as an unnormalized importance weight. Since

is a random variable, the variance can be reduced by averaging the importance weights derived from i.i.d samples from

. This yields the Importance-Weighted Autonenocder (IWAE) bound [21],

 lnpθ(x)≥\Expectz1…zki.i.d.∼q\bracln1kk∑i=1w(zi)≥\ELBO, (2)

which admits a tighter lower bound than the ELBO [21, 22].

### 2.1 Stochastic Variational Inference

The generative model can be trained by jointly optimizing and to maximize the lower bound over the data distribution . Supposing the variational family is parametric and indexed by the parameter space (e.g. a Gaussian variational family indexed by mean and covariance parameters), the optimization problem becomes

 maxθ\Expect^p(x)\bracmaxλ\Expectq(z\scolonλ)lnw(z\scolonλ,θ). (3)

where importance weight is now

 w(z\scolonλ,θ)=pθ(x,z)q(z\scolonλ). (4)

For notational simplicity, we omit the dependency on . For a fixed choice of and , [17] proposed to optimize via gradient ascent, where one initializes with and takes successive steps of

 λi+1←λi+η∇λi\ELBO, (5)

for which the ELBO gradient with respect to can be approximated via Monte Carlo sampling as

 ∇λi\ELBO≈1mm∑j=1∇λilnw(zλi(\eps(j)i)\scolonλi,θ) (6)

where is reparameterized as a function of and a base distribution . We note that applications gradient ascent generates a trajectory of variational parameters , where we use the final parameter for the approximation. Following the convention in [20], we refer to this procedure as Stochastic Variational Inference (SVI).

### 2.2 Amortized Inference Suboptimality

The SVI procedure introduces an inference subroutine that optimizes the proposal distribution per sample, which is computationally costly. [1, 10] observed that the computational cost of inference can be amortized by introducing an inference model , parameterized by , that directly seeks to learn the mapping from each sample to an optimal that solves the maximization problem

 λ∗=\argmaxλ\Expectq(z\scolonλ)lnpθ(x,z)q(z\scolonλ). (7)

This yields the amortized ELBO optimization problem

 maxθ,ϕ\Expect^p(x)\brac\Expectq(z\scolonfϕ(x))lnpθ(x,z)q(z\scolonfϕ(x)), (8)

where can be concisely rewritten (with a slight abuse of notation) as to yield the standard variational autoencoder objective [1].

While computationally efficient, the influence of the amortized inference model on the training dynamics of the generative model has recently come under scrutiny [15, 17, 18, 16]. A notable consequence of amortization is the amortization gap

 \KLqϕ(z\givx)pθ(z\givx)−\KLq(z\scolonλ∗)pθ(z\givx) (9)

which measures the additional error incurred when the amortized inference model is used instead of the optimal for approximating the posterior [15]. A large amortization gap can present a potential source of concern since it introduces further deviation from the maximum likelihood objective [16].

### 2.3 Amortization-SVI Hybrids

To close the amortization gap, [17] proposed to blend amortized inference with SVI. Since SVI requires one to initialize , a natural solution is to set . Thus, SVI is allowed to fine-tune the initial proposal distribution found by the amortized inference model and reduce the amortization gap. Rather than optimizing jointly with the amortized ELBO objective Eq. 8, the training of the inference and generative models is now decoupled; is trained to optimize the amortized ELBO objective, but is trained to approximately optimize Eq. 3, where is approximated via SVI. To enable end-to-end training of the inference and generative models, [18]

proposed to backpropagate through the SVI steps via a finite-difference estimation of the necessary Hessian-vector products. Alternatively,

[19] adopts a learning-to-learn framework where an inference model iteratively outputs as a function of and the ELBO gradient.

## 3 Buffered Stochastic Variational Inference

In this paper, we focus on the simpler, decoupled training procedure described by [17] and identify a new way of improving the SVI training procedure (orthogonal to the end-to-end approaches in [18, 19]). Our key observation is that, as part of the gradient ascent estimation in Eq. 6, the SVI procedure necessarily generates a sequence of importance weights , where . Since likely achieves the highest ELBO, the intermediate weights are subsequently discarded in the SVI training procedure, and only is retained for updating the generative model parameters. However, if the preceding proposal distributions are also reasonable approximations of the posterior, then it is potentially wasteful to discard their corresponding importance weights. A natural question to ask then is whether the full trajectory of weights can be leveraged to further improve the training of the generative model.

Taking inspiration from IWAE’s weight-averaging mechanism, we propose a modification to the SVI procedure where we simply keep a buffer of the entire importance weight trajectory and use an average of the importance weights as the objective in training the generative model.111For simplicity, we use the uniform-weighting in our base implementation of BSVI. In Section 4.1, we discuss how to optimize during training. The generative model is then updated with the gradient . We call this procedure Buffered Stochastic Variational Inference (BSVI) and denote as the BSVI objective. We describe the BSVI training procedure in Algorithm 1 and contrast it with SVI training. For notational simplicity, we shall always imply initialization with an amortized inference model when referring to SVI and BSVI.

## 4 Theoretical Analysis

An important consideration is whether the BSVI objective serves as a valid lower bound to the log-likelihood . A critical challenge in the analysis of the BSVI objective is that the trajectory of variational parameters is actually a sequence of statistically-dependent random variables. This statistical dependency is a consequence of SVI’s stochastic gradient approximation in Eq. 6. We capture this dependency structure in Figure 1(a), which shows that each is only deterministically generated after is sampled. When the proposal distribution parameters are marginalized, the resulting graphical model is a joint distribution over . To reason about such a joint distribution, we introduce the following generalization of the IWAE bound.

[] Let be a distribution where . Consider a joint proposal distribution over . Let for all , and be a categorical distribution over . The following construction, which we denote the Generalized IWAE Bound, is a valid lower bound of the log-marginal-likelihood

 \Expectq(z0:k)lnk∑i=0πip(x,zi)q(zi\givzv(i))≤lnp(x), (10)

The proof follows directly from the linearity of expectation when using

for importance-sampling to construct an unbiased estimate of

, followed by application of Jensen’s inequality. A detailed proof is provided in Appendix A.

Notably, if , then Section 4 reduces to the IWAE bound. Section 4 thus provides a generalization of IWAE, where the samples drawn are potentially non-independently and non-identically distributed. Section 4 thus provides a way to construct new lower bounds on the log-likelihood whenever one has access to a set of non-independent samples.

In this paper, we focus on a special instance where a chain of samples is constructed from the SVI trajectory. We note that the BSVI objective can be expressed as

 \Expectq(z0:k\givx)lnk∑i=0πiwi=\Expectq(z0:k\givx)lnk∑i=0πipθ(x,zi)q(zi\givz

Note that since can be deterministically computed given , it is therefore admissible to interchange the distributions . The BSVI objective is thus a special case of the Generalized IWAE bound, where with auxiliary conditioning on . Hence, the BSVI objective is a valid lower bound of ; we now refer to it as the BSVI bound where appropriate.

In the following two subsections, we address two additional aspects of the BSVI bound. First, we propose a method for ensuring that the BSVI bound is tighter than the Evidence Lower Bound achievable via SVI. Second, we provide an initial characterization of BSVI’s implicit sampling-importance-resampling distribution.

### 4.1 Buffer Weight Optimization

Stochastic variational inference uses a series of gradient ascent steps to generate a final proposal distribution . As evident from Figure 1(a), the parameter is in fact a random variable. The ELBO achieved via SVI, in expectation, is thus

 \Expectq(z,λk\givx)lnpθ(x,z)qϕ(z\givλk)=\Expectq(z0:k\givx)lnwk, (12)

where the RHS re-expresses it in notation consistent with Eq. 11. We denote Eq. 12 as the SVI bound. In general, the BSVI bound with uniform-weighting is not necessarily tighter than the SVI bound. For example, if SVI’s last proposal distribution exactly matches posterior , then assigning equal weighting to across would make the BSVI bound looser.

In practice, we observe the BSVI bound with uniform-weighting to consistently achieve a tighter lower bound than SVI’s last proposal distribution. We attribute this phenomenon to the effectiveness of variance-reduction from averaging multiple importance weights—even when these importance weights are generated from dependent and non-identical proposal distributions.

To guarantee that the BSVI is tighter than the SVI bound, we propose to optimize the buffer weight . This guarantees a tighter bound,

 maxπ\Expectq(z0:k\givx)lnk∑i=0πiwi≥\Expectq(z0:k\givx)lnwk, (13)

since the SVI bound is itself a special case of the BSVI bound when . It is worth noting that Eq. 13 is concave with respect to , allowing for easy optimization of .

Although is a local variational parameter, we shall, for simplicity, optimize only a single global that we update with gradient ascent throughout the course of training. As such, is jointly optimized with and .

### 4.2 Dependence-Breaking via Double-Sampling

As observed in [20], taking the gradient of the log-likelihood with respect to results in the expression

 ∇θlnpθ(x)=\Expectpθ(z\givx)∇θlnpθ(x,z). (14)

We note that gradient of the ELBO with respect to results in a similar expression

 ∇θELBO(x)=\Expectqϕ(z\givx)∇θlnpθ(x,z). (15)

As such, the ELBO gradient differs from log-likelihood gradient only in terms of the distribution applied by the expectation operator. To approximate the log-likelihood gradient, we wish to set close to under some divergence.

We now show what results from computing the gradient of the BSVI objective. [] The BSVI gradient with is

 ∇θBSVI(x) =\Expectq\sir(z\givx)∇θlnpθ(x,z), (16)

where is a sampling-importance-resampling procedure defined by the generative process

 z0:k ∼q(z0:k\givx) (17) i ∼r(i\givz0:k) (18) z ←zi, (19)

and

is a probability mass function over

. A detailed proof is provided in Appendix A.

A natural question to ask is whether BSVI’s is closer to the posterior than in expectation. To assist in this analysis, we first characterize a particular instance of the Generalized IWAE bound when are independent but non-identically distributed.

[] When , the implicit distribution admits the inequality

 \Expectq\sir(z)lnpθ(x,z)q\sir(z)≥\Expectq(z0:k)lnk∑i=0πiwi (20) =\Expectq(z0:k)lnk∑i=0πipθ(x,z)qi(zi). (21)

Figure 3 extends the analysis by [23] from the i.i.d. case (i.e. the standard IWAE bound) to the non-identical case (proof in Appendix A). It remains an open question whether the inequality holds for the non-independent case.

Since the BSVI objective employs dependent samples, it does not fulfill the conditions for Fig. 3. To address this issue, we propose a variant, BSVI with double-sampling (BSVI-DS), that breaks dependency by drawing two samples at each SVI step: for computing the SVI gradient update and for computing the BSVI importance weight . The BSVI-DS bound is thus

 \Expectq(^z

where is a product of independent but non-identical distributions when conditioned on . Double-sampling now allows us to make the following comparison.

[] Let denote the proposal distribution found by SVI. For any choice of , the distribution implied by BSVI-DS (with optimal weighting ) is at least as close to as ,

 \KLq\sirpθ(z\givx)≤\KLqkpθ(z\givx), (23)

as measured by the Kullback-Leibler divergence.

Fig. 3 follows from Fig. 3 and that the BSVI-DS bound under optimal is no worse than the SVI bound. Although the double-sampling procedure seems necessary for inequality in Fig. 3 to hold, in practice we do not observe any appreciable difference between BSVI and BSVI-DS.

## 5 Computational Considerations

Another important consideration is the speed of training the generative model with BSVI versus SVI. Since BSVI reuses the trajectory of weights generated by SVI, the forward pass incurs the same cost. The backwards pass for BSVI, however, is for SVI steps—in contrast to SVI’s cost. To make the cost of BSVI’s backwards pass , we highlight a similar observation from the original IWAE study [21] that the gradient can be approximated via Monte Carlo sampling

 ∇θBSVI(x)≈1mm∑i=1∇θlnpθ(x,z(i)), (24)

where is sampled from BSVI’s implicit distribution . We denote this as training BSVI with sample-importance-resampling (BSVI-SIR). Setting allows variational autoencoder training with BSVI-SIR to have the same wall-clock speed as training with SVI.

## 6 Experiments

### 6.1 Setup

We evaluated the performance of our method by training variational autoencoders with BSVI-SIR with buffer weight optimization (BSVI-SIR-

)) on the dynamically-binarized Omniglot, grayscale SVHN datasets, and FashionMNIST (a complete evaluation of all BSVI variants is available in

Appendix B). Our main comparison is against the SVI training procedure (as described in Algorithm 1). We also show the performance of the standard VAE and IWAE training procedures. Importantly, we note that we have chosen to compare SVI- and IWAE- trained with against BSVI--SIR trained with SVI steps. This is because that BSVI--SIR generates importance weights.

For all our experiments, we use the same architecture as [18] (where the decoder is a PixelCNN) and train with the AMSGrad optimizer [24]. For grayscale SVHN, we follow [25] and replaced [18]’s bernoulli observation model with a discretized logistic distribution model with a global scale parameter. Each model was trained for up to 200k steps with early-stopping based on validation set performance. For the Omniglot experiment, we followed the training procedure in [18] and annealed the KL term multiplier [2, 26] during the first

iterations. We replicated all experiments four times and report the mean and standard deviation of all relevant metrics. For additional details, refer to

Appendix D

### 6.2 Log-Likelihood Performance

For all models, we report the log-likelihood (as measured by BSVI-). We additionally report the SVI- (ELBO*) bound along with its decomposition into rate (KL*) and distortion (Reconstruction*) components [27]. We highlight that KL* provides a fair comparison of the rate achieved by each model without concern of misrepresentation caused by the amortized inference suboptimality.

Omniglot. Table 1 shows that BSVI-SIR outperforms SVI on the test set log-likelihood. BSVI-SIR also makes greater usage of the latent space (as measured by the lower Reconstruction*). Interestingly, BSVI-SIR’s log-likelihoods are noticeably higher than its corresponding ELBO*, suggesting that BSVI-SIR has learned posterior distributions not easily approximated by the Gaussian variational family when trained on Omniglot.

SVHN. Table 2 shows that BSVI-SIR outperforms SVI on test set log-likelihood. We observe that both BSVI-SIR and SVI significantly outperform both VAE and IWAE on log-likelihood, ELBO*, and Reconstruction*, demonstrating the efficacy of iteratively refining the proposal distributions found by amortized inference model during training.

FashionMNIST. Table 3 similarly show that BSVI-SIR outperforms SVI on test set log-likelihood. Here, BSVI achieves significantly better Reconstruction* as well as achieving higher ELBO* compared to VAE, IWAE, and SVI.

In Tables 6, 5 and 4 (Appendix B), we also observe that the use of double sampling and buffer weight optimization does not make an appreciable difference than their appropriate counterparts, demonstrating the efficacy of BSVI even when the samples are statistically dependent and the buffer weight is simply uniform.

### 6.3 Stochastic Gradient as Regularizer

Interestingly, Table 4 shows that BSVI-SIR can outperform BSVI on the test set despite having a higher variance gradient. We show in Figure 4 that this is the result of BSVI overfitting the training set. The results demonstrate the regularizing effect of having noisier gradients and thus provide informative empirical evidence to the on-going discussion about the relationship between generalization and the gradient signal-to-noise ratio in variational autoencoders [28, 16].

### 6.4 Latent Space Visualization

Table 1 shows that the model learned by BSVI-SIR training has better Reconstruction* than SVI, indicating greater usage of the latent variable for encoding information about the input image. We provide a visualization of the difference in latent space usage in Figure 5. Here, we sample multiple images conditioned on a fixed . Since BSVI encoded more information into than SVI on the Omniglot dataset, we see that the conditional distribution of the model learned by BSVI has lower entropy (i.e. less diverse) than SVI.

### 6.5 Analysis of Training Metrics

Recall that the BSVI- training procedure runs SVI- as a subroutine, and therefore generates the trajectory of importance weights . Note that and are unbiased estimates of the ELBO achieved by the proposal distribution (SVI- bound) and (SVI- bound) respectively. It is thus possible to monitor the health of the BSVI training procedure by checking whether the bounds adhere to the ordering

 BSVI-k≥SVI-k≥SVI-0 (25)

in expectation. Figures 5(b) and 5(a) show that this is indeed the case. Since Omniglot was trained with KL-annealing [18], we see in Figure 5(a) that SVI plays a negligible role once the warm-up phase (first iterations) is over. In contrast, SVI plays an increasingly large role when training on the more complex SVHN and FashionMNIST datasets, demonstrating that the amortization gap is a significantly bigger issue in the generative modeling of SVHN and FashionMNIST. Figure 5(b) further shows that BSVI- consistently achieves a better bound than SVI-. When the buffer weight is also optimized, we see in Figure 5(c) that learns to upweight the later proposal distributions in , as measured by the buffer weight average . For SVHN, the significant improvement of SVI- over SVI- results in being biased significantly toward the later proposal distributions. Interestingly, although Figure 5(c) suggests that the optimal buffer weight can differ significantly from naive uniform-weighting, we see from Tables 2 and 1 that buffer weight optimization has a negligible effect on the overall model performance.

## 7 Conclusion

In this paper, we proposed Buffered Stochastic Variational Inference (BSVI), a novel way to leverage the intermediate importance weights generated by stochastic variational inference. We showed that BSVI is effective at alleviating inference suboptimality and that training variational autoencoders with BSVI consistently outperforms its SVI counterpart, making BSVI an attractive and simple drop-in replacement for models that employ SVI. One promising line of future work is to extend the BSVI training procedure with end-to-end learning approaches in [18, 19]. Additionally, we showed that BSVI procedure is a valid lower bound and belongs to general class of importance-weighted (Generalized IWAE) bounds where the importance weights are statistically dependent. Thus, it would be of interest to study the implications of this bound for certain MCMC procedures such as Annealed Importance Sampling [29] and others.

#### Acknowledgements

We would like to thank Matthew D. Hoffman for his insightful comments and discussions during this project. This research was supported by NSF (#1651565, #1522054, #1733686), ONR (N00014-19-1-2145), AFOSR (FA9550-19-1-0024), and FLI.

## Appendix A Proofs

See 4

###### Proof.

To show the validity of this lower bound, note that

 \Expectq(z0,…,zk)\brac∑iπipθ(x,zi)q(zi\givzv(i)) =∑iπi\Expectq(z0,…,zk)pθ(x,zi)q(zi\givzv(i)) (26) =∑iπi\Expectq(zv(i))\Expectq(zi\givzv(i))pθ(x,zi)q(zi\givzv(i)) (27) =∑iπi\Expectq(zv(i))pθ(x) (28) =pθ(x). (29)

Applying Jensen’s inequality shows that the lower bound in the theorem is valid. ∎

See 3

###### Proof.
 ∇θBSVI(x) =\Expectq(z0:k\givx)∇θlnk∑i=0πiwi (30) =\Expectq(z0:k\givx)\Expectr(i\givz0:k)∇θlnpθ(x,zi), (31)

The double-expectation can now be reinterpreted as the sampling-importance-resampling distribution . ∎

See 3

###### Proof.

Recall that the is defined by the following sampling procedure

 (z0,…,zk) ∼q(z0,…,zk) (32) i ∼r(i\givz0:k) (33) z ←zi, (34)

where

 r(i\givz0:k)=πiwi∑jπjwj=πip(x,zi)q(zi\givz

We first note that, for any distribution

 r(z)=∫ar(a)δz(a)\da=\Expectr(a)δz(a). (36)

This provides an intuitive way of constructing the probability density function by reframing it as a sampling process (the expectation w.r.t.

) paired with a filtering procedure (the dirac-delta ). Thus, the density under is thus

 q\sir(z)=\Expectq(z0:k)\Expectr(i\givz0:k)δz(zi). (37)

Additionally, we shall introduce the following terms

 \tp(z) =pθ(x,z) (38) \bwi =πiwi∑jπjwj (39) \bvi =wi∑jπjwj (40) \bvi(z) =w(z)πiw(z)+∑−iπjwj. (41)

for notational simplicity. Note that the density function can be re-expressed as

 q\sir(z) =\Expectq(z0:k)\Expectr(i\givz0:k)δz(zi) (42) =\Expectq(z0:k)∑iπi\bviδz(zi) (43) =\Expectπ(i)\Expectqz−i\Expectqi(zi\givz−i)\bviδz(zi) (44) =\Expectπ(i)\Expectqz−i\bvi(z)qi(z\givz−i). (45)

We now begin with the ELBO under and proceed from there via

 \Expectq\sir(z)ln\tp(z)q\sir(z) =−\uKLq\sir(z)\tp(z) (46) =−\uKL\Expectπ(i)\Expectqz−i\bvi(z)qi(z\givz−i)\tp(z) (47) ≥−\Expectπ(i)\Expectqz−i\uKL\bvi(z)qi(z\givz−i)\tp(z), (48)

where we use Jensen’s Inequality to exploit the convexity of the unnormalized Kullback-Leibler divergence . We now do a small change of notation when rewriting the unnormalized KL as an integral to keep the notation simple

 \Expectq\sir(z)ln\tp(z)q\sir(z) ≥\Expectπ(i)\Expectqz−i∫zi\bviq(zi\givz−i)ln\tp(zi)\bviq(zi\givz−i) (49) =\Expectπ(i)\Expectqz−i\Expectq(zi\givz−i)\bviln\tp(zi)\bviq(zi\givz−i) (50) =\Expectq(z0:k)∑i\bwiln\tp(zi)\bviq(zi\givz−i) (51) =\Expectq(z0:k)∑i\bwiln\paren∑jπjwj⋅q(zi\givz

If are independent, then it follows that . Thus,

 \Expectq\sir(z)ln\tp(z)q\sir(z) ≥\Expectq(z0:k)∑i\bwi\bracln\paren∑jπjwj+ln\parenq(zi\givz

## Appendix B Model Performance on Test and Training Data

Here we report various performance metrics for each type of model trained on the training set for both Omniglot and SVHN. As stated earlier, log-likelihood is estimated using BSVI-500, and ELBO* refers to the lower bound achieved by SVI-500 (i.e. ). KL* and Reconstruction* are the rate and distortion terms for ELBO*, respectively.

 Log-likelihood =\Expectq(z0:500\givx)\bracln500∑i=0πipθ(x,zi)q(zi\givz