We build upon AESMC (Le et al., 2017), a method for model learning that itself builds on VAE (Kingma & Welling, 2014; Rezende et al., 2014) and IWAE (Burda et al., 2016). AESMC is similarly based on maximizing a lower bound to the log marginal likelihood, but uses SMC (Doucet & Johansen, 2009)
as the underlying marginal likelihood estimator instead of IS. For a very wide array of models, particularly those with sequential structure, SMC forms a substantially more powerful inference method than IS, typically returning lower variance estimates for the marginal likelihood. Consequently, by using SMC for its marginal likelihood estimation, AESMC often leads to improvements in model learning compared with VAE and IWAE. We provide experiments on structured time-series data that show that AESMC based learning was able to learn useful representations of the latent space for both reconstruction and prediction more effectively than the IWAE counterpart.
AESMC was introduced in an earlier preprint (Le et al., 2017) concurrently with the closely related methods of Maddison et al. (2017); Naesseth et al. (2017). In this work we take these ideas further by providing new theoretical insights for the resulting ELBO, extending these to explore the relative efficiency of different approaches to proposal learning, and using our results to develop a new and improved training procedure. In particular, we introduce a method for expressing the gap between an ELBO and the log marginal likelihood as a KL divergence between two distributions on an extended sampling space. Doing so allows us to investigate the behavior of this family of algorithms when the objective is maximized perfectly, which occurs only if the KL divergence becomes zero. In the IWAE case, this implies that the proposal distributions are equal to the posterior distributions under the learned model. In the AESMC case, it has implications for both the proposal distributions and the intermediate set of targets that are learned. We demonstrate that, somewhat counter-intuitively, using lower variance estimates for the marginal likelihood can actually be harmful to proposal learning. Using these insights, we experiment with an adaptation to the AESMC algorithm, which we call alternating ELBO, that uses different lower bounds for updating the model parameters and proposal parameters. We observe that this adaptation can, in some cases, improve model learning and proposal adaptation.
2.1 State-Space Models
SSM are probabilistic models over a set of latent variables and observed variables . Given parameters , a SSM is characterized by an initial density , a series of transition densities , and a series of emission densities with the joint density being .
We are usually interested in approximating the posterior or the expectation of some test function under this posterior . We refer to these two tasks as inference. Inference in models which are non-linear, non-discrete, and non-Gaussian is difficult and one must resort to approximate methods, for which SMC has been shown to be one of the most powerful approaches (Doucet & Johansen, 2009).
We will consider model learning as a problem of maximizing the marginal likelihood in the family of models parameterized by .
2.2 Sequential Monte Carlo
SMC performs approximate inference on a sequence of target distributions . In the context of SSM, the target distributions are often taken to be . Given a parameter and proposal distributions and from which we can sample and whose densities we can evaluate, SMC is described in Algorithm 1.
Using the set of weighted particles at the last time step, we can approximate the posterior as and the integral as , where is the normalized weight and is a Dirac measure centered on
. Furthermore, one can obtain an unbiased estimator of the marginal likelihoodusing the intermediate particle weights:
The sequential nature of SMC and the resampling step are crucial in making SMC scalable to large . The former makes it easier to design efficient proposal distributions as each step need only target the next set of variables . The resampling step allows the algorithm to focus on promising particles in light of new observations, avoiding the exponential divergence between the weights of different samples that occurs for importance sampling as increases. This can be demonstrated both empirically and theoretically (Del Moral, 2004, Chapter 9). We refer the reader to (Doucet & Johansen, 2009) for an in-depth treatment of SMC.
2.3 Importance Weighted Auto-Encoders
Given a dataset of observations , a generative network and an inference network , IWAE (Burda et al., 2016) maximize where, for a given observation , the (with particles) is a lower bound on by Jensen’s inequality:
The IWAE optimization is performed using SGA where a sample from is obtained using the reparameterization trick (Kingma & Welling, 2014) and the gradient is used to perform an optimization step.
3 Auto-Encoding Sequential Monte Carlo
AESMC implements model learning, proposal adaptation, and inference amortization in a similar manner to the VAE and the IWAE: it uses SGA on an empirical average of the ELBO over observations. However, it varies in the form of this ELBO. In this section, we will introduce the AESMC ELBO, explain how gradients of it can be estimated, and discuss the implications of these changes.
3.1 Objective Function
Consider a family of SSM and a family of proposal distributions . AESMC uses an ELBO objective based on the SMC marginal likelihood estimator (1). In particular, for a given , the objective is defined as
where is defined in (1) and is the sampling distribution of SMC,
forms a lower bound to the log marginal likelihood due to Jensen’s inequality and the unbiasedness of the marginal likelihood estimator. Hence, given a dataset , we can perform model learning based on maximizing the lower bound of as a surrogate target, namely by maximizing
For notational convenience, we will talk about optimizing ELBO in the rest of this section. However, we note that the main intended use of AESMC is to amortize over datasets, for which the ELBO is replaced by the dataset average in the optimization target. Nonetheless, rather than using the full dataset for each gradient update, will we instead use minibatches, noting that this forms unbiased estimator.
3.2 Gradient Estimation
We describe a gradient estimator used for optimizing using SGA. The SMC sampler in Algorithm 1 proceeds by sampling sequentially from their respective distributions until the whole particle-weight trajectory is sampled. From this trajectory, using equation (1), we can obtain an estimator for the marginal likelihood.
Assuming that the sampling of latent variables is reparameterizable, we can make their sampling independent of
. In particular, assume that there exists a set of auxiliary random variableswhere and a set of reparameterization functions . We can simulate the SMC sampler by first sampling and setting and , then for cycling through sampling and , and setting and . We use the resulting reparameterized sample of to evaluate the gradient estimator .
To account for the discrete choices of ancestor indices one could additionally use the REINFORCE (Williams, 1992) trick, however in practice, we found that the additional term in the estimator has problematically high variance. We explore various other possible gradient estimators and empirical assessments of their variances in Appendix A. This exploration confirms that including the additional REINFORCE terms leads to problematically high variance, justifying our decision to omit them, despite introducing a small bias into the gradient estimates.
3.3 Bias & Implications on the Proposals
In this section, we express the gap between ELBO and the log marginal likelihood as a KL divergence and study implications on the proposal distributions. We present a set of claims and propositions whose full proofs are in Appendix B. These give insight into the behavior of AESMC and show the advantages, and disadvantages, of using our different ELBO. This insight motivates Section 4 which proposes an algorithm for improving proposal learning.
Given an unnormalized target density with normalizing constant , , and a proposal density , then
is a lower bound on and satisfies
The key observation for expressing such a bound for general ELBO such as and is that the target density and the proposal density need not directly correspond to and . This allows us to view the underlying sampling distributions of the marginal likelihood Monte Carlo estimators such as in (3) and in (6) as proposal distributions on an extended space . The following claim uses this observation to express the bound between a general ELBO and the log marginal likelihood as divergence from the extended space sampling distribution to a corresponding target distribution.
Given a non-negative unbiased estimator of the normalizing constant where is distributed according to the proposal distribution , the following holds:
is the implied normalized target density.
Similarly, in the case of AESMC, we obtain
Having expressions for the target distribution and the sampling distribution for a given ELBO allows us to investigate what happens when we maximize that ELBO, remembering that the KL term is strictly non-negative and zero if and only if . For the VAE and IWAE cases then, provided the proposal is sufficiently flexible, one can always perfectly maximize the ELBO by setting for all . The reverse implication also holds: if then it must be the case that . However, for AESMC, achieving is only possible when one also has sufficient flexibility to learn a particular series of intermediate target distributions, namely the marginals of the final target distribution. In other words, it is necessary to learn a particular factorization of the generative model, not just the correct individual proposals, to achieve and thus . These observations are formalized in Propositions 1 and 2 below.
for all if and only if for all .
If , then for all if and only if
for all and , and
for all and for for all ,
where are the intermediate targets used by SMC.
Proposition 2 has the consequence that if the family of generative models is such that the first condition does not hold, we will not be able to make the bound tight. This means that, except for a very small class of models, then, for most convenient parameterizations, it will be impossible to learn a perfect proposal that gives a tight bound, i.e. there will be no and such that the above conditions can be satisfied. However, it also means that encodes important additional information about the implications the factorization of the generative model has on the inference—the model depends only on the final target , but some choices of the intermediate targets will lead to much more efficient inference than others. Perhaps more importantly, SMC is usually a far more powerful inference algorithm than importance sampling and so the AESMC setup allows for more ambitious model learning problems to be effectively tackled than the VAE or IWAE. After all, even though it is well known in the SMC literature that, unlike for IS, most problems have no perfect set of SMC proposals which will generate exact samples from the posterior (Doucet & Johansen, 2009), SMC still gives superior performance on most problems with more than a few dimensions. These intuitions are backed up by our experiments that show that using regularly learns better models than using .
4 Improving Proposal Learning
In practice, one is rarely able to perfectly drive the divergence to zero and achieve a perfect proposal. In addition to the implications of the previous section, this occurs because may not be sufficiently expressive to represent exactly and because of the inevitable sub-optimality of the optimization process, remembering that we are aiming to learn an amortized inference artifact, rather than a single posterior representation. Consequently, to accurately assess the merits of different ELBO for proposal learning, it is necessary to consider their finite-time performance. We therefore now consider the effect the number of particles has on the gradient estimators for and .
Counter-intuitively, it transpires that the tighter bounds implied by using a larger is often harmful to proposal learning for both IWAE and AESMC. At a high-level, this is because an accurate estimate for can be achieved for a wide range of proposal parameters and so the magnitude of reduces as increases. Typically, this shrinkage happens faster than increasing
reduces the standard deviation of the estimate and so the standard deviation of the gradient estimate relative to the problem scaling (i.e. as a ratio of true gradient) actually increases. This effect is demonstrated in Figure 1
which shows a kernel density estimator for the distribution of the gradient estimate for differentand the model given in Section 5.2. Here we see that as we increase , both the expected gradient estimate (which is equal to the true gradient by unbiasedness) and standard deviation of the estimate decrease. However, the former decreases faster and so the relative standard deviation increases. This is perhaps easiest to appreciate by noting that for , there is a roughly equal probability of the estimate being positive or negative, such that we are equally likely to increase or decrease the parameter value at the next SGA iteration, inevitably leading to poor performance. On the other hand, when , it is far more likely that the gradient estimate is positive than negative, and so there is clear drift to the gradient steps. We add to the empirical evidence for this behavior in Section 5. Note the critical difference for model learning is that does not, in general, decrease in magnitude as increases. Note also that using a larger should always give better performance at test time; it may though be better to learn using a smaller .
In simultaneously developed work (Rainforth et al., 2017), we formalized this intuition in the IWAE setting by showing that the estimator of with particles, denoted by , has the following SNR:
We thus see that increasing reduces the SNR and so the gradient updates for the proposal will degrade towards pure noise if is set too high.
4.1 Alternating ELBOs
To address these issues, we suggest and investigate the ALT algorithm which updates in a coordinate descent fashion using different ELBO, and thus gradient estimates, for each. We pick a -optimizing pair and a -optimizing pair , corresponding to an inference type and number of particles. In an optimization step, we obtain an estimator for with particles and an estimator for with particles which we call and respectively. We use to update the current and to update the current . The results from the previous sections suggest that using and with a large and a small may perform better model and proposal learning than just fixing to since using with small helps learning (at least in terms of the SNR) and using with large helps learning . We experimentally observe that this procedure can in some cases improve both model and proposal learning.
We now present a series of experiments designed to answer the following questions: 1) Does tightening the bound by using either more particles or a better inference procedure lead to an adverse effect on proposal learning? 2) Can AESMC, despite this effect, outperform IWAE? 3) Can we further improve the learned model and proposal by using ALT?
First we investigate a LGSSM for model learning and a latent variable model for proposal adaptation. This allows us to compare the learned parameters to the optimal ones. Doing so, we confirm our conclusions for this simple problem.
We then extend those results to more complex, high dimensional observation spaces that require models and proposals parameterized by neural networks. We do so by investigating the Moving Agents dataset, a set of partially occluded video sequences.
5.1 Linear Gaussian State Space Model
Given the following LGSSM
we find that optimizing w.r.t. leads to better generative models than optimizing . The same is true for using more particles.
We generate a sequence for by sampling from the model with . We then optimize the different s w.r.t. using the bootstrap proposal and . Because we use the bootstrap proposal, gradients w.r.t. to
are not backpropagated through.
We use a fixed learning rate of and optimize for steps using SGA. Figure 2 shows that the convergence of both to and to is faster when and more particles are used.
5.2 Proposal Learning
We now investigate how learning , i.e. the proposal, is affected by the the choice of ELBO and the number of particles.
Consider a simple, fixed generative model where and are the latent and observed variables respectively and a family of proposal distributions parameterized by . For a fixed observation , we initialize and optimize with respect to . We investigate the quality of the learned parameter as we increase the number of particles during training. Figure 3 (left) clearly demonstrates that the quality of compared to the analytic posterior decreases as we increase .
Similar behavior is observed in Figure 3 (middle, right) where we optimize with respect to both and for the LGSSM described in Section 5.1. We see that using more particles helps model learning but makes proposal learning worse. Using our ALT algorithm alleviates this problem and at the same time makes model learning faster as it profits from a more accurate proposal distribution. We provide more extensive experiments exploring proposal learning with different ELBO and number of particles in Appendix C.3.
5.3 Moving Agents
To show that our results are applicable to complex, high dimensional data we compare AESMC and IWAE on stochastic, partially observable video sequences. Figure7 in Appendix C.2 shows an example of such a sequence.
The dataset consists of sequences of images of which 1000 are randomly held out as test set. Each sequence contains images represented as a 2 dimensional array of size . In each sequence there is one agent, represented as circle, whose starting position is sampled randomly along the top and bottom of the image. The dataset is inspired by (Ondrúška & Posner, 2016), however with the crucial difference that the movement of the agent is stochastic. The agent performs a directed random walk through the image. At each timestep, it moves according to
where are the coordinates in frame in a unit square that is then projected onto pixels. In addition to the stochasticity of the movement, half of the image is occluded, preventing the agent from being observed.
For the generative model and proposal distribution we use a VRNN (Chung et al., 2015). It extends RNN by introducing a stochastic latent state at each timestep . Together with the observation
, this state conditions the deterministic transition of the RNN. By introducing this unobserved stochastic state, the VRNN is able to better model complex long range variability in stochastic sequences. Architecture and hyperparameter details are given in AppendixC.1.
Figure 4 shows for models trained with IWAE and AESMC for different particle numbers. The lines correspond to the mean over three different random seeds and the shaded areas indicate the standard deviation. The same number of particles was used for training and testing, additional hyperparameter settings are given in the appendix. One can see that models trained using AESMC outperform IWAE and using more particles improves the ELBO for both. In Appendix C.2, we inspect different learned generative models by using them for prediction, confirming the results presented here. We also tested ALT on this task, but found that while it did occasionally improve performance, it was much less stable than IWAE and AESMC.
Rolling mean over 5 epochs ofon the test set, lines indicate the average over 3 random seeds and shaded areas indicate standard deviation. The color indicates the number of particles, the line style the used algorithm. (Right) The table shows the final for each learned model.
We have developed AESMC—a method for performing model learning using a new ELBO objective which is based on the SMC marginal likelihood estimator. This ELBO objective is optimized using SGA and the reparameterization trick. Our approach utilizes the efficiency of SMC in models with intermediate observations and hence is suitable for highly structured models. We experimentally demonstrated that this objective leads to better generative model training than the IWAE objective for structured problems, due to the superior inference and tighter bound provided by using SMC instead of importance sampling.
Additionally, in Claim 1, we provide a simple way to express the bias of objectives induced by log of marginal likelihood estimators as a KL divergence on an extended space. In Propositions 1 and 2, we investigate the implications of these KL being zero in the case of IWAE and AESMC. In the latter case, we find that we can achieve zero KL only if we are able to learn SMC intermediate target distributions corresponding to marginals of the target distribution. Using our assertion that tighter variational bounds are not necessarily better, we then introduce and test a new method, alternating ELBO, that addresses some of these issues and observe that, in some cases, this improves both model and proposal learning.
TAL is supported by EPSRC DTA and Google (project code DF6700) studentships. MI is supported by the UK EPSRC CDT in Autonomous Intelligent Machines and Systems. TR is supported by the European Research Council under the European Union’s Seventh Framework Programme (FP7/2007-2013) ERC grant agreement no. 617071; majority of TR’s work was undertaken while he was in the Department of Engineering Science, University of Oxford, and was supported by a BP industrial grant. TJ is supported by the UK EPSRC and MRC CDT in Statistical Science. FW is supported by The Alan Turing Institute under the EPSRC grant EP/N510129/1; DARPA PPAML through the U.S. AFRL under Cooperative Agreement FA8750-14-2-0006; Intel and DARPA D3M, under Cooperative Agreement FA8750-17-2-0093.
Burda et al. (2016)
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov.
Importance weighted autoencoders.In ICLR, 2016.
- Chung et al. (2015) Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In Advances in neural information processing systems, pp. 2980–2988, 2015.
- Del Moral (2004) P Del Moral. Feynman-Kac formulae: genealogical and interacting particle systems with applications. Probability and its applications, 2004.
Doucet & Johansen (2009)
Arnaud Doucet and Adam M Johansen.
A tutorial on particle filtering and smoothing: Fifteen years later.
Handbook of nonlinear filtering, 12(656-704):3, 2009.
- Kingma & Welling (2014) Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. In ICLR, 2014.
- Le et al. (2017) Tuan Anh Le, Maximilian Igl, Tom Jin, Tom Rainforth, and Frank Wood. Auto-encoding sequential Monte Carlo. arXiv preprint arXiv:1705.10306v1, 2017.
- Maddison et al. (2017) Chris J Maddison, John Lawson, George Tucker, Nicolas Heess, Mohammad Norouzi, Andriy Mnih, Arnaud Doucet, and Yee Teh. Filtering variational objectives. In Advances in Neural Information Processing Systems, pp. 6576–6586, 2017.
- Naesseth et al. (2017) Christian A Naesseth, Scott W Linderman, Rajesh Ranganath, and David M Blei. Variational sequential Monte Carlo. arXiv preprint arXiv:1705.11140, 2017.
Ondrúška & Posner (2016)
Peter Ondrúška and Ingmar Posner.
Deep tracking: Seeing beyond seeing using recurrent neural networks.
Thirtieth AAAI Conference on Artificial Intelligence, 2016.
Rainforth et al. (2017)
Tom Rainforth, Tuan Anh Le, Maximilian Igl, Chris J Maddison, Yee Whye Teh, and
Tighter variational bounds are not necessarily better.
NIPS Workshop on Bayesian Deep Learning, 2017.
- Rezende et al. (2014) Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and approximate inference in deep generative models. In ICML, 2014.
Ronald J Williams.
Simple statistical gradient-following algorithms for connectionist reinforcement learning.Machine learning, 8(3-4):229–256, 1992.
Appendix A Gradients
The goal is to obtain an unbiased estimator for the gradient
a.1 Full Reinforce
We express the required quantity as
which we can estimate by sampling directly from and evaluating .
a.2 Reinforce & Reparameterization
We express the required quantity as