1 Introduction
Learning in statistical models via gradient descent is straightforward when the objective function and its gradients are tractable. In the presence of latent variables, however, many objectives become intractable. For neural generative models with latent variables, there are currently a few dominant approaches: optimizing lower bounds on the marginal loglikelihood kingma2013auto ; rezende2014stochastic , restricting to a class of invertible models dinh2016density , or using likelihoodfree methods goodfellow2014generative ; nowozin2016f ; tran2017deep ; mohamed2016learning . In this work, we focus on the first approach and introduce filtering variational objectives (FIVOs), a tractable family of objectives for maximum likelihood estimation (MLE) in latent variable models with sequential structure.
Specifically, let denote an observation of an
valued random variable. We assume that the process generating
involves an unobserved valued random variable with joint density in some family . The goal of MLE is to recover that maximizes the marginal loglikelihood, ^{†}^{†}We reuse to denote the conditionals and marginals of the joint density.. The difficulty in carrying out this optimization is that the loglikelihood function is defined via a generally intractable integral. To circumvent marginalization, a common approach kingma2013auto ; rezende2014stochastic is to optimize a variational lower bound on the marginal loglikelihood jordan1999introduction ; beal2003variational . The evidence lower bound (ELBO) is the most common such bound and is defined by a variational posterior distribution whose support includes ’s,(1) 
lowerbounds the marginal loglikelihood for any choice of , and the bound is tight when is the true posterior . Thus, the joint optimum of in and is the MLE. In practice, it is common to restrict to a tractable family of distributions (e.g., a factored distribution) and to jointly optimize the ELBO over and with stochastic gradient ascent kingma2013auto ; rezende2014stochastic ; ranganath2014black ; kucukelbir2016automatic . Because of the KL penalty from to , optimizing (1) under these assumptions tends to force ’s posterior to satisfy the factorizing assumptions of the variational family which reduces the capacity of the model . One strategy for addressing this is to decouple the tightness of the bound from the quality of . For example, burda2015importance observed that Eq. (1) can be interpreted as the log of an unnormalized importance weight with the proposal given by , and that using samples from the same proposal produces a tighter bound, known as the importance weighted autoencoder bound, or IWAE.
Indeed, it follows from Jensen’s inequality that the log of any unbiased positive Monte Carlo estimator of the marginal likelihood results in a lower bound that can be optimized for MLE. The filtering variational objectives (FIVOs) build on this idea by treating the log of a particle filter’s likelihood estimator as an objective function. Following mnih2016variational , we call objectives defined as logtransformed likelihood estimators Monte Carlo objectives (MCOs). In this work, we show that the tightness of an MCO scales like the relative variance of the estimator from which it is constructed. It is wellknown that the variance of a particle filter’s likelihood estimator scales more favourably than simple importance sampling for models with sequential structure cerou2011nonasymptotic ; berard2014 . Thus, FIVO can potentially form a much tighter bound on the marginal loglikelihood than IWAE.
The main contributions of this work are introducing filtering variational objectives and a more careful study of Monte Carlo objectives. In Section 2, we review maximum likelihood estimation via maximizing the ELBO. In Section 3, we study Monte Carlo objectives and provide some of their basic properties. We define filtering variational objectives in Section 4, discuss details of their optimization, and present a sharpness result. Finally, we cover related work and present experiments showing that sequential models trained with FIVO outperform models trained with ELBO or IWAE in practice.
2 Background
We briefly review techniques for optimizing the ELBO as a surrogate MLE objective. We restrict our focus to latent variable models in which the model factors into tractable conditionals and that are parameterized differentiably by parameters . MLE in these models is then the problem of optimizing in
. The expectationmaximization (EM) algorithm is an approach to this problem which can be seen as coordinate ascent, fully maximizing
alternately in and at each iteration dempster1977maximum ; wu1983convergence ; neal1998view . Yet, EM rarely applies in general, because maximizing over for a fixed corresponds to a generally intractable inference problem.Instead, an approach with mild assumptions on the model is to perform gradient ascent following a Monte Carlo estimator of the ELBO’s gradient hoffman2013stochastic ; ranganath2014black . We assume that is taken from a family of distributions parameterized differentiably by parameters
. We can follow an unbiased estimator of the ELBO’s gradient by sampling
and updating the parameters by and , where the gradients are computed conditional on the sample and is a learning rate. Such estimators follow the ELBO’s gradient in expectation, but variance reduction techniques are usually necessary ranganath2014black ; mnih2014neural ; mnih2016variational .A lower variance gradient estimator can be derived if is a reparameterizable distribution kingma2013auto ; rezende2014stochastic ; gal2016uncertainty . Reparameterizable distributions are those that can be simulated by sampling from a distribution , which does not depend on , and then applying a deterministic transformation . When , , and are differentiable, an unbiased estimator of the ELBO gradient consists of sampling and updating the parameter by . Given , the gradients of the sampling process can flow through .
Unfortunately, when the variational family of is restricted, following gradients of tends to reduce the capacity of the model to match the assumptions of the variational family. This KL penalty can be “removed” by considering generalizations of the ELBO whose tightness can be controlled by means other than the closenesss of and , e.g., burda2015importance . We consider this in the next section.
3 Monte Carlo Objectives (MCOs)
Monte Carlo objectives (MCOs) mnih2016variational generalize the ELBO to objectives defined by taking the of a positive, unbiased estimator of the marginal likelihood. The key property of MCOs is that they are lower bounds on the marginal loglikelihood, and thus can be used for MLE. Motivated by the previous section, we present results on the convergence of generic MCOs to the marginal loglikelihood and show that the tightness of an MCO is closely related to the variance of the estimator that defines it.
One can verify that the ELBO is a lower bound by using the concavity of log and Jensen’s inequality,
(2) 
This argument only relies only on unbiasedness of when . Thus, we can generalize this by considering any unbiased marginal likelihood estimator and treating as an objective function over models . Here indexes the amount of computation needed to simulate , e.g., the number of samples or particles. Monte Carlo Objectives. Let be an unbiased positive estimator of , , then the Monte Carlo objective over defined by is
(3) 
For example, the ELBO is constructed from a single unnormalized importance weight . The IWAE bound burda2015importance takes to be averaged i.i.d. importance weights,
(4) 
We consider additional examples in the Appendix. To avoid notational clutter, we omit the arguments to an MCO, e.g., the observations or model , when the default arguments are clear from context. Whether we can compute stochastic gradients of efficiently depends on the specific form of the estimator and the underlying random variables that define it.
Many likelihood estimators converge to almost surely as (known as strong consistency). The advantage of a consistent estimator is that its MCO can be driven towards by increasing . We present sufficient conditions for this convergence and a description of the rate: Properties of Monte Carlo Objectives. Let be a Monte Carlo objective defined by an unbiased positive estimator of . Then,

[label=()]

(Bound) .

(Consistency) If is uniformly integrable (see Appendix for definition) and is strongly consistent, then as .
Proof.
See the Appendix for the proof and a sufficient condition for controlling the first inverse moment when is the average of i.i.d. random variables. ∎
In some cases, convergence of the bound to is monotonic, e.g., IWAE burda2015importance , but this is not true in general. The relative variance of estimators, , tends to be well studied, so property 3 gives us a tool for comparing the convergence rate of distinct MCOs. For example, cerou2011nonasymptotic ; berard2014 study marginal likelihood estimators defined by particle filters and find that the relative variance of these estimators scales favorably in comparison to naive importance sampling. This suggests that a particle filter’s MCO, introduced in the next section, will generally be a tighter bound than IWAE.
4 Filtering Variational Objectives (FIVOs)
The filtering variational objectives (FIVOs) are a family of MCOs defined by the marginal likelihood estimator of a particle filter. For models with sequential structure, e.g., latent variable models of audio and text, the relative variance of a naive importance sampling estimator tends to scale exponentially in the number of steps. In contrast, the relative variance of particle filter estimators can scale more favorably with the number of steps—linearly in some cases cerou2011nonasymptotic ; berard2014 . Thus, the results of Section 3 suggest that FIVOs can serve as tighter objectives than IWAE for MLE in sequential models.
Let our observations be sequences of valued random variables denoted , where . We also assume that the data generation process relies on a sequence of unobserved valued latent variables denoted . We focus on sequential latent variable models that factor as a series of tractable conditionals, .
A particle filter is a sequential Monte Carlo algorithm, which propagates a population of weighted particles for steps using a combination of importance sampling and resampling steps, see Alg. 1. In detail, the particle filter takes as arguments an observation , the number of particles , the model distribution , and a variational posterior factored over ,
(6) 
The particle filter maintains a population of particles with weights . At step , the filter independently proposes an extension to each particle’s trajectory . The weights are multiplied by the incremental importance weights,
(7) 
and renormalized. If the current weights satisfy a resampling criteria, then a resampling step is performed and particles are sampled in proportion to their weights from the current population with replacement. Common resampling schemes include resampling at every step and resampling if the effective sample size (ESS) of the population drops below doucet2009tutorial . After resampling the weights are reset to . Otherwise, the particles are copied to the next step along with the accumulated weights. See Fig. 1 for a visualization.
Instead of viewing Alg. 1 as an inference algorithm, we treat the quantity as an objective function over . Because is an unbiased estimator of , proven in the Appendix and in del2004feynman ; del2013mean ; andrieu2010particle ; pitt2012some , it defines an MCO, which we call FIVO: Filtering Variational Objectives. Let be the output of Alg. 1 with inputs , then is a filtering variational objective. is a strongly consistent estimator del2004feynman ; del2013mean . So if is uniformly integrable, then as . Resampling is the distinguishing feature of
; if resampling is removed, then FIVO reduces to IWAE. Resampling does add an amount of immediate variance, but it allows the filter to discard low weight particles with high probability. This has the effect of refocusing the distribution of particles to regions of higher mass under the posterior, and in some sequential models can reduce the variance from exponential to linear in the number of time steps
cerou2011nonasymptotic ; berard2014 . Resampling is a greedy process, and it is possible that a particle discarded at step , could have attained a high mass at step . In practice, the best tradeoff is to use adaptive resampling schemes doucet2009tutorial . If for a given a particle filter’s likelihood estimator improves over simple importance sampling in terms of variance, we expect to be a tighter bound than or .4.1 Optimization
The FIVO bound can be optimized with the same stochastic gradient ascent framework used for the ELBO. We found in practice it was effective simply to follow a Monte Carlo estimator of the biased gradient with reparameterized . This gradient estimator is biased, as the full FIVO gradient has three kinds of terms: it has the term , where is defined conditional on the random variables of Alg. 1; it has gradient terms for every distribution of Alg. 1 that depends on the parameters; and, if adaptive resampling is used, then it has additional terms that account for the change in FIVO with respect to the decision to resample. In this section, we derive the FIVO gradient when are reparameterized and a fixed resampling schedule is followed. We derive the full gradient in the Appendix.
In more detail, we assume that and are parameterized in a differentiable way by and . Assume that is from a reparameterizable family and that of Alg. 1 are reparameterized. Assume that we use a fixed resampling schedule, and let be an indicator function indicating whether a resampling occured at step . Now, depends on the parameters via and the resampling probabilities in the density. Thus,
(8) 
Given a single forward pass of Alg. 1 with reparameterized , the terms inside the expectation form a Monte Carlo estimator of Eq. (8). However, the terms from resampling events contribute to the majority of the variance of the estimator. Thus, the gradient estimator that we found most effective in practice consists only of the gradient , the solid red arrows of Figure 1. We explore this experimentally in Section 6.3.



4.2 Sharpness
As with the ELBO, FIVO is a variational objective taking a variational posterior as an argument. An important question is whether FIVO achieves the marginal loglikelihood at its optimal . We can only guarantee this for models in which and are independent given . Sharpness of Filtering Variational Objectives. Let be a FIVO, and . If has independence structure such that for , then
Proof.
See Appendix. ∎
Most models do not satisfy this assumption, and deriving the optimal in general is complicated by the resampling dynamics. For the restricted the model class in Proposition 4.2, the optimal does not condition on future observations . We explored this experimentally with richer models in Section 6.4, and found that allowing to condition on does not reliably improve FIVO. This is consistent with the view of resampling as a greedy process that responds to each intermediate distribution as if it were the final. Still, we found that the impact of this effect was outweighed by the advantage of optimizing a tighter bound.
5 Related Work
The marginal loglikelihood is a central quantity in statistics and probability, and there has long been an interest in bounding it wainwright2008graphical . The literature relating to the bounds we call Monte Carlo objectives has typically focused on the problem of estimating the marginal likelihood itself. grosse2015sandwiching ; burda2015accurate use Jensen’s inequality in a forward and reverse estimator to detect the failure of inference methods. IWAE burda2015importance is a clear influence on this work, and FIVO can be seen as an extension of this bound. The ELBO enjoys a long history jordan1999introduction and there have been efforts to improve the ELBO itself. ranganath2016operator generalize the ELBO by considering arbitrary operators of the model and variational posterior. More closely related to this work is a body of work improving the ELBO by increasing the expressiveness of the variational posterior. For example, rezende2015variational ; kingma2016improved augment the variational posterior with deterministic transformations with fixed Jacobians, and salimans2015markov
extend the variational posterior to admit a Markov chain.
Other approaches to learning in neural latent variable models include bornschein2014reweighted , who use importance sampling to approximate gradients under the posterior, and gu2015neural , who use sequential Monte Carlo to approximate gradients under the posterior. These are distinct from our contribution in the sense that for them inference for the sake of estimation is the ultimate goal. To our knowledge the idea of treating the output of inference as an objective in and of itself, while not completely novel, has not been fully appreciated in the literature. Although, this idea shares inspiration with methods that optimize the convergence of Markov chains bengio2013generalized .
We note that the idea to optimize the log estimator of a particle filter was independently and concurrently considered in naesseth2017variational ; le2017auto . In naesseth2017variational the bound we call FIVO is cast as a tractable lower bound on the ELBO defined by the particle filter’s nonparameteric approximation to the posterior. le2017auto additionally derive an expression for FIVO’s bias as the KL between the filter’s distribution and a certain target process. Our work is distinguished by our study of the convergence of MCOs in , which includes FIVO, our investigation of FIVO sharpness, and our experimental results on stochastic RNNs.
6 Experiments
In our experiments, we sought to: (a) compare models trained with ELBO, IWAE, and FIVO bounds in terms of final test loglikelihoods, (b) explore the effect of the resampling gradient terms on FIVO, (c) investigate how the lack of sharpness affects FIVO, and (d) consider how models trained with FIVO use the stochastic state. To explore these questions, we trained variational recurrent neural networks (VRNN)
chung2015recurrentwith the ELBO, IWAE, and FIVO bounds using TensorFlow
abadi2016tensorflow on two benchmark sequential modeling tasks: natural speech waveforms and polyphonic music. These datasets are known to be difficult to model without stochastic latent states fraccaro2016sequential .The VRNN is a sequential latent variable model that combines a deterministic recurrent neural network (RNN) with stochastic latent states at each step. The observation distribution over is conditioned directly on and indirectly on via the RNN’s state . For a length sequence, the model’s posterior factors into the conditionals , and the variational posterior factors as . All distributions over latent variables are factorized Gaussians, and the output distributions depend on the dataset. The RNN is a singlelayer LSTM and the conditionals are parameterized by fully connected neural networks with one hidden layer of the same size as the LSTM hidden layer. We used the residual parameterization fraccaro2016sequential for the variational posterior.
For FIVO we resampled when the ESS of the particles dropped below . For FIVO and IWAE we used a batch size of 4, and for the ELBO, we used batch sizes of to match computational budgets (resampling is with the alias method). For all models we report bounds using the variational posterior trained jointly with the model. For models trained with FIVO we report . To provide strong baselines, we report the maximum across bounds, , for models trained with ELBO and IWAE. Additional details in the Appendix.
Bound  Nottingham  JSB  MuseData  Pianomidi.de  

4  ELBO  3.00  8.60  7.15  7.81 
IWAE  2.75  7.86  7.20  7.86  
FIVO  2.68  6.90  6.20  7.76  
8  ELBO  3.01  8.61  7.19  7.83 
IWAE  2.90  7.40  7.15  7.84  
FIVO  2.77  6.79  6.12  7.45  
16  ELBO  3.02  8.63  7.18  7.85 
IWAE  2.85  7.41  7.13  7.79  
FIVO  2.58  6.72  5.89  7.43 
TIMIT  
Bound  64 units  256 units  
4  ELBO  0  10,438 
IWAE  160  11,054  
FIVO  5,691  17,822  
8  ELBO  2,771  9,819 
IWAE  3,977  11,623  
FIVO  6,023  21,449  
16  ELBO  1,676  9,918 
IWAE  3,236  13,069  
FIVO  8,630  21,536 
6.1 Polyphonic Music
We evaluated VRNNs trained with the ELBO, IWAE, and FIVO bounds on 4 polyphonic music datasets: the Nottingham folk tunes, the JSB chorales, the MuseData library of classical piano and orchestral music, and the Pianomidi.de MIDI archive boulanger2012modeling
. Each dataset is split into standard train, valid, and test sets and is represented as a sequence of 88dimensional binary vectors denoting the notes active at the current timestep. We meancentered the input data and modeled the output as a set of 88 factorized Bernoulli variables. We used 64 units for the RNN hidden state and latent state size for all polyphonic music models except for JSB chorales models, which used 32 units. We report bounds on average loglikelihood per timestep in Table
1. Models trained with the FIVO bound significantly outperformed models trained with either the ELBO or the IWAE bounds on all four datasets. In some cases, the improvements exceeded nat per timestep, and in all cases optimizing FIVO with outperformed optimizing IWAE or ELBO for .6.2 Speech
The TIMIT dataset is a standard benchmark for sequential models that contains 6300 utterances with an average duration of 3.1 seconds spoken by 630 different speakers. The 6300 utterances are divided into a training set of size 4620 and a test set of size 1680. We further divided the training set into a validation set of size 231 and a training set of size 4389, with the splits exactly as in fraccaro2016sequential . Each TIMIT utterance is represented as a sequence of realvalued amplitudes which we split into a sequence of 200dimensional frames, as in chung2015recurrent ; fraccaro2016sequential . Data preprocessing was limited to mean centering and variance normalization as in fraccaro2016sequential . For TIMIT, the output distribution was a factorized Gaussian, and we report the average loglikelihood bound per sequence relative to models trained with ELBO. Again, models trained with FIVO significantly outperformed models trained with IWAE or ELBO, see Table 1.
6.3 Resampling Gradients
All models in this work (except those in this section) were trained with gradients that did not include the term in Eq. (8) that comes from resampling steps. We omitted this term because it has an outsized effect on gradient variance, often increasing it by 6 orders of magnitude. To explore the effects of this term experimentally, we trained VRNNs with and without the resampling gradient term on the TIMIT and polyphonic music datasets. When using the resampling term, we attempted to control its variance using a movingaverage baseline linear in the number of timesteps. For all datasets, models trained without the resampling gradient term outperformed models trained with the term by a large margin on both the training set and heldout data. Many runs with resampling gradients failed to improve beyond random initialization. A representative pair of train loglikelihood curves is shown in Figure 2 — gradients without the resampling term led to earlier convergence and a better solution. We stress that this is an empirical result — in principle biased gradients can lead to divergent behaviour. We leave exploring strategies to reduce the variance of the unbiased estimator to future work.
6.4 Sharpness
FIVO does not achieve the marginal loglikelihood at its optimal variational posterior , because the optimal does not condition on future observations (see Section 4.2). In contrast, ELBO and IWAE are sharp, and their s depend on future observations. To investigate the effects of this, we defined a smoothing variant of the VRNN in which takes as additional input the hidden state of a deterministic RNN run backwards over the observations, allowing to condition on future observations. We trained smoothing VRNNs using ELBO, IWAE, and FIVO, and report evaluation on the training set (to isolate the effect on optimization performance) in Table 2 . Smoothing helped models trained with IWAE, but not enough to outperform models trained with FIVO. As expected, smoothing did not reliably improve models trained with FIVO. Test set performance was similar, see the Appendix for details.
Bound  Nottingham  JSB  MuseData  Pianomidi.de  TIMIT 

ELBO  2.40  5.48  6.54  6.68  0 
ELBO+s  2.59  5.53  6.48  6.77  925 
IWAE  2.52  5.77  6.54  6.74  1,469 
IWAE+s  2.37  4.63  6.47  6.74  2,630 
FIVO  2.29  4.08  5.80  6.41  6,991 
FIVO+s  2.34  3.83  5.87  6.34  9,773 
6.5 Use of Stochastic State
A known pathology when training stochastic latent variable models with the ELBO is that stochastic states can go unused. Empirically, this is associated with the collapse of variational posterior network to the model prior bowman2015generating . To investigate this, we plot the KL divergence from to averaged over the dataset (Figure 2). Indeed, the KL of models trained with ELBO collapsed during training, whereas the KL of models trained with FIVO remained high, even while achieving a higher loglikelihood bound.
7 Conclusions
We introduced the family of filtering variational objectives, a class of lower bounds on the log marginal likelihood that extend the evidence lower bound. FIVOs are suited for MLE in neural latent variable models. We trained models with the ELBO, IWAE, and FIVO bounds and found that the models trained with FIVO significantly outperformed other models across four polyphonic music modeling tasks and a speech waveform modeling task. Future work will include exploring control variates for the resampling gradients, FIVOs defined by more sophisticated filtering algorithms, and new MCOs based on differentiable operators like leapfrog operators with deterministically annealed temperatures. In general, we hope that this paper inspires the machine learning community to take a fresh look at the literature of marginal likelihood estimators—seeing them as objectives instead of algorithms for inference.
Acknowledgments
We thank Matt Hoffman, Matt Johnson, Danilo J. Rezende, Jascha SohlDickstein, and Theophane Weber for helpful discussions and support in this project. A. Doucet was partially supported by the EPSRC grant EP/K000276/1. Y. W. Teh’s research leading to these results has received funding from the European Research Council under the European Union’s Seventh Framework Programme (FP7/20072013) ERC grant agreement no. 617071.
References
 (1) Diederik P Kingma and Max Welling. Autoencoding variational Bayes. ICLR, 2014.

(2)
Danilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra.
Stochastic backpropagation and approximate inference in deep generative models.
ICML, 2014.  (3) Laurent Dinh, Jascha SohlDickstein, and Samy Bengio. Density estimation using real nvp. arXiv preprint arXiv:1605.08803, 2016.
 (4) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.
 (5) Sebastian Nowozin, Botond Cseke, and Ryota Tomioka. fgan: Training generative neural samplers using variational divergence minimization. arXiv preprint arXiv:1606.00709, 2016.
 (6) Dustin Tran, Rajesh Ranganath, and David M Blei. Deep and hierarchical implicit models. arXiv preprint arXiv:1702.08896, 2017.
 (7) Shakir Mohamed and Balaji Lakshminarayanan. Learning in implicit generative models. arXiv preprint arXiv:1610.03483, 2016.
 (8) Michael I Jordan, Zoubin Ghahramani, Tommi S Jaakkola, and Lawrence K Saul. An introduction to variational methods for graphical models. Machine learning, 37(2):183–233, 1999.

(9)
Matthew J. Beal.
Variational algorithms for approximate Bayesian inference
. 2003.  (10) Rajesh Ranganath, Sean Gerrish, and David Blei. Black box variational inference. In AISTATS, 2014.
 (11) Alp Kucukelbir, Dustin Tran, Rajesh Ranganath, Andrew Gelman, and David M Blei. Automatic differentiation variational inference. arXiv preprint arXiv:1603.00788, 2016.

(12)
Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov.
Importance weighted autoencoders.
ICLR, 2016.  (13) Andriy Mnih and Danilo J Rezende. Variational inference for Monte Carlo objectives. arXiv preprint arXiv:1602.06725, 2016.
 (14) Frédéric Cérou, Pierre Del Moral, and Arnaud Guyader. A nonasymptotic theorem for unnormalized Feynman–Kac particle models. Ann. Inst. H. Poincaré B, 47(3):629–649, 2011.

(15)
Jean Bérard, Pierre Del Moral, and Arnaud Doucet.
A lognormal central limit theorem for particle approximations of normalizing constants.
Electron. J. Probab., 19(94):1–28, 2014.  (16) Arthur P Dempster, Nan M Laird, and Donald B Rubin. Maximum likelihood from incomplete data via the EM algorithm. J. R. Stat. Soc. Ser. B Stat. Methodol., pages 1–38, 1977.
 (17) CF Jeff Wu. On the convergence properties of the EM algorithm. Ann. Stat., pages 95–103, 1983.
 (18) Radford M Neal and Geoffrey E Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. In Learning in graphical models, pages 355–368. Springer, 1998.
 (19) Matthew D Hoffman, David M Blei, Chong Wang, and John William Paisley. Stochastic variational inference. Journal of Machine Learning Research, 14(1):1303–1347, 2013.
 (20) Andriy Mnih and Karol Gregor. Neural variational inference and learning in belief networks. arXiv preprint arXiv:1402.0030, 2014.

(21)
Yarin Gal.
Uncertainty in Deep Learning
. PhD thesis, University of Cambridge, 2016. 
(22)
Arnaud Doucet and Adam M. Johansen.
A tutorial on particle filtering and smoothing: Fifteen years later.
In D. Crisan and B. Rozovsky, editors,
The Oxford Handbook of Nonlinear Filtering
, pages 656–704. Oxford University Press, 2011.  (23) Pierre Del Moral. FeynmanKac formulae: genealogical and interacting particle systems with applications. Springer Verlag, 2004.
 (24) Pierre Del Moral. Mean field simulation for Monte Carlo integration. CRC Press, 2013.
 (25) Christophe Andrieu, Arnaud Doucet, and Roman Holenstein. Particle Markov chain Monte Carlo methods. J. R. Stat. Soc. Ser. B Stat. Methodol., 72(3):269–342, 2010.
 (26) Michael K Pitt, Ralph dos Santos Silva, Paolo Giordani, and Robert Kohn. On some properties of Markov chain Monte Carlo simulation methods based on the particle filter. J. Econometrics, 171(2):134–151, 2012.
 (27) Martin J Wainwright, Michael I Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends in Machine Learning, 1(1–2):1–305, 2008.
 (28) Roger B Grosse, Zoubin Ghahramani, and Ryan P Adams. Sandwiching the marginal likelihood using bidirectional Monte Carlo. arXiv preprint arXiv:1511.02543, 2015.
 (29) Yuri Burda, Roger Grosse, and Ruslan Salakhutdinov. Accurate and conservative estimates of MRF loglikelihood using reverse annealing. In AISTATS, 2015.
 (30) Rajesh Ranganath, Dustin Tran, Jaan Altosaar, and David Blei. Operator variational inference. In NIPS, 2016.
 (31) Danilo Jimenez Rezende and Shakir Mohamed. Variational inference with normalizing flows. ICML, 2015.
 (32) Diederik P Kingma, Tim Salimans, Rafal Jozefowicz, Xi Chen, Ilya Sutskever, and Max Welling. Improved variational inference with inverse autoregressive flow. In NIPS, 2016.
 (33) Tim Salimans, Diederik Kingma, and Max Welling. Markov chain Monte Carlo and variational inference: Bridging the gap. In ICML, 2015.
 (34) Jörg Bornschein and Yoshua Bengio. Reweighted wakesleep. ICLR, 2015.
 (35) Shixiang Gu, Zoubin Ghahramani, and Richard E Turner. Neural adaptive sequential Monte Carlo. In NIPS, 2015.
 (36) Yoshua Bengio, Li Yao, Guillaume Alain, and Pascal Vincent. Generalized denoising autoencoders as generative models. In NIPS, 2013.
 (37) Christian A Naesseth, Scott W Linderman, Rajesh Ranganath, and David M Blei. Variational sequential Monte Carlo. arXiv preprint arXiv:1705.11140, 2017.
 (38) Tuan Anh Le, Maximilian Igl, Tom Jin, Tom Rainforth, and Frank Wood. Autoencoding sequential Monte Carlo. arXiv preprint arXiv:1705.10306, 2017.
 (39) Junyoung Chung, Kyle Kastner, Laurent Dinh, Kratarth Goel, Aaron C Courville, and Yoshua Bengio. A recurrent latent variable model for sequential data. In NIPS, 2015.
 (40) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. Tensorflow: Largescale machine learning on heterogeneous distributed systems. arXiv preprint arXiv:1603.04467, 2016.
 (41) Marco Fraccaro, Søren Kaae Sønderby, Ulrich Paquet, and Ole Winther. Sequential neural models with stochastic layers. In NIPS, 2016.
 (42) Nicolas BoulangerLewandowski, Yoshua Bengio, and Pascal Vincent. Modeling temporal dependencies in highdimensional sequences: Application to polyphonic music generation and transcription. ICML, 2012.
 (43) Samuel R Bowman, Luke Vilnis, Oriol Vinyals, Andrew M Dai, Rafal Jozefowicz, and Samy Bengio. Generating sentences from a continuous space. arXiv preprint arXiv:1511.06349, 2015.
 (44) Eric Veach and Leonidas J Guibas. Optimally combining sampling techniques for Monte Carlo rendering. In Proceedings of the 22nd annual conference on Computer graphics and interactive techniques, pages 419–428. ACM, 1995.
 (45) Siddhartha Chib. Marginal likelihood from the Gibbs output. Journal of the American Statistical Association, 90(432):1313–1321, 1995.
 (46) XiaoLi Meng and Wing Hung Wong. Simulating ratios of normalizing constants via a simple identity: a theoretical exploration. Statistica Sinica, pages 831–860, 1996.
 (47) Andrew Gelman and XiaoLi Meng. Simulating normalizing constants: From importance sampling to bridge sampling to path sampling. Statistical science, pages 163–185, 1998.
 (48) Radford M Neal. Annealed importance sampling. Statistics and computing, 11(2):125–139, 2001.
 (49) Radford M Neal. Estimating ratios of normalizing constants using linked importance sampling. arXiv preprint math/0511216, 2005.
 (50) John Skilling. Nested sampling for general Bayesian computation. Bayesian analysis, 1(4):833–859, 2006.
 (51) Víctor Elvira, Luca Martino, David Luengo, and Mónica F Bugallo. Generalized multiple importance sampling. arXiv preprint arXiv:1511.03095, 2015.
 (52) Pierre Del Moral, Arnaud Doucet, and Ajay Jasra. Sequential Monte Carlo samplers. J. R. Stat. Soc. Ser. B Stat. Methodol., 68(3):411–436, 2006.
 (53) Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In AISTATS, 2010.
 (54) Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. ICLR, 2015.
Appendix to Filtering Variational Objectives
Other Examples of MCOs.
There is an extensive literature on marginal likelihood estimators [44, 45, 46, 47, 48, 49, 50]. Each defines an MCO, and we consider two in more detail, annealed importance sampling [48] and multiple importance sampling [44, 51]. Let denote an observation of an valued random variable generated in a process with an unobserved valued random variable . Let be the joint density.
Annealed Importance Sampling MCO.
Annealed importance sampling (AIS) is a generalization of importance sampling [48]. We present an MCO derived from a special case of the AIS algorithm. Let be a variational posterior distribution and let be a sequence of real numbers for such that and and . Let be a Markov transition distribution whose stationary distribution is proportional to . Then for and for we have the following unbiased estimator,
(9) 
Notice two things. First, there is no assumption that the states are at equilibrium, and second, we did not require a transition operator keeping as an invariant distribution. All together, we can define the AIS MCO,
(10) 
This is a sharp objective, if we take as the true posterior, , and to be the Dirac delta copy operator. The difficulty in applying this MCO is finding , which are scalable and easy to optimize. Generalizations of the AIS procedure have been proposed in [52]. The resulting Sequential Monte Carlo samplers procedures also provide an unbiased estimator of the marginal likelihood and are structurally identical to the particle algorithm presented in this paper.
Multiple Importance Sampling MCO.
Multiple importance sampling (MIS) [44] is another generalization of importance sampling. Let be possibly distinct variational posterior distributions and be such that . There are a variety of distinct estimators that could be formed from the [51]. We present just one. Let , then we have the following unbiased estimator
(11) 
Notice that the latent sample is evaluated under all ’s. One can view this as a RaoBlackwellized estimator corresponding to the mixture distribution . All together,
(12) 
Again, this objective is sharp, if we take any and . The difficulty in making this objective more useful is optimizing it in a way that distinguishes the and assigns the appropriate .
Proof of Proposition 1.
Let and define as the Monte Carlo objective defined by .

[label=()]

By the concavity of and Jensen’s inequality,

Assume

is strongly consistent, i.e. as .

is uniformly integrable. That is, let be the probability space on which is defined. The random variables are uniformly integrable if and if for any , there exists , such that for all and , implies , where is an indicator function of the set .
Then by continuity of , converges almost surely to . By Vitali’s convergence theorem (using the uniform integrability assumption), we get as .


Let , and assume . Define the relative error
(13) Then the bias . Now, Taylor expand about 0,
(14) (15) and in expectation
(16) Our aim is to show
(17) In particular, by CauchySchwarz
(18) (19) (20) and again by CauchySchwarz (21) This concludes the proof.
Controlling the first inverse moment.
We provide a sufficient condition that guarantees that the inverse moment of the average of i.i.d. random variables is bounded, a condition used in Proposition 3 (c). Intuitively, this is a fairly weak condition, because it only requires that the mass in an arbitrarily small neighbourhood of zero is bounded.
Let be i.i.d. positive random variables and . If there exist such that for , then .
Proof.
Let be such that for . We proceed in two cases. If , then
For , we show that , so the same condition is sufficient for any . The AMGM inequality tells us that
so  