Streaming Variational Monte Carlo

06/04/2019
by   Yuan Zhao, et al.
Stony Brook University
0

Nonlinear state-space models are powerful tools to describe dynamical structures in complex time series. In a streaming setting where data are processed one sample at a time, simultaneously inferring the state and their nonlinear dynamics has posed significant challenges in practice. We develop a novel online learning framework, leveraging variational inference and sequential Monte Carlo, which enables flexible and accurate Bayesian joint filtering. Our method provides a filtering posterior arbitrarily close to the true filtering distribution for a wide class of dynamics models and observation models. Specifically, the proposed framework can efficiently infer a posterior over the dynamics using sparse Gaussian processes. Constant time complexity per sample makes our approach amenable to online learning scenarios and suitable for real-time applications.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

page 6

page 7

page 8

page 10

07/27/2017

Recursive Variational Bayesian Dual Estimation for Nonlinear Dynamics and Non-Gaussian Observations

State space models provide an interpretable framework for complex time s...
06/12/2013

Bayesian Inference and Learning in Gaussian Process State-Space Models with Particle MCMC

State-space models are successfully used in many areas of science, engin...
10/21/2017

A Learning-to-Infer Method for Real-Time Power Grid Topology Identification

Identifying arbitrary topologies of power networks in real time is a com...
11/15/2018

Infinite-Horizon Gaussian Processes

Gaussian processes provide a flexible framework for forecasting, removin...
07/09/2020

Online Approximate Bayesian learning

We introduce in this work a new method for online approximate Bayesian l...
03/23/2021

Nested Gaussian filters for recursive Bayesian inference and nonlinear tracking in state space models

We introduce a new sequential methodology to calibrate the fixed paramet...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Nonlinear state-space model is a generative model for complex time series with underlying nonlinear dynamical structure Haykin1998 ; Ko2009 ; Mattos2016 . Specifically, it identifies a nonlinear dynamics in the latent state-space that captures the spatiotemporal structure of noisy observations :

(state dynamics model) (1a)
(observation model) (1b)

where and

are continuous vector functions,

denotes a probability distribution, and

is intended to capture unobserved perturbations of the state . Such state-space models have many applications (e.g., object tracking) where the flow of the latent states is governed by known physical laws and constraints, or where learning the laws are of great interest, especially in neuroscience Roweis2001 ; Sussillo2013 ; Frigola2014 ; Daniels2015 ; Zhao2016 ; Nassar2019 .

If the parametric form of the model and the parameters are known a priori, then the latent states can be inferred online through the filtering distribution, , or offline through the smoothing distribution,  Ho1964 ; Sarkka2013 . Otherwise the challenge is in learning the parameters of the state-space model, , which is known in the literature as the system identification problem.

In a streaming setting where data are processed one sample at a time, simultaneously inferring the state and their nonlinear dynamics has posed significant challenges in practice. In this study, we are interested in online algorithms that can recursively solve the dual estimation problem of learning both the latent trajectory

in the state-space and the underlying parameters of the model especially the dynamical system on streaming observations Haykin2001 .

Popular solutions such, as the extended Kalman filter (EKF) or the unscented Kalman filter (UKF) 

wan2000unscented , build an online dual estimator using nonlinear Kalman filtering by augmenting the state-space with its parameters Wan2000 ; Wan2001

. While powerful, they usually provide coarse approximations to the filtering distribution and involve many hyperparameters to be tuned which hinder their practical performance. Moreover, they don’t take advantage of modern stochastic gradient optimization techniques commonly used throughout machine learning and are not easily applicable to arbitrary observation likelihoods.

Recursive stochastic variational inference has been proposed in streaming data assuming either independent Broderick2013 and temporally-dependent samples Frigola2014 ; Zhao2017 . However the proposed variational distributions are not guaranteed to be good approximations to the true posterior. As opposed to variational inference, sequential Monte Carlo (SMC) leverages importance sampling to build an approximation to the target distribution in a data streaming setting smith2013sequential ; doucet2009tutorial . However, its success heavily depends on the choice of proposal distributions, and the (locally) optimal proposal distribution doucet2009tutorial usually is only available in the simplest cases. While work has been done on learning good proposals for SMC gu2015neural ; Naesseth2018 ; cornebise2014adaptive ; guarniero2017iterated most are designed only for offline scenarios targeting the smoothing distributions instead of the filtering distributions. In cornebise2014adaptive , the proposal is learned online but the class of dynamics for which this applicable for is extremely limited.

In hopes of lifting these concerns, we develop an online inference framework, that leverages variational inference and sequential Monte Carlo, which is applicable to a wide range of nonlinear state-space models and observation likelihoods. We derive a lower bound to the log marginal likelihood of the filtering distribution and show that it can be made arbitrarily tight. Moreover, we choose the sparse Gaussian process (GP) Titsias2009 for modeling the dynamics that allows for

recursive Bayesian inference. Under this framework, we can jointly filter latent states and identify underlying nonlinear system in online settings.

2 Streaming Variational Monte Carlo

Given the state-space model defined in (1), our interest is to obtain the latent state , particularly in Bayesian inference, the filtering distribution at time ,

(2)

from the streaming observation , where is a measurable space (typically or ) and recursively using the previous filtering posterior distribution .

However the above posterior is generally intractable except for limited cases Haykin2001 . Thus we turn to approximate methods and propose to combine sequential Monte Carlo and variational inference, which allows us to utilize modern stochastic optimization while leveraging the flexibility and theoretical guarantees of SMC. We refer to our approach as streaming variational Monte Carlo (SVMC).

2.1 Sequential Monte Carlo

SMC builds an approximation to the sequence of distributions using weighted samples, where the samples come from a proposal distribution where are the parameters of the distribution. Due to the Markovian assumption in the state-space model defined in (1) the posterior can be factorized as . We also enforce the same factorization for the proposal, .

In SMC, the approximation is constructed recursively doucet2009tutorial . At each time instant, the trajectories, with indexing the random sample, and their corresponding weights, , form an approximation to the smoothing distribution:

(3)

where the weights are computed according to

(4)

and is the Dirac-delta function centered at . Marginalizing out in (3) gives an approximation to the filtering distribution:

(5)

As a byproduct, the weights produced in an SMC run yield an unbiased estimate of the marginal likelihood of the smoothing distribution 

smith2013sequential :

(6)

and a biased but consistent estimate of the marginal likelihood of the filtering distribution smith2013sequential ; van2000asymptotic

(7)

For completeness, we reproduce the consistency proof of (7) in the supplementary material A.1. The recursive nature of SMC makes them a perfect candidate to be used in an online setting adali2010adaptive . Their recursive nature make them constant complexity per time step and constant memory because only the samples and weights generated at time are needed, . These attractive properties have allowed SMC to enjoy much success in fields such as robotics thrun2002particle , control engineering greenfield2003adaptive and target tracking gustafsson2002particle . The success of an SMC sampler crucially depends on the design of the proposal distribution, . A common choice for the proposal distribution is the transition distribution, , which is known as the bootstrap particle filter (BPF). While simple, it is well known that BPF need a large number of particles to perform well and suffer in high-dimensions bickel2008sharp .

The strength and accuracy of SMC crucially depends on the design of the proposal distribution. Designing a proposal is even more difficult in an online setting because a proposal distribution that was optimized for the system at time may not be the best proposal steps ahead. For example, if the dynamics were to change abruptly, a phenomenon known as concept drift vzliobaite2016overview , the previous proposal may fail for the current time step. Thus, we propose to adapt the proposal distribution online using variational inference. This allows us to utilize modern stochastic optimization to adapt the proposal on-the-fly while still leveraging the theoretical guarantees of SMC.

2.2 Variational Inference

Variational inference (VI) takes an optimization approach to posterior inference. In VI, we approximate the target posterior, by a class of simpler distributions, , where

are the parameters of the distribution. We then minimize a divergence (which is usually the Kullback-Leibler divergence (KLD)) between the posterior and the approximate distribution in the hopes of making

closer to . If the divergence used is KLD, then minimizing the KLD between these distributions is equivalent to maximizing the so-called evidence lower bound (ELBO) wainwright2008graphical :

(8)

For filtering inference, the intractability introduced by marginalizing over in (8) makes the problem much harder to optimize, rendering variational inference impractical in a streaming setting where incoming data are temporally dependent.

Algorithm 1 Streaming Variational Monte Carlo (Step ) 1: 2:for  do 3:     for  do 4:          Resample 5:          Propose 6:          Reweigh 7:     end for 8:      9:      SGD 10:end for 11:Resample, propose and reweigh particles 12:return , Figure 1: SVMC

2.3 A Tight Lower Bound

Due to the intractability of the filtering distribution, the standard ELBO is difficult to optimize forcing us to define a different objective function to optimize. As stated above, we know that the importance weights are an unbiased estimator of . Applying Jensen’s inequality to (6Naesseth2018 ; anh2018autoencoding

(9)

Expanding (9) we obtain

(10)
(11)

where is the variational gap. Leveraging this we propose to optimize

(12)

We call the filtering ELBO; it lower bounds the log normalization constant (log partition function) of the filtering distribution where accounts for the bias of the estimator (7).

The implicit distribution that arises from performing SMC Cremer2017 is

(13)

As the number of samples goes to infinity, (12) can be made arbitrarily tight; as a result, the implicit approximation to the filtering distribution, (13) will become arbitrarily close to the true posterior, a.e. except on a set of measure 0 which allows for a trade-off between accuracy and computational complexity. We note that this result is not applicable in most cases of VI due to the simplicity of variational families used. As an example, we showcase the accuracy of the implicit distribution in Fig. 2. We summarize this result in the following theorem (proof in  A.2).

Theorem 2.1 (Filtering ELBO).

The filtering ELBO (12), , is a lower bound to the logarithm of the normalization constant of the filtering distribution, . As the number of samples, , goes to infinity, will become arbitrarily close to .

Theorem 2.1 leads to the following corollary del1996non (proof in A.3).

Corollary 2.1.1.

Theorem 2.1 implies that the implicit filtering distribution converges to the true posterior as .

Figure 2: Accurate SVMC posterior. SVMC is able to capture the posterior fairly well (200 samples) even with a unimodal proposal. The unimodal variational approximation fails to capture the structure of the true posterior.

2.4 Stochastic Optimization

As in variational inference, we fit the parameters of the proposal, dynamics and observation model, , by maximizing the (filtering) ELBO (Alg. 1). While the expectations in (12) are not in closed form, we can obtain unbiased estimates of the and its gradients with Monte Carlo. Note that, while obtaining gradients with respect to , we only need to take gradients of . We also assume that the proposal distribution, is reparameterizable, i.e. we can sample from by setting for some function where and is a distribution independent of . Thus we can express the gradient of (12) using the reparameterization trick Kingma2014 as

(14)

In Algorithm 1, we do

times stochastic gradient descent (SGD) update for each step.

While using more samples for computing the approximation creates a tighter bound, using more samples for estimating (14) may be detrimental for optimizing the parameters, as it has been shown to decrease the signal-to-noise ratio (SNR) of the gradient estimator for importance-sampling-based ELBOs rainforth2018tighter

. The intuition is as follows: as the number of samples used to compute the gradient increases, the bound gets tighter and tighter which in turn causes the magnitude of the gradient to become smaller. The rate at which the magnitude decreases is much faster than the variance of the estimator, thus driving the SNR to

. In practice, we found that 1 sample is enough to obtain good performance.

2.5 Learning Dynamics with Sparse Gaussian Processes

State-space models allow for various time series models Shumway2010 to model the evolution of state and ultimately predict the future. Specifically we use Gaussian processes (GPs) Rasmussen2006

over principled parametric models in this study. Gaussian processes is a convenient method of imposing general assumptions on the dynamical system such as continuity or smoothness, predicting the future explicitly, and knowing the uncertainty over the predictions.

A Gaussian process is a collection of random variables, any finite number of which have a joint Gaussian distribution. It is completely specified by its mean and covariance functions. One can describe a distribution over a function

(15)

where is the mean function and is the covariance function. With the GP prior imposed on a function, one can do Bayesian inference with data. In this study, we assume that for dynamics.

However fully Bayesian inference on GPs are not practical for online learning because the space and time complexities are proportional to the number samples which grows with time , i.e., and respectively. In other words, the space and time costs increase as more and more observations are processed. To overcome this limitation, we employ the sparse GP methods Titsias2009 ; Snelson2006 . We augment the state to with inducing points where and are pseudo-inputs and impose their prior distributions .

Assuming that is a sufficient statistic for given , we have

(16)

where is the covariance matrix determined by the covariance function for finite samples. Note that the inclusion of the inducing points in the model reduces the computational complexity to be constant to the sample size, i.e., it does not depend on time.

In SVMC, computing the weights requires the prior over the state at current time. Supposedly the prior of is the posterior of which is represented by the particles if it is assumed static i.e. . However, in this case, the prior becomes improper as a Dirac delta function, which is an obstacle to the computation of weights. Therefore, we further impose a diffusion process on :

(17)

where and as defined before.

Hence this formulation is well-suited to be used in the SVMC framework to learn the dynamics online. The corresponding weights would generally be

(18)

A good proposal distribution for SMC is one that is close to the posterior. We factorize the joint proposal into , and employ the sparse GP regression technique Rasmussen2006 ; Titsias2009 to construct . Let . Given the previous state and by the Bayes’ rule

(19)

we obtain the recursive updating rule:

(20)

where and . For each new sample during the (dual) filtering, we first sample from and use these samples to obtain updated from . The recursive proposal on the inducing variables does not require gradients and optimization, and hence accelerates the training.

3 Experiments

To showcase the power of SVMC, we employ it on a number of simulated and real experiments.

3.1 Linear Dynamical System (LDS)

As a baseline, we use the proposed method on an LDS

(21)

LDS is the de facto dynamical system for many fields of science and engineering due to its simplicity and efficient exact filtering (i.e., Kalman filtering). The use of an LDS also endows a closed form expression of the log marginal likelihood for the filtering and smoothing distribution. Thus, as a baseline we compare the estimates of the log marginal likelihood, , produced by SVMC, the offline variational sequential Monte Carlo (VSMC) Naesseth2018 and BPF gordon1993novel in an experiment similar to the one used in Naesseth2018 . We generated data from (21) with , , , with and where the state and emission parameters are fixed. For both VSMC and SVMC, 100 particles were used for both computing gradients and computing the lower bound. 25,000 gradient steps were used for VSMC and (25,000/50 = 5,000) gradient steps were done at each time step to equate the total number of gradient steps between both methods. The results are shown in Table 1; SVMC obtains a higher ELBO than VSMC and BPF.

BPF VSMC SVMC True
ELBO or LL
Table 1: ELBO of BPF, VSMC and SVMC, and True LL

3.2 Chaotic recurrent neural network

To show the performance our algorithm in filtering data generated from a complex, nonlinear and high-dimensional dynamical system, we generate data from a continuous-time "vanilla" recurrent neural network (vRNN)

(22)

where is Brownian motion. Using Euler integration, (22) can be described as a discrete time dynamical system

(23)

where is the Euler step. The emission model is

(24)

where each dimension of the emission noise, , are independently and identically distributed (i.i.d) from a Student’s t distribution, , where

is the degrees of freedom and

is the scale.

We set and the elements of are i.i.d. drawn from . We set which produces chaotic dynamics at a relatively slow timescale compared to  SUSSILLO2009544 . The rest of the parameters values are: , , , and , which are kept fixed. We generated a time series of length of . We ran SVMC using 200 particles with proposal distribution where and

are parameterized by a perceptron with 100 hidden units where 15 gradient steps were performed at each time step. For a comparison, we also ran a BPF with 200 and 2000 particles and an unscented Kalman filter (UKF).

Figure 3: Chaotic RNN. All particle filters run with 100 different random seeds. (A) Average RMSE of SVMC, BPF and UKF. (B) Average ELBO over time of SVMC and BPF.

SVMC can achieve better performance than BPF with an order of magnitude less samples. As is evident from Figure 3, SVMC can handle the Student’s t distributed emission noise while UKF fails which can be seen in the sudden spikes in the root mean squared error (RMSE). SVMC is also able to achieve a much higher ELBO than standard BPF, with almost an order of magnitude less samples.

3.3 Synthetic NASCAR® Dynamics

We test learning dynamics with sparse GP on a synthetic data of which the underlying dynamics follow a recurrent switching linear dynamical systems linderman2017bayesian . The simulated trajectory resembles the NASCAR® track (Fig. 4A). We train the model with observations simulated from where is a -by- matrix. The proposal is defined as of which and are linear maps of the concatenation of observation and previous state . We use 50 particles and squared exponential (SE) kernel for GP. To investigate the learning of dynamics, we control for other factors, i.e. we fix the observation model and other hyper-parameters such as noise variances at their true values. (See the details in the supplementary material-A.4.)

Figure 4A shows the true (blue) and inferred (red) latent states. The inference quickly catches up with the true state and stays on the track. As the state oscillates on the track, the sparse GP learns a reasonable limit cycle (Fig. 4F) without seeing much data outside the track. The velocity fields in Figure 4D–F show the gradual improvement in the online learning. The -step prediction also shows that the GP captures the dynamics (Fig. 4B). We compare SVMC with Kalman filter in terms of mean squared error (MSE) (Fig. 4

C). The transition matrix of the LDS of the Kalman filter (KF) is learned by expectation-maximization which is an offline method, hence not truly online. Yet, SVMC performs better than KF after

steps.

Figure 4: NASCAR® Dynamics. (A) True and inferred latent trajectory. (B) Filtering and prediction. We show the last steps of filtered states and the following steps of predicted states. (C) Predictive error. We compare the -step predictive MSE (average over realizations) of SVMC (sparse GP dynamics) with Kalman filter. The transition matrix of the Kalman filter was learned by EM. (D)–(F) Velocity field learned by different time steps.

3.4 Analog Nonlinear Oscillator

To test the proposed framework on the real data, we physically build an analog circuit (supplementary material A.5) which implements a Hz nonlinear oscillation and apply our method to the recording from the circuit. The recording contains three channels of voltage sampled at Hz. We assume the following SSM: where , , and . We train the model with 3000 steps of observations (Fig. 5A). The dynamics learned by sparse GP (SE kernel) shows the limit cycle that can implement the oscillation (Fig. 5C). We also predict 500 more observations that resemble an oscillating trajectory (Fig. 5B)

Figure 5: Analog oscillator. (A) Observations for training (3000 steps). (B) Predicted observations (500 steps). (C) Velocity field of inferred dynamics.

4 Discussion

In this study, we develop a novel online learning framework, leveraging variational inference and sequential Monte Carlo, which enables flexible and accurate Bayesian joint filtering. Our derivation shows that our filtering posterior can be arbitrarily close to the true one for a wide class of dynamics models and observation models. Specifically, the proposed framework can efficiently infer a posterior over the dynamics using sparse Gaussian processes by augmenting the state with the inducing variables that follow a diffusion process. Taking benefit from Bayes’ rule, our recursive proposal on the inducing variables does not require optimization with gradients. Constant time complexity per sample makes our approach amenable to online learning scenarios and suitable for real-time applications. In contrast to previous works, we demonstrate our approach is able to accurately filter the latent states for linear / nonlinear dynamics, recover complex posteriors, faithfully infer dynamics, and provide long-term predictions. In future, we want to focus on reducing the computation time per sample that could allow for real-time application on faster systems. On the side of GP, we would like to investigate the priors and kernels that characterize the properties of dynamical systems as well as the hyperparameters.

5 Acknowledgement

Yuan Zhao and Il Memming Park were supported by NSF IIS-1734910, NIH/NIBIB EB026946, and Thomas Hartman Center for Parkinson’s Research (#6424). Josue Nassar was supported by NSF LSAMP Bridge to the Doctorate (HRD-1612689). Mónica Bugallo was supported by NSF CCF-1617986.

References

  • (1) S. Haykin and J. Principe. Making sense of a complex world [chaotic events modeling]. IEEE Signal Processing Magazine, 15(3):66–81, May 1998.
  • (2) J. Ko and D. Fox. GP-BayesFilters: Bayesian filtering using gaussian process prediction and observation models. Autonomous Robots, 27(1):75–90, 5 2009.
  • (3) C. L. C. Mattos, Z. Dai, A. Damianou, et al. Recurrent gaussian processes. International Conference on Learning Representations (ICLR), 2016.
  • (4) S. Roweis and Z. Ghahramani. Learning nonlinear dynamical systems using the expectation-maximization algorithm, pages 175–220. John Wiley & Sons, Inc, 2001.
  • (5) D. Sussillo and O. Barak. Opening the black box: Low-dimensional dynamics in high-dimensional recurrent neural networks. Neural Computation, 25(3):626–649, March 2013.
  • (6) R. Frigola, Y. Chen, and C. E. Rasmussen. Variational gaussian process state-space models. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2, pages 3680–3688, Montreal, Canada, 2014.
  • (7) B. C. Daniels and I. Nemenman. Automated adaptive inference of phenomenological dynamical models. Nature Communications, 6:8133+, August 2015.
  • (8) Y. Zhao and I. M. Park. Interpretable nonlinear dynamic modeling of neural trajectories. In Advances in Neural Information Processing Systems (NIPS), 2016.
  • (9) J. Nassar, S. Linderman, M. Bugallo, and I. M. Park. Tree-structured recurrent switching linear dynamical systems for multi-scale modeling. In International Conference on Learning Representations, 2019.
  • (10) Y. Ho and R. Lee. A Bayesian approach to problems in stochastic estimation and control. IEEE Transactions on Automatic Control, 9(4):333–339, October 1964.
  • (11) S. Särkkä. Bayesian filtering and smoothing. Cambridge University Press, 2013.
  • (12) S. S. Haykin. Kalman filtering and neural networks. Wiley, 2001.
  • (13) E. A. Wan and R. Van Der Merwe. The unscented kalman filter for nonlinear estimation. In Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No. 00EX373), pages 153–158. Ieee, 2000.
  • (14) E. A. Wan and R. Van Der Merwe. The unscented Kalman filter for nonlinear estimation. In Proceedings of the IEEE 2000 Adaptive Systems for Signal Processing, Communications, and Control Symposium (Cat. No.00EX373), pages 153–158, Lake Louise, Alta., Canada, August 2000. IEEE.
  • (15) E. A. Wan and A. T. Nelson. Dual extended Kalman filter methods, pages 123–173. John Wiley & Sons, Inc, 2001.
  • (16) T. Broderick, N. Boyd, A. Wibisono, A. C. Wilson, and M. I. Jordan. Streaming variational Bayes. In C. J. C. Burges, L. Bottou, M. Welling, Z. Ghahramani, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 26, pages 1727–1735. Curran Associates, Inc., 2013.
  • (17) Y. Zhao and I. M. Park. Variational joint filtering. arXiv:1707.09049, 2017.
  • (18) A. Smith. Sequential Monte Carlo methods in practice. Springer Science & Business Media, 2013.
  • (19) A. Doucet and A. M. Johansen. A tutorial on particle filtering and smoothing: Fifteen years later.

    Handbook of nonlinear filtering

    , 12(656-704):3, 2009.
  • (20) S. S. Gu, Z. Ghahramani, and R. E. Turner. Neural adaptive sequential monte carlo. In Advances in Neural Information Processing Systems, pages 2629–2637, 2015.
  • (21) C. Naesseth, S. Linderman, R. Ranganath, and D. Blei. Variational sequential monte carlo. In A. Storkey and F. Perez-Cruz, editors,

    Proceedings of the Twenty-First International Conference on Artificial Intelligence and Statistics

    , volume 84 of Proceedings of Machine Learning Research, pages 968–977, Playa Blanca, Lanzarote, Canary Islands, 09–11 Apr 2018. PMLR.
  • (22) J. Cornebise, E. Moulines, and J. Olsson. Adaptive sequential monte carlo by means of mixture of experts. Statistics and Computing, 24(3):317–337, 2014.
  • (23) P. Guarniero, A. M. Johansen, and A. Lee. The iterated auxiliary particle filter. Journal of the American Statistical Association, 112(520):1636–1647, 2017.
  • (24) M. Titsias. Variational learning of inducing variables in sparse gaussian processes. In Artificial Intelligence and Statistics, pages 567–574, April 2009.
  • (25) A. W. Van der Vaart. Asymptotic statistics, volume 3. Cambridge university press, 2000.
  • (26) T. Adali and S. Haykin. Adaptive signal processing: next generation solutions, volume 55. John Wiley & Sons, 2010.
  • (27) S. Thrun. Particle filters in robotics. In Proceedings of the Eighteenth conference on Uncertainty in artificial intelligence, pages 511–518. Morgan Kaufmann Publishers Inc., 2002.
  • (28) A. Greenfield and A. Brockwell. Adaptive control of nonlinear stochastic systems by particle filtering. In 2003 4th International Conference on Control and Automation Proceedings, pages 887–890. IEEE, 2003.
  • (29) F. Gustafsson, F. Gunnarsson, N. Bergman, et al. Particle filters for positioning, navigation, and tracking. IEEE Transactions on signal processing, 50(2):425–437, 2002.
  • (30) P. Bickel, B. Li, T. Bengtsson, et al. Sharp failure rates for the bootstrap particle filter in high dimensions. In Pushing the limits of contemporary statistics: Contributions in honor of Jayanta K. Ghosh, pages 318–329. Institute of Mathematical Statistics, 2008.
  • (31) I. Žliobaitė, M. Pechenizkiy, and J. Gama. An overview of concept drift applications. In Big data analysis: new algorithms for a new society, pages 91–114. Springer, 2016.
  • (32) M. J. Wainwright, M. I. Jordan, et al. Graphical models, exponential families, and variational inference. Foundations and Trends® in Machine Learning, 1(1–2):1–305, 2008.
  • (33) T. A. Le, M. Igl, T. Rainforth, T. Jin, and F. Wood. Auto-encoding sequential monte carlo. In International Conference on Learning Representations, 2018.
  • (34) C. Cremer, Q. Morris, and D. Duvenaud.

    Reinterpreting Importance-Weighted Autoencoders.

    arXiv e-prints, Apr 2017.
  • (35) P. Del Moral. Non-linear filtering: interacting particle resolution. Markov processes and related fields, 2(4):555–581, 1996.
  • (36) D. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. arXiv:1312.6114 [cs, stat], May 2014. arXiv: 1312.6114.
  • (37) T. Rainforth, A. R. Kosiorek, T. A. Le, et al. Tighter variational bounds are not necessarily better. arXiv preprint arXiv:1802.04537, 2018.
  • (38) R. H. Shumway and D. S. Stoffer. Time Series Analysis and Its Applications: With R Examples (Springer Texts in Statistics). Springer, 2010.
  • (39) C. K. I. W. Carl Edward Rasmussen. Gaussian Processes for Machine Learning. MIT Press Ltd, 2006.
  • (40) E. Snelson and Z. Ghahramani. Sparse Gaussian Processes using Pseudo-inputs. In Y. Weiss, B. Schölkopf, and J. C. Platt, editors, Advances in Neural Information Processing Systems 18, pages 1257–1264. MIT Press, 2006.
  • (41) N. J. Gordon, D. J. Salmond, and A. F. M. Smith. Novel approach to nonlinear/non-gaussian bayesian state estimation. In IEE proceedings F (radar and signal processing), volume 140, pages 107–113. IET, 1993.
  • (42) D. Sussillo and L. Abbott. Generating coherent patterns of activity from chaotic neural networks. Neuron, 63(4):544 – 557, 2009.
  • (43) S. Linderman, M. Johnson, A. Miller, et al. Bayesian learning and inference in recurrent switching linear dynamical systems. In Artificial Intelligence and Statistics, pages 914–922, 2017.
  • (44) N. Chopin et al. Central limit theorem for sequential monte carlo methods and its application to bayesian inference. The Annals of Statistics, 32(6):2385–2411, 2004.

Appendix A Supplementary Materials

a.1 Proof that is a consistent estimator for

Proof.

To prove that is a consistent estimator, we will rely on the delta method van2000asymptotic . From chopin2004central , we know that the central limit theorem (CLT) holds for and

(25)
(26)

where we assume that are finite. We can express as a function of and

(27)

Since and is a continuous function, an application of the Delta method gives

(28)

where , and where by the Cauchy-Schwartz inequality van2000asymptotic , is also finite. Thus, as , will converge in probability to , proving the consistency of the estimator. ∎

a.2 Proof of Theorem 2.1

Proof.

It is well known that the importance weights produced in a run of SMC are an unbiased estimator of  smith2013sequential

(29)

where . We can apply Jensen’s inequality to obtain

(30)

Expanding both sides of (30)

(31)

Subtracting from both sides gives

(32)

Letting , where is the number of samples, we get

(33)

where by Jensen’s inequality (30), for all values of . By the continuous mapping theorem van2000asymptotic ,

(34)

As a consequence, . By the same logic, and leveraging that is a consistent estimator for , we get that

(35)

Thus will get arbitrarily close to as . ∎

a.3 Proof of Corollary 2.1.1

Proof.

The implicit smoothing distribution that arises from performing SMC Naesseth2018 is defined as

(36)

We can obtain the implicit filtering distribution by marginalizing out from (36)

(37)
(38)
(39)

In Naesseth2018 ; anh2018autoencoding , it was shown that

(40)

Rearranging terms in (40), we get

(41)

By Theorem 2.1, we know that as , thus as

(42)

Leveraging Theorem 1 from Naesseth2018

(43)

which implies that

(44)

thus plugging this into (42)

(45)
(46)
(47)

which is true iff a.e.. Thus by Lebesgue’s dominated convergence theorem van2000asymptotic