We deal with generic state-space models (SSM) which may be nonlinear and non-Gaussian. Inference for this important and popular family of statistical models presents tremendous challenges that has prohibited their widespread applicability. The key difficulty is that inference on the latent process of the model depends crucially on unknown static parameters that need to be also estimated. While MCMC samplers are unsatisfactory because they both fail to produce high dimensional, efficiently mixing Markov chains and because they are inappropriate for on-line inference, sequential Monte Carlo (SMC) methods(Kantas et al., 2015) provide the tools to construct successful viable implementation strategies. In particular, particle MCMC (Andrieu et al., 2010) utilises SMC to build generic efficient MCMC algorithms that provide inferences for both static parameters and latent paths. We provide a scalable alternative to these methods via an approximation that combines SMC and variational inference.
We introduce a new variational distribution that unlike recent strand of literature (Maddison et al., 2017; Naesseth et al., 2018; Le et al., 2018) performs variational inference also on the static parameters of the SSM. This is essential for various reasons. First, when there is dependency between static and dynamic parameters posterior inference may be inaccurate if the joint posterior density is approximated by conditioning on fixed values of static parameters. Second, inferring the static parameter is often the primary problem of interest: for example, for biochemical networks and models involving Lotka Voltera equations, we are not interested in the population of the species per se, but we want to infer some chemical rate constants (such as reaction rates or predation/growth rates), which are parameters of the transition density; in neuroscience, Bayesian decoding of neural spike trains is often made via a state-space representation of point processes in which inference for static parameters is of great importance. Finally, for complex dynamic systems it is often advisable to improve model compression or interpretability by encouraging sparsity and such operations may require inference for the posterior densities of the static parameters.
Sampling from the new variational distribution involves running a SMC algorithm which yields an unbiased estimate of the likelihood for a fixed static parameter value. Importantly, we show that the SMC algorithm constructs a computational graph that allows for optimisation of the variational bound using stochastic gradient descent. We provide some empirical evidence that variational inference on static parameters can give better predictive performance, either out-of sample in the linear Gaussian state space model or in-sample for predictive distributions in a multivariate stochastic volatility model. We also illustrate our method by modelling fairly general intensity functions in a multivariate Hawkes process model.
Let us begin by introducing the standard inference problem in a generic SSM, followed by a review of the SMC approach to sample from a sequence of distributions arising in such probabilistic structures. SSMs are characterized by a latent Markov state process on and an observable process on
. We follow the standard convention of using capital letters for random variables and the corresponding lower case letter to denote their values. The dynamics of the latent states is determined, conditional on a static parameter vector
, by a transition probability density
along with an initial density . The observations are assumed to be conditionally iid given the states with density given by
for any with the generic notation .
We consider a Bayesian framework and assume has a prior density . Consequently, for observed data , we perform inference using the posterior density
where the joint density of the latent states and observations given a fixed static parameter value writes as
The posterior density is in general intractable, as is
However, an SMC algorithm can be used to approximate . A brief review of how this sampling algorithm proceeds is as follows and further details can be found in Doucet et al. (2000); Doucet and Johansen (2009).
SMC methods approximate using a set of weighted random samples , also called particles, having positive weights , so that . Here, denotes the Dirac delta function. To do so, one starts at by sampling from an importance density , parametrized with , where can depend on the static parameters . For any , we first resample an ancestor variable that represents the ’parent’ of particle according to , where is a categorical distribution on with probabilities . We then set and proceed by extending the path of each particle by sampling from a transition kernel . This yields an updated latent path for which we compute the incremental importance weight
We set as well as and define
which is an unbiased and strongly consistent estimator of , see Del Moral (1996). A pseudo-code (Algorithm 1) for this standard SMC sampler can be found in Appendix A. It is possible to perform the resampling step only if some condition on is satisfied, see Algorithm 1. For simplicity, we assume that the particles are resampled at every step. The density of all variables generated by this SMC sampler for a fixed static parameter value is given by
where is a final particle index drawn from a categorical distribution with weights . Since is unbiased, we have
3 Variational bounds for state space models using SMC samplers
Variational inference (Jordan et al., 1999; Wainwright and Jordan, 2008; Blei et al., 2017) allows Bayesian inference to scale to large data sets (Hoffman et al., 2013) and is applicable to a wide range of models (Ranganath et al., 2014; Kucukelbir et al., 2017). It generally postulates a family of approximating distributions with variational parameters that minimize some divergence, most commonly the KL divergence, between the approximating distribution and the posterior. The quality of the approximation hinges on the expressiveness of the variational family.
Let be a distribution on with variational parameters . We aim to approximate the posterior density in with a variational distribution that results as an appropriate marginal of auxiliary variables arising from an SMC sampler of the form
defined precisely below. Note that sampling from the extended variational distribution (5) just means sampling and then running a particle filter using the sampled value as the static parameter.
We introduce the proposed variational bound first as a lower bound on . We then show that optimizing the proposed bound means minimizing the KL-divergence between the extended variational distribution and an extended target density that resembles closely the density targeted in particle MCMC methods.
We can write . Hence, using the fact that the likelihood estimator is unbiased and due to Jensen’s inequality,
In particular, is a lower bound on .
Remark 1 (Inference for multiple independent time series).
Instead of considering one latent process and observable process , we can also consider independent latent processes with corresponding observable processes described by the same static parameter . We obtain a lower bound on given by
where is the estimator of . Note that we can obtain an unbiased estimate of this bound by sampling an element and using as an estimate of , thereby allowing our method to scale to a large number of independent time series. For ease of exposition, we formulate our results for a single time series only.
Next, we show that the variational bound can be represented as the difference between the log-evidence and the KL divergence between the variational distribution and an extended target density. More concretely, following Andrieu et al. (2010), we consider a target density on the extended space , ,
Here, we have defined and for , i.e. is the index that the ancestor of particle at generation had. It follows, using , that the ratio between the extended target density and the variational distribution is given by
Proposition 2 (KL divergence in extended space).
It holds that
The proof can be found in Appendix B. Recall that we have introduced so that its maximisation pushes the variational approximation of the static parameter closer to its true posterior as measured by the KL divergence. The above proposition shows that this objective also minimizes the KL divergence between densities on an extended space that includes multiple latent paths. To elucidate further the relation between the variational distribution of a single latent path and its posterior, we need to introduce a further distribution. Consider the density under of the variables generated by a SMC algorithm conditional on a fixed latent path . This is known as a conditional SMC algorithm (Andrieu et al., 2010), with distribution given by
where are the indices of all particles that are not equal to . We obtain the following corollary proved in Appendix C.
Corollary 3 (Marginal KL divergence and marginal ELBO).
The KL divergence in the extended space is an upper bound on the KL divergence between the marginal variational approximation and the posterior, with the gap between bounds being
Particularly, is a lower bound compared to the standard ELBO using the marginal with as the variational distribution:
The proposed surrogate objective resembles variational bounds with auxiliary variables (Salimans et al., 2015; Maaløe et al., 2016; Ranganath et al., 2016) where the gap between the two bounds is expressed by the KL-divergence between the variational approximation of the auxiliary variable given the latent variable of interest and a so-called reverse model. Here, this reverse model is specified by the conditional SMC algorithm. The above corollary implies that the variational bound is looser than the standard ELBO with the auxiliary variables integrated out. This marginal variational distribution cannot in general be evaluated analytically. However, we can obtain unbiased estimates of it by computing the log-likelihood estimate under a conditional SMC algorithm, resembling a particle Gibbs update. This constitutes an extension of Proposition 1 in Naesseth et al. (2018). We present a proof in Appendix D.
Proposition 4 (Marginal variational distribution).
and there exists so that
The last inequality in Proposition 4 is a straightforward extension of an analogous result in the EM setting (Naesseth et al., 2018). It implies that, for fixed variational parameters and , the approximation becomes more accurate for increasing . Sampling from this distribution can be seen as an extension of visualizing the expected importance weighted approximation in Importance Weighted Auto-Encoders (Cremer et al., 2017)
. Since this distribution can be high-dimensional, the preceding proposition gives an alternative to kernel-density estimation.
Lastly, from a different angle, the variational objective can be seen as a sequential variational-autoencoding (VAE) bound. Indeed, as a consequence of Proposition2 and equation (6), we obtain immediately the following result. We elaborate on it further in the next section.
Corollary 5 (Sequential VAE representation).
The variational bound can be written as
4 Related Work
The representation in Corollary 5 allows us to contrast the variational bound to previously considered sequential VAE frameworks (Chung et al., 2015; Archer et al., 2015; Fraccaro et al., 2016; Krishnan et al., 2017; Goyal et al., 2017)
. The introduced bound contains the cross-entropy between the proposal distribution and the likelihood common to sequential VAE bounds. However, this reconstruction error is only evaluated for surviving particles. Similarly, while a sequential VAE framework includes a KL-divergence between the proposal distribution and the prior transition probability, the log-ratio of these two densities is only evaluated for a surviving path. Most work using sequential VAEs have considered observation and state transition models parametrised by neural networks, and given the high-dimensionality of the static parameters, have confined their analysis to variational EM inferences. This is also the case for the approaches inMaddison et al. (2017); Naesseth et al. (2018); Le et al. (2018), to which this work is most closely related. They have demonstrated that resampling increases the variational bound compared to a sequential IWAE (Burda et al., 2015) approach. Rainforth et al. (2018) demonstrated that increasing the number of particles leads to a worse signal to noise ratio of the gradient estimate of the proposal parameters in an IWAE setting. Le et al. (2018) suggested to use fewer particles without resampling for calculating the proposal gradient. A possible approach left for future work would be to consider a different resampling threshold for the proposal gradients. Finally, the objective in this work differs from adaptive SMC approaches optimizing the reverse KL-divergence (or -divergence) between the posterior and the proposal, cf. Cornebise et al. (2008); Gu et al. (2015).
5 Optimization of the variational bound
The gradient of the variational bound is given by
We focus on the gradient of the first expectation and note that the gradient of the second expectation can be estimated by standard (black-box) approaches in variational inference, depending of course on the chosen variational approximation. If for instance the variational distribution over the static parameters is continuously reparametrisable, one can use standard low-variance reparametrised gradients(Kingma and Welling, 2014; Rezende et al., 2014; Titsias and Lázaro-Gredilla, 2014). This is the gradient estimator that we use in our experiments in combination with mean-field variational families. We assume that the proposals are reparametrisable, i.e. there exists a differentiable deterministic function such that , with continuous and independent of . Similarly, we assume that the variational distribution of the static parameters is reparametrisable, i.e. there exists a differentiable deterministic function such that , with continuous and independent of . We abbreviate , and . Using the product rule, observe that the first gradient in is
Analogously to Maddison et al. (2017); Le et al. (2018); Naesseth et al. (2018) in a variational EM framework, we have also ignored the second summand in the gradient due to its high variance in our experiments. We take Monte Carlo samples of the expectation above and optimize the bound using Adam (Kingma and Ba, 2014). It is also possible to use natural gradients (Amari, 1998), see Appendix E.
6.1 Linear Gaussian state space models
Regularisation in a high-dimensional model.
We illustrate potential benefits of a fully Bayesian approach in a standard linear Gaussian state space model
with initial state distribution and parameters , , , and . Naesseth et al. (2018) have shown in a linear Gaussian model that learning the proposal yields a higher variational lower bound compared to proposing from the prior and the variational bound is close to the true log-marginal likelihood for both sparse and dense emission matrices . However, an EM approach might easily over-fit, unless one employs some regularisation, such as stopping early if the variational bound decreases on some test set. We demonstrate this effect by re-examining one of the experiments in Naesseth et al. (2018), setting , and assume that , and are all identity matrices. Furthermore, and with , and has randomly generated elements with . We assume that the proposal density is
and , with and diagonal matrices. We perform both a variational EM approach and a fully Bayesian approach over the static parameters using particles. In the latter case, we place Normal priors and . Furthermore, we suppose that a priori
is diagonal with variances drawn independently from an Inverse Gamma distribution with shape and scale parameters ofeach. A mean-field approximation for the static parameters is assumed. We suppose that the variational distribution over each element of and
is a normal distribution and the approximation over the diagonal elements ofis log-normal. For identifiability reasons, we assume that , and are known. We compare the EM and VB approach in terms of log-likelihoods on out-of-sample data assuming training and testing on iid sequences. Figure 1 shows that in contrast to the VB approach, the EM approach attains a higher log-likelihood on the training data with a lower log-likelihood on the test set as the training progresses.
Log-likelihood for linear Gaussian state space models. Log-likelihood values are computed using Kalman filtering. The static parameters used in the VB case are the mean of the variational distribution (VB mean) or the samples from the variational distribution (VB samples) as they are drawn during training.
Approximation bias in a low-dimensional model.
Variational approximations for the latent path can yield biased estimates of the static parameters, see Turner and Sahani (2011). We illustrate that this bias decreases for increasing in a two-dimensional linear Gaussian model, both in an EM and VB setting. We therefore consider inference in a linear Gaussian state space model (8-9) with two-dimensional latent states and one-dimensional observations. The state transition matrix is assumed to be determined by the autoregressive parameter with We consider inference over as the static parameter and fix with and being identity matrices. We simulate realisations of length each using . Inference is performed with different initialisations and learning rates over the simulated datasets. It has been documented in such a linear Gaussian model, see Turner and Sahani (2011), that Gaussian variational approximations of the latent path that factorise over the state components underestimate . We observe the same effect in Figure 1(a) when using just particle. However, increasing the number of particles used during inference reduces this bias. Furthermore, we find that point estimates of the static parameters show some variation over different simulations, while an approximate Bayesian approach can be argued to better account for this uncertainty. The variational distributions for for each of the simulations using particles is shown in Figure 1(b), confirming that they all put significant mass on the ground truth. Let us remark that these experiments also complement those in Le et al. (2018), where it is illustrated that increasing improves learning point estimates of the static parameters in a Gaussian model with a one-dimensional latent state. Indeed, as shown next, the marginal variational distribution allows not just for dependencies in the latent states across time, but also across different state dimensions, even if they are independent under the proposal.
Marginal variational distribution in a low-dimensional model.
In an additional experiment, we evaluate if the variational approximation from Proposition 4 of the latent path matches the distribution of its true posterior. We consider the above state space model over time steps as in Turner and Sahani (2011). Note that for given static parameters, the posterior is Gaussian. Indeed, for , where denotes dimension of , we have with
assuming is drawn from its stationary distribution. We visualise the posterior distribution along with the marginal variational distribution
in Figure 3 using particles and samples for the expectation. We find that the approximation mirrors the true posterior. In particular, it accounts for explaining-away between different dimensions of the latent state, although we have used isotropic proposals.
6.2 Stochastic volatility models
To show that our method allows inference of latent states and static parameters of higher dimensions, we consider a multivariate stochastic volatility model,
where with , and covariance matrix , . This model has been considered in Guarniero et al. (2017) using particle MCMC methods under the restriction that is band-diagonal to reduce the number of parameters. It is also more general than that entertained in Naesseth et al. (2018) with assumed diagonal, see also Chib et al. (2009) for a review on stochastic volatility models. We consider a fully Bayesian treatment as in Guarniero et al. (2017), applied to the same data set of monthly returns (9/2008 to 2/2016) of exchange rates with respect to the US dollar as reported by the Federal Reserve System. The specification of the prior and variational forms of the static parameters are explained in Appendix F. We consider proposals of the form
where is diagonal and using particles. Densities of the variational approximation that correspond to the GBP exchange rate can be found in Appendix F, Figure 4, which are largely similar to those obtained in (Guarniero et al., 2017). Furthermore, we approximate the one- and two-step predictive distributions
for ,where , is the approximation of by the particle filter and with for simulated from the generative model. The predictive distributions are evaluated using a log scoring rule (Gneiting and Raftery, 2007; Geweke and Amisano, 2010) to arrive at the predictive log-likelihoods in Table 1. The full variational approach attains higher predictive log-likelihoods.
|EM||9.697 (0.008)||9.716 (0.008)|
|VB||9.707 (0.011)||9.728 (0.015)|
|EM||9.690 (0.003)||9.713 (0.003)|
|VB||9.701 (0.004)||9.727 (0.005)|
particle filters with the same optimal static values. Mean estimates with standard deviation in parentheses based on 100 replicates.
6.3 Non-linear stochastic Hawkes processes
There has been an increasing interest in modelling asynchronous sequential data using point processes in various domains, including social networks (Linderman and Adams, 2014; Wang et al., 2017), finance (Bacry et al., 2015), and electronic health (Lian et al., 2015). Recent work (Du et al., 2016; Mei and Eisner, 2017; Xiao et al., 2017b, a) have advocated the use of neural networks in a black-box treatment of point process dynamics.
We illustrate that our approach allows scalable probabilistic inference for continuous-time event data , , where is the time when the -th event occurs and is an additional discrete mark associated with the event. We consider describing such a realisation as a -variate point process with intensities , driven by continuous time processes
and a non-negative monotone function . Moreover, and . Importantly, we allow to depend on , and the -th component of describes by how much the -th event excites, if , or inhibits, if , subsequent events of type . It is possible to view the dynamics as a discrete-time SSM; the essential idea being that is piecewise-deterministic between events, see Appendix G for details along with related work on Hawkes point processes (Hawkes, 1971a). Let us define the discrete-time latent process with , . Standard theory about point processes, see Daley and Vere-Jones (2003), implies that the observation density is given by , where our model specification yields as a deterministic function between