1 Introduction
Many real world systems exhibit complex nonlinear dynamics, and a wide variety of approaches have been proposed to model them. Characterising these dynamics is essential to the analysis of a system’s behaviour, as well as to prediction. Autoregressive models map past observations to future ones and are very common in diverse fields such as economics and model predictive control. Statespace models (SSMs), lift the dynamics from the observations to a set of auxiliary, unobserved variables (the latent states
) which fully describe the state of the system at each timestep, and thus take the system to evolve as a Markov chain. The “transition function” maps a latent state to the next. We place a Gaussian process (GP) prior on the transition function, obtaining the Gaussian process statespace model (GPSSM). This Bayesian nonparametric approach allows us to:
1)obtain uncertainty estimates and predictions from the posterior over the transition function,
2) handle increasing amounts of data without the model saturating, and 3) maintain high uncertainty estimates in regions with little data.While many approximate inference schemes have been proposed (Frigola et al., 2013), we focus on variational “inducing point” approximations (Titsias, 2009)
, as they offer a particularly elegant framework for approximating GP models without losing key properties of the nonparametric model. Since a nonparametric Gaussian process is used as the approximate posterior
(Matthews et al., 2016), the properties of the original model are maintained. Increasing the number of inducing points, we add capacity to the approximation, and the quality of the approximation is measured by the marginal likelihood lower bound (or evidence lower bound – ELBO).The accuracy of variational methods is fundamentally limited by the class of approximate posteriors, with independence assumptions being particularly harmful for timeseries models (Turner et al., 2010). Several variational inference schemes that factorise the states and transition function in the approximate posterior have been proposed (Frigola et al., 2014; McHutchon et al., 2014; Ialongo et al., 2017; Eleftheriadis et al., 2017). Here, we investigate the design choices which are available in specifying a nonfactorised posterior.
2 Gaussian Process State Space Models
Conceptually, a GPSSM is identical to other SSMs. We model discretetime sequences of observations , where , by a corresponding latent Markov chain of states where . All statetostate transitions are governed by the same transition function . For simplicity, we take the transition and observation densities to be Gaussians, although any closed form density function could be chosen. Without loss of generality (subject to a suitable augmentation of the statespace) (Frigola, 2015), we also assume a linear mapping between and the mean of to alleviate nonidentifiabilities between transitions and emissions. The generative model is specified by the following equations:
(1) 
the function values are given by the GP as:
(2) 
3 Design choices in variational inference
We want a variational approximation to . We begin by writing our approximate posterior as For we choose an inducing point posterior according to Titsias (2009) and Hensman et al. (2013) and write by splitting up the function into the inducing outputs and all other points . For our , we will consider Markovian distributions, following the structure of the exact posterior^{1}^{1}1More precisely, the exact conditional posterior is: .:
(3) 
This allows us to write down the general form of the variational lower bound:
(4)  
(5) 
We are now left to specify the form of , , and
. For the first two, we choose Gaussian distributions and optimise variationally their means and covariances, while, for the last, several choices are available. We follow the form of the exact filtering distribution (for Gaussian emissions) but treat
as free variational parameters to be optimised (thus approximating the smoothing distribution):(6) 
sampling  

1) Factorised  linear  
2) Factorised  nonlinear  
3) UFactorised  nonlinear  
4) NonFactorised  nonlinear 
is the sparse GP’s marginal posterior variance:
Each posterior is identified by whether it factorises the distribution between states and transition function, and by whether it is linear in the latent states (i.e. only the first one, which corresponds to a joint Gaussian over the states). The cubic sampling cost associated with the last choice of posterior (“NonFactorised  nonlinear”, i.e. the full GP) derives from having to condition, at every timestep, on a subsequence of length , giving operations, each of complexity (i.e. updating and solving a triangular linear system represented by a Cholesky decomposition). The cost of sampling is crucial, because evaluating and optimising our variational bound requires obtaining samples of and to compute expectations by Monte Carlo integration. We now review two options to sidestep the cubic cost associated with the fully nonparametric GP.
3.1 Dependence on the entire process  “Chunking”
If we wish to retain the full GP as well as a nonfactorised posterior, but the data does not come in short independent sequences, one approach is to “cut” the posterior into subsequences of lengths :
(7) 
for this reduces the cost of sampling to . Conditioning (which has cubic cost in the size of the conditioning set) now only needs to extend as far back as the beginning of the current chunk, where the marginal is explicitly represented and can be sampled directly. Moreover we can now “minibatch” over different chunks, evaluating our bound in a “doubly stochastic” manner (Salimbeni and Deisenroth, 2017).
3.2 Dependence on the inducing points only  “UFactorisation”
In order to avoid “cutting” longranging temporal dependences in our data by “chunking”, we could instead use the inducing points to represent the dependence between and . We can take and, constraining to be Markovian (which is not exact in general, but is required for efficiency), we write:
(8) 
The intuition behind this choice of posterior (corresponding to the third one in the table) is to represent the GP with samples from the inducing point variational distribution (each sample effectively being a different transition function), and to generate trajectories for each of the samples. Because these functions are represented parametrically (with finite “resolution” corresponding to the number of inducing points), and our posterior is Markovian, our sampling complexity does not grow as we traverse the latent chain . If we were to “integrate out” (as in the second posterior in the table), however, the dependence between and would be severed (recall is “part” of ).
4 Experiments
In order to test the effect of the factorisation on the approximate GP posterior (i.e. the learned dynamics), we perform inference on data^{2}^{2}2
A sequence of 50 steps was generated with Gaussian emission and process noise (standard deviations of
and respectively). The emission model was fixed to the generative one to allow comparisons. generated by the “kink” transition function (see Figure 1’s “true function”). The models whose fit is shown are “Factorised  nonlinear” and “UFactorised  nonlinear”, both using an RBF kernel. Factorisation leads to an overconfident, miscalibrated posterior and this is the same for both factorised models (they gave virtually the same fit). Using 100 inducing points, the UFactorised and NonFactorised posteriors were also indistinguishable, the transition function being precisely “pinneddown” by the inducing points.References

Doerr et al. (2018)
Andreas Doerr, Christian Daniel, Martin Schiegg, Duy NguyenTuong, Stefan
Schaal, Marc Toussaint, and Sebastian Trimpe.
Probabilistic recurrent statespace models.
In
Proceedings of the 35th International Conference on Machine Learning
, 2018.  Eleftheriadis et al. (2017) Stefanos Eleftheriadis, Tom Nicholson, Marc Deisenroth, and James Hensman. Identification of gaussian process state space models. In Advances in Neural Information Processing Systems 30, 2017. URL http://papers.nips.cc/paper/7115identificationofgaussianprocessstatespacemodels.pdf.
 Frigola (2015) Roger Frigola. Bayesian time series learning with gaussian processes. University of Cambridge, 2015.
 Frigola et al. (2013) Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl E. Rasmussen. Bayesian inference and learning in Gaussian process statespace models with particle MCMC. In Advances in Neural Information Processing Systems 26, 2013. URL http://media.nips.cc/nipsbooks/nipspapers/paper_files/nips26/1449.pdf.
 Frigola et al. (2014) Roger Frigola, Yutian Chen, and Carl Edward Rasmussen. Variational gaussian process statespace models. In Advances in Neural Information Processing Systems 27, 2014. URL http://papers.nips.cc/paper/5375variationalgaussianprocessstatespacemodels.pdf.

Hensman et al. (2013)
James Hensman, Nicolo Fusi, and Neil D Lawrence.
Gaussian Processes for Big Data.
Uncertainty in Artificial Intelligence
, 2013.  Ialongo et al. (2017) Alessandro Davide Ialongo, Mark van der Wilk, and Carl Edward Rasmussen. Closedform inference and prediction in Gaussian process statespace models. NIPS 2017 TimeSeries Workshop, 2017.

Matthews et al. (2016)
Alexander Matthews, James Hensman, Richard Turner, and Zoubin Ghahramani.
On sparse variational methods and the kullbackleibler divergence between stochastic processes.
In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics, 2016.  McHutchon et al. (2014) Andrew McHutchon et al. Nonlinear modelling and control using Gaussian processes. PhD thesis, PhD thesis, University of Cambridge UK, Department of Engineering, 2014.
 Salimbeni and Deisenroth (2017) Hugh Salimbeni and Marc Deisenroth. Doubly stochastic variational inference for deep gaussian processes. In Advances in Neural Information Processing Systems, pages 4588–4599, 2017.
 Titsias (2009) Michalis Titsias. Variational Learning of Inducing Variables in Sparse Gaussian Processes. Artificial Intelligence and Statistics, 2009.
 Turner et al. (2010) Ryan Turner, Marc Deisenroth, and Carl Rasmussen. Statespace inference and learning with gaussian processes. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, 2010.
Appendix A Comparison to PRSSM
Doerr et al. (2018) were, to the best of our knowledge, the first to consider a nonfactorised variational posterior for the GPSSM^{3}^{3}3They call their approach PRSSM.. Their work, however, has two significant shortcomings. Firstly, the terms are set to be the same as the corresponding prior terms (i.e. the prior transitions). This posterior fails to exploit information contained in the observations other than by adapting (it performs no filtering or smoothing on the latent states), and can only be an adequate approximation when the process noise is low and the observed sequence is short (as even low noise levels can compound, and potentially be amplified, in a long sequence). Of course, it would also be an appropriate choice (if somewhat difficult to optimise) when the process noise is zero, but then the latent variables become deterministic, given the transition function, and it is unclear whether modelling them explicitly through a probabilistic statespace model should be beneficial (an autoregressive model with no latent variables might be sufficient).
Secondly, Doerr et al. (2018) employ a sampling scheme which gives incorrect marginal samples of , even assuming, as they did, that:
(9) 
The mistake is predicated on believing that the factorisation in equation 9 produces a Markovian . In general, this is not the case. A mismatch is thus introduced between the form of the variational lower bound and the samples being used to evaluate it, resulting in a spurious objective.
Comments
There are no comments yet.