1 Introduction
The setting we consider is that of stochastic, nonlinear dynamical systems under noisy measurements. Given discrete time data, recovering the system dynamics (i.e. a "transition function" from the current state to the next) can be likened to regression with noisy targets and inputs. The statespace model formalism addresses this by distinguishing between observed variables and latent ones (the denoised "states"). The system’s evolution is assumed to be independent of the observations, given the states. Moreover, it is common to make a Markovian assumption and let the state at time only depend on that at time and not on any preceding ones: the system is fully specified by its current state.
In this work we focus on the Gaussian Process State Space Model (GPSSM), where the transition function is given a Gaussian process prior. We aim to perform approximate Bayesian inference both over the transition function and the unobserved states using a variational method. We follow on from work by
Frigola et al. [2] and McHutchon et al. [5] and make three main contributions. Firstly we investigate the behaviour of a fully analytic variational inference scheme, and show how it performs on some initial tasks. Secondly, we introduce a new method for multistep ahead prediction, based on augmenting the variational distribution. Finally, we also show how this method gives a way of assessing the quality of the approximate predictions and posterior at testtime, which was lacking in previous approaches.2 Doubly Variational Inference
Here we briefly review a simplified derivation of variational inference in the GPSSM introduced by Turner et al. [7]. The main simplification over earlier presentations [2, 5] is the use of a density over entire functions
(through a slight abuse of notation). While such a density does not strictly exist, we use it as an intermediate step before we integrate out all but a finite number of observations, at which point it becomes a familiar Gaussian distribution again. We begin with the familiar form of the evidence lower bound (ELBO), with an approximate posterior over states
and transition function : . We constrain to be independent between and , and we further use the sparse GP posterior introduced by Titsias [6].^{1}^{1}1See Matthews et al. [4] for a more recent discussion. We write as the GP prior conditioned on a small set of inducing variables together with a free density over : . This gives a bound where the difficult GP conditional terms cancel out.(1)  
(2) 
This general bound still allows a choice for the form of and . Frigola et al. [2] samples from the intractable optimal variational distributions. We follow McHutchon et al. [5] by choosing a Gaussian , which allows all expectations to be calculated in closedform. Finally, the optimal
is found by calculus of variations and is also a Gaussian with closedform moments. We optimise the bound
with respect to: the GP hyperparameters, the inducing inputs, the process noise standard deviations, the parameters of the linearGaussian emissions, and the moments of
.3 Predictions
In addition to inference, making predictions in these models is itself challenging and also requires approximations or sampling. We will address the general case of multistep ahead forecasting of a given sequence. The quality of existing methods for making approximate predictions can only be assessed using a held out test set. The variational framework, however, can be used to quantify the KL divergence between the exact predictive distribution, and the approximation. This provides a method to assess the reliability of the approximate predictive distribution, and the approximate posterior over states, without the need for a test set.
Exact predictions
After training on a sequence of data, we have a factorised approximate posterior for latent states and transition function : . We can sample from the predictive distribution:

Sample from the final state’s posterior and the transition function’s .^{2}^{2}2In practice, to sample we incrementally sample a function value, with the previous one serving as input.

Incrementally sample future states for .

Sample predictive observations .
This would be repeated in order to sample another trajectory. There are two main issues with the predictive distribution which make it difficult to deal with: a) there are correlations between and the states across all times, and b) the distributions become nonGaussian after being passed through a GP several times. We will discuss two Gaussian approximate predictive distributions.
Moment matching
Given a distribution on an input of a GP
, it is possible to analytically calculate the mean and variance of the GP output
. This was used by Girard et al. [3] to propose a moment matching approximation, where the moments of the next state are iteratively matched to those of the previous state passed through the GP. It was also used by Deisenroth and Rasmussen [1] to make multistep predictions for model based RL. The quality of this approximation degrades the further it has to predict, as any error introduced in the first moment matching step is passed on to the input of the next moment matching step.Variational predictions
Alternatively, we note that we can view the predictive distribution over states simply as a continuation of the posterior to states that do not have observations associated with them. Given this insight, we can augment the approximate posterior to include these extra states in the variational lower bound. This gives extra transition terms, without any terms from emission likelihoods. We can write the new augmented ELBO as
(3) 
and maximise it w.r.t. the added state distributions to make predictions, while keeping all the parameters obtained during training fixed. This gives an alternative Gaussian predictive distribution to moment matching.
4 Assessing approximate predictions at test time
We can expect both Gaussian approximations to do well in situations where the transitions are close to linear, or perhaps even when the state distribution remains unimodal. When state distributions become multimodal, which occurs when predicting near unstable equilibria in the dynamical system, we can not expect any Gaussian approximation to be close to the model’s true predictive distribution. Here, we show that the augmented lower bound can be used to monitor the accuracy of our approximate predictive distribution. We believe this to be important, since we can only expect to make good predictions if both our model is correct and our predictions are consistent with our model.
To quantify the quality of our approximate predictive density, we will compute the KL divergence between it and the true predictive distribution. We start by considering the gap between the two bounds (where are the predicted states):
(4) 
We now note that the augmented KL divergence, is equal to the original one, plus terms depending on the predictive distributions, these terms (listed in (3)) can also be written as:
(5) 
This shows that the additional ELBO gap due to the additional states is equal to the expected KL divergence between the model’s true predictive distribution over the states and the approximate one.
5 Results and Conclusion
We test out our model on two nonlinear dynamical systems: the "kink" transition model (Figure 1) and the "Cart and Pole" (Figure 4 in supplementary materials).
As we can see from Figures 1, 2 and 4 the variational model can successfully learn the transition function from noisy data, filter on observed data (before the vertical black line), and predict multiple steps ahead. We see however that predictive performance degrades as we reach the nonlinear region of the statespace where our Gaussian posterior is a rough approximation. This can be quantified by measuring the difference in bound degradation (Figure 3) when predicting from different regions of the statespace.
Finally, in Figure 6 we can see how the ELBO objective changes as a function of the model’s latent dimensionality on the Cart and Pole dataset with omitted velocities. The ELBO objective has both a complexity penalty and a datafit term which need to be traded off, leading it to be optimal close to the true dimensionality of the system (i.e. 5 latent dimensions), suggesting it is able to impute missing information.
Results under this variational inference scheme are promising, though they point out the fragility of a Gaussian , especially when predicting variationally (compare with the more robust sampled predictions of Figures 4 and 5). Future work will explore the use of more flexible variational distributions.
Acknowledgments
We would like to acknowledge Richard E. Turner, James Hensman and Hong Ge for helpful discussions. ADI and MvdW are generously supported by Qualcomm Innovation Fellowships.
References

Deisenroth and Rasmussen [2011]
M. Deisenroth and C. E. Rasmussen.
Pilco: A modelbased and dataefficient approach to policy search.
In
Proceedings of the 28th International Conference on machine learning (ICML11)
, pages 465–472, 2011.  Frigola et al. [2014] R. Frigola, Y. Chen, and C. E. Rasmussen. Variational gaussian process statespace models. In Advances in neural information processing systems, pages 3680–3688, 2014.
 Girard et al. [2003] A. Girard, C. E. Rasmussen, J. Q. n. Candela, and R. MurraySmith. Gaussian process priors with uncertain inputs application to multiplestep ahead time series forecasting. In Advances in Neural Information Processing Systems 15. 2003.

Matthews et al. [2016]
A. G. d. G. Matthews, J. Hensman, R. Turner, and Z. Ghahramani.
On sparse variational methods and the kullbackleibler divergence between stochastic processes.
In Artificial Intelligence and Statistics, pages 231–239, 2016.  McHutchon et al. [2014] A. McHutchon et al. Nonlinear modelling and control using Gaussian processes. PhD thesis, PhD thesis, University of Cambridge UK, Department of Engineering, 2014.
 Titsias [2009] M. K. Titsias. Variational learning of inducing variables in sparse gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 567–574, 2009.
 Turner et al. [2015] R. E. Turner, C. E. Rasmussen, M. van der Wilk, T. D. Bui, and R. Frigola. Gaussian process state space models. Unpublished note, January 2015.
Comments
There are no comments yet.