Closed-form Inference and Prediction in Gaussian Process State-Space Models

We examine an analytic variational inference scheme for the Gaussian Process State Space Model (GPSSM) - a probabilistic model for system identification and time-series modelling. Our approach performs variational inference over both the system states and the transition function. We exploit Markov structure in the true posterior, as well as an inducing point approximation to achieve linear time complexity in the length of the time series. Contrary to previous approaches, no Monte Carlo sampling is required: inference is cast as a deterministic optimisation problem. In a number of experiments, we demonstrate the ability to model non-linear dynamics in the presence of both process and observation noise as well as to impute missing information (e.g. velocities from raw positions through time), to de-noise, and to estimate the underlying dimensionality of the system. Finally, we also introduce a closed-form method for multi-step prediction, and a novel criterion for assessing the quality of our approximate posterior.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

07/09/2020

Fast Variational Learning in State-Space Gaussian Process Models

Gaussian process (GP) regression with 1D inputs can often be performed i...
11/06/2018

A Novel Variational Family for Hidden Nonlinear Markov Models

Latent variable models have been widely applied for the analysis and vis...
05/30/2017

Identification of Gaussian Process State Space Models

The Gaussian process state space model (GPSSM) is a non-linear dynamical...
10/18/2021

Assessing Ecosystem State Space Models: Identifiability and Estimation

Bayesian methods are increasingly being applied to parameterize mechanis...
09/08/2015

A Variational Bayesian State-Space Approach to Online Passive-Aggressive Regression

Online Passive-Aggressive (PA) learning is a class of online margin-base...
05/30/2019

Factorized Inference in Deep Markov Models for Incomplete Multimodal Time Series

Integrating deep learning with latent state space models has the potenti...
12/15/2017

Safe Policy Search with Gaussian Process Models

We propose a method to optimise the parameters of a policy which will be...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The setting we consider is that of stochastic, non-linear dynamical systems under noisy measurements. Given discrete time data, recovering the system dynamics (i.e. a "transition function" from the current state to the next) can be likened to regression with noisy targets and inputs. The state-space model formalism addresses this by distinguishing between observed variables and latent ones (the de-noised "states"). The system’s evolution is assumed to be independent of the observations, given the states. Moreover, it is common to make a Markovian assumption and let the state at time only depend on that at time and not on any preceding ones: the system is fully specified by its current state.

In this work we focus on the Gaussian Process State Space Model (GPSSM), where the transition function is given a Gaussian process prior. We aim to perform approximate Bayesian inference both over the transition function and the unobserved states using a variational method. We follow on from work by

Frigola et al. [2] and McHutchon et al. [5] and make three main contributions. Firstly we investigate the behaviour of a fully analytic variational inference scheme, and show how it performs on some initial tasks. Secondly, we introduce a new method for multi-step ahead prediction, based on augmenting the variational distribution. Finally, we also show how this method gives a way of assessing the quality of the approximate predictions and posterior at test-time, which was lacking in previous approaches.

2 Doubly Variational Inference

Here we briefly review a simplified derivation of variational inference in the GPSSM introduced by Turner et al. [7]. The main simplification over earlier presentations [2, 5] is the use of a density over entire functions

(through a slight abuse of notation). While such a density does not strictly exist, we use it as an intermediate step before we integrate out all but a finite number of observations, at which point it becomes a familiar Gaussian distribution again. We begin with the familiar form of the evidence lower bound (ELBO), with an approximate posterior over states

and transition function : . We constrain to be independent between and , and we further use the sparse GP posterior introduced by Titsias [6].111See Matthews et al. [4] for a more recent discussion. We write as the GP prior conditioned on a small set of inducing variables together with a free density over : . This gives a bound where the difficult GP conditional terms cancel out.

(1)
(2)

This general bound still allows a choice for the form of and . Frigola et al. [2] samples from the intractable optimal variational distributions. We follow McHutchon et al. [5] by choosing a Gaussian , which allows all expectations to be calculated in closed-form. Finally, the optimal

is found by calculus of variations and is also a Gaussian with closed-form moments. We optimise the bound

with respect to: the GP hyper-parameters, the inducing inputs, the process noise standard deviations, the parameters of the linear-Gaussian emissions, and the moments of

.

3 Predictions

In addition to inference, making predictions in these models is itself challenging and also requires approximations or sampling. We will address the general case of multi-step ahead forecasting of a given sequence. The quality of existing methods for making approximate predictions can only be assessed using a held out test set. The variational framework, however, can be used to quantify the KL divergence between the exact predictive distribution, and the approximation. This provides a method to assess the reliability of the approximate predictive distribution, and the approximate posterior over states, without the need for a test set.

Exact predictions

After training on a sequence of data, we have a factorised approximate posterior for latent states and transition function : . We can sample from the predictive distribution:

  1. Sample from the final state’s posterior and the transition function’s .222In practice, to sample we incrementally sample a function value, with the previous one serving as input.

  2. Incrementally sample future states for .

  3. Sample predictive observations .

This would be repeated in order to sample another trajectory. There are two main issues with the predictive distribution which make it difficult to deal with: a) there are correlations between and the states across all times, and b) the distributions become non-Gaussian after being passed through a GP several times. We will discuss two Gaussian approximate predictive distributions.

Moment matching

Given a distribution on an input of a GP

, it is possible to analytically calculate the mean and variance of the GP output

. This was used by Girard et al. [3] to propose a moment matching approximation, where the moments of the next state are iteratively matched to those of the previous state passed through the GP. It was also used by Deisenroth and Rasmussen [1] to make multi-step predictions for model based RL. The quality of this approximation degrades the further it has to predict, as any error introduced in the first moment matching step is passed on to the input of the next moment matching step.

Variational predictions

Alternatively, we note that we can view the predictive distribution over states simply as a continuation of the posterior to states that do not have observations associated with them. Given this insight, we can augment the approximate posterior to include these extra states in the variational lower bound. This gives extra transition terms, without any terms from emission likelihoods. We can write the new augmented ELBO as

(3)

and maximise it w.r.t. the added state distributions to make predictions, while keeping all the parameters obtained during training fixed. This gives an alternative Gaussian predictive distribution to moment matching.

4 Assessing approximate predictions at test time

We can expect both Gaussian approximations to do well in situations where the transitions are close to linear, or perhaps even when the state distribution remains unimodal. When state distributions become multimodal, which occurs when predicting near unstable equilibria in the dynamical system, we can not expect any Gaussian approximation to be close to the model’s true predictive distribution. Here, we show that the augmented lower bound can be used to monitor the accuracy of our approximate predictive distribution. We believe this to be important, since we can only expect to make good predictions if both our model is correct and our predictions are consistent with our model.

To quantify the quality of our approximate predictive density, we will compute the KL divergence between it and the true predictive distribution. We start by considering the gap between the two bounds (where are the predicted states):

(4)

We now note that the augmented KL divergence, is equal to the original one, plus terms depending on the predictive distributions, these terms (listed in (3)) can also be written as:

(5)

This shows that the additional ELBO gap due to the additional states is equal to the expected KL divergence between the model’s true predictive distribution over the states and the approximate one.

5 Results and Conclusion

We test out our model on two non-linear dynamical systems: the "kink" transition model (Figure 1) and the "Cart and Pole" (Figure 4 in supplementary materials). As we can see from Figures 1, 2 and 4 the variational model can successfully learn the transition function from noisy data, filter on observed data (before the vertical black line), and predict multiple steps ahead. We see however that predictive performance degrades as we reach the non-linear region of the state-space where our Gaussian posterior is a rough approximation. This can be quantified by measuring the difference in bound degradation (Figure 3) when predicting from different regions of the state-space.
Finally, in Figure 6 we can see how the ELBO objective changes as a function of the model’s latent dimensionality on the Cart and Pole dataset with omitted velocities. The ELBO objective has both a complexity penalty and a data-fit term which need to be traded off, leading it to be optimal close to the true dimensionality of the system (i.e. 5 latent dimensions), suggesting it is able to impute missing information.
Results under this variational inference scheme are promising, though they point out the fragility of a Gaussian , especially when predicting variationally (compare with the more robust sampled predictions of Figures 4 and 5). Future work will explore the use of more flexible variational distributions.

Figure 1: Kink dynamics: learned transition function from noisy observations.
Figure 2: Kink dynamics: variational predictions. Model trained on near-noiseless data. Only data before the vertical black line is observed, in order to filter the latent states.

prediction

probability density

prediction

variational predictive density

sampled predictive density

Figure 3: Assessing the quality of the Gaussian predictive distribution in different regions of the state-space.
Left: linear regime, initial state: -2.5, degradation in ELBO due to prediction: -0.220
Right: non-linear regime, initial state: -0.5, degradation in ELBO due to prediction: -46.9

Acknowledgments

We would like to acknowledge Richard E. Turner, James Hensman and Hong Ge for helpful discussions. ADI and MvdW are generously supported by Qualcomm Innovation Fellowships.

References

  • Deisenroth and Rasmussen [2011] M. Deisenroth and C. E. Rasmussen. Pilco: A model-based and data-efficient approach to policy search. In

    Proceedings of the 28th International Conference on machine learning (ICML-11)

    , pages 465–472, 2011.
  • Frigola et al. [2014] R. Frigola, Y. Chen, and C. E. Rasmussen. Variational gaussian process state-space models. In Advances in neural information processing systems, pages 3680–3688, 2014.
  • Girard et al. [2003] A. Girard, C. E. Rasmussen, J. Q. n. Candela, and R. Murray-Smith. Gaussian process priors with uncertain inputs application to multiple-step ahead time series forecasting. In Advances in Neural Information Processing Systems 15. 2003.
  • Matthews et al. [2016] A. G. d. G. Matthews, J. Hensman, R. Turner, and Z. Ghahramani.

    On sparse variational methods and the kullback-leibler divergence between stochastic processes.

    In Artificial Intelligence and Statistics, pages 231–239, 2016.
  • McHutchon et al. [2014] A. McHutchon et al. Nonlinear modelling and control using Gaussian processes. PhD thesis, PhD thesis, University of Cambridge UK, Department of Engineering, 2014.
  • Titsias [2009] M. K. Titsias. Variational learning of inducing variables in sparse gaussian processes. In International Conference on Artificial Intelligence and Statistics, pages 567–574, 2009.
  • Turner et al. [2015] R. E. Turner, C. E. Rasmussen, M. van der Wilk, T. D. Bui, and R. Frigola. Gaussian process state space models. Unpublished note, January 2015.

6 Supplementary Material

Figure 4: Open-loop predictions for the cart and pole dynamical system. Control inputs are given and predictions are evaluated by sampling trajectories. Predictive performance is significantly improved by de-noising in the GPSSM. The true noiseless states (green line) are given for reference. Prediction begins after the black vertical line, the initial sequence is given to filter the state.
Figure 5: Kink dynamics: predictions obtained by sampling trajectories, using the same model as in Figure 2.
Figure 6: ELBO as a function of the latent state dimensionality. The model correctly identifies that a dimensionality of 5-6 is optimal (true dimensionality is 5) even though it only observes cart positions and the and of the pendulum angles. We initialise the surplus dimensions (if any) with the difference through time of positions and angles to help it model velocities (i.e. derivatives).