Overcoming Mean-Field Approximations in Recurrent Gaussian Process Models

by   Alessandro Davide Ialongo, et al.

We identify a new variational inference scheme for dynamical systems whose transition function is modelled by a Gaussian process. Inference in this setting has either employed computationally intensive MCMC methods, or relied on factorisations of the variational posterior. As we demonstrate in our experiments, the factorisation between latent system states and transition function can lead to a miscalibrated posterior and to learning unnecessarily large noise terms. We eliminate this factorisation by explicitly modelling the dependence between state trajectories and the Gaussian process posterior. Samples of the latent states can then be tractably generated by conditioning on this representation. The method we obtain (VCDT: variationally coupled dynamics and trajectories) gives better predictive performance and more calibrated estimates of the transition function, yet maintains the same time and space complexities as mean-field methods. Code is available at: github.com/ialong/GPt.



There are no comments yet.


page 8


Non-Factorised Variational Inference in Dynamical Systems

We focus on variational inference in dynamical systems where the discret...

Symplectic Gaussian Process Dynamics

Dynamics model learning is challenging and at the same time an active fi...

Likelihood-Free Inference in State-Space Models with Unknown Dynamics

We introduce a method for inferring and predicting latent states in the ...

Identification of Gaussian Process State Space Models

The Gaussian process state space model (GPSSM) is a non-linear dynamical...

Factorized Gaussian Process Variational Autoencoders

Variational autoencoders often assume isotropic Gaussian priors and mean...

A Variant of Gaussian Process Dynamical Systems

In order to better model high-dimensional sequential data, we propose a ...

Multi-Class Gaussian Process Classification Made Conjugate: Efficient Inference via Data Augmentation

We propose a new scalable multi-class Gaussian process classification ap...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Many time series are well explained by assuming their progression is determined by an underlying dynamical system. A model of this dynamical system can be used to make strong predictions of the future behaviour of the time series. We often require predictions of systems with unknown dynamics, be it for economic models of financial markets, or models of physical systems to be controlled with model predictive control or model-based reinforcement learning. Particularly in low-data regimes, estimates of the uncertainty in the predictions are crucial for making robust decisions, as the expected utility of an action can depend strongly on the distribution of the outcome

(von Neumann et al., 1944). A striking example of this is model based policy search (Deisenroth & Rasmussen, 2011).

There are many approaches to time series prediction. Within the machine learning community, autoregressive (AR)

(Billings, 2013) and state-space (SSMs) models are popular, in part due to the relative ease with which predictions can be made. AR predictions are obtained by a learned mapping from the last observations to the next one, whereas SSMs model the underlying dynamical system by learning a transition function which maps a system state forward in time. At each point in time, the state contains sufficient information for predicting both future states and observations. While AR models can be easier to train, SSMs have the potential to be more data efficient, due to their minimal state representation.

We aim to learn complex, possibly stochastic, non-linear dynamical systems from noisy data, following the Bayesian paradigm in order to capture model uncertainty. We use a Gaussian process (GP) prior (Rasmussen & Williams, 2005) for the transition function, giving the Gaussian Process State Space Model (GPSSM) (Frigola et al., 2013). Aside from capturing uncertainty, GPs are non-parametric, guaranteeing that our model complexity will not saturate as we observe more data.

Despite the challenge of performing accurate approximate inference in the GPSSM, an impressive amount of progress has been made (Frigola et al., 2014; McHutchon, 2014; Eleftheriadis et al., 2017; Ialongo et al., 2017; Doerr et al., 2018), helped along by the development of elegant variational inference techniques (Titsias, 2009; Hensman et al., 2013) which retain the key non-parametric property of GPs. In this work, we improve variational approximations by critically examining existing methods and their failure modes. We propose a family of non-factorised variational posteriors which alleviate crucial problems with earlier approaches while maintaining the same efficiency. We refer to the proposed approach as VCDT: variationally coupled dynamics and trajectories.

2 Background

State-space models are common in machine learning, and appear in many forms. At the most basic level, linear-Gaussian state-space models can be learned by maximum likelihood, combining Kalman smoothing and EM for instance (Roweis & Ghahramani, 1999). Extensions have been developed for deterministic non-linear transitions (Ghahramani & Roweis, 1999)

. Recurrent neural networks, like the popular LSTM

(Hochreiter & Schmidhuber, 1996) learn deterministic mappings with state space structure for sequence prediction, which have seen successful recent use in a wide range of tasks including translation (Sutskever et al., 2014).

2.1 Bayesian non-parametric approaches

We are particularly interested in prediction tasks which require good estimates of uncertainty, for example, for use in model-based reinforcement learning systems (Deisenroth & Rasmussen, 2011)

. These applications distinguish themselves by requiring 1) continual learning, as datasets are incrementally gathered, and 2) uncertainty estimates to ensure that policies learned are robust to the many dynamics that are consistent with a small dataset. Bayesian non-parametric models provide a unified and elegant way of solving both these problems: model complexity scales with the size of the dataset and is controlled by the Bayesian

Occam’s razor (Rasmussen & Ghahramani, 2000). In this work, we focus on approximations (Titsias, 2009; Hensman et al., 2013; Matthews et al., 2016) which offer improved computational scaling with the ability to fully recover the original non-parametric model.

2.2 Gaussian process State Space Models

Recurrent models with GP transitions (GPSSM) have been proposed in a variety of ways, each with its own inference method (Wang et al., 2005; Ko & Fox, 2009; Turner et al., 2010; Frigola et al., 2013). Early methods relied on maximum a posteriori inference for either the latent states (Wang et al., 2005; Ko & Fox, 2009) or the transition function (Turner et al., 2010). Frigola et al. (2013) presented the first fully Bayesian treatment with a particle MCMC method that sampled over the latent states and the GP transition function. Due to issues with computational complexity and sampling in higher dimensional latent spaces, later attention turned to variational methods. All variational inference schemes have so far relied on independence assumptions between latent states and transition function (Frigola et al., 2014; McHutchon, 2014; Ialongo et al., 2017), sometimes even factorising the state distribution over time (Mattos et al., 2016). Eleftheriadis et al. (2017) introduced a recognition model to help with optimisation of the variational distribution, while keeping the mean-field assumption. Recently, Doerr et al. (2018) introduced the first method to account for the dependence between transition function and latent states by using a doubly stochastic inference scheme similar to Salimbeni & Deisenroth (2017). However, their approach has severe limitations which we will discuss and show experimentally. Bui et al. (2016) investigated Power Expectation Propagation as an alternative approach for fitting factorised approximate posteriors.

2.3 Variational inference approaches

Variational methods often make simplifying assumptions about the form of the approximate posterior to improve computational tractability. This may result in significantly biased solutions. The bias is particularly severe if independence assumptions are made where strong correlations are actually present (Turner & Sahani, 2011). Clearly, in dynamical systems, the trajectory of the latent states depends strongly on the dynamics. Hence, we focus on improving existing variational inference schemes by removing the independence assumption between latent states and transition function. We identify a general method that performs well across varying noise scales.

3 The Model

3.1 State-space model structure

We model discrete-time sequences of observations , where , by positing that each data point is generated by a corresponding latent variable

(the system’s “state”). These latent variables form a Markov chain, implying that, for any time-step

, we can generate by conditioning only on and the transition function . We take the system to evolve according to a single, time-invariant transition function, which we would like to infer. While our inference scheme provides the freedom to choose arbitrary transition and observation density functions, in keeping with previous literature, we use Gaussians. We also assume a linear mapping between and the mean of . This does not limit the range of systems that can be modelled, though it may require us to choose a higher dimensional latent state (Frigola, 2015). We make this choice to reduce the non-identifiabilities between transitions and emissions. To reduce them further we may also impose constraints on the scale of or of the linear mapping . This allows the model to be expressed by the following equations:


where we take the “process noise” covariance matrix to be diagonal, encouraging the transition function to account for all correlations between latent dimensions. The crucial difference from a standard state-space model is that we have placed a GP prior over the transition function. Thus, for a given mean function and positive-definite covariance function , any finite collection of function evaluations, at arbitrary inputs

, will have a Gaussian joint distribution:


where and . This gives the joint density of our model:


In order to draw samples, has to be drawn by conditioning on all the previously sampled function values:


By multiplying the Gaussian conditionals, we can return to our GP prior as in eq. 5. Although the function values are jointly Gaussian, the states are not necessarily so, since the GP’s mean and kernel functions specify non-linear relations between inputs and outputs .

3.2 Multivariate latent states

If our state has more than one dimension (i.e. ), we model the system’s transitions by placing a GP prior on each of the dimension-specific functions, and we write:


where each independent GP has its own mean and kernel functions: . For ease of notation, we drop subscripts indexing state dimensions . Since the GP’s are independent from each other, each process only has to condition on its own function evaluations. Thus, the multivariate conditional

is a diagonal Gaussian where the mean and variance of dimension

are as in eq. 9 and eq. 10, using as mean function and as kernel function .

3.3 Control inputs

A control input influences the transition in a Markovian manner. We can therefore view it as being part of an “augmented” latent state. This makes our transition function , and requires only a change to ’s kernel and mean functions to be defined over

input dimensions. Our transition probability is now also conditioned on

: , though, for brevity, we drop

from the notation, as it is not a random variable and we always condition on it.

3.4 Computational costs of sampling

Despite the conceptual similarity between the non-parametric GPSSM and parametric non-linear state space models, it is considerably more expensive to sample from the GPSSM prior. The main reason for this can be seen in eqs. 10, 9, 8 and 7: to sample we have to condition on the previous points. This requires incrementally building up a matrix inverse, which costs in time. Since our non-parametric transition function is implicitly defined by the function values sampled so far, it is necessary to condition on all previous samples to ensure our overall function is self-consistent over the whole time series.

This poses a problem for MCMC based inference methods, which rely on drawing samples from distributions closely related to the prior. Prediction from a GP posterior also has cubic cost. One way to avoid this cost is to sample from an adapted GP prior (e.g. FITC (Snelson & Ghahramani, 2005)), which was suggested by Frigola et al. (2013).

Modifying the model in this way can have unintended consequences (Bauer et al., 2016), so we focus on performing inference without assumptions on the structure of the GP or its kernel. Our inference scheme should approximate the correct model, while avoiding the cost at training time.

4 Variational inference

4.1 General variational bound

We are interested in approximating the posterior 111We abuse notation by directly denoting a density over the infinite-dimensional , even though such a density does not exist. We only ever concern ourselves with finite subsets of . using variational inference. The general procedure is to construct a variational lower bound to the log marginal likelihood that has the KL divergence from an approximation to the true posterior as its slack: . By maximising w.r.t. , we improve the quality of the approximation.

Practical stochastic variational inference places two requirements on approximating distributions. First, we need to be able to generate samples from them and, second, we need to be able to evaluate their density. We start by deriving the lower bound for a general joint approximate posterior , to which we will later add constraints in order to produce different tractability-accuracy trade-offs. Following the usual derivation using Jensen’s inequality (Blei et al., 2017), we obtain


For we will use a sparse Gaussian process (Titsias, 2009; Hensman et al., 2013; Matthews, 2016). This constraint is ubiquitous in models where the inputs to a GP are random variables (Damianou et al., 2016), such as the GPLVM (Titsias & Lawrence, 2010), or deep GP (Damianou & Lawrence, 2013), as it provides analytical as well as computational tractability. A sparse GP approximation specifies a free Gaussian density on the function values at locations , and uses the prior conditional for the rest of the Gaussian process, which we denote . At this time, we let depend on the entire process for generality:


Choosing the posterior in this way results in a finite KL between the approximate posterior and the prior, which becomes (Matthews et al., 2016).

4.2 Approximate posterior design choices

1) Factorised - linear
2) Factorised - non-linear
3) Non-Factorised - non-linear
Table 1: Variations of approximate posteriors. are free parameters in all cases. is the sparse GP’s marginal posterior variance whereas is the conditional variance of : .

So far, the only simplifying assumptions have been reducing to the prior conditional and to a Gaussian. Now we have to specify the form of . Most existing approximate inference schemes focused on factorised posteriors to obtain an efficient sampling scheme or even a closed form for . In particular, Frigola et al. (2014) find the optimal variational by calculus of variations and sample from it, whereas McHutchon (2014) and Ialongo et al. (2017) constrain it to be Markov-Gaussian. A factorised, Gaussian posterior together with Gaussian emissions leads to a Gaussian optimal , which, for certain kernels, gives us a closed-form bound. Eleftheriadis et al. (2017) reduced the memory cost associated with storing the Markov-Gaussian and improved optimisation behaviour by employing a recurrent neural network as a recognition model. This also parallels the work of (Krishnan et al., 2017), where recurrent neural networks perform smoothing in order to learn a deterministic model of the dynamics.

In our experiments, we show that imposing independence between and does in fact encourage over-confident beliefs about the dynamics, which is a well-known phenomenon in variational inference (Turner & Sahani, 2011). To avoid this we focus on approximate posteriors which preserve correlations between and in various forms. We impose one more general constraint: Markovian structure in . This structure is required to eliminate the impractical amount of correlations that can be expressed between the large number of random variables in . Furthermore, the true posterior is also Markovian in :


Hence the general form of our approximate posterior:


where we are free to specify and . We can thus simplify the ELBO from eq. 13 to


Now we add two restrictions: 1) we use a Gaussian to obtain a tractable KL, and 2) we assume and are linear, and depend only on a finite number of points on , , to make the expectation over tractable. We choose , , and to be free variational parameters in all cases (see table 1 for how and are related). They are crucial to allow information from the observations to make its way into our latent state posterior. We now discuss various choices for and , and the influence they have on the flexibility of the approximate posterior, which we summarise in table 1. The densities of all these variational distributions can still be evaluated, and exact samples can be drawn for unbiased evaluation of the bound.

4.3 A natural choice

Another way to justify our choice of posterior is that, for Gaussian emissions, the exact posterior “filtering” factors: are Gaussians of precisely the form we propose. In fact, we can compute the corresponding , , and in closed form:


Notice that here and are constant through time, reducing the time and memory footprint of the algorithm. While the exact posterior “smoothing” factors are not generally Gaussian or closed-form (hence leaving free), in settings where the observation noise is small, filtering and smoothing will give very similar results, making the filtering factors very close to optimal. Likewise, if future observations hold little information about the present due to process noise corruption, the filtering factors will provide good approximations to the conditional state posterior.

4.4 Factorised approximations

We can recover the existing Gaussian factorised approximation of McHutchon (2014) by taking (see 1 in table 1). The linear relationship between consecutive states implies a linearisation of the transition function. Yet, the true will contain influence from only through the non-linear transition function.

We can break the linearisation constraint by correlating with “pushed through” the approximate GP mean (2 in table 1). This results in a non-Gaussian which is still factorised from . Since this corresponds to marginalising out from , we let to allow our state posterior to be more “uncertain” precisely where our GP posterior tells us to. If the observations provide sufficient evidence to discount the GP’s uncertainty, optimisation will drive and down, giving us a concentrated state posterior.

The main reason we introduce this modified factorised method, is so we can directly isolate the effect of introducing dependence between and , compared to simply making non-Gaussian.

Both approximate posteriors can be sampled in time due to their Markovian structure. The factorised - linear posterior is particularly convenient as it allows us to exploit GPU-optimised banded-matrix operations as in Durrande et al. (2019).

4.5 Direct dependence on

The factorised, non-linear approximation from the previous section is attempting to indirectly incorporate the learned transition function into the approximation over . Instead of summarising the information using the mean of , we want to break the factorisation between and by using the actual, stochastic in the approximation. This is summarised in table 1, line 3. The actual function value of a function in the posterior is weighed by against .

Sampling from this approximate posterior poses a challenge. From eq. 16, we note that every time a is sampled, it has to be consistent with a single function throughout the entire time series. This means that has to be sampled by conditioning on the function values sampled for earlier time-steps, resulting in an cost (as per section 3.4). This high cost prohibits learning in long time series. We can circumvent this issue by cutting the approximate posterior into subsequences with lengths . The state immediately after a subsequence, e.g.  is given a non-conditional posterior:


This reduces the cost of sampling to (for ), and allows for mini-batching over subsequences. However, this factorisation catastrophically “breaks” long-range dependences, again introducing a potential mismatch between our state and transition function posteriors. In the extreme case of , we would obtain an approximate posterior which factorises completely across time, as in Mattos et al. (2016).

4.6 VCDT and “U-dependence”

The previous section introduces an approximate posterior exhibiting dependence between and

. This will generally come at a cubic cost since our transition function has potentially as many degrees of freedom as time-steps. Parametric models avoid this cubic cost since they can sample a transition function with fixed degrees of freedom before sampling the entire time series. Conveniently, our sparse variational GP posterior separates the finite degrees of freedom that we can learn from the data, from the rest of the potentially infinite degrees of freedom of our prior.

In order to gain an approximate posterior with a tractable sampling scheme, we choose ((4) in table 1). Crucially, we can sample from this posterior in linear-time, as only a single needs to be sampled from for an entire sample trajectory. We are effectively being parametric about our sample trajectories while minimising a KL distance from the full non-parametric model. By the properties of sparse variational GPs (Matthews et al., 2016), by adding more inducing points we can reduce this distance.

Intuitively, by finely tiling with inducing points the parts of the state-space we are likely to traverse with our sample paths we can fully specify the behaviour of our transition function in those regions. In this case, conditioning on is sufficient to “tie” down the transitions arbitrarily tightly leaving very little to be gained by additionally conditioning on sampled values. The elegance of this approach lies in the fact that we can assess how many inducing points are required by finding the saturation point where the converged ELBO values stop improving as more points are added.

4.7 Comparison to PR-SSM

Doerr et al. (2018) were, to the best of our knowledge, the first to consider a non-factorised variational posterior for the GPSSM. Their work, however, has two significant shortcomings. Firstly, is taken to be the same as the prior transition. This leaves no free parameters, except for those in and , to incorporate information from the observations into our posterior: no filtering or smoothing is possible. This gives a good approximation only when the process noise is low and/or the observed sequence is short, as low noise levels can compound in a long sequence. Even in the absence of process noise, using the observations to update our beliefs about the states can help optimisation greatly by effectively setting up easier “targets” for the GP to go through. Notice that by setting , , and in our posterior transitions we can also force them to match the prior, making PR-SSM a special case of our more flexible parameterisation.

Secondly, Doerr et al. (2018) employ a sampling scheme which gives incorrect marginal samples of , even assuming, as they did, that:


Despite this conditional factorisation assumption, once is integrated out, as in PR-SSM, the ’s become correlated. Thus, in order to correctly sample , we need to condition on all previous samples , introducing a cubic time cost. This is ignored in PR-SSM and samples from the posterior over are effectively drawn independently for every time-step . The resulting samples from are biased. A mismatch is thus introduced between the functional form of the approximate posterior (which has correlations between function values), and the samples we use to compute its expectations. Hence, PR-SSM’s objective does not correspond to a valid variational lower bound.

We solve this by explicitly sampling and conditioning on it. Conditioning on the same enforces consistency of the dynamics along a sample trajectory. However, we also wish to avoid the factorisation assumption of eq. 22 as that does not generally correspond to a valid GP prior. Explicitly assuming we obtain a valid variational bound where we only need to condition on to sample , giving us an cost while maintaining a full GP prior.

5 Experiments

In the experiments we set out to test how the proposed posterior (VCDT) compares with the Factorised - linear, Factorised - non-linear, and PR-SSM (Doerr et al., 2018) approximations. Each model serves as a “control” for a specific experimental question:

  1. Factorised - linear: what is gained by making our approximation non-linear? I.e. if our posterior were not jointly Gaussian, but only its Markovian conditional factors.

  2. Factorised - non-linear: what is gained by introducing dependence between our states and our transition function? I.e. if we do not marginalise both and from our posterior over .

  3. PR-SSM: what is gained, or lost, by forcing our model to find transition functions that explain the data well, without accounting for process noise?

To answer these questions, we test model calibration and predictive performance in a number of environments: the “kink” function, a set of five system identification benchmark datasets (also used in (Doerr et al., 2018)), and a cart and pole system.

5.1 “Kink” Function

We generate data according to the transition function displayed in red in fig. 1:

This function generates cyclic trajectories through the state-space, as it combines two linear segments with slope greater than (left) and less than (right). It also contains a non-linearity at the joining of these segments, giving it a variety of dynamics and serving as an interesting testing environment. We train our models on the data described in fig. 1, fixing their emissions () to the true, generative ones so as to enable direct latent space comparison222This is to side-step any scale invariances built into our model..

Figure 1: “Kink” data. In red is the true system transition function, in black are the latent states and in grey the observations. Each point’s coordinates are a pair or for some index . Such pairs are connected through time, highlighting trajectories through the state-space. A sequence of 120 latent states were sampled beginning from a standard normal . Their trajectory was perturbed by Gaussian noise with s.d. . Observations were obtained by adding Gaussian noise with s.d. to the latents.

In order to make inference challenging, we add a small amount of process noise (s.d. ) and a significant amount of observation noise (s.d. ). As we can see from fig. 2, in this regime PR-SSM is unable to learn, given that its posterior encourages it to find trajectories with no process noise. Indeed, regardless of which data we are training on, during optimisation, PR-SSM always drives the process noise to zero. This leads it to converge to poor solutions when process noise exists and/or to drive up the observation noise (across all experiments, PR-SSM learned the highest values for the observation noise). Even initialising PR-SSM at the solution found by other methods leads to the same sub-optimal fit of fig. 2. On the other hand, the GPSSM models can handle process noise and recover inflated estimates of it in order to ascribe some transitions to noise and justify smoother dynamics. The Factorised - linear model’s estimates of the process noise are consistently the highest (across many settings of noise parameters and sequence lengths), followed by the Factorised - non-linear model. As can be seen in fig. 3, VCDT finds the most well-calibrated posterior. The factorised approaches (the linear one is shown, though they gave a very similar fit) result in a posterior that is not only less accurate, but also more confidently wrong. In general, the posteriors found by the factorised approaches favour higher process noise to achieve smoother, over-confident dynamics. This is consistent with the behaviour discussed in (Turner & Sahani, 2011): mean-field variational inference favours more concentrated posteriors. By matching our true posterior’s structure more closely and modelling the dependence between and , we can overcome this issue.

Figure 2: PR-SSM fails to learn in the presence of process noise, for moderately long time series (). Plot elements as in fig. 3.

5.2 System identification benchmarks

We also train our models on five benchmark datasets taken from (De Moor et al., 1997). Test set results are reported in table 2. As in (Doerr et al., 2018), we train on the first half of the sequence, we normalise the data, and we use a 4-dimensional latent state (plus 1 non-stochastic dimension for the control inputs). No mini-batching was used, the bound is evaluated using 100 samples from the posterior, and we use 100 variational inducing points. The test results are for predictions 30 steps into the future from the end of the training sequence.

We initialise all models at the solution found by the Factorised - non-linear method to assess differences in optima of the variational objectives. This turned out to be necessary for PR-SSM as learning on such long sequences, without mini-batching, proved very challenging without any filtering from the observations. As can be seen from table 2

, PR-SSM can often find good solutions despite its constraints. This is particularly true for low noise regimes, as in the “Drive” dataset. The GPSSM models are more robust in general though, with VCDT in the lead among GPSSMs for NLPP in all but the last two datasets. No error-bars are displayed since the initialisation was deterministic, no mini-batching was used, and 100 samples from the posterior gave low variance estimates of the training objective. To compute the test statistics accurately, we used many more samples (


Figure 3: Posterior miscalibration due to factorisation. Left: fit for the factorised ; Middle: fit for VCDT ; Right: superimposed posteriors (grey is factorised, black is VCDT). All subplots: the shaded regions indicate a confidence interval and the contour plots (left and middle) show the pairwise posteriors over some test latent states. The coordinates of the squares, triangles and circles correspond to, respectively: the true latent states , the observed states and the means of the marginal posteriors .
Model Actuator Ballbeam Drive Dryer Gas Furnace
Factorised - linear -0.364; 0.154 -0.486; 0.075 0.770; 0.439 -0.709; 0.098 -0.296; 0.170
Factorised - non-linear -0.641; 0.142 -1.379; 0.073 0.283; 0.246 -1.310; 0.049 0.264; 0.168
VCDT -0.644; 0.141 -1.395; 0.072 0.238; 0.285 -1.282; 0.050 0.377; 0.169
PR-SSM (initialised) 3.653; 1.976 24.765; 1.118 -0.649; 0.139 -1.265; 0.053 0.144; 0.162
Table 2: Test set performance (NLPP; RMSE) for the system identification datasets. Lower is better.

5.3 Cart and pole

Finally, we consider a cart and pole system (or inverted pendulum). We add Gaussian noise to the states in order to obtain our observations. Standard deviations are: 0.03; 0.1047; 0.4; 1.3963; for the respective dimensions of our system: cart position, pendulum angle, cart velocity and angular velocity of the pendulum. The noise s.d. levels correspond to 3cm; 6 degrees; 3cm/0.075sec; 6 degrees/0.075sec. Results for models fit using 100 samples from the posterior (per optimisation step) and 300 inducing points are shown in

table 3. Because there is no process noise and the training sequences are short (20 time-steps), we see that PR-SSM performs the best in terms of predictive performance. Of course PR-SSM can be viewed as a special case of our approximation, albeit with a mismatched sampling scheme. Running the experiment with the correct sampling scheme and fixing the variational parameters to: , , and , we learn a solution with the same performance as PR-SSM. In the GPSSM models, VCDT performs the best, and, as can be seen in fig. 4, it also recovers the lowest noise levels, using the dynamics to explain more of the structure in the data.

Factorised - linear 2.268 3.548
Factorised - non-linear 1.847 2.974
VCDT 0.694 2.139
PR-SSM -0.049 1.613
PR-SSM (corrected sampling) -0.053 1.628
Table 3: Test set performance for the Cart and Pole dataset. Results averaged over 30 different test sequences (prediction 20 steps into the future). Lower is better.
Figure 4: Lower values of the learned process noise standard deviation.

6 Conclusion

The GPSSM is a powerful formalism. The main challenge is to perform fast, accurate inference. Variational inference maintains many of the benefits of using non-parametric models, but faces particular difficulties in time series models due to their sensitivity to factorisation assumptions. Naive non-factorising approximations regain the poor computational scaling that variational methods were introduced to avoid. By exploiting the low-rank structure of the approximate GP posterior, we were able to construct a non-factorised posterior with the desired computational scaling. This often leads to better predictive performance, calibration, and an improved estimation of model parameters.


ADI would like to acknowledge the generous support of the Cambridge-Tübingen and Qualcomm Innovation fellowships.


  • Bauer et al. (2016) Bauer, M., van der Wilk, M., and Rasmussen, C. E. Understanding probabilistic sparse Gaussian process approximations. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, 2016.
  • Billings (2013) Billings, S. Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and Spatio-Temporal Domains. Wiley, 2013. ISBN 9781119943594.
  • Blei et al. (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
  • Bui et al. (2016) Bui, T., Turner, R. E., and Rasmussen, C. E. Bayesian Gaussian process state-space models via Power-EP. In ICML 2016 Workshop on Data efficient Machine Learning, 2016.
  • Damianou & Lawrence (2013) Damianou, A. and Lawrence, N. Deep Gaussian processes. In

    Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS)

    , pp. 207–215, 2013.
  • Damianou et al. (2016) Damianou, A. C., Titsias, M. K., and Lawrence, N. D. Variational inference for latent variables and uncertain inputs in Gaussian processes. The Journal of Machine Learning Research, 17(1):1425–1486, 2016.
  • De Moor et al. (1997) De Moor, B., De Gersem, P., De Schutter, B., and Favoreel, W. DAISY: A database for identification of systems. JOURNAL A, 38:4–5, 1997. URL http://homes.esat.kuleuven.be/~smc/daisy/.
  • Deisenroth & Rasmussen (2011) Deisenroth, M. and Rasmussen, C. E. PILCO: A model-based and data-efficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning (ICML), 2011.
  • Doerr et al. (2018) Doerr, A., Daniel, C., Schiegg, M., Nguyen-Tuong, D., Schaal, S., Toussaint, M., and Trimpe, S. Probabilistic recurrent state-space models. In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.
  • Durrande et al. (2019) Durrande, N., Adam, V., Bordeaux, L., Eleftheriadis, S., and Hensman, J.

    Banded matrix operators for Gaussian Markov models in the automatic differentiation era.

    In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.
  • Eleftheriadis et al. (2017) Eleftheriadis, S., Nicholson, T., Deisenroth, M., and Hensman, J. Identification of Gaussian process state-space models. In Advances in Neural Information Processing Systems 30, 2017.
  • Frigola (2015) Frigola, R. Bayesian time series learning with Gaussian processes. PhD thesis, University of Cambridge, 2015.
  • Frigola et al. (2013) Frigola, R., Lindsten, F., Schön, T. B., and Rasmussen, C. E. Bayesian inference and learning in Gaussian process state-space models with particle MCMC. In Advances in Neural Information Processing Systems 26, 2013.
  • Frigola et al. (2014) Frigola, R., Chen, Y., and Rasmussen, C. E. Variational Gaussian process state-space models. In Advances in Neural Information Processing Systems 27, 2014.
  • Ghahramani & Roweis (1999) Ghahramani, Z. and Roweis, S. T. Learning nonlinear dynamical systems using an EM algorithm. In Advances in Neural Information Processing Systems 12, 1999.
  • Hensman et al. (2013) Hensman, J., Fusi, N., and Lawrence, N. D. Gaussian processes for Big Data. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2013.
  • Hochreiter & Schmidhuber (1996) Hochreiter, S. and Schmidhuber, J. LSTM can solve hard long time lag problems. In Advances in Neural Information Processing Systems 9, 1996.
  • Ialongo et al. (2017) Ialongo, A. D., van der Wilk, M., and Rasmussen, C. E. Closed-form inference and prediction in Gaussian process state-space models. NIPS 2017 Time-Series Workshop, 2017.
  • Ko & Fox (2009) Ko, J. and Fox, D. GP-BayesFilters: Bayesian filtering using Gaussian process prediction and observation models. Autonomous Robots, 27(1):75–90, 2009.
  • Krishnan et al. (2017) Krishnan, R. G., Shalit, U., and Sontag, D. Structured inference networks for nonlinear state-space models. In Proceedings of the AAAI Conference on Artificial Intelligence, 2017.
  • Matthews et al. (2016) Matthews, A., Hensman, J., Turner, R., and Ghahramani, Z.

    On sparse variational methods and the Kullback-Leibler divergence between stochastic processes.

    In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), 2016.
  • Matthews (2016) Matthews, A. G. d. G. Scalable Gaussian process inference using variational methods. PhD thesis, University of Cambridge, 2016.
  • Mattos et al. (2016) Mattos, C. L. C., Dai, Z., Damianou, A., Forth, J., Barreto, G. A., and Lawrence, N. D. Recurrent Gaussian processes. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.
  • McHutchon (2014) McHutchon, A. Nonlinear modelling and control using Gaussian processes. PhD thesis, University of Cambridge, 2014.
  • Rasmussen & Ghahramani (2000) Rasmussen, C. E. and Ghahramani, Z. Occam’s razor. In Advances in Neural Information Processing Systems 13, 2000.
  • Rasmussen & Williams (2005) Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. The MIT Press, 2005.
  • Roweis & Ghahramani (1999) Roweis, S. and Ghahramani, Z. A unifying review of linear Gaussian models. Neural computation, 11(2):305–345, 1999.
  • Salimbeni & Deisenroth (2017) Salimbeni, H. and Deisenroth, M. Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information Processing Systems 30, pp. 4588–4599, 2017.
  • Snelson & Ghahramani (2005) Snelson, E. and Ghahramani, Z. Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems 18, 2005.
  • Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, 2014.
  • Titsias (2009) Titsias, M. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS), 2009.
  • Titsias & Lawrence (2010) Titsias, M. and Lawrence, N. D. Bayesian Gaussian process latent variable model. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 844–851, 2010.
  • Turner et al. (2010) Turner, R., Deisenroth, M., and Rasmussen, C. State-space inference and learning with Gaussian processes. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 2010.
  • Turner & Sahani (2011) Turner, R. E. and Sahani, M. Two problems with variational expectation maximisation for time-series models. Bayesian Time series models, 1(3.1):3–1, 2011.
  • von Neumann et al. (1944) von Neumann, J., Morgenstern, O., Kuhn, H. W., and Rubinstein, A. Theory of Games and Economic Behavior. Princeton University Press, 1944. ISBN 9780691130613.
  • Wang et al. (2005) Wang, J., Hertzmann, A., and Fleet, D. J. Gaussian process dynamical models. In Advances in Neural Information Processing Systems 18, 2005.