1 Introduction
Gaussian processes (GPs, Rasmussen and Williams 2006
) have been proven to be powerful probabilistic nonparametric modeling tools for
static nonlinear functions. However, many realworld applications, such as control, target tracking, and timeseries analysis are tackling problems with nonlinear dynamical behavior. The use of GPs in modeling nonlinear dynamical systems is an emerging topic, with many strong contributions during the recent years, for example the work by Turner et al. (2010), Frigola et al. (2013, 2014a, 2014b) and Mattos et al. (2016). The aim of this paper is to advance the stateoftheart in Bayesian inference on Gaussian process state space models (GPSSMs). As we will detail, a GPSSM is a state space model, using a GP as its state transition function. Thus, the GPSSM is not a GP itself, but a state space model (
i.e., a dynamical system). Overviews of GPSSMs are given by, e.g., McHutchon (2014) and FrigolaAlcade (2015).We provide a novel reducedrank model formulation of the GPSSM with good convergence properties both in theory and practice. The advantage with our approach over the variational approach by Frigola et al. (2014b), as well as other inducingpointbased approaches, is that our approach attempts to approximate the optimal Karhunen–Loeve eigenbasis for the reducedrank approximation instead of using the suboptimal Nyström approximation which implicitly is the underlying approximation in all inducing point methods. Because of this we do not need to resort to variational approximations, but we can instead perform the Bayesian computations in full. By utilizing the structure of the reducedrank model, we construct a computationally efficient lineartimecomplexity MCMCbased algorithm for learning in the proposed GPSSM model, which we demonstrate and evaluate on several challenging examples. We also provide a proof of convergence of the reducedrank GPSSM to a full GPSSM (in the supplementary material).
GPSSMs are a general class of models defining a dynamical system for given by
(1a)  
(1b) 
where the noise terms and are i.i.d. Gaussian, and . The latent state is observed via the measurements . The key feature of this model is the nonlinear transformations and which are not known explicitly and do not adhere to any specific parametrization. The model functions and are assumed to be realizations from a Gaussian process prior over with a given covariance function
subject to some hyperparameters
. Learning of this model, which we will tackle, amounts to inferring the posterior distribution of , , , , and given a set of (noisy) observations .The strength of including the GP in (1) is its ability to systematically model uncertainty—not only uncertainty originating from stochastic noise within the system, but also uncertainty inherited from data, such as few measurements or poor excitation of the dynamics in certain regions of the state space. An example of this is given by Figure 1, where we learn the posterior distribution of the unknown function in a GPSSM (see Sec. 5 for details). An inspiring realworld example on how such probabilistic information can be utilized for simultaneous learning and control is given by Deisenroth et al. (2015).
Nonprobabilistic methods for modeling nonlinear dynamical systems include learning of state space models using a basis function expansion (Ghahramani and Roweis, 1998), but also nonlinear extensions of AR(MA) and GARCH models from the timeseries analysis literature (Tsay, 2010), as well as nonlinear extensions of ARX and state space models from the system identification literature (Sjöberg et al., 1995; Ljung, 1999). In particular, nonlinear ARX models are now a standard tool for the system identification engineer (The MathWorks, Inc., 2015). For probabilistic modeling, the latent force model (Alvarez et al., 2009) presents one approach for modeling dynamical phenomena using GPs by encoding a priori known dynamics within the construction of the GP. Another approach is the Gaussian process dynamical model (Wang et al., 2008), where a GP is used to model the nonlinear function within an SSM, that is, a GPSSM. However, the work by Wang et al. (2008) is, as opposed to this paper, mostly focused around the problem setting when . That is also the focus for the further development by Damianou et al. (2011), where the EM algorithm for learning is replaced by a variational approach.
State space filtering and smoothing in GPSSMs has been tackled before (e.g., Deisenroth et al. 2012; Deisenroth and Mohamed 2012), and recent interest has been in learning GPSSMs (Turner et al., 2010; Frigola et al., 2013, 2014a, 2014b). An inherent problem in learning the GPSSM is the entangled relationship between the states and the nonlinear function . Two different approaches have been proposed in the literature: In the first approach the GP is represented by a parametrized form (Turner et al. use a pseudotraining data set, akin to the inducing inputs by Frigola et al. 2014b, whereas we will employ a basis function expansion). The second approach (used by Frigola et al. 2013, 2014a) is handling the nonlinear function implicitly by marginalizing it out. Concerning learning, Turner et al. (2010) and Frigola et al. (2014a) use an EMbased procedure, whereas we and Frigola et al. (2013) use an MCMC algorithm.
The main bottleneck prohibiting the use in practice of some of the previously proposed GPSSMs methods is the computational load. For example, the training of a onedimensional system using data points (i.e., a fairly small example) is in the magnitude of several hours for the solution by Frigola et al. (2013). Akin to Frigola et al. (2014b), our proposed method will typically handle such an example within minutes, or even less. To reduce the computational load, Frigola et al. (2014b) suggests variational sparse GP techniques to approximate the solution. Our approach, however, is using the reducedrank GP approximation by Solin and Särkkä (2014), which is a disparate solution with different properties. The reducedrank GP approximation enjoys favorable theoretical properties, and we can prove convergence to a nonapproximated GPSSM.
The outline of the paper is as follows: In Section 2 we will introduce reducedrank Gaussian process state space models by making use of the representation of GPs via basis functions corresponding to the prior covariance structure (Solin and Särkkä, 2014), a theoretically wellsupported approximation significantly reducing the computational load. In Section 3 we will develop an algorithm for learning reducedrank Gaussian process state space models by using recent MCMC methods (Lindsten et al., 2014; Wills et al., 2012). We will also demonstrate it on synthetic as well as real data examples in Section 5, and finally discuss the contribution and further extensions in Section 6.
2 REDUCEDRANK GPSSMs
We use GPs as flexible priors in Bayesian learning of the state space model. The covariance function encodes the prior assumptions of the model functions, thus representing the best belief of the behavior of the nonlinear transformations. In the following we present an approach for parametrizing this model in terms of an rank truncation of a basis function expansion as presented by Solin and Särkkä (2014). Related ideas have also been proposed by, for example, LázaroGredilla et al. (2010).
Provided that the covariance function is stationary (homogeneous, i.e. ), the covariance function can be equivalently represented in terms of the spectral density . This Fourier duality is known as the Wiener–Khintchin theorem, which we parametrize as: . We employ the relation presented by Solin and Särkkä (2014) to approximate the covariance operator corresponding to . This operator is a pseudodifferential operator, which we approximate by a series of differential operators, namely Laplace operators . In the isotropic case, the approximation of the covariance function is given most concisely in the following form:
(2) 
where is the spectral density function of , and and
are the Laplace operator eigenvalues and eigenfunctions solved for the domain
. See Solin and Särkkä (2014) for a detailed derivation and convergence proofs.The key feature in the Hilbert space approximation (2) is that and are independent of the hyperparameters , and it is only the spectral density that depends on . Equation (2) is a direct approximation of the eigendecomposition of the Gram matrix (e.g., Rasmussen and Williams 2006), and it can be interpreted as an optimal parametric expansion with respect to the given covariance function in the GP prior.
In terms of a basis function expansion, this can be expressed as
(3) 
where . In the case , this formulation does allow for nonzero covariance between different components of the state space. We can now formulate a reducedrank GPSSM, corresponding to (1a), as
(4) 
and similarly for (1b). Henceforth we will consider a reducedrank GPSSM,
(5a)  
(5b) 
where and are matrices of weights with priors for each element as described by (3).
3 LEARNING GPSSMs
Learning in reducedrank Gaussian process state space models (5) from amounts to inferring the posterior distribution of , , , , and the hyperparameters . For clarity in the presentation, we will focus on inferring the dynamics, and assume the observation model ( and ) to be known a priori. However, the extension to an unknown observation model—as well as exogenous input signals—follows in the same fashion, and will be demonstrated in the numerical examples.
To infer the sought distributions, we will use a blocked Gibbs sampler outlined in Algorithm 1. Although involving sequential Monte Carlo (SMC) for inference in state space, the validity of this approach is not relying on asymptotics () in the SMC algorithm, thanks to recent particle MCMC methods (Andrieu et al., 2010; Lindsten et al., 2014).
It is possible to learn (5) under different assumptions on what is known. We will focus on the general (and in many cases realistic) setting where the distributions of , and are all unknown. In cases when or are known a priori, the presented scheme is straightforward to adapt. To be able to infer the posterior distribution of and , we make the additional prior assumptions:
(6) 
where denotes the Inverse Wishart distribution. For brevity, we will omit the problem of finding the unknown initial distribution . It is possible to treat this rigorously akin to , but it is of minor importance in most practical situations. We will now in Section 3.1–3.3 explain the four main steps 5–8 in Algorithm 1.
3.1 Sampling in State Space with SMC
SMC methods (Doucet and Johansen, 2011) are a family of techniques developed around the problem of inferring the posterior state distribution in SSMs. SMC can be seen as a sequential application of importance sampling along the sequence of distributions with a resampling procedure to avoid sample depletion.
To sample the state space trajectory , conditional on a model , and data , we employ a conditional particle filter with ancestor sampling, forming a particle Gibbs Markov kernel Algorithm 2 (PGAS, Lindsten et al. 2014). PGAS can be thought of as an SMC algorithm for finding the socalled smoothing distribution to be used within an MCMC procedure.
3.2 Sampling of Covariances and Weights
The sampling of the weights and the noise covariance , conditioned on and , can be done exactly, by following the procedure of Wills et al. (2012). With the priors (3) and (6), the joint prior of and can be written using the Matrix Normal Inverse Wishart (MNIW) distribution as
(7) 
Details on the parametrization of the MNIW distribution we use is available in the supplementary material, and it is given by the hierarchical model and . For our problem, the most important is the second argument, the inverse row covariance , a square matrix with the inverse spectral density of the covariance function as its diagonal entries:
(8) 
This is how the prior from (3
) enters the formulation. (Note that the marginal variance of each element in
is also scaled , and thereby. For notational convenience, we refrain from introducing a scaling factor, but let it be absorbed into the covariance function.) With this (conjugate) prior, the posterior follows analytically by introducing the following statistics of the sampled trajectory
:(9) 
where and . Using the Markov property of the SSM, it is possible to write the conditional distribution for as (Wills et al., 2012, Eq. (42)):
(10) 
Given the prior (7), can now to be sampled from (Wills et al., 2012, Eq. (43)):
(11) 
3.3 Marginalizing the Hyperparameters
Concerning the sampling of the hyperparameters , we note that we can easily evaluate the conditional distribution up to proportionality as
(12) 
To utilize this, we suggest to sample the hyperparameters by using a Metropolis–Hastings (MH) step, resulting in a socalled MetropoliswithinGibbs procedure.
4 Theoretical Results
Our model (5) and learning Algorithm 1 inherits certain welldefined properties from the reducedrank approximation and the presented sampling scheme. In the first theorem, we consider the convergence of a series expansion approximation to the GPSSM with an increasing number of basis functions. As in Solin and Särkkä (2014), we only provide the convergence results for a rectangular domain with Dirichlet boundary conditions, but the result could easily be extended to a more general case. Proofs for all theorems are included in the supplementary material.
Theorem 4.1.
The probabilistic model implied by the dynamic and measurement models of the approximate GPSSM convergences in distribution to the exact GPSSM, when the size of the domain and the number of basis functions tends to infinity.
The above theorem means that in the limit any probabilistic inference in the approximate model will be equivalent to inference on the exact model, because the prior and likelihood models become equivalent. The benefit of considering the rank model instead of a standard GP, is the following:
Theorem 4.2.
Provided the rankreduced approximation, the computational load scales as as opposed to .
Furthermore, the proposed learning procedure enjoys sound theoretical properties:
Theorem 4.3.
Hence, Theorem 4.3 guarantees that our learning procedure indeed is sampling from the distribution we expect it to, even when a finite number of particles is used in the Monte Carlo based Algorithm 2. It is also possible to prove uniform ergodicity for Algorithm 1, as such a result exists for Algorithm 2 (Lindsten et al., 2014).
5 Numerical Examples
Data / Method  RMSE  LL  Train time [min]  Test time [s]  Comments  

Synthetic data:  
PMCMC GPSSM Frigola et al. (2013)  547  As reported by Frigola et al. (2014b).  
Variational GPSSM Frigola et al. (2014b)  As reported by Frigola et al. (2014b).  
Reducedrank GPSSM  Average over 10 runs.  
Damper modeling:  
Linear OE model (4th order)  N/A  
Hammerstein–Wiener (4th order)  N/A  
NARX (3rd order, wavelet network)  N/A  
NARX (3rd order, Tree partition)  N/A  
NARX (3rd order, sigmoid network)  N/A  
Reducedrank GPSSM  
Energy forecasting:  
Static GP  
Reducedrank GPSSM 
In this section, we will demonstrate and evaluate our contribution, the model (5
) and the associated learning Algorithm
1, using four numerical examples. We will demonstrate and evaluate the proposed method (including the convergence of the learning algorithm) on two synthetic examples and two realworld datasets, as well as making a comparison with other methods.In all examples, we separate the data set into training data and evaluation data
. To evaluate the performance quantitatively, we compare the estimated data
to the true data using the root mean square error (RMSE) and the mean log likelihood (LL):(13) 
and
(14) 
The source code for all examples is available via the first authors homepage.
5.1 Synthetic Data
As a proofofconcept already presented in Figure 1, we have data points from the model
(15) 
where and . We inferred and , using a GP with the exponentiated quadratic (squared exponential, parametrized as in Rasmussen and Williams 2006) covariance function with unknown hyperparameters, and as priors. In this onedimensional case (), the eigenvalues and eigenfunctions are and . The spectral density corresponding to the covariance function is .
The posterior estimate of the learned model is shown in Figure 1, together with the samples of the basis function weights . The variance of the posterior distribution of increases in the regimes where the data is not exciting the model.
As a second example, we repeat the numerical benchmark example on synthetic data from Frigola et al. (2014b): A onedimensional state space model , if , and , if with known measurement equation , and noise distributed as and . The model is learned from data points, and evaluated with data points. As in Frigola et al. (2014b), a Matérn covariance function is used (see, e.g., Section 4.2.1 of Rasmussen and Williams 2006 for details, including its spectral density). The results for our model with MCMC iterations and basis functions are provided in Table 1.
We also restate two results from Frigola et al. (2014b): The GPSSM method by Frigola et al. (2013) (which also uses particle MCMC for learning) and the variational GPSSM by Frigola et al. (2014b). Due to the compact writing in Frigola et al. (2013, 2014b), we have not been able to reproduce the results, but to make the comparison as fair as possible, we average our results over 10 runs (with different realizations of the training data). Our method was evaluated using the provided Matlab implementation on a standard desktop computer^{1}^{1}1Intel i74600 2.1 GHz CPU..
The choice to use only iterations of the learning algorithm is motivated by Figure 2, illustrating the ‘model quality’ (in terms of log likelihood and RMSE) as a function of : It is clear from Figure 2 that the model quality is of the same magnitude after a few hundred samples and after samples. As we know the sampler converges to the right distribution in the limit , this indicates that the sampler converges already after a few hundred samples for this example. This is most likely thanks to the linearinparameter structure of the reducedrank GP, allowing for the efficient Gibbs updates (10–11).
There is an advantage for our proposed reducedrank GPSSM in terms of LL, but considering the stochastic elements involved in the experiment, the different RMSE performance results are hardly outside the error bounds. Regarding the computational load, however, there is a substantial advantage for our proposed method, enjoying a training time less only a third of the one by the variational GPSSM, which in turn outperforms the method by Frigola et al. (2013).
5.2 Nonlinear Modeling of a MagnetoRheological Fluid Damper
We also compare our proposed method to stateoftheart conventional system identification methods (Ljung, 1999). The problem is the modeling of input–output behavior of a magnetorheological fluid damper, introduced by Wang et al. (2009) and used as a case study in the System Identification Toolbox for Mathworks Matlab (The MathWorks, Inc., 2015). The data consists of data points, of which are used for training and the remaining for evaluation, shown in Figure 2(a). The data exhibits some nontrivial dynamics, and as the
data points probably not contain enough information to determine the system uniquely, a certain amount of uncertainty is present in the posterior. This is thus an interesting and realistic problem for a Bayesian method, as it possibly can provide useful information about the posterior uncertainty, not captured in classical maximum likelihood methods for system identification.
We learn a threedimensional model:
(16a)  
(16b) 
where , , and with unknown. We assume a GP prior with an exponentiated quadratic covariance function, with separate lengthscales for each dimension. We use basis functions^{2}^{2}2Explicit expression for the basis functions in the multidimensional case is found in the supplementary material. for and for , which in total gives basis function weights and hyperparameters to sample.
The learned model was used to simulate a distribution of the output for the test data, plotted in Figure 2(a). Note how the variance of the prediction changes in different regimes of the plot, quantifying the uncertainty in the posterior belief. The resulting output is also evaluated quantitatively in Table 1, together with five stateoftheart maximum likelihood methods, and our proposed method performs on par with the best of these. The learning algorithm took about two hours to run on a standard desktop computer.
The assumed model with known linear and additive form could be replaced by an even more general structure, but this choice seems to give a sensible tradeoff between structure (reducing computational load) and flexibility (increasing computational load) for this particular problem. Our proposed Bayesian method does indeed appear as a realistic alternative to the maximum likelihood methods, without any more problemspecific tailoring than the rather natural model assumption (16a).
5.3 Energy Consumption Forecasting
As a fourth example, we consider the problem of forecasting the daily energy consumption in Sweden ^{3}^{3}3Data from Svenska Kraftnät, available: http://www.svk.se/aktorsportalen/elmarknad/statistik/. four days in advance. The daily data from 2013 was used for training, and the data from 2014 for evaluation. The timeseries was modeled as an autonomous dynamical system (driven only by noise), and a three dimensional reducedrank GPSSM was trained for this, with all functions and parameters unknown. To obtain the forecast, the model was used inside a particle filter to find the state distribution, and the four step ahead prediction density was computed. The data and the predictions are shown in Figure 2(b).
6 Discussion
6.1 Tuning
For a successful application of the proposed algorithm, there are a few algorithmspecific parameters for the user to choose: The number of basis functions and the number of particles in PGAS. A large number of basis functions makes the model more flexible and the reducedrank approximation ‘closer’ to a nonapproximated GP, but it also increases the computational load. With a smooth covariance function , the prior is in practice for moderate , and can be chosen fairly small (as a rule of thumb, say, 6–15 per dimension) without making a too crude approximation. In our experience, the number of particles in PGAS can be chooses fairly small (say,
), without affecting the mixing properties of the Markov chain heavily. This is in accordance to what has been reported in the literature by
Lindsten et al. (2014).6.2 Properties of the Proposed Model
We have proposed to use the reducedrank approximation of GPs by Solin and Särkkä (2014) within a state space model, to obtain a GPSSM which efficiently can be learned using a PMCMC algorithm. As discussed in Section 3 and studied using numerical examples in Section 5, the linearintheparameter structure of the reducedrank GPSSM allows for a computationally efficient learning algorithm. However, the question if a similar performance could be obtained with another GP approximation method or another learning scheme arises naturally.
Other GP approximation methods, for example pseudoinputs, would most likely not allow for such efficient learning as the reducedrank approximation does; unless closedform Gibbs updates are available (requiring a linearintheparameter structure, or similar), the parameter learning would have to resort to Metropolis–Hastings, which most likely would give a significantly slower learning procedure. For many GP approximation methods it is also more natural to find a point estimate of the parameters (the inducing points, for example) using, for example, EM, rather than inferring the parameter posterior, as is the case in this paper.
The learning algorithm, on the other hand, could probably be replaced by some other method also inferring (at least approximately) the posterior distribution of the parameters, such as SMC (Chopin et al., 2013) or a variational method. However, to maintain efficiency, the method has to utilize the linearintheparameter structure of the model to reach a computational load competitive with our proposed scheme. Such an alternative (however only inferring MAP estimate of the sought quantities) could possibly be the method by Kokkala et al. (2014).
6.3 Conclusions
We have proposed the reducedrank GPSSM (5), and provided theoretical support for convergence towards the full GPSSM. We have also proposed a theoretically sound MCMCbased learning algorithm (including the hyperparameters) utilizing the structure of the model efficiently.
By demonstration on several examples, the computational load and the modeling capabilities of our approach have been proven to be competitive. The computational load of the learning is even less than in the variational sparse GP solution provided by Frigola et al. (2014b), and the performance in challenging input–output modeling is on par with wellestablished stateoftheart maximum likelihood methods.
6.4 Possible Extensions and Further Work
A natural extension for applications where some domain knowledge is present, is to let the model include some functions with an a priori known parametrization. The handling of such models in the learning algorithm should be feasible, as it is already known how to use PGAS for such models (Lindsten et al., 2014). Further, the assumptions of the prior of (6) are possible to circumvent by using, for example, MH, at the cost of an increased computational load. The same holds true for the Gaussian noise assumption in (5).
Another direction for further work is to adapt the process to be able to sequentially learn and improve the model when data is added in batches, by formulating the previously learned model as the prior to the next iteration of the learning. This could probably be useful in, for example, reinforcement learning, along the lines of
Deisenroth et al. (2015).In the engineering literature, dynamical systems are mostly defined in discrete time. An interesting approach to model the continuoustime counterpart using Gaussian processes is presented by Ruttor et al. (2013). A development of the reducedrank GPSSM to continuous time dynamical models using stochastic Runge–Kutta methods would be of great interest for further research.
References
References

Rasmussen and Williams (2006)
C. E. Rasmussen and C. K. I. Williams.
Gaussian Processes for Machine Learning
. MIT Press, Cambridge, MA, 2006.  Turner et al. (2010) R. D. Turner, M. P. Deisenroth, and C. E. Rasmussen. Statespace inference and learning with Gaussian processes. In Proceedings of the 13^{th} International Conference on Artificial Intelligence and Statistics, pages 868–875, 2010.
 Frigola et al. (2013) R. Frigola, F. Lindsten, T. B. Schön, and C. Rasmussen. Bayesian inference and learning in Gaussian process statespace models with particle MCMC. In Advances in Neural Information Processing Systems, volume 26, pages 3156–3164, 2013.
 Frigola et al. (2014a) R. Frigola, F. Lindsten, T. B. Schön, and C. Rasmussen. Identification of Gaussian process statespace models with particle stochastic approximation EM. In Proceedings of the 19^{th} IFAC World Congress, pages 4097–4102, 2014a.
 Frigola et al. (2014b) R. Frigola, Y. Chen, and C. Rasmussen. Variational Gaussian process statespace models. In Advances in Neural Information Processing Systems, volume 27, pages 3680–3688, 2014b.
 Mattos et al. (2016) C. L. C. Mattos, Z. Dai, A. Damianou, J. Forth, G. A. Barreto, and N. D. Lawrence. Recurrent Gaussian processes. arXiv preprint arXiv:1511.06644, 2016. To be presented at the 4^{th} International Conference on Learning Representations (ICLR), San Juan, Puerto Rico, May 2016.
 McHutchon (2014) A. McHutchon. Nonlinear Modelling and Control Using Gaussian Processes. PhD thesis, University of Cambridge, 2014.
 FrigolaAlcade (2015) R. FrigolaAlcade. Bayesian Time Series Learning with Gaussian Processes. PhD thesis, University of Cambridge, 2015.
 Deisenroth et al. (2015) M. P. Deisenroth, D. Fox, and C. E. Rasmussen. Gaussian processes for dataefficient learning in robotics and control. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(2):408–423, 2015.
 Ghahramani and Roweis (1998) Z. Ghahramani and S. T. Roweis. Learning nonlinear dynamical systems using an EM algorithm. In Advances in Neural Information Processing Systems, volume 11, pages 431–437, 1998.
 Tsay (2010) R. S. Tsay. Analysis of Financial Time Series. Wiley, Hoboken, NJ, 3^{rd} edition, 2010.
 Sjöberg et al. (1995) J. Sjöberg, Q. Zhang, L. Ljung, A. Benveniste, B. Delyon, P.Y. Glorennec, H. Hjalmarsson, and A. Juditsky. Nonlinear blackbox modeling in system identification: a unified overview. Automatica, 31(12):1691–1724, 1995.
 Ljung (1999) L. Ljung. System Identification: Theory for the User. Prentice Hall, Upper Saddle River, NJ, 1999.
 The MathWorks, Inc. (2015) The MathWorks, Inc. Nonlinear modeling of a magnetorheological fluid damper. Example file provided by Matlab^{®} R2015b System Identification Toolbox^{TM}, 2015. Available at http://mathworks.com/help/ident/examples/nonlinearmodelingofamagnetorheologicalfluiddamper.html.
 Alvarez et al. (2009) M. A. Alvarez, D. Luengo, and N. D. Lawrence. Latent force models. In Proceedings of 12^{th} International Conference on Artificial Intelligence and Statistics, pages 9–16, 2009.
 Wang et al. (2008) J. M. Wang, D. J. Fleet, and A. Hertzmann. Gaussian process dynamical models for human motion. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30(2):283–298, 2008.
 Damianou et al. (2011) A. Damianou, M. K. Titsias, and N. D. Lawrence. Variational Gaussian process dynamical systems. In Advances in Neural Information Processing Systems, volume 24, pages 2510–2518, 2011.
 Deisenroth et al. (2012) M. P. Deisenroth, R. D. Turner, M. F. Huber, U. D. Hanebeck, and C. E. Rasmussen. Robust filtering and smoothing with Gaussian processes. IEEE Transactions on Automatic Control, 57(7):1865–1871, 2012.
 Deisenroth and Mohamed (2012) M. Deisenroth and S. Mohamed. Expectation propagation in Gaussian process dynamical systems. In Advances in Neural Information Processing Systems (NIPS), volume 25, pages 2609–2617, 2012.
 Solin and Särkkä (2014) A. Solin and S. Särkkä. Hilbert space methods for reducedrank Gaussian process regression. arXiv preprint arXiv:1401.5508, 2014.
 Lindsten et al. (2014) F. Lindsten, M. I. Jordan, and T. B. Schön. Particle Gibbs with ancestor sampling. Journal of Machine Learning Research, 15(1):2145–2184, 2014.
 Wills et al. (2012) A. Wills, T. B. Schön, F. Lindsten, and B. Ninness. Estimation of linear systems using a Gibbs sampler. In Proceedings of the 16^{th} IFAC Symposium on System Identification, pages 203–208, 2012.
 LázaroGredilla et al. (2010) M. LázaroGredilla, J. QuiñoneroCandela, C. E. Rasmussen, and A. R. FigueirasVidal. Sparse spectrum Gaussian process regression. Journal of Machine Learning Research, 11(1):1865–1881, 2010.
 Andrieu et al. (2010) C. Andrieu, A. Doucet, and R. Holenstein. Particle Markov chain Monte Carlo methods. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 72(3):269–342, 2010.
 Doucet and Johansen (2011) A. Doucet and A. M. Johansen. A tutorial on particle filtering and smoothing: Fifteen years later. In D. Crisan and B. Rozovsky, editors, Nonlinear Filtering Handbook, pages 656–704. Oxford University Press, Oxford, 2011.
 Wang et al. (2009) J. Wang, A. Sano, T. Chen, and B. Huang. Identification of hammerstein systems without explicit parameterisation of nonlinearity. International Journal of Control, 82(5):937–952, 2009.
 Chopin et al. (2013) N. Chopin, P. E. Jacob, and O. Papaspiliopoulos. SMC: An efficient algorithm for sequential analysis of state space models. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 75(3):397–426, 2013.
 Kokkala et al. (2014) J. Kokkala, A. Solin, and S. Särkkä. Expectation maximization based parameter estimation by sigmapoint and particle smoothing. In Proceedings of 17^{th} International Conference on Information Fusion, pages 1–8, 2014.
 Ruttor et al. (2013) A. Ruttor, P. Batz, and M. Opper. Approximate Gaussian process inference for the drift of stochastic differential equations. In Advances in Neural Information Processing Systems, volume 26, pages 2040–2048, 2013.
 Särkkä and Piché (2014) S. Särkkä and R. Piché. On convergence and accuracy of statespace approximations of squared exponential covariance functions. In Proceedings of the International Workshop on Machine Learning for Signal Processing, 2014.
 Tierney (1994) L. Tierney. Markov chains for exploring posterior distributions. Annals of Statistics, pages 1701–1728, 1994.
Supplementary material
This is the supplementary material for ‘Computationally Efficient Bayesian Learning of Gaussian Process State Space Models’ by Svensson, Solin, Särkkä and Schön, presented at AISTATS 2016. The references in this document point to the bibliography in the article.
1 Proofs
Proof of Theorem 4.1.
Let us start by considering the GP approximation to , . By Theorem 4.4 of Solin and Särkkä (2014), when domain size and the number of basis functions , the approximate covariance function converges pointwise to . As the prior means of the exact and approximate GPs are both zero, the means thus converge as well. By similar argument as is used in the proof of Theorem 2.2 in Särkkä and Piché (2014) it follows that the posterior mean and covariance functions will converge pointwise as well.
Now, consider the random variables defined by
(17)  
(18) 
where is an term series expansion approximation to the GP. It now follows that for any fixed the mean and covariance of and coincide when
. However, because these random variables are Gaussian, the first two moments determine the whole distribution and hence we can conclude that
in distribution.For the measurement model we can similarly consider the random variables
(19)  
(20) 
With similar argument as above, we can conclude that the approximation converges in distribution. ∎
Proof of Theorem 4.2.
Provided the reducedrank approximation of the Gram matrix, the reduction in the computational load directly follows from application of the matrix inversion lemma. ∎
Proof of Theorem 4.3.
Using fundamental properties of the Gibbs sampler (see, e.g., Tierney (1994)), the claim holds if all steps of Algorithm 1 are leaving the right conditional probability density invariant. Step 5 is justified by Lindsten et al. (2014) (even for a finite ), and step 6–7 by Wills et al. (2012). Further, step 8 can be seen as a MetropoliswithinGibbs procedure (Tierney, 1994). ∎
2 Details on Matrix Normal and Inverse Wishart distributions
As presented in the article, the matrix normal inverse Wishart (MNIW) distribution is the conjugate prior for state space models linear in its parameters and Wills et al. (2012). The MNIW distribution can be written as , where each part is defined as follows:

The pdf for the Inverse Wishart distribution with degrees of freedom and positive definite scale matrix :
(21) with being the multivariate gamma function.
To sample from the MN distribution, one may sample a matrix of i.i.d. random variables, and obtain as , where denotes the Cholesky factor ().
3 Eigenfunctions for MultiDimensional Spaces
4 Provided Matlab Software
The following Matlab files are available via the first authors homepage:
File  Use  Comments 

synthetic_example_1.m  First synthetic example (including Figure 1)  
synthetic_example_2.m  Second synthetic example  
damper.m  MR damper example  For other results, see The MathWorks, Inc. (2015) 
energy_forecast.m  Energy consumption forecasting example  
iwishpdf.m  Implements (21)  
mvnpdf_log.m  Logarithm of normal distribution pdf  
systematic_resampling.m  Systematic resampling (Step 7, Algorithm 2) 
All files are published under the GPL license.
Comments
There are no comments yet.