1 Introduction
Many time series are well explained by assuming their progression is determined by an underlying dynamical system. A model of this dynamical system can be used to make strong predictions of the future behaviour of the time series. We often require predictions of systems with unknown dynamics, be it for economic models of financial markets, or models of physical systems to be controlled with model predictive control or modelbased reinforcement learning. Particularly in lowdata regimes, estimates of the uncertainty in the predictions are crucial for making robust decisions, as the expected utility of an action can depend strongly on the distribution of the outcome
(von Neumann et al., 1944). A striking example of this is model based policy search (Deisenroth & Rasmussen, 2011).There are many approaches to time series prediction. Within the machine learning community, autoregressive (AR)
(Billings, 2013) and statespace (SSMs) models are popular, in part due to the relative ease with which predictions can be made. AR predictions are obtained by a learned mapping from the last observations to the next one, whereas SSMs model the underlying dynamical system by learning a transition function which maps a system state forward in time. At each point in time, the state contains sufficient information for predicting both future states and observations. While AR models can be easier to train, SSMs have the potential to be more data efficient, due to their minimal state representation.We aim to learn complex, possibly stochastic, nonlinear dynamical systems from noisy data, following the Bayesian paradigm in order to capture model uncertainty. We use a Gaussian process (GP) prior (Rasmussen & Williams, 2005) for the transition function, giving the Gaussian Process State Space Model (GPSSM) (Frigola et al., 2013). Aside from capturing uncertainty, GPs are nonparametric, guaranteeing that our model complexity will not saturate as we observe more data.
Despite the challenge of performing accurate approximate inference in the GPSSM, an impressive amount of progress has been made (Frigola et al., 2014; McHutchon, 2014; Eleftheriadis et al., 2017; Ialongo et al., 2017; Doerr et al., 2018), helped along by the development of elegant variational inference techniques (Titsias, 2009; Hensman et al., 2013) which retain the key nonparametric property of GPs. In this work, we improve variational approximations by critically examining existing methods and their failure modes. We propose a family of nonfactorised variational posteriors which alleviate crucial problems with earlier approaches while maintaining the same efficiency. We refer to the proposed approach as VCDT: variationally coupled dynamics and trajectories.
2 Background
Statespace models are common in machine learning, and appear in many forms. At the most basic level, linearGaussian statespace models can be learned by maximum likelihood, combining Kalman smoothing and EM for instance (Roweis & Ghahramani, 1999). Extensions have been developed for deterministic nonlinear transitions (Ghahramani & Roweis, 1999)
. Recurrent neural networks, like the popular LSTM
(Hochreiter & Schmidhuber, 1996) learn deterministic mappings with state space structure for sequence prediction, which have seen successful recent use in a wide range of tasks including translation (Sutskever et al., 2014).2.1 Bayesian nonparametric approaches
We are particularly interested in prediction tasks which require good estimates of uncertainty, for example, for use in modelbased reinforcement learning systems (Deisenroth & Rasmussen, 2011)
. These applications distinguish themselves by requiring 1) continual learning, as datasets are incrementally gathered, and 2) uncertainty estimates to ensure that policies learned are robust to the many dynamics that are consistent with a small dataset. Bayesian nonparametric models provide a unified and elegant way of solving both these problems: model complexity scales with the size of the dataset and is controlled by the Bayesian
Occam’s razor (Rasmussen & Ghahramani, 2000). In this work, we focus on approximations (Titsias, 2009; Hensman et al., 2013; Matthews et al., 2016) which offer improved computational scaling with the ability to fully recover the original nonparametric model.2.2 Gaussian process State Space Models
Recurrent models with GP transitions (GPSSM) have been proposed in a variety of ways, each with its own inference method (Wang et al., 2005; Ko & Fox, 2009; Turner et al., 2010; Frigola et al., 2013). Early methods relied on maximum a posteriori inference for either the latent states (Wang et al., 2005; Ko & Fox, 2009) or the transition function (Turner et al., 2010). Frigola et al. (2013) presented the first fully Bayesian treatment with a particle MCMC method that sampled over the latent states and the GP transition function. Due to issues with computational complexity and sampling in higher dimensional latent spaces, later attention turned to variational methods. All variational inference schemes have so far relied on independence assumptions between latent states and transition function (Frigola et al., 2014; McHutchon, 2014; Ialongo et al., 2017), sometimes even factorising the state distribution over time (Mattos et al., 2016). Eleftheriadis et al. (2017) introduced a recognition model to help with optimisation of the variational distribution, while keeping the meanfield assumption. Recently, Doerr et al. (2018) introduced the first method to account for the dependence between transition function and latent states by using a doubly stochastic inference scheme similar to Salimbeni & Deisenroth (2017). However, their approach has severe limitations which we will discuss and show experimentally. Bui et al. (2016) investigated Power Expectation Propagation as an alternative approach for fitting factorised approximate posteriors.
2.3 Variational inference approaches
Variational methods often make simplifying assumptions about the form of the approximate posterior to improve computational tractability. This may result in significantly biased solutions. The bias is particularly severe if independence assumptions are made where strong correlations are actually present (Turner & Sahani, 2011). Clearly, in dynamical systems, the trajectory of the latent states depends strongly on the dynamics. Hence, we focus on improving existing variational inference schemes by removing the independence assumption between latent states and transition function. We identify a general method that performs well across varying noise scales.
3 The Model
3.1 Statespace model structure
We model discretetime sequences of observations , where , by positing that each data point is generated by a corresponding latent variable
(the system’s “state”). These latent variables form a Markov chain, implying that, for any timestep
, we can generate by conditioning only on and the transition function . We take the system to evolve according to a single, timeinvariant transition function, which we would like to infer. While our inference scheme provides the freedom to choose arbitrary transition and observation density functions, in keeping with previous literature, we use Gaussians. We also assume a linear mapping between and the mean of . This does not limit the range of systems that can be modelled, though it may require us to choose a higher dimensional latent state (Frigola, 2015). We make this choice to reduce the nonidentifiabilities between transitions and emissions. To reduce them further we may also impose constraints on the scale of or of the linear mapping . This allows the model to be expressed by the following equations:(1)  
(2)  
(3)  
(4) 
where we take the “process noise” covariance matrix to be diagonal, encouraging the transition function to account for all correlations between latent dimensions. The crucial difference from a standard statespace model is that we have placed a GP prior over the transition function. Thus, for a given mean function and positivedefinite covariance function , any finite collection of function evaluations, at arbitrary inputs
, will have a Gaussian joint distribution:
(5) 
where and . This gives the joint density of our model:
(6) 
In order to draw samples, has to be drawn by conditioning on all the previously sampled function values:
(7) 
(8)  
(9)  
(10) 
By multiplying the Gaussian conditionals, we can return to our GP prior as in eq. 5. Although the function values are jointly Gaussian, the states are not necessarily so, since the GP’s mean and kernel functions specify nonlinear relations between inputs and outputs .
3.2 Multivariate latent states
If our state has more than one dimension (i.e. ), we model the system’s transitions by placing a GP prior on each of the dimensionspecific functions, and we write:
(11) 
where each independent GP has its own mean and kernel functions: . For ease of notation, we drop subscripts indexing state dimensions . Since the GP’s are independent from each other, each process only has to condition on its own function evaluations. Thus, the multivariate conditional
is a diagonal Gaussian where the mean and variance of dimension
are as in eq. 9 and eq. 10, using as mean function and as kernel function .3.3 Control inputs
A control input influences the transition in a Markovian manner. We can therefore view it as being part of an “augmented” latent state. This makes our transition function , and requires only a change to ’s kernel and mean functions to be defined over
input dimensions. Our transition probability is now also conditioned on
: , though, for brevity, we dropfrom the notation, as it is not a random variable and we always condition on it.
3.4 Computational costs of sampling
Despite the conceptual similarity between the nonparametric GPSSM and parametric nonlinear state space models, it is considerably more expensive to sample from the GPSSM prior. The main reason for this can be seen in eqs. 10, 9, 8 and 7: to sample we have to condition on the previous points. This requires incrementally building up a matrix inverse, which costs in time. Since our nonparametric transition function is implicitly defined by the function values sampled so far, it is necessary to condition on all previous samples to ensure our overall function is selfconsistent over the whole time series.
This poses a problem for MCMC based inference methods, which rely on drawing samples from distributions closely related to the prior. Prediction from a GP posterior also has cubic cost. One way to avoid this cost is to sample from an adapted GP prior (e.g. FITC (Snelson & Ghahramani, 2005)), which was suggested by Frigola et al. (2013).
Modifying the model in this way can have unintended consequences (Bauer et al., 2016), so we focus on performing inference without assumptions on the structure of the GP or its kernel. Our inference scheme should approximate the correct model, while avoiding the cost at training time.
4 Variational inference
4.1 General variational bound
We are interested in approximating the posterior ^{1}^{1}1We abuse notation by directly denoting a density over the infinitedimensional , even though such a density does not exist. We only ever concern ourselves with finite subsets of . using variational inference. The general procedure is to construct a variational lower bound to the log marginal likelihood that has the KL divergence from an approximation to the true posterior as its slack: . By maximising w.r.t. , we improve the quality of the approximation.
Practical stochastic variational inference places two requirements on approximating distributions. First, we need to be able to generate samples from them and, second, we need to be able to evaluate their density. We start by deriving the lower bound for a general joint approximate posterior , to which we will later add constraints in order to produce different tractabilityaccuracy tradeoffs. Following the usual derivation using Jensen’s inequality (Blei et al., 2017), we obtain
(12)  
(13) 
For we will use a sparse Gaussian process (Titsias, 2009; Hensman et al., 2013; Matthews, 2016). This constraint is ubiquitous in models where the inputs to a GP are random variables (Damianou et al., 2016), such as the GPLVM (Titsias & Lawrence, 2010), or deep GP (Damianou & Lawrence, 2013), as it provides analytical as well as computational tractability. A sparse GP approximation specifies a free Gaussian density on the function values at locations , and uses the prior conditional for the rest of the Gaussian process, which we denote . At this time, we let depend on the entire process for generality:
(14) 
Choosing the posterior in this way results in a finite KL between the approximate posterior and the prior, which becomes (Matthews et al., 2016).
4.2 Approximate posterior design choices
sampling  

1) Factorised  linear  
2) Factorised  nonlinear  
3) NonFactorised  nonlinear  
4) VCDT 
So far, the only simplifying assumptions have been reducing to the prior conditional and to a Gaussian. Now we have to specify the form of . Most existing approximate inference schemes focused on factorised posteriors to obtain an efficient sampling scheme or even a closed form for . In particular, Frigola et al. (2014) find the optimal variational by calculus of variations and sample from it, whereas McHutchon (2014) and Ialongo et al. (2017) constrain it to be MarkovGaussian. A factorised, Gaussian posterior together with Gaussian emissions leads to a Gaussian optimal , which, for certain kernels, gives us a closedform bound. Eleftheriadis et al. (2017) reduced the memory cost associated with storing the MarkovGaussian and improved optimisation behaviour by employing a recurrent neural network as a recognition model. This also parallels the work of (Krishnan et al., 2017), where recurrent neural networks perform smoothing in order to learn a deterministic model of the dynamics.
In our experiments, we show that imposing independence between and does in fact encourage overconfident beliefs about the dynamics, which is a wellknown phenomenon in variational inference (Turner & Sahani, 2011). To avoid this we focus on approximate posteriors which preserve correlations between and in various forms. We impose one more general constraint: Markovian structure in . This structure is required to eliminate the impractical amount of correlations that can be expressed between the large number of random variables in . Furthermore, the true posterior is also Markovian in :
(15) 
Hence the general form of our approximate posterior:
(16) 
where we are free to specify and . We can thus simplify the ELBO from eq. 13 to
(17) 
Now we add two restrictions: 1) we use a Gaussian to obtain a tractable KL, and 2) we assume and are linear, and depend only on a finite number of points on , , to make the expectation over tractable. We choose , , and to be free variational parameters in all cases (see table 1 for how and are related). They are crucial to allow information from the observations to make its way into our latent state posterior. We now discuss various choices for and , and the influence they have on the flexibility of the approximate posterior, which we summarise in table 1. The densities of all these variational distributions can still be evaluated, and exact samples can be drawn for unbiased evaluation of the bound.
4.3 A natural choice
Another way to justify our choice of posterior is that, for Gaussian emissions, the exact posterior “filtering” factors: are Gaussians of precisely the form we propose. In fact, we can compute the corresponding , , and in closed form:
(18)  
(19)  
(20) 
Notice that here and are constant through time, reducing the time and memory footprint of the algorithm. While the exact posterior “smoothing” factors are not generally Gaussian or closedform (hence leaving free), in settings where the observation noise is small, filtering and smoothing will give very similar results, making the filtering factors very close to optimal. Likewise, if future observations hold little information about the present due to process noise corruption, the filtering factors will provide good approximations to the conditional state posterior.
4.4 Factorised approximations
We can recover the existing Gaussian factorised approximation of McHutchon (2014) by taking (see 1 in table 1). The linear relationship between consecutive states implies a linearisation of the transition function. Yet, the true will contain influence from only through the nonlinear transition function.
We can break the linearisation constraint by correlating with “pushed through” the approximate GP mean (2 in table 1). This results in a nonGaussian which is still factorised from . Since this corresponds to marginalising out from , we let to allow our state posterior to be more “uncertain” precisely where our GP posterior tells us to. If the observations provide sufficient evidence to discount the GP’s uncertainty, optimisation will drive and down, giving us a concentrated state posterior.
The main reason we introduce this modified factorised method, is so we can directly isolate the effect of introducing dependence between and , compared to simply making nonGaussian.
Both approximate posteriors can be sampled in time due to their Markovian structure. The factorised  linear posterior is particularly convenient as it allows us to exploit GPUoptimised bandedmatrix operations as in Durrande et al. (2019).
4.5 Direct dependence on
The factorised, nonlinear approximation from the previous section is attempting to indirectly incorporate the learned transition function into the approximation over . Instead of summarising the information using the mean of , we want to break the factorisation between and by using the actual, stochastic in the approximation. This is summarised in table 1, line 3. The actual function value of a function in the posterior is weighed by against .
Sampling from this approximate posterior poses a challenge. From eq. 16, we note that every time a is sampled, it has to be consistent with a single function throughout the entire time series. This means that has to be sampled by conditioning on the function values sampled for earlier timesteps, resulting in an cost (as per section 3.4). This high cost prohibits learning in long time series. We can circumvent this issue by cutting the approximate posterior into subsequences with lengths . The state immediately after a subsequence, e.g. is given a nonconditional posterior:
(21) 
This reduces the cost of sampling to (for ), and allows for minibatching over subsequences. However, this factorisation catastrophically “breaks” longrange dependences, again introducing a potential mismatch between our state and transition function posteriors. In the extreme case of , we would obtain an approximate posterior which factorises completely across time, as in Mattos et al. (2016).
4.6 VCDT and “Udependence”
The previous section introduces an approximate posterior exhibiting dependence between and
. This will generally come at a cubic cost since our transition function has potentially as many degrees of freedom as timesteps. Parametric models avoid this cubic cost since they can sample a transition function with fixed degrees of freedom before sampling the entire time series. Conveniently, our sparse variational GP posterior separates the finite degrees of freedom that we can learn from the data, from the rest of the potentially infinite degrees of freedom of our prior.
In order to gain an approximate posterior with a tractable sampling scheme, we choose ((4) in table 1). Crucially, we can sample from this posterior in lineartime, as only a single needs to be sampled from for an entire sample trajectory. We are effectively being parametric about our sample trajectories while minimising a KL distance from the full nonparametric model. By the properties of sparse variational GPs (Matthews et al., 2016), by adding more inducing points we can reduce this distance.
Intuitively, by finely tiling with inducing points the parts of the statespace we are likely to traverse with our sample paths we can fully specify the behaviour of our transition function in those regions. In this case, conditioning on is sufficient to “tie” down the transitions arbitrarily tightly leaving very little to be gained by additionally conditioning on sampled values. The elegance of this approach lies in the fact that we can assess how many inducing points are required by finding the saturation point where the converged ELBO values stop improving as more points are added.
4.7 Comparison to PRSSM
Doerr et al. (2018) were, to the best of our knowledge, the first to consider a nonfactorised variational posterior for the GPSSM. Their work, however, has two significant shortcomings. Firstly, is taken to be the same as the prior transition. This leaves no free parameters, except for those in and , to incorporate information from the observations into our posterior: no filtering or smoothing is possible. This gives a good approximation only when the process noise is low and/or the observed sequence is short, as low noise levels can compound in a long sequence. Even in the absence of process noise, using the observations to update our beliefs about the states can help optimisation greatly by effectively setting up easier “targets” for the GP to go through. Notice that by setting , , and in our posterior transitions we can also force them to match the prior, making PRSSM a special case of our more flexible parameterisation.
Secondly, Doerr et al. (2018) employ a sampling scheme which gives incorrect marginal samples of , even assuming, as they did, that:
(22) 
Despite this conditional factorisation assumption, once is integrated out, as in PRSSM, the ’s become correlated. Thus, in order to correctly sample , we need to condition on all previous samples , introducing a cubic time cost. This is ignored in PRSSM and samples from the posterior over are effectively drawn independently for every timestep . The resulting samples from are biased. A mismatch is thus introduced between the functional form of the approximate posterior (which has correlations between function values), and the samples we use to compute its expectations. Hence, PRSSM’s objective does not correspond to a valid variational lower bound.
We solve this by explicitly sampling and conditioning on it. Conditioning on the same enforces consistency of the dynamics along a sample trajectory. However, we also wish to avoid the factorisation assumption of eq. 22 as that does not generally correspond to a valid GP prior. Explicitly assuming we obtain a valid variational bound where we only need to condition on to sample , giving us an cost while maintaining a full GP prior.
5 Experiments
In the experiments we set out to test how the proposed posterior (VCDT) compares with the Factorised  linear, Factorised  nonlinear, and PRSSM (Doerr et al., 2018) approximations. Each model serves as a “control” for a specific experimental question:

Factorised  linear: what is gained by making our approximation nonlinear? I.e. if our posterior were not jointly Gaussian, but only its Markovian conditional factors.

Factorised  nonlinear: what is gained by introducing dependence between our states and our transition function? I.e. if we do not marginalise both and from our posterior over .

PRSSM: what is gained, or lost, by forcing our model to find transition functions that explain the data well, without accounting for process noise?
To answer these questions, we test model calibration and predictive performance in a number of environments: the “kink” function, a set of five system identification benchmark datasets (also used in (Doerr et al., 2018)), and a cart and pole system.
5.1 “Kink” Function
We generate data according to the transition function displayed in red in fig. 1:
This function generates cyclic trajectories through the statespace, as it combines two linear segments with slope greater than (left) and less than (right). It also contains a nonlinearity at the joining of these segments, giving it a variety of dynamics and serving as an interesting testing environment. We train our models on the data described in fig. 1, fixing their emissions () to the true, generative ones so as to enable direct latent space comparison^{2}^{2}2This is to sidestep any scale invariances built into our model..
In order to make inference challenging, we add a small amount of process noise (s.d. ) and a significant amount of observation noise (s.d. ). As we can see from fig. 2, in this regime PRSSM is unable to learn, given that its posterior encourages it to find trajectories with no process noise. Indeed, regardless of which data we are training on, during optimisation, PRSSM always drives the process noise to zero. This leads it to converge to poor solutions when process noise exists and/or to drive up the observation noise (across all experiments, PRSSM learned the highest values for the observation noise). Even initialising PRSSM at the solution found by other methods leads to the same suboptimal fit of fig. 2. On the other hand, the GPSSM models can handle process noise and recover inflated estimates of it in order to ascribe some transitions to noise and justify smoother dynamics. The Factorised  linear model’s estimates of the process noise are consistently the highest (across many settings of noise parameters and sequence lengths), followed by the Factorised  nonlinear model. As can be seen in fig. 3, VCDT finds the most wellcalibrated posterior. The factorised approaches (the linear one is shown, though they gave a very similar fit) result in a posterior that is not only less accurate, but also more confidently wrong. In general, the posteriors found by the factorised approaches favour higher process noise to achieve smoother, overconfident dynamics. This is consistent with the behaviour discussed in (Turner & Sahani, 2011): meanfield variational inference favours more concentrated posteriors. By matching our true posterior’s structure more closely and modelling the dependence between and , we can overcome this issue.
5.2 System identification benchmarks
We also train our models on five benchmark datasets taken from (De Moor et al., 1997). Test set results are reported in table 2. As in (Doerr et al., 2018), we train on the first half of the sequence, we normalise the data, and we use a 4dimensional latent state (plus 1 nonstochastic dimension for the control inputs). No minibatching was used, the bound is evaluated using 100 samples from the posterior, and we use 100 variational inducing points. The test results are for predictions 30 steps into the future from the end of the training sequence.
We initialise all models at the solution found by the Factorised  nonlinear method to assess differences in optima of the variational objectives. This turned out to be necessary for PRSSM as learning on such long sequences, without minibatching, proved very challenging without any filtering from the observations. As can be seen from table 2
, PRSSM can often find good solutions despite its constraints. This is particularly true for low noise regimes, as in the “Drive” dataset. The GPSSM models are more robust in general though, with VCDT in the lead among GPSSMs for NLPP in all but the last two datasets. No errorbars are displayed since the initialisation was deterministic, no minibatching was used, and 100 samples from the posterior gave low variance estimates of the training objective. To compute the test statistics accurately, we used many more samples (
).Model  Actuator  Ballbeam  Drive  Dryer  Gas Furnace 

Factorised  linear  0.364; 0.154  0.486; 0.075  0.770; 0.439  0.709; 0.098  0.296; 0.170 
Factorised  nonlinear  0.641; 0.142  1.379; 0.073  0.283; 0.246  1.310; 0.049  0.264; 0.168 
VCDT  0.644; 0.141  1.395; 0.072  0.238; 0.285  1.282; 0.050  0.377; 0.169 
PRSSM (initialised)  3.653; 1.976  24.765; 1.118  0.649; 0.139  1.265; 0.053  0.144; 0.162 
5.3 Cart and pole
Finally, we consider a cart and pole system (or inverted pendulum). We add Gaussian noise to the states in order to obtain our observations. Standard deviations are: 0.03; 0.1047; 0.4; 1.3963; for the respective dimensions of our system: cart position, pendulum angle, cart velocity and angular velocity of the pendulum. The noise s.d. levels correspond to 3cm; 6 degrees; 3cm/0.075sec; 6 degrees/0.075sec. Results for models fit using 100 samples from the posterior (per optimisation step) and 300 inducing points are shown in
table 3. Because there is no process noise and the training sequences are short (20 timesteps), we see that PRSSM performs the best in terms of predictive performance. Of course PRSSM can be viewed as a special case of our approximation, albeit with a mismatched sampling scheme. Running the experiment with the correct sampling scheme and fixing the variational parameters to: , , and , we learn a solution with the same performance as PRSSM. In the GPSSM models, VCDT performs the best, and, as can be seen in fig. 4, it also recovers the lowest noise levels, using the dynamics to explain more of the structure in the data.Model  NLPP  RMSE 

Factorised  linear  2.268  3.548 
Factorised  nonlinear  1.847  2.974 
VCDT  0.694  2.139 
PRSSM  0.049  1.613 
PRSSM (corrected sampling)  0.053  1.628 
6 Conclusion
The GPSSM is a powerful formalism. The main challenge is to perform fast, accurate inference. Variational inference maintains many of the benefits of using nonparametric models, but faces particular difficulties in time series models due to their sensitivity to factorisation assumptions. Naive nonfactorising approximations regain the poor computational scaling that variational methods were introduced to avoid. By exploiting the lowrank structure of the approximate GP posterior, we were able to construct a nonfactorised posterior with the desired computational scaling. This often leads to better predictive performance, calibration, and an improved estimation of model parameters.
Acknowledgements
ADI would like to acknowledge the generous support of the CambridgeTübingen and Qualcomm Innovation fellowships.
References
 Bauer et al. (2016) Bauer, M., van der Wilk, M., and Rasmussen, C. E. Understanding probabilistic sparse Gaussian process approximations. In Lee, D. D., Sugiyama, M., Luxburg, U. V., Guyon, I., and Garnett, R. (eds.), Advances in Neural Information Processing Systems 29, 2016.
 Billings (2013) Billings, S. Nonlinear System Identification: NARMAX Methods in the Time, Frequency, and SpatioTemporal Domains. Wiley, 2013. ISBN 9781119943594.
 Blei et al. (2017) Blei, D. M., Kucukelbir, A., and McAuliffe, J. D. Variational inference: A review for statisticians. Journal of the American Statistical Association, 112(518):859–877, 2017.
 Bui et al. (2016) Bui, T., Turner, R. E., and Rasmussen, C. E. Bayesian Gaussian process statespace models via PowerEP. In ICML 2016 Workshop on Data efficient Machine Learning, 2016.

Damianou & Lawrence (2013)
Damianou, A. and Lawrence, N.
Deep Gaussian processes.
In
Proceedings of the 16th International Conference on Artificial Intelligence and Statistics (AISTATS)
, pp. 207–215, 2013.  Damianou et al. (2016) Damianou, A. C., Titsias, M. K., and Lawrence, N. D. Variational inference for latent variables and uncertain inputs in Gaussian processes. The Journal of Machine Learning Research, 17(1):1425–1486, 2016.
 De Moor et al. (1997) De Moor, B., De Gersem, P., De Schutter, B., and Favoreel, W. DAISY: A database for identification of systems. JOURNAL A, 38:4–5, 1997. URL http://homes.esat.kuleuven.be/~smc/daisy/.
 Deisenroth & Rasmussen (2011) Deisenroth, M. and Rasmussen, C. E. PILCO: A modelbased and dataefficient approach to policy search. In Proceedings of the 28th International Conference on Machine Learning (ICML), 2011.
 Doerr et al. (2018) Doerr, A., Daniel, C., Schiegg, M., NguyenTuong, D., Schaal, S., Toussaint, M., and Trimpe, S. Probabilistic recurrent statespace models. In Proceedings of the 35th International Conference on Machine Learning (ICML), 2018.

Durrande et al. (2019)
Durrande, N., Adam, V., Bordeaux, L., Eleftheriadis, S., and Hensman, J.
Banded matrix operators for Gaussian Markov models in the automatic differentiation era.
In Proceedings of the 22nd International Conference on Artificial Intelligence and Statistics (AISTATS), 2019.  Eleftheriadis et al. (2017) Eleftheriadis, S., Nicholson, T., Deisenroth, M., and Hensman, J. Identification of Gaussian process statespace models. In Advances in Neural Information Processing Systems 30, 2017.
 Frigola (2015) Frigola, R. Bayesian time series learning with Gaussian processes. PhD thesis, University of Cambridge, 2015.
 Frigola et al. (2013) Frigola, R., Lindsten, F., Schön, T. B., and Rasmussen, C. E. Bayesian inference and learning in Gaussian process statespace models with particle MCMC. In Advances in Neural Information Processing Systems 26, 2013.
 Frigola et al. (2014) Frigola, R., Chen, Y., and Rasmussen, C. E. Variational Gaussian process statespace models. In Advances in Neural Information Processing Systems 27, 2014.
 Ghahramani & Roweis (1999) Ghahramani, Z. and Roweis, S. T. Learning nonlinear dynamical systems using an EM algorithm. In Advances in Neural Information Processing Systems 12, 1999.
 Hensman et al. (2013) Hensman, J., Fusi, N., and Lawrence, N. D. Gaussian processes for Big Data. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2013.
 Hochreiter & Schmidhuber (1996) Hochreiter, S. and Schmidhuber, J. LSTM can solve hard long time lag problems. In Advances in Neural Information Processing Systems 9, 1996.
 Ialongo et al. (2017) Ialongo, A. D., van der Wilk, M., and Rasmussen, C. E. Closedform inference and prediction in Gaussian process statespace models. NIPS 2017 TimeSeries Workshop, 2017.
 Ko & Fox (2009) Ko, J. and Fox, D. GPBayesFilters: Bayesian filtering using Gaussian process prediction and observation models. Autonomous Robots, 27(1):75–90, 2009.
 Krishnan et al. (2017) Krishnan, R. G., Shalit, U., and Sontag, D. Structured inference networks for nonlinear statespace models. In Proceedings of the AAAI Conference on Artificial Intelligence, 2017.

Matthews et al. (2016)
Matthews, A., Hensman, J., Turner, R., and Ghahramani, Z.
On sparse variational methods and the KullbackLeibler divergence between stochastic processes.
In Proceedings of the 19th International Conference on Artificial Intelligence and Statistics (AISTATS), 2016.  Matthews (2016) Matthews, A. G. d. G. Scalable Gaussian process inference using variational methods. PhD thesis, University of Cambridge, 2016.
 Mattos et al. (2016) Mattos, C. L. C., Dai, Z., Damianou, A., Forth, J., Barreto, G. A., and Lawrence, N. D. Recurrent Gaussian processes. In Proceedings of the International Conference on Learning Representations (ICLR), 2016.
 McHutchon (2014) McHutchon, A. Nonlinear modelling and control using Gaussian processes. PhD thesis, University of Cambridge, 2014.
 Rasmussen & Ghahramani (2000) Rasmussen, C. E. and Ghahramani, Z. Occam’s razor. In Advances in Neural Information Processing Systems 13, 2000.
 Rasmussen & Williams (2005) Rasmussen, C. E. and Williams, C. K. I. Gaussian Processes for Machine Learning. The MIT Press, 2005.
 Roweis & Ghahramani (1999) Roweis, S. and Ghahramani, Z. A unifying review of linear Gaussian models. Neural computation, 11(2):305–345, 1999.
 Salimbeni & Deisenroth (2017) Salimbeni, H. and Deisenroth, M. Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information Processing Systems 30, pp. 4588–4599, 2017.
 Snelson & Ghahramani (2005) Snelson, E. and Ghahramani, Z. Sparse Gaussian processes using pseudoinputs. In Advances in Neural Information Processing Systems 18, 2005.
 Sutskever et al. (2014) Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to sequence learning with neural networks. In Advances in Neural Information Processing Systems 27, 2014.
 Titsias (2009) Titsias, M. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of the 12th International Conference on Artificial Intelligence and Statistics (AISTATS), 2009.
 Titsias & Lawrence (2010) Titsias, M. and Lawrence, N. D. Bayesian Gaussian process latent variable model. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 844–851, 2010.
 Turner et al. (2010) Turner, R., Deisenroth, M., and Rasmussen, C. Statespace inference and learning with Gaussian processes. In Proceedings of the 13th International Conference on Artificial Intelligence and Statistics (AISTATS), 2010.
 Turner & Sahani (2011) Turner, R. E. and Sahani, M. Two problems with variational expectation maximisation for timeseries models. Bayesian Time series models, 1(3.1):3–1, 2011.
 von Neumann et al. (1944) von Neumann, J., Morgenstern, O., Kuhn, H. W., and Rubinstein, A. Theory of Games and Economic Behavior. Princeton University Press, 1944. ISBN 9780691130613.
 Wang et al. (2005) Wang, J., Hertzmann, A., and Fleet, D. J. Gaussian process dynamical models. In Advances in Neural Information Processing Systems 18, 2005.
Comments
There are no comments yet.