Dynamical systems modeling is a cornerstone of experimental sciences. Modelers attempt to capture the dynamical behavior of a stochastic system or a phenomenon in order to improve its understanding and make predictions about its future state. Stochastic differential equations (SDEs) are an effective formalism for modelling systems with underlying stochastic dynamics, with wide range of applications 
. The key problem in SDE’s is estimation of the underlying deterministic driving function, and the stochastic diffusion component.
We consider the dynamics of a multivariate system governed by Markov process described by an SDE
is the state vector of a-dimensional dynamical system at continuous time , is a deterministic state evolution, is a scalar magnitude of the stochastic multivariate Wiener process . The Wiener process has zero initial state , and the independent increments
follow a Gaussian with standard deviation.
The SDE system (1) transforms states forward in continuous time by the deterministic drift component , while the is the magnitude of the random Brownian diffusion that scatters the state with random fluctuations. The state solutions of SDE are given by the Itô integral 
where we integrate the system state from an initial state for time forward, and where is an auxiliary time variable. The only non-deterministic part of the solution (2) is the Brownian motion , whose random realisations generate path realisations that induce state distributions at time given the drift and diffusion . SDEs produce continuous, but non-smooth trajectories over time due to the non-differentiable Brownian motion.
We assume that both and are completely unknown and we only observe one or several multivariate time series obtained from noisy observations at observation times ,
where; and the latent states follow the state distribution. The goal of SDE modelling is to learn the drift and diffusion functions such that the process matches data .
There is considerable amount of literature on inferring SDEs that have a pre-defined parametric drift or diffusion functions for specific applications . There has also been interest on estimating non-parametric SDE drift and diffusion functions from data using the general Bayesian formalism [3, 4]. With linear drift approximations the state distribution turns out to be a Gaussian, which can be solved with variational smoothing algorithm  or by variational mean field approximation . Non-linear drifts and diffusions are predominantly modelled with Gaussian processes [3, 7, 4], which are a family of Bayesian kernel methods . For such models the state distributions are intractable, and hence these methods resort to using a family of gradient matching approximations [9, 10, 11], where the drift is estimated to match the empirical gradients of data, , and the diffusion relates to the residual of the approximation [3, 7, 4]. The gradient matching is only applicable to dense observations over time, while additional linearisation [3, 7] is necessary to model sparse observations. Non-parametric estimation of diffusion only applies to dense data [7, 4].
In this paper we propose to infer non-parametric drift and diffusion functions with Gaussian processes for arbitrary sparse or dense data. We learn the underlying system to induce state distributions with high expected likelihood,
The expected likelihood is generally intractable. In contrast to earlier works we do not use gradient matching or other approximative models, but instead we directly tackle and optimize the SDE system against the true likelihood (4) by performing full forward simulation. We propose an unbiased stochastic Monte Carlo approximation for the likelihood, for which we derive efficient, tractable gradients. Our approach places no restrictions on the spacing or sparsity of the observations. Our model is denoted as npSDE, and the implementation is publicly available in http://www.github.com/cagatayyildiz/npde.
2 Inducing Gaussian process SDE model
In this section we model both the drift and diffusion as Gaussian processes with a general inducing point parameterisation. The drift function defines a vector field , that is, an assignment of a -dimensional gradient vector to every -dimensional state . We assume that drift does not depend on time. The diffusion function
is a standard scalar function. We model both functions as Gaussian processes (GP), which are flexible Bayesian non-linear and non-parametric models.
2.1 Drift Gaussian process
The inducing point parameterisation for the drift GP was originally proposed in the context of ordinary differential equation systems, which we review here. We assume a zero-mean vector-valued GP prior on the drift function
which defines a priori distribution over drift values whose mean and covariance are
where is a block matrix of matrix-valued kernels . The key property of Gaussian processes is that they encode functions where similar states induce similar drifts , and where the state similarity is defined by the kernels . Several families of rich matrix-valued kernels exist [14, 13, 16]. In this work we opt for the family of decomposable kernels , where is a Gaussian base kernel
with drift variance , and dimension-specific lengthscales that determine the smoothness of drift field, and is a PSD dependency matrix between dimensions. In practise global dependency structures are often unavailable, and the diagonal structure is then chosen.
In standard GP regression we would obtain posterior of the drift by conditioning the GP prior with data . In SDE models the conditional is intractable due to the integral mapping (2) between observations and drifts . Instead, we augment the Gaussian process with a set of inducing vectors at locations , such that 
. We interpolate drift from inducing points as
which supports the drift with inducing locations and inducing vectors , and where . This corresponds to a vector-valued kernel function , or to a multi-task Gaussian process posterior mean . Due to universality of the Gaussian kernel , we can represent arbitrary drifts with sufficient inducing points.
2.2 Diffusion Gaussian process
We represent diffusion as another inducing point GP, similarly to drift. Diffusion is a scalar function that uses a scalar kernel. The diffusion has a zero-mean GP prior
that defines covariance with a Gaussian kernel of form (9), but with diffusion variance and lengthscales . Diffusion values at states then follow a prior
where . We interpolate the diffusion from inducing locations with inducing values ,
Drift and diffusion naturally share their inducing locations .
2.3 Stochastic Monte Carlo inference
The inducing SDE model is determined via the inducing locations , the inducing values and , the observation noise variance , and the kernel parameters and of the drift and diffusion kernels. The posterior of the model combines the likelihood of (4) and the independent priors and
using Bayes’ theorem as
where the time-dependent state distribution now depends on the inducing parameters. We propose to approximate the true expected likelihood with an unbiased stochastic Monte Carlo averaging, since we can draw path samples from the state distribution by sampling the Brownian motion path . The stochastic likelihood estimate with
samples turns out to be a kernel density estimator with Gaussian bases.
where we discretise time into subintervals of width , and sample the Wiener coefficients as with standard deviation . We set to the initial observation and use (16) to compute state path . The number of time steps is often higher than the number of observed timepoints to achieve sufficient path resolution.
We find a maximum a posteriori (MAP) estimates of by gradient ascent, while choosing lengthscales from a grid, keeping the inducing locations fixed on a dense grid (See Figure 1(d)) and setting . In practise in or systems placing inducing locations on a grid is a robust choice, noting that they can be also optimised with increased computational complexity .
2.4 Computing stochastic gradients
The gradient of the expectation of the log-likelihood (15) is
where the sample paths are from equation (16). The last term is the cumulative derivative of the state of sample at time against the parameters . The gradients of the piecewise Euler-Maruyama paths are:
The derivatives are constructed iteratively over time starting from with a fixed initial state , and the four partial derivatives are gradients of the kernel functions (10) and (13) with respect to and , respectively:
These gradients are related to the sensitivity equations [20, 21] derived for non-parametric ODEs previously . The gradients (19) can be computed in practise together with the sample paths (16) during the Euler-Maruyama iteration. The computation of the gradients has the same computational complexity as the numerical simulator. The iterative gradients are superior to finite difference approximation since we have exact formulation of the gradients, albeit of the approximal Euler-Maruyama paths, which can be solved to arbitrary numerical accuracy by tuning discretisation.
In order to illustrate the performance of our model, we conduct several experiments on synthetic data as well as real-world data sets. In all experiments inducing vectors are initialized by gradient matching, and then fully optimised. We use EM method to simulate the state distributions and compute stochastic gradients over samples. We use L-BFGS algorithm to compute the MAP estimates.
3.1 Double well
We first consider the double well system where the drift is given by and with constant diffusion . We generate random noisy input trajectories, each with observed data points. True dynamics and the observed data points are illustrated in Figure 3a. We fit our npSDE model with inducing points located uniformly within . We accurately approximate the true drift (the right plot on Figure 3), and learn a diffusion estimate of .
3.2 Simple oscillating dynamics
Next, we investigate how the quality of fit changes by the amount of data used for training. We consider the synthetic system in , whose drift equations are given by and , and the diffusion is . The state-dependent diffusion acts an hotspot of increased path scatter, and provides interesting and challenging dynamics to infer. We generate six data batches from the true dynamics using EM method with step size , and observe every 100’th state corrupted by a Gaussian noise with variance . The data batches contain 1, 5, 10, 25, 50 and 100 input trajectories, each having 25 data points. We repeat the experiments 50 times and report the average error.
The left plot in Figure 4 illustrates the cumulative discrepancy in the state distributions over time, and the right plot shows the error between the true and estimated drift/diffusion. Unsurprisingly, the discrepancy in both plots decrease when more training data is used. We also observe that the performance gain is insignificant after 50 input trajectories.
3.3 Ice core data
As another showcase of our model, we consider the NGRIP ice core dataset , which contains records of isotopic oxygen O concetrations within clacial ice cores. The record is used to explore the climatic changes that date back to last glacial period. During that period, the North Atlantic region underwent abrupt climate changes known as Dansgaard-Oeschger (DO) events. The events are characterized by a sudden increase in the temperature followed by a gradual cooling phase. Following , we consider timepoints from the time span from 60000 years to 20000 years before present, where 16 DO events have been identified.
Figure 5(a) illustrates the highly variable data. We observe a repeating pattern of DO events: a sudden increase followed by a slower settlement phase. The panel 5(b) indicates estimated drift that pushes the oxygen down until state and up with small states, matching the data. Interestingly, diffusion at 5(c) is highly peaked between and , which has accurately identified the regime of DO events. The model has learned to explain the DO events with high diffusion.
3.4 Human motion dataset
We finally demonstrate our approach on human motion capture data. Our goal is twofold: to estimate a single drift function that captures the dynamics of the walking sequences of several people, and to explain the discrepancies among sequences via diffusion. The input trajectories are the same as in : four walking sequences, each from a different person. We also follow the preprocessing method described in , which results in a simplified skeleton that consists of 50-dimensional pose configurations. All records are mean centered and downsampled by a factor of two.
Inference is performed in three dimensional space where the input sequences are projected using PCA. We place the inducing points on a grid and set the length-scale of both drift and diffusion process to 0.5. data set, inferred drift fit and the density of the sample paths are visualized in Figure 6. We can conclude that our model is capable of inferring drift and diffusion functions that match arbitrary data.
We propose an approach for learning non-parametric drift and diffusion functions of stochastic differential equation (SDE) systems such that the resulting simulated state distributions match data. Our approach can learn arbitrary dynamics due to the flexible inducing Gaussian process formulation. We propose a stochastic estimate of the simulated state distributions and an efficient system of computing their gradients. Our approach does not place any restrictions on the sparsity or denseness of the observations data. We leave learning of time-varying drifts and diffusions as interesting future work.
The data used in this project was obtained from mocap.cs.cmu.edu. The database was created with funding from NSF EIA-0196217. This work has been supported by the Academy of Finland Center of Excellence in Systems Immunology and Physiology, the Academy of Finland grants no. 260403, 299915, 275537, 311584.
-  R. Friedrich, J. Peinkeb, M. Sahimic, and R. Tabar, “Approaching complexity by stochastic methods: From biological systems to turbulence,” Phys. reports, vol. 506, pp. 87–162, 2011.
-  B. Øksendal, Stochastic Differential Equations: An Introduction with Applications, Springer, 6th edition, 2014.
-  A. Ruttor, P. Batz, and M. Opper, “Approximate Gaussian process inference for the drift function in stochastic differential equations,” in Advances in Neural Information Processing Systems, 2013, pp. 2040–2048.
-  C. García, A. Otero, P. Felix, J. Presedo, and D. Marquez, “Nonparametric estimation of stochastic differential equations with sparse Gaussian processes,” Physical Review E, vol. 96, no. 2, pp. 022104, 2017.
-  C. Archambeau, D. Cornford, M. Opper, and J. Shawe-Taylor, “Gaussian process approximations of stochastic differential equations,” in Gaussian Processes in Practice, 2007.
-  M. Vrettas, M. Opper, and D. Cornford, “Variational mean-field algorithm for efficient inference in large systems of stochastic differential equations,” Physical Review E, vol. 91, pp. 012148, 2015.
-  P. Batz, A. Ruttor, and M. Opper, “Approximate bayes learning of stochastic differential equations,” arXiv:1702.05390, 2017.
-  C.E. Rasmussen and K.I. Williams, Gaussian processes for machine learning, MIT Press, 2006.
-  J. M. Varah, “A spline least squares method for numerical parameter estimation in differential equations,” SIAM J sci Stat Comput, vol. 3, pp. 28–46, 1982.
-  S. Ellner, Y. Seifu, and R. Smith, “Fitting population dynamic models to time-series data by gradient matching,” Ecology, vol. 83, pp. 2256–2270, 2002.
-  J. Ramsay, G. Hooker, D. Campbell, and J. Cao, “Parameter estimation for differential equations: a generalized smoothing approach,” J R Stat Soc B, vol. 69, pp. 741–796, 2007.
-  M. Heinonen, C. Yildiz, H. Mannerström, J. Intosalmi, and H. Lähdesmäki, “Learning unknown ODE models with Gaussian processes,” in Proceedings of the 35th International Conference on Machine Learning, 2018.
-  M. Alvarez, L. Rosasco, and N. Lawrence, “Kernels for vector-valued functions: A review,” Foundations and Trends in Machine Learning, 2012.
-  N. Wahlström, M. Kok, and T. Schön, “Modeling magnetic fields using Gaussian processes,” IEEE ICASSP, 2013.
-  I. Macedo and R. Castro, “Learning divergence-free and curl-free vector fields with matrix-valued kernels,” Instituto Nacional de Matematica Pura e Aplicada, 2008.
-  C. Micchelli and M. Pontil, “On learning vector-valued functions,” Neural computation, 2005.
-  J. Quiñonero-Candela and C.E. Rasmussen, “A unifying view of sparse approximate Gaussian process regression,” Journal of Machine Learning Research, vol. 6, pp. 1939–1959, 2005.
-  J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis, Cambridge University Press, 2004.
J. Hensman, N. Fusi, and N. Lawrence,
“Gaussian processes for big data,”
Uncertainty in Artificial Intelligence. AUAI Press, 2013, pp. 282–290.
-  P. Kokotovic and J. Heller, “Direct and adjoint sensitivity equations for parameter optimization,” IEEE Trans. on Automatic Control, vol. 12, pp. 609–610, 1967.
-  F. Fröhlich, B. Kaltenbacher, F. Theis, and J. Hasenauer, “Scalable parameter estimation for genome-scale biochemical reaction networks,” PLOS Comp Biol, vol. 13, pp. 1–18, 2017.
-  K. Andersen, N. Azuma, J-M. Barnola, M. Bigler, P. Biscaye, N. Caillon, J. Chappellaz, H. Clausen, et al., “High-resolution record of northern hemisphere climate extending into the last interglacial period,” Nature, vol. 431, pp. 147, 2004.
-  F. Kwasniok, “Analysis and modelling of glacial climate transitions using simple dynamical systems,” Phil Trans R Soc A, vol. 371, pp. 20110472, 2013.
-  J. Wang, D. Fleet, and A. Hertzmann, “Gaussian process dynamical models for human motion,” IEEE Trans. on pattern analysis and machine intelligence, vol. 30, pp. 283–298, 2008.