1 Introduction
Dynamical systems modeling is a cornerstone of experimental sciences. Modelers attempt to capture the dynamical behavior of a stochastic system or a phenomenon in order to improve its understanding and make predictions about its future state. Stochastic differential equations (SDEs) are an effective formalism for modelling systems with underlying stochastic dynamics, with wide range of applications [1]
. The key problem in SDE’s is estimation of the underlying deterministic driving function, and the stochastic diffusion component.
We consider the dynamics of a multivariate system governed by Markov process described by an SDE
(1) 
where
is the state vector of a
dimensional dynamical system at continuous time , is a deterministic state evolution, is a scalar magnitude of the stochastic multivariate Wiener process . The Wiener process has zero initial state , and the independent incrementsfollow a Gaussian with standard deviation
.The SDE system (1) transforms states forward in continuous time by the deterministic drift component , while the is the magnitude of the random Brownian diffusion that scatters the state with random fluctuations. The state solutions of SDE are given by the Itô integral [2]
(2) 
where we integrate the system state from an initial state for time forward, and where is an auxiliary time variable. The only nondeterministic part of the solution (2) is the Brownian motion , whose random realisations generate path realisations that induce state distributions at time given the drift and diffusion . SDEs produce continuous, but nonsmooth trajectories over time due to the nondifferentiable Brownian motion.
We assume that both and are completely unknown and we only observe one or several multivariate time series obtained from noisy observations at observation times ,
(3) 
where
follows a stationary timeinvariant zeromean multivariate Gaussian distribution with diagonal noise variances
; and the latent states follow the state distribution. The goal of SDE modelling is to learn the drift and diffusion functions such that the process matches data .There is considerable amount of literature on inferring SDEs that have a predefined parametric drift or diffusion functions for specific applications [1]. There has also been interest on estimating nonparametric SDE drift and diffusion functions from data using the general Bayesian formalism [3, 4]. With linear drift approximations the state distribution turns out to be a Gaussian, which can be solved with variational smoothing algorithm [5] or by variational mean field approximation [6]. Nonlinear drifts and diffusions are predominantly modelled with Gaussian processes [3, 7, 4], which are a family of Bayesian kernel methods [8]. For such models the state distributions are intractable, and hence these methods resort to using a family of gradient matching approximations [9, 10, 11], where the drift is estimated to match the empirical gradients of data, , and the diffusion relates to the residual of the approximation [3, 7, 4]. The gradient matching is only applicable to dense observations over time, while additional linearisation [3, 7] is necessary to model sparse observations. Nonparametric estimation of diffusion only applies to dense data [7, 4].
In this paper we propose to infer nonparametric drift and diffusion functions with Gaussian processes for arbitrary sparse or dense data. We learn the underlying system to induce state distributions with high expected likelihood,
(4) 
The expected likelihood is generally intractable. In contrast to earlier works we do not use gradient matching or other approximative models, but instead we directly tackle and optimize the SDE system against the true likelihood (4) by performing full forward simulation. We propose an unbiased stochastic Monte Carlo approximation for the likelihood, for which we derive efficient, tractable gradients. Our approach places no restrictions on the spacing or sparsity of the observations. Our model is denoted as npSDE, and the implementation is publicly available in http://www.github.com/cagatayyildiz/npde.
2 Inducing Gaussian process SDE model
In this section we model both the drift and diffusion as Gaussian processes with a general inducing point parameterisation. The drift function defines a vector field , that is, an assignment of a dimensional gradient vector to every dimensional state . We assume that drift does not depend on time. The diffusion function
is a standard scalar function. We model both functions as Gaussian processes (GP), which are flexible Bayesian nonlinear and nonparametric models
[8].2.1 Drift Gaussian process
The inducing point parameterisation for the drift GP was originally proposed in the context of ordinary differential equation systems
[12], which we review here. We assume a zeromean vectorvalued GP prior on the drift function(5) 
which defines a priori distribution over drift values whose mean and covariance are
(6)  
(7) 
where the kernel is matrixvalued [13]. A GP prior defines that for any collection of states , the drift values follow a matrixvalued normal [13],
(8) 
where is a block matrix of matrixvalued kernels . The key property of Gaussian processes is that they encode functions where similar states induce similar drifts , and where the state similarity is defined by the kernels . Several families of rich matrixvalued kernels exist [14, 13, 16]. In this work we opt for the family of decomposable kernels , where is a Gaussian base kernel
(9) 
with drift variance , and dimensionspecific lengthscales that determine the smoothness of drift field, and is a PSD dependency matrix between dimensions. In practise global dependency structures are often unavailable, and the diagonal structure is then chosen.
In standard GP regression we would obtain posterior of the drift by conditioning the GP prior with data [8]. In SDE models the conditional is intractable due to the integral mapping (2) between observations and drifts . Instead, we augment the Gaussian process with a set of inducing vectors at locations , such that [17]
. We interpolate drift from inducing points as
(10) 
which supports the drift with inducing locations and inducing vectors , and where . This corresponds to a vectorvalued kernel function [13], or to a multitask Gaussian process posterior mean [8]. Due to universality of the Gaussian kernel [18], we can represent arbitrary drifts with sufficient inducing points.
2.2 Diffusion Gaussian process
We represent diffusion as another inducing point GP, similarly to drift. Diffusion is a scalar function that uses a scalar kernel. The diffusion has a zeromean GP prior
(11) 
that defines covariance with a Gaussian kernel of form (9), but with diffusion variance and lengthscales . Diffusion values at states then follow a prior
(12) 
where . We interpolate the diffusion from inducing locations with inducing values ,
(13) 
Drift and diffusion naturally share their inducing locations .
2.3 Stochastic Monte Carlo inference
The inducing SDE model is determined via the inducing locations , the inducing values and , the observation noise variance , and the kernel parameters and of the drift and diffusion kernels. The posterior of the model combines the likelihood of (4) and the independent priors and
using Bayes’ theorem as
(14)  
(15)  
where the timedependent state distribution now depends on the inducing parameters. We propose to approximate the true expected likelihood with an unbiased stochastic Monte Carlo averaging, since we can draw path samples from the state distribution by sampling the Brownian motion path . The stochastic likelihood estimate with
samples turns out to be a kernel density estimator with Gaussian bases.
We draw the sample paths using EulerMaruyama(EM) method for approximating the solution of an SDE (2) [2]:
(16) 
where we discretise time into subintervals of width , and sample the Wiener coefficients as with standard deviation . We set to the initial observation and use (16) to compute state path . The number of time steps is often higher than the number of observed timepoints to achieve sufficient path resolution.
We find a maximum a posteriori (MAP) estimates of by gradient ascent, while choosing lengthscales from a grid, keeping the inducing locations fixed on a dense grid (See Figure 1(d)) and setting . In practise in or systems placing inducing locations on a grid is a robust choice, noting that they can be also optimised with increased computational complexity [19].
2.4 Computing stochastic gradients
The gradient of the expectation of the loglikelihood (15) is
(17)  
(18) 
where the sample paths are from equation (16). The last term is the cumulative derivative of the state of sample at time against the parameters . The gradients of the piecewise EulerMaruyama paths are:
(19)  
The derivatives are constructed iteratively over time starting from with a fixed initial state , and the four partial derivatives are gradients of the kernel functions (10) and (13) with respect to and , respectively:
(20)  
(21)  
(22)  
(23) 
These gradients are related to the sensitivity equations [20, 21] derived for nonparametric ODEs previously [12]. The gradients (19) can be computed in practise together with the sample paths (16) during the EulerMaruyama iteration. The computation of the gradients has the same computational complexity as the numerical simulator. The iterative gradients are superior to finite difference approximation since we have exact formulation of the gradients, albeit of the approximal EulerMaruyama paths, which can be solved to arbitrary numerical accuracy by tuning discretisation.
3 Experiments
In order to illustrate the performance of our model, we conduct several experiments on synthetic data as well as realworld data sets. In all experiments inducing vectors are initialized by gradient matching, and then fully optimised. We use EM method to simulate the state distributions and compute stochastic gradients over samples. We use LBFGS algorithm to compute the MAP estimates.
3.1 Double well
We first consider the double well system where the drift is given by and with constant diffusion . We generate random noisy input trajectories, each with observed data points. True dynamics and the observed data points are illustrated in Figure 3a. We fit our npSDE model with inducing points located uniformly within . We accurately approximate the true drift (the right plot on Figure 3), and learn a diffusion estimate of .
3.2 Simple oscillating dynamics
Next, we investigate how the quality of fit changes by the amount of data used for training. We consider the synthetic system in [7], whose drift equations are given by and , and the diffusion is . The statedependent diffusion acts an hotspot of increased path scatter, and provides interesting and challenging dynamics to infer. We generate six data batches from the true dynamics using EM method with step size , and observe every 100’th state corrupted by a Gaussian noise with variance . The data batches contain 1, 5, 10, 25, 50 and 100 input trajectories, each having 25 data points. We repeat the experiments 50 times and report the average error.
The left plot in Figure 4 illustrates the cumulative discrepancy in the state distributions over time, and the right plot shows the error between the true and estimated drift/diffusion. Unsurprisingly, the discrepancy in both plots decrease when more training data is used. We also observe that the performance gain is insignificant after 50 input trajectories.
3.3 Ice core data
As another showcase of our model, we consider the NGRIP ice core dataset [22], which contains records of isotopic oxygen O concetrations within clacial ice cores. The record is used to explore the climatic changes that date back to last glacial period. During that period, the North Atlantic region underwent abrupt climate changes known as DansgaardOeschger (DO) events. The events are characterized by a sudden increase in the temperature followed by a gradual cooling phase. Following [23], we consider timepoints from the time span from 60000 years to 20000 years before present, where 16 DO events have been identified.
Figure 5(a) illustrates the highly variable data. We observe a repeating pattern of DO events: a sudden increase followed by a slower settlement phase. The panel 5(b) indicates estimated drift that pushes the oxygen down until state and up with small states, matching the data. Interestingly, diffusion at 5(c) is highly peaked between and , which has accurately identified the regime of DO events. The model has learned to explain the DO events with high diffusion.
3.4 Human motion dataset
We finally demonstrate our approach on human motion capture data. Our goal is twofold: to estimate a single drift function that captures the dynamics of the walking sequences of several people, and to explain the discrepancies among sequences via diffusion. The input trajectories are the same as in [24]: four walking sequences, each from a different person. We also follow the preprocessing method described in [24], which results in a simplified skeleton that consists of 50dimensional pose configurations. All records are mean centered and downsampled by a factor of two.
Inference is performed in three dimensional space where the input sequences are projected using PCA. We place the inducing points on a grid and set the lengthscale of both drift and diffusion process to 0.5. data set, inferred drift fit and the density of the sample paths are visualized in Figure 6. We can conclude that our model is capable of inferring drift and diffusion functions that match arbitrary data.
4 Discussion
We propose an approach for learning nonparametric drift and diffusion functions of stochastic differential equation (SDE) systems such that the resulting simulated state distributions match data. Our approach can learn arbitrary dynamics due to the flexible inducing Gaussian process formulation. We propose a stochastic estimate of the simulated state distributions and an efficient system of computing their gradients. Our approach does not place any restrictions on the sparsity or denseness of the observations data. We leave learning of timevarying drifts and diffusions as interesting future work.
Acknowledgements.
The data used in this project was obtained from mocap.cs.cmu.edu. The database was created with funding from NSF EIA0196217. This work has been supported by the Academy of Finland Center of Excellence in Systems Immunology and Physiology, the Academy of Finland grants no. 260403, 299915, 275537, 311584.
References
 [1] R. Friedrich, J. Peinkeb, M. Sahimic, and R. Tabar, “Approaching complexity by stochastic methods: From biological systems to turbulence,” Phys. reports, vol. 506, pp. 87–162, 2011.
 [2] B. Øksendal, Stochastic Differential Equations: An Introduction with Applications, Springer, 6th edition, 2014.
 [3] A. Ruttor, P. Batz, and M. Opper, “Approximate Gaussian process inference for the drift function in stochastic differential equations,” in Advances in Neural Information Processing Systems, 2013, pp. 2040–2048.
 [4] C. García, A. Otero, P. Felix, J. Presedo, and D. Marquez, “Nonparametric estimation of stochastic differential equations with sparse Gaussian processes,” Physical Review E, vol. 96, no. 2, pp. 022104, 2017.
 [5] C. Archambeau, D. Cornford, M. Opper, and J. ShaweTaylor, “Gaussian process approximations of stochastic differential equations,” in Gaussian Processes in Practice, 2007.
 [6] M. Vrettas, M. Opper, and D. Cornford, “Variational meanfield algorithm for efficient inference in large systems of stochastic differential equations,” Physical Review E, vol. 91, pp. 012148, 2015.
 [7] P. Batz, A. Ruttor, and M. Opper, “Approximate bayes learning of stochastic differential equations,” arXiv:1702.05390, 2017.
 [8] C.E. Rasmussen and K.I. Williams, Gaussian processes for machine learning, MIT Press, 2006.
 [9] J. M. Varah, “A spline least squares method for numerical parameter estimation in differential equations,” SIAM J sci Stat Comput, vol. 3, pp. 28–46, 1982.
 [10] S. Ellner, Y. Seifu, and R. Smith, “Fitting population dynamic models to timeseries data by gradient matching,” Ecology, vol. 83, pp. 2256–2270, 2002.
 [11] J. Ramsay, G. Hooker, D. Campbell, and J. Cao, “Parameter estimation for differential equations: a generalized smoothing approach,” J R Stat Soc B, vol. 69, pp. 741–796, 2007.
 [12] M. Heinonen, C. Yildiz, H. Mannerström, J. Intosalmi, and H. Lähdesmäki, “Learning unknown ODE models with Gaussian processes,” in Proceedings of the 35th International Conference on Machine Learning, 2018.
 [13] M. Alvarez, L. Rosasco, and N. Lawrence, “Kernels for vectorvalued functions: A review,” Foundations and Trends in Machine Learning, 2012.
 [14] N. Wahlström, M. Kok, and T. Schön, “Modeling magnetic fields using Gaussian processes,” IEEE ICASSP, 2013.
 [15] I. Macedo and R. Castro, “Learning divergencefree and curlfree vector fields with matrixvalued kernels,” Instituto Nacional de Matematica Pura e Aplicada, 2008.
 [16] C. Micchelli and M. Pontil, “On learning vectorvalued functions,” Neural computation, 2005.
 [17] J. QuiñoneroCandela and C.E. Rasmussen, “A unifying view of sparse approximate Gaussian process regression,” Journal of Machine Learning Research, vol. 6, pp. 1939–1959, 2005.
 [18] J. ShaweTaylor and N. Cristianini, Kernel methods for pattern analysis, Cambridge University Press, 2004.

[19]
J. Hensman, N. Fusi, and N. Lawrence,
“Gaussian processes for big data,”
in
Uncertainty in Artificial Intelligence
. AUAI Press, 2013, pp. 282–290.  [20] P. Kokotovic and J. Heller, “Direct and adjoint sensitivity equations for parameter optimization,” IEEE Trans. on Automatic Control, vol. 12, pp. 609–610, 1967.
 [21] F. Fröhlich, B. Kaltenbacher, F. Theis, and J. Hasenauer, “Scalable parameter estimation for genomescale biochemical reaction networks,” PLOS Comp Biol, vol. 13, pp. 1–18, 2017.
 [22] K. Andersen, N. Azuma, JM. Barnola, M. Bigler, P. Biscaye, N. Caillon, J. Chappellaz, H. Clausen, et al., “Highresolution record of northern hemisphere climate extending into the last interglacial period,” Nature, vol. 431, pp. 147, 2004.
 [23] F. Kwasniok, “Analysis and modelling of glacial climate transitions using simple dynamical systems,” Phil Trans R Soc A, vol. 371, pp. 20110472, 2013.
 [24] J. Wang, D. Fleet, and A. Hertzmann, “Gaussian process dynamical models for human motion,” IEEE Trans. on pattern analysis and machine intelligence, vol. 30, pp. 283–298, 2008.
Comments
There are no comments yet.