Learning Stochastic Differential Equations With Gaussian Processes Without Gradient Matching

by   Cagatay Yildiz, et al.

We introduce a novel paradigm for learning non-parametric drift and diffusion functions for stochastic differential equation (SDE) that are learnt to simulate trajectory distributions that match observations of arbitrary spacings. This is in contrast to existing gradient matching or other approximations that do not optimize simulated responses. We demonstrate that our general stochastic distribution optimisation leads to robust and efficient learning of SDE systems.



There are no comments yet.


page 1

page 2

page 3

page 4


Non-parametric Estimation of Stochastic Differential Equations with Sparse Gaussian Processes

The application of Stochastic Differential Equations (SDEs) to the analy...

Scalable Inference in SDEs by Direct Matching of the Fokker-Planck-Kolmogorov Equation

Simulation-based techniques such as variants of stochastic Runge-Kutta a...

New Algorithms And Fast Implementations To Approximate Stochastic Processes

We present new algorithms and fast implementations to find efficient app...

Deterministic Inference of Neural Stochastic Differential Equations

Model noise is known to have detrimental effects on neural networks, suc...

AReS and MaRS - Adversarial and MMD-Minimizing Regression for SDEs

Stochastic differential equations are an important modeling class in man...

Stochastic embeddings of dynamical phenomena through variational autoencoders

System identification in scenarios where the observed number of variable...

Deep learning with differential Gaussian process flows

We propose a novel deep learning paradigm of differential flows that lea...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Dynamical systems modeling is a cornerstone of experimental sciences. Modelers attempt to capture the dynamical behavior of a stochastic system or a phenomenon in order to improve its understanding and make predictions about its future state. Stochastic differential equations (SDEs) are an effective formalism for modelling systems with underlying stochastic dynamics, with wide range of applications [1]

. The key problem in SDE’s is estimation of the underlying deterministic driving function, and the stochastic diffusion component.

We consider the dynamics of a multivariate system governed by Markov process described by an SDE



is the state vector of a

-dimensional dynamical system at continuous time , is a deterministic state evolution, is a scalar magnitude of the stochastic multivariate Wiener process . The Wiener process has zero initial state , and the independent increments

follow a Gaussian with standard deviation


The SDE system (1) transforms states forward in continuous time by the deterministic drift component , while the is the magnitude of the random Brownian diffusion that scatters the state with random fluctuations. The state solutions of SDE are given by the Itô integral [2]


where we integrate the system state from an initial state for time forward, and where is an auxiliary time variable. The only non-deterministic part of the solution (2) is the Brownian motion , whose random realisations generate path realisations that induce state distributions at time given the drift and diffusion . SDEs produce continuous, but non-smooth trajectories over time due to the non-differentiable Brownian motion.

We assume that both and are completely unknown and we only observe one or several multivariate time series obtained from noisy observations at observation times ,



follows a stationary time-invariant zero-mean multivariate Gaussian distribution with diagonal noise variances

; and the latent states follow the state distribution. The goal of SDE modelling is to learn the drift and diffusion functions such that the process matches data .

There is considerable amount of literature on inferring SDEs that have a pre-defined parametric drift or diffusion functions for specific applications [1]. There has also been interest on estimating non-parametric SDE drift and diffusion functions from data using the general Bayesian formalism [3, 4]. With linear drift approximations the state distribution turns out to be a Gaussian, which can be solved with variational smoothing algorithm [5] or by variational mean field approximation [6]. Non-linear drifts and diffusions are predominantly modelled with Gaussian processes [3, 7, 4], which are a family of Bayesian kernel methods [8]. For such models the state distributions are intractable, and hence these methods resort to using a family of gradient matching approximations [9, 10, 11], where the drift is estimated to match the empirical gradients of data, , and the diffusion relates to the residual of the approximation [3, 7, 4]. The gradient matching is only applicable to dense observations over time, while additional linearisation [3, 7] is necessary to model sparse observations. Non-parametric estimation of diffusion only applies to dense data [7, 4].

Figure 1: (a) A Van der Pol oscillator with local diffusion and drift, (b-c) path samples and distribution, (d) the estimated GP system, (e) three noisy input trajectories for training, (e-f) the estimated path samples that match the true samples.

In this paper we propose to infer non-parametric drift and diffusion functions with Gaussian processes for arbitrary sparse or dense data. We learn the underlying system to induce state distributions with high expected likelihood,


The expected likelihood is generally intractable. In contrast to earlier works we do not use gradient matching or other approximative models, but instead we directly tackle and optimize the SDE system against the true likelihood (4) by performing full forward simulation. We propose an unbiased stochastic Monte Carlo approximation for the likelihood, for which we derive efficient, tractable gradients. Our approach places no restrictions on the spacing or sparsity of the observations. Our model is denoted as npSDE, and the implementation is publicly available in http://www.github.com/cagatayyildiz/npde.

2 Inducing Gaussian process SDE model

In this section we model both the drift and diffusion as Gaussian processes with a general inducing point parameterisation. The drift function defines a vector field , that is, an assignment of a -dimensional gradient vector to every -dimensional state . We assume that drift does not depend on time. The diffusion function

is a standard scalar function. We model both functions as Gaussian processes (GP), which are flexible Bayesian non-linear and non-parametric models


2.1 Drift Gaussian process

The inducing point parameterisation for the drift GP was originally proposed in the context of ordinary differential equation systems

[12], which we review here. We assume a zero-mean vector-valued GP prior on the drift function


which defines a priori distribution over drift values whose mean and covariance are


where the kernel is matrix-valued [13]. A GP prior defines that for any collection of states , the drift values follow a matrix-valued normal [13],


where is a block matrix of matrix-valued kernels . The key property of Gaussian processes is that they encode functions where similar states induce similar drifts , and where the state similarity is defined by the kernels . Several families of rich matrix-valued kernels exist [14, 13, 16]. In this work we opt for the family of decomposable kernels , where is a Gaussian base kernel


with drift variance , and dimension-specific lengthscales that determine the smoothness of drift field, and is a PSD dependency matrix between dimensions. In practise global dependency structures are often unavailable, and the diagonal structure is then chosen.

In standard GP regression we would obtain posterior of the drift by conditioning the GP prior with data [8]. In SDE models the conditional is intractable due to the integral mapping (2) between observations and drifts . Instead, we augment the Gaussian process with a set of inducing vectors at locations , such that [17]

. We interpolate drift from inducing points as


which supports the drift with inducing locations and inducing vectors , and where . This corresponds to a vector-valued kernel function [13], or to a multi-task Gaussian process posterior mean [8]. Due to universality of the Gaussian kernel [18], we can represent arbitrary drifts with sufficient inducing points.

2.2 Diffusion Gaussian process

We represent diffusion as another inducing point GP, similarly to drift. Diffusion is a scalar function that uses a scalar kernel. The diffusion has a zero-mean GP prior


that defines covariance with a Gaussian kernel of form (9), but with diffusion variance and lengthscales . Diffusion values at states then follow a prior


where . We interpolate the diffusion from inducing locations with inducing values ,


Drift and diffusion naturally share their inducing locations .

2.3 Stochastic Monte Carlo inference

The inducing SDE model is determined via the inducing locations , the inducing values and , the observation noise variance , and the kernel parameters and of the drift and diffusion kernels. The posterior of the model combines the likelihood of (4) and the independent priors and

using Bayes’ theorem as


where the time-dependent state distribution now depends on the inducing parameters. We propose to approximate the true expected likelihood with an unbiased stochastic Monte Carlo averaging, since we can draw path samples from the state distribution by sampling the Brownian motion path . The stochastic likelihood estimate with

samples turns out to be a kernel density estimator with Gaussian bases.

We draw the sample paths using Euler-Maruyama(EM) method for approximating the solution of an SDE (2) [2]:


where we discretise time into subintervals of width , and sample the Wiener coefficients as with standard deviation . We set to the initial observation and use (16) to compute state path . The number of time steps is often higher than the number of observed timepoints to achieve sufficient path resolution.

Figure 2: (a) True state distribution from the model in Figure 1(a-b), (b,d) state distribution approximations with different number of path samples, (c) approximation errors.

We find a maximum a posteriori (MAP) estimates of by gradient ascent, while choosing lengthscales from a grid, keeping the inducing locations fixed on a dense grid (See Figure 1(d)) and setting . In practise in or systems placing inducing locations on a grid is a robust choice, noting that they can be also optimised with increased computational complexity [19].

2.4 Computing stochastic gradients

The gradient of the expectation of the log-likelihood (15) is


where the sample paths are from equation (16). The last term is the cumulative derivative of the state of sample at time against the parameters . The gradients of the piecewise Euler-Maruyama paths are:


The derivatives are constructed iteratively over time starting from with a fixed initial state , and the four partial derivatives are gradients of the kernel functions (10) and (13) with respect to and , respectively:


These gradients are related to the sensitivity equations [20, 21] derived for non-parametric ODEs previously [12]. The gradients (19) can be computed in practise together with the sample paths (16) during the Euler-Maruyama iteration. The computation of the gradients has the same computational complexity as the numerical simulator. The iterative gradients are superior to finite difference approximation since we have exact formulation of the gradients, albeit of the approximal Euler-Maruyama paths, which can be solved to arbitrary numerical accuracy by tuning discretisation.

3 Experiments

In order to illustrate the performance of our model, we conduct several experiments on synthetic data as well as real-world data sets. In all experiments inducing vectors are initialized by gradient matching, and then fully optimised. We use EM method to simulate the state distributions and compute stochastic gradients over samples. We use L-BFGS algorithm to compute the MAP estimates.

3.1 Double well

We first consider the double well system where the drift is given by and with constant diffusion . We generate random noisy input trajectories, each with observed data points. True dynamics and the observed data points are illustrated in Figure 3a. We fit our npSDE model with inducing points located uniformly within . We accurately approximate the true drift (the right plot on Figure 3), and learn a diffusion estimate of .

Figure 3: (a) Double well system, (b) estimated drift.
Figure 4: (a) Cumulative discrepancy in the state distributions over time and over number of observed trajectories in synthetic model, (b) drift and diffusion estimation errors.

3.2 Simple oscillating dynamics

Next, we investigate how the quality of fit changes by the amount of data used for training. We consider the synthetic system in [7], whose drift equations are given by and , and the diffusion is . The state-dependent diffusion acts an hotspot of increased path scatter, and provides interesting and challenging dynamics to infer. We generate six data batches from the true dynamics using EM method with step size , and observe every 100’th state corrupted by a Gaussian noise with variance . The data batches contain 1, 5, 10, 25, 50 and 100 input trajectories, each having 25 data points. We repeat the experiments 50 times and report the average error.

The left plot in Figure 4 illustrates the cumulative discrepancy in the state distributions over time, and the right plot shows the error between the true and estimated drift/diffusion. Unsurprisingly, the discrepancy in both plots decrease when more training data is used. We also observe that the performance gain is insignificant after 50 input trajectories.

Figure 5: (a) The ice core data, (b) estimated drift, (c) the highly state-dependent diffusion.

3.3 Ice core data

As another showcase of our model, we consider the NGRIP ice core dataset [22], which contains records of isotopic oxygen O concetrations within clacial ice cores. The record is used to explore the climatic changes that date back to last glacial period. During that period, the North Atlantic region underwent abrupt climate changes known as Dansgaard-Oeschger (DO) events. The events are characterized by a sudden increase in the temperature followed by a gradual cooling phase. Following [23], we consider timepoints from the time span from 60000 years to 20000 years before present, where 16 DO events have been identified.

Figure 5(a) illustrates the highly variable data. We observe a repeating pattern of DO events: a sudden increase followed by a slower settlement phase. The panel 5(b) indicates estimated drift that pushes the oxygen down until state and up with small states, matching the data. Interestingly, diffusion at 5(c) is highly peaked between and , which has accurately identified the regime of DO events. The model has learned to explain the DO events with high diffusion.

3.4 Human motion dataset

Figure 6: (a) Shared drift estimate learned from walking data of 4 subjects, (b) estimated sample paths, (c) density plots of the sample paths. The 4 observed trajectories are shown as black lines in (b-c), with red circles denoting the initial state.

We finally demonstrate our approach on human motion capture data. Our goal is twofold: to estimate a single drift function that captures the dynamics of the walking sequences of several people, and to explain the discrepancies among sequences via diffusion. The input trajectories are the same as in [24]: four walking sequences, each from a different person. We also follow the preprocessing method described in [24], which results in a simplified skeleton that consists of 50-dimensional pose configurations. All records are mean centered and downsampled by a factor of two.

Inference is performed in three dimensional space where the input sequences are projected using PCA. We place the inducing points on a grid and set the length-scale of both drift and diffusion process to 0.5. data set, inferred drift fit and the density of the sample paths are visualized in Figure 6. We can conclude that our model is capable of inferring drift and diffusion functions that match arbitrary data.

4 Discussion

We propose an approach for learning non-parametric drift and diffusion functions of stochastic differential equation (SDE) systems such that the resulting simulated state distributions match data. Our approach can learn arbitrary dynamics due to the flexible inducing Gaussian process formulation. We propose a stochastic estimate of the simulated state distributions and an efficient system of computing their gradients. Our approach does not place any restrictions on the sparsity or denseness of the observations data. We leave learning of time-varying drifts and diffusions as interesting future work.


The data used in this project was obtained from mocap.cs.cmu.edu. The database was created with funding from NSF EIA-0196217. This work has been supported by the Academy of Finland Center of Excellence in Systems Immunology and Physiology, the Academy of Finland grants no. 260403, 299915, 275537, 311584.


  • [1] R. Friedrich, J. Peinkeb, M. Sahimic, and R. Tabar, “Approaching complexity by stochastic methods: From biological systems to turbulence,” Phys. reports, vol. 506, pp. 87–162, 2011.
  • [2] B. Øksendal, Stochastic Differential Equations: An Introduction with Applications, Springer, 6th edition, 2014.
  • [3] A. Ruttor, P. Batz, and M. Opper, “Approximate Gaussian process inference for the drift function in stochastic differential equations,” in Advances in Neural Information Processing Systems, 2013, pp. 2040–2048.
  • [4] C. García, A. Otero, P. Felix, J. Presedo, and D. Marquez, “Nonparametric estimation of stochastic differential equations with sparse Gaussian processes,” Physical Review E, vol. 96, no. 2, pp. 022104, 2017.
  • [5] C. Archambeau, D. Cornford, M. Opper, and J. Shawe-Taylor, “Gaussian process approximations of stochastic differential equations,” in Gaussian Processes in Practice, 2007.
  • [6] M. Vrettas, M. Opper, and D. Cornford, “Variational mean-field algorithm for efficient inference in large systems of stochastic differential equations,” Physical Review E, vol. 91, pp. 012148, 2015.
  • [7] P. Batz, A. Ruttor, and M. Opper, “Approximate bayes learning of stochastic differential equations,” arXiv:1702.05390, 2017.
  • [8] C.E. Rasmussen and K.I. Williams, Gaussian processes for machine learning, MIT Press, 2006.
  • [9] J. M. Varah, “A spline least squares method for numerical parameter estimation in differential equations,” SIAM J sci Stat Comput, vol. 3, pp. 28–46, 1982.
  • [10] S. Ellner, Y. Seifu, and R. Smith, “Fitting population dynamic models to time-series data by gradient matching,” Ecology, vol. 83, pp. 2256–2270, 2002.
  • [11] J. Ramsay, G. Hooker, D. Campbell, and J. Cao, “Parameter estimation for differential equations: a generalized smoothing approach,” J R Stat Soc B, vol. 69, pp. 741–796, 2007.
  • [12] M. Heinonen, C. Yildiz, H. Mannerström, J. Intosalmi, and H. Lähdesmäki, “Learning unknown ODE models with Gaussian processes,” in Proceedings of the 35th International Conference on Machine Learning, 2018.
  • [13] M. Alvarez, L. Rosasco, and N. Lawrence, “Kernels for vector-valued functions: A review,” Foundations and Trends in Machine Learning, 2012.
  • [14] N. Wahlström, M. Kok, and T. Schön, “Modeling magnetic fields using Gaussian processes,” IEEE ICASSP, 2013.
  • [15] I. Macedo and R. Castro, “Learning divergence-free and curl-free vector fields with matrix-valued kernels,” Instituto Nacional de Matematica Pura e Aplicada, 2008.
  • [16] C. Micchelli and M. Pontil, “On learning vector-valued functions,” Neural computation, 2005.
  • [17] J. Quiñonero-Candela and C.E. Rasmussen, “A unifying view of sparse approximate Gaussian process regression,” Journal of Machine Learning Research, vol. 6, pp. 1939–1959, 2005.
  • [18] J. Shawe-Taylor and N. Cristianini, Kernel methods for pattern analysis, Cambridge University Press, 2004.
  • [19] J. Hensman, N. Fusi, and N. Lawrence, “Gaussian processes for big data,” in

    Uncertainty in Artificial Intelligence

    . AUAI Press, 2013, pp. 282–290.
  • [20] P. Kokotovic and J. Heller, “Direct and adjoint sensitivity equations for parameter optimization,” IEEE Trans. on Automatic Control, vol. 12, pp. 609–610, 1967.
  • [21] F. Fröhlich, B. Kaltenbacher, F. Theis, and J. Hasenauer, “Scalable parameter estimation for genome-scale biochemical reaction networks,” PLOS Comp Biol, vol. 13, pp. 1–18, 2017.
  • [22] K. Andersen, N. Azuma, J-M. Barnola, M. Bigler, P. Biscaye, N. Caillon, J. Chappellaz, H. Clausen, et al., “High-resolution record of northern hemisphere climate extending into the last interglacial period,” Nature, vol. 431, pp. 147, 2004.
  • [23] F. Kwasniok, “Analysis and modelling of glacial climate transitions using simple dynamical systems,” Phil Trans R Soc A, vol. 371, pp. 20110472, 2013.
  • [24] J. Wang, D. Fleet, and A. Hertzmann, “Gaussian process dynamical models for human motion,” IEEE Trans. on pattern analysis and machine intelligence, vol. 30, pp. 283–298, 2008.