AReS and MaRS - Adversarial and MMD-Minimizing Regression for SDEs

Stochastic differential equations are an important modeling class in many disciplines. Consequently, there exist many methods relying on various discretization and numerical integration schemes. In this paper, we propose a novel, probabilistic model for estimating the drift and diffusion given noisy observations of the underlying stochastic system. Using state-of-the-art adversarial and moment matching inference techniques, we circumvent the use of the discretization schemes as seen in classical approaches. This yields significant improvements in parameter estimation accuracy and robustness given random initial guesses. On four commonly used benchmark systems, we demonstrate the performance of our algorithms compared to state-of-the-art solutions based on extended Kalman filtering and Gaussian processes.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/26/2019

Lawson schemes for highly oscillatory stochastic differential equations and conservation of invariants

In this paper, we consider a class of stochastic midpoint and trapezoida...
02/02/2022

Fenrir: Physics-Enhanced Regression for Initial Value Problems

We show how probabilistic numerics can be used to convert an initial val...
02/17/2017

Approximate Bayes learning of stochastic differential equations

We introduce a nonparametric approach for estimating drift and diffusion...
07/16/2018

Learning Stochastic Differential Equations With Gaussian Processes Without Gradient Matching

We introduce a novel paradigm for learning non-parametric drift and diff...
11/24/2021

State-space deep Gaussian processes with applications

This thesis is mainly concerned with state-space approaches for solving ...
10/06/2019

Convergence of the likelihood ratio method for linear response of non-equilibrium stationary states

We consider numerical schemes for computing the linear response of stead...
10/16/2018

A Direct Method to Learn States and Parameters of Ordinary Differential Equations

Though ordinary differential equations (ODE) are used extensively in sci...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Modeling discretely observed time series is a challenging problem that arises in quantitative sciences and many fields of engineering. While it is possible to tackle such problems with difference equations or ordinary differential equations (ODEs), both approaches suffer from serious drawbacks. Difference equations are difficult to apply if the observation times are unevenly distributed. Furthermore, they do not generalize well across observation frequencies, while natural laws do not care about this artificial construct. ODEs can deal with these challenges, but fail to incorporate the inherent stochastic behavior present in many physical, chemical or biological systems. These effects can be captured by introducing stochasticity in the dynamics model, which brings to stochastic differential equations (SDEs). In this paper, we will use exclusively the Itô-form

(1)

where

is the time-dependent vector of states we would like to model,

collects the parameters of the model, is the drift term, is the vector-valued diffusion function and is a Wiener-process of the same dimension as the state vector .

While SDEs can efficiently capture stochasticity and deal with unevenly spaced observation times or observation frequency, inference is rather challenging. Due to the stochasticity of , the state vector

is itself a random variable. Except for few special cases, it is not possible to find an analytic solution for the statistics of

for general drift and diffusion terms. The problem is even more challenging if we were to condition on or state-estimate some discrete time observations (filtering/smoothing) or infer some statistics for the parameters (parameter inference). It is well known that the parameter inference problem is a difficult task, with most approaches either being very sensitive to initialization (Picchini, 2007)

, strongly dependent on the choice of hyperparameters like the spacing of the integration grid

(Bhat et al., 2015) or using excessive amount of computational resources even for small scale systems and state-of-the-art implementation (Ryder et al., 2018).

The difficulty of the parameter estimation problem of estimating parameters of drift and diffusion under observational noise is readily exemplified by the fact that even major scientific programming environment providers like are still lack an established toolbox for practical use. In this paper, we will take a step into a novel direction tackling this open and exciting research question.

1.1 Related Work

While it is impossible to cover all related research efforts, we would like to give a quick overview by mentioning some of the most relevant. For a more in-depth discussion, we recommend Tronarp and Särkkä (2018), who provide an excellent review of the current state-of-the-art smoothing schemes. Moreover, Sørensen (2004) and Nielsen et al. (2000) provide extensive explanations of the more traditional approaches.

Most classical methods rely on calculating . Since is usually analytically intractable, approximation schemes are necessary. Elerian et al. (2001) and Eraker (2001) used the Euler-Maruyama discretization to approximate on a fixed, fine grid of artificial observation times later to be leveraged in a MCMC sampling scheme. Pieschner and Fuchs (2018) and van der Meulen et al. (2017) subsequently refined this approach with improved bridge constructs and incorporated partial observability. Ryder et al. (2018) recently followed up on this idea by combining discretization procedures with variational inference. Särkkä et al. (2015) investigate different approximation methods based on Kalman filtering, while Archambeau et al. (2007) and Vrettas et al. (2015) use a variational Gaussian Process-based approximation. Finally, it should also be mentioned that can be inferred by solving the Fokker-Planck-Kolmogorov equations using standard numerical methods for PDEs (Hurn and Lindsay, 1999; Aït-Sahalia, 2002).

Instead of approximating in a variational fashion, Gaussian Processes can as well be used to directly model and , ignoring in this way any prior knowledge about their parametric form. This approach was investigated by Ruttor et al. (2013), whose linearization and discretization assumptions which were later relaxed by Yildiz et al. (2018). While we will show in our experiments that these methods can be used for parameter estimation if the parametric form of drift and diffusion are known, it should be noted that parameter inference was not the original goal of their work.

1.2 Our Work

To the best of our knowledge, there are only very few works that try to circumvent calculating

at all. Our approach is probably most closely related to the ideas presented by

Riesinger et al. (2016). Our proposal relies on the Doss-Sussman transformation (Doss, 1977; Sussmann, 1978) to reduce the parameter inference problem to parameter inference in an ensemble of random ordinary differential equations (RODEs). These equations can then be solved path-wise using either standard numerical schemes or using the computationally efficient gradient matching scheme of Gorbach et al. (2017) as proposed by Bauer et al. (2017).

The path-wise method of Bauer et al. (2017) has natural parallelization properties, but there is still an inherent approximation error due to the Monte Carlo estimation of the expectation over the stochastic element in the RODE. Furthermore, their framework imposes severe linearity restrictions on the functional form of the drift , while it is unable to estimate the diffusion matrix .

While we will keep their assumption of a constant diffusion matrix, i.e. , our approach gets rid of the linearity assumptions on the drift . Furthermore, we substitute the Monte Carlo approximation by embedding the SDE into a fully statistical framework, allowing for efficient estimation of both and using state-of-the-art statistical inference methods.

Even though a constant diffusion assumption might seem restrictive at first, such SDE models are widely used, e.g. in chemical engineering (Karimi and McAuley, 2018), civil engineering (Jiménez et al., 2008), pharmacology (Donnet and Samson, 2013) and of course in signal processing, control and econometrics. While we believe that this approach could be extended approximately to systems with general diffusion matrices, we leave this for future work.

The contributions of our framework are the following:

  • We derive a new statistical framework for diffusion and drift parameter estimation of SDEs using the Doss-Sussmann transformation and Gaussian Processes.

  • We introduce a grid-free, computationally efficient and robust parameter inference scheme that combines a non-parametric Gaussian Process model with adversarial loss functions.

  • We demonstrate that our method is able to estimate constant but non-diagonal diffusion terms of stochastic differential equations without any functional form assumption on the drift.

  • We show that our method significantly outperforms the state-of-the-art algorithms for SDEs with multi-modal state posteriors, both in terms of diffusion and drift parameter estimation.

  • We share and publish our code to facilitate future research.111Code available at https://github.com/gabb7/AReS-MaRS

2 Background

In this section, we formalize our problem and introduce the necessary notation and background drawn from Gaussian process-based gradient matching for ODEs.

2.1 Problem Setting

We consider SDEs of the form

(2)

where is the -dimensional state vector at time ; are the increments of a standard Wiener process; is an arbitrary, potentially highly nonlinear function whose parametric form is known, save for the unknown parameter vector ; is the unknown but constant diffusion matrix. Without loss of generality, we can assume to be a lower-diagonal, positive semi-definite matrix.

The system is observed at arbitrarily spaced time points , subjected to Gaussian observation noise:

(3)

where we assume the noise variances to be state-dependent but time-independent, i.e.

(4)

for and .

2.2 Deterministic ODE Case

In the context of Bayesian parameter inference for deterministic ordinary differential equations, Calderhead et al. (2009) identify numerical integration as the main culprit for bad computational performance. Thus, they propose to turn the parameter estimation procedure on its head: instead of calculating using numerical integration, they extract two probabilistic estimates for the derivatives, one using only the noisy observations and one using the differential equations. The main challenge is then to combine these two distributions, such that more information about can guide towards good parameter estimates . For this purpose, Calderhead et al. (2009)

propose a product of experts heuristics that was accepted and reused until recently

Wenk et al. (2018)

showed that this heuristic leads to severe theoretical issues. They instead propose an alternative graphical model, forcing equality between the data based and the ODE based model save for a Gaussian distributed slack variable.

In this paper, we use another interpretation of gradient matching, which is aimed at finding parameters such that the two distributions over match as closely as possible. Acknowledging the fact that standard methods like minimizing the KL divergence are not tractable, we use robust moment matching techniques while solving a much harder problem with . However, it should be clear that our methodology could easily be applied to the special case of deterministic ODEs and thus provides an additional contribution towards parameter estimation for systems of ODEs.

2.3 Notation

Throughout this paper, bold, capital letters describe matrices. Values of a time-dependent quantities such as the state vector can be collected in the matrix of dimensions , where the -th row collect the single-state values at times for the state .

The matrix can be vectorized by concatenating its rows and defining in this way the vector . This vector should not be confused with , which is still a time-dependent vector of dimension .

As we work with Gaussian processes, it is useful to standardize the state observations by subtracting the mean and dividing by the standard deviation, in a state-wise fashion. We define the vector of the data standard deviation

, and the matrix as:

(5)

where indicates the Kronecker product and

is the identity matrix of size

. Similarly for the means, we can define the vector that contains the state-wise means of the observations, each repeated times. Thus the standardize vector can be defined as:

(6)

For the sake of clarity, we omit the normalization in the following sections. It should be noted however that in a finite sample setting, standardization strongly improves the performance of GP regression. In both our implementation and all the experiments shown in section 4, we assume a GP prior on that were standardized using the state-wise mean and standard deviation of the observations .

For coherence with the current Gaussian process-based gradient matching literature, we follow the notation introduced by Calderhead et al. (2009) and Wenk et al. (2018) wherever possible.

3 Methods

In the deterministic case with , the GP regression model can be directly applied to the states . However, if , the derivatives of the states with respect to time do no longer exist due to the contributions of the Wiener process. Thus, performing direct gradient matching on the states is not feasible.

3.1 Latent States Representation

We propose to tackle this problem by introducing a latent variable , defined via the linear coordinate transformation

(7)

where is the solution of the following SDE:

(8)

Without loss of generality, we set and thus . While in principle the framework supports any initial condition as long as , the reasons for this choice will become clear in section 3.2.3.

Using Itô’s formula, we obtain the following SDE for

(9)

This means that for a given realization of , we obtain a differentiable latent state . In principle, we could sample realizations of and solve the corresponding deterministic problems, which is equivalent to approximately marginalizing over . However, it is actually possible to treat this problem statistically, completely bypassing such marginalization. We do this by creating probabilistic generative models for observations and derivatives analytically. The equations are derived in this section, while the final models are shown in Figure 1.

3.2 Generative Model for Observations

Let us define as the Gaussian observation error at time . Using the matrix notation introduced in section 2.3, we can write

(10)

where and are the matrices corresponding to the lower-case variables introduced in the previous section. In contrast to standard GP regression, we have an additional additive noise term , which is the result of the stochastic process described by equation (8). As in standard GP regression, it is possible to recover a closed form Gaussian distribution for each term.

3.2.1 GP Prior

We assume a zero-mean Gaussian prior over the latent states , whose covariance matrix is given by a kernel function , in turn parameterized by the hyperparameter vector :

(11)

We treat all state dimensions as independent, meaning that we put independent GP priors with separate hyperparameters on the time evolution of each state. Consequently, will be a block diagonal matrix with blocks each of dimension . The blocks model the correlation over time introduced by the GP prior.

3.2.2 Error Model

In equation (4

), we assume that our errors are i.i.d. Gaussians uncorrelated over time. The joint distribution of all errors is thus still a Gaussian distribution, whose covariance

has only diagonal elements given by the GP likelihood variances . More precisely:

(12)

and

(13)

3.2.3 Ornstein-Uhlenbeck Process

Through the coordinate transformation in equation (7), all stochasticity is captured by the stochastic process described by equation (8). Such mathematical construct has a closed-form, Gaussian solution and is called Ornstein-Uhlenbeck process. For the one-dimensional case with zero initial condition and unit diffusion

(14)

we get the following mean and covariance:

(15)
(16)

Sampling at the discrete sample points yields the vector , which is Gaussian distributed:

(17)

where as given by equation (16). In the case of a -dimensional process with identity diffusion, i.e.

(18)

we can just treat each state dimension as an independent, one-dimensional OU process. Thus, after sampling times at the time points in and unrolling the resulting matrix as described in section 2.3, we get

(19)

where is a block diagonal matrix with one for each state dimension.

Using Itô’s formula, we can show that the samples of the original Ornstein-Uhlenbeck process at each time point can be obtained via the linear coordinate transformation

(20)

Let

be defined as the matrix that performs this linear transformation for the unrolled vectors

. We can then write the density of the original Ornstein-Uhlenbeck process as

(21)

3.2.4 Marginals of the Observations

Using equation (10), the marginal distribution of can be computed as the sum of three independent Gaussian-distributed random variables with zero mean, described respectively by equations (11), (13) and (21). Thus, is again Gaussian-distributed, according to

(22)

where

(23)

It is important to note that due to the latent state representation, the diffusion matrix is now a part of the hyperparameters of the observation model. It will later be inferred alongside the hyperparameters of our GP using maximum evidence (Rasmussen, 2004). Using a stationary kernel , captures the stationary part of as in standard GP regression, while the parameters in describe the non-stationary part due deriving from . This ultimately leads to an identifiable problem.

(a) SDE-based model

(b) Data based model
Figure 1: Generative models for the two different ways to compute the derivatives of the latent states .

3.3 Generative Model for Derivatives

Similarly to gradient matching approaches, we create two generative models for our derivatives, one based on the data and one based on the SDE model.

3.3.1 Data-based Model

As shown e.g. in the appendix of Wenk et al. (2018), the prior defined in equation (11) automatically induces a GP prior on the conditional derivatives of . Defining

(24)
(25)

where

(26)
(27)
(28)

we can write

(29)

3.3.2 SDE-based Model

There also is a second way of obtaining an expression for the derivatives of , namely using equation (9):

(30)

where represents the dirac delta.

3.4 Inference

Combined with the modeling paradigms introduced in the previous sections, this yields two generative models for the observations, as pictured in Figure 1. The graphical model in Figure 0(a) represents the derivatives we get via the generative process described by the SDEs via the nonlinear drift function . The model in Figure 0(b) represents the derivatives yielded by the generative process described by the Gaussian process. Assuming a perfect GP fit and access to the true parameters , intuitively these two distributions should be equal. We thus want to find parameters that minimizes the difference between these two distributions.

Compared to the deterministic ODE case, the graphical models in Figure 1 contain additional dependencies on the contribution of the Ornstein-Uhlenbeck process

. Furthermore, the SDE-driven probability distribution of

in Figure 0(a) depends on , , and through a potentially highly nonlinear drift function . Thus, it is not possible to do analytical inference without making restrictive assumptions on the functional form of .

However, as shown in the appendix in section A.2, it is possible to derive computationally efficient ancestral sampling schemes for both models, as summarized in Algorithm 1. While this rules out classical approaches like analytically minimizing the KL divergence, we are now able to deploy likelihood-free algorithms that were designed for matching two probability densities based on samples.

1:Input:
2:Ancestral sampling the SDE model
3:
4:
5:
6:Ancestral sampling the Data model
7:
8:
9:Return:
Algorithm 1 Ancestral sampling for

3.5 Adversarial Sample-based Inference

(a) Before Training
(b) After Training
Figure 2: Comparing the sampled gradient from the graphical model in Figure 0(a) (Model-based) and the graphical model in Figure 0(b) (Data-based) before and after adversarial training on Lotka Volterra.

Arguably, among the most popular algorithms of this kind, one can find the generative adversarial network (GAN), where a parametric neural network is trained to match the unknown likelihood of our data

(Goodfellow et al., 2014). The basic GAN setup consists of a fixed data set, a generator that tries to create realistic samples of said dataset and a discriminator that tries to tell the fake samples from the true ones. As recently shown by Yang et al. (2018)

, GANs have a huge potential for solving stochastic partial differential equations (SPDEs).

Yang et al. (2018) assume a fixed data set consisting of observations (similar to our ) and use an SPDE-inspired neural network as a generator for realistic observations. In the case of SDEs however, this would still involve a lot of numerical integration. Thus, we modify the GAN setup by leaving behind the idea of having a fixed data set. Instead of relying on bootstrapped samples of our observations, we sample the derivatives of our data based model shown in Figure 0(b), assuming a sufficiently good model fit, thus representing the true derivatives of our latent variable . We then use the SDE-based model shown in Figure 0(a) as a generator. To avoid standard GAN problems such as training instability and to improve robustness, we choose to replace the discriminator with a critic, trained to estimate the Wasserstein distance between samples as proposed by Arjovsky et al. (2017). The resulting algorithm, summarized in Algorithm 2, can be interpreted as performing Adversarial Regression for SDEs and will thus be called AReS. In Figure 2, we show the derivatives sampled from the two models both before and after training for one example run of the Lotka Volterra system (cf. Section 4.4). While not perfect, the GAN is clearly able to push the SDE gradients towards the gradients of the observed data.

3.6 Maximum Mean Discrepancy

Even though GANs work well in practical settings, they need careful balancing between their generator and discriminator. Dziugaite et al. (2015) propose to solve this problem using Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) as a metric to substitute the discriminator. As proposed by Li et al. (2015), we choose a rational quadratic kernel to obtain a robust discriminator that can be deployed without fine-tuning on a variety of problems. The resulting algorithm, summarized in Algorithm 3, can be interpreted as performing Maximum mean discrepancy-minimizing Regression for SDEs and will thus be called MaRS.

1:Input: Observations at times , a model , learning rate , number of total iterations , the clipping parameter , the batch size , the number of iterations of the critic per generator iteration .
2:Train the Gaussian Process on the data to recover the hyperparameters , and the diffusion
3:Initialize the initial critic parameters and the initial SDE parameters
4:for   do
5:     for  do
6:         Sample and as described in algorithm 1
7:         
8:         
9:         
10:     end for
11:     
12:     
13:end for
Algorithm 2 AReS
1:Input: Observations at times , SDE model , learning rate , number of iterations , batch size
2:Train the Gaussian Process on the data to recover the hyperparameters , and the diffusion
3:Initialize the initial SDE parameters
4:for   do
5:     Sample and as described in algorithm 1. Each batch contains elements
6:     
7:     
8:end for
Algorithm 3 MaRS

4 Experiments

4.1 Setups

To evaluate the empirical performance of our method, we conduct several experiments on simulated data, using four standard benchmark systems and comparing against the EKF-based approach of Särkkä et al. (2015) and the two GP-based approaches by Vrettas et al. (2015) and Yildiz et al. (2018).

The first system is a simple Ornstein-Uhlenbeck process shown in Figure 2(a), given by the SDE

(31)

As mentioned in Section 3.2.3, this system has an analytical Gaussian process solution and serves thus more academic purposes. We simulate our dataset using , and .

The second system is the Lorenz63 model given by the SDEs

In both systems, the drift function is linear in one state or one parameter conditioned on all the others (cf. Gorbach et al., 2017). Furthermore, there is no coupling across state dimensions in the diffusion matrix. This leads to two more interesting test cases.

To investigate the algorithm’s capability to deal with off-diagonal terms in the diffusion, we introduce the two dimensional Lotka-Volterra system shown in Figure 2(b), given by the SDEs

(32)

where is, without loss of generality, assumed to be a lower triangular matrix. The true vector parameter is and the system is simulated starting from . Since its original introduction by Lotka (1932), the Lotka Volterra system has been widely used to model population dynamics in biology. The system is observed at 50 equidistant points in the interval . As it turns out, this problem is significantly challenging for all algorithms, even without the presence of observation noise.

To investigate the effect of strong non-linearities in the drift, we introduce the Ginzburg-Landau double-well potential shown in Figure 2(c), defined by the SDE

(33)

Using , and , this system exhibits an interesting bifurcation effect. While there are two stable equilibria at , it is completely up to noise in which one the system will end up. This is thus the perfect system to test how well an algorithm can deal with multi-modal SDEs. The system is observed at 50 equidistant points in the interval , subjected to observational noise with .

Lastly, some implementation details are constant throughout each experiment: the critic in the adversarial parameter estimation is a 2-layer fully connected neural network, with respectively 256 and 128 nodes. Every batch, for both MMD and adversarial training contains 256 elements. While Ornstein-Uhlenbeck and the double-well potential were modeled with a sigmoid kernel, for the Lotka Volterra we used a common RBF (we point at Rasmussen (2004) for more information about kernels and GPs).

(a) Ornstein Uhlenbeck
(b) Lotka Volterra
(c) Double Well
Figure 3: Sample trajectories for three different benchmark systems. While Ornstein Uhlenbeck and Lotka Volterra are rather tame, the Double Well system clearly exhibits a bifurcation effect.

4.2 Evaluation

For all systems, the parameters turn out to be identifiable. Thus, the parameter value is a good indicator of how well an algorithm is able to infer the drift function. However, since the different dimensions of are independent, there are multiple diffusion matrices that create the same trajectories. We thus directly compare the variance of the increments, i.e. the components of .

To account for statistical fluctuations, we use 100 independent realizations of the SDE systems and compare the mean and standard deviation of and . and are compared against the Gaussian process-based by Vrettas et al. (2015) and by Yildiz et al. (2018) as well as the classic Kalman filter-based recommended by Särkkä et al. (2015).

4.3 Locally Linear Systems

As mentioned in Section 4.1, the functional form of the drift functions of both the Ornstein Uhlenbeck process and the Lorenz 63 system satisfy a local linearity assumption, while their diffusion is kept diagonal. They thus serve as excellent benchmark systems for a wide variety of parameter inference algorithms. The empirical results are shown in Table 3 for the Ornstein-Uhlenbeck. Unfortunately, turns out to be rather unstable if both diffusion and parameters are unknown, even though it takes on average roughly 58 hours to converge. We have thus provided it with the true and show only its empirical parameter estimates. Since both and are using Equation (22) to determine , they share the same values. Due to space restrictions, we moved the results for Lorenz 63 to Table 4 in the appendix. As demonstrated by this systems, and are both able to deal with locally linear systems, occasionally outperforming their competitors, especially in their estimates of the diffusion.

4.4 Non-Diagonal Diffusion

To investigate the effect of off-diagonal entries in , we use the Lotka Volterra dynamics. For the diffusion estimates, and share again the same values. Since was unable to model non-diagonal diffusions, we provide it with the true and only compare parameter estimates. As was already struggling in the lower dimensional cases, we omitted it from this comparison due to limited computational resources. The results are shown in Table 3. and are clearly outperforming their competition in terms of diffusion estimation, while is the only algorithm able to keep up in terms of drift parameter estimation.

4.5 Dealing with Multi-Modality

/
Table 2: Inferred parameters over 100 independent realizations of the Lotka Volterra dynamics. For every algorithm, we show the median one standard deviation. The ground truth is shown in the left most column.
/
/
/
/
Table 3: Inferred parameters over 100 independent realizations of the Ginzburg-Landau Double-Well dynamics. For every algorithm, we show the median one standard deviation. The ground truth is shown in the left most column.
222Need to compare its squared value!!
/
Table 1: Inferred parameters over 100 independent realizations of the Ornstein-Uhlenbeck dynamics. For every algorithm, we show the median one standard deviation. The ground truth is shown in the left most column.

As a final challenge, we investigate the Ginzburg-Landau double well system. While it is one dimensional, its state distribution is multi-modal even if all parameters are known. As shown in Table 3, this is a huge challenge for all classical approaches. While probably does not have enough data points for its non-parametric proxy for the drift function, the time-dependent Gaussianity assumptions in both and are problematic in this case. In our gradient matching framework, no such assumption is made. Thus, both and are able to deal with the multimodality of the problem.

5 Conclusion

Parameter and diffusion estimation in stochastic systems arises in quantitative sciences and many fields of engineering. Current estimation techniques based on Kalman filtering or Gaussian processes try to approximate the state distribution conditioned on the parameters and then iteratively optimize the data likelihood. In this work, we propose to turn this procedure on its head by leveraging key ideas from gradient matching algorithms designed for deterministic ODEs. By introducing a novel noise model for Gaussian process regression using the Doss-Sussmann transformation, we are able to reliably estimate the diffusion process and the drift parameters. Our algorithm is able to keep up and occasionally outperform the state-of-the-art on the simpler benchmark systems, while it is also accurately estimating parameters for systems that exhibit multi-modal state densities, a case where traditional methods fail. While our approach is currently restricted to systems with a constant diffusion matrix , it would be interesting to see how it generalizes to other systems by using different or approximate bridge constructs. Unfortunately, this is outside of the scope of this work. However, we hope that our publicly available code will facilitate future research in that direction.

This research was supported by the Max Planck ETH Center for Learning Systems. GA acknowledges funding from Google DeepMind and University of Oxford.

References

  • Aït-Sahalia (2002) Yacine Aït-Sahalia. Maximum likelihood estimation of discretely sampled diffusions: a closed-form approximation approach. Econometrica, 70(1):223–262, 2002.
  • Archambeau et al. (2007) Cedric Archambeau, Dan Cornford, Manfred Opper, and John Shawe-Taylor. Gaussian process approximations of stochastic differential equations.

    Journal of machine learning research

    , 1:1–16, 2007.
  • Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223, 2017.
  • Bauer et al. (2017) Stefan Bauer, Nico S Gorbach, Djordje Miladinovic, and Joachim M Buhmann. Efficient and flexible inference for stochastic systems. In Advances in Neural Information Processing Systems, pages 6988–6998, 2017.
  • Bhat et al. (2015) Harish S Bhat, RWMA Madushani, and Shagun Rawat. Parameter inference for stochastic differential equations with density tracking by quadrature. In International Workshop on Simulation, pages 99–113. Springer, 2015.
  • Calderhead et al. (2009) Ben Calderhead, Mark Girolami, and Neil D Lawrence.

    Accelerating bayesian inference over nonlinear differential equations with gaussian processes.

    In Advances in neural information processing systems, pages 217–224, 2009.
  • Donnet and Samson (2013) Sophie Donnet and Adeline Samson. A review on estimation of stochastic differential equations for pharmacokinetic/pharmacodynamic models. Advanced Drug Delivery Reviews, 65(7):929–939, 2013.
  • Doss (1977) Halim Doss. Liens entre équations différentielles stochastiques et ordinaires. 1977.
  • Dziugaite et al. (2015) Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906, 2015.
  • Elerian et al. (2001) Ola Elerian, Siddhartha Chib, and Neil Shephard. Likelihood inference for discretely observed nonlinear diffusions. Econometrica, 69(4):959–993, 2001.
  • Eraker (2001) Bjørn Eraker. Mcmc analysis of diffusion models with application to finance. Journal of Business & Economic Statistics, 19(2):177–191, 2001.
  • Goodfellow et al. (2014) Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
  • Gorbach et al. (2017) Nico S Gorbach, Stefan Bauer, and Joachim M Buhmann. Scalable variational inference for dynamical systems. In Advances in Neural Information Processing Systems, pages 4806–4815, 2017.
  • Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel two-sample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
  • Hurn and Lindsay (1999) AS Hurn and KA Lindsay. Estimating the parameters of stochastic differential equations. Mathematics and computers in simulation, 48(4-6):373–384, 1999.
  • Jiménez et al. (2008) MJ Jiménez, Henrik Madsen, JJ Bloem, and Bernd Dammann. Estimation of non-linear continuous time models for the heat exchange dynamics of building integrated photovoltaic modules. Energy and Buildings, 40(2):157–167, 2008.
  • Karimi and McAuley (2018) Hadiseh Karimi and Kimberley B McAuley. Bayesian objective functions for estimating parameters in nonlinear stochastic differential equation models with limited data. Industrial & Engineering Chemistry Research, 57(27):8946–8961, 2018.
  • Li et al. (2015) Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In International Conference on Machine Learning, pages 1718–1727, 2015.
  • Lotka (1932) Alfred J Lotka. The growth of mixed populations: two species competing for a common food supply. Journal of the Washington Academy of Sciences, 22(16/17):461–469, 1932.
  • Nielsen et al. (2000) Jan Nygaard Nielsen, Henrik Madsen, and Peter C Young. Parameter estimation in stochastic differential equations: an overview. Annual Reviews in Control, 24:83–94, 2000.
  • Picchini (2007) Umberto Picchini. Sde toolbox: Simulation and estimation of stochastic differential equations with matlab. 2007.
  • Pieschner and Fuchs (2018) Susanne Pieschner and Christiane Fuchs. Bayesian inference for diffusion processes: Using higher-order approximations for transition densities. arXiv preprint arXiv:1806.02429, 2018.
  • Rasmussen (2004) Carl Edward Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine learning, pages 63–71. Springer, 2004.
  • Riesinger et al. (2016) Christoph Riesinger, Tobias Neckel, and Florian Rupp. Solving random ordinary differential equations on gpu clusters using multiple levels of parallelism. SIAM Journal on Scientific Computing, 38(4):C372–C402, 2016.
  • Ruttor et al. (2013) Andreas Ruttor, Philipp Batz, and Manfred Opper. Approximate gaussian process inference for the drift function in stochastic differential equations. In Advances in Neural Information Processing Systems, pages 2040–2048, 2013.
  • Ryder et al. (2018) Thomas Ryder, Andrew Golightly, A Stephen McGough, and Dennis Prangle. Black-box variational inference for stochastic differential equations. International Conference on Machine Learning, 2018.
  • Särkkä et al. (2015) Simo Särkkä, Jouni Hartikainen, Isambi Sailon Mbalawata, and Heikki Haario. Posterior inference on parameters of stochastic differential equations via non-linear gaussian filtering and adaptive mcmc. Statistics and Computing, 25(2):427–437, 2015.
  • Sørensen (2004) Helle Sørensen. Parametric inference for diffusion processes observed at discrete points in time: a survey. International Statistical Review, 72(3):337–354, 2004.
  • Sussmann (1978) Héctor J Sussmann. On the gap between deterministic and stochastic ordinary differential equations. The Annals of Probability, pages 19–41, 1978.
  • Tronarp and Särkkä (2018) Filip Tronarp and Simo Särkkä. Iterative statistical linear regression for gaussian smoothing in continuous-time non-linear stochastic dynamic systems. arXiv preprint arXiv:1805.11258, 2018.
  • van der Meulen et al. (2017) Frank van der Meulen, Moritz Schauer, et al. Bayesian estimation of discretely observed multi-dimensional diffusion processes using guided proposals. Electronic Journal of Statistics, 11(1):2358–2396, 2017.
  • Vrettas et al. (2015) Michail D Vrettas, Manfred Opper, and Dan Cornford. Variational mean-field algorithm for efficient inference in large systems of stochastic differential equations. Physical Review E, 91(1):012148, 2015.
  • Wenk et al. (2018) Philippe Wenk, Alkis Gotovos, Stefan Bauer, Nico Gorbach, Andreas Krause, and Joachim M Buhmann. Fast gaussian process based gradient matching for parameter identification in systems of nonlinear odes. arXiv preprint arXiv:1804.04378, 2018.
  • Yang et al. (2018) Liu Yang, Dongkun Zhang, and George Em Karniadakis. Physics-informed generative adversarial networks for stochastic differential equations. arXiv preprint arXiv:1811.02033, 2018.
  • Yildiz et al. (2018) Cagatay Yildiz, Markus Heinonen, Jukka Intosalmi, Henrik Mannerström, and Harri Lähdesmäki. Learning stochastic differential equations with gaussian processes without gradient matching. arXiv preprint arXiv:1807.05748, 2018.

Appendix A

a.1 Parameter Estimation Lorenz 63

Table 4: Median and standard deviation of the 65 best runs of each algorithm. As crashed in roughly one third of all experiments, we compare only the best 65 runs, where a crash is treated as a complete failure. While this provides somehow a fair comparison, it should be noted that this significantly overestimates the performance of all algorithms.

a.2 Densities for Ancestral Sampling of the SDE Based Model

Given the graphical model in Figure 0(a), it is straight forward to calculate the densities used in the ancestral sampling scheme in Algorithm 1. After marginalizing out , the joint density described by the graphical model can be written as

(34)

Inserting the densities given by Equations (10), (11), (13) and (21) yields

(35)

Using variable substitution to simplify notation, we write

(36)

This equation is now subsequently modified by observing that the product of two Gaussian densities in the same random variable is again a Gaussian density:

(37)
(38)
(39)
(40)
(41)

where

(42)
(43)
(44)
(45)

This formula can be further modified using the Woodbury identity, i.e.

(46)
(47)
(48)

which leads to

(49)

and

(50)
(51)
(52)

which leads to

(53)

Since we observe , we are interested in calculating the conditional distribution

(54)

Conveniently enough, the marginal density of is already factorized out in Equation (41) (compare Equation (22)). Thus, we have

(55)

As is independent of , we can employ ancestral sampling by first obtaining a sample of by sampling and then using that sample to obtain a sample of by sampling .

a.3 Calculating the GP Posterior for Data Based Ancestral Sampling

Given the graphical model in Figure 0(b), it is straight forward to calculate the densities used in the ancestral sampling scheme in Algorithm 1. After marginalizing out and using the variable substitutions introduced in Equation (36), the joint density described by the graphical model can be written as

(56)
(57)
(58)
(59)
(60)
(61)

where

(62)
(63)
(64)
(65)
(66)

and , is independent of .

After marginalizing out and then dividing by the marginal of , we get the conditional distribution

(67)