1 Introduction
Modeling discretely observed time series is a challenging problem that arises in quantitative sciences and many fields of engineering. While it is possible to tackle such problems with difference equations or ordinary differential equations (ODEs), both approaches suffer from serious drawbacks. Difference equations are difficult to apply if the observation times are unevenly distributed. Furthermore, they do not generalize well across observation frequencies, while natural laws do not care about this artificial construct. ODEs can deal with these challenges, but fail to incorporate the inherent stochastic behavior present in many physical, chemical or biological systems. These effects can be captured by introducing stochasticity in the dynamics model, which brings to stochastic differential equations (SDEs). In this paper, we will use exclusively the Itôform
(1) 
where
is the timedependent vector of states we would like to model,
collects the parameters of the model, is the drift term, is the vectorvalued diffusion function and is a Wienerprocess of the same dimension as the state vector .While SDEs can efficiently capture stochasticity and deal with unevenly spaced observation times or observation frequency, inference is rather challenging. Due to the stochasticity of , the state vector
is itself a random variable. Except for few special cases, it is not possible to find an analytic solution for the statistics of
for general drift and diffusion terms. The problem is even more challenging if we were to condition on or stateestimate some discrete time observations (filtering/smoothing) or infer some statistics for the parameters (parameter inference). It is well known that the parameter inference problem is a difficult task, with most approaches either being very sensitive to initialization (Picchini, 2007), strongly dependent on the choice of hyperparameters like the spacing of the integration grid
(Bhat et al., 2015) or using excessive amount of computational resources even for small scale systems and stateoftheart implementation (Ryder et al., 2018).The difficulty of the parameter estimation problem of estimating parameters of drift and diffusion under observational noise is readily exemplified by the fact that even major scientific programming environment providers like are still lack an established toolbox for practical use. In this paper, we will take a step into a novel direction tackling this open and exciting research question.
1.1 Related Work
While it is impossible to cover all related research efforts, we would like to give a quick overview by mentioning some of the most relevant. For a more indepth discussion, we recommend Tronarp and Särkkä (2018), who provide an excellent review of the current stateoftheart smoothing schemes. Moreover, Sørensen (2004) and Nielsen et al. (2000) provide extensive explanations of the more traditional approaches.
Most classical methods rely on calculating . Since is usually analytically intractable, approximation schemes are necessary. Elerian et al. (2001) and Eraker (2001) used the EulerMaruyama discretization to approximate on a fixed, fine grid of artificial observation times later to be leveraged in a MCMC sampling scheme. Pieschner and Fuchs (2018) and van der Meulen et al. (2017) subsequently refined this approach with improved bridge constructs and incorporated partial observability. Ryder et al. (2018) recently followed up on this idea by combining discretization procedures with variational inference. Särkkä et al. (2015) investigate different approximation methods based on Kalman filtering, while Archambeau et al. (2007) and Vrettas et al. (2015) use a variational Gaussian Processbased approximation. Finally, it should also be mentioned that can be inferred by solving the FokkerPlanckKolmogorov equations using standard numerical methods for PDEs (Hurn and Lindsay, 1999; AïtSahalia, 2002).
Instead of approximating in a variational fashion, Gaussian Processes can as well be used to directly model and , ignoring in this way any prior knowledge about their parametric form. This approach was investigated by Ruttor et al. (2013), whose linearization and discretization assumptions which were later relaxed by Yildiz et al. (2018). While we will show in our experiments that these methods can be used for parameter estimation if the parametric form of drift and diffusion are known, it should be noted that parameter inference was not the original goal of their work.
1.2 Our Work
To the best of our knowledge, there are only very few works that try to circumvent calculating
at all. Our approach is probably most closely related to the ideas presented by
Riesinger et al. (2016). Our proposal relies on the DossSussman transformation (Doss, 1977; Sussmann, 1978) to reduce the parameter inference problem to parameter inference in an ensemble of random ordinary differential equations (RODEs). These equations can then be solved pathwise using either standard numerical schemes or using the computationally efficient gradient matching scheme of Gorbach et al. (2017) as proposed by Bauer et al. (2017).The pathwise method of Bauer et al. (2017) has natural parallelization properties, but there is still an inherent approximation error due to the Monte Carlo estimation of the expectation over the stochastic element in the RODE. Furthermore, their framework imposes severe linearity restrictions on the functional form of the drift , while it is unable to estimate the diffusion matrix .
While we will keep their assumption of a constant diffusion matrix, i.e. , our approach gets rid of the linearity assumptions on the drift . Furthermore, we substitute the Monte Carlo approximation by embedding the SDE into a fully statistical framework, allowing for efficient estimation of both and using stateoftheart statistical inference methods.
Even though a constant diffusion assumption might seem restrictive at first, such SDE models are widely used, e.g. in chemical engineering (Karimi and McAuley, 2018), civil engineering (Jiménez et al., 2008), pharmacology (Donnet and Samson, 2013) and of course in signal processing, control and econometrics. While we believe that this approach could be extended approximately to systems with general diffusion matrices, we leave this for future work.
The contributions of our framework are the following:

We derive a new statistical framework for diffusion and drift parameter estimation of SDEs using the DossSussmann transformation and Gaussian Processes.

We introduce a gridfree, computationally efficient and robust parameter inference scheme that combines a nonparametric Gaussian Process model with adversarial loss functions.

We demonstrate that our method is able to estimate constant but nondiagonal diffusion terms of stochastic differential equations without any functional form assumption on the drift.

We show that our method significantly outperforms the stateoftheart algorithms for SDEs with multimodal state posteriors, both in terms of diffusion and drift parameter estimation.

We share and publish our code to facilitate future research.^{1}^{1}1Code available at https://github.com/gabb7/AReSMaRS
2 Background
In this section, we formalize our problem and introduce the necessary notation and background drawn from Gaussian processbased gradient matching for ODEs.
2.1 Problem Setting
We consider SDEs of the form
(2) 
where is the dimensional state vector at time ; are the increments of a standard Wiener process; is an arbitrary, potentially highly nonlinear function whose parametric form is known, save for the unknown parameter vector ; is the unknown but constant diffusion matrix. Without loss of generality, we can assume to be a lowerdiagonal, positive semidefinite matrix.
The system is observed at arbitrarily spaced time points , subjected to Gaussian observation noise:
(3) 
where we assume the noise variances to be statedependent but timeindependent, i.e.
(4) 
for and .
2.2 Deterministic ODE Case
In the context of Bayesian parameter inference for deterministic ordinary differential equations, Calderhead et al. (2009) identify numerical integration as the main culprit for bad computational performance. Thus, they propose to turn the parameter estimation procedure on its head: instead of calculating using numerical integration, they extract two probabilistic estimates for the derivatives, one using only the noisy observations and one using the differential equations. The main challenge is then to combine these two distributions, such that more information about can guide towards good parameter estimates . For this purpose, Calderhead et al. (2009)
propose a product of experts heuristics that was accepted and reused until recently
Wenk et al. (2018)showed that this heuristic leads to severe theoretical issues. They instead propose an alternative graphical model, forcing equality between the data based and the ODE based model save for a Gaussian distributed slack variable.
In this paper, we use another interpretation of gradient matching, which is aimed at finding parameters such that the two distributions over match as closely as possible. Acknowledging the fact that standard methods like minimizing the KL divergence are not tractable, we use robust moment matching techniques while solving a much harder problem with . However, it should be clear that our methodology could easily be applied to the special case of deterministic ODEs and thus provides an additional contribution towards parameter estimation for systems of ODEs.
2.3 Notation
Throughout this paper, bold, capital letters describe matrices. Values of a timedependent quantities such as the state vector can be collected in the matrix of dimensions , where the th row collect the singlestate values at times for the state .
The matrix can be vectorized by concatenating its rows and defining in this way the vector . This vector should not be confused with , which is still a timedependent vector of dimension .
As we work with Gaussian processes, it is useful to standardize the state observations by subtracting the mean and dividing by the standard deviation, in a statewise fashion. We define the vector of the data standard deviation
, and the matrix as:(5) 
where indicates the Kronecker product and
is the identity matrix of size
. Similarly for the means, we can define the vector that contains the statewise means of the observations, each repeated times. Thus the standardize vector can be defined as:(6) 
For the sake of clarity, we omit the normalization in the following sections. It should be noted however that in a finite sample setting, standardization strongly improves the performance of GP regression. In both our implementation and all the experiments shown in section 4, we assume a GP prior on that were standardized using the statewise mean and standard deviation of the observations .
3 Methods
In the deterministic case with , the GP regression model can be directly applied to the states . However, if , the derivatives of the states with respect to time do no longer exist due to the contributions of the Wiener process. Thus, performing direct gradient matching on the states is not feasible.
3.1 Latent States Representation
We propose to tackle this problem by introducing a latent variable , defined via the linear coordinate transformation
(7) 
where is the solution of the following SDE:
(8) 
Without loss of generality, we set and thus . While in principle the framework supports any initial condition as long as , the reasons for this choice will become clear in section 3.2.3.
Using Itô’s formula, we obtain the following SDE for
(9) 
This means that for a given realization of , we obtain a differentiable latent state . In principle, we could sample realizations of and solve the corresponding deterministic problems, which is equivalent to approximately marginalizing over . However, it is actually possible to treat this problem statistically, completely bypassing such marginalization. We do this by creating probabilistic generative models for observations and derivatives analytically. The equations are derived in this section, while the final models are shown in Figure 1.
3.2 Generative Model for Observations
Let us define as the Gaussian observation error at time . Using the matrix notation introduced in section 2.3, we can write
(10) 
where and are the matrices corresponding to the lowercase variables introduced in the previous section. In contrast to standard GP regression, we have an additional additive noise term , which is the result of the stochastic process described by equation (8). As in standard GP regression, it is possible to recover a closed form Gaussian distribution for each term.
3.2.1 GP Prior
We assume a zeromean Gaussian prior over the latent states , whose covariance matrix is given by a kernel function , in turn parameterized by the hyperparameter vector :
(11) 
We treat all state dimensions as independent, meaning that we put independent GP priors with separate hyperparameters on the time evolution of each state. Consequently, will be a block diagonal matrix with blocks each of dimension . The blocks model the correlation over time introduced by the GP prior.
3.2.2 Error Model
In equation (4
), we assume that our errors are i.i.d. Gaussians uncorrelated over time. The joint distribution of all errors is thus still a Gaussian distribution, whose covariance
has only diagonal elements given by the GP likelihood variances . More precisely:(12) 
and
(13) 
3.2.3 OrnsteinUhlenbeck Process
Through the coordinate transformation in equation (7), all stochasticity is captured by the stochastic process described by equation (8). Such mathematical construct has a closedform, Gaussian solution and is called OrnsteinUhlenbeck process. For the onedimensional case with zero initial condition and unit diffusion
(14) 
we get the following mean and covariance:
(15)  
(16) 
Sampling at the discrete sample points yields the vector , which is Gaussian distributed:
(17) 
where as given by equation (16). In the case of a dimensional process with identity diffusion, i.e.
(18) 
we can just treat each state dimension as an independent, onedimensional OU process. Thus, after sampling times at the time points in and unrolling the resulting matrix as described in section 2.3, we get
(19) 
where is a block diagonal matrix with one for each state dimension.
Using Itô’s formula, we can show that the samples of the original OrnsteinUhlenbeck process at each time point can be obtained via the linear coordinate transformation
(20) 
Let
be defined as the matrix that performs this linear transformation for the unrolled vectors
. We can then write the density of the original OrnsteinUhlenbeck process as(21) 
3.2.4 Marginals of the Observations
Using equation (10), the marginal distribution of can be computed as the sum of three independent Gaussiandistributed random variables with zero mean, described respectively by equations (11), (13) and (21). Thus, is again Gaussiandistributed, according to
(22) 
where
(23) 
It is important to note that due to the latent state representation, the diffusion matrix is now a part of the hyperparameters of the observation model. It will later be inferred alongside the hyperparameters of our GP using maximum evidence (Rasmussen, 2004). Using a stationary kernel , captures the stationary part of as in standard GP regression, while the parameters in describe the nonstationary part due deriving from . This ultimately leads to an identifiable problem.
3.3 Generative Model for Derivatives
Similarly to gradient matching approaches, we create two generative models for our derivatives, one based on the data and one based on the SDE model.
3.3.1 Databased Model
3.3.2 SDEbased Model
There also is a second way of obtaining an expression for the derivatives of , namely using equation (9):
(30) 
where represents the dirac delta.
3.4 Inference
Combined with the modeling paradigms introduced in the previous sections, this yields two generative models for the observations, as pictured in Figure 1. The graphical model in Figure 0(a) represents the derivatives we get via the generative process described by the SDEs via the nonlinear drift function . The model in Figure 0(b) represents the derivatives yielded by the generative process described by the Gaussian process. Assuming a perfect GP fit and access to the true parameters , intuitively these two distributions should be equal. We thus want to find parameters that minimizes the difference between these two distributions.
Compared to the deterministic ODE case, the graphical models in Figure 1 contain additional dependencies on the contribution of the OrnsteinUhlenbeck process
. Furthermore, the SDEdriven probability distribution of
in Figure 0(a) depends on , , and through a potentially highly nonlinear drift function . Thus, it is not possible to do analytical inference without making restrictive assumptions on the functional form of .However, as shown in the appendix in section A.2, it is possible to derive computationally efficient ancestral sampling schemes for both models, as summarized in Algorithm 1. While this rules out classical approaches like analytically minimizing the KL divergence, we are now able to deploy likelihoodfree algorithms that were designed for matching two probability densities based on samples.
3.5 Adversarial Samplebased Inference
Arguably, among the most popular algorithms of this kind, one can find the generative adversarial network (GAN), where a parametric neural network is trained to match the unknown likelihood of our data
(Goodfellow et al., 2014). The basic GAN setup consists of a fixed data set, a generator that tries to create realistic samples of said dataset and a discriminator that tries to tell the fake samples from the true ones. As recently shown by Yang et al. (2018), GANs have a huge potential for solving stochastic partial differential equations (SPDEs).
Yang et al. (2018) assume a fixed data set consisting of observations (similar to our ) and use an SPDEinspired neural network as a generator for realistic observations. In the case of SDEs however, this would still involve a lot of numerical integration. Thus, we modify the GAN setup by leaving behind the idea of having a fixed data set. Instead of relying on bootstrapped samples of our observations, we sample the derivatives of our data based model shown in Figure 0(b), assuming a sufficiently good model fit, thus representing the true derivatives of our latent variable . We then use the SDEbased model shown in Figure 0(a) as a generator. To avoid standard GAN problems such as training instability and to improve robustness, we choose to replace the discriminator with a critic, trained to estimate the Wasserstein distance between samples as proposed by Arjovsky et al. (2017). The resulting algorithm, summarized in Algorithm 2, can be interpreted as performing Adversarial Regression for SDEs and will thus be called AReS. In Figure 2, we show the derivatives sampled from the two models both before and after training for one example run of the Lotka Volterra system (cf. Section 4.4). While not perfect, the GAN is clearly able to push the SDE gradients towards the gradients of the observed data.3.6 Maximum Mean Discrepancy
Even though GANs work well in practical settings, they need careful balancing between their generator and discriminator. Dziugaite et al. (2015) propose to solve this problem using Maximum Mean Discrepancy (MMD) (Gretton et al., 2012) as a metric to substitute the discriminator. As proposed by Li et al. (2015), we choose a rational quadratic kernel to obtain a robust discriminator that can be deployed without finetuning on a variety of problems. The resulting algorithm, summarized in Algorithm 3, can be interpreted as performing Maximum mean discrepancyminimizing Regression for SDEs and will thus be called MaRS.
4 Experiments
4.1 Setups
To evaluate the empirical performance of our method, we conduct several experiments on simulated data, using four standard benchmark systems and comparing against the EKFbased approach of Särkkä et al. (2015) and the two GPbased approaches by Vrettas et al. (2015) and Yildiz et al. (2018).
The first system is a simple OrnsteinUhlenbeck process shown in Figure 2(a), given by the SDE
(31) 
As mentioned in Section 3.2.3, this system has an analytical Gaussian process solution and serves thus more academic purposes. We simulate our dataset using , and .
The second system is the Lorenz63 model given by the SDEs
In both systems, the drift function is linear in one state or one parameter conditioned on all the others (cf. Gorbach et al., 2017). Furthermore, there is no coupling across state dimensions in the diffusion matrix. This leads to two more interesting test cases.
To investigate the algorithm’s capability to deal with offdiagonal terms in the diffusion, we introduce the two dimensional LotkaVolterra system shown in Figure 2(b), given by the SDEs
(32) 
where is, without loss of generality, assumed to be a lower triangular matrix. The true vector parameter is and the system is simulated starting from . Since its original introduction by Lotka (1932), the Lotka Volterra system has been widely used to model population dynamics in biology. The system is observed at 50 equidistant points in the interval . As it turns out, this problem is significantly challenging for all algorithms, even without the presence of observation noise.
To investigate the effect of strong nonlinearities in the drift, we introduce the GinzburgLandau doublewell potential shown in Figure 2(c), defined by the SDE
(33) 
Using , and , this system exhibits an interesting bifurcation effect. While there are two stable equilibria at , it is completely up to noise in which one the system will end up. This is thus the perfect system to test how well an algorithm can deal with multimodal SDEs. The system is observed at 50 equidistant points in the interval , subjected to observational noise with .
Lastly, some implementation details are constant throughout each experiment: the critic in the adversarial parameter estimation is a 2layer fully connected neural network, with respectively 256 and 128 nodes. Every batch, for both MMD and adversarial training contains 256 elements. While OrnsteinUhlenbeck and the doublewell potential were modeled with a sigmoid kernel, for the Lotka Volterra we used a common RBF (we point at Rasmussen (2004) for more information about kernels and GPs).
4.2 Evaluation
For all systems, the parameters turn out to be identifiable. Thus, the parameter value is a good indicator of how well an algorithm is able to infer the drift function. However, since the different dimensions of are independent, there are multiple diffusion matrices that create the same trajectories. We thus directly compare the variance of the increments, i.e. the components of .
To account for statistical fluctuations, we use 100 independent realizations of the SDE systems and compare the mean and standard deviation of and . and are compared against the Gaussian processbased by Vrettas et al. (2015) and by Yildiz et al. (2018) as well as the classic Kalman filterbased recommended by Särkkä et al. (2015).
4.3 Locally Linear Systems
As mentioned in Section 4.1, the functional form of the drift functions of both the Ornstein Uhlenbeck process and the Lorenz 63 system satisfy a local linearity assumption, while their diffusion is kept diagonal. They thus serve as excellent benchmark systems for a wide variety of parameter inference algorithms. The empirical results are shown in Table 3 for the OrnsteinUhlenbeck. Unfortunately, turns out to be rather unstable if both diffusion and parameters are unknown, even though it takes on average roughly 58 hours to converge. We have thus provided it with the true and show only its empirical parameter estimates. Since both and are using Equation (22) to determine , they share the same values. Due to space restrictions, we moved the results for Lorenz 63 to Table 4 in the appendix. As demonstrated by this systems, and are both able to deal with locally linear systems, occasionally outperforming their competitors, especially in their estimates of the diffusion.
4.4 NonDiagonal Diffusion
To investigate the effect of offdiagonal entries in , we use the Lotka Volterra dynamics. For the diffusion estimates, and share again the same values. Since was unable to model nondiagonal diffusions, we provide it with the true and only compare parameter estimates. As was already struggling in the lower dimensional cases, we omitted it from this comparison due to limited computational resources. The results are shown in Table 3. and are clearly outperforming their competition in terms of diffusion estimation, while is the only algorithm able to keep up in terms of drift parameter estimation.
4.5 Dealing with MultiModality
/ 
/  
/  
/  
/ 
^{2}^{2}2Need to compare its squared value!!  
/ 
As a final challenge, we investigate the GinzburgLandau double well system. While it is one dimensional, its state distribution is multimodal even if all parameters are known. As shown in Table 3, this is a huge challenge for all classical approaches. While probably does not have enough data points for its nonparametric proxy for the drift function, the timedependent Gaussianity assumptions in both and are problematic in this case. In our gradient matching framework, no such assumption is made. Thus, both and are able to deal with the multimodality of the problem.
5 Conclusion
Parameter and diffusion estimation in stochastic systems arises in quantitative sciences and many fields of engineering. Current estimation techniques based on Kalman filtering or Gaussian processes try to approximate the state distribution conditioned on the parameters and then iteratively optimize the data likelihood. In this work, we propose to turn this procedure on its head by leveraging key ideas from gradient matching algorithms designed for deterministic ODEs. By introducing a novel noise model for Gaussian process regression using the DossSussmann transformation, we are able to reliably estimate the diffusion process and the drift parameters. Our algorithm is able to keep up and occasionally outperform the stateoftheart on the simpler benchmark systems, while it is also accurately estimating parameters for systems that exhibit multimodal state densities, a case where traditional methods fail. While our approach is currently restricted to systems with a constant diffusion matrix , it would be interesting to see how it generalizes to other systems by using different or approximate bridge constructs. Unfortunately, this is outside of the scope of this work. However, we hope that our publicly available code will facilitate future research in that direction.
This research was supported by the Max Planck ETH Center for Learning Systems. GA acknowledges funding from Google DeepMind and University of Oxford.
References
 AïtSahalia (2002) Yacine AïtSahalia. Maximum likelihood estimation of discretely sampled diffusions: a closedform approximation approach. Econometrica, 70(1):223–262, 2002.

Archambeau et al. (2007)
Cedric Archambeau, Dan Cornford, Manfred Opper, and John ShaweTaylor.
Gaussian process approximations of stochastic differential equations.
Journal of machine learning research
, 1:1–16, 2007.  Arjovsky et al. (2017) Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. In International Conference on Machine Learning, pages 214–223, 2017.
 Bauer et al. (2017) Stefan Bauer, Nico S Gorbach, Djordje Miladinovic, and Joachim M Buhmann. Efficient and flexible inference for stochastic systems. In Advances in Neural Information Processing Systems, pages 6988–6998, 2017.
 Bhat et al. (2015) Harish S Bhat, RWMA Madushani, and Shagun Rawat. Parameter inference for stochastic differential equations with density tracking by quadrature. In International Workshop on Simulation, pages 99–113. Springer, 2015.

Calderhead et al. (2009)
Ben Calderhead, Mark Girolami, and Neil D Lawrence.
Accelerating bayesian inference over nonlinear differential equations with gaussian processes.
In Advances in neural information processing systems, pages 217–224, 2009.  Donnet and Samson (2013) Sophie Donnet and Adeline Samson. A review on estimation of stochastic differential equations for pharmacokinetic/pharmacodynamic models. Advanced Drug Delivery Reviews, 65(7):929–939, 2013.
 Doss (1977) Halim Doss. Liens entre équations différentielles stochastiques et ordinaires. 1977.
 Dziugaite et al. (2015) Gintare Karolina Dziugaite, Daniel M Roy, and Zoubin Ghahramani. Training generative neural networks via maximum mean discrepancy optimization. arXiv preprint arXiv:1505.03906, 2015.
 Elerian et al. (2001) Ola Elerian, Siddhartha Chib, and Neil Shephard. Likelihood inference for discretely observed nonlinear diffusions. Econometrica, 69(4):959–993, 2001.
 Eraker (2001) Bjørn Eraker. Mcmc analysis of diffusion models with application to finance. Journal of Business & Economic Statistics, 19(2):177–191, 2001.
 Goodfellow et al. (2014) Ian Goodfellow, Jean PougetAbadie, Mehdi Mirza, Bing Xu, David WardeFarley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672–2680, 2014.
 Gorbach et al. (2017) Nico S Gorbach, Stefan Bauer, and Joachim M Buhmann. Scalable variational inference for dynamical systems. In Advances in Neural Information Processing Systems, pages 4806–4815, 2017.
 Gretton et al. (2012) Arthur Gretton, Karsten M Borgwardt, Malte J Rasch, Bernhard Schölkopf, and Alexander Smola. A kernel twosample test. Journal of Machine Learning Research, 13(Mar):723–773, 2012.
 Hurn and Lindsay (1999) AS Hurn and KA Lindsay. Estimating the parameters of stochastic differential equations. Mathematics and computers in simulation, 48(46):373–384, 1999.
 Jiménez et al. (2008) MJ Jiménez, Henrik Madsen, JJ Bloem, and Bernd Dammann. Estimation of nonlinear continuous time models for the heat exchange dynamics of building integrated photovoltaic modules. Energy and Buildings, 40(2):157–167, 2008.
 Karimi and McAuley (2018) Hadiseh Karimi and Kimberley B McAuley. Bayesian objective functions for estimating parameters in nonlinear stochastic differential equation models with limited data. Industrial & Engineering Chemistry Research, 57(27):8946–8961, 2018.
 Li et al. (2015) Yujia Li, Kevin Swersky, and Rich Zemel. Generative moment matching networks. In International Conference on Machine Learning, pages 1718–1727, 2015.
 Lotka (1932) Alfred J Lotka. The growth of mixed populations: two species competing for a common food supply. Journal of the Washington Academy of Sciences, 22(16/17):461–469, 1932.
 Nielsen et al. (2000) Jan Nygaard Nielsen, Henrik Madsen, and Peter C Young. Parameter estimation in stochastic differential equations: an overview. Annual Reviews in Control, 24:83–94, 2000.
 Picchini (2007) Umberto Picchini. Sde toolbox: Simulation and estimation of stochastic differential equations with matlab. 2007.
 Pieschner and Fuchs (2018) Susanne Pieschner and Christiane Fuchs. Bayesian inference for diffusion processes: Using higherorder approximations for transition densities. arXiv preprint arXiv:1806.02429, 2018.
 Rasmussen (2004) Carl Edward Rasmussen. Gaussian processes in machine learning. In Advanced lectures on machine learning, pages 63–71. Springer, 2004.
 Riesinger et al. (2016) Christoph Riesinger, Tobias Neckel, and Florian Rupp. Solving random ordinary differential equations on gpu clusters using multiple levels of parallelism. SIAM Journal on Scientific Computing, 38(4):C372–C402, 2016.
 Ruttor et al. (2013) Andreas Ruttor, Philipp Batz, and Manfred Opper. Approximate gaussian process inference for the drift function in stochastic differential equations. In Advances in Neural Information Processing Systems, pages 2040–2048, 2013.
 Ryder et al. (2018) Thomas Ryder, Andrew Golightly, A Stephen McGough, and Dennis Prangle. Blackbox variational inference for stochastic differential equations. International Conference on Machine Learning, 2018.
 Särkkä et al. (2015) Simo Särkkä, Jouni Hartikainen, Isambi Sailon Mbalawata, and Heikki Haario. Posterior inference on parameters of stochastic differential equations via nonlinear gaussian filtering and adaptive mcmc. Statistics and Computing, 25(2):427–437, 2015.
 Sørensen (2004) Helle Sørensen. Parametric inference for diffusion processes observed at discrete points in time: a survey. International Statistical Review, 72(3):337–354, 2004.
 Sussmann (1978) Héctor J Sussmann. On the gap between deterministic and stochastic ordinary differential equations. The Annals of Probability, pages 19–41, 1978.
 Tronarp and Särkkä (2018) Filip Tronarp and Simo Särkkä. Iterative statistical linear regression for gaussian smoothing in continuoustime nonlinear stochastic dynamic systems. arXiv preprint arXiv:1805.11258, 2018.
 van der Meulen et al. (2017) Frank van der Meulen, Moritz Schauer, et al. Bayesian estimation of discretely observed multidimensional diffusion processes using guided proposals. Electronic Journal of Statistics, 11(1):2358–2396, 2017.
 Vrettas et al. (2015) Michail D Vrettas, Manfred Opper, and Dan Cornford. Variational meanfield algorithm for efficient inference in large systems of stochastic differential equations. Physical Review E, 91(1):012148, 2015.
 Wenk et al. (2018) Philippe Wenk, Alkis Gotovos, Stefan Bauer, Nico Gorbach, Andreas Krause, and Joachim M Buhmann. Fast gaussian process based gradient matching for parameter identification in systems of nonlinear odes. arXiv preprint arXiv:1804.04378, 2018.
 Yang et al. (2018) Liu Yang, Dongkun Zhang, and George Em Karniadakis. Physicsinformed generative adversarial networks for stochastic differential equations. arXiv preprint arXiv:1811.02033, 2018.
 Yildiz et al. (2018) Cagatay Yildiz, Markus Heinonen, Jukka Intosalmi, Henrik Mannerström, and Harri Lähdesmäki. Learning stochastic differential equations with gaussian processes without gradient matching. arXiv preprint arXiv:1807.05748, 2018.
Appendix A
a.1 Parameter Estimation Lorenz 63
a.2 Densities for Ancestral Sampling of the SDE Based Model
Given the graphical model in Figure 0(a), it is straight forward to calculate the densities used in the ancestral sampling scheme in Algorithm 1. After marginalizing out , the joint density described by the graphical model can be written as
(34) 
Inserting the densities given by Equations (10), (11), (13) and (21) yields
(35) 
Using variable substitution to simplify notation, we write
(36) 
This equation is now subsequently modified by observing that the product of two Gaussian densities in the same random variable is again a Gaussian density:
(37)  
(38)  
(39)  
(40)  
(41) 
where
(42)  
(43)  
(44)  
(45) 
This formula can be further modified using the Woodbury identity, i.e.
(46)  
(47)  
(48) 
which leads to
(49) 
and
(50)  
(51)  
(52) 
which leads to
(53) 
Since we observe , we are interested in calculating the conditional distribution
(54) 
Conveniently enough, the marginal density of is already factorized out in Equation (41) (compare Equation (22)). Thus, we have
(55) 
As is independent of , we can employ ancestral sampling by first obtaining a sample of by sampling and then using that sample to obtain a sample of by sampling .
a.3 Calculating the GP Posterior for Data Based Ancestral Sampling
Given the graphical model in Figure 0(b), it is straight forward to calculate the densities used in the ancestral sampling scheme in Algorithm 1. After marginalizing out and using the variable substitutions introduced in Equation (36), the joint density described by the graphical model can be written as
(56)  
(57)  
(58)  
(59)  
(60)  
(61) 
where
(62)  
(63)  
(64)  
(65)  
(66) 
and , is independent of .
After marginalizing out and then dividing by the marginal of , we get the conditional distribution
(67) 
Comments
There are no comments yet.