 # Bayesian Filtering for ODEs with Bounded Derivatives

Recently there has been increasing interest in probabilistic solvers for ordinary differential equations (ODEs) that return full probability measures, instead of point estimates, over the solution and can incorporate uncertainty over the ODE at hand, e.g. if the vector field or the initial value is only approximately known or evaluable. The ODE filter proposed in recent work models the solution of the ODE by a Gauss-Markov process which serves as a prior in the sense of Bayesian statistics. While previous work employed a Wiener process prior on the (possibly multiple times) differentiated solution of the ODE and established equivalence of the corresponding solver with classical numerical methods, this paper raises the question whether other priors also yield practically useful solvers. To this end, we discuss a range of possible priors which enable fast filtering and propose a new prior--the Integrated Ornstein Uhlenbeck Process (IOUP)--that complements the existing Integrated Wiener process (IWP) filter by encoding the property that a derivative in time of the solution is bounded in the sense that it tends to drift back to zero. We provide experiments comparing IWP and IOUP filters which support the belief that IWP approximates better divergent ODE's solutions whereas IOUP is a better prior for trajectories with bounded derivatives.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Ordinary differential equations (ODEs) are an extensively studied mathematical problem. Dynamical systems are described by ODEs, and the application of these extends to many different fields of science and engineering. Given the initial value proplem (IVP) on the time interval

 x′(t)=f(t,x(t)),x(0)=x0∈Rd (1)

with globally Lipschitz continuous, we want to compute a numerical approximation of the solution using a probabilistic approach. Solving ODEs is one of the major tasks of numerical mathematics. Classical numerical ODE solvers (e.g. Runge Kutta methods) construct the solution by iteratively evaluating the vector field at discrete times at numerical approximations and collecting information on the derivative . As pointed out in the past [12, 3, 10], to incorporate uncertainties that arise due to the inaccurate observation of and consequently accumulated numerical error, we can interpret the observed evaluations of as (potentially noisy) data for statistical inference. For quantifying the uncertainty on we model the analytic solution as a Gauss-Markov stochastic process defined on a probability space . The iterative algorithm then outputs the law of , i.e. the probability measure that the process induces on the collection of functions from to the state space . This probability measure contains the belief over .

From this viewpoint, unknown numerical quantities are modelled as random variables and numerical algorithms return probability distributions instead of points estimates. This is a probabilistic approach which carries out numerical computations in a statistically interpretable way by combining probability theory and numerics, and permits to quantify the uncertainties for the computations. This idea became a dynamic research area which includes optimization, linear algebra and differential equations and is called

Probabilstic Numerics (PN) .
A PN solver for ODEs was introduced by  who recast the inference of the solution as a Gaussian process (GP) regression. However, their original solver did not satisfy the polynomial convergence rates of classical solvers. In parallel development,  proposed a similar solver of similar structure, which uses a Monte Carlo, instead of a closed-form Gaussian, updating scheme. This solver produced solutions of linear order, but still not of polynomial order. In , the polynomial convergence rate was established by choosing an Integrated Wiener Process (IWP) prior. This solver was further amended by adding calibrated uncertainty quantification by Bayesian quadrature  and using the Markov property of the IWP to recast it as a filter, thereby drastically speeding up the probabilistic computations [9, 16]. While the resulting filter is the fastest probabilistic ODE solver available, it outputs solely Gaussian probability measures. Another novel method introduced in  provides a more flexible non-parametric uncertainty quantification by defining a generative process to sample random numerical solutions of the ODE, obtained by adding Gaussian noise to the solution of a classical solver. The computational cost of this method per sample, however, is equal to the cost of the underlying solver, which makes it impractical for cost-sensitive applications.
In this paper we recall the solver in [15, 9] which employs a -times integrated Wiener process prior on the solution and introduce instead an Integrated Ornstein Uhlenbeck (IOU) process prior. This new model improves the algorithm for ODEs whose trajectories have a higher derivative which is expected to gravitate back to zero. In the following, we will call such systems ’ODEs with bounded derivatives’.

## 2 Gaussian filtering for ODEs

### 2.1 Bayesian filtering problem

Assume we have a signal process which is not directly observable, solution of the stochastic differential equation (SDE)

 dXt=b(t,Xt)dt+σ(t,Xt)dWt (2)

where denotes a Wiener process. Instead, we can only see an observation process that is a transformation of . Both processes are defined on some probability space . The goal of the filtering problem is to compute, given the observations , the -optimal estimate of for . Accordingly, has to be measurable w.r.t. the natural filtration and minimize the mean-square distance between and the candidates in :

 E∥∥Xt−^Xt∥∥2=minY∈L2(Ω,FZt,P)E∥Xt−Y∥2. (3)

The best mean square estimate of given is known to be the conditional expectation [8, 11]. In a Bayesian spirit, we seek to compute the posterior distribution of given the current and previous noisy observations . This can be done by using Bayes’ rule

 p(X∣Z)=p(Z∣X)p(X)p(Z). (4)

The distribution is the measurement model which describes the noisy relationship between the true signal and the observations, and is the prior containing preliminary information on before any observation are taken into account. A natural question is then: which prior should we put on the signal process

? Due to the computationally advantageous properties of Gaussian distributions which allow for fast inference by matrix computations we assume

to be a Gaussian process, i.e. to have finite joint Gaussian distribution. Moreover we want to iteratively update the solution from to without taking into account the previous observations at time steps . This is possible when is also a Markov process, since otherwise the computational cost per step would exponentially increase in the number of previous steps. Sample continuous Gauss-Markov processes can be written as the solution of the linear time-invariant stochastic differential equation (LTI-SDE) of the form

 dXt=FXtdt+LdWt (5)

with Gaussian initial condition . The matrix is the so called drift matrix and is the diffusion matrix of the process. The solution of Eq. 5 depends on the form of and . With this conditions the distribution of the estimate

can be computed by Kalman filtering

, an iterative algorithmic implementation for the linear filtering problem with Gaussian model assumptions. Since it is essentially an application of Bayes’ rule, it is determined by the choice of the prior and the measurement model . Previous work [9, 15, 16] employed an IWP prior, discussed in Section 2.3, whereas in this paper we will analyse Bayesian filtering for ODEs with an IOUP prior. Due to the properties of Gaussian distributions, together with Gaussian measurement model assumptions, we can analytically compute the matrices for Gaussian filtering for LTI SDE like Eq. 5 (see ).

### 2.2 Filtering for ODEs

We use Bayesian filtering for computing a numerical estimation of the solution of the IVP Eq. 1. We treat the imprecise measurements as noisy observations. Since is a numerical approximation of the true solution at time the evaluations of are indeed noisy. Therefore, we model the solution and the first derivatives as a draw of a Gauss Markov stochastic process . Hence, the SDE in Eq. 5, which describes the dynamics of , has the following form

 ⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝dX(0)tdX(1)t⋮dX(q−1)tdX(q)t⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠=F⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝010⋯0001⋯0⋮⋱⋱⋮0⋯01fq,0fq,1……fq,q⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝X(0)tX(1)t⋮X(q−1)tX(q)t⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠dt+L⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝00⋮0σ⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠dWt. (6)

The drift matrix has non-zero entries only on the first-off diagonal and on the last row. This form of guarantees that the -th component is the derivative of the -component in the state space. The parameters determine the process employed on the -the component of the state space : they are all equal to zero for an integrated Wiener process, equal to zero apart from for the integrated OUP with some , whereas for the Matern process  we have . Inference in Eq. 6 with linear Gaussian measurements models gives rise to the Kalman Filter equations implemented in Algorithm 1. The algorithm outputs the filtering distribution , where .

The transition matrices and in lines 6-7 are given by ()

 A(h)=exp(Fh) (7)
 Q(h)=∫t0exp(Fτ)LLTexp(Fτ)Tdτ (8)

where and come from Eq. 6. The matrix is the measurement model matrix (in probabilistic terms ), in our case the -th unit vector . In line 1 a prior model is selected. Lines 6-7 compute the prediction distribution , which depends from the form of the matrices and and is therefore strictly related to the choice of the prior. Lines 9-13 are known as update step and compute the updated distribution . In the experiments we will assume for simplicity , i.e. the numerical approximation is exact.

### 2.3 Prior selection: Integrated Wiener process

The standard Brownian motion is a Gauss Markov process with mean

and variance

. Samples of the process are depicted in Fig. 1 (b). Previous work [9, 15, 16] worked with an IWP prior for probabilistic ODE solvers. In this case and are given by 

 A(h)i,j=eAhi,j=1j≥ihj−i(j−i)! (9)
 Q(h)i,j=σ2h2q+1−i−j(2q+1−i−j)(q−i)!(q−j)!. (10)

This work showed the useful uncertainty quantification for the solution with only a small computational overhead compared to classical numerical methods. In particular, Schober et al. constructed a class of probabilistic solvers whose posterior mean have the same analytic structure of the one of the classic numerical Runge-Kutta methods (for first, second and third order), thus acquiring their good properties.

## 3 Ornstein Uhlenbeck process

Can we use other Gauss Markov priors over the solution which preserve good convergence rates? In this paper we model the solution function and the first derivatives as a draw from an integrated Ornstein Uhlenbeck process (IOUP) to perform inference in ODEs. We show that this model is competitive with the IWP solver discussed in the previous section. The Ornstein Uhlenbeck process (OUP) is a Gauss-Markov process whose dynamic is governed by the following SDE

 dXt=θ(Xt−μ)dt+σdWt (11)

where and denotes a standard Wiener process. The OUP converges to its stationary distribution in the sense that , for . The difference between the Wiener process and the Ornstein Uhlenbeck is the drift term , which is constant for the Wiener process and dependent on the parameter for OUP. If the current value of the process is smaller than the value of the mean , the drift is positive and the process tends to drift back to . The same (symmetric) effect occurs if the current value of is greater than . In other words, the samples evolve around the mean which acts as an equilibrium level. Some samples of the process are shown in Fig. 1 (a). This property is called ”mean reverting” and the magnitude of determines the strength of this drift. For small values of the effect vanishes, whereas, for large , stays close to the mean with high probability. This behaviour justifies the fact that the variance of the OUP is bounded, in fact bounded by , whereas the Wiener process variance is unbounded () and the samples don’t tend to remain around the mean. (a) Samples of an Ornstein Uhlenbeck process with initial position μ=0, variance σ2=1 and θ=−1.The samples tend to remain close to the mean μ (mean reverting property).

### 3.1 Bayesian Filtering with Integrated Ornstein Uhlenbeck process

We analyse the probabilistic interpretation of ODEs by filtering investigated in Section 2.2 in the case of a -times integrated Ornstein Uhlenbeck process.
To this end, we a priori assume that the solution of the ODE and the first derivatives follow a -times integrated Ornstein Uhlenbeck process with initial conditions . Accordingly, the drift matrix in Eq. 6 has entries , , on the last row. Using this prior model the matrices and can be computed analytically (see appendix) from Eq. 7 and Eq. 8 and have the following form:

 A(h)i,j=ehFi,j=⎧⎪ ⎪⎨⎪ ⎪⎩1j≥ihj−i(j−i)!,if j≠q1θq−i(eθh−∑q−i−1k=0(θh)kk!),if j=q\par (12)
 (13)

A probabilistic solver with IOUP prior has local convergence rate for the predicted posterior mean of the same order as the solver with an IWMP prior:

###### Proposition 1

The order of the prediction in the first step

 ∥∥m(h)−0−x(h)∥∥≤Khq+1, (14)

for some constant independent of .

###### Proof
 m−(h)0= q+1∑i=0A0,imi=m0+q−1∑k=1hkmkk!+mqθq∞∑k=q(θh)kk! (15) = m0+q−1∑k=1hkmkk!+mq∞∑k=qhkθk−qk! (16) = m0+hm1+h22!m2+⋯+hq−1(q−1)!m(q−1)+hqq!mq+O(hq+1) (17) = x0+hf(0,x0)+⋯+hq−1(q−1)!f(q−1)(0,x0)+hqq!fq(0,x0)first q components of the% Taylor expantion of x(h)+O(hq+1) (18)

By writing the Taylor expansion of and subtracting it from Eq. 18, we obtain the desired order for the truncation error.

As mentioned above, we want to exploit of the ’mean reverting property’ of OUP (and absent in WP), or equivalently of the fact that OUP has bounded variance, for solving ODEs whose trajectories have bounded derivatives. Since we use a -times IOUP prior over the solution, we want that also the integral of the process is maintained the same behaviour, i.e. tends to stay closer to the mean than the WP. This is indeed the case, as we proved that (see appendix).

## 4 Experiments

In this section we analyse applications of the probabilistic ODE solver with IOUP prior discussed in Section 3.1. In particular we compare this solver with the one considered in Section 2.3, which uses an IWP prior over the solution , by applying the two probabilistic ODE solvers to the same ODEs and examining the plots. The experiments aim to test the intuition that the selection of the prior depends on the structure of the ODE’s solution: when the solution looks ’mean-reverting’, in the sense that it has bounded derivatives, we expect better results from the IOUP prior, whereas we should choose an IWP prior when solutions to ODEs have unbounded derivatives or look divergent. According to these considerations Table 1 shows our prior belief on which solver (IOUP or IWP) performs better for which of the following ODEs.

• Linear ODEs

• Exponential function

 x′(t)=x x(0)=1
• Negative exponetial function

 x′(t)=−x x(0)=1
• Oscillators

• Orbit equations 

(epsilon is the eccentricity of the orbit)

• Van der Pol oscillator

 x′′(t)−μ(1−x2)x′(t)+x=0    x(0)=0
• Moderate systems 

,

For the experiments the IOUP and IWP filters are initialized with mean and variance with non-zero entries only in the minor matrix where ord is the order of the ODE and

is the identity matrix.

### 4.1 Linear ODEs

The solution of a), the exponential function, has a divergent trajectory with unbounded derivatives and as we expected the plot in Fig. 2 (a) shows that the IWP-prior solver approximates the solution better than the IOUP-prior. On the contrary, the negative exponential function, solution of b), decays to zero. This resembles the behaviour of the mean-reverting samples of the IOU process, which should therefore work as a good prior over . The plot in Fig. 2 (b) confirms this intuition: the IOUP-solver approximates better the solution than the IWP-solver. Both methods use a model of step size .
The plot in Fig. 2 (b) clearly shows the behaviour of the two solvers: IWP oscillates much more than IOUP around the solution when this converges to zero, because the IWP doesn’t tend to ’drift back’ to the mean, whereas the IOUP solver remains closer to the solution due to the process drifting property and smaller variance.

### 4.2 Oscillators

We used an IOUP-prior to compute approximate oscillating solutions of ODEs in Problem 2. a) and b). The original intuition was that an IOUP-solver would be a good prior for oscillators, because their trajectory resembles the ’mean-reverting’ property and looks similar to the samples of an OUP. However, we do not always see better results of the IOUP-approximate solution. Fig. 3 depicts two different oscillators: in the plot (a) the IWP solver distinctly approximates better the solution in Problem 2 a). On the other hand the Van der Pol oscillator (Problem 2 b)) illustrated in Fig. 3 (b) is better approximated by an IOUP prior. To motivate these results we argue that the similarity to the OUP samples is a global property of the solution’s structure. However, the filter computes the approximate solution along the time axis ’locally’, proceeding with small time steps. In particular the IOUP filter predicts with high probability the -th derivative smaller at every time step, whereas the derivative of an oscillator is periodically changing and doesn’t decay to zero or converges to a finite number.

### 4.3 Moderate systems

Problem 3. is a system of ODEs (of order 1 and dimension 10) which describes a radioactive decay chain. The ten solutions of the ODE problem are shown in Fig. 4 : the trajectories decay to zero for (apart from which converges to one). Again, as discussed for ODE 1 , the solution functions look similar to the mean reverting samples of the OUP. In Fig. 5 we plotted two orbits of the ODE problem 3, with a step size of and . While the IWP oscillates around the orbit, the IOUP solvers remains near to it. Moreover the more stiff is the orbit, the more we expect the Ornstein-Uhlenbeck to perform well. Indeed, the plots confirm that this intuition might be useful. Figure 4: Solution functions {x0,…,x9} to ODE Problem 3, computed with a Runge Kutta method with step size h=0.0001 (a) Orbit with the function x8(t) on the x-axis and x7(t) on the y-axis,t∈(0,10)

## 5 Discussion

The experiments presented in Section 4 provide good material to discuss the differences between the IOUP and the IWP solver. An IOUP prior appears to approximate better ’ODEs whith bounded derivatives’, due to the drifting property which keeps the process near to the mean zero with high probability (in our model the -th derivative of the solution). Fig. 2 (b) and Fig. 5 suggest this argument is reasonable: it distinctly shows how the approximate solution with an IWP prior oscillates unstably around the analytical solution due to the fact that it is less conservative than the OUP (in the sense that it tends to deviate away from its mean) whereas the IOUP solver remains closer to it. In this perspective, an IWP works as a better prior for divergent ODEs, like e.g the exponential function (Fig. 2 a)). Our intuition was not validated for oscillators (Fig. 3). This is probably due to the fact that the trajectory of oscillators— although it resembles the drifting property— periodically changes the value of the derivative, whereas the IOUP-solver predicts a smaller derivative at each time step.

## 6 Summary

In the framework for Bayesian filtering we examined theory of probabilistic ODE solvers which return Gaussian process posterior distributions instead of point estimates and provide uncertainty measures over the solutions. We stressed the Bayesian nature of these solvers and discussed the question of prior selection. Hence we proposed a novel solver which employs an IOUP prior over the solution of ODEs, and provided experiments which show a competitive behaviour of this solver with the already well-established IWP-solver.

## 7 Appendix

### 7.1 Matrices A(h) and Q(h) or IOUP

It is easy to see that

 A(h)i,j:=ehFi,j=⎧⎪ ⎪⎨⎪ ⎪⎩1j≥ihj−i(j−i)!,if j≠q1θq−i(eθh−∑q−i−1k=0(θh)kk!),if j=q\par (19)

Let’s calculate given by 

 Q(h)=∫h0exp(Fτ)LQLTexp(Fτ)Tdτ (20)

Where

 F=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝010⋯0001⋯0⋮⋱⋱⋮0⋯0100…0θ⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠   L=⎛⎜ ⎜ ⎜ ⎜ ⎜ ⎜ ⎜⎝00⋮0σ⎞⎟ ⎟ ⎟ ⎟ ⎟ ⎟ ⎟⎠

with . We assume . We have

 LQLTi,j={σ2if j=i=q0otherwise

so

 exp(Fτ)LQLTi,j=⎧⎪⎨⎪⎩σ2∑∞kθkhk+q−i(k+q−i)!=1θq−i(eθh−∑q−i−1k=0(θh)kk!)if j=q0otherwise

and

 exp(Fτ)LQLTexp(Fτ)Ti,j=+∞∑k=0θkτk+q−i(k+q−i)!+∞∑k=0θkτk+q−j(k+q−j)!

Then

 Qi,j(h)=σ2∫h0(+∞∑k=0θkτk+q−i(k+q−i)!)(+∞∑k=0θkτk+q−j(k+q−j)!)dτ =σ2∫h0(1θq−i+∞∑k=0(θτ)k+q−i(k+q−i)!)(1θq−j+∞∑k=0(θτ)k+q−j(k+q−j)!)dτ =σ2∫h0(e2θτθ2q−i−j−eθτθ2q−i−jq−i−1∑k=0(θτ)kk!−eθτθ2q−i−jq−j−1∑k=0(θτ)kk!+1θ2q−i−jq−i−1∑k=0(θτ)kk!q−j−1∑k=0(θτ)kk!) =σ2θ2q−i−j⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩(e2θh−12θ)−q−i−1∑k=0θkk!∫h0eθττkdτ\emphIk(h)−q−j−1∑k=0θkk!∫h0eθττkdτ\emphIk(h)+q−i−1∑k1=0q−j−1∑k2=0θk1+k2k1!k2!∫h0τk1+k2dτ⎫⎪ ⎪ ⎪ ⎪ ⎪⎬⎪ ⎪ ⎪ ⎪ ⎪⎭

Writing

 Ik(h)=θk−1eθhhkk!−\emphIk−1(h)  %with   I0(h)=∫h0eθτdτ=1θ(eθh−1)

we get

 (21)

### 7.2 IOUP local convergence rate for the posterior mean

If we employ a -times integrated OUP over the solution , where , the prediction step for the mean with initial value is where

 A(h)i,j=ehFi,j=⎧⎪ ⎪ ⎪ ⎪ ⎪⎨⎪ ⎪ ⎪ ⎪ ⎪⎩1j≥ihj−i(j−i)!,if j≠q1θq−i(eθh−∑q−i−1k=0(θh)kk!),if j=q\par (22)

### 7.3 Variances of IOUP and IWP

Let denote an Ornstein Uhlenbeck process with and a standard Wiener process. In this section we will write with for simplicity in the calculations. We want to show :

 Var(∫T0Xsds)a)≤Var(∫T0Wsds)b)

We calculate a) and b):

•  =E∫T0∫T0XsXtdsdt=∫T0∫T0E[XsXt]dsdtE[Xt]=E[Xs]=0=∫T0∫T0Cov(Xs,Xt)dsdt= =σ22θ∫T0∫T0(e−θ∣t−s∣−e−θ(t+s))dsdt=σ22θ⎛⎜ ⎜ ⎜ ⎜⎝∫T0∫T0e−θ∣t−s∣dsdt1)−∫T0∫T0e−θ(t+s)dsdt2)⎞⎟ ⎟ ⎟ ⎟⎠
•  ∫T0∫T0e−θ∣t−s∣dsdt=∫T0(∫s0e−θ(s−t)dt+∫Tse−θ(t−s)dt)ds=∫T0(1−e−θsθ+1−e−θ(T−s)θ)ds= =∫T02θds−∫T0e−θsθds−∫T0e−θ(T−s)θds=2Tθ+e−θT−1θ2−1−e−θTθ2=2θ(T+e−θT−1θ)
•  ∫T0∫T0e−θ(t+s)dsdt=∫T0(1θ(e−θs−e−θ(T+s)))ds=1θ(∫T0e−θs−∫T0e−θ(T+s)))= =1θ(1−e−θTθ−e−θT−e−2θTθ)=1θ2(1−e−θT)2

So

 Var(∫T0Xsds)=σ22θ(2θ(T+e−θT−1θ)−1θ2(1−e−θT)2)=σ22θ3(2θT+2(e−θT−1)−(1−e−θT)2)
•  Var(∫T0Wsds)=∫T0∫T0Cov(Xs,Xt)dsdt=∫T0∫T0min(s,t)dsdt=∫T0(∫s0tdt+∫Tssdt)ds= =∫T0(s22+s(T−s))ds=T33

So we can see that

 σ22θ3(2θT+2(e−θT−1)−(1−e−θT)2)≤T33     ∀T>0.