 # A Direct Method to Learn States and Parameters of Ordinary Differential Equations

Though ordinary differential equations (ODE) are used extensively in science and engineering, the task of learning ODE parameters from noisy observations still presents challenges. To address these challenges, we propose a direct method that involves alternating minimization of an objective function over the filtered states and parameters. This objective function directly measures how well the filtered states and parameters satisfy the ODE, in contrast to many existing methods that use separate objectives over the observations, filtered states, and parameters. As we show on several ODE systems, as compared to state-of-the-art methods, the direct method exhibits increased robustness (to noise, parameter initialization, and hyperparameters), decreased training times, and improved accuracy in estimating both filtered states and parameters. The direct method involves only one hyperparameter that plays the role of an inverse step size. We show how the direct method can be used with general multistep numerical discretizations, and demonstrate its performance on systems with up to d=40 dimensions. The code of our algorithms can be found in the authors' web pages.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Ordinary differential equations (ODEs) are widely used to describe the behavior of time-varying systems in various fields of science and engineering. When researchers derive ODEs that govern the behavior of a system, the parameter values for these ODEs are often unknown. An important task is to estimate ODE parameters given noisy experimental measurements.

ODE parameter estimation is challenging for two main reasons: (1) the data is noisy, and (2) most ODEs do not have an analytical solution. Without these two challenges, the parameters could be estimated by solving a simple regression problem.

Two popular methods to solve the ODE parameter estimation problem are Bayesian and spline-based. Spline-based methods try to fit (cubic) splines to the noisy data and then estimate the parameters Varah (1982); Poyton et al. (2006); Ramsay et al. (2007); Cao et al. (2011); Cao and Zhao (2008). Estimators other than splines, such as smoothing kernels and local polynomials, have also been tried Liang and Wu (2008); Dattner and Klaassen (2015); Gugushvili and Klaassen (2012). These methods have many hyperparameters such as the smoothing parameter, the number of knots, the positions of the knots, etc. These methods are sensitive to the initialization of the parameters. Bayesian approaches Dondelinger et al. (2013); Calderhead et al. (2009); Gorbach et al. (2017)

make assumptions about the type of the noise and distribution of the data, which are not necessarily correct. Also, these methods need to set the prior distributions, variance of the noise, kernel width, etc. very carefully to produce reasonable results. Another disadvantage of these methods is their large training time.

In this paper, we propose an entirely different approach to address the issues of previous methods. We apply alternating minimization to an objective function that directly measures how well the states and parameters satisfy the ODE system. In each step of our learning algorithm, the states and parameters move slowly away from their initializations, stepping toward the desired solution. Our method is fast to train, easy to implement, robust with respect to parameter initializations, and works well under different magnitudes and types of noise.

We call our method a direct method because: (1) it learns the states directly in the original space, instead of learning them indirectly by fitting a smoothed function to the observations, and (2) it learns the states and parameters jointly using one, unified objective function, which gives better results than learning them in separate objective functions.

The rest of the paper is organized as follows. In Section 2 we define the problem mathematically. In Section 3, we explain our proposed method in detail. We show the advantages of our method with several experiments in Section 4.

### 1.1 Related Work

There have been several different approaches to estimate ODE parameters. Nonlinear least squares methods start with an initial guess for the parameters that is iteratively updated to bring the model’s predictions close to measurements Himmelblau et al. (1967); Bard (1973); Benson (1979); Hosten (1979). These methods are slow and have convergence issues when the initial parameters are far from the true parameters. The main advantage of our method over these methods is that we learn states and parameters simultaneously, which makes our method robust to initialization.

In Varah (1982) it was first suggested to fit splines to the noisy observations, and then consider these splines as the clean states. Since the derivatives of the splines can be found easily, the parameters are estimated by solving a regression problem. The idea of fitting splines and other smooth functions to the observations has been used extensively in the literature Poyton et al. (2006); Ramsay et al. (2007); Cao et al. (2011); Cao and Zhao (2008); Liang and Wu (2008); Dattner and Klaassen (2015); Gugushvili and Klaassen (2012).

The idea of learning the splines and parameters jointly, leading to better results, has been explored (Poyton et al., 2006; Ramsay et al., 2007). Spline-based methods have many hyperparameters that require careful tuning, and are also sensitive to initialization. Our method learns parameters and states jointly, but it does not fit a smooth function to the observations. Our direct learning strategy decreases the number of hyperparameters and makes the results robust to initialization.

Bayesian approaches have also been very popular recently (Girolami, 2008; Dondelinger et al., 2013; Calderhead et al., 2009; Gorbach et al., 2017). Such approaches typically have large training times and depend sensitively on a large number of hyperparameters (priors, noise variances, etc.). Another disadvantage of these methods, as has been mentioned in Gorbach et al. (2017), is that they cannot simultaneously learn clean states and parameters. In Gorbach et al. (2017), a variational inference approach has been used to overcome this problem, but the method is not applicable to all ODEs. Often, Bayesian methods make multiple assumptions about the distribution of the data and noise.

Finally, let us mention that the problem we solve here is that of simultaneous filtering (recovering clean states from noisy observations) and parameter estimation. Many well-known nonlinear ODE filtering methods, including extended and ensemble Kalman filters as well as particle filters, are online methods that make Gaussian assumptions. Our approach is a distribution-free, batch method.

## 2 Problem Definition: ODE Parameter Estimation

Consider a dynamical system in with state at time . We assume the system depends on a parameter , in which case the time-evolution of the state is given by

 ˙x(t)=dx(t)dt=f(x(t),θ). (1)

At distinct times , we have noisy observations :

 y(ti)=x(ti)+z(ti),i=1,…,T (2)

where is the noise of the observation at time . We represent the set of -dimensional states, noises, and observations by , and , respectively. For concision, in what follows, we write the time as a subscript, i.e., instead of .

In this paper, we assume that the form of the vector field

is known. The goal is to use to estimate and (or equivalently ). Examples of and can be found in Section 4.

## 3 Our Direct Approach: Learning ODE Parameters and States

The first step of our approach is to discretize the ODE (1) in time using multistep methods. In this section, we focus on the explicit Euler method (which is a one-step method) because it is intuitive, makes our formulation simple, and helps the reader to focus purely on our novel approach. Later, we explain how higher-order multistep methods can be used in our formulation. We also compare multistep methods with different orders in our experimental results section.

The explicit Euler method discretizes the ODE (1) for the time points as follows:

 x(ti+1)−x(ti)=f(x(ti),θ)Δi,i=1,...,T−1 (3)

where . In (3), both states and parameters are unknown; we are given only the noisy observations . With this discretization, we formulate our objective function:

 E(X,θ)=∑T−1i=1∥∥x(ti+1)−x(ti)−f(x(ti),θ)Δi∥∥2 (4)

Before continuing, it is worth analyzing our objective function . Let us define fidelity as the degree to which the estimated states and parameters actually satisfy the ODE. Our objective function directly measures time-discretized fidelity; if , then we have a solution to the time-discretized ODE. Note that the observations do not appear explicitly—we use to initialize Alg. (1).

In contrast, most prior work maximizes the likelihood function that stems from assuming the noise in (2) is Gaussian with mean zero and covariance matrix . In such approaches, fidelity (as defined above) is treated as a secondary term, e.g., using a penalty term (Ramsay et al., 2007) or using a probabilistic model that accounts for temporal discretization errors (Arnold et al., 2014, 2013). In the present work, we seek to show that making fidelity the primary objective yields better estimates of and .

###### Theorem 1.

The objective function in (4) has an infinite number of optimal solutions, each of which makes the objective value .

###### Proof.

Assign arbitrary real vectors to and . Then we use the following equation sequentially with :

 x(ti+1)=x(ti)+f(x(ti),θ)Δi. (5)

By computing the states in this way, we ensure that each term in the objective function —see (4)—is zero. Since the objective function in (4) is always greater than or equal to zero, we achieve a global minimum. Because and are arbitrary, an infinite number of solutions exist. ∎

It is easy to see that Theorem 1 generalizes to the case where we replace the Euler method—in the definition of —by a higher-order, explicit ODE solver.

Additionally, suppose we fix and consider to be a function of alone. Then, if we also fix the value of , we see that there is a unique global minimizer. We return to this point below when we discuss minimizing over .

Our goal is to use the observations to find the clean (or filtered) states and the parameters . To achieve this goal, we will not directly compute a global minimizer of (4), for reasons that we now explain. In what follows, we make a distinction between learned (asterisks) and predicted (hats) states and parameters.

We represent the learned states and parameters by and , where . We propose an iterative approach that optimizes and alternately for iterations. We represent the states and parameters learned at iteration by and .

We also represent the predicted states at iteration by . The predicted states are achieved by setting and , and then stepping forward in time using an ODE solver—e.g., Euler’s method in (5) or a multistep method as in (11)—to generate the states at . By Theorem 1, this predicted state/parameter pair, , is a global minimizer of (4).

We initialize the states with the observations: . Then we update and alternately for iterations. At each iteration , our method has two steps: 1) fix and optimize (4) over , and 2) fix and optimize a modified version of (4) over .

Before we detail the two steps, let us explain why and how we modify the objective function in the second step (optimization over ). Assume we are in the first iteration , and that we have learned in the first step. In the second step, if we try to optimize (4) (as it is) over the states, then the predicted states give us the optimal solution. In other words, is the optimal solution. Since this makes the objective , our algorithm should stop. To see why this is undesirable, note that we initialized the states with the observations. Our hope is that the clean states are somewhat close to the observations. The predicted states are in general far away from the observations. By setting , we lose all the information that was given to us by the observations. For this reason, we modify the objective function in (4) such that the changes in states are small from one iteration to another. We want to be close to and . Figure 1: ODE parameter estimation of the ODE in (6). At each iteration, the predicted states ^X and the learned states X∗ move closer to each other and the clean states.

Let us clarify with a simple -dimensional example. Consider the following ODE:

 ˙x(t)=dx(t)dt=9.8−θ0x(t). (6)

This ODE is achieved by applying Newton’s second law of motion to a falling object of mass , where shows the velocity of the object at time . Here, is the parameter which is unknown. Fig. 1 shows how our algorithm works. To generate the clean states (shown with green circles), we solve (6) over the time interval with parameters and . Our algorithm does not have access to these clean states. The observations (shown with magenta circles) are achieved by adding Gaussian noise to the clean states in the initialization step. Now examine the plots for iteration to in Fig. 1, where we show learned and predicted states after the initialization. The predicted states (blue circles) are far from the clean states. The learned states (red circles) are not on top of the predicted states, because we forced them to be close to their previous values (observations). At each iteration, the first step (learning ), pushes the predicted states closer to the learned states . The second step pushes close to . As we can see, after about iterations, both and are very close to the clean states. The pseudocode of our method can be found in Algorithm 1. In the following, we explain the two steps in more detail.

### 3.1 Optimization Over the ODE Parameter θ

At iteration , we fix to equal the learned states from the previous iteration: , and then minimize the objective function (4) over only:

 θ∗(n)=argminθE(X∗(n−1),θ)=argminθ∑T−1i=1∥∥x∗(n−1)(ti+1)−x∗(n−1)(ti)−f(x∗(n−1)(ti),θ)Δi∥∥2 (7)

This is a -dimensional regression problem where the th input and output are and , respectively. We optimize this objective using the LBFGS algorithm, implemented in Python via the scipy.optimize.minimize function. Throughout this work, when using LBFGS, we use automatic differentiation to supply the optimizer with gradients of the objective function.

### 3.2 Optimization Over the States X

At iteration , we fix to equal the parameters learned in the previous step: . By Theorem 1, directly minimizing will yield a global minimizer such that , terminating the optimization procedure. To avoid this fate, and to learn better states and parameters, we modify the objective function in (2):

 X∗(n)=argminX{E(X,θ∗(n))+λ∥∥X−X∗(n−1)∥∥2}, (8)

where has been defined in (4). For , the penalty term encourages to remain close to . As we increase , we tighten this closeness. To optimize this objective function, we again use the LBFGS algorithm, as in the first step.

We can also motivate (8) using the notion of proximal operators (Parikh and Boyd, 2014). Let us fix . When the are all identically zero, the objective function reduces to a quadratic form:

 T−1∑i=1∥x(ti+1)−x(ti)∥2.

If we now fix , the Hessian with respect to the remaining variables

has strictly positive eigenvalues. The eigenvalues of the Hessian of

are continuous functions of the parameters ; hence there exists such that if , the Hessian remains positive definite. The upshot is that, under these conditions, is a strictly convex function of . This is why it has a unique global minimizer , as mentioned above.

Additionally, because is convex, we can form the proximal operator

 proxμ˜E(X∗(n−1))=argminX2:{˜E(X2:)+(2μ)−1∥X2:−X∗(n−1)2:∥2} (9)

We see that when , the proximal step returns . For sufficiently small, as shown in (Parikh and Boyd, 2014), this proximal step is approximately equal to a gradient descent step:

 proxμ˜E(X∗(n−1))=X∗(n−1)2:−μ∇˜E(X∗(n−1)2:)+o(μ) (10)

One can see that our in (8) plays the role of in (9). Hence large values of correspond to small gradient descent step sizes, and vice versa.

Though in (8) is non-convex, we can view it as a set-valued proximal operator (Li et al., 2017); still, the analogy with convex proximal operators gives valuable intuition. When , we see that (8) ignores and directly outputs with arbitrary. As increases, (8) approximates a small step that starts from and proceeds in the direction of the minimizer .

### 3.3 Higher-Order Discretization

As we mentioned before, our approach is not limited to the Euler discretization. Here, we show that it is straightforward to use higher-order discretization methods.

The idea of multistep (-step) methods is to use the previous states to predict the next state. Let us consider the general formulation of the explicit linear -step method to discretize the ODE (1):

 x(ti+1)=m−1∑j=0ajx(ti−j)+Δim−1∑j=0bjf(x(i−j),θ), (11)

where is the time step. There are several strategies to determine the coefficients and . Each strategy leads to a specific family of multistep methods. For example, the Adams-Bashforth method approximates with a polynomial of order to find the coefficients and predict the next state. Further information regarding different strategies can be found in (Iserles, 2009; Palais and Palais, 2009).

Note that to use -step methods to predict the state at time , we need its previous states. To predict the states (the first few states), the maximum order we can use is , because there are only states before the state . In general, to predict we use a multistep method of order .

When using a general -step discretization method, we define our objective function as follows:

 Em-step(X,θ)=∑T−1i=1∥∥x(ti+1)−∑k−1j=0ajx(ti−j)−Δi∑k−1j=0bjf(x(i−j),θ)∥∥2, (12)

where is the order of the discretization method to predict the state . Now, we repeat the following steps until convergence:

 θ∗(n) =argminθEm-step(X∗(n−1),θ) (13a) X∗(n) =argminX{Em-step(X,θ(∗n))+λ∥X−X∗(n−1)∥2} (13b)

For both steps, we use LBFGS, initialized with and , respectively.

#### Convergence.

There are two notions of convergence that we will discuss briefly here: practical and theoretical.

In practice, we stop Algorithm 1 when the error changes less than from one iteration to the next. To see when this happens, we take another look at the optimization over the states in (8) at iteration . This objective function has two terms. The optimal solution of the left term is the predicted states . The optimal solution of the right term is . When we optimize this objective function to find , there are three cases: 1) , 2) , and 3) is neither nor . Our algorithm stops when we are in case 1 or 2 since further optimization over and does not change anything. In case 3, the algorithm continues, leading to further optimization steps to decrease error.

From a theoretical standpoint, our treatment will be elementary and serve to motivate future work. Let us first note that (7) and (8) imply, in turn:

 E(X∗(n−1),θ∗(n))≤E(X∗(n−1),θ∗(n−1))

Putting these together, we obtain

 E(X∗(n),θ∗(n))≤E(X∗(n−1),θ∗(n−1)). (14)

The function , bounded below by , is non-increasing along the trajectory . Hence the sequence must converge to some limit . Note, however, that is not a metric on . Without further assumptions, there is not much more we can say.

Let us assume that the sequence itself has a limit in . Let ; let be the same function with replaced by . Then we can reframe (7) and (8) as

 θ∗(n) =argminθFn(X∗(n−1),θ) X∗(n) =argminXFn(X,θ∗(n))

Taking the limit as , we obtain

 θ∗ (15a) X∗ =argminXF∗(X,θ∗)=argminX{E(X,θ∗)+λ∥X−X∗∥2}. (15b)

From these equations, we can derive

 ∇θE(X∗,θ∗) =0 ∇XE(X∗,θ∗) =0

Let denote the Hessian of with respect to . Similarly, let denote the Hessian of with respect to . By (15a) and (15b), both and are positive semidefinite at . Note that

 F∗XX=EXX+2λI

at any point of the form , including the limit point . For sufficiently large, we see that will be positive definite, implying strict convexity of in a neighborhood of .

Using the above framework, together with ideas from the alternating minimization literature—e.g., (Grippo and Sciandrone, 2000; Byrne, 2013; Byrne and Lee, 2015; Ha and Barber, 2017)—we believe we will be able to establish conditions for convergence of our algorithm’s iterates, , to a local minimizer of .

Our method is robust with respect to its only hyperparameter . We will show in our experimental results later that for a broad range of values, our method works well. We fix it to in our later experiments. As explained before, previous methods have a large number of hyperparameters, which are difficult to set.

Our method can be trained quickly. On a standard laptop, it takes around seconds for our method to learn the parameters and states jointly on ODE problems with states. The spline-based methods take a few minutes and Bayesian methods take a few hours to converge on the same problem.

Because our method, unlike Bayesian methods, does not make assumptions about the type of the noise or distribution of the states, it performs well under different noise and state distributions. In particular, as the magnitude of noise in the observations increases, our method clearly outperforms the extended Kalman filter.

As our experiments confirm, both spline-based and Bayesian approaches are very sensitive to the initialization of the ODE parameters. If we initialize them far away from the true values, they do not converge. Our method is much more robust. This robustness stems from simultaneously learning states and parameters. Even if the estimated parameters are far from the true parameters at some iteration, they can improve later, as the estimated states converge to the clean states.

## 4 Experiments

In this section, we first introduce three benchmark datasets. Then we show how our method works under different kinds of noise and initializations on these datasets. We finally show that our method performs well compared to other state-of-the-art methods. In the following, we create the clean states using a Runge-Kutta method of order . In all of our experiments, unless otherwise stated, we use the three-step Adams-Bashforth method to discretize the ODE.

#### Lotka–Volterra model.

This model is used to study the interaction between predator (variable ) and prey (variable ) in biology Lotka (1932). The model contains two nonlinear equations as follows:

 dx0dt=θ0x0−θ1x0x1dx1dt=θ2x0x1−θ3x1. (16)

The state is two-dimensional and there are four unknown parameters. We use the same settings as in Dondelinger et al. (2013). We set the parameters to , , and . With initial condition , we generate clean states in the time range of with a spacing of .

#### FitzHugh–Nagumo model.

This model describes spike generation in squid giant axons FitzHugh (1961); Nagumo et al. (1962). It also has two nonlinear equations:

 dx0dt=θ2(x0−(x0)33+x1)dx1dt=−1θ2(x0−θ0+θ1x1), (17)

where is the voltage across an axon and is the outward current. In this model, the states are two-dimensional and there are three unknown parameters. We use the same settings as in Ramsay et al. (2007). We set the parameters as and . With initial condition , we generate clean states in the time range of with a spacing of .

#### Rössler attractor.

This three-dimensional nonlinear system has a chaotic attractor Rössler (1976):

 dx0dt=−x1−x2dx1dt=x0+θ0x1dx2dt=θ1+x2(x0−θ2). (18)

The states are three-dimensional and there are three unknown parameters. We use the same settings as in Ramsay et al. (2007). We set the parameters as and . With the initial condition , we generate clean states in the time range of , with a spacing of .

#### Lorenz-96 model.

The goal of this model is to study weather predictability Lorenz and Emanuel (1998). The th differential equation has the following form:

 dxkdt=(xk+1−xk−2)(xk−1)−xk+θ0,k=0,…,d−1 (19)

The model has one parameter and states, where can be set by the user. This gives us the opportunity to test our method on larger ODEs. Note that to make (19) meaningful, we have , , and . As suggested in Lorenz and Emanuel (1998), we set and . The clean states are generated in the time range with a spacing of

. The initial state is generated randomly from a Gaussian distribution with mean

and variance . Figure 2: Prediction error at different iterations of our algorithm. Noisy observations are achieved by adding Gaussian noise with variance σ2 to the clean observations. We consider the FitzHugh–Nagumo (first row), Rössler (second row), and Lorenz-96 (third row) models. Our learning strategy decreases the error in all cases. Figure 3: Robustness to the hyperparameter λ in FitzHugh–Nagumo (first row) and Lotka–Volterra (second row) models. The true parameters are θ0=.5,θ1=.3, and θ2=3 in the FitzHugh–Nagumo and θ0=2,θ1=1,θ2=4, and θ3=1 in the Lotka–Volterra. For each λ, we report the mean error and parameter value in 10 experiments.

#### Evaluation metrics.

Let and denote the true parameters and the clean states, respectively. Let and denote the estimated parameters and the predicted states. We report the Frobenius norm of as the prediction error. We also consider as the th parameter error. Recall that predicted states are achieved by considering as the parameter and as the initial state, and then repeatedly applying (5) or its multistep analogue (11).

#### Optimization of our objective function leads to a better estimation.

We first focus only on our objective function in (4). At each iteration of our optimization, we compute the predicted states and report the prediction error. Fig. 2 shows the results.

In Fig. 2, we consider two kinds of discretization: 1) one-step Euler method, and 2) three-step Adams-Bashforth method. Note that as we increase the order, we expect to see more accurate results.

We added Gaussian noise with mean and variance to each of the clean states, where in the first column and in the second column. Fig. 2 shows that at the first iteration the error is significant in all models. The error is for FitzHugh–Nagumo and Rössler, and for Lorenz-96 model.

After several iterations of our algorithm, the error decreases significantly, no matter what kind of discretization we use. But, as expected, three-step Adams-Bashforth performs better than Euler in general: it converges faster and achieves a smaller error at the end. This is more clear for the Lorenz-96 model, where the error becomes almost zero when we use the three-step Adams-Bashforth, while the error becomes around for Euler.

The last point about Fig. 2 is that, as expected, sometimes the prediction error increases; the error does not decrease monotonically. This mainly happens at the first few iterations. The main reason for this behavior is that our objective function in (4) is different from the prediction error. We cannot directly optimize the prediction error because we do not have access to the clean states. Still, the fact that our algorithm eventually brings the prediction error close to zero suggests that minimizing the objective in (4) has the same effect as minimizing the prediction error.

#### Robustness to the hyperparameter λ.

The only hyperparameter in our algorithm is . In Fig. 3, we set in turn to a set of values from to , run our algorithm, and report the results after convergence. In both models, we generate observations by adding Gaussian noise with variance to the clean states. Because of randomness included in creating noisy observations, we create sets of observations, run our algorithm once for each of them, and report the mean in Fig. 3

. We also show the standard deviation in prediction errors, but not in parameter values (to avoid clutter).

In Fig. 3 we report the prediction error and the estimated parameters for each value of . The true values for the FitzHugh–Nagumo are and . For the Lotka–Volterra model, the true values are , and .

We see in Fig. 3 that for , our method correctly finds the parameters and brings the error close to zero. Also, in the range of to , the errors and the estimated parameters remain almost the same. We actually increased to and found that it works like the previous values of . The only disadvantage of increasing to a large value is that training time increases—as explained above, increasing is analogous to decreasing the step size in a gradient descent method. Large implies that states can change very little from one iteration to another, forcing the algorithm to run longer for convergence. As explained in detail above, when , the algorithm stops after a single iteration, with the predicted states far from the clean states. Figure 4: We change the amount and type of noise in the observations, and report the prediction and parameter errors on the FitzHugh–Nagumo (first column) and Rössler (second column) models.

#### Different types and amounts of noise in the observations.

Our method does not assume anything about the type of noise. In reality, the noise could be from any distribution. In Fig. 4, we investigate the effect of the type of noise on the outcome of our algorithm. The red (blue) curves correspond to the case when we add Gaussian (Laplacian) noise to the observations. We set the mean to , change the variance of the noise, and report the prediction and parameter errors. Note that for each noise variance, we repeat the experiment times and report the mean and standard deviation of the error.

In general, increasing the noise variance increases the error. We can see this in almost all plots. In both models, the error does not change much by changing the variance from to . We can also see that the method performs almost as well for observations corrupted by Laplacian noise as in the Gaussian noise case. Note that the Laplacian noise has a heavier-than-Gaussian tail. Figure 5: Comparison with other methods. We create initializations by adding Gaussian noise of variance σ2θ to the true parameters. We create 10 sets of observations and initializations per each σ2θ and report the errors. Each error bar corresponds to the error in one of the experiments. Figure 6: We compare our method with the mean-field method Gorbach et al. (2017) on Lotka–Volterra model. We add Gaussian noise with variance σ2 to the clean states to create noisy observations. For each value of σ2, we generate 10 sets of observations and report the prediction and parameter errors. Each error bar corresponds to the error in one of the experiments. We have put the average error of each method for the 10 experiments below each plot.

#### Comparison with other methods (robustness to the initialization).

As the first experiment, we compare our method with three other methods, each of them from a different category. Among the spline-based methods, we use the online MATLAB code corresponding to Ramsay et al. (2007), denoted by “splines” in our experiments. Among the Bayesian approaches, we use the online R code corresponding to Dondelinger et al. (2013), denoted “Bayes” in our experiments. We also implement a method that uses the iterative least square approach, denoted “lsq” in our experiments. This method considers the parameters and the initial state as the unknown variables. To implement lsq, we use the Python LMFIT package Newville et al. (2014). We use the FitzHugh–Nagumo and Rössler models, creating noisy observations by adding Gaussian noise with mean and variance to the clean states.

All methods including ours need an initial guess for the unknown parameters. We add Gaussian noise with mean and variance to the true parameter and use the result to initialize the methods. Fig. 5 shows the results, where we change the variance from to . Since there is randomness in both initialization and observation, we repeat the experiment times. Note that the comparisons are fair, with the same observations and initializations used across all methods.

In Fig. 5, each of the bars corresponds to the prediction or parameter error for one of the methods in one of the experiments. Hence there are error bars for each of the methods in each plot. We set in our method for all the experiments. For the other methods, we chose the best hyperparameters we could determine after careful experimentation.

The first point regarding Fig. 5 is that our method is robust with respect to the initialization, while the other methods are not. The total number of experiments per method is (on the two models). The prediction error of our method exceeds in experiments. The prediction error of splines (the second best method after ours) exceeds in experiments (nearly half the experiments). For the other methods, the error only increases.

We can see in Fig. 5 that almost all the methods work well when the initialization is close to the true parameters (small noise). But, in reality, we do not know what the real parameters are. So it is reasonable to say that the last column of Fig. 5 (initialization with the largest noise) determines which method performs better in real-world applications. As we can see, our method outperforms the other methods in both prediction and parameter error.

In our second experiment, we compare our method with the mean-field method (also a Bayesian method) Gorbach et al. (2017) on the Lotka–Volterra model. The mean-field method is only applicable to the differential equations with a specific form (see Eq.  in Gorbach et al. (2017)). While we cannot apply it to FitzHugh–Nagumo and Rössler models, we can apply it to the Lotka–Volterra model. In Fig. 6, we compare the methods by reporting the prediction and the parameter errors. We create noisy observations by starting with the clean states and adding Gaussian noise with variance . We show the results for different variances at different columns of Fig. 6. Similar to our previous experiments, we generate sets of noisy observations. Each of the bars in Fig. 6 corresponds to the error for one of the methods in one of the experiments.

Fig. 6 shows that the average error of our method is less than the mean-field method in almost all cases Gorbach et al. (2017). As we increase the noise in the observations, the error of both methods increases. But, we can also see that our method is more robust than the mean-field method when it comes to observation noise. Consider the case where (second column). In this case, the average parameter error of the mean-field method for and becomes around and , respectively, but the average error of our method for both parameters remains less than . Figure 7: We compare our method with the extended Kalman filter (EKF) on the Lotka–Volterra model. We add Gaussian noise with variance σ2=0.1 and σ2=1.5 to the clean states to create noisy observations. We set the number of observation to T=20 (left panel) and T=10000 (right panel). For each value of T, we generate 10 sets of observations and report the average estimation and parameter errors. Each error bar corresponds to the error in one of the experiments. We have put the average error of each method for the 10 experiments below each plot.

#### Comparison with the extended Kalman filter (EKF).

We follow Sitz et al. (2002) in applying the Kalman filter to our problem of estimating the parameters and states. We first need to write an equation that recursively finds the state in terms of . As suggested in Sitz et al. (2002), this can be achieved by discretizing the ODE in (1) using the Euler discretization:

 x(ti+1)=x(ti)+f(x(ti),θ)Δi. (20)

Let us define as the parameter estimated at time by the Kalman filter. We define a joint state variable , which merges the states and the parameters as follows:

 ξ(ti)=(x(ti)θ(ti)),ξ(ti)∈Rd+p. (21)

The process model to predict the next state variable can be written as:

 ξ(ti+1)=(x(ti+1)θ(ti+1))=(x(ti)+f(x(ti),θ)Δiθ(ti)). (22)

We define the observation model as follows:

 y(ti)=Hξ(ti),H=(I0)d×(d+p), (23)

where is a matrix, is a identity matrix, and is a matrix where all elements are .

In most cases, the function is nonlinear, which makes the process model nonlinear. For this reason, we use the extended Kalman filter (EKF), which linearizes the model.

We compare our method with the EKF in Fig. 7 on the Lotka–Volterra model. We compare the methods in different settings by changing the amount of noise and the number of samples. To create noisy observations, we add Gaussian noise with the variances and to the clean states. We set the number of samples to (time range ) and (time range ).

We use an open-source Python code Labbe (2014) to implement the EKF. We set the state covariance (noise covariance) to a diagonal matrix with elements equal to (). We set the process covariance using the function Q_discrete_white_noise() provided in Labbe (2014), where the variance is set to . Note that these parameters must be carefully tuned to obtain reasonable results; changing the state or noise covariance yields significantly worse results.

In Fig. 7 we report the average estimation error instead of the prediction error. Estimation error is defined as the difference between the clean states and the estimated states . We report the estimation error because the prediction error of the EKF goes to infinity. To see why this happens, note that to obtain reasonable predictions we need good estimations of the parameters and the initial state. Since the EKF is an online method, it never updates the initial state. Given that the initial state is noisy, no matter how well parameters are estimated, the prediction error becomes very large. Our method updates the initial state, yielding small prediction error.

As we can see in Fig. 7, the only setting in which the EKF performs comparably to our method is the case of and . In other words, EKF works fine when we have long time series with low noise. In more realistic settings, our method significantly outperforms the EKF. A key difference between the two methods is that the EKF is an online method while ours is a batch method, iterating over the entire data set repeatedly. Consequently, our method updates parameters based on information in all the states, leading to more robust updates than is possible with the EKF, which updates parameters based on a single observation.

We can also see in Fig. 7 that the error of both methods becomes smaller as we increase the number of samples . This is expected to happen because increasing is equivalent to giving more information about the model to the methods. The average estimation error of our method becomes almost for large .

#### Animation to show how our method works.

We consider the FitzHugh–Nagumo model, with settings as explained at the beginning of this section, except that we consider the first seconds instead of . We add Gaussian noise with variance to the clean states to create the noisy observations. In Fig. 8, we show how our algorithm works in the first iterations. Acrobat Reader is required to play the animation. In this animation, denotes clean states (green circles), denotes estimated states, and denotes predicted states. Note that initially, is the same as the noisy observations. Fig. 8 shows the two dimensions separately. At the top of each figure, we show the estimated parameters at each iteration. Note that the true parameters are , and . As explained before, the estimated and predicted states move closer to each other at each iteration. This helps the estimated parameters converge to the true parameters.

## 5 Conclusion

ODE parameter estimation is an important problem, made difficult due to noisy data and lack of analytical ODE solutions. Previous approaches make several assumptions about the states and the noise, leading to issues regarding the number of hyperparameters and robustness to the noise. In this paper, we have shown that these assumptions are unnecessary. The key in learning the unknown parameters is to learn the clean states. Hence we proposed a direct method that learns states and parameters jointly. Our approach addresses issues of previous works, achieving fast training and robustness to noise, initialization, and hyperparameter tuning.

## References

• Arnold et al.  A. Arnold, D. Calvetti, and E. Somersalo. Linear multistep methods, particle filtering and sequential Monte Carlo. Inverse Problems, 29(8):085007, 25, 2013.
• Arnold et al.  A. Arnold, D. Calvetti, and E. Somersalo. Parameter estimation for stiff deterministic dynamical systems via ensemble Kalman filter. Inverse Problems, 30(10):105008, 30, 2014.
• Bard  Y. Bard. Nonlinear Parameter Estimation. Academic Press, 1973.
• Benson  M. Benson. Parameter fitting in dynamic models. Ecological Modelling, 6(2):97–115, 1979.
• Byrne  C. L. Byrne. Alternating minimization as sequential unconstrained minimization: A survey. J. Optimization Theory and Applications, 156(3):554–566, 2013.
• Byrne and Lee  C. L. Byrne and J. S. Lee. Alternating minimization, proximal minimization and optimization transfer are equivalent. arXiv preprint arXiv:1512.03034, 2015.
• Calderhead et al.  B. Calderhead, M. Girolami, and N. D. Lawrence.

Accelerating Bayesian inference over nonlinear differential equations with Gaussian processes.

In Advances in Neural Information Processing Systems, pages 217–224, 2009.
• Cao and Zhao  J. Cao and H. Zhao. Estimating dynamic models for gene regulation networks. Bioinformatics, 24(14):1619–1624, 2008.
• Cao et al.  J. Cao, L. Wang, and J. Xu. Robust estimation for ordinary differential equation models. Biometrics, 67(4):1305–1313, 2011.
• Dattner and Klaassen  I. Dattner and C. A. J. Klaassen. Optimal rate of direct estimators in systems of ordinary differential equations linear in functions of the parameters. Electronic Journal of Statistics, 9(2):1939–1973, 2015.
• Dondelinger et al.  F. Dondelinger, D. Husmeier, S. Rogers, and M. Filippone. ODE parameter inference using adaptive gradient matching with Gaussian processes. In Artificial Intelligence and Statistics, pages 216–228, 2013.
• FitzHugh  R. FitzHugh. Impulses and physiological states in theoretical models of nerve membrane. Biophysical journal, 1(6):445–466, 1961.
• Girolami  M. Girolami. Bayesian inference for differential equations. Theoretical Computer Science, 408(1):4–16, 2008.
• Gorbach et al.  N. S. Gorbach, S. Bauer, and J. M. Buhmann. Scalable variational inference for dynamical systems. In Advances in Neural Information Processing Systems, pages 4809–4818, 2017.
• Grippo and Sciandrone  L. Grippo and M. Sciandrone. On the convergence of the block nonlinear Gauss-Seidel method under convex constraints. Oper. Res. Lett., 26(3):127–136, 2000.
• Gugushvili and Klaassen  S. Gugushvili and C. A. Klaassen. -consistent parameter estimation for systems of ordinary differential equations: bypassing numerical integration via smoothing. Bernoulli, 18(3):1061–1098, 2012.
• Ha and Barber  W. Ha and R. F. Barber. Alternating minimization and alternating descent over nonconvex sets. arXiv preprint arXiv:1709.04451, 2017.
• Himmelblau et al.  D. M. Himmelblau, C. R. Jones, and K. B. Bischoff. Determination of rate constants for complex kinetics models. Industrial & Engineering Chemistry Fundamentals, 6(4):539–543, 1967.
• Hosten  L. Hosten. A comparative study of short cut procedures for parameter estimation in differential equations. Computers & Chemical Engineering, 3(1-4):117–126, 1979.
• Iserles  A. Iserles. A First Course in the Numerical Analysis of Differential Equations. Cambridge Texts in Applied Mathematics. Cambridge University Press, Cambridge, second edition, 2009.
• Labbe  R. Labbe. Kalman and Bayesian filters in Python, 2014.
• Li et al.  Q. Li, Y. Zhou, Y. Liang, and P. K. Varshney. Convergence analysis of proximal gradient with momentum for nonconvex optimization. In

Proceedings of the 34th International Conference on Machine Learning, ICML 2017, Sydney, NSW, Australia, 6-11 August 2017

, pages 2111–2119, 2017.
• Liang and Wu  H. Liang and H. Wu. Parameter estimation for differential equation models using a framework of measurement error in regression models. Journal of the American Statistical Association, 103(484):1570–1583, 2008.
• Lorenz and Emanuel  E. N. Lorenz and K. A. Emanuel. Optimal sites for supplementary weather observations: Simulation with a small model. Journal of the Atmospheric Sciences, 55(3):399–414, 1998.
• Lotka  A. J. Lotka. The growth of mixed populations: two species competing for a common food supply. Journal of the Washington Academy of Sciences, 22(16/17):461–469, 1932.
• Nagumo et al.  J. Nagumo, S. Arimoto, and S. Yoshizawa. An active pulse transmission line simulating nerve axon. Proceedings of the IRE, 50(10):2061–2070, 1962.
• Newville et al.  M. Newville, T. Stensitzki, D. B. Allen, and A. Ingargiola. LMFIT: Non-linear least-square minimization and curve-fitting for Python, 2014.
• Palais and Palais  R. S. Palais and R. A. Palais. Differential Equations, Mechanics, and Computation. Number 51 in Student Mathematical Library. American Math. Soc. [u.a.], Providence, RI [u.a.], 2009. Literaturverz. S. 307 - 309.
• Parikh and Boyd  N. Parikh and S. P. Boyd. Proximal Algorithms. Foundations and Trends in Optimization, 1(3):127–239, 2014.
• Poyton et al.  A. Poyton, M. Varziri, K. McAuley, P. McLellan, and J. Ramsay. Parameter estimation in continuous-time dynamic models using principal differential analysis. Computers & Chemical Engineering, 30(4):698–708, 2006.
• Ramsay et al.  J. O. Ramsay, G. Hooker, D. Campbell, and J. Cao. Parameter estimation for differential equations: a generalized smoothing approach. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 69(5):741–796, 2007.
• Rössler  O. E. Rössler. An equation for continuous chaos. Physics Letters A, 57(5):397–398, 1976.
• Sitz et al.  A. Sitz, U. Schwarz, J. Kurths, and H. U. Voss. Estimation of parameters and unobserved components for nonlinear systems from noisy time series. Phys. Rev. E, 66(1):016210, 2002.
• Varah  J. M. Varah. A spline least squares method for numerical parameter estimation in differential equations. SIAM Journal on Scientific and Statistical Computing, 3(1):28–46, 1982.