 # Model inference for Ordinary Differential Equations by parametric polynomial kernel regression

Model inference for dynamical systems aims to estimate the future behaviour of a system from observations. Purely model-free statistical methods, such as Artificial Neural Networks, tend to perform poorly for such tasks. They are therefore not well suited to many questions from applications, for example in Bayesian filtering and reliability estimation. This work introduces a parametric polynomial kernel method that can be used for inferring the future behaviour of Ordinary Differential Equation models, including chaotic dynamical systems, from observations. Using numerical integration techniques, parametric representations of Ordinary Differential Equations can be learnt using Backpropagation and Stochastic Gradient Descent. The polynomial technique presented here is based on a nonparametric method, kernel ridge regression. However, the time complexity of nonparametric kernel ridge regression scales cubically with the number of training data points. Our parametric polynomial method avoids this manifestation of the curse of dimensionality, which becomes particularly relevant when working with large time series data sets. Two numerical demonstrations are presented. First, a simple regression test case is used to illustrate the method and to compare the performance with standard Artificial Neural Network techniques. Second, a more substantial test case is the inference of a chaotic spatio-temporal dynamical system, the Lorenz--Emanuel system, from observations. Our method was able to successfully track the future behaviour of the system over time periods much larger than the training data sampling rate. Finally, some limitations of the method are presented, as well as proposed directions for future work to mitigate these limitations.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Dynamical systems play a crucial role in mathematical modelling across all areas of physics, engineering and applied mathematics. The equations used in some particular application domain are typically derived either phenomenologically  or from first principles such as the conservation of energy, mass or momentum (as in mechanics ). The structure of the equations should describe the fundamental aspects of the system in question as much as possible. On the other hand, constitutive parameters are often hard to know explicitly and need to be learnt from data. As such, it is necessary to balance rigidity and flexibility when modelling a system.

This paper considers the problem of finding a model of a dynamical system, represented by coupled Ordinary Differential Equations (ODEs), from observations. This is a particular form of inverse problem (as in 

). The time evolution of many dynamical systems is described by polynomial equations in the system variables and their derivatives. We introduce a form of parametric polynomial kernel regression (related to Radial Basis Function networks

). This technique was developed during the search for an algorithm that is able to be trained continuously on streaming data as opposed to complete trajectories. Hidden parameter models (with unobserved variables) are not addressed but the techniques shown here could be extended to such cases in the future, augmenting probabilistic Bayesian filtering methods (as in ).

Kernel ridge regression is a nonparametric method for fitting polynomials to data without explicitly calculating all polynomial terms of a set of variables [18, 21]. There are two limitations of this approach when fitting models to time series data. First, as a nonparametric method, the computational time complexity scales cubically with the number of observation points. This is a significant issue when dealing with time series data. Second, it is difficult to compute kernel ridge regression efficiently using streaming data. While it is possible to continually update the inverse of a matrix (see ), the roughly cubic scaling of the required matrix operations is not well suited to monitoring high-dimensional systems in a real time data setting. Here, to optimise our parametric polynomial kernel function representations, Stochastic Gradient Descent (SGD) is used along with the Backpropagation method (see ). This combination of techniques helps to minimise computational complexity and the amount of explicit feature engineering required to find a good representation of an unknown ODE.

We represent ODE models parametrically as compute graphs. Compute graphs are used in Artificial Neural Network (ANN) theory to model complicated nonlinear structures by the composition of simple functions and are well suited to gradient descent optimisation via the Backpropagation method. It is demonstrated that numerical integration (both explicit and implicit) can be used to discretise ODE time integrals in a way that allows for the inference of continuous-time dynamical system models by gradient descent. This is an extension of an approach that appeared at least as early as . The discretisation procedure is related to the Backpropagation Through Time method 

, which is used for modelling discrete time series with so-called Recurrent Neural Networks.

To demonstrate the findings of this paper, two numerical case studies were carried out. The first is a simple analysis that contrasts the performance of standard ANN techniques with the proposed kernel method. It is shown that our method had the best extrapolation performance. A more extensive analysis of the chaotic spatio-temporal Lorenz–Emanuel dynamical system is also presented. The proposed method is able to recover a maximum likelihood estimate of the hidden polynomial model. For comparison, a parametric model constructed by direct summation of polynomial features (without kernels, of the form used in

) was also tested. The parametric polynomial kernel method was able to outperform the direct polynomial expansion, accurately predicting the future evolution of a chaotic dynamical system over periods many times greater than the training interval.

The primary advantage of the technique presented in this paper is that the model representation in parametric form can avoid the curse of dimensionality and poor scaling with training set size associated with nonparametric kernel regression. Further, polynomial kernels avoid the combinatorial explosion that occurs when explicitly computing polynomial series expansions. Interestingly, the accuracy of the proposed parametric kernel method can be tuned by adjusting the dimension of a set of intermediate parameters. The trade-off for increased accuracy is additional training time.

## 2 Background on Compute Graph Optimisation

### 2.1 Compute graphs and nonlinear function representations

The parametric polynomial regression technique introduced in this paper is built on the framework of so-called compute graphs. This section provides the background theory required for later parts of this work. Compute graphs are very general structures which define the flow of information over a topology and as such provide a convenient parametric representation of nonlinear functions. In particular, compute graphs can be coupled with Automatic Differentiation 

and the Backpropagation algorithm (an application of the chain rule) to allow for gradient-based optimisation. Stochastic Gradient Descent is the most common form of optimiser used in this context and is briefly described in this section.

Artificial Neural Networks (ANNs) are a subset of compute graphs (in the sense of discrete mathematics 

). Common ANN terminology such as Deep Neural Networks, Boltzmann Machines, Convolutional Neural Networks and Multilayer Perceptrons refer to different ANN connectivity, training and subcomponent patterns

[3, 8]. The choice of an appropriate ANN type depends on the problem being solved. This section works with general compute graph terminology, rather than specific ANN design patterns, as these principles are appropriate for all ANN architectures.

A (real-valued) compute graph consists of a weighted directed graph, i.e. an ordered pair

with the following properties:

• is the finite set of vertices (or nodes)

. Vertices specify an activation function

, and an output (or activation) value .

• is the set of edges . Each edge specifies a start vertex, defined to be , and an end vertex, defined to be . That is, edges are said to start at and terminate at . Edges also specify a weight, .

Edges can be understood as ‘pointing’ from to . Incoming edges to a node are all with . Similarly, outgoing edges from a node are all with . Parents of a node refer to all nodes such that there is an edge starting at and terminating at . Similarly, children of a node refer to all nodes such that there is an edge starting at and terminating at . A valid path of length starting at and terminating at is a set of at least two nodes such that there exist edges in from to for all . A recurrent edge in a compute graph refers to an edge that lies on a valid path from a node to any of its parents. A graph with recurrent edges is said to be a recurrent graph. An example of a (recurrent) compute graph is shown in fig 1.

Inputs to the compute graph are all those nodes with no incoming edges (i.e. no parents), . The activation values for input nodes must be assigned. The values at all other nodes, , in the compute graph are calculated by

 zi = ∑k: vk parent of viWkiak, (1) ai = σi(zi), (2)

where represents the weighted inputs to a node from all parent nodes and represents the output from a node.

Note that ANNs often define so-called bias units. Bias units allow for inputs to a node to have their mean easily shifted. A bias input to some node can be represented in a compute graph by creating a set of nodes , with no parents, that always output a value of . Further, each is assigned to be an additional parent of by creating an edge from to with weight so that

 ai=σi⎛⎝∑k: vk parent of viWkiak+Bi⎞⎠. (3)

Bias units will not, however, be explicitly indicated in the rest of this section as they can be assumed to be implicitly defined in eqn (1).

The composition of simple functions with a compute graph structure allows for complicated nonlinear functions to be represented parametrically . Figure 1: Example of compute graph. The subscript inside each node denotes the node number. Arrowheads indicate the direction of the graph edges. The function inside each node refers to the output function to be applied at the node. Note that node 1 is an input (with value a1) as it has no parents. Further note that edge W63 is recurrent as there is a cycle formed in the graph between nodes 3,5 and 6.

### 2.2 Optimisation by Stochastic Gradient Descent and Backpropagation

Optimisation over very large compute graphs representing highly nonlinear functions has become possible using Stochastic Gradient Descent (SGD) coupled with Backpropagation of errors . Advanced forms of SGD such as the Adam optimisation technique  are useful for optimising complicated compute graphs. The basic SGD method is described here. Stochastic Gradient Descent finds a locally optimal set of parameters, , by iteratively updating the current estimate for the optimal parameters, . It does so by moving the current estimate in the direction of greatest decreasing error, given by the derivative :

 θi+1:=θi−η∇θJ(θi), (4)

where is a small parameter that gives the distance to move in the direction defined by . Iterations are repeated until a specified error tolerance is reached, i.e. until

 J(θi)≤ϵ. (5)

Consider the case of approximating some unknown function by a compute graph that outputs the function . The weights are taken to be the values of the edge weights for all

. Let the loss functional in this example be given by

 J(θ):=∑x|f(x)−~fθ(x)|2, (6)

for in some finite set. Thus, is also representable as a compute graph. The graph for contains the graph for as a subset. To apply SGD to a compute graph, extended to contain the terms computing the loss functional, the Backpropagation method (an application of the chain rule) can be used if two conditions are met:

• All nodal activation functions, , must be differentiable.

• The graph must be directed and acyclic, meaning the graph cannot contain any valid paths from a node to any of its parents, i.e. the graph must not have any recurrent edges.

If the above conditions are satisfied, Backpropagation can compute via the chain rule. The basic procedure is outlined here, but a more detailed treatment can be found in . In the case that the graph is not acyclic, it can be unrolled via a technique referred to as Backpropagation Through Time .

Backwards error derivatives must be computed at all nodes, , in the network:

 δi:=∂J∂zi. (7)

For nodes in the graph that compute the loss functional , the derivative can be computed directly. Otherwise, assume that node has children . Using the chain rule, the error derivative can be calculated by pushing the error derivatives backwards through the graph from children to parents:

 δi=N∑j=1δj∂zj∂ai∂ai∂zi=N∑j=1δjWijσ′i(zi). (8)

Given the error derivative terms, the desired error gradients for can be computed at node with parents by

 ∂J∂Wij=δj∂zj∂Wij=δj∂∂Wij(M∑k=1Wkjak)=δjai. (9)

Automatic Differentiation 

can be used to write efficient computer code for Backpropagation. Specifically, Backpropagation is a form of ‘reverse accumulation mode’ Automatic Differentiation. The above calculations can be organised efficiently by going through the compute graph from output to input nodes. At the time of writing, Tensorflow

 is a popular implementation of the algorithms described above. Although other (including gradient-free) optimisation procedures can be used that are suitable for general compute graphs, SGD with Backpropagation is typically very computationally efficient when applicable.

## 3 Parametric Polynomial Kernel Regression

### 3.1 Overview

Before discussing model inference for ODEs in particular, a parametric polynomial kernel function representation is introduced. Although ANNs and compute graphs are very effective at fitting arbitrary functions, standard ANN methods are poorly suited to polynomial function representation. As typical ANN architectures fit a very large number of parameters, they are unable to perform sensible extrapolation for even low-dimensional polynomial regression problems. Polynomial kernel ridge regression using the so-called kernel trick  works well for fitting polynomials but suffers from cubic (that is, ) computational time complexity. Gradient-descent compute graph optimisation, as it is a parametric method, provides a way to optimise large data sets without the computational difficulties faced by nonparametric methods. While it is possible to build a compute graph that explicitly includes polynomial basis features, this scales factorially with the number of polynomial features included. In this paper it is shown that polynomial kernels can be inserted into compute graph structures and optimised by SGD, avoiding both the combinatorial explosion of polynomial series expansions and the poor time scaling of nonparametric kernel ridge regression.

### 3.2 Polynomial kernel ridge regression

Polynomial kernels, typically associated with kernel regression and Support Vector Machines

[21, 18], are functions of the form

 K(x,y)=(b⟨x,y⟩+c)d (10)

for some , . If the values of are assumed to be some parameters, the expansion of the polynomial kernel (for ) will, implicitly, yield all polynomial combinations up to order .

Kernel ridge regression is a nonparametric method in the sense that the number of parameters grows with the amount of training data . By contrast, in this paper ‘parametric model’ refers to a model with a fixed number of parameters. Adopting the notation in , the standard form of ridge regression is as follows. Given observations of an unknown function at locations, , kernel ridge regression finds an approximation, , by

 f(x)≈fk(x)=N∑i=1αiK(x,xi), (11)

where the values are termed weights and is a kernel function. Kernel functions are a form of generalisation of positive definite matrices (see  for additional details). Only the (real-valued) polynomial kernel in eqn (10) will be discussed in this paper. The weights are calculated using as follows:

 α=(K+λI)−1f(x), (12)

where is the matrix with entries and is the by identity matrix. The term is a regularisation term that controls overfitting. Note that if is not invertible, then the inverse must be replaced by a pseudo-inverse. In the sense of Bayesian regression, the term represents the scale of Gaussian noise added to observations as a part of the approximation procedure.

The use of kernels for regression as in eqn (11) has the effect of mapping a low-dimensional problem implicitly into a high-dimensional space. This is a very powerful technique for projecting data onto high-dimensional basis functions. Unfortunately, as a (typically) dense matrix must be inverted to calculate , the computational complexity of standard kernel ridge regression scales cubically with the number of data points, . This is a severe limitation when considering large data sets such as the time series data considered in later sections of this paper.

### 3.3 Parametric polynomial kernel representation

Instead of calculating an inner product between known values of and as in eqn (10) and inverting a matrix as in eqn (12), this paper demonstrates that a kernel representation can be found in an efficient way using compute graphs and SGD. Consider the following parametric representation of a function with parameters :

 (13)

where denotes elementwise matrix multiplication (or Hadamard product), i.e.  means for the corresponding matrix entries . The remaining terms are defined by , , and . The parameters are known as bias weights in the ANN literature . The full set of parameters for this model is . The dimension is an intermediate representation dimension and is discussed below.

Eqn (13) is a parametric representation of a second-order polynomial kernel. Expanding eqn (13) explicitly would yield a set of second-order polynomials in terms of . However, using SGD the unknown polynomial expression can be found without the need to know the expanded polynomial form. The elementwise matrix product acts like the -th power in eqn (10). The parameters can be trained by SGD and function as parametric representations of Support Vectors. The term required to complete the definition of eqn (13

) is a hyperparameter representing a choice of intermediate representation dimension and is related to the number of Support Vectors required to represent the system (as in Support Vector Regression, see

). Increasing the size of increases the number of parameters but can improve the fit of the regressor (as is demonstrated empirically in Section 5).

An -th order polynomial could be fit by taking a larger number of Hadamard products. Denote the composition of Hadamard products by ( times). Then, our approach consists of expressing an -th order representation of as follows:

 fθ(x)=W2[(W1X+B1)∘n(W1X+B1)]+B2 (14)

or some similar variation on this theme. The expression in eqn (14) is differentiable in the sense of compute graphs since all of the operations in eqn (14) are differentiable. Comparing with eqns (11) and (12), the parametric form of polynomial kernel regression can be thought of as an approximation to both the and terms in a single expression. As the parametric regression form can be optimised by SGD, the cubic scaling of nonparametric kernel ridge regression is avoided.

### 3.4 Numerical demonstration on simple regression problem

This section demonstrates the proposed method via the approximation of a simple cubic function, namely

 f(x):=(x−1)(x+1)(x+0.5). (15)

The goal of this analysis is to infer the hidden function . Given a set of training data, pairs , the problem is to minimise the loss functional

 J(θ):=1NN∑i=1|f(xi)−fθ(x)|2. (16)

For this test problem, training data points were sampled uniformly between and .

First, a standard ANN ‘Multilayer Perceptron’ (specifically a three-layer deep, 100 unit wide perceptron network) was tested. The reader unfamiliar with these terms can see  for definitions, but it is sufficient for the purposes of this paper to understand that this perceptron model computes the function

 fθ(x)=W4σ(W3σ(W2σ(W1x+B1)+B2)+B3)+B4 (17)

where , , , , and such that the parameters of this network are . Additionally,

denotes the sigmoid function:

 σ(x):=11+e−x. (18)

In eqn (17), is applied to vectors componentwise.

Second, the parametric polynomial method in eqn (14) was tested for polynomial orders . The parameter was fixed to for all comparisons.

Both the perceptron model and the parametric polynomial kernel model were trained in two stages. The Adam optimiser  was first run for 1000 iterations with a learning rate of and then for an additional 1000 iterations with a learning rate of . All ANNs and SGD optimisers were implemented using the Tensorflow software library .

Finally, a nonparametric kernel ridge regression estimator of the form in eqn (11) was tested. This was implemented using the SciKit learn ‘KernelRidge’ function  using a third-order polynomial kernel. Note that this function has additional hyperparameters, and . These were set to , and ‘None’ respectively. The SciKit documentation describes these parameters in detail. As with the parametric estimator, the choice of maximum polynomial degree ( in eqn (12)) is another hyperparameter. For this demonstration, only the known true value () was tested with the nonparametric regression estimator.

The values of after running SGD are shown in table 1. The third-order parametric polynomial loss is ten orders of magnitude lower than the regression loss of the perceptron network. The lower loss of the parametric polynomial method compared to and is (of course) expected as the hidden function is a third-order polynomial. This indicates that several polynomial orders should be tested when applying the proposed technique to other problems.

The results of the analysis are shown in figs 2 and 3. Each model tested was able to recover the true form of in the region of the training data. Relative errors for each method are shown in fig 4. Both the parametric and nonparametric polynomial methods were also able to extrapolate well beyond the range of the original data for the model. This can be best seen in fig 3. The perceptron model, by contrast, almost immediately fails to predict values of the hidden function outside of range of the training data. For inferring hidden polynomial dynamical systems from observations, where the ability to extrapolate beyond the training data is essential, the analysis in this section suggests that the parametric polynomial kernel method can be expected to have performance superior to standard ANN methods.

This analysis also indicates that the loss is an effective indicator of extrapolation performance for polynomial kernel methods (at least in this test case). This is not true for the Multilayer Perceptron model which had a low value but poor extrapolation performance. One must however take care when making assertions about extrapolation performance, as it is easy to make incorrect inferences in the absence of data. Figure 2: Comparison of performance of the parametric polynomial kernel method on a simple regression task. Note that the true hidden function, from eqn (15), is underneath the function inferred by the n=3 parametric polynomial. The two coincide because of the virtually perfect fit. The nonparametric polynomial kernel ridge estimator also closely coincides with the true f(x). The 25 regression training data points were calculated by sampling uniformly between x=−2 and x=2. Figure 3: Comparison of performance of the parametric polynomial kernel method on a simple regression task. This is a zoomed out view of fig 2 and shows that the polynomial kernel estimators (both parametric for n=3 and nonparametric) are able to recover the true hidden function in eqn (15) outside of the range of the training data. Figure 4: Comparison of pointwise absolute errors for the simple regression task. Errors are computed as ∣∣y−f(x)f(x)∣∣ where f(x) is the true hidden function defined in eqn (15). The parametric polynomial kernel method has the best performance, followed by the nonparametric polynomial kernel ridge method. Note that the training data was restricted to lie within x=−2 and x=2.

## 4 Ordinary Differential Equation Model Inference

### 4.1 Dynamical Systems

Dynamical systems are classified into either difference equations (discrete-time systems) or differential equations (continuous-time systems)

. In this paper, only continuous-time dynamical systems are investigated, although the numerical methods presented could be applied to both continuous-time and discrete-time systems. Continuous-time dynamical systems of the form considered in this paper can be expressed as coupled first-order Ordinary Differential Equations (ODEs):

 ddtu(t)=f(t,u(t)), (19)

where:

• represents time;

• is the vector of values representing the variables of the system at time ;

• represents the prescribed time derivatives of .

A trajectory of a dynamical system refers to a parameterised path which returns a value of for all values of the parameter . The value of in eqn (19) can be computed given some initial value, , by integrating forward in time:

 u(t)=u(0)+∫t0ddτu(τ)dτ=u(0)+∫t0f(τ,u(τ))dτ (20)

To simplify the solution of ODEs and the implementation of the learning algorithm presented in this paper, we only consider first-order systems. A differential equation of order of the form

 dmdtmu(t)=f(t,u(t)) (21)

can be converted into a system of first-order coupled ODEs. This is also the standard approach employed in numerical implementations of ODE solvers, for an example, see the SciPy function solveivp . The conversion can be achieved by introducing new variables for higher derivatives. Consider an -th order equation of the form

 dmudtm=g(t,u,dudt,d2udt2,⋯,dm−1udtm−1). (22)

This can be rewritten by replacing the terms by new variables () such that:

 ddt⎡⎢ ⎢ ⎢ ⎢⎣uv1⋮vm−1⎤⎥ ⎥ ⎥ ⎥⎦=⎡⎢ ⎢ ⎢ ⎢ ⎢⎣v1⋮vm−1f(t,u,v1,v2,…,vm−1)⎤⎥ ⎥ ⎥ ⎥ ⎥⎦. (23)

As the value of at some time depends on the values at infinitesimally earlier times through the derivatives of , there is a recursive structure present in the equations (this would be even clearer for difference equations or after a discretisation). The model inference technique presented in this paper uses loop unrolling to simplify the derived optimisation problem.

### 4.2 Model inference for coupled ODEs

Model inference, in this context, is the problem of recovering the form of (as in eqn (19)) given observations of at times from to . Model inference can be expressed as an optimisation problem:

 MinimiseJ(θ):=∫T0∣∣∣u(t)−(u(0)+∫t0fθ(τ,u(τ))dτ)∣∣∣2dt, (24)

where is a loss functional over some unknown parameters . The function denotes a parametric approximation to the true latent function . For the purposes of this paper, the parametric representation of can be assumed to be a directed acyclic compute graph. Denote the trajectories computed using the integral of by

 ~uθ(t):=u(0)+∫t0fθ(τ,u(τ))dτ. (25)

Then the loss functional in eqn (24) can be expressed as

 J(θ)=∫T0|u(t)−~uθ(t)|2dt. (26)

In this form, it is clear that the measures how closely the observed trajectories match the predicted trajectories for each value of . Additionally, although the norm has been used above, this norm could be changed to any other norm as appropriate. For simplicity, only the norm will be used in this paper.

If observations of are available, the optimisation problem can be expressed in an alternative, but not exactly equivalent, differential form:

 MinimiseK(θ):=∫T0∣∣∣ddtu(t)−fθ(t,u(t))∣∣∣2dt. (27)

The loss functional surface for will tend to be smoother over when compared to the differential form (since there is an additional integration), potentially altering the behaviour of various optimisation methods. However, the exact minimisers of both and , if they exist so that , are the same, as can be seen by differentiation.

The choice to optimise over or depends on the chosen representation of and the availability of observations. Assume that only observations of are available and not direct observations of . Then it is necessary to either introduce some way to approximate or to approximate . In the remainder of this section, it is shown that a discretised form of , denoted by , can be derived. The discretised objective can be trained using SGD and Backpropagation as long as can be represented by an acyclic compute graph. The derivation of proceeds by first approximating the outer integral in eqn (26) using a finite set of observations of . The derivation of the discretisation is completed by approximating the integral using standard numerical time integration techniques.

### 4.3 Discretisation of the approximate trajectories

The continuous form of the integral in eqn (25) is not amenable to numerical computation and requires discretisation. In particular, if is to be represented by a compute graph and learnt by SGD, then the entire loss functional must be represented by a differentiable, directed acyclic compute graph. To achieve this, it is useful to first note that the integral in eqn (25) can be decomposed into a series of integrals over smaller time domains. Consider the trajectories from times to and to :

 ~uθ(t):=u(0)+∫t0fθ(τ,u(τ))dτ. (28)

Then,

 ~uθ(t+h) = u(0)+∫t0fθ(τ,u(τ))dτ+∫t+htfθ(τ,u(τ))dτ (29) = ~uθ(t)+∫t+htfθ(τ,u(τ))dτ, (30)

giving the trajectory predicted by from to .

The required discretisation can be completed using standard numerical integration techniques. Numerical integration methods such as Euler, Runge-Kutta and Backwards Differentiation (see  for an overview) work, roughly, by assuming some functional form for and analytically integrating this approximation. Numerical integration methods can be expressed as a function of the integrand evaluated at some finite set of points :

 ∫baf(x)dx≈G(a,b,f,{xj}mj=1). (31)

Note that the points are defined as a part of the specification of a particular numerical integration scheme. The function to be integrated, , must be able to be evaluated at each .

The trajectories in eqn (30) can then be approximated with a numerical approximation scheme as in eqn (31):

 ~uθ(t+h)≈^uθ(t+h) := ^uθ(tj)+G(t,t+h,fθ,{(τj,u(τj))}mj=1), (32) ^uθ(0) := u(0). (33)

refers to a trajectory with continuous integrals replaced by approximate numerical integrals. The values are evaluation points and correspond to the values in eqn (31). In general, the smaller the value of the greater the accuracy of the approximation. Small values of , however, increase the computational burden required to compute approximate trajectories.

### 4.4 ODE inference loss functional for observations at discrete times

For practical problems, observations of will not be available for all times between and . Typically, the trajectory will be known only at a finite set of times so that is known at . The finite set will be referred to as ‘training data’ and can be used to discretise the optimisation problem in eqn (24) by the following approximation:

 Minimise~J(θ) := 1NN∑i=1∣∣∣u(ti)−(u(0)+∫ti0fθ(τ,u(τ))dτ)∣∣∣2 (34) = 1NN∑i=1∣∣u(ti)−~uθ(ti)∣∣2. (35)

However, the terms must also be replaced by a discretisation, as in eqn (32). Assume that a numerical integration scheme is selected that evaluates the integrand at points. It is convenient to decompose the trajectory integrals into a series of integrals over finite subsets of the training data, to for the window size (typically either or ), such that

 ~uθ(ti+p)=~uθ(ti)+∫ti+ptifθ(τ,uθ(τ))dτ. (36)

With reference to eqn (32), this can be further approximated by numerical integration:

 ^uθ(ti+p)=^uθ(ti)+G(ti,ti+p,fθ,{(τj,u(τj))}mj=1) (37)

such that the value of is known (given the training data) for all evaluation points , .

Finally, eqn (37) can be modified by using the known value (from the training data) of in place of :

 ^u(ti+p):=u(ti)+G(ti,ti+p,fθ,{(τj,u(τj))}mj=1). (38)

Eqn (35) can be approximated by the discretised loss functional by inserting :

 ^J(θ) := 1N−pN−p∑i=1∣∣u(ti+p)−^u(ti+p)∣∣2 (39) = 1N−pN−p∑i=1∣∣u(ti+p)−(u(ti)+G(ti,ti+p,fθ,{(τj,u(τj))}mj=1))∣∣2. (40)

As is a discrete approximation to , the model inference problem in eqn (26) is approximately solved by minimisation of over a training data set:

 θ∗=argminθJ(θ)≈argminθ^J(θ). (41)

The inferred ODE model then is .

Note that in the above derivation, loss functionals have been computed for time-dependent models of the form . In practice, optimisation over a single trajectory will only provide useful estimates of very close to . To find estimates of away from those points, one would have to observe multiple trajectories and modify to average over these trajectories. Alternatively, in the autonomous case, where is of the form , one trajectory may be enough to infer , depending on the number of sampling points available.

### 4.5 Example using Euler integration

To demonstrate concretely how eqn (40) gives a loss functional discretisation, , for an ODE model that can be optimised by SGD and Backpropagation, an example using simple numerical integration techniques is discussed in this section. Forward Euler (see ) computes an approximation to a dynamical system trajectory time integral as follows ():

 u(t+h)≈u(t)+hf(t,u(t)). (42)

With reference to eqn (31), Forward Euler is a numerical integration scheme with , and

 G(a,b,f,{(a,u(a))})=|b−a|f(a,u(a)). (43)

Forward Euler is a so-called explicit method as the approximation of depends only on functions evaluated at times earlier than . Backward Euler, conversely, is an implicit method:

 u(t+h)≈u(t)+hf(t+h,u(t+h)). (44)

With reference to eqn (31), Backward Euler is a numerical integration scheme with , , and

 G(a,b,f,{(b,u(b))})=|b−a|f(b,u(b)). (45)

Forward time integration using Backward Euler requires solving a system of equations (typically by Newton-Raphson iterations ) as appears on both sides of eqn (44). This is characteristic of implicit integration methods. The choice of when to use explicit or implicit integration methods for simulation of a system depends on the form of the dynamical system to be approximated . Implicit methods are more efficient and accurate for so-called ‘stiff’ problems [10, 11].

However, either method can be used to discretise an ODE into a compute graph representation. For example, assume that is represented by an acyclic compute graph. Then, given training data , the model inference loss functional, , in eqn (40) can be approximated using Forward Euler as follows:

 ^JF(θ):=1N−1N−1∑i=1∣∣u(ti+1)−(u(ti)+|ti+1−ti|fθ(ti,u(ti)))∣∣2. (46)

Implicit integration schemes can be used in essentially the same way as shown above for Forward Euler. As an example, the Backward Euler scheme in (44) can be used to set the model inference loss functional, , from eqn (40) as follows:

 (47)

Note that for the explicit Euler scheme, as in eqn (46), up to time we can infer only up to time . Hence, there is a time lag in the learning which is not observed for the implicit Euler scheme.

The loss functionals in eqns (46) and (47) are trivially differentiable and acyclic (as the values of and are just constants that have been taken from observations) as long as the graph representation of is differentiable and acyclic. Thus, if is represented by a differentiable and acyclic compute graph, the loss functionals can be optimised by SGD.

### 4.6 Example using linear multistep integration approximation

More sophisticated integration schemes than Backward or Forward Euler can be used to find a differentiable parametric representation of . Linear multistep integral approximation schemes are briefly described here as they will be used for the numerical simulations presented in the next section of this paper. Any numerical scheme that is differentiable and representable by a directed acyclic compute graph when inserted into the loss functional could be used. Linear multistep methods are a convenient choice when the training data consists of observations of that have been sampled at constant frequency.

From , Adams-Moulton linear multistep integration of order can be used to approximate a trajectory of a dynamical system from time to time for some as follows:

 ^u(b)=u(a+h)+h(512fθ(b,u(b))+23fθ(a+h,u(a+h))−112fθ(a,u(a))). (48)

To derive the loss functional , assume that training data observations of are given by and that the times are evenly spaced such that . Inserting eqn (48) into eqn (40) gives the Adams-Moulton approximate loss functional (, ):

 ^JA(θ):=1N−2N−2∑i=1∣∣u(ti+2)−^u(ti+2)∣∣2. (49)

Note that the full Adams-Moulton integrator (defined in 

) could also be used to derive a loss functional that approximates a trajectory discretisation using a series of interpolation points between the observations in the training set. For simplicity, only the method shown above (placing the evaluation points at the values in the training data set) is used in this paper.

## 5 Numerical Analysis of the Lorenz–emanuel System

### 5.1 Overview

This section demonstrates the application of the parametric polynomial kernel regression technique to the model inference problem for a dynamical system using the discretisation detailed in the previous section of this paper. Simulations of the Lorenz–Emanuel system (see §7.1 of ) were analysed. This dynamical system consists of variables, for , arranged periodically such that and . Let the full set of variables be denoted by . The Lorenz–Emanuel system can be highly chaotic, displaying sensitive dependence on initial conditions. The equations of motion for this system are:

 duidt=(ui+1−ui−2)ui−1−ui+F. (50)

For the analysis in this section, the following parameters were adopted:

 N=8,F=5. (51)

The parameter represents an external forcing term that prevents the energy in the system from decaying to zero. The value was chosen to be high enough to cause sensitive dependence on initial conditions.

### 5.2 Model inference training data and test description

Model inference was performed given the training data shown in fig 5. The training data was generated using the SciPy solve_ivp method  with the ‘RK45’ algorithm (variable 4th-5th order Runge-Kutta ) and sampled at a rate of samples per time unit for times to . The initial values for the data were generated by sampling each

independently from a normal distribution with mean

:

 ui(t=0)∼N(μ=0,σ=3). (52)

The performance of the proposed method was tested by resampling new initial conditions from the same distribution in eqn (52) and comparing the outputs from the true simulation to simulations generated using an inferred model. All test simulations were again carried out using the SciPy solve_ivp method with ‘RK45’ integration [14, 5].

We used the Adams-Moulton loss functional, , in eqn (49) to define the model inference task. The specific form of the inferred models is given in Section 5.3. All models were implemented using Tensorflow  and optimised with the Adam variant of SGD (see  for implementation details). A fixed optimisation training schedule was adopted in all cases and consisted of three phases, . Each phase is described by an ordered pair where is the number of gradient descent iterations for that phase and is the ‘learning rate’ parameter as in eqn (4). The training schedule adopted was:

 {P1=(1000,0.1),P2=(2000,0.01),P3=(200,0.001)}. (53)

It was found that this schedule was sufficient to minimise to approximately the maximum achievable precision for all models tested.

Note that the integrator used to generate trajectories (RK45) and that used for discretisation of the ODE trajectories (Adams-Moulton) are not the same. This was to demonstrate that any ODE solver can be used to generate simulations from the inferred model. Figure 5: Lorenz–Emanuel system training data, generated using the model defined in eqn (50).

### 5.3 Model representation with polynomial linearisations and kernels

To complete the specification of the problem, the basic form of must be provided. If the form of the dynamical system equations are known beforehand, this information can be used to simplify the analysis. If no information is available, a search over different types of compute graph architectures must be conducted (as in ). For this demonstration, only a polynomial structure is assumed. This is a reasonable assumption that one could make when investigating general interdependent data observations from a dynamical system without any other prior knowledge, as a number of systems have such a structure .

For this inference task, the exact form of the polynomial couplings between the various were not provided to the compute graph. Instead, two types of polynomial nonlinearities were tested. First, a linear combination of all second-order polynomial terms that could be constructed using each of the terms was considered, that is, equations of the form

 d^uidt=fiθ(u)=N∑k=1k∑j=1αikjukuj+βikuk+γi (54)

for each . The parameters are , , for , , . This sort of polynomial is of the traditional form used for polynomial chaos expansions (see ).

Second, the parametric polynomial kernel method introduced in this paper and defined in eqn (13) with dimensions was used to represent . Values of and were tried to test the effect of this parameter on the accuracy of the results.

### 5.4 Results

Stochastic Gradient Descent, combined with ODE trajectory discretisation, was successfully applied to model inference for the Lorenz–Emanuel system in eqn (50). Our parametric kernel model gave the best accuracy on the inference task. Importantly, the kernel model was able to be tuned to higher accuracies by increasing the number of weights used, . Although increasing increases the number of total parameters to be optimised, this trade off may be worthwhile depending on the particular problem.

The performance of the different models is shown in fig 6. The accumulated error, , was calculated as the sum of squared errors from the true model:

 ϵ(t=0) = 0, (55) ϵ(t+h) = √((u(t+h)−^u(t+h))2+ϵ(t), (56)

where (matching the training data sampling rate of samples per time unit). The errors were calculated for the polynomial feature model in eqn (54) and the polynomial kernel model in eqn (13) for , and .

From fig 6, the direct polynomial feature mapping had the worst accuracy. The parametric kernel method was able to track the system evolution more accurately. In all cases, the inferred models were able to maintain a small inference error at times up to at least an order of magnitude greater than the training data sampling rate.

Performance on the model inference task for the polynomial kernel method defined by eqn (13) with is demonstrated in fig 7. This pair of figures shows a comparison between the true model output, , and inferred model output, . From fig 7, it can be seen that the overall structure of the equations is captured by the inferred model. Due to the chaotic nature of the system being analysed, once a few errors accumulate, the true and inferred models diverge rapidly. Figure 6: Lorenz–Emanuel system error vs time. Errors are calculated as per eqn (56). (a) Data trace from true model, u(t).

### 5.5 Discussion

The parametric polynomial kernel method was able to infer the hidden ODE model with good accuracy given a fixed set of training data. The accumulated errors grow quickly with time. This is reasonable considering the chaotic nature of the Lorenz–Emanuel system. A more mathematically rigorous stability analysis of the numerical scheme would be interesting but is beyond the scope of this paper. A number of possible variations on the numerical example presented could be analysed in future work. For instance, the type of integration method used, the sampling rate of the data, and the effect of different amounts of training data would all be interesting to investigate.

## 6 Conclusions

This paper presented a parametric form of polynomial kernel regression, as well as numerical case studies. In particular, the proposed method was applied to the model inference problem for a chaotic dynamical system. Our parametric polynomial kernel method was able to harness the power of kernelised regression without the cubic computational complexity typically incurred by nonparametric polynomial regression, thereby avoiding the curse of dimensionality. Although the method was successfully applied to a test problem, more work will be required to fully understand how best to apply parametric polynomial kernels to real world (rather than simulated) data. As is the case in all regression models, some form of regularisation would need to be included to address overfitting and observational noise.

It was assumed for the analysis in this paper that it was known a priori that only certain polynomial couplings are present. Using the wrong polynomial order in the model expansion was found to cause convergence difficulties. This is also the case in nonparametric kernel regression (see  and the example in fig 2). As such, this is not considered a serious limitation of the method in that it is possible to test a few different sets of model forms when attempting to find a good fit to a data set. Bayesian model selection methods could be applied to formally assess the quality of different polynomial kernel model dimensions.

It is worth noting that direct projection onto polynomial features was found to perform poorly compared to the polynomial kernel method. Although stochasticity was not considered in this paper, it is quite possible that this finding will impact standard techniques frequently employed for Uncertainty Quantification. A kernel representation of the type introduced in this paper applied to Gaussian and other stochastic features may be useful for improving standard polynomial chaos methods (which are described in ).

The search for effective compute graph architectures remains a problem that plagues all methods attempting to learn hidden function structures without inserting large amounts of prior knowledge into the inverse problem. Scaling to very high-dimensional problems would be an interesting challenge. Given the partial decoupling from the curse of dimensionality that gradient descent methods can provide, it is hoped that the techniques presented in this paper would be suitable for model inference on large scale dynamical systems in the future.

## 7 Acknowledgements

This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme, grant agreement No 757254 (SINGULARITY) and a Lloyds Register Foundation grant for the Data-Centric Engineering programme at the Alan Turing Institute.

## References

•  M. Abadi, A. Agarwal, P. Barham & others, TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. http://www.tensorflow.org/, 2015.
•  J. Adler, O. Öktem, Solving ill-posed inverse problems using iterative deep neural networks, Inverse Problems. 33(12), 2017.
•  C.M. Bishop,

Neural Networks for Pattern Recognition

. Oxford University Press, 1995.
•  B. Carpenter, M.D. Hoffman, M. Brubaker, D. Lee, P. Li, M. Betancourt, The Stan math library: Reverse-mode automatic differentiation in C++. ArXiv, arXiv:1509.07164, 2015.
•  R. Dormand, P.J. Prince, A family of embedded Runge-Kutta formulae, Journal of Computational and Applied Mathematics. 6(1), 19–26, 1980.
•  P. Eberhard, C. Bischof, Automatic differentiation of numerical integration algorithms, Neural Networks. 68, 717–732, 1999.
•  S. Epp, Discrete Mathematics with Applications. Cengage Learning, 2010.
•  I. Goodfellow, Y. Bengio, A. Courville Deep Learning. MIT Press, 2016.
•  W.W. Hager, Updating the Inverse of a Matrix, SIAM Review, 31(2), 221–239, 1989.
•  E. Hairer, S.P. Nørsett, G. Wanner Solving Ordinary Differential Equations I: Nonstiff problems. Springer Science & Business Media, 1993.
•  E. Hairer, S.P. Nørsett, G. Wanner Solving Ordinary Differential Equations II: Stiff and Differential-Algebraic Problems. Springer Science & Business Media, 1996.
•  R.A. Horn, C.R. Johnson, Matrix Analysis. Cambridge University Press, 2012.
•  A. Iserles, A First Course in the Numerical Analysis of Differential Equations. Cambridge University Press, 2009.
•  E. Jones, T. Oliphant, P. Peterson, SciPy: Open source scientific tools for Python. http://www.scipy.org/, 2018.
•  D.P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, Proceedings of the 3rd International Conference on Learning Representations (ICLR). Springer Verlag, 2015.
•  H.G. Matthies, E. Zander, B.V. Rosic̀, A. Litvinenko, O. Pajonk, Inverse Problems in a Bayesian Setting, Computational Methods for Solids and Fluids. 41, 245–286, 2016.
•  J.D. Meiss, Differential Dynamical Systems, Revised Edition. SIAM, 2017.
•  K.P. Murphy, Machine Learning: A Probabilistic Perspective. MIT Press, 2012.
•  F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, J. Vanderplas, A. Passos, D. Cournapeau, M. Brucher, M. Perrot, E. Duchesnay, Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research. 12, 2825–2830, 2011.
•  L.B. Rall, Automatic Differentiation: Techniques and Applications, Lecture Notes in Computer Science. 120, 1981.
•  S. Russell, P. Norvig, Artificial Intelligence: A Modern Approach. Pearson, 2016.
•  M. Schmidt, H. Lipson, Distilling free-form natural laws from experimental data, Science. 324, 81–85, 2009.
•  J.C. Sprott, Elegant Chaos: Algebraically Simple Chaotic Flows. World Scientific Publishing Company, 2010.
•  K.O. Stanley, R. Miikkulainen, Evolving Neural Networks through Augmenting Topologies, Evolutionary Computation. 10(1), 99–127, 2002.
•  A.M. Stuart, Inverse problems: A Bayesian perspective, Acta Numerica. 19, 451–559, 2005.
•  B. Sudret, Uncertainty propagation and sensitivity analysis in mechanical models: Contributions to structural reliability and stochastic spectral methods, Habilitation à diriger des recherches, Université Blaise Pascal, 2007.
•  J.R. Taylor, Classical Mechanics. University Science Books, 2005.
•  K. Vu, J.C. Snyder, L. Li, M. Rupp, B.F. Chen, T. Khelif, K.‐R. Müller, K. Burke, Understanding kernel ridge regression: Common behaviors from simple functions to density functionals, International Journal of Quantum Chemistry. 115(16), 1115–1128, 2015.
•  P. Werbos, Generalisation of Backpropagation with application to a recurrent gas market model, Neural Networks. 1(4), 339–356, 1988.