# Probabilistic solution of chaotic dynamical system inverse problems using Bayesian Artificial Neural Networks

This paper demonstrates the application of Bayesian Artificial Neural Networks to Ordinary Differential Equation (ODE) inverse problems. We consider the case of estimating an unknown chaotic dynamical system transition model from state observation data. Inverse problems for chaotic systems are numerically challenging as small perturbations in model parameters can cause very large changes in estimated forward trajectories. Bayesian Artificial Neural Networks can be used to simultaneously fit a model and estimate model parameter uncertainty. Knowledge of model parameter uncertainty can then be incorporated into the probabilistic estimates of the inferred system's forward time evolution. The method is demonstrated numerically by analysing the chaotic Sprott B system. Observations of the system are used to estimate a posterior predictive distribution over the weights of a parametric polynomial kernel Artificial Neural Network. It is shown that the proposed method is able to perform accurate time predictions. Further, the proposed method is able to correctly account for model uncertainties and provide useful prediction uncertainty bounds.

## Authors

• 2 publications
• 2 publications
• ### Neural network augmented inverse problems for PDEs

In this paper we show how to augment classical methods for inverse probl...
12/27/2017 ∙ by Jens Berg, et al. ∙ 1

• ### Model inference for Ordinary Differential Equations by parametric polynomial kernel regression

Model inference for dynamical systems aims to estimate the future behavi...
08/06/2019 ∙ by David K. E. Green, et al. ∙ 12

• ### Combining data assimilation and machine learning to estimate parameters of a convective-scale model

Errors in the representation of clouds in convection-permitting numerica...
09/07/2021 ∙ by Stefanie Legler, et al. ∙ 13

• ### Learning Uncertainty with Artificial Neural Networks for Improved Remaining Time Prediction of Business Processes

Artificial neural networks will always make a prediction, even when comp...
05/12/2021 ∙ by Hans Weytjens, et al. ∙ 0

• ### Incremental learning of environment interactive structures from trajectories of individuals

This work proposes a novel method for estimating the influence that unkn...
09/09/2019 ∙ by Damian Campo, et al. ∙ 0

• ### AlgoNet: C^∞ Smooth Algorithmic Neural Networks

Artificial neural networks revolutionized many areas of computer science...
05/16/2019 ∙ by Felix Petersen, et al. ∙ 0

• ### Estimation for Compositional Data using Measurements from Nonlinear Systems using Artificial Neural Networks

Our objective is to estimate the unknown compositional input from its ou...
01/24/2020 ∙ by Se Un Park, et al. ∙ 0

## Code Repositories

### bayesianODEInverse

Probabilistic solution of chaotic dynamical system inverse problems using Bayesian Artificial Neural Networks

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

This paper explores Bayesian Machine Learning-type methodologies for the inference of chaotic dynamical system models from trajectory observations. Problems of this form are also known as inverse problems for dynamical systems. The forward time behaviour of a dynamical system can be predicted given the solution of an inverse problem by, for example, direct simulation

[15] or as a part of a filter which incorporates observational data into time evolution estimates (see [22], §17.4). The ability to predict the forward time behaviour of dynamical systems has applications across virtually all areas of science and engineering [20, 32]. If the equations are chaotic, then the single maximum a posteriori (MAP) forward estimate from a inverse problem solver is likely to be very inaccurate after only a short time. Inverse problems for dynamical systems are almost always ill-posed in the sense that many different models could be used to generate the observed data. Thus, probabilistic techniques are required to make usable inverse model estimates.

Artificial Neural Networks (ANNs) [9], augmented with Bayesian uncertainty quantification techniques from [7], are used in this paper to solve dynamical system inverse problems. Bayesian updating provides the fundamental method for solving inverse problems [33]. Following [10], the terms ANN and ‘compute graph’ will be used interchangeably. Compute graphs will be used in this paper to parametrically represent systems of Ordinary Differential Equations.

Dynamical system trajectory observations can be used to estimate the parameters of a compute graph representation of an ODE by minimising an objective functional. This minimisation procedure effectively performs a search over a parameterised space of possible functions. Nonlinearities in the composed functions allow for the parametric space to represent a very large number of possible functions. This is advantageous when solving inverse problems as it allows for the amount of a priori knowledge about the unknown functional form of the dynamical system in question to be minimised.

By utilising Bayesian methods, forward prediction errors can be properly quantified. A probabilistic formulation also allows for errors in the measurement data to be incorporated in a clear manner. Techniques for doing so are described within. Until recently, Bayesian inference over compute graph parameters has been a significant challenge. Fortunately, the method in

[7] provides a computationally efficient method. The ‘dropout’ technique [30]

(originally developed as a regulariser) is used to build up probabilistic output samples from a compute graph by randomly disabling certain parameters. This forces the encoding of the solution to spread out across the entire compute graph, thereby potentially avoiding overfitting of outlier data. Bayesian parameter uncertainty can be estimated by repeatedly sampling from the compute graph while dropout is enabled. This gives a computationally tractable approximation to a Gaussian process over the graph parameters.

In [7], forward projections of time series are made without learning an explicit ODE model. We detail a method for combining an ODE model and time discretisation errors into the solution of ODE inverse problems. Directly incorporating time discretisation into the inverse problem solution allows the learnt compute graph to be used with standard ODE solver algorithms (as described in [4]) for forward time estimation. Further, errors due to time discretisation can be quantified and more carefully controlled. Thus, the interpretability of the proposed method is improved relative to the technique employed in [7].

The Bayesian inverse dynamical system methodology described in this paper is tested by numerical analysis of the Sprott B system [29]

. The effect of various hyperparameters, such as derivative discretisation error and dropout rate, was examined. The proposed method was able to provide forward time estimates, solving the inverse problem in a probabilistic sense. Accurate predictions were able to be made over time intervals up to several orders of magnitude longer than the observation data sampling rate, demonstrating that the useful predictions can be made. Further, the proposed method was able to provide confidence intervals that correctly bounded test case data. The methodology presented within could be improved by further testing on more complex cases.

## 2 Bayesian Artificial Neural Networks

Neural networks are highly effective as nonlinear regression models. On the other hand, modern deep neural networks typically rely on using a massive number of parameters to ensure that gradient based optimisation will not get stuck in local minima. This is problematic for chaotic ODE inverse problems. The right balance between model size and learning capacity must be found. Bayesian modelling of network parameters can help by quantifying the true range of predictions the trained network is capable of producing for a given input. Full Bayesian modelling of neural network parameter uncertainty is computationally intractable. In [7], it is shown that approximate Bayesian inference of network parameters can be carried out by introducing a probabilistic compute graph architecture. This section describes this approximation technique, which is then applied to ODE inverse problems in subsequent sections.

### 2.1 Compute Graphs and Artificial Neural Networks

Artificial Neural Networks (ANNs) (composed of compute graphs, see [3, 10]) are able to represent nonlinear functions by weighted composition of simpler functions. Roughly, a neural network architecture is defined by a directed graph, consisting of a set of nodes and edges. A complete definition is provided in [10]. The power of compute graph representation of functions is that a large number of possible alternative function choices can be searched efficiently.

Compute graphs can be used for nonlinear regression problems. Given a compute graph architecture, the parameters defining how to weigh the composed functions can be adjusted until some error functional is minimised over the regression data points. Under certain circumstances, this optimisation can be achieved efficiently by combining Automatic Differentiation [23]

, the backpropagation method and Stochastic Gradient Descent

[3, 27].

For the purposes of the regression problems considered in this paper, a subset of suitable compute graphs is described. A real valued feedforward, layerwise compute graph computes a function of the form

 gθ:Rm⟶Rn (1)

as follows. Let the network have ‘layers’, each labelled by a natural number from to . The input to each layer

is a vector

. Each layer has a ‘parameter matrix’ (or ‘weight matrix’),

. Each layer computes a linear transformation of its inputs,

, computed by left-multiplying by . Finally, each layer possess a nonlinearity function, , which is applied elementwise to . In summary, each layer computes

 zi :=θiai−1; (2) ai :=σi(zi) (3)

 a0 :=x; (4) gθ(x) :=aL, (5)

where is the input to the compute graph and is the graph output function.

Compute graphs of the form defined above are termed ‘nonrecurrent’ (also known as ‘feedforward’) graphs. As the inputs to each layer, , depend only on layers for , the flow of information is unidirectional. Recurrent graphs, by contrast, allow for a layer to have inputs from layers . Details regarding recurrent graphs can be found in [5]. Recurrent graphs will not be considered further in this paper.

The search for compute graph weights, from the set of all possible weights, is an optimisation problem. Let be the set of all weights in the network across all layers. Then

 θ:={θi}Li=1. (6)

Further, let denote the set of all possible weights such that .

The output of the network can be written directly as the composition of linear combinations of inputs and the application of the nonlinearities as follows:

 gθ(x)=σL(θLσL−1(θL−1σL−2(…σ1(θ1x)…))). (7)

If should approximate some given function, the values of

can be found by optimising some loss functional,

, given a training data set, . Define the training data set, , as

 T:={(xi,gi)}Mi=1, (8)

where are given value pairs of the function to be approximated.

Training seeks some optimal weights such that

 θ∗∈argminθ∈Θ J(T,θ). (9)

A common choice for the loss functional is a 2-norm over :

 J(T,θ)=1MM∑i=1∥gθ(xi)−gi∥22, (10)

where denotes the Euclidean norm on .

Other loss functionals are also possible [9].

For this paper, it is assumed that

 J(T,θ)≥0 (11)

for all .

In the case that the nonlinearities are piecewise (or weakly) differentiable and that the graph is nonrecurrent, Stochastic Gradient Descent [9] can be used to find an approximation to by iteratively moving in the direction of decreasing . Let denote the -th iteration of the gradient descent process. Then, for each weight, an approximation to a local minimum can be found by computing

 θj+1:=θj−α∇θJ(T,θj) (12)

where is the learning rate (gradient descent step size) and is the derivative of with respect to all weights, computed at

. The gradients can be computed efficiently by the backpropagation method (an application of the chain rule

[3]).

The computation of these gradients is typically carried out using Automatic Differentiation methods. Many software packages exist for building and optimising compute graphs, including Tensorflow

[1]. Further, adaptive learning rates are typically used to improve the optimisation performance over the basic SGD algorithm outlined above. For instance, the Adam optimiser [17] works well for many problems.

### 2.2 Standard neural network training as a maximum likelihood estimate

Standard neural network training can be viewed as obtaining a maximum likelihood estimate of the posterior, , where is some set of observational data that can be used to compute (or ‘train’) the weights as in eq. 8.

#### 2.2.1 Required probabilistic notation

The probabilistic notation used in this paper, summarised here, is as follows:

• denotes a distribution (a measure that may be applied to events) of .

• denotes probability of event

.

• denotes the conditional probability of given , to be understood as a distribution over that is dependent on .

• Marginalisation of from a distribution over and is the operation

 P(Y) =∫P(Y|X=x)dP(X=x). (13)

Marginalisation is also denoted by the shorthand

 P(Y) =∫P(Y|X)dP(X) (14)

in this paper.

#### 2.2.2 Gibbs measure definition

A definition of the Gibbs measure is also required. The Gibbs measure (defined rigorously in [8] and roughly here) over some space consisting of is given by

 P(X=x)=exp(−βE(x))∫exp(−βE(x))dx=1Z(β)exp(−βE(x)) (15)

where:

• is a so-called ‘energy function’. Energies can be used to define the relative probabilities of each .

• is a parameter which defines how ‘spread out’ is over

. It can be considered to be analogous to the inverse of the variance of a Gaussian distribution.

• is a normalising function, referred to as a ‘partition function’, which ensures is a valid probability measure.

The Gibbs measure as given in eq. 15 is defined as long as the integral in converges [8]. In the limit that goes to positive infinity, all probability mass over will be concentrated at the minima of . In other words, eq. 15 converges (weakly) to the Dirac measure at the minimum of .

#### 2.2.3 Maximum likelihood approximated from Bayes theorem

Using Bayes theorem, the posterior distribution over the weights is

 P(θ|T) =P(T|θ)P(θ)P(T) (16) =P(T|θ)P(θ)∫ΘP(T|θ)dP(θ). (17)

This section derives a maximum likelihood estimate, so that a more general probabilistic approach can be adopted in later parts of this paper.

The output of the network, after training, can be computed by marginalising over the weight posterior to calculate the posterior predictive distribution for . This gives

 P(g(x)|T)=∫ΘP(gθ(x)|θ)dP(θ|T) (18)

where is the output of a compute graph as in eq. 7 with weights . The posterior must be estimated using eq. 16.

Following the techniques described in [33], the likelihood ratio over weight space can (by assumption) be modelled as a Gibbs measure by letting

 P(T|θ)P(T)=1Zexp(−βJ(T,θ)) (19)

where the loss functional in eq. 19 defines the error between the data and as in eq. 10. In eq. 19, the partition function has been modified to absorb the normalising factor, , so that is given by

 Z =∫Θexp(−βJ(T,θ))dP(θ) (20) =EP(θ)[exp(−βJ(T,θ))]. (21)

The normalising factor ensures that the posterior

is a probability distribution (using

eq. 16) so

 ∫dP(θ|T)=1Z∫exp(−βJ(T,θ))dP(θ)=1. (22)

Assuming a prior, , equal to a point mass at , then, from eq. 16 the likelihood can be expressed as

 P(θ|T)=P(T|θj)P(T). (23)

Taking logs of eq. 23 gives

 logP(θ|T) =logP(T|θj)P(T) (24) =log(1Zexp(−βJ(T,θj))) (25) =−βJ(T,θj)−logZ. (26)

is bounded between and so . Since is monotonic, the maximum likelihood estimate of is found when is maximised.

The posterior can then be maximised by gradient ascent by iteratively setting

 θj+1:=θj+α∇θlogP(θj|T). (27)

Taking gradients of eq. 26 with respect to (assuming that all terms in eq. 26 are smooth in ) gives

 ∇θlogP(θ|T) =−β∇θJ(T,θj)−∇θlogZ. (28)

This is simplified by noting that as the defined in eq. 20 is a constant.

Computing the maximum of iteratively by gradient ascent yields

 θj+1=θj−α∇θJ(T,θj), (29)

where the constant parameter has been absorbed into .

The local optimisation target in eq. 29 is identical to eq. 12. That is, minimisation of finds the maximum likelihood estimate of the posterior distribution of the weights, given the training data, by iterating until

 θj+1≈θ∗∈argminθ∈Θ J(T,θ). (30)

Then is an approximation of the maximum likelihood in the posterior . Assuming that the maximum likelihood estimate in eq. 30 is a reasonable approximation to the true posterior gives

 P(θ|T)≈δθ∗. (31)

That is, the posterior is assumed to be approximated by the single point .

The posterior predictive distribution for is then

 P(g(x)|T) =∫ΘP(gθ(x)|θ)dP(θ|T) (32) ≈∫ΘP(gθ(x)|θ)dδθ∗ (33) =P(gθ∗(x)) (34) =δgθ∗(x). (35)

The approximate maximum a posteriori distribution for in eq. 35, after standard neural network training, reduces to a deterministic function .

Unfortunately, a MAP estimate of a function is insufficient for the needs of this paper and a Bayesian method for approximation of the full posterior predictive distribution is required.

### 2.3 Dropout regularisation for neural networks

Bayesian updating of large parameter spaces is numerically intractable. In [7] an efficient approximation technique for parametric Gaussian process regression is introduced. For a compute graph with dropout layers [30], it can be shown that introducing dropout before weight layers is equivalent to an approximation of a probabilistic deep Gaussian process [6]. This section introduces the original dropout regularising prior, in preparation for section 2.4, which describes a method for estimating the posterior over all weights in a trained network.

Dropout randomly disconnects weights within a network. For a single layer, following the definitions for eq. 7

, dropout can be implemented as follows. Define the inverse vector Bernoulli distribution (a specific sort of Bernoulli process

[19]) of dimension to be a vector in

with random variable entries,

, for , such that each is either or and that the probability that , , is the same for all . Denote the inverse vector Bernoulli distribution by , which is such that

 P(Xj=1)=1−r for j=1,…,n. (36)

Returning to the definition of dropout, let (for the -th layer in a compute graph) be a vector in . The value will be referred to as the ‘dropout rate’ and a sample referred to as a ‘dropout mask’ for layer .

Define the Hadamard product, denoted , of two vectors in as the entrywise product

 ∘:Rn×Rn⟶Rn (37)

such that, for for , the entries of (denoted ) are given by .

The dropout mask is applied to by taking the Hadamard product of with . The layer output, , is modified to

 ai:=σi(di∘zi). (38)

Entries of multiplied with entries of equal to zero are thus ‘dropped out’ from the computation of . Denote the set of all (independent) dropout distributions across the network by

 D(r):={D(r,ni)}Li=1, (39)

so that is the set of sampled dropout masks for all layers

 d:={di∼D(r,ni)}Li=1. (40)

Denote the function computed by the network, with dropout mask sample applied to (for all layers in the network), as

 gθ(x|d):=gθ(x) with d applied to θ. (41)

Dropout was first designed as a regularisation method (in the sense of Tikhonov regularisation, see [22]). Regularisation methods are equivalent, in a Bayesian optimisation sense, to the selection of some prior over the weight space [22]. In the original implementation of dropout, randomisation was used only during training, and then disabled when using the compute graph for predictions. That is, after training was completed the dropout layer was modified to have .

In standard dropout training, the maximum likelihood estimate of is found as in eq. 29, with the additional step that the posterior is calculated by marginalising over . Training (minimising ) over some data set with dropout enabled computes a posterior distribution for the weights. The model likelihood given a set of weights and a dropout mask can be computed by

 P(T|θ)P(T) =∫D(r)P(T|θ,d=δ)P(T)dP(d=δ|θ) (42) =∫D(r)P(T|θ,d=δ)P(T)dP(d=δ) (43)

as is independent of . Then eq. 43 can be expressed as an expectation over as follows

 P(T|θ)P(T)=ED(r)[P(T|θ,d)P(T)]. (44)

As in eq. 19, following [33], the likelihood ratio can be assumed to be a Gibbs measure:

 P(T|θ,d)P(T)=1Zexp(−βJ(T,θ,d)) (45)

where is the loss functional computed using from eq. 41 (the network output calculated after applying the dropout mask to ). For example, using a 2-norm loss functional, as in eq. 10, and the definition of in eq. 8 yields

 J(T,θ,d)=1MM∑i=1∥gθ(xi|d)−gi∥22. (46)

From this point, essentially the same procedure as that used in eq. 29 can be used to find the posterior maximum likelihood estimate of given and .

Using Bayes theorem, note that the posterior for given and a particular dropout mask is

 P(θ|T,d)=P(T|θ,d)P(T)P(θ|d). (47)

Assuming the prior is given by a point mass at and using eq. 45, the probability of the weight posterior given a dropout mask, , is

 logP(θ|T,d) =logP(T|θj,d)P(T) (48) =−βJ(T,θj,d)−logZ. (49)

Following the discussing in section 2.2.3

, the posterior probability for

will be maximised by the value of which maximises . Taking derivatives of the posterior in eq. 49 with respect to , we get

 ∇θlogP(θ|T,d) =−β∇θJ(T,θj,d)−∇θlogZ (50) =−β∇θJ(T,θj,d) (51)

where, as discussed in section 2.2, .

Taking expectations of eq. 51 over all dropout masks gives

 ED(r)[∇θlogP(θ|T,d)] =ED(r)[−β∇θJ(T,θj,d)]; (52) ∇θED(r)[logP(θ|T,d)] =−β∇θED(r)[J(T,θj,d)]. (53)

 θj+1 =θj+α∇θED(r)[logP(θ|T,d)] (54) =θj−α∇θED(r)[J(T,θj,d)] (55)

where the term has been absorbed into the constant in eq. 55.

Then, the maximum likelihood estimate for , averaged across dropout masks, can be approximated by iteratively updating until

 θj+1≈θ∗∈argminθ∈Θ ED(r)[J(T,θ,d)]. (56)

The dropout modified gradient can be estimated by Monte Carlo sampling times from :

 ED(r)[J(T,θ,d)]≈1KK∑k=1J(T,θ,dk) where dk∼D(r). (57)

Training with dropout has the effect of adding a regularising prior over , which aims to prevent the network from overfitting. The learnt parameters, , are the maximum likelihood estimate for the posterior .

The posterior predictive distribution for , after disabling dropout, is then

 P(g(x)|T)=∫ΘP(gθ(x))dP(θ|T). (58)

As in eq. 31, standard dropout training assumes that the posterior density for in eq. 58 is approximately a point mass at the maximum likelihood estimate from eq. 56. Then,

 P(g(x)|T) ≈∫ΘP(gθ(x))dδθ∗ (59) =P(gθ∗(x)). (60)

As in eq. 35, the posterior predictive distribution over the weights in eq. 59 is a deterministic approximation .

### 2.4 Bayesian neural network approximation

While the regularising action of dropout-enabled training can be beneficial, additional steps must be taken to approximate the actual posterior predictive distribution for over all weights. In [7]

, an extension to the dropout method is introduced which is able to estimate the full posterior. Roughly, the technique is as follows. The starting point is to first train using dropout regularisation. Then, rather than disable dropout to find a maximum a posteriori estimate of

for prediction, dropout is left active during prediction. Repeated sampling from the dropout-enabled predictive network can be used to estimate . Denote the posterior predictive distribution (used to approximate ) after training by

 P(^g(x)|T). (61)

Crucially, [7]

demonstrates that the posterior of the network with dropout-enabled prediction is approximately a Gaussian process. This is done by showing that the Kullback-Leibler divergence between the posterior of a deep Gaussian process and the posterior predictive distribution of an ANN is minimised by the dropout training objective. The first two moments of the approximate Gaussian process representing the

are sufficient to describe the trained ANN predictions. These moments can be recovered efficiently by Monte Carlo sampling.

Training for the full posterior estimation method is the same as that shown in eq. 55. However, rather than disable the dropout layers when computing , the dropout layers remain active. Denote samples from the probabilistic network as

 gθ(x|d)∼P(g(x|d)|θ) (62)

where is the probability to sample some value given and a dropout mask, . The dropout mask is assumed to be sampled from as in eq. 39.

Assigning a single value (a MAP estimate) of for to the network simplifies to

 P(g(x|d)|T) :=∫ΘP(g(x|d)|θ)dδθ∗ (63) =P(gθ∗(x|d)) (64)

where the set of weight parameters found after training the network by gradient descent. That is, are the maximum likelihood parameters for as in eq. 56.

The can be approximated by the posterior predictive distribution computed by marginalising over all dropout masks in , so

 P(^g(x))=∫P(gθ∗(x|d))dD(r). (65)

It is shown in [7] that can be approximate efficiently by a Gaussian process of the form

 P(^g(x)|T)≈N(μθ∗(x),σθ∗(x)) (66)

where the mean,

, and standard deviation,

, of the process are computed empirically by repeatedly sampling from the dropout-enabled network.

Denote the -th sample from the dropout-enabled network by

 gk(x|d)∼P(gθ∗(x|dk)),dk∼D(r). (67)

Then the empirical mean, , for can be computed by

 μθ∗(x)≈1KK∑k=1gk(x|d) (68)

where is some finite number of samples from the dropout-enabled network. The standard deviation can be similarly estimated by

 σθ∗(x)≈ ⎷1K−1K∑k=1(gk(x|d)−μθ∗(x))2. (69)

Calculating the output of a compute graph for a given input is computationally cheap, so estimation of and is tractable. The repeated sampling of adds only a constant overhead to the computation of a prediction from the network.

As in eq. 66 is a Gaussian distribution for each input, , prediction confidence intervals at each can be obtained. Upper and lower confidence bounds for a Gaussian distribution can be computed respectively by:

 gU(x) =μθ∗(x)+cσθ∗(x); (70) gL(x) =μθ∗(x)−cσθ∗(x). (71)

The factor is the number of standard deviations away from the mean required for some confidence level. For example, for a confidence interval. See [19] for further details on this subject. Confidence intervals can be used to simplify the interpretation of the quality of the predictions produced by Bayesian compute graph posterior and are used to present the numerical results in section 6.

## 3 Probabilistic representations of dynamical systems

This section presents a probabilistic ODE representation that can be used for solving inverse problems with Bayesian ANNs.

### 3.1 Dynamical systems

Following the notation in §4.1 of [10], we consider in this paper continuous-time dynamical systems that can be expressed as coupled first-order Ordinary Differential Equations (ODEs) of the form

 ddtu(t)=f(t,u(t)) (72)

where:

• represents time;

• is the vector of values representing the variables of the system at time ;

• represents the prescribed time derivatives of .

For the remainder of this paper, we assume that the unknown ODE model, , is autonomous [4]. Then

 ddtu(t))=f(u(t)). (73)

Inverse problems for non-autonomous systems required further assumptions and, while not explored further in this paper, would be a useful avenue for future work.

For notational convenience, define

 ut:=u(t). (74)

A solution to the ODE in eq. 72 satisfies

 ut+s=ut+∫t+stf(uτ)dτ. (75)

The deterministic (analytic) trajectory of a dynamical system from time to refers to the set of all states occupied by the dynamical system,

 {(a,ua):a∈[t,t+s]}, (76)

ordered by time. Numerical ODE solvers, described in [4], can be used to produce approximations to the analytic trajectory. These approximations are denoted

 {uti}Ni=1=ODESOLVE({ti}Ni=1,ut0,f) (77)

where is the initial value for the trajectory, is some finite set of times at which the values of will be computed, and is the ODE derivative function as in eq. 72.

In this paper, only continuous-time dynamical systems are investigated, although the numerical methods presented could be applied to both continuous-time and discrete-time systems.

### 3.2 Markov process representation of continuous-time dynamical systems

Continuous-time dynamical systems can be represented as a Markov process by making reference to the Gibbs measure. An alternative representation, not discussed in this paper, is the Stochastic Partial Differential Equation (SPDE) random variable approach (as in

[13]). For rigorous definitions of continuous-time stochastic processes, see §IV of [21]. We will outline only the necessary parts here.

For a dynamical system, with time derivative model , the Gibbs measure can be used to define the probability to transition from one state, , to another state, , in time units. First, denote the probability to be in a state at time by

 Pt(ut). (78)

Define the so-called ‘transition probability’ as the probability to go from some state, , to another state, , in time units by

 P(ut+s|ut). (79)

Then

 Pt+s(ut+s) =∫RnP(ut+s|ut)dPt(ut). (80)

#### 3.2.1 Discretised trajectory Markov model formulation

A discrete trajectory is considered to be a set of probabilities for at some finite set of times, . The assumed Markov property of the transition probability can be used to define trajectories in terms of transitions between states at one time, , to another, , as a series of smaller steps from to . A probabilistic trajectory is then given by

 {Pti(uti)}Ni=1 (81)

where each is calculated by

 Pt1(ut1) =assumed a priori; (82) Pti+1(uti+1) =P(uti+1|uti)Pti(uti); (83) Pti+2(uti+2) =P(uti+2|uti+1)P(uti+1|uti)Pti+1(uti); (84) etc.

The entire trajectory can be computed by chaining together conditional probabilities:

 Ptn(utn) (85)

### 3.3 ODE model parameter uncertainty

Uncertainty regarding model parameters can be included by replacing the dependency on in the transition probability with and marginalising over all .

The probability of an output, , given the set of inputs will be denoted

 P(fθ(ut)):=P(fθ(ut)|θ). (86)

Let refer to the value of marginalised over . Then

 P(^f(ut)):=∫ΘP(fθ(ut))dP(θ). (87)

The values of can be estimated using Bayes rule given training data, , yielding a posterior distribution, . The posterior over the weights can be used to estimate the posterior predictive distribution

 P(^f(ut)|T):=∫ΘP(fθ(ut))dP(θ|T). (88)

The posterior predictive distribution in eq. 88 can be combined with an ODE integral discretisation technique to recover estimates of the future states of the dynamical system.

#### 3.3.1 Bayesian networks for dynamical system inverse problems

To solve a dynamical system inverse problem, the Bayesian Gaussian approximation can be applied to the problem of learning the posterior predictive distribution for given observations of some process .

Let the model for with dropout augmentation be denoted

 fθ(ut|d). (89)

Training with dropout activated can be used to recover the MAP estimate, . After training, a Gaussian approximation to can be recovered for eq. 88 by marginalising out the dropout layers, as in eq. 66, so that

 P(^f(ut)|T) =∫∫ΘP(fθ(ut|d))dP(θ|T)dD(r) (90) ≈∫∫ΘP(fθ(ut|d))dδθ∗dD(r) (91) ≈N(μθ∗(ut),σθ∗(ut)) (92)

where and can be estimated by sampling, as in eqs. 69 and 68 respectively.

## 4 Solution of the inverse problem

Although the previous section describes the posterior predictive distribution for given , the method for finding given observations has not yet been described. The particulars depend on additional assumptions. First, an explicit probabilistic representation of the transition probability is required. A Gaussian (Gibbs measure) form is utilised. Second, the form of the error induced by numerical ODE integration schemes must assumed. We take the error to be additive Gaussian noise. By making these assumptions, the transition probability can be approximated as a Gaussian process. This, in combination with the Bayesian compute graph approximation method, allows for the ODE problem to be solved in a computationally tractable manner. This section derives the form of the inverse problem approximation scheme used for the numerical analysis in section 6.

### 4.1 Finding the posterior distribution, assuming a 2-norm error distribution and Euler integration

The predictive Bayesian network computing

can be found by training a Bayesian compute graph on approximations to the time derivatives of . In [10] a method for approximating given an integral discretisation was used. In this paper a simpler method is shown, based on approximations to the time derivative of .

Denote an approximation to , computed using (an observation in ) by . Assuming that has some approximation error, , gives

 f(ut) =fγ(ut)+γ (93)

where the form of the implied distribution depends on the exact choice of .

Then a dropout-enabled network can be trained, as in eq. 56, by finding

 θ∗ ∈argminθ∈Θ ED(r)[J(T,θ,d)]. (94)

We assume, for mathematical convenience, that the loss term has a 2-norm representation of the form

 J(T,θ,d)=∫EP(f(ut)|fγ(ut))[∥f(ut)−fθ(ut|d)∥22]dut. (95)

This error can be approximated by taking Monte Carlo samples of from :

 J(T,θ,d) ≈∫1BB∑b=1[∥fbγ(ut)−fθ(ut|d)∥22]dut (96) for fbγ(ut) ∼P(f(ut)|fγ(ut)). (97)

#### 4.1.1 Euler integral approximation for fγ

To actually compute eq. 96, the distribution must be known. In this paper the data, , is assumed to consist of observations of the process, , at discrete times. Then

 T:={ti,uti}Ni=1. (98)

Further, the values are assumed to be evenly spaced and sampled at a constant rate of . This assumption gives

 h=ti+1−ti. (99)

The simplest method to compute the approximate time derivatives from the training observations is to use a first-order finite difference method of the form

 fγ(uti) =uti+1−utih (100) f(uti) ≈fγ(uti)+O(h2). (101)

This approximation implies .

The error is assumed to be additive Gaussian noise. Then

 P(f(ut)|fγ(ut)) =N(uti+1−utih,σγ); (102) σγ :=ch2 (103)

for some constant . For an inverse problem, the value of is unknown and must be estimated. In this paper, is used since the dropout rate, , must also be adjusted to match the variance of the actual training data. As such, the variance induced by can be implicitly controlled by adjusting . It was found that it is still useful, for a numerical problem, to include the error due to in .

The derivative approximation in eq. 100 could be replaced by some other suitable approximation, such as a higher-order Taylor series based approximation, as described in any standard reference on numerical methods [11].

#### 4.1.2 Dropout training objective assuming an Euler integral approximation

The Euler approximation can be inserted into the loss functional in eq. 96 to find a computable training objective. As the training data is assumed to be sampled at discrete points, the integral over each can be approximated by a finite integral at each of the points at which is computed. The training objectives becomes

 J(T,θ,d) ≈1N−1N−1∑i=11BB∑b=1∥∥fbγ(ut)−fθ(uti|d)∥∥22 (104) for fbγ(ut) ∼N(uti+1−utih,σγ). (105)

Then, as in eq. 66, can be used as a maximum likelihood estimator for computing the posterior distribution . The application of the posterior distribution to forward model prediction is discussed in the next section.

The posterior distribution in eq. 92, with found using eq. 104 is the solution to the dynamical system inverse problem. The full training procedure to calculate the posterior over weight space is summarised in algorithm 1.

Note the training data is assumed to be sampled at evenly spaced intervals, . Also note that a first-order Taylor series derivative approximation has been used. The algorithm presented could be modified to make use of irregular time discretisation or different derivative approximations methods by altering the assumptions made in this section.

## 5 Predicting future states given the posterior predictive distribution

The posterior predictive model, in eq. 92 can be used for forward prediction. That is, the learnt model can be used to compute an a posteriori estimate for trajectories, defined in eq. 85. If certain assumptions about approximation errors are made, then the estimated trajectory can be assumed to be a Gaussian process.

To achieve this, first a discrete-time state transition distribution is derived using ideas from probabilistic numerics [12]. Then, the posterior model for is combined with the transition probability model to compute a posterior distribution over future ODE states.

### 5.1 Error induced by numerical approximation of the transition probability

Discretisation must be introduced to the integral to allow for numerical approximation of trajectories. Examples of ODE discretisation schemes include Euler integration and Runge-Kutta methods, see [4] for a detailed overview.

Denote the numerical approximation to the analytical ODE integral in eq. 75 by

 ut+s=^uft+s(ut)+ϵ :=ut+G(t,s,f)+ϵ (106)

where is some numerical approximation scheme with

 G(t,s,f)≈∫t+stf(uτ)dτ (107)

and represents the error of the approximation.

For a standard numerical integration method, can be represented as a weighted sum of a set of values, with . The values of are termed the ‘evaluation points’ of the integration scheme. The integral approximation can then be written

 G(t,s,f)=G({f(uti)}Ni=1)=α0+N∑i=1αif(uti) (108)

The factors, , and the evaluation points, , depend on the particular numerical integration scheme used.

The transition probability can be computed, given the probability to sample some value of , by marginalisation:

 P(ut+s|ut) =∫P(ut+s|^uft+s(ut))dP(^uft+s(ut)). (109)

If is known, eq. 109 is deterministic. Later, will be replaced by the random-valued posterior approximation .

### 5.2 Gaussian representation of the integral approximation error

Combining the techniques in [24] and [33], we assume that the likelihood of