 # Implicit Probabilistic Integrators for ODEs

We introduce a family of implicit probabilistic integrators for initial value problems (IVPs) taking as a starting point the multistep Adams-Moulton method. The implicit construction allows for dynamic feedback from the forthcoming time-step, by contrast with previous probabilistic integrators, all of which are based on explicit methods. We begin with a concise survey of the rapidly-expanding field of probabilistic ODE solvers. We then introduce our method, which builds on and adapts the work of Conrad et al. (2016) and Teymur et al. (2016), and provide a rigorous proof of its well-definedness and convergence. We discuss the problem of the calibration of such integrators and suggest one approach. We give an illustrative example highlighting the effect of the use of probabilistic integrators - including our new method - in the setting of parameter inference within an inverse problem.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Set-up, motivation and context

We consider the common statistical problem of inferring model parameters from data . In a Bayesian setting, the parameter posterior is given by . Suppose we have a regression model in which the likelihood term

requires us to solve an ordinary differential equation (ODE). Specifically, for each datum, we have

for some latent function satisfying

and vector of measurement errors

We can write the full model as . Since

is latent, it is included as an integral part of the posterior model. This more general decomposition would not need to be considered explicitly in, say, a linear regression model for which

; here we would simply have . In other words, given there is no uncertainty in , and the model would reduce to simply .

In our case, however, is defined implicitly through the ODE and is therefore no longer trivial. What we mean by this is that can only be calculated approximately and thus—following the central principle of probabilistic numerical methods (Hennig et al., 2015)

—we assign to it a probability distribution representing our lack of knowledge about its true value. Our focus here is on initial value problems (IVPs) where we assume the initial value

is known (though an extension to unknown is straightforward). We thus have the setup

 p(θ,σ,x|Y,X0)∝p(Y|x,σ)p(x|θ,X0)p(θ)p(σ). (1)

For our purposes, the interesting term on the right-hand side is , where hereafter we omit . In the broadest sense, our aim is to account as accurately as possible for the numerical error which is inevitable in the calculation of , and to do this within a probabilistic framework by describing . We then wish to consider the effect of this uncertainty as it is propagated through (1), when performing inference on in an inverse problem setting. An experimental example in this context is considered in Section 3.

### 1.1 Probabilistic numerical methods

Before we give a summary of the current state of probabilistic numerical methods (PN) for ODEs, we take a brief diversion. It is interesting to note that the concept of defining a distribution for has appeared in the recent literature in different forms. For example, a series of papers (Calderhead et al., 2009; Dondelinger et al., 2013; Wang and Barber, 2014; Macdonald et al., 2015), which arose separately from PN in its modern form, seek to avoid solving the ODE entirely and instead replace it with a ‘surrogate’ statistical model, parameterised by , with the primary motivation being to reduce overall computation. The central principle in these papers is to perform gradient matching (GM) between the full and surrogate models. A consequence of this framework is the introduction of a distribution akin to the appearing in (1). The aim is to approximate using statistical techniques, but there is no attempt to model the error itself—instead, simply an attempt to minimise the discrepancy between the true solution and its surrogate. Furthermore, the parameters of the surrogate models proposed in the GM framework are fitted by conditioning on data meaning needs to be viewed as a data-conditioned posterior . In our view this is problematic, since where the uncertainty in a quantity of interest arises solely from the inexactness of the numerical methods used to calculate it, inference over that quantity should not be based on data that is the outcome of an experiment. The circularity induced in (1) by -conditioning is clear.

The fundamental shift in thinking in the papers by Hennig and Hauberg (2014) and Chkrebtii et al. (2016), building on Skilling (1991), and then followed up and extended by Schober et al. (2014), Conrad et al. (2016), Teymur et al. (2016), Kersting and Hennig (2016), Schober et al. (2018) and others is that of what constitutes ‘data’ in the algorithm used to determine . By contrast to the GM approach, the experimental data is not used in constructing this distribution. Though the point has already been made tacitly in some of these works, we argue that this constitutes the key difference in philosophy. Instead, we should strive to quantify the numerical uncertainty in first, then propagate this uncertainty via the data likelihood to the Bayesian inversion employed for inferring . This is effectively direct probabilistic modelling of the numerical error and is the approach taken in PN.

How then is inferred in PN? The common thread here is that a discrete path is generated which numerically approximates —the discretised version of the true solution —then ‘model interrogations’ (Chkrebtii et al., 2016) are thought of as a sort of numerical data and is inferred based on these. Done this way, an entirely model-based description of the uncertainty in results, with no recourse to experimental data .

### 1.2 Sequential inference

Another central feature of PN solvers from Chkrebtii et al. (2016) onward is that of treating the problem sequentially, in the manner of a classic IVP integrator. In all of the GM papers, and indeed in Hennig and Hauberg (2014), is treated as a block – inferred all at once, given data (or, in Hennig and Hauberg, ). This necessarily limits the degree of feedback possible from the dynamics of the actual ODE, and in a general IVP this may be the source of significant inaccuracy, since errors in the inexact values approximating are amplified by the ODE itself. In a sequential approach, the numerical data is not a static pre-existing object as the true data is, but rather is generated as we go by repeatedly evaluating the ODE at a sequence of input ordinates. Thus it is clear that the numerical data generated at time is affected by the inferred solution at times before . This iterative information feedback is qualitatively much more like a standard IVP solver than a block inference approach and is similar to the principle of statistical filtering (Särkkä, 2013).

We now examine the existing papers in this area more closely, in order to give context to our own contribution in the subsequent section. In Chkrebtii et al. (2016) a Gaussian process (GP) prior is jointly placed over and its derivative , then at step the current GP parameters are used to predict a value for the state at the next step, . This is then transformed to give . The modelling assumption now made is that this value is distributed around the true value of the derivative with Gaussian error. On this basis the new datum is assimilated into the model, giving a posterior for which can be used as the starting prior in the next step. This approach does not make direct use of the sequence ; rather it is merely generated in order to produce the numerical data which is then compared to the prior model in derivative space. The result is a distributional Gaussian posterior over consistent with the sequence .

Conrad et al. (2016) take a different approach. Treating the problem in a discrete setting, they produce a sequence of values approximating , with constituting the data and calculated from the previous values and by employing some iterative relation akin to a randomised version of a standard IVP solver. Note that there is no attempt to continuously assimilate the generated values into the model for the unknown or during the run of the algorithm. Instead, the justification for the method comes post hoc in the form of a convergence theorem bounding the maximum expected squared-error . An extension to multistep methods—in which is allowed to depend on multiple past values  —is introduced in Teymur et al. (2016) and utilises the same basic approach. Various extensions and generalisations of the theoretical results in these papers are given in Lie et al. (2017), and a related idea in which the step-size is randomised is proposed by Abdulle and Garegnani (2018).

This approach is intuitive, allowing for modified versions of standard algorithms which inherit known useful properties, and giving provable expected error bounds. It is also more general since it allows for non-parametric posterior distributions for , though it relies on Monte Carlo sampling to give empirical approximations to it. Mathematically, we write

 p(Z|θ)=∫p(Z,F|θ)dF=∫[N−1∏i=0p(Fi|Zi,θ)p(Zi+1|Zi,F≤i)]dF. (2)

Here, is the approximation to the unknown discretised solution function , and each is a piece of numerical data. We use to mean . Using the terminology of Hennig et al. (2015), the two constituent components of the telescopic decomposition in the right-hand side of (2) correspond to the ‘decision rule’ (how the algorithm generates a new data-point ) and the ‘generative model’ (which encodes the likelihood model for ) respectively. Note that, from a statistical viewpoint, the method explicitly defines a distribution over numerical solutions rather than an uncertainty centred around the true solution (or ). The relationship of the measure over to that over is then guaranteed by the convergence analysis.

The term is taken in both Conrad et al. (2016) and Teymur et al. (2016) to be simply a deterministic transformation; this could be written in distributional form as . The term is given by Conrad et al. as a Gaussian centred around the output

of any deterministic single step IVP solver, with variance scaled in accordance with the constraints of their theorem. Teymur et al. introduce a construction for this term which permits conditioning on multiple previous

’s and has mean equivalent to the multistep Adams–Bashforth method. They give the corresponding generalised convergence result. Their proof is also easily verified to be valid for implicit multistep methods—a result we appeal to later—though the specific implicit integrator model they suggest is methodologically inconsistent, for reasons we will explain in Section 2.

In all of the approaches described so far, Monte Carlo sampling is required to marginalise and thereby calculate . This constitutes an appreciable computational overhead. A third approach, related to stochastic filtering, is presented in Schober et al. (2014), Kersting and Hennig (2016) and Schober et al. (2018)

. These papers develop a framework which does not rely on sampling, but instead makes the simplifying assumption that all distributions are Gaussian, and propagates the uncertainty from step to step using the theory of Kalman filtering

(Särkkä, 2013). This is an alternative settlement to the accuracy/computational cost trade-off, a point which is acknowledged in those papers.

For the sake of comparison, we can loosely rewrite their general approach in our notation as follows:

 p(x|θ)=∫[∏ip(~Zi+1|x[i],F0:i)p(Fi+1|~Zi+1,θ)p(x[i+1]|~Zi+1,Fi+1)]dFd~Z, (3)

where we write instead of to emphasise that this represents an -times updated model for the continuous solution , rather than the ’th iteration of an algorithm which generates an approximation to the discrete . This algorithm predicts a value for the new state from the current model and all previous data, then generates a data point based on that prediction, and then updates the model based on this new datum. Note that all distributions in this framework are Gaussian, to permit fast filtering, and as a result the non-linearities in are effectively linearised, and any numerical method which produces non-Gaussian errors has Gaussians fitted to them anyway.

This filtering approach is interesting because of the earlier-stated desideratum of maximising the degree of feedback from the ODE dynamics to the solver. The predict-evaluate-update approach suggested by (3) means that information from the ODE function at the next time step is fed back into the procedure at each step, unlike in other methods which only predict forwards. In numerical analysis this is typically a much-desired feature, leading to methods with improved stability and accuracy. However, it is still a three-part procedure, analogous for example to paired Adams–Bashforth and Adams–Moulton integrators used in PEC mode (Butcher, 2008). This connection is referred to in Schober et al. (2018).

## 2 Our proposed method

We now propose a different, novel sequential procedure which also incorporates information from the ODE at time step but does so directly. This produces a true implicit probabilistic integrator, without a subtle inconsistency present in the method suggested by Teymur et al. (2016). There, the analogue of (2

) defines a joint Gaussian distribution over

and (the right-hand component, with replaced by ) but then generates by passing through the function (the left hand component). This gives two mutually-incompatible meanings to , one linearly and one non-linearly related to . Our proposed method fixes this problem. Indeed, we specifically exploit the difference in these two quantities by separating them out and directly penalising the discrepancy between them.

To introduce the idea we consider the one-dimensional case first, then later we generalise to a multi-dimensional context. We first note that unlike in the explicit randomised integrators of Conrad et al. (2016) and Teymur et al. (2016), we do not have access to the exact deterministic Adams–Moulton predictor, to which we could then add a zero-mean perturbation. An alternative approach is therefore required. Consider instead the following distribution which directly advances the integrator one step and depends only the current point:

 p(Zi+1=z|Zi,θ,η)∝g(r(z),η). (4)

Here, is a positive discrepancy measure in derivative space defined in the next paragraph, and is an -scaled functional transformation which ensures that the expression is a valid probability distribution in the variable .

A concrete example will illuminate the definition. Consider the simplest implicit method, backward Euler. This is defined by the relation and typically can only be solved by an iterative calculation, since

is of course unknown. If the random variable

has value , then we may express as a function of . Specifically, we have . The discrepancy between the value of and the value of can then be used as a measure of the error in the linear method, and penalised. This is equivalent to penalising the difference between the two different expressions for arising from the previously-described naive extension of (2) to the implicit case. We write

 p(Zi+1=z|Zi,θ,η)=K−1exp(−12η−2(h−1(z−Zi)−f(z,θ))2). (5)

Comparing (4) and (5), is the expression , and is in this case the transformation . This approach directly advances the solver in a single leap, without collecting explicit numerical data as in previous approaches. It is in general non-parametric and requires either sampling or approximation to be useful (more on which in the next section). Since is in general non-linear, it follows that is non-linear too. It then follows that the density in equation (5) does not result in a Gaussian measure, despite being a squared-exponential transformation. The generalisation to higher order implicit linear multistep methods of Adams–Moulton (AM) type, having the form , for AM coefficients , follows as

 p(Zi+1=z|Z≤i,θ,η)=1Kexp⎛⎜⎝−12η2⎛⎝h−1(z−Zi)−∑s−1j=0βjFi−jβ−1−f(z,θ)⎞⎠2⎞⎟⎠. (6)

### 2.1 Mathematical properties of the proposed method

The following analysis proves the well-definedness and convergence properties of the construction proposed in Section 2. First we show that the distribution (6) is well-defined and proper, by proving the finiteness and strict positivity of the normalising constant . We then describe conditions on the -dependence of the scaling parameter , such that the random variables in (8) satisfy the hypotheses of Theorem 3 in Teymur et al. (2016). In particular, the convergence of our method follows from the Adams–Moulton analogue of that result.

Denote by the deterministic map defined by the -step Adams–Moulton method. For example, the implicit map associated with the backward Euler method—the ‘zero step’ AM method—for a fixed parameter is . More generally, the map associated with the -step AM method is

 Ψhθ,s(Zi−s+1:i)=Zi+h[β−1f(Ψhθ,s(Zi−s+1:i),θ)+∑s−1j=0βjf(Zi−j,θ)], (7)

where , and the are the Adams–Moulton coefficients. Note that

represents the deterministic Adams–Moulton estimate for

. Given a probability space , define for every the random variable according to

 Zi+1=Ψhθ,s(Zi−s+1:i)+ξhi. (8)

The relationship between the expressions (6) and (8) is addressed in part (i) of the following Theorem, the proof of which is given in the supplementary material accompanying this paper.

###### Theorem.

Assume that the vector field is globally Lipschitz with Lipschitz constant . Fix , , , and . If for some independent of and , then the following statements hold:

1. The function defined in (6) is a well-defined probability density.

2. For every , there exists a constant that does not depend on , such that for all , .

3. If , the probabilistic integrator defined by (6) converges in mean-square as , at the same rate as the deterministic -step Adams–Moulton method.

### 2.2 Multi-dimensional extension

The extrapolation part of any linear method operates on each component of a multi-dimensional problem separately. Thus if , we have for each component in turn. Of course, this is not true of the transformation , except in the trivial case where is linear in ; thus in (2), the right-hand distribution is componentwise-independent while the left-hand one is not. All previous sequential PN integrators have treated the multi-dimensional problem in this way, as a product of one-dimensional relations.

In our proposal it does not make sense to consider the system of equations component by component, due to the presence of the non-linear term, which appears as an intrinsic part of the step-forward distribution . The multi-dimensional analogue of (6) should take account of this and be defined over all dimensions together. For vector-valued , we therefore define

 p(Zi+1|Z≤i,θ,H)∝exp{−12r(z)TH−1r(z)}. (9)

where is now a vector of discrepancies in derivative space, and is a matrix encoding the solver scale, generalising . Straightforward modifications to the proof give multi-dimensional analogues to the statements in the Theorem.

### 2.3 Calibration and setting H

The issue of calibration of ODE solvers is addressed without consensus in every treatment of this topic referenced in Section 1. The approaches can broadly be split into those of ‘forward’ type, in which there is an attempt to directly model what the theoretical uncertainty in a solver step should be and propagate that through the calculation; and those of ‘backward’ type, where the uncertainty scale is somehow matched after the computation to that suggested by some other indicator. Both of these have shortcomings, the former due to the inherent difficulty of explicitly describing the error, and the latter because it is by definition less precise. One major stumbling block is that it is in general a challenging problem to even define what it means for an uncertainty estimate to be well-calibrated.

In the present paper, we require a way of setting . We proceed by modifying and generalising an idea from Conrad et al. (2016) which falls into the ‘backward’ category. There, the variance of the step-forward distribution is taken to be a matrix , with determined by a scale-matching procedure that ensures the integrator outputs a global error scale in line with expectations. We refer the reader to the detailed exposition of this procedure in Section 3.1 of that paper. Furthermore, the convergence result from Teymur et al. (2016) implies that, for the probabilistic -step Adams–Bashforth integrator, the exponent should be taken to be .

In our method, we are not able to relate such a matrix directly to because from the definition (9) it is clear that is a scaling matrix for the spread of the derivative , whereas measures the spread of the state . In order to transform to the correct space without linearising the ODE, we apply the multivariate delta method (Oehlert, 1992) to give an approximation for the variance of the transformed random variable, and set to be equal to the result. Thus

 H=Var(f(Zi+1)) ≈Jf(E(Zi+1))ΣZJf(E(Zi+1))T (10) =αhρJf(E(Zi+1))Jf(E(Zi+1))T,

where is the Jacobian of . The mean value is unknown, but we can use an explicit method of equal or higher order to compute an estimate at negligible cost, and use instead, under the assumption that these are reasonably close. Remember that we are roughly calibrating the method so some level of approximation is unavoidable. This comment applies equally to the case where the Jacobian is not analytically available and is estimated numerically. Such approximations do not affect the fundamental convergence properties of the algorithm, since they do not affect the -scaling of the stepping distribution. We also note that we are merely matching variances/spread parameters and nowhere assuming that the distribution (9) is Gaussian. This idea bears some similarity to the original concept in Skilling (1993), where a scalar ‘stiffness constant’ is used in a similar way to transform the uncertainty scale from solution space to derivative space.

We now ascertain the appropriate -scaling for by setting the exponent

. The condition required by the univariate analysis in this paper is that

; part (iii) of the Theorem shows that we require , where is the number of steps in the corresponding AM method.111We take this opportunity to remind the reader of the unfortunate convention from numerical analysis that results in having different meanings here and in the previous paragraph—the explicit method of order is the one with steps, whereas the implicit method of order is the one with steps. Choosing  —an approach supported by the numerical experiments in Section 3—the backwards Euler method () requires The multidimensional analogue of the above condition is for an -independent positive-definite matrix . Since is independent of , this means we must set to be proportional to , and thus we have .

Our construction has the beneficial consequence of giving a non-trivial cross-correlation structure to the error calibration matrix , allowing a richer description of the error in multi-dimensional problems, something absent from previous approaches. Furthermore, it derives this additional information via direct feedback from the ODE, which we have shown is a desirable attribute.

### 2.4 Reducing computational expenditure

In the form described in the previous section, our algorithm results in a non-parametric distribution for at each step. With this approach, a description of the uncertainty in the numerical method can only be evaluated by a Monte Carlo sampling procedure at every iteration. Even if this sampling is performed using a method well-suited to targeting distributions close to Gaussian—we use a modified version of the pre-conditioned Crank–Nicolson algorithm proposed by Cotter et al. (2013)—there is clearly a significant computational penalty associated with this. Figure 1: 500 Monte Carlo repetitions of the probabilistic backward Euler (AM0) method applied to the FitzHugh–Nagumo model with h=0.1 and 0≤t≤20. The approximation from Section 2.4 is used and α∗AM0=0.2. The upper pane plots the ensemble of discrete trajectories, with the path of the deterministic backward Euler method in light blue. The lower pane is based on the same data, this time summarised. The ensemble mean is shown dashed, and 1σ, 2σ and 3σ intervals are shown shaded, with reference solution in solid black.

The only way to avoid this penalty is by reverting to distributions of standard form, which are easy to sample from. One possibility is to approximate (6) by a Gaussian distribution—depending on how this approximation is performed the desideratum of maintaining information feedback from the future dynamics of the target function can be maintained. For example, a first order Taylor expansion of , when substituted into as defined in (9), gives an approximation which is linear in . This yields a non-centred Gaussian when transformed into a probability measure as in (9). Defining and

, some straightforward algebra gives the moments of the approximating Gaussian measure for the next step as

and . We note that this procedure is merely to facilitate straightforward sampling—though is linear in , the inclusion of the first additional term from the Taylor expansion means that information about the non-linearity (in ) of are still incorporated to second order, and the generated solution is not jointly Gaussian across time steps . Furthermore, since is order 1 in , this approximation does not impact the global convergence of the integrator, as long as is set in accordance with the principles described in Section 2.3. This method of solving implicit integrators by linearising them in is well-known in classical numerical analysis, and the resulting methods are sometimes called semi-implicit methods (Press et al., 2007).

## 3 Experimental results

We illustrate our new algorithm by considering the case of a simple inverse problem, the FitzHugh–Nagumo model discussed in Ramsay et al. (2007) and subsequently considered in a several papers on this topic. This is a two-dimensional non-linear dynamical system with three parameters , the values of which () are chosen to produce periodic motion. With the problem having a Bayesian structure, we write down the posterior as

 p(θ,Z|Y)∝p(Y|Z,σ)p(Z|θ,ξ)p(θ)p(ξ). (11)

This expression recalls (1), but with substituting for as described in Section 1.2. We write to emphasise that the trajectory depends on the sequence of random perturbations . For simplicity we use the known value of throughout, so do not include it in the posterior model.

Conrad et al. (2016) remark on the biasing effect on the posterior distribution for of naively evaluating the forward model using a standard numerical method. They showed that their probabilistic integrator returns wider posteriors, preventing misplaced overconfidence in an erroneous estimate. We now extend these experiments to our new method. In passing, we note interesting recent theoretical developments discussing the quantitative effect on posterior inference of randomised forward models, presented in Lie et al. (2018).

### 3.1 Calibration

The first task is to infer the appropriate value for the overall scaling constant for each method, to be used in setting the matrix in (10). As in Conrad et al. (2016), we calculate a value that maximises the agreement between the output of the probabilistic integrator and a measure of global error from the deterministic method, and then fix and proceed with this value.

For each of several methods , was calculated for a range of values of and was close to constant throughout, suggesting that the -scaling advocated in Section 2.3 (ie. taking the equality in the bound in part (iii) of the Theorem) is the correct one. This point has not been specifically addressed in previous works on this subject. The actual maxima for each method are different and further research is required to examine whether a relationship can be deduced between these values and some known characteristic of each method, such as number of steps or local error constant of the underlying method. Furthermore, we expect these values to be problem-dependent. In this case, we found , , , , , .

Having calibrated the probabilistic integrator, we illustrate its typical output in Figure 1: the top pane plots the path of 500 iterations of the probabilistic backward Euler method run at . We plot the discrete values for each repetition, without attempting to distinguish the trajectories from individual runs. This is to stress that each randomised run (resulting from a different instantiation of ) is not intended to be viewed as a ‘typical sample’ from some underlying continuous probability measure, as in some other probabilistic ODE methods, but rather that collectively they form an ensemble from which an empirical distribution characterising discrete-time solution uncertainty can be calculated. The bottom pane plots the same data but with shaded bands representing the , and intervals, and a dotted line representing the empirical mean.

### 3.2 Parameter inference

We now consider the inverse problem of inferring the parameters of the FitzHugh–Nagumo model in the range . We first generate synthetic data ; 20 two-dimensional data-points collected at times corrupted by centred Gaussian noise with variance . We then treat the parameters as unknown and run an MCMC algorithm—Adaptive Metropolis Hastings (Haario et al., 2001)—to infer their posterior distribution.

In Conrad et al. (2016), the equivalent algorithm performs multiple repetitions of the forward solve at each step of the outer MCMC (each with a different instantiation of ) then marginalises out to form an expected likelihood. This is computationally very expensive; in our experiments we find that for the MCMC to mix well, many tens of repetitions of the forward solve are required at each step.

Instead we use a Metropolis-within-Gibbs scheme where at MCMC iteration , a candidate parameter is proposed and accepted or rejected having had its likelihood calculated using the same sample as used in the current iteration . If accepted as , a new can then be sampled and the likelihood value recalculated ready for the next proposal. The proposal at step is then compared to this new value. Pseudo-code for this algorithm is given in the supplementary material.

Our approach requires that be recalculated exactly once for each time a new parameter value is accepted. The cost of this strategy is therefore bounded by twice the cost of an MCMC operating with a deterministic integrator—the bound being achieved only in the scenario that all proposed moves are accepted. Thus the algorithm, in contrast to the calibration procedure (which is relatively costly but need only be performed once), has limited additional computational overhead compared to the naive approach using a classical method.

Figure 2

shows kernel density estimates approximating the posterior distribution of (

,) for the forward Euler, probabilistic forward Euler, backward Euler and probabilistic backward Euler methods. Each represents 1000 parameter samples from simulations run with step-sizes . This is made of 11000 total samples, with the first 1000 discarded as burn-in, and the remainder thinned by a factor of 10. For each method , its pre-calculated calibration parameter is used to set the variance of .

At larger step-sizes, the deterministic methods both give over-confident and biased estimates (on different sides of the true value). In accordance with the findings of Conrad et al. (2016), the probabilistic forward Euler method returns a wider posterior which covers the true solution. The bottom right-hand panel demonstrates the same effect with the probabilistic backward Euler method we have introduced in this paper.

We find similar results for second- and higher-order methods, both explicit and implicit. The scale of the effect is however relatively small on such a simple test problem, where a higher-order integrator would not be expected to produce much error in the forward solve. Further work will investigate the application of these methods to more challenging problems. Figure 2: Comparison of the posterior distribution of (θ2,θ3) from the FitzHugh–Nagumo model in cases where the forward solve is calculated using one of four different integrators (deterministic and probabilistic backward- and forward-Euler methods), each for four different step sizes h=0.005,0.01,0.02,0.05. All density estimates calculated using 1000 MCMC samples. Dashed black lines indicate true parameter values. Full details are given in main text.

## 4 Conclusions and avenues for further work

In this paper, we have surveyed the existing collection of probabilistic integrators for ODEs, and proposed a new construction—the first to be based on implicit methods—giving a rigorous description of its theoretical properties. We have given preliminary experimental results showing the effect on parameter inference of the use of different first-order methods, both existing and new, in the evaluation of the forward model. Higher-order multistep methods are allowed by our construction.

Our discussion on integrator calibration does not claim a resolution to this subtle and thorny problem, but suggests several avenues for future research. We have mooted a question on the relationship between the scaling parameter and other method characteristics. Insight into this issue may be the key to making these types of randomised methods more practical, since common tricks for calibration may emerge which are then applicable to different problems. An interesting direction of enquiry, being explored separately, concerns whether estimates of global error from other sources, eg. adjoint error modelling, condition number estimation, could be gainfully applied to calibrate these methods.

## Acknowledgements

HCL and TJS are partially supported by the Freie Universität Berlin within the Excellence Initiative of the German Research Foundation (DFG). This work was partially supported by the DFG through grant CRC 1114 Scaling Cascades in Complex Systems, and by the National Science Foundation (NSF) under grant DMS-1127914 to the Statistical and Applied Mathematical Sciences Institute’s QMC Working Group II Probabilistic Numerics.

## References

• Abdulle and Garegnani (2018) Abdulle, A. and Garegnani, G. (2018). Random Time Step Probabilistic Methods for Uncertainty Quantification in Chaotic and Geometric Numerical Integration. arXiv:1801.01340.
• Butcher (2008) Butcher, J. (2008). Numerical Methods for Ordinary Differential Equations: Second Edition. Wiley.
• Calderhead et al. (2009) Calderhead, B., Girolami, M., and Lawrence, N. (2009).

Accelerating Bayesian Inference over Nonlinear Differential Equations with Gaussian Processes.

Advances in Neural Information Processing Systems (NeurIPS), 21:217–224.
• Chkrebtii et al. (2016) Chkrebtii, O., Campbell, D., Calderhead, B., and Girolami, M. (2016). Bayesian Solution Uncertainty Quantification for Differential Equations. Bayesian Analysis, 11(4):1239–1267.
• Conrad et al. (2016) Conrad, P., Girolami, M., Särkkä, S., Stuart, A., and Zygalakis, K. (2016). Statistical Analysis of Differential Equations: Introducing Probability Measures on Numerical Solutions. Statistics and Computing, pages 1–18.
• Cotter et al. (2013) Cotter, S., Roberts, G., Stuart, A., and White, D. (2013). MCMC Methods for Functions: Modifying Old Algorithms to Make Them Faster. Statistical Science, 28(3):424–446.
• Dondelinger et al. (2013) Dondelinger, F., Rogers, S., and Husmeier, D. (2013). ODE Parameter Inference using Adaptive Gradient Matching with Gaussian Processes.

Proc. of the 16th Int. Conf. on Artificial Intelligence and Statistics (AISTATS)

, 31:216–228.
• Haario et al. (2001) Haario, H., Saksman, E., and Tamminen, J. (2001). An Adaptive Metropolis Algorithm. Bernoulli, 7(2):223–242.
• Hennig and Hauberg (2014) Hennig, P. and Hauberg, S. (2014). Probabilistic Solutions to Differential Equations and their Application to Riemannian Statistics. Proc. of the 17th Int. Conf. on Artificial Intelligence and Statistics (AISTATS), 33:347–355.
• Hennig et al. (2015) Hennig, P., Osborne, M., and Girolami, M. (2015). Probabilistic Numerics and Uncertainty in Computations. Proc. R. Soc. A, 471(2179):20150142.
• Kersting and Hennig (2016) Kersting, H. and Hennig, P. (2016). Active Uncertainty Calibration in Bayesian ODE Solvers. Uncertainty in Artificial Intelligence (UAI), 32.
• Lie et al. (2017) Lie, H. C., Stuart, A., and Sullivan, T. (2017). Strong Convergence Rates of Probabilistic Integrators for Ordinary Differential Equations. arXiv:1703.03680.
• Lie et al. (2018) Lie, H. C., Sullivan, T. J., and Teckentrup, A. L. (2018). Random Forward Models and Log-Likelihoods in Bayesian Inverse Problems. SIAM/ASA Journal on Uncertainty Quantification. To appear.
• Macdonald et al. (2015) Macdonald, B., Higham, C., and Husmeier, D. (2015). Controversy in Mechanistic Modelling with Gaussian Processes.

Proc. of the 32nd Int. Conf. on Machine Learning (ICML)

, 37:1539–1547.
• Oehlert (1992) Oehlert, G. (1992). A Note on the Delta Method. The American Statistician, 46(1):27–29.
• Press et al. (2007) Press, W., Teukolsky, S., Vetterling, W., and Flannery, B. (2007). Numerical Recipes 3rd Edition: The Art of Scientific Computing. Cambridge University Press.
• Ramsay et al. (2007) Ramsay, J., Hooker, G., Campbell, D., and Cao, J. (2007). Parameter Estimation for Differential Equations: a Generalized Smoothing Approach. J. Royal Stat. Soc. B, 69(5):741–796.
• Särkkä (2013) Särkkä, S. (2013). Bayesian Filtering and Smoothing. Cambridge University Press.
• Schober et al. (2014) Schober, M., Duvenaud, D., and Hennig, P. (2014). Probabilistic ODE Solvers with Runge-Kutta Means. Advances in Neural Information Processing Systems (NeurIPS), 27:739–747.
• Schober et al. (2018) Schober, M., Särkkä, S., and Hennig, P. (2018). A Probabilistic Model for the Numerical Solution of Initial Value Problems. Statistics and Computing, pages 1–24.
• Skilling (1991) Skilling, J. (1991). Bayesian Solution of Ordinary Differential Equations. In Smith, C. R., editor, Maximum Entropy and Bayesian Methods, Fundamental Theories of Physics, pages 23–37. Springer, Dordrecht.
• Skilling (1993) Skilling, J. (1993). Bayesian Numerical Analysis. In Grandy, W. J. and Milonni, P., editors, Physics and Probability, pages 207–222. Cambridge University Press.
• Teymur et al. (2016) Teymur, O., Zygalakis, K., and Calderhead, B. (2016). Probabilistic Linear Multistep Methods. Advances in Neural Information Processing Systems (NeurIPS), 29:4314–4321.
• Wang and Barber (2014) Wang, Y. and Barber, D. (2014). Gaussian Processes for Bayesian Estimation in Ordinary Differential Equations. Proc. of the 31st Int. Conf. on Machine Learning (ICML), 32:1485–1493.

## Appendix – Proof of Theorem

We first prove statement (i). Let and ; we will omit the -dependence hereafter. Let . It follows from (8) that

 z=v+x=Zi+h(β−1f(v,θ)+∑s−1j=0βjf(Zi−j,θ))+x. (12)

It follows by rearranging the terms in (12) that

 1β−1(z−Zih−s−1∑j=0βjf(Zi−j,θ))=f(v,θ)+xβ−1h.

By subtracting from both sides, taking the norm, and squaring, we obtain

We may thus rewrite (6) as

 p(v+x|v,θ,η,h)=1Kexp(−12η2∥∥∥f(v,θ)−f(v+x,θ)+xβ−1h∥∥∥2). (13)

and it follows that the normalising constant is given by

 K(v,θ,η,h)=∫Rdexp(−12η2∥∥∥f(v,θ)−f(v+x,θ)+xβ−1h∥∥∥2)dx.

We now bound the function described in (13) from above and below by unnormalised Gaussian probability densities. First note that by the triangle inequality and the assumption of global Lipschitz continuity we have the lower bound

 ∥∥∥f(v,θ)−f(v+x,θ)+xβ−1h∥∥∥≥∥∥∥xβ−1h∥∥∥−∥f(v,θ)−f(v+x,θ)∥≥∥∥∥xβ−1h∥∥∥−Lf,θ∥x∥=∥x∥((β−1h)−1−Lf,θ).

Similar reasoning yields the upper bound

 ∥∥∥f(v,θ)−f(v+x,θ)+xβ−1h∥∥∥≤∥x∥(Lf,θ+(β−1h)−1).

Thus for , there exist constants that do not depend on such that

 ch∥x∥2≤∥∥∥f(v,θ)−f(v+x,θ)+xβ−1h∥∥∥2≤Ch∥x∥2

where we have defined and . Recall that by the definition of the AM method . (13) now gives

 exp(−(2η2)−1Ch∥x∥2)≤K(v,θ,η,h)⋅p(v+x|v,θ,η,h)≤exp(−(2η2)−1ch∥x∥2) (14)

where we note that the lower and upper bounds in (14) do not depend on . In what follows, we will omit the dependence of and on and , and write and , in order to emphasise the dependence of these quantities on .

The interpretation of (14) is that, up to normalisation, the random variable has a Lebesgue density that lies between the densities of two centred Gaussian random variables.

Integrating each of the three terms in (14) with respect to and using the formula for the normalising constant of a Gaussian measure on , we obtain from the hypotheses and that

 (15)

Note that and are the normalising constants for the Gaussian random variables and respectively, where denotes the identity matrix.

Since , the upper and lower bounds in (15) are respectively finite and strictly positive. This proves (i).

To prove (ii), observe that (15) yields that, for all and , we have

 1≤Kc,hKC,h=(Chch)d/2=(1+Lf,θβ−1h1−Lf,θβ−1h)d (16)

The upper bound decreases to 1 as decreases to zero, since , and are all strictly positive. By the second inequality in (14),

 E[∥v+ξi∥2] =E[∥Zi+1∥2] =∫Rd∥z∥2p(z|v,θ,η,h)dz ≤Kc,hK−1h∫Rd∥z∥2exp(−ch∥x∥22η2)(dx) =Kc,hK−1hE[∥v+ζc,h∥2]. (17)

Since the preceding inequalities hold for arbitrary , we may set in (13). Using this fact and the fact that (16) implies that , we only need to show for some that does not depend on . Consider the change of variables . Since this is just a scaling, we have by the change of variables formula that , and hence

 K−1c,h∫Rd∥x∥rexp (−∥x∥22η2/ch)dx =(2πη2ch)−d/2∫Rd[(η2ch)r/2∥x′∥rexp(−∥x′∥22)(η2ch)d/2]dx′ =Cr(η2ch)r/2 ≤C′rh(ρ+1)r. (18)

where we have used (15) in the first equation, and where do not depend on .

To prove (iii), we set and in (ii) to obtain . Since is the number of steps of the Adams–Moulton method of order , the random variable satisfies the assumption in the statement of Theorem 3 in Teymur et al. (2016). It then follows from that result that

 sup0≤ih≤TE∥Zi−x(ti)∥≤ch2(s+1)

(Note: the used here is different to in the referenced paper, since we follow the usual convention in numerical analysis texts where the implicit multistep method of order has steps.) ∎