Probabilistic Linear Multistep Methods

by   Onur Teymur, et al.
Imperial College London

We present a derivation and theoretical investigation of the Adams-Bashforth and Adams-Moulton family of linear multistep methods for solving ordinary differential equations, starting from a Gaussian process (GP) framework. In the limit, this formulation coincides with the classical deterministic methods, which have been used as higher-order initial value problem solvers for over a century. Furthermore, the natural probabilistic framework provided by the GP formulation allows us to derive probabilistic versions of these methods, in the spirit of a number of other probabilistic ODE solvers presented in the recent literature. In contrast to higher-order Runge-Kutta methods, which require multiple intermediate function evaluations per step, Adams family methods make use of previous function evaluations, so that increased accuracy arising from a higher-order multistep approach comes at very little additional computational cost. We show that through a careful choice of covariance function for the GP, the posterior mean and standard deviation over the numerical solution can be made to exactly coincide with the value given by the deterministic method and its local truncation error respectively. We provide a rigorous proof of the convergence of these new methods, as well as an empirical investigation (up to fifth order) demonstrating their convergence rates in practice.



page 7


Pick-and-Mix Information Operators for Probabilistic ODE Solvers

Probabilistic numerical solvers for ordinary differential equations comp...

Probabilistic Solutions To Ordinary Differential Equations As Non-Linear Bayesian Filtering: A New Perspective

We formulate probabilistic numerical approximations to solutions of ordi...

Convergence Rates of Gaussian ODE Filters

A recently-introduced class of probabilistic (uncertainty-aware) solvers...

Linear multistep methods and global Richardson extrapolation

In this work, we study the application the classical Richardson extrapol...

Implicit Probabilistic Integrators for ODEs

We introduce a family of implicit probabilistic integrators for initial ...

Stable Implementation of Probabilistic ODE Solvers

Probabilistic solvers for ordinary differential equations (ODEs) provide...

Sharp global convergence guarantees for iterative nonconvex optimization: A Gaussian process perspective

We consider a general class of regression models with normally distribut...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Numerical solvers for differential equations are essential tools in almost all disciplines of applied mathematics, due to the ubiquity of real-world phenomena described by such equations, and the lack of exact solutions to all but the most trivial examples. The performance – speed, accuracy, stability, robustness – of the numerical solver is of great relevance to the practitioner. This is particularly the case if the computational cost of accurate solutions is significant, either because of high model complexity or because a high number of repeated evaluations are required (which is typical if an ODE model is used as part of a statistical inference procedure, for example). A field of work has emerged which seeks to quantify this performance – or indeed lack of it – by modelling the numerical errors probabilistically, and thence trace the effect of the chosen numerical solver through the entire computational pipeline kennedy01. The aim is to be able to make meaningful quantitative statements about the uncertainty present in the resulting scientific or statistical conclusions.

Recent work in this area has resulted in the development of probabilistic numerical methods, first conceived in a very general way in diaconis88. An recent summary of the state of the field is given in hennig15. The particular case of ODE solvers was first addressed in skilling93, formalised and extended in chkrebtii16; hennig14; schober14 with a number of theoretical results recently given in conrad16. The present paper modifies and extends the constructions in chkrebtii16; conrad16 to the multistep case, improving the order of convergence of the method but avoiding the simplifying linearisation of the model required by the approaches of hennig14; schober14. Furthermore we offer extensions to the convergence results in conrad16 to our proposed method and give empirical results confirming convergence rates which point to the practical usefulness of our higher-order approach without significantly increasing computational cost.

1.1 Mathematical setup

We consider an Initial Value Problem (IVP) defined by an ODE


Here is the solution function,

is the vector-valued function that defines the ODE, and

is a given vector called the initial value. The dependence of on an -dimensional parameter

will be relevant if the aim is to incorporate the ODE into an inverse problem framework, and this parameter is of scientific interest. Bayesian inference under this setup (see

girolami08) is covered in most of the other treatments of this topic but is not the main focus of this paper; we therefore suppress for the sake of clarity.

Some technical conditions are required in order to justify the existence and uniqueness of solutions to (1). We assume that is evaluable point-wise given and and also that it satisfies the Lipschitz condition in , namely for some and all and ; and also is continuous in . These conditions imply the existence of a unique solution, by a classic result usually known as the Picard-Lindelöf Theorem iserles08.

We consider a finite-dimensional discretisation of the problem, with our aim being to numerically generate an -dimensional vector111The notation denotes the vector , and analogously , etc. approximating the true solution in an appropriate sense. Following chkrebtii16

, we consider the joint distribution of

and the auxiliary variables (obtained by evaluating the function ), with each obtained by sequentially conditioning on previous evaluations of . A basic requirement is that the marginal mean of should correspond to some deterministic iterative numerical method operating on the grid . In our case this will be a linear multistep method (LMM) of specified type. 222We argue that the connection to some specific deterministic method is a desirable feature, since it aids interpretability and allows much of the well-developed theory of IVP solvers to be inherited by the probabilistic solver. This is a particular strength of the formulation in conrad16 which was lacking in all previous works.

Firstly we telescopically factorise the joint distribution as follows:


We can now make simplifying assumptions about the constituent distributions. Firstly since we have assumed that is evaluable point-wise given and ,


which is a Dirac-delta measure equivalent to simply performing this evaluation deterministically. Secondly, we assume a finite moving window of dependence for each new state – in other words is only allowed to depend on and for some . This corresponds to the inputs used at each iteration of the -step Adams-Bashforth method. For we will assume dependence on only those derivative evaluations up to ; this initialisation detail is discussed briefly in Section 4. Strictly speaking, is superfluous to our requirements (since we already have ) and thus we can rewrite (2) as


The conditional distributions are the primary objects of our study – we will define them by constructing a particular Gaussian process prior over all variables, then identifying the appropriate (Gaussian) conditional distribution. Note that a simple modification to the decomposition (2) allows the same set-up to generate an -step Adams-Moulton iterator333The convention is that the number of steps is equal to the total number of derivative evaluations used in each iteration, hence the -step AB and -step AM methods both go ‘equally far back’. – the implicit multistep method where depends in addition on . At various stages of this paper this extension is noted but omitted for reasons of space – the collected results are given in Appendix C.

Linear multistep methods

We give a very short summary of Adams family LMMs and their conventional derivation via interpolating polynomials. For a fuller treatment of this well-studied topic we refer the reader to the comprehensive references

iserles08; hairer08; butcher08. Using the usual notation we write

for the numerical estimate of the true solution

, and for the estimate of .

The classic -step Adams-Bashforth method calculates by constructing the unique polynomial interpolating the points . This is given by Lagrange’s method as


The are known as Lagrange polynomials, have the property that , and form a basis for the space known as the Lagrange basis. The Adams-Bashforth iteration then proceeds by writing the integral version of (1) as and approximating the function under the integral by the extrapolated interpolating polynomial to give


where and the are the Adams-Bashforth coefficients for order , all independent of and summing to 1. Note that if is a polynomial of degree (so is a polynomial of degree ) this procedure will give the next solution value exactly. Otherwise the extrapolation error in is of order and in (after an integration) is of order . So the local truncation error is and the global error iserles08.

Adams-Moulton methods are similar except that the polynomial interpolates the points . The resulting equation analogous to (7) is thus an implicit one, with the unknown appearing on both sides. Typically AM methods are used in conjunction with an AB method of one order lower, in a ‘predictor-corrector’ arrangement. Here, a predictor value is calculated using an AB step; this is then used to estimate ; and finally an AM step uses this value to calculate . We again refer the reader to Appendix C for details of the AM construction.

2 Derivation of Adams family LMMs via Gaussian processes

We now consider a formulation of the Adams-Bashforth family starting from a Gaussian process framework and then present a probabilistic extension. We fix a joint Gaussian process prior over as follows. We define two vectors of functions and in terms of the Lagrange polynomials defined in (6) as


The elements (excluding the first) of form a basis for and the elements of form a basis for . The initial in is necessary to make the dimensions of the two vectors equal, so we can correctly define products such as which will be required later. The first element of can be any non-zero constant ; the analysis later is unchanged and we therefore take .

Since we will solely be interested in values of the argument corresponding to discrete equispaced time-steps indexed relative to the current time-point , we will make our notation more concise by writing for , and similarly for . We now use these vectors of basis functions to define a joint Gaussian process prior as follows:


This construction works because and differentiation is a linear operator; the rules for the transformation of the covariance elements is given in Section 9.4 of rasmussen06 and can easily be seen to correspond to the defined relationship between and .

Recalling the decomposition in (5), we are interested in the conditional distribution . This is also Gaussian, with mean and covariance given by the standard formulae for Gaussian conditioning. This construction now allows us to state the following result:

Proposition 1.

The conditional distribution under the Gaussian process prior given in (10), with covariance kernel basis functions as in (8) and (9), is a -measure concentrated on the -step Adams-Bashforth predictor .

The proof of this proposition is given in Appendix A.

Because of the natural probabilistic structure provided by the Gaussian process framework, we can augment the basis function vectors and to generate a conditional distribution for

that has non-zero variance. By choosing a particular form for this augmented basis we can obtain an expression for the standard deviation of

that is exactly equal to the leading-order local truncation error of the corresponding deterministic method.

We will expand the vectors and by one component, chosen so that the new vector comprises elements that span a polynomial space of order one greater than before. Define the augmented bases and as


The additional term at the end of is the polynomial of order which arises from interpolating at points (with the additional point at ) and choosing the basis function corresponding to the root at , scaled by with a positive constant whose role will be explained in the next section. The elements of these vectors span and respectively. With this new basis we can give the following result:

Proposition 2.

The conditional distribution under the Gaussian process prior given in (10), with covariance kernel basis functions as in (11) and (12), is Gaussian with mean equal to the -step Adams-Bashforth predictor and, setting for some , standard deviation equal to its local truncation error.

The proof is given in Appendix B. In order to de-mystify the construction, we now exhibit a concrete example for the case . The conditional distribution of interest is . In the deterministic case, the vectors of basis functions become

and simple calculations give that

The probabilistic version follows by setting

and further calculation shows that

An entirely analogous argument can be shown to reproduce and probabilistically extend the implicit Adams-Moulton scheme. The Gaussian process prior now includes as an additional variable and the correlation structure and vectors of basis functions are modified accordingly. The required modifications are given in Appendix C and a explicit derivation for the 4-step AM method is given in Appendix D.

2.1 The role of

Replacing in (11) by , with , makes the variance of the integrator coincide exactly with the local truncation error of the underlying deterministic method.444We do not claim that this is the only possible way of modelling the numerical error in the solver. The question of how to do this accurately is an open problem in general, and is particularly challenging in the multi-dimensional case. In many real world problems different noise scales will be appropriate for different dimensions and – especially in ‘hierarchical’ models arising from higher-order ODEs – non-Gaussian noise is to be expected. That said, the Gaussian assumption as a first order approximation for numerical error is present in virtually all work on this subject and goes all the way back to skilling93. We adopt this premise throughout, whilst noting this interesting unresolved issue.

This is of course of limited utility unless higher derivatives of are available, and even if they are, is itself unknowable in general. However it is possible to estimate the integrator variance in a systematic way by using backward difference approximations fornberg88 to the required derivative at . We show this by expanding the -step Adams-Bashforth iterator as


where are the set of coefficients and the local truncation error constant for the -step Adams-Bashforth method, and are the set of backward difference coefficients for estimating the th derivative of to order fornberg88.

In other words, the constant can be substituted with , using already available function values and to adequate order. It is worth noting that collecting the coefficients and results in an expression equivalent to the Adams-Bashforth method of order and therefore, this procedure is in effect employing two integrators of different orders and estimating the truncation error from the difference of the two.555An explicit derivation of this for is given in Appendix E. This principle is similar to the classical Milne Device butcher08, which pairs an AB and and AM iterator to achieve the same thing. Using the Milne Device to generate a value for the error variance is also straightforward within our framework, but requires two evaluations of at each iteration (one of which immediately goes to waste) instead of the approach presented here, which only requires one.

3 Convergence of the probabilistic Adams-Bashforth integrator

We now give the main result of our paper, which demonstrates that the convergence properties of the probabilistic Adams-Bashforth integrator match those of its deterministic counterpart.

Theorem 3.

Consider the -step deterministic Adams-Bashforth integrator given in Proposition 1, which is of order . Then the probabilistic integrator constructed in Proposition 2 has the same mean square error as its deterministic counterpart. In particular

where denotes the true solution, the numerical solution, and K is a positive real number depending on but independent of .

The proof of this theorem is given in Appendix F, and follows a similar line of reasoning to that given for a one-step probabilistic Euler integrator in conrad16. In particular, we deduce the convergence of the algorithm by extrapolating from the local error. The additional complexity arises due to the presence of the stochastic part, which means we cannot rely directly on the theory of difference equations and the representations of their solutions. Instead, following buckwar06, we rewrite the defining -step recurrence equation as a one-step recurrence equation in a higher dimensional space.

4 Implementation

We now have an implementable algorithm for an -step probabilistic Adams-Bashforth integrator. Firstly, an accurate initialisation is required for the first iterations – this can be achieved with, for example, a Runge-Kutta method of sufficiently high order.666We use a (packaged) adaptive Runge-Kutta-Fehlberg solver of 7th order with 8th order error control. Secondly, at iteration , the preceding stored function evaluations are used to find the posterior mean and variance of . The integrator then advances by generating a realisation of the posterior measure derived in Proposition 2. Following chkrebtii16, a Monte Carlo repetition of this procedure with different random seeds can then be used as an effective way of generating propagated uncertainty estimates at any time .

4.1 Example – Chua circuit

The Chua circuit chua92 is the simplest electronic circuit that exhibits chaotic behaviour, and has been the subject of extensive study – in both the mathematics and electronics communities – for over 30 years. Readers interested in this rich topic are directed to chua07 and the references therein. The defining characteristic of chaotic systems is their unpredictable long-term sensitivity to tiny changes in initial conditions, which also manifests itself in the sudden amplification of error introduced by any numerical scheme. It is therefore of interest to understand the limitations of a given numerical method applied to such a problem – namely the point at which the solution can no longer be taken to be a meaningful approximation of the ground truth. Probabilistic integrators allow us to do this in a natural way chkrebtii16.

The Chua system is given by , , . We use parameter values , , , , and initial conditions , , . This particular choice is taken from ‘Attractor CE96’ in bilotta08.

Figure 1: Time series for the -component in the Chua circuit model described in Section 4.1, solved 20 times for using an -step probabilistic AB integrator with (top), (middle), (bottom). Step-size remains throughout. Wall-clock time for each simulation was close to constant ( per cent – the difference primarily accounted for by the RKF initialisation procedure).

Using the probabilistic version of the Adams-Bashforth integrator with , it is possible to delay the point at which numerical path diverges from the truth, with effectively no additional evaluations of required compared to the one-step method. This is demonstrated in Figure 1. Our approach is therefore able to combine the benefits of classical higher-order methods with the additional insight into solution uncertainty provided by a probabilistic method.

4.2 Example – Lotka-Volterra model

We now apply the probabilistic integrator to a simple periodic predator-prey model given by the system , for parameters , , and . We demonstrate the convergence behaviour stated in Theorem 3 empirically.

The left-hand plot in Figure 2 shows the sample mean of the absolute error of 200 realisations of the probabilistic integrator plotted against step-size, on a log-log scale. The differing orders of convergence of the probabilistic integrators are easily deduced from the slopes of the lines shown.

The right-hand plot shows the actual error value (no logarithm or absolute value taken) of the same 200 realisations, plotted individually against step-size. This plot shows that the error in the one-step integrator is consistently positive, whereas for two- and three-step integrators is approximately centred around 0. (This is also visible with the same data if the plot is zoomed to more closely examine the range with small .) Though this phenomenon can be expected to be somewhat problem-dependent, it is certainly an interesting observation which may have implications for bias reduction in a Bayesian inverse problem setting.

Figure 2: Empirical error analysis for the -component of 200 realisations of the probabilistic AB integrator as applied to the Lotka-Volterra model described in Section 4.2. The left-hand plot shows the convergence rates for AB integrators of orders 1-5, while the right-hand plot shows the distribution of error around zero for integrators of orders 1-3.

5 Conclusion

We have given a derivation of the Adams-Bashforth and Adams-Moulton families of linear multistep ODE integrators, making use of a Gaussian process framework, which we then extend to develop their probabilistic counterparts.

We have shown that the derived family of probabilistic integrators result in a posterior mean at each step that exactly coincides with the corresponding deterministic integrator, with the posterior standard deviation equal to the deterministic method’s local truncation error. We have given the general forms of the construction of these new integrators to arbitrary order. Furthermore, we have investigated their theoretical properties and provided a rigorous proof of their rates of convergence, Finally we have demonstrated the use and computational efficiency of probabilistic Adams-Bashforth methods by implementing the solvers up to fifth order and providing example solutions of a chaotic system, and well as empirically verifying the convergence rates in a Lotka-Voltera model.

We hope the ideas presented here will add to the arsenal of any practitioner who uses numerical methods in their scientific analyses, and contributes a further tool in the emerging field of probabilistic numerical methods.

KZ was partially supported by a grant from the Simons Foundation. Part of this work was done during the author’s stay at the Newton Institute for the programme Stochastic Dynamical Systems in Biology: Numerical Methods and Applications.


A Proof of Proposition 1

Recall that for all . Straightforward substitutions into the definitions give that , etc. and hence , for all . Furthermore since every component of bar the first is a polynomial of degree with a factor . Finally

Now by (10) and the standard formulae for Gaussian conditioning, we have

which is equal to the -step Adams-Bashforth predictor defined by (6) and (7). Next we write

and the proposition follows.

B Proof of Proposition 2

We follow the same reasoning as in Proposition 1. Since the additional basis function at the end of is clearly zero at for all , each inner product of the form , and is equal to the corresponding inner product , and as no additional contribution from the new extended basis arises. It therefore suffices to check only the terms of the form .

Integrating the additional basis function gives a polynomial of degree with a constant factor . Evaluating this at means that the additional term is also 0 in . Therefore and . It follows that the expression for is exactly the same as when using the unaugmented basis function set.

The argument in the previous paragraph means we can immediately write down that

Since the first components of are equal to the components of , this expression reduces to the contribution of the augmented basis element. Therefore

The Adams-Moulton coefficient is equal to the local truncation error constant for the -step Adams-Bashforth method butcher08 and the proposition follows.

C Extension to Adams-Moulton

We collect here the straightforward modifications required to the constructions in the main paper to produce implicit Adams-Moulton methods instead of explicit Adams-Bashforth versions.

The telescopic decomposition (5) becomes


where it is particularly to be noted that is no longer superfluous.

The Lagrange interpolation resulting in the the Adams-Moulton method is


the analogous vectors of basis polynomials to (8) and (9) are


and the iterator is defined by


with are the Adams-Moulton coefficients.

The Gaussian process prior resulting in AM is


D Adams-Moulton integrator with

The conditional distribution of interest is . In the deterministic case the vectors of basis functions become


and the resulting calculations give

The probabilistic version is