 # A Bulirsch-Stoer algorithm using Gaussian processes

In this paper, we treat the problem of evaluating the asymptotic error in a numerical integration scheme as one with inherent uncertainty. Adding to the growing field of probabilistic numerics, we show that Gaussian process regression (GPR) can be embedded into a numerical integration scheme to allow for (i) robust selection of the adaptive step-size parameter and; (ii) uncertainty quantification in predictions of putatively converged numerical solutions. We present two examples of our approach using Richardson's extrapolation technique and the Bulirsch-Stoer algorithm. In scenarios where the error-surface is smooth and bounded, our proposed approach can match the results of the traditional polynomial (parametric) extrapolation methods. In scenarios where the error surface is not well approximated by a finite-order polynomial, e.g. in the vicinity of a pole or in the assessment of a chaotic system, traditional methods can fail, however, the non-parametric GPR approach demonstrates the potential to continue to furnish reasonable solutions in these situations.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Solutions to ordinary differential equations (ODEs) are often sought by means of adaptive numerical integration algorithms

NumRce . The accuracy of these approaches regularly relies on the assumption that the true solution is suitably smooth across the region of integration so that locally, i.e. in an interval of width equal to integration step-size , where indexes the algorithm’s iteration, the numerical error is well approximated by a finite-order polynomial. The choice of step-size is then determined using an extrapolation procedure and a given precision bound. Of particular importance to the present study are the extrapolation procedures of Richardson Rich and the Bulirsch-Stoer algorithm (BSA) BS1964 . The BSA is a descendant of Richardson’s algorithm and aims to quantify whether the numerical integration scheme has recovered a converged solution to the ODE at the iteration by recursively assessing the behaviour of the numerical error in the limit as . Central to the good performance of this method is the assumption that the error function can be expressed as a power series in . While this assumption has broad applicability, it can breakdown in a number of scenarios, e.g. in the vicinity of a singularity or when aiming to compute particle trajectories in chaotic systems - owing to exponentially large separation rates between nearby particles. In scenarios like these, numerical integration algorithms which are underpinned by polynomial extrapolation can fail and often with considerable burden on computational economy.

In this paper we instead propose treating the numerical error as uncertain. We employ a probabilistic reinterpretation of the problem of adaptively determining a suitable choice of step-size at each iteration. We do this by embedding Gaussian process regression (GPR) in to the extrapolation component of the numerical integration scheme. Our non-parametric approach side-steps the assumption that the numerical error is ‘well-behaved’ near the region of interest and contrasts with the parametric (polynomial) extrapolation methods of Richardson and Bulirsch-Stoer. Moreover, the proposed approach allows for uncertainty quantification in the choice of step-size, at each iteration, and consequently the numerical error. Hence, the GPR implementation can allow for a broader theoretical assessment of candidate numerical solutions to the ODE relative to alternative deterministic methods. Our GPR extrapolation method contributes to the expanding field of probabilistic numerics BOGO2018 ; HOG2015 ; Cockayne2017 and as we will demonstrate may offer a more robust numerical integration scheme, particularly when the true solution to the ODE over the region of interest is not well approximated by a finite-order polynomial.

We begin our discussion by considering the widely used extrapolation scheme put forward by Richardson. After a brief mathematical description of the method, we introduce how a GP can be successfully included into the scheme and present an algorithm to do this. The Bulirsch-Stoer algorithm is a descendant of Richardson’s scheme and is regularly used to compute practical numerical solutions to ODEs. We summarize the BSA before presenting a probabilistic version of the algorithm with a application. We finish by applying both the probabilistic and deterministic versions of the algorithm to the classical problem of a Keplerian orbit.

Our goal is to approximate the integral , which we assume exists and is bounded over the domain . Where possible, we avoid unnecessary mathematical details and instead focus on describing the various issues associated with the computation of , i.e. owing to the behaviour of the integrand , either locally or globally in , or because of the choice of numerical integration scheme.

## 2 Richardson extrapolation

Let be a numerical evaluation of the integral using a fixed step-size . Generally, if is smooth and bounded across , can be written as a weighted sum of evaluations of over a sub-interval of length , i.e.

 I(f,h)=hN∑i=1wif(xi)≈∫ω⊆Ωf(x)dx,

where the weights and number of abscissa are given by the choice of numerical scheme, e.g. a Gaussian or Netwon-Cotes type quadrature method. As it will be important later, the trapezoidal rule, from the Newton-Cotes family of numerical schemes, sets , , . In all evaluations of this type, the error in approximating the integral by is particularly sensitive to the choice of step parameter .

Under the assumption that the numerical evaluation can be written as

 I(f,h0)=F∗+C0hn0+O(hn+10),n∈N

where is some choice of step parameter and a bounded constant, Richardson’s extrapolation scheme can be used to: (i) assess sensitivity to the choice of and; (ii) recursively reduce the leading order approximation error, i.e. sequentially remove . Note, in order to simplify notation the above equation assumes that . To derive Richardson’s extrapolation scheme, we re-write the numerical approximation in terms of a new step-size , for some (the choice of which is discussed later), and combine and to remove the leading order error term. Repeating this process for , the scheme can be written as a recurrence relation in which, for each additional iteration, the approximation error decreases111The error cannot be reduced beyond numerical rounding error, i.e.

 Rj=γ1−n−jI(f,hj)−Rj−1γ1−n−j−1=F∗+Cjhn+j0+O(hn+j+10):j=1,2…% andR0=I(f,h0).

Ignoring numerical rounding errors, the approach can be used to compute to within a desired tolerance , which we refer to as a converged numerical solution, e.g. by identifying . For example, the method underpins the Romberg extension of the trapezoidal scheme.

A variation of Richardson’s method, sometimes referred to as Richardson’s deferred approach to the limit, computes an ensemble of values and fits a polynomial, as a function of , to

. The intercept of the fitted polynomial is then an estimate of

, i.e. in the theoretical limit as , which we denote by . The process is repeated for several distinct initializations of , e.g. and , and a converged numerical solution is identified when . This approach underpins the Bulirsch-Stoer BS1964 extension of Richardson’s extrapolation technique.

Central to the good performance of both methods is the parametric assumption that the numerical error can be approximated by a power series in of order , or more generally on being analytic and bounded in . The assumption is widely applicability, but can fail when two nearby approximations, e.g. at and , diverge at a rate not captured by , i.e. near to a singular point or owing to the systems intrinsic chaotic nature. To overcome this limitation, we propose a non-parametric approach that is inspired, in part, by recent developments in the Bayesian Quadrature literature D1988 ; OH1991 . We illustrate the approach considering the example of Richardson’s deferred approach to the limit, i.e. the Bulirsch-Stoer algorithm.

### 2.1 Uncertainty in the numerical evaluation

So far we have treated the numerical evaluation as one in which the approximation error is fixed by a given choice of step-size . We now make a conceptual shift from this position and instead assume that the approximation error has intrinsic uncertainty. For a very good discussion of the philosophical and scientific implications of this switch see HOG2015 . In the present paper, we let denote the probabilistic re-interpretation of , where

 ^I(f,hj)=F∗+ϵj(hj),F∗∈Randhj∈R+,

i.e. the uncertainty appears through the presence of the random error term . By construct, the approximation error must approach zero as the step-size approaches zero, which, in the probabilistic model, requires that

 p(^I(f,hj)−F∗=0∣F∗)→1,j→∞and0≤hj

The identity is motivated by a key assumption we make: that a numerical evaluation , computed using a deterministic numerical integration scheme, is a draw from the distribution of probabilistic evaluations . Hence, as , the above identity holds. We furthermore consider that a collection of random variables is a Gaussian process (GP), where denotes a distinct step-size that monotonically decreases as increases. In the spirit of the Bulirsch-Stoer algorithm, our objective is to estimate at from the fitted estimate of . In the probabilistic reinterpretation, an estimate of is deemed a converged numerical evaluation when the uncertainty in is below a user defined uncertainty tolerance (henceforth we drop the hat). Therefore, in deterministic evaluations of , the tolerance places a bound on the numerical precision of an estimate, whereas in probabilistic evaluations places a bound on the maximum uncertainty of an estimate. Our approach can broadly be described in four steps.

Step 1; Input a

dimensional vector of step-sizes

and compute the initial data set comprising pairs of variables , where each is assumed a draw from the distribution of probabilistic evaluations . is computed via a numerical integration scheme, and in the present treatment we use the trapezoidal scheme. Step 2; Let the collection of random variables be a GP, i.e.

 ^I(f,h)∣D(0)h∼GP(m(h),k(h,h′)),

where and denote the GP mean and kernel functions, respectively. From this model, compute estimates of: (i) at , denoted , and; (ii) , i.e. the uncertainty of in the estimate . Step 3; if the uncertainty in the estimate is above a user defined threshold, i.e. , we reject as a reasonable estimate of and instead estimate a new step-size which satisfies

 hK+1:σ(hK+1)=^γσ0,0<^γ<1,

i.e. locate such that the uncertainty in the fitted value is a user defined fraction of the uncertainty in the current estimate of . In practice is identified using a grid search over the region . Step 4; augment the sample data with a new observation using , i.e.

 D(1)h=D(0)h∪{hK+1,I(f,hK+1)},

where is computed via the deterministic numerical integration scheme. Using the augmented data, we then repeat steps 2-4 until we identify an estimate of with uncertainty . See Alg. 1 for pseudo-code.

In summary, the proposed approach is an adaptive learning technique, developed from Richardson’s extrapolation and the Bulirsch-Stoer algorithm, that sees the uncertainty in the current estimate of being used to guide the choice of subsequent step-sizes . There are two notable benefits of this approach over and above the deterministic alternative: (i) there is no parametric assumption concerning the behaviour of the error function, i.e. that the leading order error term is well approximated by of a finite order polynomial, and; (ii) the choice of step-size between two observations and is ‘dynamic’, i.e. in the deterministic set-up the distance between two abscissa and is fixed and controlled by the parameter whereas in the probabilistic setting the distance can vary between any two adjacent abscissa owing to asymmetry in the distribution of the uncertainty parameter and the choice of (henceforth we drop the hat). A fundamental challenge remains between both the deterministic and probabilistic approaches, however, as each subsequent step-size , , must be small enough to closely approximate while large enough that computation is dominated by discretization error only, e.g. see HV2016 . In the probabilistic setting, this issue is related to the choice of step-control parameter , the choice of ‘prior’ mean and the kernel function of the GP.

To illustrate our approach, and to go some way toward assessing the influence of these choices, we consider the following test function

 f(x)=exp(−(x−0.35)22(0.1)2)−sin(10x)3 (1)

and aim to approximate the integral of over the interval . We construct the initial data from a 4-dimensional vector of step-sizes and use the trapezoidal rule to compute each , and set the step-control parameter . For the GP, we chose the Matérn class of kernel Stein1999 :

where , is a modified Bessel function and are the kernel parameters and set

(for mean square differentiability). We discounted the popular radial basis function (also called squared exponential), which is a limiting case of the Matérn class (as

), as it makes stronger assumptions on smoothness and it has been argued that this may be unrealistic for modeling many physical systems, see Stein1999 ; GPbook . The results from our analysis are presented in Fig. 1. We note that a consequence of the GP modeling approach is that the posterior mean of is strongly influenced by the choice of prior mean when lies in the interval between the zero and the most recent update of the step size , i.e. . We therefore assessed the sensitivity of our results to the choice of prior mean across three prior initializations: zero, the average of the initial points and , i.e. evaluated at the smallest step-size in the initial data set (Fig. 1 left). In our example, the results illustrate a strong preference toward setting the prior mean as the final observation . Working under the assumption that the most recent evaluation of I is the most accurate, the result seems reasonable. Fig. 1 (right) displays the results when applying the approach and demonstrates how the sequential updating of the pair , and thus the data , can quickly shrink the uncertainty both in the estimation of the integral at . It is possible, therefore, that dynamic updating of the step-size to can lead to faster convergence properties than when fixing the step-size, as in the deterministic set-up. Figure 1: Left: assessment of sensitivity to the choice of the prior mean (μ) in the GP. Three choices are considered: μ=0 , μ=¯x and μ=xi. Right: Evaluation F∗ (marked with a star at h=0). The Blue points denote the evaluation of I(f,h) using the 4 initial step-sizes hK={1,1/2,1/3,1/4} and a 95% uncertainty band. We then apply the GP algorithm over several iterations (only 2 required) to identify a converged estimate of F∗. Each of the two new step-sizes were chosen based on the step-size hj in which the 95% uncertainty band is below γ=0.5 of the uncertainty σ0 at h=0. The first evaluation, point h5, is denoted by Green point and the second, h6, is denoted by Red.

## 3 Bulirsch-Stoer Algorithm

We have reviewed the Bulirsch-Stoer extrapolation algorithm by fixing the domain of integration to be the length the initial step-size . If the domain is local to a singular point, however, numerical evaluations which rely on a fixed initial step-size can regularly fail to converge. The full implementation of the Bulirsch-Stoer algorithm aims to resolve this issue. In regions where a converged numerical solution appears difficult to identify, the choice of initial step-size, and thus the domain of integration, is repeatedly reduced until a converged numerical solution has been identified. Hence, the original domain of integration is partitioned into segments, in which a converged solution is identified, and the integral is evaluated by aggregating the results from each segment, see Alg. 2 for pseudo code. Our probabilistic re-interpretation of this algorithm sees our GP scheme replacing the traditional extrapolation algorithm, as before, for each segment in the domain, see Alg. 3 for pseudo code. To better suit our numerical example in this section, we henceforth consider the domain of integration as a temporal region and denote a step-size (time-step) by .

The Bulirsch-Stoer algorithm is descendant of Richardson’s deferred approach to the limit and therefore typically222It is worth noting that the original method as proposed by Bulirsch-Stoer used rational function extrapolation, however straightforward polynomial extrapolation is typically slightly more efficient than rational function extrapolation NumRce (for smooth problems). relies on performing polynomial extrapolation between the numerical evaluations at each abscissa. Briefly, for each segment of integration, the algorithm usually selects the abscissa at the points , where follows the series originally proposed by Bulirsch-Stoer . However, it is more common to see the sequence more recently suggested by Deuflard Deuflard1 ; Deuflard2 – i.e. – which is the scheme we adopt. Typically, the choice of is made Deuflard1 ; Deuflard2 . The Bulirsch-Stoer algorithm is an adaptive deterministic numerical scheme, which furnishes solutions to the integral problem across a broad range of scenarios. However, the approach is underpinned by the assumption that the integrand is well approximated by a polynomial in each segment of the domain of integration which, even for an analytic function , can lead to issues. We now illustrate this.

In Fig. 2

we provide an example of a well-known situation where polynomial interpolation can perform badly. In this famous example of

Runge’s phenomenon, the GP approach, using a Matérn kernel Stein1999 , provides excellent extrapolation and interpolation whereas the polynomial interpolation procedure fails, particularly at the tails of the domain. This result might be reversed however, particularly if the true underlying function is or is well approximated by a polynomial. In this case, the GP scheme will almost certainly fail to match the goodness of the polynomial extrapolation scheme. The extent to which the methods differ in performance will likely be dependent on the the choice of kernel function: if we had some prior knowledge that the function should ‘look like’ a polynomial, we might choose a kernel that helps the GP posterior mean to exhibit polynomial like behaviour, see for example Karvonen1998 . Figure 2: The well-known case of Runge’s phenomenon where polynomial interpolation results in oscillation at the edge of the interval. Gaussian Process Regression does not suffer from the same problem and provides a better fit on the same data.

In the probabilistic re-interpretation of the Bulirsch-Stoer algorithm, we make the assumption that the numerical evaluation converges to the correct solution as . This is typically true, but when the domain is partitioned into sufficiently small time-steps rounding error can dominate error due to discretization. The GPR derived scheme suffers from the same issue. Outside of numerical rounding error, the GPR scheme can assess convergence using a combination of two conditions: assessing the uncertainty of the predicted mean in the limit as , via the fitted GP model, and; by comparing the estimated means between any two intializations. The standard Bulirsch-Stoer algorithm only assesses the later.

The main disadvantage of the probabilistic version of the Bulirsch-Stoer algorithm is computation time, i.e. for a fixed number of abscissa fitting a polynomial using Neville’s algorithm is typically faster than aiming to maximize the likelihood of the kernel parameters. Certainly, for a smooth and bounded function , the original BSA will likely outperform the probabilistic version. However, in cases where a segment of the domain is partitioned multiple times, e.g. owing to rapidly changing over the interval, it is possible that the overall performance of the probabilistic version will be better than the original.

### 3.1 Example: Kepler orbit

To illustrate utility in the probabilistic re-interpretation of the Bulirsch-Stoer algorithm, we consider the case of a mass-less particle in a Keplerian potential 333The choice of an astronomical problem is perhaps fitting as an early reported use of Gaussian processes dates to time series analysis by the astronomer T.N. Thiele in 1880. L1981 . The problem involves finding a solution to the second order ODE,

 ¨r=GM|r|3r

where is the position vector, is Newton’s constant and is total mass (it is convention to use units with ). We consider the case of a particle initially at on an orbit with eccentricity of 0.99. This results in a close passage to the singularity in the force at .

We compare two different versions of Bulirsch-Stoer: the usual polynomial extrapolation and; the Gaussian process regression approach. Integration over each sub-interval is performed using a second-order leapfrog integrator. For the polynomial scheme, if integration fails to converge after eight sub-divisions of the time-step , the region is iteratively halved until convergence is achieved. We note that more complex schemes can be used to reduce the time-step (e.g. see NumRce ), which might outperform the Bulirsch-Stoer scheme. For the GP approach, we do not place a limit on the number of times an interval can be sub-divided owing to the ability of the GP scheme to closely approximate a wider class of analytic functions than those that are well approximated by a finite order polynomial.

Fig. 3 displays a comparison of the two schemes relative to the true solution. Notably, as the particle first passes through pericenter passage, i.e. the closest approach to the origin, which is a singularity, the polynomial based scheme fails and no longer furnishes an estimate of the particle’s position. However, the non-parametric GP scheme closely approximates the particle’s trajectory as it first passes through pericenter and, moreover,continues to furnish a reasonable approximation of the particle path throughout the time interval considered. Figure 3: Comparison of Bulirsch-Stoer schemes with polynomial extrapolation and Gaussian process regression. Displayed are the x (horizontal) and y (vertical) co-ordinates of a particle, as a function of time, in a Kepler orbit with eccentricity of 0.99. The Bulirsch-Stoer scheme using polynomial extrapolation (blue circles) fails during the pericenter passage (i.e. closest approach). In contrast, Gaussian process regression (green circles) closely folows the true particle trajectory across the spatio-temporal domain.

## 4 Discussion

In this paper we have presented an adaptive-learning integration scheme, which embeds Gaussian process regression (GPR) into the Bulirsch-Stoer algorithm (BSA), for use in numerical integration. This probabilistic approach extends the current deterministic implementation of the BSA in two ways. Firstly, the approach avoids the parametric assumption that a numerical evaluation of the integral has an error term that is closely approximated by a finite order polynomial. This assumption can lead to the deterministic approach failing near a singular point in the integrand. Secondly, for given step-size, uncertainty in the numerical evaluation of the integral via GPR is used to guide how the step-size should be adapted. The combination of these additional features leads to a more robust numerical integration scheme beyond the current standard. The probabilistic scheme is not intended one-size-fits-all replacement for existing numerical methods, however, but rather to be considered as an alternative when standard approaches fail.

#### Acknowledgments

We would like to thank Prof. Chris Oates for helpful comments in connection with the project and Dr. Paul Kirk for interesting scientific discussions and helpful comments on the manuscripts. PB acknowledge support from the Leverhulme Trust (Research Project Grant, RPG-2015-408). CNF acknowledges support from the Medical Research Council (MC UU 00002/7).

## References

• (1) Briol, F.-X. , Oates, C. J., Girolami, M., Osborne, M. A. and Sejdinovic, D.. Probabilistic integration: A role in statistical computation? (with discussion). Statistical Science, 2018.
• (2) Briol, F-X., Oates, C. J., Girolami, M., Osborne, M. A. (2015). Frank-Wolfe Bayesian Quadrature: Probabilistic Integration with Theoretical Guarantees. Advances In Neural Information Processing Systems (NIPS), pages 1162-1170
• (3)

Briol, F-X, Oates, C. J., Cockayne, J. Chen, W. Y., Girolami, M. (2017). On the sampling problem for kernel quadrature. Internatiational Conference on Machine Learning.PMLR 70:586-595

• (4) Boekholt T., Portegies Zwart S., 2015, Comput. Astrophys. Cosmol., 2, 2
• (5) Bulirsch, R., Stoer, J.: Fehlerabschätzungen und extrapolation mit rationalen funktionen bei verfahren vom richardson-typus. Numerische Mathematik (6), 413–427 (1964)
• (6) Cockayne, J. Oates, C. J., Sullivan T., and Girolami, M.. Bayesian probabilistic numerical methods. arXiv:1702.03673, 2017
• (7) Dehnen W., Read J., 2011, European Physical Journal Plus, 126, 55
• (8) Deuflhard, P. 1983, Numerische Mathematik, vol. 41, pp. 399–422.
• (9) Deuflhard, P. 1985, SIAM Review, vol. 27, pp. 505–535.
• (10) Diaconis (1988) Bayesian numerical analysis, Statistical decision theory and related topics V, pp. 163–175.
• (11) Harvey R, Verseghy DL (2016) The reliability of single precision computations in the simulation of deep soil heat diffusion in a land surface model. Clim Dyn 46:3865–3882.
• (12) Hennig, P., Osborne, M. A., and Girolami, M. Probabilistic Numerics and Uncertainty in Computations. Proceedings of the Royal Society A, 471(2179):20150142, 2015.
• (13) Karvonen, T., Oates, C. J., and Särkkä, S. A Bayes-Sard Cubature Method. arXiv preprint arXiv:1804.03016, 2018
• (14) Lauritzen, S. L., "Time series analysis in 1880. A discussion of contributions made by T.N. Thiele". International Statistical Review 49, 1981, 319–333
• (15)

MacKay, D. (1998). In C.M. Bishop (Ed.), Neural networks and machine learning. (NATO ASI Series, Series F, Computer and Systems Sciences, Vol. 168, pp. 133- 166.) Dordrecht: Kluwer Academic Press

• (16) O’Hagan (1991) Bayes-Hermite Quadrature, Journal of Statistical Planning and Inference 29, pp. 245–260.
• (17) Oates, C. J., Niederer, S., Lee, A., Briol, F-X. Girolami, M. (2016). Probabilistic Models for Integration Error in the Assessment of Functional Cardiac Models
• (18) Press, W., 2007, Numerical Recipes in C
• (19) Rasmussen, C. and Williams, C., Gaussian Processes for Machine Learning, 2006
• (20) Richardson, L. F. (1911), "The Approximate Arithmetical Solution by Finite Differences of Physical Problems Involving Differential Equations, with an Application to the Stresses in a Masonry Dam", Philosophical Transactions of the Royal Society A, 210 (459–470): 307–357
• (21) Runge, C., Zeitschrift für Mathematik und Physik, 46: 224–243, 1901
• (22) Schwaighofer, A., Tresp, V. and Yu., K. Learning Gaussian Process Kernels via Hierarchical Bayes. In NIPS 17, Cambridge, MA, 2005. MIT Press.
• (23) Stein, M. L. (1999). Interpolation of Spatial Data.
• (24) Wilson, A. G. and Adams, R. P., Gaussian process kernels for pattern discovery and extrapolation. arXiv preprint arXiv:1302.4245, 2013.
• (25) Wilson, A. G. and Gilboa, E. and Nehorai, A. and Cunningham, J. P. Fast Kernel Learning for Multidimensional Pattern Extrapolation