1 Introduction
Solutions to ordinary differential equations (ODEs) are often sought by means of adaptive numerical integration algorithms
NumRce . The accuracy of these approaches regularly relies on the assumption that the true solution is suitably smooth across the region of integration so that locally, i.e. in an interval of width equal to integration stepsize , where indexes the algorithm’s iteration, the numerical error is well approximated by a finiteorder polynomial. The choice of stepsize is then determined using an extrapolation procedure and a given precision bound. Of particular importance to the present study are the extrapolation procedures of Richardson Rich and the BulirschStoer algorithm (BSA) BS1964 . The BSA is a descendant of Richardson’s algorithm and aims to quantify whether the numerical integration scheme has recovered a converged solution to the ODE at the iteration by recursively assessing the behaviour of the numerical error in the limit as . Central to the good performance of this method is the assumption that the error function can be expressed as a power series in . While this assumption has broad applicability, it can breakdown in a number of scenarios, e.g. in the vicinity of a singularity or when aiming to compute particle trajectories in chaotic systems  owing to exponentially large separation rates between nearby particles. In scenarios like these, numerical integration algorithms which are underpinned by polynomial extrapolation can fail and often with considerable burden on computational economy.In this paper we instead propose treating the numerical error as uncertain. We employ a probabilistic reinterpretation of the problem of adaptively determining a suitable choice of stepsize at each iteration. We do this by embedding Gaussian process regression (GPR) in to the extrapolation component of the numerical integration scheme. Our nonparametric approach sidesteps the assumption that the numerical error is ‘wellbehaved’ near the region of interest and contrasts with the parametric (polynomial) extrapolation methods of Richardson and BulirschStoer. Moreover, the proposed approach allows for uncertainty quantification in the choice of stepsize, at each iteration, and consequently the numerical error. Hence, the GPR implementation can allow for a broader theoretical assessment of candidate numerical solutions to the ODE relative to alternative deterministic methods. Our GPR extrapolation method contributes to the expanding field of probabilistic numerics BOGO2018 ; HOG2015 ; Cockayne2017 and as we will demonstrate may offer a more robust numerical integration scheme, particularly when the true solution to the ODE over the region of interest is not well approximated by a finiteorder polynomial.
We begin our discussion by considering the widely used extrapolation scheme put forward by Richardson. After a brief mathematical description of the method, we introduce how a GP can be successfully included into the scheme and present an algorithm to do this. The BulirschStoer algorithm is a descendant of Richardson’s scheme and is regularly used to compute practical numerical solutions to ODEs. We summarize the BSA before presenting a probabilistic version of the algorithm with a application. We finish by applying both the probabilistic and deterministic versions of the algorithm to the classical problem of a Keplerian orbit.
Our goal is to approximate the integral , which we assume exists and is bounded over the domain . Where possible, we avoid unnecessary mathematical details and instead focus on describing the various issues associated with the computation of , i.e. owing to the behaviour of the integrand , either locally or globally in , or because of the choice of numerical integration scheme.
2 Richardson extrapolation
Let be a numerical evaluation of the integral using a fixed stepsize . Generally, if is smooth and bounded across , can be written as a weighted sum of evaluations of over a subinterval of length , i.e.
where the weights and number of abscissa are given by the choice of numerical scheme, e.g. a Gaussian or NetwonCotes type quadrature method. As it will be important later, the trapezoidal rule, from the NewtonCotes family of numerical schemes, sets , , . In all evaluations of this type, the error in approximating the integral by is particularly sensitive to the choice of step parameter .
Under the assumption that the numerical evaluation can be written as
where is some choice of step parameter and a bounded constant, Richardson’s extrapolation scheme can be used to: (i) assess sensitivity to the choice of and; (ii) recursively reduce the leading order approximation error, i.e. sequentially remove . Note, in order to simplify notation the above equation assumes that . To derive Richardson’s extrapolation scheme, we rewrite the numerical approximation in terms of a new stepsize , for some (the choice of which is discussed later), and combine and to remove the leading order error term. Repeating this process for , the scheme can be written as a recurrence relation in which, for each additional iteration, the approximation error decreases^{1}^{1}1The error cannot be reduced beyond numerical rounding error, i.e.
Ignoring numerical rounding errors, the approach can be used to compute to within a desired tolerance , which we refer to as a converged numerical solution, e.g. by identifying . For example, the method underpins the Romberg extension of the trapezoidal scheme.
A variation of Richardson’s method, sometimes referred to as Richardson’s deferred approach to the limit, computes an ensemble of values and fits a polynomial, as a function of , to
. The intercept of the fitted polynomial is then an estimate of
, i.e. in the theoretical limit as , which we denote by . The process is repeated for several distinct initializations of , e.g. and , and a converged numerical solution is identified when . This approach underpins the BulirschStoer BS1964 extension of Richardson’s extrapolation technique.Central to the good performance of both methods is the parametric assumption that the numerical error can be approximated by a power series in of order , or more generally on being analytic and bounded in . The assumption is widely applicability, but can fail when two nearby approximations, e.g. at and , diverge at a rate not captured by , i.e. near to a singular point or owing to the systems intrinsic chaotic nature. To overcome this limitation, we propose a nonparametric approach that is inspired, in part, by recent developments in the Bayesian Quadrature literature D1988 ; OH1991 . We illustrate the approach considering the example of Richardson’s deferred approach to the limit, i.e. the BulirschStoer algorithm.
2.1 Uncertainty in the numerical evaluation
So far we have treated the numerical evaluation as one in which the approximation error is fixed by a given choice of stepsize . We now make a conceptual shift from this position and instead assume that the approximation error has intrinsic uncertainty. For a very good discussion of the philosophical and scientific implications of this switch see HOG2015 . In the present paper, we let denote the probabilistic reinterpretation of , where
i.e. the uncertainty appears through the presence of the random error term . By construct, the approximation error must approach zero as the stepsize approaches zero, which, in the probabilistic model, requires that
The identity is motivated by a key assumption we make: that a numerical evaluation , computed using a deterministic numerical integration scheme, is a draw from the distribution of probabilistic evaluations . Hence, as , the above identity holds. We furthermore consider that a collection of random variables is a Gaussian process (GP), where denotes a distinct stepsize that monotonically decreases as increases. In the spirit of the BulirschStoer algorithm, our objective is to estimate at from the fitted estimate of . In the probabilistic reinterpretation, an estimate of is deemed a converged numerical evaluation when the uncertainty in is below a user defined uncertainty tolerance (henceforth we drop the hat). Therefore, in deterministic evaluations of , the tolerance places a bound on the numerical precision of an estimate, whereas in probabilistic evaluations places a bound on the maximum uncertainty of an estimate. Our approach can broadly be described in four steps.
Step 1; Input a
dimensional vector of stepsizes
and compute the initial data set comprising pairs of variables , where each is assumed a draw from the distribution of probabilistic evaluations . is computed via a numerical integration scheme, and in the present treatment we use the trapezoidal scheme. Step 2; Let the collection of random variables be a GP, i.e.where and denote the GP mean and kernel functions, respectively. From this model, compute estimates of: (i) at , denoted , and; (ii) , i.e. the uncertainty of in the estimate . Step 3; if the uncertainty in the estimate is above a user defined threshold, i.e. , we reject as a reasonable estimate of and instead estimate a new stepsize which satisfies
i.e. locate such that the uncertainty in the fitted value is a user defined fraction of the uncertainty in the current estimate of . In practice is identified using a grid search over the region . Step 4; augment the sample data with a new observation using , i.e.
where is computed via the deterministic numerical integration scheme. Using the augmented data, we then repeat steps 24 until we identify an estimate of with uncertainty . See Alg. 1 for pseudocode.
In summary, the proposed approach is an adaptive learning technique, developed from Richardson’s extrapolation and the BulirschStoer algorithm, that sees the uncertainty in the current estimate of being used to guide the choice of subsequent stepsizes . There are two notable benefits of this approach over and above the deterministic alternative: (i) there is no parametric assumption concerning the behaviour of the error function, i.e. that the leading order error term is well approximated by of a finite order polynomial, and; (ii) the choice of stepsize between two observations and is ‘dynamic’, i.e. in the deterministic setup the distance between two abscissa and is fixed and controlled by the parameter whereas in the probabilistic setting the distance can vary between any two adjacent abscissa owing to asymmetry in the distribution of the uncertainty parameter and the choice of (henceforth we drop the hat). A fundamental challenge remains between both the deterministic and probabilistic approaches, however, as each subsequent stepsize , , must be small enough to closely approximate while large enough that computation is dominated by discretization error only, e.g. see HV2016 . In the probabilistic setting, this issue is related to the choice of stepcontrol parameter , the choice of ‘prior’ mean and the kernel function of the GP.
To illustrate our approach, and to go some way toward assessing the influence of these choices, we consider the following test function
(1) 
and aim to approximate the integral of over the interval . We construct the initial data from a 4dimensional vector of stepsizes and use the trapezoidal rule to compute each , and set the stepcontrol parameter . For the GP, we chose the Matérn class of kernel Stein1999 :
where , is a modified Bessel function and are the kernel parameters and set
(for mean square differentiability). We discounted the popular radial basis function (also called squared exponential), which is a limiting case of the Matérn class (as
), as it makes stronger assumptions on smoothness and it has been argued that this may be unrealistic for modeling many physical systems, see Stein1999 ; GPbook . The results from our analysis are presented in Fig. 1. We note that a consequence of the GP modeling approach is that the posterior mean of is strongly influenced by the choice of prior mean when lies in the interval between the zero and the most recent update of the step size , i.e. . We therefore assessed the sensitivity of our results to the choice of prior mean across three prior initializations: zero, the average of the initial points and , i.e. evaluated at the smallest stepsize in the initial data set (Fig. 1 left). In our example, the results illustrate a strong preference toward setting the prior mean as the final observation . Working under the assumption that the most recent evaluation of I is the most accurate, the result seems reasonable. Fig. 1 (right) displays the results when applying the approach and demonstrates how the sequential updating of the pair , and thus the data , can quickly shrink the uncertainty both in the estimation of the integral at . It is possible, therefore, that dynamic updating of the stepsize to can lead to faster convergence properties than when fixing the stepsize, as in the deterministic setup.3 BulirschStoer Algorithm
We have reviewed the BulirschStoer extrapolation algorithm by fixing the domain of integration to be the length the initial stepsize . If the domain is local to a singular point, however, numerical evaluations which rely on a fixed initial stepsize can regularly fail to converge. The full implementation of the BulirschStoer algorithm aims to resolve this issue. In regions where a converged numerical solution appears difficult to identify, the choice of initial stepsize, and thus the domain of integration, is repeatedly reduced until a converged numerical solution has been identified. Hence, the original domain of integration is partitioned into segments, in which a converged solution is identified, and the integral is evaluated by aggregating the results from each segment, see Alg. 2 for pseudo code. Our probabilistic reinterpretation of this algorithm sees our GP scheme replacing the traditional extrapolation algorithm, as before, for each segment in the domain, see Alg. 3 for pseudo code. To better suit our numerical example in this section, we henceforth consider the domain of integration as a temporal region and denote a stepsize (timestep) by .
The BulirschStoer algorithm is descendant of Richardson’s deferred approach to the limit and therefore typically^{2}^{2}2It is worth noting that the original method as proposed by BulirschStoer used rational function extrapolation, however straightforward polynomial extrapolation is typically slightly more efficient than rational function extrapolation NumRce (for smooth problems). relies on performing polynomial extrapolation between the numerical evaluations at each abscissa. Briefly, for each segment of integration, the algorithm usually selects the abscissa at the points , where follows the series originally proposed by BulirschStoer . However, it is more common to see the sequence more recently suggested by Deuflard Deuflard1 ; Deuflard2 – i.e. – which is the scheme we adopt. Typically, the choice of is made Deuflard1 ; Deuflard2 . The BulirschStoer algorithm is an adaptive deterministic numerical scheme, which furnishes solutions to the integral problem across a broad range of scenarios. However, the approach is underpinned by the assumption that the integrand is well approximated by a polynomial in each segment of the domain of integration which, even for an analytic function , can lead to issues. We now illustrate this.
In Fig. 2
we provide an example of a wellknown situation where polynomial interpolation can perform badly. In this famous example of
Runge’s phenomenon, the GP approach, using a Matérn kernel Stein1999 , provides excellent extrapolation and interpolation whereas the polynomial interpolation procedure fails, particularly at the tails of the domain. This result might be reversed however, particularly if the true underlying function is or is well approximated by a polynomial. In this case, the GP scheme will almost certainly fail to match the goodness of the polynomial extrapolation scheme. The extent to which the methods differ in performance will likely be dependent on the the choice of kernel function: if we had some prior knowledge that the function should ‘look like’ a polynomial, we might choose a kernel that helps the GP posterior mean to exhibit polynomial like behaviour, see for example Karvonen1998 .In the probabilistic reinterpretation of the BulirschStoer algorithm, we make the assumption that the numerical evaluation converges to the correct solution as . This is typically true, but when the domain is partitioned into sufficiently small timesteps rounding error can dominate error due to discretization. The GPR derived scheme suffers from the same issue. Outside of numerical rounding error, the GPR scheme can assess convergence using a combination of two conditions: assessing the uncertainty of the predicted mean in the limit as , via the fitted GP model, and; by comparing the estimated means between any two intializations. The standard BulirschStoer algorithm only assesses the later.
The main disadvantage of the probabilistic version of the BulirschStoer algorithm is computation time, i.e. for a fixed number of abscissa fitting a polynomial using Neville’s algorithm is typically faster than aiming to maximize the likelihood of the kernel parameters. Certainly, for a smooth and bounded function , the original BSA will likely outperform the probabilistic version. However, in cases where a segment of the domain is partitioned multiple times, e.g. owing to rapidly changing over the interval, it is possible that the overall performance of the probabilistic version will be better than the original.
3.1 Example: Kepler orbit
To illustrate utility in the probabilistic reinterpretation of the BulirschStoer algorithm, we consider the case of a massless particle in a Keplerian potential ^{3}^{3}3The choice of an astronomical problem is perhaps fitting as an early reported use of Gaussian processes dates to time series analysis by the astronomer T.N. Thiele in 1880. L1981 . The problem involves finding a solution to the second order ODE,
where is the position vector, is Newton’s constant and is total mass (it is convention to use units with ). We consider the case of a particle initially at on an orbit with eccentricity of 0.99. This results in a close passage to the singularity in the force at .
We compare two different versions of BulirschStoer: the usual polynomial extrapolation and; the Gaussian process regression approach. Integration over each subinterval is performed using a secondorder leapfrog integrator. For the polynomial scheme, if integration fails to converge after eight subdivisions of the timestep , the region is iteratively halved until convergence is achieved. We note that more complex schemes can be used to reduce the timestep (e.g. see NumRce ), which might outperform the BulirschStoer scheme. For the GP approach, we do not place a limit on the number of times an interval can be subdivided owing to the ability of the GP scheme to closely approximate a wider class of analytic functions than those that are well approximated by a finite order polynomial.
Fig. 3 displays a comparison of the two schemes relative to the true solution. Notably, as the particle first passes through pericenter passage, i.e. the closest approach to the origin, which is a singularity, the polynomial based scheme fails and no longer furnishes an estimate of the particle’s position. However, the nonparametric GP scheme closely approximates the particle’s trajectory as it first passes through pericenter and, moreover,continues to furnish a reasonable approximation of the particle path throughout the time interval considered.
4 Discussion
In this paper we have presented an adaptivelearning integration scheme, which embeds Gaussian process regression (GPR) into the BulirschStoer algorithm (BSA), for use in numerical integration. This probabilistic approach extends the current deterministic implementation of the BSA in two ways. Firstly, the approach avoids the parametric assumption that a numerical evaluation of the integral has an error term that is closely approximated by a finite order polynomial. This assumption can lead to the deterministic approach failing near a singular point in the integrand. Secondly, for given stepsize, uncertainty in the numerical evaluation of the integral via GPR is used to guide how the stepsize should be adapted. The combination of these additional features leads to a more robust numerical integration scheme beyond the current standard. The probabilistic scheme is not intended onesizefitsall replacement for existing numerical methods, however, but rather to be considered as an alternative when standard approaches fail.
Acknowledgments
We would like to thank Prof. Chris Oates for helpful comments in connection with the project and Dr. Paul Kirk for interesting scientific discussions and helpful comments on the manuscripts. PB acknowledge support from the Leverhulme Trust (Research Project Grant, RPG2015408). CNF acknowledges support from the Medical Research Council (MC UU 00002/7).
References
 (1) Briol, F.X. , Oates, C. J., Girolami, M., Osborne, M. A. and Sejdinovic, D.. Probabilistic integration: A role in statistical computation? (with discussion). Statistical Science, 2018.
 (2) Briol, FX., Oates, C. J., Girolami, M., Osborne, M. A. (2015). FrankWolfe Bayesian Quadrature: Probabilistic Integration with Theoretical Guarantees. Advances In Neural Information Processing Systems (NIPS), pages 11621170

(3)
Briol, FX, Oates, C. J., Cockayne, J. Chen, W. Y., Girolami, M. (2017). On the sampling problem for kernel quadrature. Internatiational Conference on Machine Learning.PMLR 70:586595
 (4) Boekholt T., Portegies Zwart S., 2015, Comput. Astrophys. Cosmol., 2, 2
 (5) Bulirsch, R., Stoer, J.: Fehlerabschätzungen und extrapolation mit rationalen funktionen bei verfahren vom richardsontypus. Numerische Mathematik (6), 413–427 (1964)
 (6) Cockayne, J. Oates, C. J., Sullivan T., and Girolami, M.. Bayesian probabilistic numerical methods. arXiv:1702.03673, 2017
 (7) Dehnen W., Read J., 2011, European Physical Journal Plus, 126, 55
 (8) Deuflhard, P. 1983, Numerische Mathematik, vol. 41, pp. 399–422.
 (9) Deuflhard, P. 1985, SIAM Review, vol. 27, pp. 505–535.
 (10) Diaconis (1988) Bayesian numerical analysis, Statistical decision theory and related topics V, pp. 163–175.
 (11) Harvey R, Verseghy DL (2016) The reliability of single precision computations in the simulation of deep soil heat diffusion in a land surface model. Clim Dyn 46:3865–3882.
 (12) Hennig, P., Osborne, M. A., and Girolami, M. Probabilistic Numerics and Uncertainty in Computations. Proceedings of the Royal Society A, 471(2179):20150142, 2015.
 (13) Karvonen, T., Oates, C. J., and Särkkä, S. A BayesSard Cubature Method. arXiv preprint arXiv:1804.03016, 2018
 (14) Lauritzen, S. L., "Time series analysis in 1880. A discussion of contributions made by T.N. Thiele". International Statistical Review 49, 1981, 319–333

(15)
MacKay, D. (1998). In C.M. Bishop (Ed.), Neural networks and machine learning. (NATO ASI Series, Series F, Computer and Systems Sciences, Vol. 168, pp. 133 166.) Dordrecht: Kluwer Academic Press
 (16) O’Hagan (1991) BayesHermite Quadrature, Journal of Statistical Planning and Inference 29, pp. 245–260.
 (17) Oates, C. J., Niederer, S., Lee, A., Briol, FX. Girolami, M. (2016). Probabilistic Models for Integration Error in the Assessment of Functional Cardiac Models
 (18) Press, W., 2007, Numerical Recipes in C
 (19) Rasmussen, C. and Williams, C., Gaussian Processes for Machine Learning, 2006
 (20) Richardson, L. F. (1911), "The Approximate Arithmetical Solution by Finite Differences of Physical Problems Involving Differential Equations, with an Application to the Stresses in a Masonry Dam", Philosophical Transactions of the Royal Society A, 210 (459–470): 307–357
 (21) Runge, C., Zeitschrift für Mathematik und Physik, 46: 224–243, 1901
 (22) Schwaighofer, A., Tresp, V. and Yu., K. Learning Gaussian Process Kernels via Hierarchical Bayes. In NIPS 17, Cambridge, MA, 2005. MIT Press.
 (23) Stein, M. L. (1999). Interpolation of Spatial Data.
 (24) Wilson, A. G. and Adams, R. P., Gaussian process kernels for pattern discovery and extrapolation. arXiv preprint arXiv:1302.4245, 2013.
 (25) Wilson, A. G. and Gilboa, E. and Nehorai, A. and Cunningham, J. P. Fast Kernel Learning for Multidimensional Pattern Extrapolation
Comments
There are no comments yet.