A wide range of commonly occurring inference problems can be fruitfully tackled using Bayesian methods. A particular common inference problem is that of regression; determining the relationship of a control variable to an output variable given a set of measurements of at points . The solution requires a model , which allows us to predict the value of at an untested value of1].
Gaussian process regression (GPR) is a powerful mathematical technique for performing non-parametric regression in a Bayesian framework [1, 2, 3, 4, 5]. The key assumption underpinning the method is that the observed data set being interpolated is a realisation of a GP with a particular covariance function. This assumption presents us with a challenge: how do we choose the covariance function which gives the best interpolant?
The process of choosing the covariance function is known as learning, or training of the GP. In this training process, it is necessary to compute the inverse of the covariance matrix (the matrix formed by evaluating the covariance function pairwise between all observed points). The time taken to evaluate the inverse of the covariance matrix scales as , where is the number of points being interpolated; this has typically restricted the application of GPR to smaller problems (), although work has been done on extending its applicability to larger data sets [6, 7, 8, 9].
In this paper, we present two techniques that speed up the training stage of the GPR algorithm. The first aims to reduce the dimensionality of the problem, and hence speed up the learning of the hyperparameters for a single covariance function; this does not change the fact that the cost of this process is, instead it simply reduces the constant in this scaling. The second aims to enable fast Bayesian model comparison between different covariance functions while also incorporating the benefits of the first technique.
We consider maximising the hyperlikelihood: the conditional probability of the data given a particular set of hyperparameters used to specify the covariance function.111This is often referred to as the marginal likelihood. In order to avoid confusion with distributions over the model parameters, we prefer to consistently use the hyper- prefix to denote probability distributions connected to the inference of the hyperparameters. We provide an expression for the Hessian matrix of the hyperlikelihood surface and show how this can be used as a valuable tool for comparing the performance of two different covariance functions. We also present modified expressions for the hyperlikelihood, its gradient and its Hessian matrix, which have all been analytically maximised and marginalised over a single scale hyperparameter. This analytic maximisation or marginalisation reduces the dimensionality of the subsequent optimisation problem and hence further speeds up the training and comparison of GPs.
These techniques are useful when attempting to rapidly fit large, irregularly sampled datasets with a variety of covariance function models. The authors have previously made use of these techniques in exploring the correlation structure of the differences between complicated waveform models in the field of gravitational-wave astronomy [10, 11]; this was done so that the effect of different models on the parameter infererences could be marginalised over. There, the behaviour of the data was largely unknown a priori and it was necessary to quantitatively compare a wide range of different covariance functions. Work in this area with larger datasets is ongoing.
In Sec. 2, we review the GPR method and discuss methods of efficiently determining a covariance function. In Sec. 22.1, we present our expression for the Hessian of the hyperlikelihood along with a discussion of how it can be used for model comparison, and in Sec. 22.2 we show how the training of the GP can be accelerated by analytically maximising or marginalising the hyperlikelihood over a single scale parameter. In Sec. 3, we apply these methods to both synthetic and real data sets, and compare the computational cost to that of a full numerical evaluation of the Bayesian model evidences. Finally, a brief discussion and concluding remarks are given in Sec. 4.
2 Gaussian process regression & training
The technique of GPR is a method for interpolating (or extrapolating) the data contained in a training set
. The vectoris called the input vector and the output vector is given by for some unknown function . The method works by assuming that the data have been drawn from an underlying GP with specified mean (usually assumed to be zero) and covariance function .
There is freedom in specifying the covariance function; common choices, such as the squared exponential and Matérn function, include a number of free hyperparameters that control the properties of the GP, i.e. .
The predictive power of the method comes from computing the conditional probability of the function taking a given value at some new -th input point , given the observed values in and the values of the hyperparameters . This predictive probability distribution for the function at the new point is a Gaussian with mean
where we have defined the scalar, vector and matrix shorthand
Since the posterior distribution for (2) relies upon the form of the covariance, GPR cannot be used to make definite predictions until we have fixed a method for dealing with the unknown hyperparameters .
Ideally, we would like to place a prior probability distribution onand make predictions by evaluating the integral
where we have used Bayes’ theorem to obtain the second equality. We have introduced the hyperlikelihood given by
which encodes the probability that the observed (training) data were drawn from a GP with covariance function . The integral (4) is almost always analytically intractable and prohibitively expensive to evaluate numerically. A common approximate approach is to use the most probable values of the hyperparameters , which maximise [12, 13, 14].
Assuming the prior distribution is sufficiently flat (or uninformative) over the region of interest, this is equivalent to maximising the hyperlikelihood . Under this approximation, the predictive distribution becomes
which is simply the Gaussian in (2) with mean and variance evaluated at . Implementing the above procedure requires numerically maximising the hyperlikelihood in (5). This can be computationally expensive; in Sec. 22.1 and Sec. 22.2, we present methods for reducing the cost of maximising the hyperlikelihood.
2.1 Using the gradient & Hessian
The maximisation process may be accelerated if the gradient of the hyperlikelihood is known and a gradient-based algorithm, such as a conjugate gradient method [13, 15], can be used. The gradient of the logarithm of the hyperlikelihood is given by 
This can be shown by differentiating (5) and making use of the standard results
The gradient in (7) is useful because the rate-determining step in computing the hyperlikelihood is computing the inverse matrix (usually achieved through a Cholesky decomposition in practice), which is an operation. All other steps in (5) scale as or less.222As described in a footnote in , the matrix–matrix products in (7) should not be evaluated directly, as this is an operation. Rather, the first term should be evaluated in terms of matrix–vector products, and, in the second term, only the diagonal elements that contribute to the trace need to be calculated; these are both operations. Once the inverse has been calculated, the gradient in (7) may also be evaluated in ; so in evaluating the hyperlikelihood for a large training set we can also get the gradient for negligible extra cost.
The procedure outlined above can be performed for multiple covariance functions, each yielding a different GP interpolant. It is therefore necessary to have a method of comparing the performance of different interpolants to decide which to use. One way to achieve this is to evaluate the (hyperprior-weighted) volume under the hyperlikelihood surface, the hyperevidence, and use this as a figure of merit for the performance. Evaluating this integral is prohibitive, so an approximation is to calculate the Hessian matrix of thesurface at the peak (the position and value of which have already been found) and to analytically integrate the resulting Gaussian. This procedure assumes flat (or slowly varying) hyperpriors in the vicinity of the peak, but this has already been assumed in going from (4) to (6). Differentiating the gradient in (7), again making use of the results in (8), and evaluating the derivatives at the position of peak hyperlikelihood, , gives the Hessian,
This expression has the same advantages as the expression for the gradient; as the inverse of the covariance matrix has already been computed, the Hessian may be evaluated at negligible extra cost. The hyperlikelihood surface may therefore be approximated by the Gaussian [16, 12]
We seek the evidence, which is given by the following integral of the posterior, where we have specified a prior on the hyperparameters;
Assuming the posterior is a sufficiently well peaked distribution, with peak at position , the evidence may be written using the Laplace approximation  as
It is always possible to change the hyperparameterisation so that the prior is flat in which case the hyperposterior is proportional to the hyperlikelihood.333For example, if the original prior on the parameters is , we can define and then is a constant. If such a hyperparameterisation has been chosen then (where is the hyperprior volume, or range of integration), , and ; therefore
This expression is now invariant under further changes to the hyperparameter specification which preserve the property that the prior is constant. We use hyperparameterisations with flat hyperpriors as this choice uniquely specifies the approximation in Eq. (13); although there remains the possibility that another hyperparameterisation exists in which the posterior is better approximated as a Gaussian.
For two covariance functions, and
, the odds ratio may be defined as the ratio of the value of (13) evaluated with to the value evaluated using , and this may be used to discriminate among competing models. The hyperprior volume in (13) acts as an Occam factor, penalising models with greater complexity . Once suitable prior volumes have been fixed, the Hessian approximation to the hyperevidence is a computationally inexpensive means of comparing covariance functions.
2.2 Partial analytic maximisation
In general, covariance functions can be arbitrarily complicated, with large numbers of hyperparameters. Inevitably, simple covariance functions are the most prevalent in the literature. If there are a small number of hyperparameters, then even reducing the number of hyperparameters by one can have a great impact on the length of time taken to maximise the hyperlikelihood. In this section, we show how the hyperlikelihood for any covariance function, regardless of complexity, can be analytically maximised over an overall scale parameter, thereby reducing the number of remaining hyperparameters. We also generalise the expressions for the gradient and the Hessian found in Sec. 22.1 to this case.
Consider the following transformation of the covariance, ; substituting this into the expression for the hyperlikelihood gives,
This function always has a unique maximum with respect to variations in at the position
at this point the hyperlikelihood takes the value
Equation (16) is to be considered as a function of the remaining hyperparameters . The peak evidence may now be found more easily by numerically maximising in (16) with respect to the remaining parameters . If a gradient-based algorithm is used, it is advantageous to have an analogous expression to (7) to give inexpensive derivatives. This can be found by differentiating (16) with respect to , making use of the results in (8),
These are not the same as the derivatives in (7).
As well as maximising, we can also consider marginalising over . As we are marginalising over a scale parameter we use the (improper) Jeffreys prior . The result is equal to the maximised form, up to a multiplicative constant,
As before, once the peak hyperlikelihood has been found, the Hessian at the peak position can aid in model comparison. In this case, the Hessian should be calculated using the second derivatives of . However, we may instead differentiate , as this differs only by a constant which will cancel when using the Hessian to compare two models. Differentiating (17) with respect to ,444Here, retains its maximum value from (15), although the new maximum value has actually now shifted to due to the effect of the hyperprior. For large data sets () the difference between the two is negligible (the hyperprior becomes uninformative as it is overwhelmed by the hyperlikelihood).
Again, these are not the same as the derivatives in (9). These expressions for the gradient and the Hessian of the hyperlikelihood, maximised or marginalised over , share the same advantages as the analogous expressions in Sec. 2: they may be evaluated in time once the hyperlikelihood itself has been evaluated in time.
3 Numerical results
In order to perform model comparison calculations between competing covariance functions, we must first specify at least two different covariance functions. We choose the two functions in (20) and (21), where . These functions are both based on the periodic covariance function proposed by . The first function is the product of a single periodic component with timescale and a simple compact-support polynomial covariance function  to describe any non-periodic component of the data. The choice of a compact-support covariance function is especially useful when working with large datasets; this is precisely the situation where the techniques described above are also designed to be of maximum benefit. The second function includes an additional periodic component with timescale . In order to avoid double-counting in , we impose the constraint . Both covariance functions also include an uncorrelated noise term; we define this in such a way that remains an overall scale hyperparameter which can be maximised or marginalised over analytically as described in Sec. 32.2.
The covariance functions are completely specified by the hyperparameters (overall scale), (; timescales), and (; smoothing parameters for the periodic components). The noise parameter could also be taken to be a hyperparameter; instead, for simplicity, we here take to be fixed. As appears in multiplied by the overall scale, , fixing is roughly equivalent to specifying a fixed fractional error.
We want to perform model comparison using the Laplace approximation outlined previously. This technique requires reparametrising the covariance function such that the hyperpriors are flat. For the timescale hyperparameters, which are dimensionful, we choose to use the scale-invariant Jeffreys prior, . This prior is improper if the range of is , so we restrict the range to , where and are respectively the smallest and largest separations between the sampling points. If there was a timescale in the problem outside of this range, we would be unable to resolve it from the data. We now seek a transformation to a new hyperparameter such that the prior is flat in this parameter, . The conservation of probability gives a differential equation relating the two
where the s are constants which we can set equal to . The range of these new hyperparameters is and .
For the smoothness parameters we choose to use log-normal priors, , with mean and variance . As before, we seek a transformation to some new hyperparameters in which the prior is flat. The desired transformation is given by
3.1 Synthetic data
Shown in Fig. 1 are realisations of GPs with covariance functions and .555The code, optimised for use on a GPU, and the synthetic used to produce these numerical results is available at http://hdl.handle.net/10283/1924. In order to perform test model comparison calculations, a realisation of the GP with points was drawn and analysed using both the and covariance functions. For each covariance the peak hyperlikelihood was found by numerically maximising (14) using a conjugate gradient method, making use of the gradient in (17). The hyperevidence was estimated using (13) and the expression for the Hessian in (19); the results are summarised in Tab. 1. To verify the accuracy of this estimate, the hyperevidence was also integrated numerically using MultiNest, [19, 20, 21] which implements a nested sampling algorithm . This was repeated for three different values of (in the case the synthetic data is plotted in the right-hand panel of Fig. 1), and the results are also summarised in Tab. 1.
From Tab. 1 it can be seen that as
is increased, the Bayes factors increasingly favour the more complicated covariance function (and in this case the correct covariance function from which the data was drawn). In almost all cases, the Laplace approximation gives a valuewhich is in agreement at better than with the numerically integrated value . There is one exception which is highlighted in bold; this occurs for the most complicated covariance function (with the largest number of hyperparameters) and when the number of data points is smallest. In this situation, it would be expected that the posterior distribution on the hyperparameters may be highly multimodal and/or exhibit strong degeneracies (both of these expectations were confirmed by examining the posterior distribution on the hyperparameters returned by MultiNest). This exceptional case serves to highlight situations in which the Laplace approximation should not be trusted. The MultiNest posteriors in all other cases were verified to be well approximated by a single Gaussian mode. Fig. 2 shows the posterior distribution for the parameters of obtained from the largest () synthetic data set.
Our method of model comparison is proposed as a faster alternative to model comparison using numerically evaluated Bayes factors. Simply comparing the peak hyperlikelihood (marginal likelihood) values would also give a measure of the goodness-of-fit, but this tends to favour more complex models and incurs the risk of overfitting. More sophisticated methods of model selection exist in the literature (see [23, 24] and references within), e.g., the comparison of models based on estimated predictive criteria [25, 26, 27], or the construction of a larger reference model and the subsequent selection of a simpler submodel with similar predictions [28, 29]. A detailed numerical comparison to these methods is left for future investigation.
The values in Tab. 1, evaluated using MultiNest, required between 20,000 and 50,000 likelihood evaluations. The maximisation routines typically took fewer than likelihood evaluations to find the peak, and then one additional evaluation to calculate the Hessian and hence . In order to guard against the possibility of the maximisation routines becoming trapped in local maxima, as opposed to the global maximum, the algorithm was run multiple times from randomly selected starting positions. The typical number of runs required to find the global maximum was . After these duplicate runs are accounted for, the speed-up factor in calculating compared to was between and in all cases.
3.2 Tidal data from Woods Hole
In order to illustrate the effectiveness of the techniques described above on real data, we consider several tidal data sets of different sizes from Woods Hole, MA .666Data from the National Oceanic and Atmospheric Administration, http://tidesandcurrents.noaa.gov/waterlevels.html (accessed August 2015). We consider the mean sea level offset recorded between 3 January and 15 June 2014, six lunar months, sampled at two-hour intervals (giving data points). This is plotted in Fig. 3. We also consider a smaller subset of the data (the first lunar month), with data points.777This data set is regularly sampled in time, and therefore the covariance matrix will be a Toeplitz matrix. This structure could be exploited to accelerate the inversion of the covariance matrix; we choose not to use this here so that our code can be applied to irregularly sampled data in arbitrary dimensions.
We interpolate the data using the two covariance functions in (20) and (21); these functions are well suited to the data, as we expect the sea level to contain harmonics of the various timescales associated with the daily, monthly and yearly cycles of the tides. For simplicity, we fix , which is the typical fractional error in the sea-level measurements. As in Sec. 33.1 we reparametrise the covariance functions so that the have Jeffreys priors, and the smoothness parameters have log-normal priors. We use a conjugate gradient maximisation algorithm with (17) and the Hessian in (19) to evaluate the volume in (13) and perform model comparison between the two covariance functions.
For the smaller dataset, we find the timescale with , which corresponds to the two main tides per day. With we find the timescales and ; the second timescale corresponds to the height difference between the first and second tides of the day. The two-timescale model is highly favoured with a log Bayes factor of .
For the larger dataset, we find with , and and with . In all cases, the (squared) errors are estimated using the diagonal components of the inverse Hessian; it can be seen that the timescales are more precisely measured for the larger dataset, as expected. The two-timescale model is even more conclusively favoured for the larger dataset, with a log Bayes factor of . We also find a number of subsidiary hyperlikelihood peaks associated with other timescales in the data, but all subsidiary peaks are strongly suppressed relative to the global peak (by at least of ) and so we expect our Bayes factor estimates to be robust.
Sea-level data is known to contain a large number of different frequencies, which necessitates the use of harmonic analysis in tidal modelling; the number of constituents included in tide prediction calculations has increased from tens  to thousands  over the past century. Clearly any -like covariance function with timescales is simplistic, but the construction of a more detailed tidal model is beyond the scope of this paper. 888We have conducted preliminary investigations of a three-timescale model. The hyperlikelihood surface for this covariance function is more structured and non-Gaussian than for and . Estimates of the Bayes factors indicate that the inclusion of additional timescales is favoured, as expected based on the known large number of modes present.
The number of evaluations of (13) needed to obtain these results was comparable to the numbers for the synthetic data discussed in Sec. 33.1. However, each evaluation here was more expensive () due to the size of the data set. Based on the speed-ups found in Sec. 33.1, it would be expected that multinest would take up to to calculate the Bayes factor.
Shown in the inset plot in Fig. 3 are the two interpolants from and
for the larger dataset, which both perform equally well on the timescale of one week. These interpolants, which are the result of the regression analysis, may be used to estimate the tidal height at a time where a measurement is not available.
We have described some simple ways in which the computationally expensive training stage of implementing GPR can be accelerated. The analytic maximisation of the hyperlikelihood over a single scale hyperparameter of the covariance function aids in speeding up the maximisation of the hyperlikelihood by reducing the dimensionality of the problem; the advantages of this will be most keenly felt in (common) problems where relatively simple covariance functions are used. Meanwhile, the analytic evaluation of the Hessian matrix, either in the manner of (9) or (19), aids in speeding up the process of model comparison between different types of covariance function. We have successfully demonstrated these techniques on a synthetic data set where the data was drawn from one of two covariance functions under consideration. In the case of the synthetic data the size of the data set was limited to less than 300 points, so that the results could be verified by using the MultiNest algorithm to numerically sample and integrate the posteriors. We also demonstrated the techniques by applying them to a larger real data set of mean sea level measurements, where the full MultiNest calculation would have taken too long to perform. It is to be hoped that these techniques will aid in the wider application of GP methods to larger data sets.
CJM and CPLB are supported by the STFC. AJKC’s work is supported by the Cambridge Commonwealth, European and International Trust. JRG’s work is supported by the Royal Society.
CJM devised the study, jointly performed the analysis and wrote the manuscript. AJKC jointly performed the analysis and wrote the manuscript. CPLB helped with the analysis, and the writing and editing of the manuscript. JRG helped devise the study and edited the manuscript. All authors gave final approval for the publication.
Conflict of interest
We have no competing interests.
Rasmussen and Williams 
C. E. Rasmussen
and C. K. I.
Gaussian Processes for Machine Learning(MIT Press, Cambridge, MA, 2006).
- MacKay  D. J. C. MacKay, Information Theory, Inference and Learning Algorithms (Cambridge University Press, Cambridge, 2003).
- Barry  D. Barry, The Annals of Statistics 14, 934 (1986).
- Wahba  G. Wahba, Journal of the Royal Statistical Society. Series B (Methodological) 40, 364 (1978).
- O’Hagan and Kingman  A. O’Hagan and J. F. C. Kingman, Journal of the Royal Statistical Society. Series B (Methodological) 40, 1 (1978).
- Smola and Bartlett  A. J. Smola and P. L. Bartlett, in Advances in Neural Information Processing Systems, edited by T. K. Leen, T. Dietterich, and V. Tresp (MIT Press, Cambridge, MA, 2001), vol. 13, pp. 619–625.
- Quiñonero-Candela and Rasmussen  J. Quiñonero-Candela and C. E. Rasmussen, Journal of Machine Learning Research 6, 1939 (2005).
- Cressie and Johannesson  N. Cressie and G. Johannesson, Journal of the Royal Statistical Society: Series B (Statistical Methodology) 70, 209 (2008).
- Banerjee et al.  A. Banerjee, D. B. Dunson, and S. T. Tokdar, Biometrika 100, 75 (2013), 1106.5779.
- Moore et al.  C. J. Moore, C. P. L. Berry, A. J. K. Chua, and J. R. Gair, Physical Review D 93, 064001 (2016), 1509.04066.
- Moore and Gair  C. J. Moore and J. R. Gair, Physical Review Letters 113, 251101 (2014), 1412.3657.
- MacKay  D. J. C. MacKay, Neural Computation 11, 1035 (1999).
- Snelson and Ghahramani  E. Snelson and Z. Ghahramani, in Advances in Neural Information Processing Systems, edited by Y. Weiss, B. Schölkopf, and J. C. Platt (MIT Press, Cambridge, MA, 2006), vol. 18, pp. 1257–1264.
- Quiñonero-Candela et al.  J. Quiñonero-Candela, C. E. Rasmussen, and C. K. I. Williams, in Large-scale Kernel Machines, edited by L. Bottou, O. Chapelle, D. DeCoste, and J. Weston (MIT Press, Cambridge, MA, 2007), chap. 9, pp. 203–223.
Blum and Riedmiller 
M. Blum and
M. Riedmiller, in
European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning(2013).
- MacKay  D. J. C. MacKay, in Maximum Entropy and Bayesian Methods, edited by G. R. Heidbreder (Springer Netherlands, Dordrecht, 1996), vol. 62 of Fundamental Theories of Physics, pp. 43–59.
- Jeffreys  H. Jeffreys, Proceedings of the Royal Society A: Mathematical, Physical and Engineering Sciences 186, 453 (1946).
- Wendland  H. Wendland, Scattered Data Approximation (Cambridge Monographs on Applied and Computational Mathematics) (Cambridge University Press, 2005).
- Feroz et al.  F. Feroz, M. P. Hobson, and M. Bridges, Monthly Notices of the the Royal Astronomical Society 398, 1601 (2009), 0809.3437.
- Feroz and Hobson  F. Feroz and M. P. Hobson, Monthly Notices of the Royal Astronomical Society 384, 449 (2008), 0704.3704.
- Feroz et al.  F. Feroz, M. P. Hobson, E. Cameron, and A. N. Pettitt, Importance Nested Sampling and the MultiNest Algorithm (2013), 1306.2144.
- Skilling  J. Skilling, Bayesian Analysis 1, 833 (2006).
- O’Hara and Sillanpää  R. B. O’Hara and M. J. Sillanpää, Bayesian Analysis 4, 85 (2009).
- Piironen and Vehtari  J. Piironen and A. Vehtari, Comparison of Bayesian predictive methods for model selection (2015), 1503.08650.
- Geisser and Eddy  S. Geisser and W. F. Eddy, Journal of the American Statistical Association 74, 153 (1979).
Algebraic geometry and statistical learning theory, Cambridge Monographs on Applied and Computational Mathematics (Cambridge University Press, Cambridge, 2009).
- Gelman et al.  A. Gelman, J. Hwang, and A. Vehtari, Statistics and Computing 24, 997 (2013), 1307.5928.
- Lindley  D. V. Lindley, Journal of the Royal Statistical Society. Series B (Methodological) 30, 31 (1968).
- San Martini and Spezzaferri  A. San Martini and F. Spezzaferri, Journal of the Royal Statistical Society. Series B (Methodological) 46, 296 (1984).
- Scherer et al.  W. Scherer, W. M. Stoney, T. N. Mero, M. O’Hargan, W. M. Gibson, J. R. Hubbard, M. I. Weiss, O. Varmer, B. Via, D. M. Frilot, et al., Tech. Rep. NOAA Special Publication NOS CO-OPS 1, National Oceanic and Atmospheric Administration, Silver Spring, Maryland (2001).
- Lisitzin  E. Lisitzin, Sea-level changes (Elsevier, Oxford, 1974).
- Casotto and Biscani  S. Casotto and F. Biscani, in AAS/Division of Dynamical Astronomy Meeting #35 (2004), vol. 36 of Bulletin of the American Astronomical Society, p. 862.