 # Evaluating the squared-exponential covariance function in Gaussian processes with integral observations

This paper deals with the evaluation of double line integrals of the squared exponential covariance function. We propose a new approach in which the double integral is reduced to a single integral using the error function. This single integral is then computed with efficiently implemented numerical techniques. The performance is compared against existing state of the art methods and the results show superior properties in numerical robustness and accuracy per computation time.

## Authors

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

The Gaussian process (GP) 

has over the past decade become an important and fairly standard tool for system identification. The flexibility of the GP and more specifically its inherent data-driven bias-variance trade off makes it a very successful model for linear systems

[16, 3, 17], where it in fact improves performance compared to more classical models. When it comes to nonlinear system the GP has also shown great promise, for example when combined with the state space model [6, 5, 20] and when used in a classic auto-regressive setting [13, 2].

A useful property of GP is that it is closed under linear operators, such as integrals [18, 15, 9]. This enables GP regression to be applied to data that is related to the underlying object of interest via a line integral. In this case, evaluating the resulting measurement covariance involves double line integrals of the covariance function. There are no analytical solutions available and we are forced to numerical approximations. To give a few examples of where this problem arises, we mention; optimization algorithms [11, 21], continuous-time system identification, quadrature rules , and tomographic reconstruction .

The contribution of this paper is a numerically efficient and robust method for solving these integrals when the squared-exponential covariance function is used. We also provide a thorough comparison of the accuracy and computational time to existing state of the art methods for this rather specific problem.

## 2 Problem Statement

Consider a set of line integral observations of the form

 yi=||wi||∫10f(wis+pi)ds+ei, (1)

that provide measurements of a scalar function over the input space , where 111The notation denotes that

is normally distributed with mean

and variance .. By

we denote the start point of the line while the vector

specifies its direction and length; hence the end point is given by , and any point on the line can be expressed as with . Given this set we wish to make predictions of the function for a new input .

In GP regression this is accomplished through the specification of a prior covariance function which defines the correlation between any two function values and loosely speaking controls how smooth we believe the underlying function to be. Although there exists many possible covariance functions , here we will consider only the squared-exponential covariance function;

 k(z,z′)=exp(−0.5(z−z′)TV(z−z′)), (2)

where is a positive definite and symmetric scaling matrix.

The covariance between a measurement and the function at a test input is given by the single integral

 Ki∗=∥wi∥1∫0exp(−0.5(wit+vi)TV(wit+vi))dt, (3)

where . Two cases need considering.

Case 1: . Then the integral (3) reduces to

 Ki∗=∥wi∥exp(−12vTiVvi)=0. (4)

Case 2: . Then the solution is given by

 Ki∗=∥wi∥√π2c1exp(c228c1−12c3)(erf(2c1+c22√2c1)−erf(c22√2c1)), (5a) where erf(⋅) is the error function  and c1=wTiVwi,c2=2wTiVvi,c3=vTiVvi. (5b)

Determining the covariance function between a pair of line integral measurements requires evaluating the following double integral

 (6)

where . This is analytically intractable and hence we are forced to numerical approximations.

## 3 Related Work

Here, we provide a brief overview of existing approaches and their benefits and disadvantages.

The need to perform a double integral can be avoided entirely by the use of reduced rank approximation schemes that approximate the squared-exponential by a finite set of basis functions. One such scheme is the Hilbert space approximation  that was used in  to perform inference of strain fields from line integral measurements. Although the problem of calculating the double line integral is removed, the number of basis functions required and the subsequent computational cost grows exponentially with the problem dimension; making this method infeasible for higher-dimensional problems. An additional drawback is that the expressiveness of the model is limited to the basis functions chosen which may make this approach undesirable in many applications. Although this approach may be appropriate for some applications, the rest of this paper focuses on solutions that do not approximate the covariance function.

A simple approach to evaluate the double line integral is the use of 2D numerical integration methods (e.g. using a 2D Simpson’s rule). Such an approach has a trade off between computation time and accuracy. Although the approach detailed later still requires the numerical evaluation of a single integral, we have found it to be more efficient for the accuracy provided.

Another approach is to transform the problem into the double integral of a bivariate normal distribution. The integral in (6) can be rewritten as

 I2=1∫01∫0e−12f(t,s)dtds, (7a) where the function f(t,s) is given by f(t,s) =(pi−pj+wit−wjs)TV (pi−pj+wit−wjs) (7b) =(ts)T(abbc)(ts)−2(de)(ts)+¯f, (7c) with the following variables Σ−1≜(abbc) (7d) d =(pi−wj)TVwi, (7e) e =−(pi−pj)TVwj, (7f) ¯f =(pi−pj)TV(pi−pj). (7g)

When the matrix is invertible, then it is possible to reformulate the integral and employ integration methods for bivariate normal distributions (see e.g.  and the references therein).

However, when for some small value , then this approach fails to reliably deliver accurate solutions. This problem is most notable when , in which case is not invertible and the bivariate approach cannot be used. There is an alternate approach in this particular case, but the user is now faced with a choice to numerically determine the appropriate threshold to decide between the solution methods. This is most notably an issue for lines that are close to co-linear. Importantly, the case where occurs frequently, e.g. along the diagonal entries of the covariance matrix.

Transformation of the problem into the Bivariate Normal is presented in the Appendix of . However the non-invertible case is not explicitly dealt with, therefore we have provided the details for applying this approach and the necessary cases in Appendix A.

## 4 Efficient Computation of the Double Line Integral

Here, a method to accurately evaluate the double line integral (6) in a computationally efficient manner is presented. The integral can be rewritten as

 Kij=∥wi∥∥∥wj∥∥1∫01∫0exp(−12(a−bs+ct−dst+et2+fs2))dtds, (8a) where a=uTijVuij,b=2uTijVwj,c=2uTijVwi,d=2wTjVwi,e=wTiVwi,f=wTjVwj. (8b)

The error function can be used to provide a solution to the integral over ;

 Kij=∥wi∥∥∥wj∥∥√π2e∫10Γ1(s)Γ2(s)ds, (9)

where

 Γ1(s)=exp(12(bs−a−fs2+(c−ds)24e))Γ2(s)=erf(c−ds+2e2√2e)−erf(c−ds2√2e). (10)

For numerical reasons the exponent should not be split into a product of and . The reason is that when is large, while and rounding errors become a problem.

Numerical methods can then be used to evaluate the remaining integral. Care should be taken when considering the following two cases:

Case 1: Either or . Then the problem reduces to a single line integral and two sub cases should be considered in implementation.

• Case 1a: and . Then Equation (9) is numerically unstable and two approaches can be taken. Either the solution to the single integral (5) can be used, which requires the user to determine a numerical threshold to decide between the two cases. Alternatively, the order of the integrals can be swapped, and Equation (9) can be applied.

• Case 1b: and . Equation (9) is numerically stable, and hence the user can either ignore this case or define a numerical tolerance below which to use the solution to the single integral (5).

Case 2: and . Then Equation (9) is numerically unstable and the solution is instead given by

 Kij=∥wi∥∥∥wj∥∥exp(−12a)=0. (11)

Psudocode is provided in Algorithm 1 and a Matlab mex function implemented in C is provided at . In the implementation and psuedocode, the order of the lines is swapped whenever as this means the function is used to evaluate the larger interval and the numerical integration the shorter interval; which was found to require less function evaluations. The C implementation utilises the BLAS functions dgemm, ddot, and dnrm2 , and the GNU Scientific Library’s non-adaptive Gauss Konrod function gsl_integration_qng  with relative and absolute error tolerance set to square root of double precision epsilon.

## 5 Results and Analysis

The performance of the methods was evaluated for several sets of double line integrals; containing 10,000 pairs each and with input dimension . The sets were chosen to test a variety of cases and in particular to evaluate the performance on the transition between the corner cases. The following sets were used:

• Set 1: A standard set; , , , and for ,

• Set 2: Almost colinear lines; , , , and for ,

• Set 3: Randomly selected diagonal scaling matrix; , , , and for ,

• Set 4: Reduced to single integral; , , , and for ,

• Set 5: Nicely scaled (large integral intervals and small random scaling); , , , and for ,

• Set 6: Poorly scaled (large integral intervals and large random scaling); , , , and for ,

• Set 7:: Almost reduced to single integral; , , , and for ,

• Set 8: Almost reduced to no integral; , , , and for .

Here, the notation denotes that the element of the vector is distributed uniformly between and .

Errors were calculated by comparison to Matlab’s Integral2 using absolute and relative error tolerance of double precision epsilon. The mean magnitude of the error for each set are recorded in Table 1. The computation times required by each method were consistent across the cases and so the average computation times are recorded in Table 2.

A 2D Simpson’s rule was implemented with sub intervals along and ; results are shown for , and . This 2D Simpson’s rule gives the lowest error for Set 8 where the integrand is close to a constant function and is quite fast a choice of small . Sets 1 through 6 it show relatively large errors with the errors decreasing as is increased. However, even with the errors are still larger than for the Bivariate Normal method and the proposed method and the computation times are an order of magnitude greater.

The Bivariate Normal method has acceptable small errors across all sets excluding set 2. The Bivariate normal method required a different algorithm for the co-linear and non co-linear cases and Set 2 lies on the transition between these cases. Figure 1 provides a histogram of the error magnitudes for Set 2 as computed by each method. The histogram shows that the Bivariate normal method can result in large errors which would be prohibitive for many applications.

Our proposed method has acceptable errors across all sets. The worst errors, of and for Set 5 and Set 6 respectively, could potentially be improved by using an adaptive integration method that may be more suited to the scaling in these sets. The average computation time is the smallest reported. Figure 1: Histogram of the error magnitudes for evaluating the double line integrals in set 2 (the almost co-linear case); (a) Simpson’s rule with p=10, (b) Simpson’s rule with p=200, (c) Bivariate Normal method, (d) our proposed method.

## 6 Conclusion

In this paper we have focussed on the specific problem of evaluating a double line integral over the squared-exponential covariance function. This problem arises in several applications of Gaussian process regression. Existing approaches to this problem and their advantages and disadvantages were outlined. An alternative method was proposed and its performance compared to the existing methods was evaluated for a number of sets of line integrals corresponding to both common cases and corner cases. The comparison shows the proposed method is computationally efficient while providing acceptable accuracy across all tested cases.

## Acknowledgements

This research was financially supported by the Swedish Foundation for Strategic Research (SSF) via the project ASSEMBLE (contract number: RIT15-0012) and by the Swedish Research Council via the projects Learning flexible models for nonlinear dynamics (contract number: 2017-03807) and NewLEADS - New Directions in Learning Dynamical Systems (contract number: 621-2016-06079).

## Appendix A Bivariate normal method

In this appendix we show how to compute the double integral in (6) by reformulating the problem as the double integral of a Bivariate normal distribution. The integral can be written as

 I2=∫10∫10e−12f(t,s)dtds, (12a) where the function f(s,t) is given by f(t,s) =(pi−pj+wit−wjs)TV(pi−pj+wit−wjs) =(ts)T(abbc)(ts)−2(de)(ts)+¯f, (12b) with the following variables (abbc) =(wTiVwi−wTiVwj−wTjVwiwTjVwj), (12c) d =−(pi−pj)TVwi, (12d) e =(pi−pj)TVwj, (12e) ¯f =(pi−pj)TV(pi−pj). (12f)

There are two cases that need to be considered separately as the solutions to each case do not commute. The reason for needing two cases is that when is colinear with (as occurs along the block diagonal elements of the matrix), then the above matrix involving is not invertible.

### a.1 Double integral with ac−b2=0

This case can occur in four distinct ways.

Case 1: and . Then and and we can use (4) and (5) to solve this integral.

Case 2: and . Then and and we can again use (4) and (5) to solve this integral.

Case 3: and . Then and and the solution is

Case 4: and . Since we are already in the case where , and and then it follows that for some since

 ac−b2 =(wTiVwi)(wTjVwj)−(wTiVwj)2, =(wTiVwi)(β2wTiVwi)−(βwTiVwi)2, =β2(wTiVwi)2−β2(wTiVwi)2 =0

as claimed. Therefore, by defining as

 β≜wTiVwjwTiVwi (13)

then can be expressed as

 f(t,s) =f+2t(pi−pj)TVwi−2βs(pi−pj)TVwi+t2(wTiVwi)−2βts(wTiVwi)+β2s2(wTiVwi).

If we make a change of variables for to

 ¯s=βs⟹f(t,¯s)=f+a(t2−2t¯s+s2)

and the integral becomes

 I2 =e−f/2β∫β0∫10e−a2(t2−2t¯s+¯s2)dtd¯s =e−f/2β∫β0∫10e−a2(t−¯s)2dtd¯s

By a change of variables then

 I2 =√2e−f/2β√a∫β0∫(1−¯s)√a/2−¯s√a/2e−ϕ2dϕd% ¯s (14) =√π√2e−f/22β√a∫β0(erf((1−¯s)√a/2)−erf(−¯s√a/2))d¯s (15)

where the second equality stems from the definition of the error function

 erf(x)=2√π∫x0e−u2d% u. (16)

Next we can exploit the fact that the integral of the error function is given by (using integration by parts)

 ∫baerf(x)dx=berf(b)+e−b2√π−aerf(a)−e−a2√π. (17)

To apply this to it is helpful to split the integral component via

 I2 =√π√2e−f/22β√a(I3−I4), (18) I3 ≜∫β0erf((1−¯s)√a/2)d¯s (19) I4 ≜∫β0erf(−¯s√a/2)d% ¯s (20)

By change of variables then becomes

 I3 =√2a∫√a/2(1−β)√a/2% erf(ζ)dζ =√2a[√a2erf(√a2)+e−a22√π−(1−β)√a2erf((1−β)√a2)−e−(1−β)2a22√π]

And similarly for with change of variables to arrive at

 I4 =√2a∫0−β√a/2erf(ψ)dψ (21) =√2a[1√π+β√a2erf(−β√a2)−e−β2a22√π] (22)

### a.2 Double integral with ac−b2>0

By defining

 z =(ts),Σ−1=(abbc),ν=(de),μ=Σν g(z) ≜(z−μ)TΣ−1(z−μ),h≜f−μTΣ−1μ

we have that

 f(t,s) =g(z)+h (23)

Therefore,

 e−12f(t,s) =e−12g(z)−h2=2πe−h2√detΣN(z|μ,Σ) (24)

where

denotes the probability density function for the normal distributed random variable

, with mean value  and covariance . With the change of variables the integral can be expressed as

 I2=2πe−h2√detΣ∫1−μ1−μ1∫1−μ2−μ2N(¯z|0,Σ)d¯z1d¯z2, (25)

Note that the density in (25) is a bivariate normal distribution, which means that it can be written as

 N(¯z|0,Σ) =12π√1−ρ2√c11c22exp(−α(¯z)2(1−ρ2))

where

 [c11c12c21c22]≜Σ=[a−b−bc]−1=1ac−b2[cbba] α(¯z) =(¯z1√c11)2−2ρ(¯z1√c11)(¯z2√c22)+(¯z2√c22)2 ρ =c12√c11c22.

By change of variables and the resulting integral as

 I2=2πe−h2√detΣ∫1−μ1√c11−μ1√c11∫1−μ2√c22−μ2√c22β(ρ)d˜z1d˜z2. (26a) where β(ρ)=12π√1−ρ2exp(−˜z21−2ρ˜z1˜z2+˜s222(1−ρ2)). (26b)

The integral (26a) has been widely studied in the literature, see e.g.  and the references therein. There is efficient code available to compute this integral.

## References

•  Larry C Andrews and Larry C Andrews. Special functions of mathematics for engineers. McGraw-Hill New York, 1992.
•  Hildo Bijl, Thomas B. Schön, Jan-Willem van Wingerden, and Michel Verhaegen. System identification through online sparse Gaussian process regression with input noise. In IFAC Journal of Systems and Control, volume 2, pages 1–11, 2017.
•  Tianshi Chen, Henrik Ohlsson, and Lennart Ljung.

On the estimation of transfer functions, regularizations and Gaussian processes–revisited.

Automatica, 48(8):1525–1535, 2012.
•  Jack J Dongarra, Jeremy Du Croz, Sven Hammarling, and Iain S Duff. A set of level 3 basic linear algebra subprograms. ACM Transactions on Mathematical Software (TOMS), 16(1):1–17, 1990.
•  Roger Frigola, Yutian Chen, and Carl E. Rasmussen. Variational Gaussian process state-space models. In Advances in Neural Information Processing Systems (NIPS), 2014.
•  Roger Frigola, Fredrik Lindsten, Thomas B. Schön, and Carl E. Rasmussen. Bayesian inference and learning in Gaussian process state-space models with particle MCMC. In Advances in Neural Information Processing Systems (NIPS) 26, Lake Tahoe, NV, USA, December 2013.
•  Alan Genz. Numerical computation of rectangular bivariate and trivariate normal and t probabilities. Statistics and Computing, 14(3):251–260, 2004.
•  Brian Gough. GNU scientific library reference manual. Network Theory Ltd., 2009.
•  Thore Graepel.

Solving noisy linear operator equations by Gaussian processes: Application to ordinary and partial differential equations.

In

International Conference on Machine Learning

, pages 234–241, 2003.
•  Johannes Hendriks. Mex code for evaluating double line integrals of the scquared exponential covariance function.
•  Philipp Hennig and Martin Kiefel. Quasi-newton methods: A new direction. Journal of Machine Learning Research, 14, 06 2012.
•  Carl Jidling, Johannes Hendriks, Niklas Wahlström, Alexander Gregg, Thomas B Schön, Chris Wensrich, and Adrian Wills. Probabilistic modelling and reconstruction of strain. Nuclear instruments and methods in physics research section B: Beam interactions with materials and atoms, 436:141–155, 2018.
•  Juc Kocijan, Agathe Girard, Blaz Banko, and Roderick Murray-Smith. Dynamic systems identification with Gaussian processes. Mathematical and Computer Modelling of Dynamical Systems, 11(4):411—424, 2005.
•  Thomas P Minka. Deriving quadrature rules from Gaussian processes. Technical report, Technical report, Statistics Department, Carnegie Mellon University, 2000.
•  Athanasios Papoulis. Probability, random variables, and stochastic processes. 1965.
•  Gianluigi Pillonetto and Giuseppe De Nicolao. A new kernel-based approach for linear system identification. Automatica, 46(1):81—93, 2010.
•  Gianluigi Pillonetto, Francesco Dinuzzo, Tianshi Chen, Giuseppe De Nicolao, and Lennart Ljung. Kernel methods in system identification, machine learning and function estimation: a survey. Automatica, 50(3):657–682, 2014.
•  Carl Edward Rasmussen and Christopher KI Williams. Gaussian process for machine learning. MIT press, 2006.
•  Arno Solin and Simo Särkkä. Hilbert space methods for reduced-rank Gaussian process regression. arXiv preprint arXiv:1401.5508, 2014.
•  Andreas Svensson and Thomas B. Schön. A flexible state space model for learning nonlinear dynamical systems. Automatica, 80:189–199, June 2017.
•  Adrian G Wills and Thomas B Schön. On the construction of probabilistic newton-type algorithms. In Proceedings of the 56th IEEE Annual Conference on Decision and Control (CDC), pages 6499–6504, Melbourne, Australia, 2017.