1 Introduction
Gaussian processes (GPs) (Rasmussen & Williams, 2006)
provide a flexible framework for modelling unknown functions: they are robust to overfitting, offer good predictive uncertainty estimates and allow us to incorporate prior assumptions into the model. Given a dataset with some inputs
and outputs , a GP regression model assumes where is a GP over and where theare normal random variables accounting for observation noise. The model predictions at
are then given by the posterior distribution . However, computing the posterior distribution usually scales , because it requires solving a system involving an matrix.A range of sparse GP methods (see QuiñoneroCandela & Rasmussen (2005) for an overview) have been developed to improve on this scaling. Among them, variational inference is a practical approach allowing regression (Titsias, 2009), classification (Hensman et al., 2015), minibatching (Hensman et al., 2013) and structured models including latent variables, timeseries and depth (Titsias & Lawrence, 2010; Frigola et al., 2014; Hensman & Lawrence, 2014; Salimbeni & Deisenroth, 2017). Variational inference in GPs works by approximating the exact (but intractable) posterior . This approximate posterior process is constructed from a conditional distribution based on pseudo observations of at locations
. The approximation is optimised by minimising the KullbackLeibler divergence between the approximate posterior and the exact one. The resulting complexity is
, so choosing enables significant speedups compared to vanilla GP models. However, when using common stationary kernels such as Matérn, Squared Exponential (SE) or rational quadratic, the influence of pseudoobservations are only local and limited to the neighbourhoods of the inducing points , so a large number of inducing points may be required to cover the input space. This is especially problematic for higher dimensional data as a result of the curse of dimensionality.Variational Fourier Features (Hensman et al., 2017, VFF) inducing variables have been proposed to overcome this limitation. In VFF the inducing variables based on pseudo observations are replaced with inducing variables obtained by projecting the GP onto the Fourier basis. This results in inducing variables that have global influence on the predictions. For onedimensional inputs, this construction leads to covariances matrices between inducing variables that are almost diagonal and this can be exploited to reduce the complexity to . However, for dimensional input spaces VFF requires construction of a new basis given by the outer product of the onedimensional Fourier basis. This implies that the number of inducing variables grows exponentially with the dimensionality, which limits the use of VFF to just one dimension or two. Furthermore, VFF is restricted to kernels of the Matérn family.
In this work we improve upon VFF in multiple directions. Rather than using a sine and cosine basis, we use a basis of spherical harmonics to define a new interdomain sparse approximation. As we will show, spherical harmonics are the eigenfunctions of stationary kernels on the hypersphere, which allows us to exploit the Mercer representation of the kernel for defining the inducing variables. In arbitrary dimensions, our method leads to
diagonal covariance matrices which makes it faster than VFF as we fully bypass the need to compute expensive matrix inverses. Compared to both sparse GPs and VFF, our approximation scheme suffers less from the curse of dimensionality. As VFF, each spherical harmonic inducing function has a global influence, but there is a natural ordering of the spherical harmonics that can guarantee that the best features are picked given an overall budget for the number of inducing variables. Moreover, our method works for any stationary kernel on the sphere.Following the illustration in fig. 1, we outline the different steps of our method. We start by concatenating the data with a constant input (bias) and project it linearly onto the unit hypersphere . We then learn a sparse GP on the sphere based on the projected data. We can do this extremely efficiently by making use of our spherical harmonic inducing variables, shown in fig. 2. Finally, the linear mapping between the sphere and the subspace containing the data can then be used to map the predictions of the GP on the sphere back to the original space.
This paper is organised as follows. In section 2 we give the necessary background on sparse GPs and VFF. In section 3 we highlight that every stage of the proposed method can be elegantly justified. For example, the linear mapping between the data and the sphere is a property of some specific covariance functions such as the arccosine kernel, from which we can expect good generalisation properties (section 3.1). Another example is that the spherical harmonics are coupled with the structure of the Reproducing Kernel Hilbert Space (RKHS) norm for stationary kernels on the sphere, which makes them very natural candidates for basis functions for sparse GPs (section 3.2). In section 3.3 we define our spherical harmonic inducing variables and compute the necessary components to do efficient GP modelling. Section 4 is dedicated to the experimental evaluation.
2 Background
2.1 GPs and Sparse Variational Inference
GPs are stochastic processes such that the distribution of any finite dimensional marginal is multivariate normal. The distribution of a GP is fully determined by its mean and covariance function (kernel) . GP models typically consist of combining a latent (unobserved) GP with a likelihood that factorises over observation points . When the observation model is , the posterior distributions of and given some data are still Gaussian and can be computed analytically. Let represent the observation noise at , then the GP posterior distribution is with
where and .
Computing this exact posterior requires inverting the matrix , which has a computational complexity and a memory footprint. Given a typical current hardware specification, this limits the dataset size to the order of few thousand observations. Furthermore, there is no known analytical expression for posterior distribution when the likelihood is not conjugate, as encountered in classification for instance.
Sparse GPs combined with variational inference provide an elegant way to address these two shortcomings (Titsias, 2009; Hensman et al., 2013, 2015). It consists of introducing a distribution that depends on some parameters, and finding the values of these parameters such that gives the best possible approximation of the exact posterior . Sparse GPs introduce pseudo inputs corresponding to inducing variables , and choose to write the approximating distribution as . This results in a distribution that is parametrised by the variational parameters , and , which are learned by minimising the Kullback–Leibler (KL) divergence .
At prediction time, the conjugacy between and the conditioned posterior implies that , where is marginalised out, is a GP with a mean and a covariance function that can be computed in closed:
(1)  
where , and .
Sparse GPs result in and computational complexity at training (minimising the KL) and prediction respectively. Picking can thus result in drastic improvement, but the lower is, the less accurate the approximation, as recently shown by Shi et al. (2020). Typical kernels—such as Matérn or Squared Exponential (SE)—depend on a lengthscale parameter that controls how quickly the correlation between two evaluations of drops for two inputs that move away from another. For short lengthscales, this correlation drops quickly and two observations can be almost independent even for two input points that are close by in the input space. For a sparse GP model, this implies that and
will rapidly revert to the prior mean and variance when
is not in the immediate neighbourhood of an inducing point . A similar effect can be observed when the input space is highdimensional: because inducing variables only have a local influence, the number of inducing points required to cover the space grows exponentially with the input space dimensionality. In practice, it may thus be required to pick large values for to obtain accurate approximate posteriors but this defeats the original intent of sparse methods.This behaviour where the vector of features
of the approximate distribution are given by kernel function can be addressed using interdomain inducing variables.2.2 Interdomain GPs and Variational Fourier Features
Interdomain Gaussian processes (see LázaroGredilla & FigueirasVidal (2009) and van der Wilk et al. (2020) for a modern exposition) use alternative forms of inducing variables such that the resulting sparse GP models result in more informative features. In interdomain GPs the inducing variables are obtained by integrating the GP with an inducing function:
This redefinition of implies that the expressions of and change, but the inference scheme of interdomain GPs and the mathematical expressions for the posterior mean and variance are exactly the same as classic sparse GPs. Depending on the choice of , interdomain GPs can result in various feature vector . These feature vectors can alleviate the classic sparse GP limitation of inducing variables having only a local influence.
VFF (Hensman et al., 2017) is an interdomain method where the inducing variables are given by a Matérn RKHS inner product between the GP and elements of the Fourier basis:
where , and if the input space is . This leads to
This results in several advantages. First, the features are exactly the elements of the Fourier basis, which are independent of the kernel parameters and can be precomputed. Second, the matrix is the sum of a diagonal matrix plus low rank matrices. This structure can be used to drastically reduce the computational complexity, and the experiments showed one or two orders of magnitude speedups compared to classic sparse GPs. Finally, the variance of the inducing variables typically decays quickly with increasing frequencies, which means that by selecting the first elements of the Fourier basis we pick the features that carry the most signal.
The main flaw of VFF comes from the way it generalises to multidimensional input spaces. The approach in Hensman et al. (2017) for getting a set of dimensional inducing functions consists of taking the outer product of univariate basis and to consider separable kernels so that the elements of are given by the product of the inner products in each dimension. For example, in dimension 2, a set of inducing functions is given by , and entries on are . This construction scales poorly with the dimension: for example choosing a univariate basis as simple as for an eightdimensional problem already results in more that 6,500 inducing functions. Additionally, this construction is very inefficient in terms of captured variance, as we illustrate in Figure 3 for a 2D input space. The figure shows that the prior variance associated with the inducing function vanishes quickly when both and increase. This means that most of the inducing functions on which the variational posterior is built are irrelevant, whereas some important ones such as or for are important but ignored. Although we used a 2D example to illustrate this poor behaviour, it is important to bear in mind that the issue gets exacerbated for higher dimensional input spaces. As detailed in fig. 4 and discussed later, our proposed approach does not suffer from such behaviour.
3 Variational Inference with Spherical Harmonics
In this section we describe the three key ingredients of the proposed approach: the mapping of dataset to the hypersphere, the definition of GPs on this sphere, and the use of spherical harmonics as inducing functions for such GPs.
3.1 Linear Mapping to the Hypersphere
As illustrated in fig. 1, for a 1D dataset, the first step of our approach involves appending the dataset with a dummy input value that we will refer to as the bias (grey dots in fig. 1). For convenience we will overload the notation, and refer to the augmented inputs as from now on, denote its dimension by , and assume . The next step is to project the augmented data onto the unit hypersphere according to a linear mapping:
as depicted by the orange dots. Given the projected data, we learn a GP model on the sphere (orange function), and the model predictions can be mapped back to the hyperplane (blue curve) using the inverse linear mapping (i.e. following the fine grey lines starting from the origin).
Although this setup may seem very arbitrary, it is inspired by the important works on the limits of neural networks as Gaussian processes, especially the
arcsine kernel (Williams, 1998) and the arccosine kernels (Cho & Saul, 2009). An arccosine kernel corresponds to the infinitelywide limit of a singlelayer ReLUactivated network with Gaussian weights. Let
, such that , be an augmented input vector whose last entry corresponds to the bias. Then the arccosine kernel can be written asis the geodesic distance (or the greatcircle distance) between the projection of and on the unit hypersphere. We observe that is separable in polar coordinates, and the dependence in the radius is just the linear kernel. Since the linear kernel is degenerate, it means that there is a onetoone mapping between the GP samples restricted to the unit sphere and the samples over
, and that this mapping is exactly the linear transformation we introduced previously.
One insight provided by the link between our settings and the arccosine kernel is that once mapped back to the original space, our GP samples will exhibit linear asymptotic behaviour, which can be compared to the way ReLUactivated neural networks operate. Although a proper demonstration would require further work, this correspondence suggests that our models may inherit the desirable generalisation properties of neural networks.
Having mapped the data to the sphere, we must now introduce GPs on the sphere and their covariance. The following section provides some theory that allows us to work with various kernels on the sphere, including the arccosine kernel as a special case.
3.2 Mercer’s Theorem for Zonal Kernels on the Sphere
Stationary kernels, i.e. translation invariant covariances, are ubiquitous in machine learning when the input space is Euclidean. When working on the hypersphere, their spherical counterpart are zonal kernels, which are invariant to rotations. More precisely, a kernel
is called zonal if there exists a shape function such that . From hereon, we focus on zonal kernels. Since stationary kernels are functions of they have the property that , where is the Laplacian operator. Such property is also verified by zonal kernels: , where we denote by the LaplaceBeltrami operator with respect to the variable . Combined with an intregration by part, this property can be used to show that the kernel operator of a zonal covariance and the LaplaceBeltrami operator commute:which in turn implies that these two operators share the same eigenfunctions. This result is of particular relevance to us since there is a huge body of literature on diagonalisation of the LaplaceBeltrami operator on , and that it is well known that its eigenfunctions are given by the spherical harmonics. This reasoning can be summarised by the following theorem:
Theorem 1 (Mercer representation).
Any zonal kernel on the hypersphere can be decomposed as
(2) 
where and are positive coefficients, denote the elements of the spherical harmonic basis in , and corresponds to the number of spherical harmonics for a given level .
Although it is typically stated without a proof, this theorem is already known in some communities (see Wendland (2005) for a functional analysis exposition, or Peacock (1999) for its use in cosmology).
Given the Mercer representation of a zonal kernel, its RKHS can be characterised by
with the inner product between two functions and defined as
(3) 
It is straightforward to show that this inner product satisfies the reproducing property (see Appendix LABEL:appendix:proof:reproducing).
In order to make these results practical, we need to compute the coefficients for some kernels of interest. For a given value of we can see as a function from to , and represent it in the basis of spherical harmonics:
(4) 
Combining equations (2) and (4) gives the following expression for the coefficients we are interested in: . Although this expression involves a dimensional integral on the hypersphere, our hypothesis that is zonal means we can make use of the FunkHecke formula (see LABEL:appendix:theorem:funk in the supplementary) and rewrite it as a simpler 1D integral over the shape function of . Following this procedure finally leads to
(5) 
where , is the Gegenbauer polynomial of degree and is the kernel’s shape function. The constants and are given in LABEL:appendix:theorem:funk in the supplementary.
Using eq. 5 we are able to compute the Fourier coefficients of any zonal kernel. The details of the calculations for the arccosine kernel restricted to the sphere is given in LABEL:sec:appendix:arccosine in the supplementary material.
Alternatively, a key result from Solin & Särkkä (2014, eq. 20) is to show that the coefficients have a simple expression that depends on the kernel spectral density
and the eigenvalues of the LaplaceBeltrami operator. For GPs on
, the coefficients boil down to . This is the expression we used in our experiments to define Matérn and SE covariances on the hypersphere. More details on this method are given in LABEL:appendix:sec:materns in the supplementary.As one may have noticed, the values of do not depend on the second index (i.e. the eigenvalues only depend on the degree of the spherical harmonic, but not on its orientation). This is a remarkable property of zonal kernels which allows us to use the addition theorem (see supplementary material LABEL:appendix:theorem:addition) for spherical harmonics to simplify eq. 2:
This representation is cheaper to evaluate than eq. 2 but it still requires truncation for practical use.
3.3 Spherical Harmonics as Inducing Features
We can now build on the results from the previous section to propose powerful and efficient sparse GPs models. We want features that exhibit nonlocal influence for expressiveness, and inducing variables that induce sparse structure in for efficiency. We achieve this by defining the inducing variables to be the inner product between the GP^{1}^{1}1Although does not belong to (Kanagawa et al., 2018), such expression is well defined since the regularity of can be used to extend the domain of definition of the first argument of the inner product to a larger class of functions. See Hensman et al. (2017) for a detailed discussion. and spherical harmonics:^{2}^{2}2Note that in the context of inducing variables, we switch to a single integer to index the spherical harmonics and order them first by increasing level , and then by increasing within a level.
(6) 
To leverage these new inducing variables we need to compute two quantities: 1) the covariance between and for , and 2) the covariance between the inducing variables themselves for the matrix. See van der Wilk et al. (2020) for an indepth discussion of these concepts.
The covariance of the inducing variables with are
where we relied on the linearity of both expectation and inner product and where we used the reproducing property of .
The covariance between the inducing variables is given by
where is the Kronecker delta. Crucially, this means that is a diagonal matrix with elements .
Substituting and into the sparse variational approximation (eq. 1), leads to the following form for
with .
This sparse approximation has two main differences compared to a SVGP model with standard inducing points. First, the spherical harmonic inducing variables lead to features with nonlocal structure. Second, the approximation does not require any inverses anymore. The computational bottleneck of this model is now simply the matrix multiplication in the variance calculation, which has a cost. Compared to the cost of inducing point SVGPs, this gives a significant speedup – as we show in the experiments.
As is usual in sparse GP methods, the number of inducing variables is constrained by the computational budget available to the user. Given that we ordered the spherical harmonic by increasing , choosing the first
elements means we will select first features with low angular frequency. Provided that the kernel spectral density is a decreasing function (this will be true for classic covariances, but not for quasiperiodic ones), this means that the selected features correspond to the ones carrying the most signal according to the prior. In other words, the decomposition of the kernel can be compared to an infinite dimensional principal component analysis, and our choice of the inducing function is optimal since we pick the ones with the largest variance. This is illustrated in
fig. 4, which shows the analogue of fig. 3 for spherical harmonic inducing functions.4 Experiments
We evaluate our method Variational Inference with Spherical Harmonics (VISH) on regression and classification problems and show the following properties of our method: 1) VISH performs competitively in terms of accuracy and uncertainty quantification on a range of problems from the UCI dataset repository. 2) VISH is extremely fast and accurate on largescale conjugate problems (approximately 6 million 8D entries in less than 2 minutes). 3) Compared to VFF, VISH can be applied to multidimensional datasets and preserve its computational efficiency. 4) On problems with nonconjugate likelihood our method does not suffer from some of the issues encountered by VFF.
We begin with a toy experiment in which we show that the approximation becomes more accurate as we increase the number of basis functions.
4.1 Toy Experiment: Banana Classification
The banana dataset is a 2D binary classification problem (Hensman et al., 2015). In fig. 5 we show three different fits of VISH with spherical harmonics, which correspond respectively to maximum levels of , , and for our inducing functions. Since the variational framework provides a guarantee that more inducing variables must be monotonically better (Titsias, 2009), we expect that increasing the number of inducing functions will provide improved approximations. This is indeed the case as we show in the rightmost panel: with increasing the ELBO converges and the fit becomes tighter.
While this is expected behaviour for SVGP methods, it is not guaranteed by VFF. Given the Kroneckerstructure used by VFF for this 2D experiment Hensman et al. (2017) report that using a full rank covariance matrix for the variational distribution was intolerably slow. They also show that enforcing the Kronecker structure on the posterior results in an ELBO that decreased as frequencies were added, and they finally propose a sumoftwoKroneckers structure, but provide no guarantee that this would converge to the exact posterior in the limit of larger . In VISH we do not need to impose any structure on the approximate covariance matrix , so we retain the guarantee that adding more basis functions will move us closer to the posterior process. The method remains fast despite optimising over full covariance matrices: fitting the models displayed in fig. 5 only takes a few seconds on a standard desktop.
4.2 Regression on UCI Benchmarks
MSE  NLPD  

Dataset  N  D  M  VISH  AVFF  SVGP  AGPR  GPR  VISH  AVFF  SVGP  AGPR  GPR 
Yacht  
Energy  
Concrete  
Kin8mn  
Power 
Predictive mean squared errors (MSEs) and negative log predictive densities (NLPDs) with one standard deviation based on 5 splits on 5 UCI regression datasets. Lower is better. All models assume a Gaussian noise model, use a Matérn3/2 kernel and use the LBFGS optimiser for the hyper and variational parameters.
VISH and SVGP are configured with the same number of inducing points . AVFF and AGPR assume an Additive structure over the inputs (see text). For AVFF and VISH the optimal posterior distribution for the inducing variables is set following Titsias (2009).We use five UCI regression datasets to compare the performance of our method against other GP approaches. We measure accuracy of the predictive mean with Mean Squared Error (MSE) and uncertainty quantification with mean Negative Log Predictive Density (NLPD). For each dataset we randomly select 90% of the data for training and 10% for testing and repeat this 5 times to get error bars. We normalise the inputs and the targets to be centered unit Gaussian. We report the MSE and NLPD of the normalised data. In Table 1 we report the performance of VISH, AdditiveVFF (AVFF) (Hensman et al., 2017), SVGP (Hensman et al., 2013), AdditiveGRP (AGPR) and GPR.
We start by comparing VISH against SVGP, and notice that for Energy, Concrete and Power the change in inductive bias and expressiveness of spherical harmonic inducing variables opposed to standard inducing points improves performance. Also, while the sparse methods (AVFF, VISH and SVGP) are necessary for scalability (as highlighted by the next experiment), they remain inferior to the exact GPR model – which should be seen as the optimal baseline.
For VFF we have to resort to an additive model (AVFF) in order to deal with the dimensionality of the data, as a vanilla VFF model can only be used in one or two dimensions. Following (Hensman et al., 2017, eq. 78), we assume a different function for each input dimension
(7) 
and approximate this process by a meanfield approximate posterior over the processes , where each process is a SVGP . We used frequencies per input dimension. As extra baseline we added an exact GPR model which makes the same additive assumption (AGPR). As expected, not having to impose this structure improves performance; we see that VISH beats AVFF on every dataset.
Limitations
Our current implementation of VISH only supports datasets up to 8 dimensions (9 dimensions when the bias is concatenated). This is not caused by a theoretical limitation because our approach leads to a diagonal matrix in any dimension. The problem stems from the fact that there are no libraries available providing stable spherical harmonic implementations in high dimensions (needed for ). Our implementation based on Dai & Xu (2013, Theorem 5.1) is stable up to 9 dimensions and future work will focus on scaling this up. Furthermore, VISH does not solve the curse of dimensionality for GP models but does drastically improve over the scaling of VFF.
4.3 LargeScale Regression on Airline Delay
Method  M  MSE  NLPD  Time  MSE  NLPD  MSE  NLPD  MSE  NLPD  Time 

VISH  210  
VISH  660  
AVFF  30/dim.  
SVGP  500 
This experiment illustrates three core capabilities of VISH: 1) it can deal with large datasets and 2) it is computationally and time efficient 3) the model improves performance in terms of NLPD.
We use the 2008 U.S. airline delay dataset to asses these capabilities. The goal of this problem is to predict the amount of delay given eight characteristics of a flight, such as the age of the aircraft (number of years since deployment), route distance, airtime, etc. We follow the exact same experiment setup as Hensman et al. (2017)^{3}^{3}3https://github.com/jameshensman/VFF and evaluate the performance on 4 datasets of size 10,000, 100,000, 1,000,000, and 5,929,413 (complete dataset), created by subsampling the original one. For each dataset we use two thirds of the data for training and one third for testing. Every split is repeated 10 times and we report the mean and one standard deviation of the MSE and NLPD. For every run the outputs are normalized to be a centered unit Gaussian. The inputs are normalized to [0, 1] for VFF and SVGP. For VISH we normalize the inputs so that each column falls within
. The hyperparameter
corresponds to the prior variance of the weights of an infinitewidth fullyconnected neural net layer (see Cho & Saul (2009)). We can optimise for this weightvariance by backpropagation through w.r.t. the ELBO. This is similar to the lengthscale hyperparameters of stationary kernels.Table 2 shows the outcome of the experiment. The results for VFF and SVGP are from Hensman et al. (2017). We observe that VISH improves on the other methods in terms of NLPD and is within error bars in terms of MSE. Given the variability in the data the GP models improve when more data is available during training.
Given the dimensionality of the dataset, a fullVFF model is completely infeasible. As an example, using just four frequencies per dimension would already lead to inducing variables. So VFF has to resort to an additive model with a prior covariance structure given as a sum of Matérn3/2 kernels for each input dimension, as in eq. 7. Each of the functions is approximated using 30 frequencies. We report two variants of VISH: one using all spherical harmonics up to degree 3 (=210) and another up to degree 4 (=660). As expected, the more inducing variables, the better the fit.
We also report the wall clock time for the experiments (training and evaluation) for and . All these experiments were ran on a single consumergrade GPU (Nvidia GTX 1070). On the complete dataset of almost 6 million records, VISH took seconds on average. AVFF required seconds and the SVGP method needed approximately 15 minutes to fit and predict. This shows that VISH is roughly two orders of magnitude faster than SVGP. AVFF comes close to VISH but has to impose additive structure to keep its computational advantage.
4.4 SUSY Classification
In the last experiment we tackle a largescale classification problem. We are tasked with distinguishing between a signal process which produces supersymmetric (SUSY) particles and a background process which does not. The inputs consist of eight kinematic properties measured by the particle detectors in the accelerator. The dataset contains 5 million records of which we use the last 10% for testing. We are interested in obtaining a calibrated classifier and measure the AuC of the ROC curve.
For SVGP and VISH we first used a subset of 20,000 points to train the variational and hyperparameters of the model with LBFGS. We then applied Adam to the whole dataset. A similar approach was used to finetune the NN baselines by Baldi et al. (2014).
Table 3 lists the performance of VISH
and compares it to a boosted decision tree (BDT), 5layer neural network (NN), and a SVGP. We observe the competitive performance of
VISH, which is a singlelayer GP method, compared to a 5layer neural net with 300 hidden units per layer and extensive hyperparameter optimisation (Baldi et al., 2014). We also note the improvement over a SVGP with SE kernel.Method  AuC 

BDT  
NN  
SVGP (SE)  
VISH 
5 Conclusion
We introduced a framework for performing variational inference in Gaussian processes using spherical harmonics. Our general setup is closely related to VFF, and we inherit several of its advantages such as a considerable speedup compared to classic sparse GP models, and having features with a global influence on the approximation. By projecting the data onto the hypersphere and using dedicated GP models on this manifold our approach succeeds where other sparse GPs methods fail. First, VISH provides good scaling properties as the dimension of the input space increases. Second, we showed that under some relatively weak hypothesis we are able to select the optimal features to include in the approximation. This is due to the intricate link between “stationary” covariances on the sphere and the LaplaceBeltrami operator. Third, the Mercer representation of the kernel means that the matrices to be inverted at training time are exactly diagonal – resulting in a very cheaptocompute sparse approximate GP.
We showed on a wide range of regression and classification problems that our method performs at or close to the state of the art while being extremely fast. The good predictive performance may be inherited from the connection between infinitely wide neural network and the way we map the predictions on the sphere back to the original space. Future work will explore this hypothesis.
Acknowledgements
Thanks to Fergus Simpson for pointing us in the direction of spherical harmonics in the early days of this work. Also many thanks to Arno Solin for sharing his implementation of the Matérn kernels’ power spectrum.
References

Baldi et al. (2014)
Baldi, P., Sadowski, P., and Whiteson, D.
Searching for exotic particles in highenergy physics with deep learning.
Nature communications, 5:4308, 2014.  Cho & Saul (2009) Cho, Y. and Saul, L. K. Kernel methods for deep learning. In Advances in Neural Information Processing Systems, pp. 342–350, 2009.
 Dai & Xu (2013) Dai, F. and Xu, Y. Approximation theory and harmonic analysis on spheres and balls. Springer, 2013.
 Frigola et al. (2014) Frigola, R., Chen, Y., and Rasmussen, C. E. Variational Gaussian process statespace models. In Advances in Neural Information Processing Systems, pp. 3680–3688, 2014.
 Hensman & Lawrence (2014) Hensman, J. and Lawrence, N. D. Nested variational compression in deep Gaussian processes. arXiv preprint arXiv:1412.1370, 2014.

Hensman et al. (2013)
Hensman, J., Fusi, N., and Lawrence, N. D.
Gaussian processes for big data.
In
Advances in Uncertainty in Artificial Intelligence
, 2013.  Hensman et al. (2015) Hensman, J., Matthews, A. G. d. G., and Ghahramani, Z. Scalable variational Gaussian process classification. In Proceedings of International Conference on Artificial Intelligence and Statistics, 2015.
 Hensman et al. (2017) Hensman, J., Durrande, N., and Solin, A. Variational fourier features for Gaussian processes. Journal of Machine Learning Research, 18:151–1, 2017.
 Kanagawa et al. (2018) Kanagawa, M., Hennig, P., Sejdinovic, D., and Sriperumbudur, B. K. Gaussian processes and kernel methods: A review on connections and equivalences. arXiv preprint arXiv:1807.02582, 2018.
 LázaroGredilla & FigueirasVidal (2009) LázaroGredilla, M. and FigueirasVidal, A. Interdomain Gaussian processes for sparse inference using inducing features. In Advances in Neural Information Processing Systems, 2009.
 Peacock (1999) Peacock, J. A. Cosmological physics. Cambridge University Press, 1999.
 QuiñoneroCandela & Rasmussen (2005) QuiñoneroCandela, J. and Rasmussen, C. E. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
 Rasmussen & Williams (2006) Rasmussen, C. E. and Williams, C. K. Gaussian Processes for Machine Learning. MIT Press, 2006.
 Salimbeni & Deisenroth (2017) Salimbeni, H. and Deisenroth, M. P. Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information Processing Systems, 2017.
 Shi et al. (2020) Shi, J., Titsias, M., and Mnih, A. Sparse orthogonal variational inference for Gaussian processes. In Proceedings of International Conference on Artificial Intelligence and Statistics, pp. 1932–1942, 2020.
 Solin & Särkkä (2014) Solin, A. and Särkkä, S. Hilbert space methods for reducedrank Gaussian process regression. Statistics and Computing, pp. 1–28, 2014.
 Titsias (2009) Titsias, M. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of International Conference on Artificial Intelligence and Statistics, 2009.
 Titsias & Lawrence (2010) Titsias, M. and Lawrence, N. D. Bayesian Gaussian process latent variable model. In Proceedings of International Conference on Artificial Intelligence and Statistics, 2010.
 van der Wilk et al. (2020) van der Wilk, M., Dutordoir, V., John, S., Artemev, A., Adam, V., and Hensman, J. A framework for interdomain and multioutput Gaussian processes. arXiv:2003.01115, 2020. URL https://arxiv.org/abs/2003.01115.
 Wendland (2005) Wendland, H. Scattered data approximation. Cambridge University Press, 2005.
 Williams (1998) Williams, C. K. Computation with infinite neural networks. Neural Computation, 10(5):1203–1216, 1998.
Comments
There are no comments yet.