Sparse Gaussian Processes with Spherical Harmonic Features

06/30/2020
by   Vincent Dutordoir, et al.
17

We introduce a new class of inter-domain variational Gaussian processes (GP) where data is mapped onto the unit hypersphere in order to use spherical harmonic representations. Our inference scheme is comparable to variational Fourier features, but it does not suffer from the curse of dimensionality, and leads to diagonal covariance matrices between inducing variables. This enables a speed-up in inference, because it bypasses the need to invert large covariance matrices. Our experiments show that our model is able to fit a regression model for a dataset with 6 million entries two orders of magnitude faster compared to standard sparse GPs, while retaining state of the art accuracy. We also demonstrate competitive performance on classification with non-conjugate likelihoods.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

04/10/2019

Know Your Boundaries: Constraining Gaussian Processes by Variational Harmonic Features

Gaussian processes (GPs) provide a powerful framework for extrapolation,...
07/13/2020

Orthogonally Decoupled Variational Fourier Features

Sparse inducing points have long been a standard method to fit Gaussian ...
06/10/2021

Scalable Variational Gaussian Processes via Harmonic Kernel Decomposition

We introduce a new scalable variational Gaussian process approximation w...
11/05/2015

Thoughts on Massively Scalable Gaussian Processes

We introduce a framework and early results for massively scalable Gaussi...
03/15/2012

Speeding up the binary Gaussian process classification

Gaussian processes (GP) are attractive building blocks for many probabil...
05/10/2021

SigGPDE: Scaling Sparse Gaussian Processes on Sequential Data

Making predictions and quantifying their uncertainty when the input data...
09/18/2012

Scaling Multidimensional Inference for Structured Gaussian Processes

Exact Gaussian Process (GP) regression has O(N^3) runtime for data size ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaussian processes (GPs) (Rasmussen & Williams, 2006)

provide a flexible framework for modelling unknown functions: they are robust to overfitting, offer good predictive uncertainty estimates and allow us to incorporate prior assumptions into the model. Given a dataset with some inputs

and outputs , a GP regression model assumes where is a GP over and where the

are normal random variables accounting for observation noise. The model predictions at

are then given by the posterior distribution . However, computing the posterior distribution usually scales , because it requires solving a system involving an matrix.

A range of sparse GP methods (see Quiñonero-Candela & Rasmussen (2005) for an overview) have been developed to improve on this scaling. Among them, variational inference is a practical approach allowing regression (Titsias, 2009), classification (Hensman et al., 2015), minibatching (Hensman et al., 2013) and structured models including latent variables, time-series and depth (Titsias & Lawrence, 2010; Frigola et al., 2014; Hensman & Lawrence, 2014; Salimbeni & Deisenroth, 2017). Variational inference in GPs works by approximating the exact (but intractable) posterior . This approximate posterior process is constructed from a conditional distribution based on pseudo observations of at locations

. The approximation is optimised by minimising the Kullback-Leibler divergence between the approximate posterior and the exact one. The resulting complexity is

, so choosing enables significant speedups compared to vanilla GP models. However, when using common stationary kernels such as Matérn, Squared Exponential (SE) or rational quadratic, the influence of pseudo-observations are only local and limited to the neighbourhoods of the inducing points , so a large number of inducing points may be required to cover the input space. This is especially problematic for higher dimensional data as a result of the curse of dimensionality.

Figure 1: Illustration of the mapping between a 2D dataset (grey dots) embedded into a 3D space and its projection (orange dots) onto the unit half-circle using a linear mapping.

Variational Fourier Features (Hensman et al., 2017, VFF) inducing variables have been proposed to overcome this limitation. In VFF the inducing variables based on pseudo observations are replaced with inducing variables obtained by projecting the GP onto the Fourier basis. This results in inducing variables that have global influence on the predictions. For one-dimensional inputs, this construction leads to covariances matrices between inducing variables that are almost diagonal and this can be exploited to reduce the complexity to . However, for -dimensional input spaces VFF requires construction of a new basis given by the outer product of the one-dimensional Fourier basis. This implies that the number of inducing variables grows exponentially with the dimensionality, which limits the use of VFF to just one dimension or two. Furthermore, VFF is restricted to kernels of the Matérn family.

In this work we improve upon VFF in multiple directions. Rather than using a sine and cosine basis, we use a basis of spherical harmonics to define a new interdomain sparse approximation. As we will show, spherical harmonics are the eigenfunctions of stationary kernels on the hypersphere, which allows us to exploit the Mercer representation of the kernel for defining the inducing variables. In arbitrary dimensions, our method leads to

diagonal covariance matrices which makes it faster than VFF as we fully bypass the need to compute expensive matrix inverses. Compared to both sparse GPs and VFF, our approximation scheme suffers less from the curse of dimensionality. As VFF, each spherical harmonic inducing function has a global influence, but there is a natural ordering of the spherical harmonics that can guarantee that the best features are picked given an overall budget for the number of inducing variables. Moreover, our method works for any stationary kernel on the sphere.

Following the illustration in fig. 1, we outline the different steps of our method. We start by concatenating the data with a constant input (bias) and project it linearly onto the unit hypersphere . We then learn a sparse GP on the sphere based on the projected data. We can do this extremely efficiently by making use of our spherical harmonic inducing variables, shown in fig. 2. Finally, the linear mapping between the sphere and the sub-space containing the data can then be used to map the predictions of the GP on the sphere back to the original space.

This paper is organised as follows. In section 2 we give the necessary background on sparse GPs and VFF. In section 3 we highlight that every stage of the proposed method can be elegantly justified. For example, the linear mapping between the data and the sphere is a property of some specific covariance functions such as the arc-cosine kernel, from which we can expect good generalisation properties (section 3.1). Another example is that the spherical harmonics are coupled with the structure of the Reproducing Kernel Hilbert Space (RKHS) norm for stationary kernels on the sphere, which makes them very natural candidates for basis functions for sparse GPs (section 3.2). In section 3.3 we define our spherical harmonic inducing variables and compute the necessary components to do efficient GP modelling. Section 4 is dedicated to the experimental evaluation.

Figure 2: The first four levels of spherical harmonic functions in . The domain of the spherical harmonics is the surface of the unit sphere . The function value is given by the color and radius.

2 Background

2.1 GPs and Sparse Variational Inference

GPs are stochastic processes such that the distribution of any finite dimensional marginal is multivariate normal. The distribution of a GP is fully determined by its mean and covariance function (kernel) . GP models typically consist of combining a latent (unobserved) GP with a likelihood that factorises over observation points . When the observation model is , the posterior distributions of and given some data are still Gaussian and can be computed analytically. Let represent the observation noise at , then the GP posterior distribution is with

where and .

Computing this exact posterior requires inverting the matrix , which has a computational complexity and a memory footprint. Given a typical current hardware specification, this limits the dataset size to the order of few thousand observations. Furthermore, there is no known analytical expression for posterior distribution when the likelihood is not conjugate, as encountered in classification for instance.

Sparse GPs combined with variational inference provide an elegant way to address these two shortcomings (Titsias, 2009; Hensman et al., 2013, 2015). It consists of introducing a distribution that depends on some parameters, and finding the values of these parameters such that gives the best possible approximation of the exact posterior . Sparse GPs introduce pseudo inputs corresponding to inducing variables , and choose to write the approximating distribution as . This results in a distribution that is parametrised by the variational parameters , and , which are learned by minimising the Kullback–Leibler (KL) divergence .

At prediction time, the conjugacy between and the conditioned posterior implies that , where is marginalised out, is a GP with a mean and a covariance function that can be computed in closed:

(1)

where , and .

Sparse GPs result in and computational complexity at training (minimising the KL) and prediction respectively. Picking can thus result in drastic improvement, but the lower is, the less accurate the approximation, as recently shown by Shi et al. (2020). Typical kernels—such as Matérn or Squared Exponential (SE)—depend on a lengthscale parameter that controls how quickly the correlation between two evaluations of drops for two inputs that move away from another. For short lengthscales, this correlation drops quickly and two observations can be almost independent even for two input points that are close by in the input space. For a sparse GP model, this implies that and

will rapidly revert to the prior mean and variance when

is not in the immediate neighbourhood of an inducing point . A similar effect can be observed when the input space is high-dimensional: because inducing variables only have a local influence, the number of inducing points required to cover the space grows exponentially with the input space dimensionality. In practice, it may thus be required to pick large values for to obtain accurate approximate posteriors but this defeats the original intent of sparse methods.

This behaviour where the vector of features

of the approximate distribution are given by kernel function can be addressed using interdomain inducing variables.

2.2 Inter-domain GPs and Variational Fourier Features

Inter-domain Gaussian processes (see Lázaro-Gredilla & Figueiras-Vidal (2009) and van der Wilk et al. (2020) for a modern exposition) use alternative forms of inducing variables such that the resulting sparse GP models result in more informative features. In interdomain GPs the inducing variables are obtained by integrating the GP with an inducing function:

This redefinition of implies that the expressions of and change, but the inference scheme of interdomain GPs and the mathematical expressions for the posterior mean and variance are exactly the same as classic sparse GPs. Depending on the choice of , interdomain GPs can result in various feature vector . These feature vectors can alleviate the classic sparse GP limitation of inducing variables having only a local influence.

VFF (Hensman et al., 2017) is an interdomain method where the inducing variables are given by a Matérn RKHS inner product between the GP and elements of the Fourier basis:

where , and if the input space is . This leads to

This results in several advantages. First, the features are exactly the elements of the Fourier basis, which are independent of the kernel parameters and can be precomputed. Second, the matrix is the sum of a diagonal matrix plus low rank matrices. This structure can be used to drastically reduce the computational complexity, and the experiments showed one or two orders of magnitude speed-ups compared to classic sparse GPs. Finally, the variance of the inducing variables typically decays quickly with increasing frequencies, which means that by selecting the first elements of the Fourier basis we pick the features that carry the most signal.

Figure 3: Illustration of the variance of the inducing variables when using VFF with a two dimensional input space. Each pair corresponds to an inducing function , and the area of each square is proportional to the variance of the associated inducing variable. For a given number of inducing variables, say as highlighted with the red dashed line, VFF selects features that do not carry signal (north-east quarter of the red dashed square) whereas it ignores features that are expected to carry signal (along the axes and ).

The main flaw of VFF comes from the way it generalises to multidimensional input spaces. The approach in Hensman et al. (2017) for getting a set of -dimensional inducing functions consists of taking the outer product of univariate basis and to consider separable kernels so that the elements of are given by the product of the inner products in each dimension. For example, in dimension 2, a set of inducing functions is given by , and entries on are . This construction scales poorly with the dimension: for example choosing a univariate basis as simple as for an eight-dimensional problem already results in more that 6,500 inducing functions. Additionally, this construction is very inefficient in terms of captured variance, as we illustrate in Figure 3 for a 2D input space. The figure shows that the prior variance associated with the inducing function vanishes quickly when both and increase. This means that most of the inducing functions on which the variational posterior is built are irrelevant, whereas some important ones such as or for are important but ignored. Although we used a 2D example to illustrate this poor behaviour, it is important to bear in mind that the issue gets exacerbated for higher dimensional input spaces. As detailed in fig. 4 and discussed later, our proposed approach does not suffer from such behaviour.

3 Variational Inference with Spherical Harmonics

In this section we describe the three key ingredients of the proposed approach: the mapping of dataset to the hypersphere, the definition of GPs on this sphere, and the use of spherical harmonics as inducing functions for such GPs.

3.1 Linear Mapping to the Hypersphere

As illustrated in fig. 1, for a 1D dataset, the first step of our approach involves appending the dataset with a dummy input value that we will refer to as the bias (grey dots in fig. 1). For convenience we will overload the notation, and refer to the augmented inputs as from now on, denote its dimension by , and assume . The next step is to project the augmented data onto the unit hypersphere according to a linear mapping:

as depicted by the orange dots. Given the projected data, we learn a GP model on the sphere (orange function), and the model predictions can be mapped back to the hyperplane (blue curve) using the inverse linear mapping (i.e. following the fine grey lines starting from the origin).

Although this setup may seem very arbitrary, it is inspired by the important works on the limits of neural networks as Gaussian processes, especially the

arc-sine kernel (Williams, 1998) and the arc-cosine kernels (Cho & Saul, 2009)

. An arc-cosine kernel corresponds to the infinitely-wide limit of a single-layer ReLU-activated network with Gaussian weights. Let

, such that , be an augmented input vector whose last entry corresponds to the bias. Then the arc-cosine kernel can be written as

is the geodesic distance (or the great-circle distance) between the projection of and on the unit hypersphere. We observe that is separable in polar coordinates, and the dependence in the radius is just the linear kernel. Since the linear kernel is degenerate, it means that there is a one-to-one mapping between the GP samples restricted to the unit sphere and the samples over

, and that this mapping is exactly the linear transformation we introduced previously.

One insight provided by the link between our settings and the arc-cosine kernel is that once mapped back to the original space, our GP samples will exhibit linear asymptotic behaviour, which can be compared to the way ReLU-activated neural networks operate. Although a proper demonstration would require further work, this correspondence suggests that our models may inherit the desirable generalisation properties of neural networks.

Having mapped the data to the sphere, we must now introduce GPs on the sphere and their covariance. The following section provides some theory that allows us to work with various kernels on the sphere, including the arc-cosine kernel as a special case.

3.2 Mercer’s Theorem for Zonal Kernels on the Sphere

Stationary kernels, i.e. translation invariant covariances, are ubiquitous in machine learning when the input space is Euclidean. When working on the hypersphere, their spherical counterpart are zonal kernels, which are invariant to rotations. More precisely, a kernel

is called zonal if there exists a shape function such that . From hereon, we focus on zonal kernels. Since stationary kernels are functions of they have the property that , where is the Laplacian operator. Such property is also verified by zonal kernels: , where we denote by the Laplace-Beltrami operator with respect to the variable . Combined with an intregration by part, this property can be used to show that the kernel operator of a zonal covariance and the Laplace-Beltrami operator commute:

which in turn implies that these two operators share the same eigenfunctions. This result is of particular relevance to us since there is a huge body of literature on diagonalisation of the Laplace-Beltrami operator on , and that it is well known that its eigenfunctions are given by the spherical harmonics. This reasoning can be summarised by the following theorem:

Theorem 1 (Mercer representation).

Any zonal kernel on the hypersphere can be decomposed as

(2)

where and are positive coefficients, denote the elements of the spherical harmonic basis in , and corresponds to the number of spherical harmonics for a given level .

Although it is typically stated without a proof, this theorem is already known in some communities (see Wendland (2005) for a functional analysis exposition, or Peacock (1999) for its use in cosmology).

Given the Mercer representation of a zonal kernel, its RKHS can be characterised by

with the inner product between two functions and defined as

(3)

It is straightforward to show that this inner product satisfies the reproducing property (see Appendix LABEL:appendix:proof:reproducing).

In order to make these results practical, we need to compute the coefficients for some kernels of interest. For a given value of we can see as a function from to , and represent it in the basis of spherical harmonics:

(4)

Combining equations (2) and (4) gives the following expression for the coefficients we are interested in: . Although this expression involves a dimensional integral on the hypersphere, our hypothesis that is zonal means we can make use of the Funk-Hecke formula (see LABEL:appendix:theorem:funk in the supplementary) and rewrite it as a simpler 1D integral over the shape function of . Following this procedure finally leads to

(5)

where , is the Gegenbauer polynomial of degree and is the kernel’s shape function. The constants and are given in LABEL:appendix:theorem:funk in the supplementary.

Using eq. 5 we are able to compute the Fourier coefficients of any zonal kernel. The details of the calculations for the arc-cosine kernel restricted to the sphere is given in LABEL:sec:appendix:arc-cosine in the supplementary material.

Alternatively, a key result from Solin & Särkkä (2014, eq. 20) is to show that the coefficients have a simple expression that depends on the kernel spectral density

and the eigenvalues of the Laplace-Beltrami operator. For GPs on

, the coefficients boil down to . This is the expression we used in our experiments to define Matérn and SE covariances on the hypersphere. More details on this method are given in LABEL:appendix:sec:materns in the supplementary.

As one may have noticed, the values of do not depend on the second index (i.e. the eigenvalues only depend on the degree of the spherical harmonic, but not on its orientation). This is a remarkable property of zonal kernels which allows us to use the addition theorem (see supplementary material LABEL:appendix:theorem:addition) for spherical harmonics to simplify eq. 2:

This representation is cheaper to evaluate than eq. 2 but it still requires truncation for practical use.

3.3 Spherical Harmonics as Inducing Features

We can now build on the results from the previous section to propose powerful and efficient sparse GPs models. We want features that exhibit non-local influence for expressiveness, and inducing variables that induce sparse structure in for efficiency. We achieve this by defining the inducing variables to be the inner product between the GP111Although does not belong to (Kanagawa et al., 2018), such expression is well defined since the regularity of can be used to extend the domain of definition of the first argument of the inner product to a larger class of functions. See Hensman et al. (2017) for a detailed discussion. and spherical harmonics:222Note that in the context of inducing variables, we switch to a single integer to index the spherical harmonics and order them first by increasing level , and then by increasing within a level.

(6)

To leverage these new inducing variables we need to compute two quantities: 1) the covariance between and for , and 2) the covariance between the inducing variables themselves for the matrix. See van der Wilk et al. (2020) for an in-depth discussion of these concepts.

The covariance of the inducing variables with are

where we relied on the linearity of both expectation and inner product and where we used the reproducing property of .

The covariance between the inducing variables is given by

where is the Kronecker delta. Crucially, this means that is a diagonal matrix with elements .

Substituting and into the sparse variational approximation (eq. 1), leads to the following form for

with .

This sparse approximation has two main differences compared to a SVGP model with standard inducing points. First, the spherical harmonic inducing variables lead to features with non-local structure. Second, the approximation does not require any inverses anymore. The computational bottleneck of this model is now simply the matrix multiplication in the variance calculation, which has a cost. Compared to the cost of inducing point SVGPs, this gives a significant speedup – as we show in the experiments.

As is usual in sparse GP methods, the number of inducing variables is constrained by the computational budget available to the user. Given that we ordered the spherical harmonic by increasing , choosing the first

elements means we will select first features with low angular frequency. Provided that the kernel spectral density is a decreasing function (this will be true for classic covariances, but not for quasi-periodic ones), this means that the selected features correspond to the ones carrying the most signal according to the prior. In other words, the decomposition of the kernel can be compared to an infinite dimensional principal component analysis, and our choice of the inducing function is optimal since we pick the ones with the largest variance. This is illustrated in

fig. 4, which shows the analogue of fig. 3 for spherical harmonic inducing functions.

Figure 4: Illustration of the variance of the VISH inducing variables for a 2D input space. Settings are the same as in fig. 3 but a spherical harmonic feature is associated to each pair . For a given number of inducing variables, the truncation pattern (see red dashed triangle for ) is optimal since it selects the most influential features.

4 Experiments

We evaluate our method Variational Inference with Spherical Harmonics (VISH) on regression and classification problems and show the following properties of our method: 1) VISH performs competitively in terms of accuracy and uncertainty quantification on a range of problems from the UCI dataset repository. 2) VISH is extremely fast and accurate on large-scale conjugate problems (approximately 6 million 8D entries in less than 2 minutes). 3) Compared to VFF, VISH can be applied to multi-dimensional datasets and preserve its computational efficiency. 4) On problems with non-conjugate likelihood our method does not suffer from some of the issues encountered by VFF.

We begin with a toy experiment in which we show that the approximation becomes more accurate as we increase the number of basis functions.

Figure 5: Classification of the 2D banana dataset with growing number of spherical harmonic basis functions. The right plot shows the convergence of the ELBO with respect to increasing numbers of basis functions.

4.1 Toy Experiment: Banana Classification

The banana dataset is a 2D binary classification problem (Hensman et al., 2015). In fig. 5 we show three different fits of VISH with spherical harmonics, which correspond respectively to maximum levels of , , and for our inducing functions. Since the variational framework provides a guarantee that more inducing variables must be monotonically better (Titsias, 2009), we expect that increasing the number of inducing functions will provide improved approximations. This is indeed the case as we show in the rightmost panel: with increasing the ELBO converges and the fit becomes tighter.

While this is expected behaviour for SVGP methods, it is not guaranteed by VFF. Given the Kronecker-structure used by VFF for this 2D experiment Hensman et al. (2017) report that using a full rank covariance matrix for the variational distribution was intolerably slow. They also show that enforcing the Kronecker structure on the posterior results in an ELBO that decreased as frequencies were added, and they finally propose a sum-of-two-Kroneckers structure, but provide no guarantee that this would converge to the exact posterior in the limit of larger . In VISH we do not need to impose any structure on the approximate covariance matrix , so we retain the guarantee that adding more basis functions will move us closer to the posterior process. The method remains fast despite optimising over full covariance matrices: fitting the models displayed in fig. 5 only takes a few seconds on a standard desktop.

4.2 Regression on UCI Benchmarks

MSE NLPD
Dataset N D M VISH A-VFF SVGP A-GPR GPR VISH A-VFF SVGP A-GPR GPR
Yacht
Energy
Concrete
Kin8mn
Power
Table 1:

Predictive mean squared errors (MSEs) and negative log predictive densities (NLPDs) with one standard deviation based on 5 splits on 5 UCI regression datasets. Lower is better. All models assume a Gaussian noise model, use a Matérn-3/2 kernel and use the L-BFGS optimiser for the hyper- and variational parameters.

VISH and SVGP are configured with the same number of inducing points . A-VFF and A-GPR assume an Additive structure over the inputs (see text). For A-VFF and VISH the optimal posterior distribution for the inducing variables is set following Titsias (2009).

We use five UCI regression datasets to compare the performance of our method against other GP approaches. We measure accuracy of the predictive mean with Mean Squared Error (MSE) and uncertainty quantification with mean Negative Log Predictive Density (NLPD). For each dataset we randomly select 90% of the data for training and 10% for testing and repeat this 5 times to get error bars. We normalise the inputs and the targets to be centered unit Gaussian. We report the MSE and NLPD of the normalised data. In Table 1 we report the performance of VISH, Additive-VFF (A-VFF) (Hensman et al., 2017), SVGP (Hensman et al., 2013), Additive-GRP (A-GPR) and GPR.

We start by comparing VISH against SVGP, and notice that for Energy, Concrete and Power the change in inductive bias and expressiveness of spherical harmonic inducing variables opposed to standard inducing points improves performance. Also, while the sparse methods (A-VFF, VISH and SVGP) are necessary for scalability (as highlighted by the next experiment), they remain inferior to the exact GPR model – which should be seen as the optimal baseline.

For VFF we have to resort to an additive model (A-VFF) in order to deal with the dimensionality of the data, as a vanilla VFF model can only be used in one or two dimensions. Following (Hensman et al., 2017, eq. 78), we assume a different function for each input dimension

(7)

and approximate this process by a mean-field approximate posterior over the processes , where each process is a SVGP . We used frequencies per input dimension. As extra baseline we added an exact GPR model which makes the same additive assumption (A-GPR). As expected, not having to impose this structure improves performance; we see that VISH beats A-VFF on every dataset.

Limitations

Our current implementation of VISH only supports datasets up to 8 dimensions (9 dimensions when the bias is concatenated). This is not caused by a theoretical limitation because our approach leads to a diagonal matrix in any dimension. The problem stems from the fact that there are no libraries available providing stable spherical harmonic implementations in high dimensions (needed for ). Our implementation based on Dai & Xu (2013, Theorem 5.1) is stable up to 9 dimensions and future work will focus on scaling this up. Furthermore, VISH does not solve the curse of dimensionality for GP models but does drastically improve over the scaling of VFF.

4.3 Large-Scale Regression on Airline Delay

Method M MSE NLPD Time MSE NLPD MSE NLPD MSE NLPD Time
VISH 210
VISH 660
A-VFF 30/dim.
SVGP 500
Table 2: Predictive mean squared errors (MSEs), negative log predictive densities (NLPDs) and wall-clock time in seconds with one standard deviation based on 10 random splits on the airline arrival delays experiment. Total dataset size is given by and in each split we randomly select 2/3 and 1/3 for training and testing.

This experiment illustrates three core capabilities of VISH: 1) it can deal with large datasets and 2) it is computationally and time efficient 3) the model improves performance in terms of NLPD.

We use the 2008 U.S. airline delay dataset to asses these capabilities. The goal of this problem is to predict the amount of delay given eight characteristics of a flight, such as the age of the aircraft (number of years since deployment), route distance, airtime, etc. We follow the exact same experiment setup as Hensman et al. (2017)333https://github.com/jameshensman/VFF and evaluate the performance on 4 datasets of size 10,000, 100,000, 1,000,000, and 5,929,413 (complete dataset), created by subsampling the original one. For each dataset we use two thirds of the data for training and one third for testing. Every split is repeated 10 times and we report the mean and one standard deviation of the MSE and NLPD. For every run the outputs are normalized to be a centered unit Gaussian. The inputs are normalized to [0, 1] for VFF and SVGP. For VISH we normalize the inputs so that each column falls within

. The hyperparameter

corresponds to the prior variance of the weights of an infinite-width fully-connected neural net layer (see Cho & Saul (2009)). We can optimise for this weight-variance by back-propagation through w.r.t. the ELBO. This is similar to the lengthscale hyperparameters of stationary kernels.

Table 2 shows the outcome of the experiment. The results for VFF and SVGP are from Hensman et al. (2017). We observe that VISH improves on the other methods in terms of NLPD and is within error bars in terms of MSE. Given the variability in the data the GP models improve when more data is available during training.

Given the dimensionality of the dataset, a full-VFF model is completely infeasible. As an example, using just four frequencies per dimension would already lead to inducing variables. So VFF has to resort to an additive model with a prior covariance structure given as a sum of Matérn-3/2 kernels for each input dimension, as in eq. 7. Each of the functions is approximated using 30 frequencies. We report two variants of VISH: one using all spherical harmonics up to degree 3 (=210) and another up to degree 4 (=660). As expected, the more inducing variables, the better the fit.

We also report the wall clock time for the experiments (training and evaluation) for and . All these experiments were ran on a single consumer-grade GPU (Nvidia GTX 1070). On the complete dataset of almost 6 million records, VISH took seconds on average. A-VFF required seconds and the SVGP method needed approximately 15 minutes to fit and predict. This shows that VISH is roughly two orders of magnitude faster than SVGP. A-VFF comes close to VISH but has to impose additive structure to keep its computational advantage.

4.4 SUSY Classification

In the last experiment we tackle a large-scale classification problem. We are tasked with distinguishing between a signal process which produces super-symmetric (SUSY) particles and a background process which does not. The inputs consist of eight kinematic properties measured by the particle detectors in the accelerator. The dataset contains 5 million records of which we use the last 10% for testing. We are interested in obtaining a calibrated classifier and measure the AuC of the ROC curve.

For SVGP and VISH we first used a subset of 20,000 points to train the variational and hyper-parameters of the model with L-BFGS. We then applied Adam to the whole dataset. A similar approach was used to fine-tune the NN baselines by Baldi et al. (2014).

Table 3 lists the performance of VISH

and compares it to a boosted decision tree (BDT), 5-layer neural network (NN), and a SVGP. We observe the competitive performance of

VISH, which is a single-layer GP method, compared to a 5-layer neural net with 300 hidden units per layer and extensive hyper-parameter optimisation (Baldi et al., 2014). We also note the improvement over a SVGP with SE kernel.

Method AuC
BDT
NN
SVGP (SE)
VISH
Table 3: Performance comparison for the SUSY benchmark. The mean AuC is reported with one standard deviation, computed by training five models with different initialisations. Larger is better. Results for BDT and NN are from Baldi et al. (2014).

5 Conclusion

We introduced a framework for performing variational inference in Gaussian processes using spherical harmonics. Our general setup is closely related to VFF, and we inherit several of its advantages such as a considerable speed-up compared to classic sparse GP models, and having features with a global influence on the approximation. By projecting the data onto the hypersphere and using dedicated GP models on this manifold our approach succeeds where other sparse GPs methods fail. First, VISH provides good scaling properties as the dimension of the input space increases. Second, we showed that under some relatively weak hypothesis we are able to select the optimal features to include in the approximation. This is due to the intricate link between “stationary” covariances on the sphere and the Laplace-Beltrami operator. Third, the Mercer representation of the kernel means that the matrices to be inverted at training time are exactly diagonal – resulting in a very cheap-to-compute sparse approximate GP.

We showed on a wide range of regression and classification problems that our method performs at or close to the state of the art while being extremely fast. The good predictive performance may be inherited from the connection between infinitely wide neural network and the way we map the predictions on the sphere back to the original space. Future work will explore this hypothesis.

Acknowledgements

Thanks to Fergus Simpson for pointing us in the direction of spherical harmonics in the early days of this work. Also many thanks to Arno Solin for sharing his implementation of the Matérn kernels’ power spectrum.

References

  • Baldi et al. (2014) Baldi, P., Sadowski, P., and Whiteson, D.

    Searching for exotic particles in high-energy physics with deep learning.

    Nature communications, 5:4308, 2014.
  • Cho & Saul (2009) Cho, Y. and Saul, L. K. Kernel methods for deep learning. In Advances in Neural Information Processing Systems, pp. 342–350, 2009.
  • Dai & Xu (2013) Dai, F. and Xu, Y. Approximation theory and harmonic analysis on spheres and balls. Springer, 2013.
  • Frigola et al. (2014) Frigola, R., Chen, Y., and Rasmussen, C. E. Variational Gaussian process state-space models. In Advances in Neural Information Processing Systems, pp. 3680–3688, 2014.
  • Hensman & Lawrence (2014) Hensman, J. and Lawrence, N. D. Nested variational compression in deep Gaussian processes. arXiv preprint arXiv:1412.1370, 2014.
  • Hensman et al. (2013) Hensman, J., Fusi, N., and Lawrence, N. D. Gaussian processes for big data. In

    Advances in Uncertainty in Artificial Intelligence

    , 2013.
  • Hensman et al. (2015) Hensman, J., Matthews, A. G. d. G., and Ghahramani, Z. Scalable variational Gaussian process classification. In Proceedings of International Conference on Artificial Intelligence and Statistics, 2015.
  • Hensman et al. (2017) Hensman, J., Durrande, N., and Solin, A. Variational fourier features for Gaussian processes. Journal of Machine Learning Research, 18:151–1, 2017.
  • Kanagawa et al. (2018) Kanagawa, M., Hennig, P., Sejdinovic, D., and Sriperumbudur, B. K. Gaussian processes and kernel methods: A review on connections and equivalences. arXiv preprint arXiv:1807.02582, 2018.
  • Lázaro-Gredilla & Figueiras-Vidal (2009) Lázaro-Gredilla, M. and Figueiras-Vidal, A. Inter-domain Gaussian processes for sparse inference using inducing features. In Advances in Neural Information Processing Systems, 2009.
  • Peacock (1999) Peacock, J. A. Cosmological physics. Cambridge University Press, 1999.
  • Quiñonero-Candela & Rasmussen (2005) Quiñonero-Candela, J. and Rasmussen, C. E. A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research, 6(Dec):1939–1959, 2005.
  • Rasmussen & Williams (2006) Rasmussen, C. E. and Williams, C. K. Gaussian Processes for Machine Learning. MIT Press, 2006.
  • Salimbeni & Deisenroth (2017) Salimbeni, H. and Deisenroth, M. P. Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information Processing Systems, 2017.
  • Shi et al. (2020) Shi, J., Titsias, M., and Mnih, A. Sparse orthogonal variational inference for Gaussian processes. In Proceedings of International Conference on Artificial Intelligence and Statistics, pp. 1932–1942, 2020.
  • Solin & Särkkä (2014) Solin, A. and Särkkä, S. Hilbert space methods for reduced-rank Gaussian process regression. Statistics and Computing, pp. 1–28, 2014.
  • Titsias (2009) Titsias, M. Variational learning of inducing variables in sparse Gaussian processes. In Proceedings of International Conference on Artificial Intelligence and Statistics, 2009.
  • Titsias & Lawrence (2010) Titsias, M. and Lawrence, N. D. Bayesian Gaussian process latent variable model. In Proceedings of International Conference on Artificial Intelligence and Statistics, 2010.
  • van der Wilk et al. (2020) van der Wilk, M., Dutordoir, V., John, S., Artemev, A., Adam, V., and Hensman, J. A framework for interdomain and multioutput Gaussian processes. arXiv:2003.01115, 2020. URL https://arxiv.org/abs/2003.01115.
  • Wendland (2005) Wendland, H. Scattered data approximation. Cambridge University Press, 2005.
  • Williams (1998) Williams, C. K. Computation with infinite neural networks. Neural Computation, 10(5):1203–1216, 1998.