Adaptive Inducing Points Selection For Gaussian Processes

07/21/2021
by   Theo Galy-Fajou, et al.
0

Gaussian Processes (GPs) are flexible non-parametric models with strong probabilistic interpretation. While being a standard choice for performing inference on time series, GPs have few techniques to work in a streaming setting. <cit.> developed an efficient variational approach to train online GPs by using sparsity techniques: The whole set of observations is approximated by a smaller set of inducing points (IPs) and moved around with new data. Both the number and the locations of the IPs will affect greatly the performance of the algorithm. In addition to optimizing their locations, we propose to adaptively add new points, based on the properties of the GP and the structure of the data.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

07/15/2021

Input Dependent Sparse Gaussian Processes

Gaussian Processes (GPs) are Bayesian models that provide uncertainty es...
01/09/2018

Deep Gaussian Processes with Decoupled Inducing Inputs

Deep Gaussian Processes (DGP) are hierarchical generalizations of Gaussi...
09/06/2018

Hands-on Experience with Gaussian Processes (GPs): Implementing GPs in Python - I

This document serves to complement our website which was developed with ...
10/28/2021

Conditioning Sparse Variational Gaussian Processes for Online Decision-making

With a principled representation of uncertainty and closed form posterio...
11/03/2020

Transforming Gaussian Processes With Normalizing Flows

Gaussian Processes (GPs) can be used as flexible, non-parametric functio...
01/31/2018

Kernel Distillation for Gaussian Processes

Gaussian processes (GPs) are flexible models that can capture complex st...
06/01/2021

Gaussian Processes with Differential Privacy

Gaussian processes (GPs) are non-parametric Bayesian models that are wid...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaussian Processes (GPs) are flexible non-parametric models with strong probabilistic interpretation. They are particularly fitted for time-series (roberts_gaussian_2013) but one of their biggest limitations is that they scale cubically with the number of points (williams2006gaussian). quinonerocandelaunifying2005 introduced the notion of sparse GPs, models approximating the posterior by a smaller number of inducing points (IPs) and reducing the inference complexity from to where is the number of IPs. titsias2009variational introduced them later in a variational setting, allowing to optimize their locations. Based on this idea, (bui2017streaming) introduced a variational streaming model relying on inducing points. One of their algorithm’s features is that hyper-parameters can be optimized and more specifically the number of inducing can vary between batches of data. However in their work, the number of IPs is fixed and their locations are simply optimized against the variational bound of the marginal likelihood. Having a fixed number of IPs limits the model’s scope if the total data size is unknown. A gradient based approach leads to two problems:
-  IP’s locations need to be optimized until convergence for every batch. Therefore batches need to be sufficiently large to get a meaningful improvement. If the new data comes in very far from the original positions of the IPs, the optimization will be extremely slow.
-  The number of IPs being fixed, there is no way to know how many will be required to have a desired accuracy. Finding the optimal number of IPs is also not an option as it is an ill-posed problem: the objective will only decrease with more IPs, i.e. the optimum is obtained when every data point is an IP.

Figure 1:

Illustration of the inducing point selection process. Blue points represent inducing points, green points data and the orange line represent the mean of the prediction from the GP model surrounded by one standard error. The dashed represent the space covered by the existing IPs, only points seen outside those areas are selected as new IPs.

We propose a different approach to this problem with a simple algorithm, Online Inducing Points Selection (OIPS), requiring only one parameter to select automatically both the number of inducing points and their location. OIPS  naturally takes into account the structure of the data while the performance trade-off and the expected number of IPs can be inferred.

Our main contributions are as follow :
- We develop an efficient online algorithm to automatically select the number and location of inducing points for a streaming GP.
- We give theoretical guarantees on the expected number of inducing points and the performance of the GP.

In section 2 we present existing methods to select inducing points, as well as an online inference for GPs. We present our algorithm and its theoretical guarantees in section 3. We show our experiments in comparison with popular inducing points selection methods in section 4. Finally we summarize our findings and explore outlooks in section 5.

2 Background

2.1 Sparse Variational Gaussian Processes

Gaussian Processes:

Given some training data where are the inputs and are the labels, we want to compute the predictive distribution for new inputs . In order to do this we try to find an optimal distribution over a latent function

. We set the latent vector

as the realization of , where , and put a GP prior on , with the prior mean (set to 0 without loss of generality) and a kernel function. In this work we are going to use an isotropic squared exponential kernel (SE kernel) : , but it is generally applicable to all translation-invariant kernels. We then compute the posterior:

(1)

Where and is the kernel matrix evaluated on (in later notation we use instead of ). For a Gaussian likelihood the posterior is known analytically in closed-form. Prediction and inference have nonetheless a complexity of

Sparse Variational Gaussian Processes:

When the likelihood is not Gaussian, there is no tractable solution for the posterior. One possible approximation is to use variational inference : a family of distributions over is selected, e.g. the multivariate Gaussian , and one optimizes the variational parameters and by minimizing the negative ELBO, a proxy for the KL divergence . However the computational complexity still grows cubically with the number of samples, and is therefore inadequate to large datasets.

quinonerocandelaunifying2005 and titsias2009variational introduced the notion of sparse variational GPs (SVGP). One adds inducing variables and their inducing locations to the model. In this work we restrict to be in the same domain as but inter-domain approaches also exist (hensman2017variational). The relation between and is given by the distribution where

(2)

where

Then we approximate with the variational distribution where by optimizing .

Note that if the likelihood is Gaussian, the optimal variational parameters and are known in closed-form. The only parameters left to optimize are the kernel parameters as well as selecting the number and the location of the inducing variables.

2.2 Inducing points selection methods

titsias2009variational

initially proposed to select the points location via a greedy selection : A small batch of data is randomly sampled, each sample is successively tested by adding it to the set of inducing points and evaluating the improvement on the ELBO. The sample bringing the best performance is added to the set of inducing points and the operation is repeated until the desired number of inducing points is reached. This greedy approach has the advantage of selecting a set which is already close to the optimum set but is extremely expensive and is not applicable to non-conjugate likelihoods as it relies on estimating the optimal bound.

The most popular approach currently is to use the -means++ algorithm (arthur2007k) and take the optimized clusters centers as inducing points locations. The clustering nature of the algorithm allows to have good coverage of the whole dataset. However the -means algorithm have a complexity of on the whole dataset where is the number of -means iterations. Another issue is that it might allocate multiple centers in a region of high density leading to very close inducing points and no significant performance improvement. It is also not applicable online and does not solve the problem of choosing the number of inducing points.

Another classical approach is to simply take a grid. For example MorenoArtesAlvarez19 use a grid in an online setting by updating the bounds of a uniform grid. Using a grid is unfortunately limited a small number of dimensions and does not take into account the structure of the data.

2.3 Online Variational Gaussian Process Learning

(bui2017streaming) developed a streaming algorithm for GPs (SSVGP) based the inducing points approach of (titsias2009variational). The method consists in recursively optimizing the variational distribution for each new batch of data given the previous variational distribution . initially approximates the posterior :

(3)

where are the set of hyper-parameters. Since is not accessible anymore, the likelihood on previously seen data is approximated using the previous variational approximation and the previous hyper-parameters :

The distribution approximated by is in the end:

(4)

The optimization of the (bound on the) KL divergence between the two distributions for each new batch will preserve the information of via and ensure a smooth transition of the hyper-parameters, including the number of inducing points. We give all technical details including the hyper-parameter derivatives and the ELBO in full form in appendix A.

3 Algorithm

The idea of our algorithm is that to give a good approximation, a large majority of the samples should be ”close” (in the reproducing kernel Hilbert space (RKHS)) to the set of IPs locations. Additionally, should be as diverse as possible, since IP degeneracy will not improve the approximation. This intuition is supported by previous works:
- bauer2016understanding showed that the most substantial improvement obtained by adding a new inducing point was through the reduction of the uncertainty of , which decreases quadratically with .
- burt2019rates showed that the quality of the approximation made with inducing points is bounded by the norm of .
Therefore by ensuring that and are sufficiently large, we can expect an improvement on the approximation of the non-sparse problem.

3.1 Adding New Inducing Points

A simple yet efficient strategy is to verify that for each new data point seen during training, there exists a close inducing point. We first compute . If the maximum value of is smaller than a threshold parameter , the sample is added to the set of IPs . If not, the algorithm passes on to the next sample. We summarize all steps in Algorithm 1.

The streaming nature of the algorithm makes it perfectly suited for an online learning setting : it needs to see samples only once, whereas other algorithms like -means need to parse all the data multiple times before converging. It is fully deterministic for a given sequence of samples and therefore convergence guarantees are given under some conditions. This approach was previously explored in a different context by csato2002sparse, but was limited to small datasets.

  Input: sample , set of inducing points ,acceptance threshold , kernel function
  
  if  then
     
  end if
  return
Algorithm 1 Online Inducing Point Selection (OIPS)

The extra cost of the algorithm is virtually free since needs to be computed for the variational updates of the model.

One of our claims is that our algorithm is model and data agnostic. The reason is that as kernel hyper-parameters are optimized, the acceptance condition changes as well

Note that this method can be interpreted as a half-greedy approach of a sequential sampling of a determinantal point process (Kulesza2012). In appendix B

, we show that for the same number of points, the probability of our selected set is higher than the one of a k-DPP.

3.2 Theoretical guarantees

The final size of is depending on many factors: the selected threshold , the chosen kernel, the structure of the data (distribution, sparsity, etc) and the number of points seen. However by having some weak assumptions on the data we can prove a bound on the expected number of inducing points as well as on the quality of the variational approximation.

Expected number of inducing points :

Since the selection process is directly depending on the data, it is impossible to give an arbitrary bound. However by adding assumptions on the distribution of one can

Theorem 1.

Given a dataset i.i.d and uniformly distributed, i.e.

, and a SE kernel with lengthscale , the expected number of selected inducing points after parsing points is

(5)

where .

The proof is given in the appendix C. As , this bound will converge to which is the estimated number of overlapping hyper-spheres of radius to fill a hypercube of dimension with side length . This can be used as an upper bound for any data lying in a compact domain. This confirms the intuition that the number of selected inducing points will grow faster with larger dimensions and a larger and with smaller lengthscales.

Expected performance on regression :

burt2019rates derived a convergence bound for the inducing points approach of (titsias2009variational). Even if they show this bound in an offline setting, their bound is still relevant for online problems. They show that when is sampled via a k-DPP  process (kulesza2011k), i.e. a determinantal point process conditioned on a fixed set size, the difference between the ELBO and the log evidence is bounded by

(6)

where is the

-th largest eigenvalue of

and is the Nyström approximation of .

We derive a similar bound when using our algorithm instead of k-DPPsampling:

Theorem 2.

Let be the set of inducing points locations of size selected via Algorithm 1 on the dataset of size .

(7)

where is the kernel matrix on and is the Nyström approximation of using the subset

The proof and an empirical comparison are given in the appendix D.

4 Experiments

In this section we get a quick look on how our algorithm performs in different settings compared to approaches described in section 2.2. We compare the online model SSVGP described in section 2

with different IP selection techniques. We select from the first batch via k-means and then optimize them (

k-means/opt), select them via our algorithm and optimize them (OIPS/opt), select them via our algorithm but don’t optimize them (OIPS) and finally create a Grid that we adapt according to new bounds. We consider 3 different toy datasets, from which two are displayed in figure 2. The dataset A is a uniform time series and the output function is a noisy sinus. The dataset B is an irregular time-series, with a gap in the inputs. The output function is also a noisy sinus. Dataset C inputs are random samples from an isotropic multivariate 3D Gaussian and the output function is given by . All datasets contain 200 training points and 200 test points. For all experiments we use an isotropic SE kernel with fixed parameters. For datasets A and B, Grid and -means has 25 IPs while OIPS converged to around 20 IPs. For dataset C, Grid has IPs, -means 50, and both OIPS converged to 10 IPs Figure 2 shows the evolution on the average negative log likelihood on test data after every batch has been seen. On a uniform time-series context all methods are pretty much equivalent. The presence of a gap, blocks the optimization of IP locations and impede inference of future points. Whereas the grid suffers from being in high-dimensions and All details on the datasets, different training methods, hyper-parameters and optimization parameters used are to be found in appendix E.

Figure 2: Toy datasets A and B, divived in 4 batches. Average Negative Test Log-Likelihood on a test set in function of number of batches seen. In a uniform streaming setting all methods perform similarly but having a gap blocks the convergence of a simple position optimization whereas in a non-compact situation the adaptive grid suffers in performance.

5 Conclusion

We presented a new algorithm, OIPS, able to select inducing points automatically for a GP in an online setting. The theoretical bounds derived outperforms the previous work based on DPPs. There is yet to improve the selection process to make it robust to outliers and to variations of the hyper-parameters. Using for instance a threshold on the median or a mean on the

-nearest IPs could help to avoid picking adversarial points such as outliers. We have only considered regression but our algorithm is also compatible with non-conjugate likelihoods. Using augmentations approaches (wenzel2019efficient; galy2019multi), same performance can be attained. Finally the most interesting improvement would be to use a non-stationary kernel (remes2017non) and be able to automatically adapt the number of inducing points across the dataset.

References

Appendix A Derivations online GPs

a.1 Elbo

Following bui2017streaming, the ELBO for variational inference is defined as :

The terms of the first line correspond to a classical SVGP problem and the second line express the KL divergence with the previous variational posterior. The distributions are defined as :

The first terms ares

And for . The expected log-likelihood is given by L

Writing the second terms fully we get :

Subtracting the second term to the first we get:

Where .

Taking the derivative of given and gives us directly the optimal solution for Gaussian regression:

Rewritten in natural parameters terms:

a.2 Hyper-parameter derivatives

Given

a kernel hyperparameter and

the derivatives are given by:

Special derivative given the variance :

a.3 Comparison with SVI

If we take the special case where inducing points do not change between iterations, then and . The updates become

Compared to the SVI updates:

If we ignore by setting it as 1:

We forget completely the previous .
To make it directly comparable to streaming:

SVI
Streaming

Appendix B Deterministic algorithm as a DPP half-greedy sampling

We proceed to a simple experiment, where given a dataset, Abalone (), we repeatedly shuffle the data. We apply algorithm 1 parsing all the data to get the subset . We use the resulting number of inducing points as a parameter to sample from a k-DPP and obtain . We compute the probabilities of and and report the histogram of the probabilities on figure 3 One can observe that the probability given by the OIPS algorithm is consistently higher as well as more narrow then the sampling. This can be explained by the fact that we deterministically constrain all the points to have a certain distance from each other and therefore put a deterministic limit on the determinant of .

Figure 3: Histogram of for the OIPS algorithm and k-DPPsampling

Appendix C Proof Theorem 1 : Bound on the number of points

Algorithm 1 can be interpreted as filling a domain with closed balls, where balls intersections are allowed but no center can be inside another ball. For a SE kernel we can compute the radius (in euclidean space) of these balls :

We can bound the volume of the union of the balls by the union of inscribed hypercubes. The length of an inscribed hypercube in an hypersphere of radius is . Since the volume of the hypercube is defined to be smaller, this gives us an upper bound on the expected number of inducing points. Defining as the number of inducing points at time , the probability of having a point outside of the union of all hypercubes is

Where , is the volume of one hypercube and therefore the probability of a new sample to appear in it.
The probability of keeping the same number of points is

We now consider the problem as a Markov chain where the state

is represented by a vector where if there are inducing points. The transition matrix is given by :

If we define that we start with inducing points the initial state is , the probability of having balls after steps is while the expected number of pointsis given by .

These sequence can be complex to compute. Instead we can approximate the final expectation by recursively computing the update given the expectation at the previous step:

This is an arithmetico-geometric suite and given the original condition and since we can get a closed form solution for :

c.1 Empirical Comparison

We show the realization of this bound on uniform data with 3 dimensions, and on figure 4.

Figure 4: Bound on the number of inducing points accepted given the number of seen points vs the empirical estimation

Appendix D Proof theorem 2 : Bounding the ELBO

We follow the approach of burt2019rates and belabbas2009spectral. burt2019rates showed that the error between the ELBO and the log evidence was bounded by . Where is the Froebius norm. Using a k-DPP sampling (kulesza2011k), they were able to show a bound on the expectation of this norm. We follow similar calculations with our deterministic algorithm for fixed kernel parameters. Let be the kernel matrix of the full dataset and the submatrix given the set of points . The Schur complement of , in is given by . Following a similar approach then belabbas2009spectral we bound the norm by the trace:

Using the definiton of we get :

where every element of the sum is a scalar. Taking the eigendecomposition of , and assuming a kernel variance of 1 (although generalizable to all variances) and a translation invariant kernel such that we get :

Where we used the fact that at least was close enough to at least one such that . For clarity we replace where is the largest eigenvalue of . When summing over the trace we get the final bound :

Now by construction all off-diagonal terms of are smaller than . Using the equality (Stewart90)

We get that

Assuming , we get

Getting then the final bound :

d.1 Empirical Comparison

These bounds are difficult to compare due to the different parameters characterizing them. Nevertheless we give an example by comparing the bound and the empirical value on toy data drawn uniformly in 3 dimensions in figure 5. For each we ran our algorithm and input the required in the bounds as the resulting number of selected inducing points. We show in the section 4 the empirical effect on the accuracy and on the number of points given the choice of .

Figure 5: Evaluation of the given the OIPS algorithm and computation of the bound from burt2019rates given in equation 6 and our bound given in equation 7

Appendix E Experiments parameters

For every problem we use an isotropic Squared Exponential Kernel :

Where