Input Dependent Sparse Gaussian Processes

07/15/2021 ∙ by Bahram Jafrasteh, et al. ∙ Universidad Autónoma de Madrid 0

Gaussian Processes (GPs) are Bayesian models that provide uncertainty estimates associated to the predictions made. They are also very flexible due to their non-parametric nature. Nevertheless, GPs suffer from poor scalability as the number of training instances N increases. More precisely, they have a cubic cost with respect to N. To overcome this problem, sparse GP approximations are often used, where a set of M ≪ N inducing points is introduced during training. The location of the inducing points is learned by considering them as parameters of an approximate posterior distribution q. Sparse GPs, combined with variational inference for inferring q, reduce the training cost of GPs to 𝒪(M^3). Critically, the inducing points determine the flexibility of the model and they are often located in regions of the input space where the latent function changes. A limitation is, however, that for some learning tasks a large number of inducing points may be required to obtain a good prediction performance. To address this limitation, we propose here to amortize the computation of the inducing points locations, as well as the parameters of the variational posterior approximation q. For this, we use a neural network that receives the observed data as an input and outputs the inducing points locations and the parameters of q. We evaluate our method in several experiments, showing that it performs similar or better than other state-of-the-art sparse variational GP approaches. However, with our method the number of inducing points is reduced drastically due to their dependency on the input data. This makes our method scale to larger datasets and have faster training and prediction times.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 7

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaussian Processes (GPs) are non-parametric models that can be used to address machine learning problems, including regression and classification

[19]. GPs become more expressive as the number of training instances grows and, since they are Bayesian models, they provide a predictive distribution that estimates the uncertainty associated to the predictions made. This uncertainty estimation or ability to know that is not known becomes critical in many practical applications [7]. Nevertheless, GPs suffer from poor scalability as their training cost is due to the need of computing the inverse of a covariance matrix of size . Another limitation is that approximate inference is required with non-Gaussian likelihoods [19].

To improve the cost of GPs, sparse approximations can be used [19]. The most popular ones introduce a set of inducing points [25, 26]. The inducing points and their associated posterior values completely specify the posterior process at test points. In the method described in [25], the computational gain is obtained by assuming independence among the process values at the training points given the inducing points and their values. This approach can also be seen as using an approximate GP prior [18]. By contrast, in the method in [26] the computational gain is obtained by combining variational inference (VI) with a posterior approximation that has a fixed part and a tunable part. In both methods the computational cost is reduced to and the inducing points, considered as model’s hyper-parameters, are learned by maximizing an estimate of the marginal likelihood.

Importantly, the VI approach in [26] maximizes a lower bound on the log-marginal likelihood as an indirect way of minimizing the KL-divergence between an approximate posterior distribution for the process values associated to the inducing points and the corresponding exact posterior. The advantage of this approach is that the objective is expressed as a sum over the training instances, allowing for mini-batch training and stochastic optimization techniques to be applied on the objective [11]. This reduces the total training cost to , making GPs scalable to very large datasets.

When using sparse approximations, however, one often observes in practice that after the optimization process the inducing points are located in those regions of the input space in which the latent function changes [25, 26, 12, 2]. Therefore, the expressive power of the model critically depends on the number of inducing points and on its correct placement across the input space. For example, some problems may require a large number of inducing points, in the order of several hundreds, to get good prediction results [11, 23, 27]. This makes difficult using sparse GPs based on inducing points in those problems.

In the literature there have been some attempts to improve the computational cost of sparse approximations. These approaches follow different strategies, including using different sets of inducing points for the computation of the posterior mean and variance

[3]. Other approaches use an orthogonal decomposition of the GP that allows to introduce an extra set of inducing points with less cost [23]. Finally, other methods consider a large set of inducing points, but restrict the computations for a particular data point to the nearest neighbors to that point from the set of inducing points [27].

In this work we are inspired by the approach of [27] and propose a novel method to improve the computational cost of sparse GPs. Our method also tries to produce a set of inducing points (and associated variational approximation ) that are specific of each different input data point, as in [27]. For that, we note that some works in the literature have observed that one can learn the mappings from inputs to proposal distributions instead of directly optimizing their parameters [15, 24]. This approach, known as amortized variational inference, is a key contribution of variational auto-encoders (VAE) [15], and has also been explored in the context of GP to solve other types of problems such as multi-class classification with input noise [28]. Amortized inference has also been empirically shown to lead to useful regularization properties that improve the generalization performance [24].

Specifically, here we propose to combine sparse GP with a neural network architecture to compute the inducing points locations associated to each input point. Moreover, we also employ a neural network to carry out amortized VI to compute the parameters of the approximate variational distribution modeling the posterior distribution associated to the values of the outputted inducing points. Critically, this approach allows to reduce the number of inducing points drastically without loosing expressive power, as now we have different sets of inducing points associated to each input location. The inducing points are simply given by a mapping from the inputs provided by a neural network. We show on several experiments that the proposed method is able to perform similar or better than standard sparse GPs and competitive methods for improving the cost of sparse GPs [27, 23]. However, the training and prediction times of our method are much better.

2 Gaussian processes

A Gaussian Process (GP) is a stochastic process for which any finite set of variables has a Gaussian distribution

[19]. GPs are non-linear and non-parametric approaches for regression and classification. In a learning task, we use a GP as a prior over a latent function. Then, Bayes’ rule is used to get a posterior for that function given the observed data. Consider a dataset , where each scalar is assumed to be obtained by adding an independent and identically distributed Gaussian noise with a zero mean and variance to a function evaluated on , i.e., , where . We specify a prior distribution for in the form of a GP, which is described by a mean function (often set to zero) and covariance function such that . Covariance functions typically have some adjustable parameters . The GP formulation allows to compute a posterior or predictive distribution for the potential values of given . More precisely, the the prediction at a new test point is Gaussian with mean and variance given by

(1)
(2)

where and are the prediction mean and variance, respectively.

is a vector with the covariances between

and each . Similarly, has the covariances between and for . Finally,

stands for the identity matrix. A popular covariance function

is the squared exponential covariance function, which assumes that is smooth [19]. Its parameters, , and can simply be found via type-II maximum likelihood estimation by maximizing [19]. Note, however, that the computational complexity of this approach is since it needs the inversion of , a matrix. This makes GPs unsuitable for large data sets.

2.1 Sparse variational Gaussian processes

Sparse Gaussian process improve the computational cost of GPs. The most popular approaches introduce, in the same input space as the original data, a new set of points , called the inducing points, denoted by [25, 26]. Let the corresponding latent functions be . The inducing points are not restricted to be part of the observed data and their location can be learned during training. A GP prior is placed on . Namely, , where is a matrix with the covariances associated to each pair of points from . The main idea of sparse approximations is that the posterior for can be approximated in terms of the posterior for .

In this work we focus on a widely used variational inference (VI) approach to approximate the posterior for [26]. Let . In VI, the goal is to find an approximate posterior for and , , that resembles as much as possible the true posterior . Critically, is constrained to be , with fixed and a tunable multi-variate Gaussian. To find a lower bound of the marginal likelihood is maximized. This bound is obtained through Jensen’s inequality, leading to the following expression, after some simplifications:

(3)

where is the model’s likelihood and

is the Kullback-Leibler divergence between probability distributions. In

[26], they do optimize in closed-form. The resulting expression is then maximized to estimate , and . This leads to a complexity of . However, if the variational posterior is optimized alongside with , and , as proposed in [10], the ELBO can be expressed as a sum over training instances, which allows for mini-batch training and stochastic optimization techniques. Using stochastic variational inference (SVI) reduces the training cost to [10]. Importantly, the first term in (3) is an expectation that has closed-form solution in the case of Gaussian likelihoods. It needs to be approximated for other cases, e.g., binary classification, either by quadrature or MCMC methods [11]. The second term is the KL-divergence between the variational posterior and the prior, which can be computed analytically since they are both Gaussian.

After optimizing (3), one often observes that the inducing points are located in those regions of the input space in which the latent function changes [26, 12, 2]. In consequence, the expressive power of the sparse GP critically depends on the number of inducing points and on its correct placement. In some learning problems, however, a large number of inducing points, in the order of several hundred, is required to get good prediction results [11, 23, 27]. This makes difficult and expensive using sparse GPs in those problems. In the next section we describe how to reduce the cost of this method.

3 Input dependent sparse GPs

We develop a novel formulation of sparse GPs, which for every given input computes the corresponding inducing points to be used for prediction, and also the associated parameters of the approximate distribution . To achieve this goal we consider a meta-point that is used to determine the inducing points and the corresponding . Namely, now depends on , i.e., . In particular, we set where the inducing points depend non-linearly, e.g., via a deep neural network, on

. The joint distribution of

and is then given by for some prior distribution over . Following [27], we can consider an implicit distribution . That is, its analytical form is unknown, but we can draw samples from it. Later on, we will specify .

Note that the marginalized prior is not anymore Gaussian. However, we can show that this formulation does not impact on the prior over . For an arbitrary selected meta-point we have that

(4)

where are the cross-covariances between and . Therefore, if is marginalized out in (4), the prior for is the standard GP prior and does not depend on . Hence, . This means that is a mixture of Gaussian densities, where the marginal over is the same for every component of the mixture.

Note, however, that in the standard sparse GP, the inducing points also have an impact on the variational approximation via the fixed conditional distribution [26]. Therefore, we also have to incorporate the input dependence on in the approximation . This is done in the next section.

3.1 Lower bound on the log-marginal likelihood

We follow [27] to derive a lower bound on the log-marginal likelihood of the extended model described above. Consider a posterior approximation of the form , where only can be adjusted and the other factors are fixed. Using the above explanation and the help of Jensen’s inequality we obtain the lower bound after some simplifications:

(5)

Now, assuming that is an implicit distribution, we can draw samples from it and approximate the expectation with respect to . Thus, for a meta-point sample from the lower bound is approximated as

(6)

Note that we can evaluate (6) and its gradients to maximize the original objective in (5) using stochastic optimization techniques. This is valid for any implicit distribution . Consider now that we use mini-batch-based training for optimization, and we set . In this case, the value of remains random, as it depends on the points that are selected in the random mini-batch. This results in a method that computes a different set of inducing points associated for each input location.

Figure 1: The network’s inputs is . The outputs are the inducing points, , and the mean vector, , and Cholesky factor, , of .

3.2 Amortized variational inference and deep neural networks

Maximizing the lower bound finds the optimal approximate distribution . A problem, however, is that we have a potential large number of parameters to fix, corresponding to each . In particular, if we set to be Gaussian, we will have to infer different means and covariance matrices for each different . This is expected to be memory inefficient and to make difficult optimization.

To reduce the number of parameters of our method we use amortized variational inference and specify a function that can generate these parameters for each [24]. More precisely, we set the mean and covariance matrix of to be and , for some non-linear functions.

Deep neural networks (DNN) are very flexible models that can describe complicated non-linear functions. In these models, the inputs go through several layers of non-linear transformations. We use these models to compute the non-linearities that generate from

the inducing points, , and the means and covariances of , i.e., and . The architecture employed is displayed in Figure 1. At the output of the DNN we obtain , a mean vector and the Cholesky factor of the covariance matrix . The maximization of the lower bound in (5) when using DNNs for the non-linearities is shown in Algorithm 1. The required expectations are computed in closed-form, in regression. In binary classification, we use 1-dimensional quadrature, as in [12].

, , neural network NNet with hidden layers and hidden units.
Optimal parameters of
initialize , i.e. kernel’s parameters and neural network’s weights
while stopping criteria is False do
     
     gather mini-batch of size from
     for () in  do
          = NNet()
          +=
          +=      
     
     Update parameters using the gradient of
Algorithm 1 Training input dependent sparse GPs

3.3 Predictions and computational cost

At test time, however, the inputs of interest are not randomly chosen. In that case, we simply set to be a deterministic distribution placed on the candidate test point . The DNN is used to obtain the associated relevant information. Namely, , and the parameters of , and . The predictive distribution for is:

(7)

Given this distribution for

, the probability distribution for

can be computed in closed form in the case of regression problems and with 1-dimensional quadrature in the case of binary classification.

The cost of our method is smaller than the cost of a standard sparse GP if a smaller number of inducing points is used. The cost of a DNN with layers, hidden units, dimension of the inputs data, and output dimension is . The cost of the sparse GPs is , with the mini-batch size. Therefore, the cost of our method is . Since in our method the inducing points are input dependent, we expect to obtain good prediction results even for values that are fairly small.

4 Related work

Early works on sparse GPs simply choose a subset of the training data for inference based on an information criterion [4, 16, 22, 9]. This approach is limited in practice and more advanced methods in which the inducing points need not be equal to the training points are believed to be superior. In the literature there are several works analyzing and studying sparse GP approximations based on inducing points. Some of these works include [18, 25, 17, 26, 2, 13]. We focus here on a variational approach [26] which allows for stochastic optimization and mini-batch training [10, 12]. This enables learning in very large datasets with a cost of , with the number of inducing points.

In general, however, a large number of inducing variables is desirable to obtain a good approximation to the posterior distribution. For some problems, even several hundreds of inducing points may be needed to get good prediction results [11, 23, 27]. There is hence a need to improve the computational cost of sparse GP approximations, without losing expressive power. One work addressing this task is that of [3]. In that work it is proposed to decouple the process of inferring the posterior mean and variance, allowing to consider a different number of inducing points for each one. Importantly, the computation of the mean achieves a linear complexity, which allows to have more expressive posterior means at a lower cost. A disadvantage is that such an approach suffers from optimization difficulties. An alternative decoupled parameterization that adopts an orthogonal basis in the mean is proposed in [20]. Such a method can be considered as a particular case of [23]. There, it is introduced a new interpretation of sparse variational approximations for GP using inducing points. For this, the GP is decomposed as a sum of two independent processes. This leads to tighter lower bounds on the marginal likelihood and new inference algorithms that allow to consider two different sets of inducing points. This enables including more inducing points at a linear cost, instead of cubic.

Our work is closer to [27]. There, it is also described a mechanism to consider input dependent inducing points in the context of sparse GP. However, the difference is significant. In particular, in [27] a very large set of inducing points is considered initially. Then, for each input point, a subset of these inducing points is considered. This subset is obtained by finding the nearest inducing points to the current data instance . This approach significantly reduces the cost of the standard sparse GP described in [26]. However, it suffers from the difficulty of having to find the nearest neighbors for each point in a mini-batch, which is very expensive. Therefore, the final cost is higher than what would be thought initially. Our method is expected to be better because of the extra flexibility provided by the non-linear relation between and the inducing points computed by the DNN. Furthermore, the DNN can take advantage of GPU acceleration.

A different way of improving the computational cost of GP is described in [29, 6, 8]. The approach consists in placing the inducing points on a grid structure. This allows to perform fast computation exploiting the inducing points structure. One can easily consider values for that are even larger than . However, to get such benefits the inducing points need to be fixed due to the structure constraints. This may be detrimental in problems with a high input dimensions.

Finally, the use of amortized variational inference [24] has also been explored in the context of GPs in [28]. There, input noise is considered in a multi-class learning problem and the variational parameters of the posterior approximation for the noiseless inputs are amortized using a DNN receiving both and . Amortized VI is shown to improve the generalization performance of the resulting model.

5 Experiments

We evaluate the performance of the proposed method, to which we refer to as Input Dependent Sparse GP (IDSGP). We consider both regression and binary classification with a probit likelihood. In this later case, we approximate the expectation in the lower bound using 1-dimensional quadrature, as in [12]

. An implementation of the proposed method in Tensorflow 2.0 is provided in the supplementary material

[1]. In the experiments we compare results with the standard variational sparse GP [26]. We refer to such a method as VSGP. We also compare results with two of the methods described in Section 4. Namely, the sparse within sparse GP (SWSGP) described in [27], and the sparse GP based on an orthogonal decomposition that allows to consider two different sets of inducing points [23]. We refer to this last method as SOLVE. All methods use a Matérn covariance function [19].

5.1 Toy problems

A first set of experiments illustrates the resulting predictive mean and standard deviation of each method on a 1-dimensional regression problem

[25]. Each method is compared with a full GP. Figure 2 shows the results obtained, including the learned locations of the inducing points. In the case of IDSGP we show the locations of the inducing points for the point represented with a star at . The number of inducing points, for each method, are indicated in the figure’s caption. IDSGP uses smaller number of inducing points than the other methods. The figure shows that, in regions with observed data, IDSGP outputs a mean and a standard deviation that look closer to those of the full GP.

Figure 3 shows similar results for the banana classification dataset [10]. We show here the resulting decision boundary of each method. We observe that IDSGP produces the most accurate boundaries.

Figure 2: Toy data set with points. Initial and final locations for the inducing points are shown on the top and bottom of each figure. In IDSGP, the inducing points correspond to the point drawn with star. The posterior mean and standard deviation of full GP are shown with blue and brown dashed lines, respectively. VSGP method with . IDSGP with and a neural network with 2 layers with 50 units. SWSGP with and neighbors. SOLVE with .
Figure 3: Banana classification data set with points. The final location of inducing points are shown inside the figures. For IDSGP, we show the location of inducing points related to the green colored point. VSGP with . IDSGP with and a neural network with 2 hidden layers each contains 50 hidden nodes. SWSGP with and 2 neighbors. SOLVE with .

5.2 Experiments on UCI datasets

A second set of experiments considers several regression and binary classification datasets extracted from the UCI repository [5]. In the regression setting, the DNN architecture used for IDSGP has hidden layer with units. The number of inducing points of IDSGP is set to . In SOLVE we use and inducing points. In VSGP we set . In SWSGP we set and neighbors. All the methods are trained using ADAM [14] with a mini-batch size of and a learning rate of . In the classification setting we use the same setup, but the number of inducing points of IDSGP is set to . All methods are trained on a Tesla P100 GPU with 16GB of memory. On each dataset we use of the data for training and the rest for testing. We report results across splits of the data since the datasets are already quite big.

The average neg. test log-likelihood of each method on each dataset is displayed in Table 1, for the regression datasets, and in Table 2, for the classification datasets, respectively. The average rank of each method is also displayed at the last row of each table. The RMSE and prediction accuracy results are similar to those displayed here. They can be found in the supplementary material. Each table also shows the number of instances and dimensions of each dataset. We observe that in the regression datasets, the proposed method, IDSGP, obtains best results in out of the datasets. IDSGP also obtains the best average rank (closer to always performing best on each train / test data split). This is remarkable given that IDSGP a much smaller number of inducing points (e.g., for IDSGP vs. for VSGP). In classification, however, all the methods seem to perform similar to each other and the differences between them are smaller. Again IDSGP uses here a smaller number of inducing points. Increasing this number does not seem to improve the results.

VSGP SOLVE SWSGP IDSGP
Kin40k 32,000 8 -0.047 (0.003) -0.415 (0.006) -0.110 (0.007) -1.461 (0.019)
Protein 36,584 9 2.848 (0.002) 2.818 (0.003) 2.835 (0.002) 2.775 (0.007)
KeggDirected 42,730 19 -1.955 (0.013) -1.756 (0.073) -2.256 (0.012) -2.410 (0.012)
KEGGU 51,686 26 -2.344 (0.012) -2.531 (0.015) -2.396 (0.006) -2.908 (0.042)
3dRoad 347,899 3 3.691 (0.006) 3.726 (0.010) 3.879 (0.026) 3.399 (0.009)
Song 412,276 90 3.613 (0.003) 3.608 (0.002) 3.618 (0.004) 3.637 (0.002)
Buzz 466,600 77 6.272 (0.012) 6.297 (0.009) 6.137 (0.008) 6.317 (0.055)
HouseElectric 1,639,424 6 -1.737 (0.006) -1.743 (0.005) -1.711 (0.010) -1.774 (0.004)
Avg. Ranks 3.125 (0.125) 2.475 (0.156) 2.850 (0.150) 1.550 (0.172)
Table 1:

Avg. neg. test log-likelihood values for the UCI regression datasets. The numbers in parentheses are standard errors. Best mean values are highlighted in bold face.

VSGP SOLVE SWSGP IDSGP
MagicGamma 15,216 10 0.308 (0.004) 0.314 (0.005) 0.371 (0.005) 0.311 (0.002)
DefaultOrCredit 24,000 30 0.000 (0.000) 0.000 (0.000) 0.000 (0.000) 0.000 (0.000)
NOMAO 27,572 174 0.113 (0.004) 0.103 (0.004) 0.134 (0.004) 0.121 (0.004)
BankMarketing 36,169 51 0.206 (0.001) 0.199 (0.001) 0.304 (0.021) 0.209 (0.002)
Miniboone 104,051 50 0.151 (0.001) 0.142 (0.001) 0.180 (0.007) 0.153 (0.001)
Skin 196,046 3 0.005 (0.000) 0.005 (0.000) 0.006 (0.001) 0.003 (0.000)
Crop 260,667 174 0.003 (0.000) 0.003 (0.000) 0.002 (0.000) 0.003 (0.000)
HTSensor 743,193 11 0.003 (0.001) 0.001 (0.000) 0.030 (0.009) 0.005 (0.001)
Avg. Ranks 2.425 (0.143) 1.775 (0.158) 3.175 (0.182) 2.625 (0.155)
Table 2: Avg. test neg. log-likelihood values for the UCI classification datasets. The numbers in parentheses are standard errors. Best mean values are highlighted in bold face.

In these experiments we also measure the average training time per epoch, for each method. The results corresponding to the UCI regression datasets are displayed in Table

5. The results for the UCI classification datasets are found in the supplementary material. They look very similar to ones reported here. We observe that the fastest method in terms of training time is the proposed approach. Namely, IDSGP. Nevertheless, the speed-up obtained is impaired by the overhead of having to compute the output of the DNN and update its parameters. IDSGP also results in fastest prediction times than VSGP, SOLVE or SWSGP. See the supplementary material for further details.

Kin40k Protein KeggDirected KEGGU 3dRoad Song Buzz HouseElectric
VSGP 591.7 (0.58) 737.2 (1.16) 932.7 (2.56) 1128.1 (3.78) 7880.9 (66.79) 9777.7 (42.84) 9901.0 (146.07) 32784.2 (190.18)
SOLVE 1739.3 (0.45) 2015.9 (0.66) 2357.3 (1.70) 2909.1 (1.19) 19567.1 (10.34) 23196.6 (98.35) 25769.5 (20.12) 92214.9 (452.18)
SWSGP 875.7 (0.68) 1023.5 (0.35) 1220.6 (1.89) 1458.0 (5.57) 10203.4 (12.03) 12241.7 (62.01) 13371.5 (12.34) 46163.3 (427.23)
IDSGP 190.3 (0.75) 371.5 (1.25) 533.0 (1.73) 693.7 (5.77) 4070.1 (201.09) 4296.5 (25.03) 3640.4 (33.36) 16352.2 (90.15)
Table 3: Average training time per epoch across the 5 splits for the UCI regression datasets. The numbers in parentheses are standard errors. Best mean values are highlighted.

5.3 Large scale datasets

A last set of experiments considers two very large datasets. The first dataset is the Airlines Delay binary classification dataset, as described in [13], with data instances and attributes. The second dataset is the Yellow taxi dataset, as described in [21], with data-points and attributes. In each dataset we use a test set of instances chosen at random. The DNN architecture for IDSGP has 2 hidden layers with 25 units each. The number of inducing points is set to be equal to in this method. In the other methods, we use the same number of inducing points as in the previous section. The mini-batch size is set to . Training is also performed on the same GPU as in the previous section. The ADAM learning rate is set to .

The average negative test log-likelihood of each method on the test set is displayed in Figure 4, for each dataset. We report performance in terms of the training time, in a scale. The results corresponding to the RMSE are very similar to the ones displayed here. They can be found in the supplementary material. We observe that the proposed method IDSGP performs best on each dataset. In particular it is able to obtain a better performance in a smaller computational time. We believe this is a consequence of using a smaller number of inducing points, and also because of the extra flexibility that the DNN provides for specifying the locations of the inducing points.

Figure 4: Negative log-likelihood on the test set for each method as a function of the training time in seconds, in scale, for the Yellow taxi and the Airline delays datasets. Best seen in color

6 Conclusions

Gaussian processes (GPs) are flexible models for regression and classification. However, they have a cost of with the number of training points. Sparse approximations based on inducing points reduce such a cost to . A problem, however, is that in some situations a large number of inducing points have to be used in practice, since they determine the flexibility of the resulting approximation. There is hence a need to reduce their computational cost.

We have proposed here input dependent sparse GP (IDSGP), a method that can improve the training time and the flexibility of sparse GP approximations. IDSGP uses a deep neural network (DNN) to output specific inducing points for each point at which the predictive distribution of the GP needs to be computed. The DNN also outputs the parameters of the corresponding variational approximation on the inducing values associated to the inducing points. IDSGP can be obtained under a formulation that considers an implicit distribution for the input instance to the DNN. Importantly, such a formulation is shown to keep intact the GP prior on the latent function values associated to the training points.

The extra flexibility provided by the DNN allows to significantly reduce the number of inducing points used in IDSGP. Such a model provides similar or better results than other sparse GP approximations from the literature at a smaller computational cost. IDSGP has been evaluated on several regression and binary classification problems from the UCI repository. The results obtained show that it improves the quality of the predictive distribution and reduces the computational cost of sparse GP approximations. Better results are most of the times obtained in regression problems. In classification problems, however, the performances obtained are similar to those of the state-of-the-art, although the training and prediction times are always shorter. The scalability of IDSGP is also illustrated on massive datasets for regression and binary classification of up to billion points. There, IDSGP also obtains better results than alternative sparse GP approximations at a smaller training cost.

The authors gratefully acknowledge the use of the facilities of Centro de Computación Científica (CCC) at Universidad Autónoma de Madrid. The authors also acknowledge financial support from Spanish Plan Nacional I+D+i, grant PID2019-106827GB-I00 / AEI / 10.13039/501100011033.

References

  • [1] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S. Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar, P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng (2015) TensorFlow: large-scale machine learning on heterogeneous systems. Note: Software available from tensorflow.org External Links: Link Cited by: §5.
  • [2] M. Bauer, M. van der Wilk, and C. E. Rasmussen (2016) Understanding probabilistic sparse Gaussian process approximations. In Advances in Neural Information Processing Systems 29, pp. 1533–1541. Cited by: §1, §2.1, §4.
  • [3] C.-A. Cheng and B. Boots (2017) Variational inference for Gaussian process models with linear complexity. In Advances in Neural Information Processing Systems, pp. 5184–5194. Cited by: §1, §4.
  • [4] L. Csató and M. Opper (2002) Sparse on-line Gaussian processes. Neural Computation 14, pp. 641–668. Cited by: §4.
  • [5] D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: §5.2.
  • [6] T. Evans and P. Nair (2018)

    Scalable Gaussian processes with grid-structured eigenfunctions (GP-GRIEF)

    .
    In International Conference on Machine Learning, pp. 1417–1426. Cited by: §4.
  • [7] Y. Gal (2016)

    Uncertainty in deep learning

    .
    Ph.D. Thesis, PhD thesis, University of Cambridge. Cited by: §1.
  • [8] J. Gardner, G. Pleiss, R. Wu, K. Weinberger, and A. G. Wilson (2018)

    Product kernel interpolation for scalable Gaussian processes

    .
    In

    International Conference on Artificial Intelligence and Statistics

    ,
    pp. 1407–1416. Cited by: §4.
  • [9] R. Henao and O. Winther (2012) Predictive active set selection methods for Gaussian processes. Neurocomputing 80, pp. 10–18. Cited by: §4.
  • [10] J. Hensman, N. Fusi, and N. D. Lawrence (2013) Gaussian processes for big data. In Uncertainty in Artificial Intelligence, pp. 282–290. Cited by: §2.1, §4, §5.1.
  • [11] J. Hensman, A. G. Matthews, M. Filippone, and Z. Ghahramani (2015) MCMC for variationally sparse Gaussian processes. In Advances in Neural Information Processing Systems 28, pp. 1648–1656. Cited by: §1, §1, §2.1, §2.1, §4.
  • [12] J. Hensman, A. Matthews, and Z. Ghahramani (2015) Scalable variational Gaussian process classification. In International Conference on Artificial Intelligence and Statistics, pp. 351–360. Cited by: §1, §2.1, §3.2, §4, §5.
  • [13] D. Hernández-Lobato and J. M. Hernández-Lobato (2016) Scalable Gaussian process classification via expectation propagation. In Artificial Intelligence and Statistics, pp. 168–176. Cited by: §4, §5.3.
  • [14] D. P. Kingma and J. Ba (2015) ADAM: a method for stochastic optimization. In Inrernational Conference on Learning Representations, pp. 1–15. Cited by: §5.2.
  • [15] D. P. Kingma and M. Welling (2014) Auto-encoding variational Bayes. In International Conference on Learning Representations, Cited by: §1.
  • [16] N. Lawrence, M. Seeger, and R. Herbrich (2003) Fast sparse Gaussian process methods: the informative vector machine. In Neural Information Processing Systems, pp. 609–616. Cited by: §4.
  • [17] A. Naish-Guzman and S. Holden (2007) The generalized FITC approximation. Advances in neural information processing systems 20, pp. 1057–1064. Cited by: §4.
  • [18] J. Quiñonero-Candela and C. E. Rasmussen (2005) A unifying view of sparse approximate Gaussian process regression. Journal of Machine Learning Research 6, pp. 1939–1959. Cited by: §1, §4.
  • [19] C. E. Rasmussen and C. K. I. Williams (2006) Gaussian processes for machine learning (adaptive computation and machine learning). The MIT Press. Cited by: §1, §1, §2, §5.
  • [20] H. Salimbeni, C.-A. Cheng, B. Boots, and M. Deisenroth (2018) Orthogonally decoupled variational Gaussian processes. In Advances in Neural Information Processing Systems, Vol. 31, pp. 8725–8734. Cited by: §4.
  • [21] H. Salimbeni and M. Deisenroth (2017) Doubly stochastic variational inference for deep Gaussian processes. In Advances in Neural Information Processing Systems, pp. 4588–4599. Cited by: §5.3.
  • [22] M. W. Seeger, C. K. I. Williams, and N. D. Lawrence (2003) Fast forward selection to speed up sparse Gaussian process regression. In Proceedings of the Ninth International Workshop on Artificial Intelligence and Statistics, pp. 254–261. Cited by: §4.
  • [23] M. Shi and A. Mnih (2020) Sparse orthogonal variational inference for Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pp. 1932–1942. Cited by: §1, §1, §1, §2.1, §4, §5.
  • [24] R. Shu, H. H. Bui, S. Zhao, M. J. Kochenderfer, and S. Ermon (2018) Amortized inference regularization. In Advances in Neural Information Processing Systems, pp. 4393–4402. Cited by: §1, §3.2, §4.
  • [25] E. Snelson and Z. Ghahramani (2006) Sparse Gaussian processes using pseudo-inputs. In Advances in Neural Information Processing Systems, pp. 1257–1264. Cited by: §1, §1, §2.1, §4, §5.1.
  • [26] M.K. Titsias (2009) Variational learning of inducing variables in sparse Gaussian processes. In International Conference on Artificial Intelligence and Statistics, pp. 567–574. Cited by: §1, §1, §1, §2.1, §2.1, §2.1, §3, §4, §4, §5.
  • [27] G.-L. Tran, D. Milios, P. Michiardi, and M. Filippone (2020) Sparse within sparse Gaussian processes using neighbor information. Note: 2011.05041 arXiv stat.ML Cited by: §1, §1, §1, §1, §2.1, §3.1, §3, §4, §4, §5.
  • [28] C. Villacampa-Calvo, B. Zaldívar, E. C. Garrido-Merchán, and D. Hernández-Lobato (2021) Multi-class gaussian process classification with noisy inputs. Journal of Machine Learning Research 22, pp. 1–52. Cited by: §1, §4.
  • [29] A. G. Wilson and H. Nickisch (2015) Kernel interpolation for scalable structured Gaussian processes (kiss-gp). In International Conference on International Conference on Machine Learning, pp. 1775–1784. Cited by: §4.

Appendix A Potential negative societal impacts

As for the potential negative societal impacts, because this paper focuses on the development of a new methodology, we believe these would be indirect through the particular application in which the proposed method is used. As one of the main advantages of GPs is that they provide uncertainty estimates associated with the predictions made, we think the potential harm of these models in society could arise in applications when this uncertainty estimation is critical. For example, an AI system in which the decisions made can have an influence on people’s life, such as autonomous vehicles or automatic medical diagnosis tools.

Appendix B Extra experimental results

In this section, we include some extra results that do not fit in the main manuscript. Namely, the RMSE in the test set results and prediction times for the UCI regression datasets, and the accuracy in the test set, training and prediction times for the UCI classification datasets. In both cases, the setup is the same as described in Section 5 and the results are similar that the ones obtained in terms of the negative test log likelihood and training times in that section. Finally, we include similar plots to those in Section 5.3 but in terms of the test RMSE for the Yellow Taxi dataset and in terms of the test classification error for the Airline Delays dataset.

b.1 UCI regression datasets

VSGP SOLVE SWSGP IDSGP
Kin40k 32,000 8 0.198 (0.002) 0.157 (0.001) 0.215 (0.002) 0.050 (0.002)
Protein 36,584 9 4.161 (0.011) 4.062 (0.011) 4.133 (0.008) 3.756 (0.019)
KeggDirected 42,730 19 0.032 (0.001) 0.079 (0.032) 0.024 (0.000) 0.022 (0.001)
KEGGU 51,686 26 0.024 (0.000) 0.020 (0.000) 0.022 (0.000) 0.014 (0.000)
3dRoad 347,899 3 9.641 (0.063) 10.020 (0.095) 11.726 (0.327) 7.250 (0.069)
Song 412,276 90 8.966 (0.022) 8.925 (0.020) 9.013 (0.029) 9.068 (0.011)
Buzz 466,600 77 175.076 (15.021) 173.352 (14.957) 160.744 (13.467) 166.784 (18.040)
HouseElectric 1,639,424 6 0.035 (0.000) 0.034 (0.000) 0.036 (0.001) 0.032 (0.000)
Avg. Ranks 3.075 (0.126) 2.400 (0.138) 3.025 (0.170) 1.500 (0.151)
Table 4: Test Root Mean Squared Error (RMSE) values for the UCI regression datasets. The numbers in parentheses are standard errors. Best mean values are highlighted.
Kin40k Protein KeggDirected KEGGU 3dRoad Song Buzz HouseElectric
VSGP 0.9(0.00) 1.1(0.00) 1.4(0.01) 1.7(0.01) 11.6(0.07) 14.5(0.08) 18.0(0.08) 48.3(0.12)
SOLVE 2.4(0.00) 2.8(0.00) 3.2(0.00) 4.0(0.00) 27.0(0.04) 31.8(0.19) 37.0(0.03) 127.7(0.43)
SWSGP 1.3(0.00) 1.5(0.00) 1.8(0.01) 2.1(0.01) 14.8(0.03) 17.6(0.02) 21.3(0.10) 66.2(0.88)
IDSGP 0.3(0.00) 0.5(0.00) 0.7(0.00) 1.0(0.02) 5.7(0.26) 5.6(0.05) 8.1(0.08) 22.1(0.14)
Table 5: Average prediction time per epoch across the 5 splits for the UCI regression datasets. The numbers in parentheses are standard errors. Best mean values are highlighted.

b.2 UCI classification datasets

VSGP SOLVE SWSGP IDSGP
MagicGamma 15,216 10 0.876 (0.001) 0.877 (0.002) 0.867 (0.002) 0.877 (0.002)
DefaultOrCredit 24,000 30 1.000 (0.000) 1.000 (0.000) 1.000 (0.000) 1.000 (0.000)
NOMAO 27,572 174 0.956 (0.002) 0.960 (0.001) 0.961 (0.001) 0.955 (0.001)
BankMarketing 36,169 51 0.906 (0.001) 0.907 (0.001) 0.897 (0.001) 0.905 (0.001)
Miniboone 104,051 50 0.941 (0.001) 0.945 (0.001) 0.938 (0.000) 0.937 (0.001)
Skin 196,046 3 0.999 (0.000) 0.999 (0.000) 0.999 (0.000) 0.999 (0.000)
Crop 260,667 174 0.999 (0.000) 0.999 (0.000) 0.999 (0.000) 0.999 (0.000)
HTSensor 743,193 11 0.999 (0.000) 1.000 (0.000) 0.989 (0.003) 0.999 (0.000)
Avg. Ranks 2.475 (0.127) 1.975 (0.152) 2.900 (0.187) 2.650 (0.168)
Table 6: Test Accuracy values for the UCI classification datasets. The numbers in parentheses are standard errors. Best mean values are highlighted.
Magic DefaultOrCredit NOMAO BankMarket Miniboone Skin Crop HTSensor
VSGP 3105(459) 4759(516) 4445(549) 6231(862) 18447(1279) 37835(7065) 49962(9292) 115463(17511)
SOLVE 5154(1061) 7554(1039) 6718(1028) 8949(1635) 37022(7902) 64606(13314) 88819(18864) 168709(21194)
SWSGP 1547(145) 2354(182) 2728(188) 3682(351) 10040(347) 20283(2796) 21770(3038) 67687(5880)
IDSGP 1143(100) 1293(90) 2026(94) 2987(354) 7654(134) 15700(1918) 21378(2561) 53895(5652)
Table 7: Average training time per epoch across the 5 splits for the UCI classification datasets. The numbers in parentheses are standard errors. Best mean values are highlighted.
MagicGamma DefaultOrCredit NOMAO BankMarketing Miniboone Skin Crop HTSensor
VSGP 3.6(0.56) 4.1(0.59) 4.4(0.74) 9.5(4.39) 17.9(1.84) 49.7(11.15) 58.7(12.82) 139.7(41.67)
SOLVE 4.0(0.85) 5.4(0.73) 3.4(0.76) 4.9(1.24) 49.2(17.83) 48.8(16.73) 86.8(36.26) 93.5(15.73)
SWSGP 3.0(0.43) 3.9(0.38) 4.3(0.51) 5.4(0.72) 16.4(1.82) 36.0(8.00) 33.9(6.25) 88.4(9.16)
IDSGP 2.5(0.24) 2.5(0.21) 3.5(0.39) 4.8(0.54) 13.9(0.75) 26.1(4.92) 37.5(4.96) 83.4(8.23)
Table 8: Average prediction time per epoch across the 5 splits for the UCI classification datasets. The numbers in parentheses are standard errors. Best mean values are highlighted.

b.3 Large scale datasets

Figure 5: RMSE on the test set for each method as a function of the training time in seconds, in scale, for the Yellow taxi dataset. Best seen in color
Figure 6: Test classification error on the test set for each method as a function of the training time in seconds, in scale, for the Airlines Delays dataset. Best seen in color