Direct loss minimization for sparse Gaussian processes

04/07/2020 ∙ by Yadi Wei, et al. ∙ Microsoft Indiana University 0

The Gaussian process (GP) is an attractive Bayesian model for machine learning which combines an elegant formulation with model flexibility and uncertainty quantification. Sparse Gaussian process (sGP) algorithms provide an approximate solution that mitigates the high computational complexity of GP and the variational approximation is the current best practice for such approximations. Recent theoretical work has shown that an alternative approach, direct loss minimization (DLM), which directly minimizes predictive loss, comes with strong guarantees on the expected loss of the algorithm. In this paper we explore this approach experimentally. We develop the DLM algorithm for sGP and show that with appropriate hyperparameter optimization it provides a significant improvement over the variational approach. In particular, optimizing sGP for log loss provides better calibrated predictions for regression, classification and count prediction, and optimizing sGP for square loss improves the mean square error in regression.



There are no comments yet.


page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Bayesian models provide an attractive approach for learning from data. Assuming that model assumptions are correct, given the data and prior one can calculate a posterior distribution that compactly captures all our knowledge about the problem. Then, given a prediction task with an associated loss for wrong predictions, we can pick the best action according to our posterior. This is less clear, however, when exact inference is not possible. As argued by several authors (e.g., Lacoste-Julien et al. (2011); Stoyanov et al. (2011)

), in this case it makes sense to optimize the choice of approximate posterior so as to minimize the expected loss of the learner in the future. This requires using the loss function directly during training of the model. We call this approach

direct loss minimization (DLM).

The Gaussian process (GP) model (Rasmussen and Williams, 2006) provides a flexible Bayesian model capturing functions over arbitrary spaces. For the case of regression, GPs provide closed-form inference procedures and many approximations exist for the other likelihoods, for example for binary classification. The main problem with using GPs is their computational complexity since solutions are cubic in the number of examples . Sparse GP solutions reduce this complexity to where is the number of pseudo inputs which serve as an approximate sufficient statistic for prediction. The two approaches most widely used are FITC (Snelson and Ghahramani, 2006) and the variational solution of Titsias (2009). The variational solution has been extended for large datasets and general likelihoods and is known as SVGP (Hensman et al., 2013, 2015; Sheth et al., 2015); see further discussion in Bauer et al. (2016). In sGP, the GP prior jointly generates the pseudo values and the latent variables which we write as and the observations are generated from the likelihood model . The SVGP solution minimizes the negative ELBO

where is the variational distribution over the inducing values, is the corresponding GP prior, is the KL divergence and (but other values of are discussed below). This shows that SVGP can be seen as performing regularized loss minimization (RLM) with loss given by and as the regularizer. These connections have been observed before and are attractive from a theoretical perspective because RLM is known to have good generalization performance in some cases; see for example (Shalev-Shwartz and Ben-David, 2014) and applications by Sheth and Khardon (2017). However, as pointed out by Sheth and Khardon (2019) the loss used in SVGP is not matched to the intended loss when using its result. When using the log loss of the Bayesian predictor the correct regularized loss term is

where is the predictive distribution for . On the other hand, if we care about a different loss, for example square loss for regression, the criterion will change. For square loss, we first have to calculate the optimal prediction given our posterior (further details given below). The training criterion then becomes

This distinction is in contrast with some previous work that aims to find the “best posterior” without regard to its intended use. This illustrates the point argued above: a pure Bayesian procedure with a correctly specified model and exact inference can afford to separate learning from prediction because the posterior captures all the information we have. But approximate methods do not have this luxury and must adapt the posterior to the loss.

The use of DLM was suggested by Sheth and Khardon (2017) who first demonstrated its utility in the correlated topic model (Blei and Lafferty, 2006). Risk bounds for DLM on sGP were given by Sheth and Khardon (2019). Motivated by this prior work, in this paper we explore these ideas empirically. In particular, we investigate the use of DLM for sparse GP models, including the case of regression, and non-conjugate prediction with binary and count observations.

To fully describe the algorithm recall that SVGP and DLM algorithms for sGP optimize both the posterior on pseudo values and any hyperparameters of the GP. In preliminary experiments we have found that, while it works in many cases, hyperparameters optimization of DLM can be sensitive for some datasets especially when training size is small. In addition, the theoretical results of Sheth and Khardon (2019) claim bounds for optimization of the posterior, for any setting of hyperparameters, but they do not cover hyperparameter optimization. We therefore propose a variant of DLM where we obtain hyperparameters using a stable method (SVGP in our experiments) and then obtain using DLM. We show that the results of this approach are significantly better than the accepted best practice given by SVGP. In particular, when optimizing for the corresponding loss, the DLM algorithm shows significantly better log loss in regression, classification and count prediction and significantly better square loss for regression problems. Thus the main contribution of this paper is empirical in demonstrating the strength of DLM in sGP models.

The proposed algorithm raises one additional consideration. Optimizing the log loss DLM objective requires the computation of the loss term. When the log predictive probability

is analytically tractable this causes no problem. This is the case for regression and for binary classification through probit regression. However, other cases, for example count regression, require computation of

through Monte-Carlo approximation or quadrature which leads to biased gradients. While much previous work in machine learning assumes that gradients must be unbiased here we show that our method works even with biased gradients if enough samples are used to estimate the gradients. Our experiments show that this holds both for binary classification and for count prediction. To explore this potential further, we point out previous analysis of stochastic gradient descent that allows for biased gradients

(Bertsekas and Tsitsiklis, 1996). We present an experimental study for the binary case, where we can compare the stochastic biased gradients with the exact computation, and show that conditions for such convergence may hold in practice. While this does not prove convergence we believe that it is a promising direction for future work.

To summarize, the paper develops practical DLM algorithms for sparse GP and shows their empirical success in a range of problems. In addition, the paper develops empirical observations on learning with biased stochastic gradients in Bayesian models which may be of independent interest.

2 Optimization Objectives and Algorithms

A Gaussian process generates a random function where, given any finite set of sample points

, the function values are jointly normally distributed as

. Here is the mean function and

is a vector with entries

, is the covariance function and is a matrix with entries , and together specify the GP. The solution for GP can be developed and is well known but its complexity is where is the number of examples. Variational sparse GP (Titsias, 2009) replaces the full solution with a cheaper alternative as follows. We first augment the model with new locations , and unobserved function evaluations at these points . The GP prior jointly generates and we have observations generated from the likelihood model . The marginal likelihood is then approximated through a variational approximation with . Here the first portion is assumed to take the form with parameters . The second portion in the approximate posterior does not allow flexibility and is fixed to . We then have the evidence lower bound (ELBO):


is the Kullback-Leibler divergence. The SVGP learning algorithm maximizes the variational parameters

and likelihood and kernel hyperparameters . Let . When is not available analytically one can use Monte Carlo estimates of the objective and this is often combined with mini-batch sampling (subsampling over ) to speed up convergence for large datasets (Hensman et al., 2013, 2015; Sheth and Khardon, 2016).

As discussed in the introduction this can be viewed as a regularized loss minimization, but viewed in this manner the loss on example is assumed to be which is not the intended process for a Bayesian predictor. Instead, given a posterior, the Bayesian algorithm first calculates its predictive distribution and then suffers a loss that depends on the context in which the algorithm is used. For the case of log-loss this term is . Replacing the loss term yields the LogLoss DLM objective from the introduction.

Comparing DLM to SVGP we see that the main difference is the log term which is applied before the two expectations. Sheth and Khardon (2017) considered a third alternative where the log term is applied after one expectation, yielding the objective

They showed that for sGP regression with fixed optimizing this objective yields the same as the FITC algorithm.

In this paper we consider DLM for sGP for regression, binary prediction through Probit regression and count prediction through Poisson regression. Since both and

are Gaussian distributions, the marginal

is also Gaussian with mean

, variance

where we use the standard notation using subscripts for arguments of the kernel function, where , etc.

Figure 1: Comparison of DLM with two settings. Plots on the left show results with all parameters optimized by the DLM objective. Plot on the right show results for DLM with hyperparameters chosen by SVGP.
Figure 2: sGP Regression: Left and middle columns show a comparison of SVGP, FITC and DLM on average Negative Log Likelihood (NLL) in 4 datasets. The right column shows NLL as a function of for cadata for a small training size and a large training size. In all plots, lower values imply better performance.

Regression: In this case we have and we can compute . The regression DLM objective is therefore


Classification: For probit regression where is the CDF of the standard normal distribution. Here we have for , . The regression DLM objective is therefore


Count prediction and general case: We also explore the case where cannot be computed directly. For example, for Poisson regression (with log link function) we have and we do not have a closed form for . In such cases we propose to approximate where . This is obviously a biased estimate of the objective which will be inherited by gradient computations during the optimization. We leave derivation of unbiased gradient samples for future work. In this paper we explore the potential of using the biased approximation to optimize sGP. We demonstrate this both for Poisson regression and for the binary case where we can compare the results to the exact computation.

Square loss: Finally, we consider the case where the model is specified using regression but the loss function is the square loss. Here is similar to the regression case. For the square loss the optimal prediction is the mean of the predictive distribution, that is . Therefore the square loss objective is


which simplifies into an objective that depends only on :


From Optimization Objectives to Algorithms:

We optimize all models with automatic differentiation with optimization settings described below with two exceptions. Preliminary experiments with joint optimization of variational parameters and hyperparameters in DLM objectives showed that it is successful in many problems but that in some specific cases the optimization is not stable. We suspect that this is due to interaction between optimization of variational parameters and hyperparameters which therefore complicates an experimental comparison. To avoid this complication in this paper we evaluate all algorithms with the same hyperparameters and pseudo inputs. Since SVGP is stable and the current best practice we first learn the model using SVGP and then use the hyperparameter and pseudo input locations found by SVGP for all algorithms. This guarantees stable results. It also makes sense as an experimental setup with a comparison to SVGP as a baseline because any improvement over SVGP is due to better calibrated computation of . In addition to these, our algorithm uses a validation set to select the value for the regularization parameter . Prior theoretical work does not have a clear recommendation and include results for , the standard setting , and . Our algorithm uses a grid search with exponential decay , trains the DLM objective with such regularization, selects based on the validation set, before testing on the independent test data. In some experiments below we diverge from this and present results for specific values of .

Figure 3: Square loss in sGP Regression: Comparison of SVGP, FITC, DLM and SQ_DLM in Mean Squared Error (MSE). In all plots, lower values imply better performance.
Figure 4: sGP Classification: Comparison of SVGP and DLM in average negative log likelihood. In all plots, lower values imply better performance.

3 Related Work

Two lines of prior work have explored related ideas from a theoretical perspective. The first aims to show that the approximations recover exact inference under some conditions. This includes, for example, consistency results for variational inference (Wang and Blei, 2017, 2019) and the Laplace approximation (Dehaene, 2017). For sparse GP, the work of Burt et al. (2019) shows that this holds when using the RBF kernel, picking appropriately, and selecting pseudo inputs in a smart manner. The work of Alquier et al. (2016) connects variational inference and PAC Bayes theory and also formulates conditions under which the approximation is close to the true posterior. On the other hand, Alquier et al. (2016); Sheth and Khardon (2017, 2019) develop agnostic PAC learning guarantees on the loss of variational and DLM algorithms. In these results the algorithm is compared to the “best in class” where, in the context of this paper, this refers to the “best sparse GP posterior”. These results suggest that under some conditions certain DLM methods for selecting the posterior in sGP have risk which as ”as good as any sparse GP posterior”. This paper can be seen as an empirical validation of such results.

Sparse GPs have received significant attention in the last few years. Bauer et al. (2016) compare and investigate the performance of SVGP and FITC and provide many insights. Their observations on difficulties in the optimization of the variance in FITC might have parallels in DLM. Our experimental setup gets around this issue to yield a stable comparison. Reeb et al. (2018) develop a new sGP algorithm by optimizing a PAC-Bayes bound. The output of their algorithm is chosen in a manner that provides better upper bound guarantees on its true error, but the actual test error is not improved over SVGP. The work of Samilbeni et al. (2018) develops a novel variant of SVGP that uses different pseudo locations for and . Since is linear in the size of and is quadratic this allows their algorithm to use a larger set of pseudo inputs for and improve predictive accuracy. This idea is orthogonal to the development of DLM and can potentially be combined with it.

Finally, in recent work that was developed independently from ours, Jankowiak et al. (2019) have proposed an algorithm which is identical to DLM for sGP for the case of regression and showed some empirical success. However, they were not aware of prior work or the theoretical perspective and their results do not explore the full range of DLM objectives. In addition, their algorithm is developed only for the case of regression and its parameter optimization may suffer from the issues mentioned above.

Figure 5: Comparison of SVGP and DLM on the binary classification airline dataset. On the left is negative log predictive probability and on the right is error rate. In both plots, lower values imply better performance.
Figure 6: Comparison of negative log predictive probability for SVGP and DLM with 10 (left) and 100 (right) Monte Carlo samples per datapoint on the binary classification airline dataset.
Figure 7: Statistics for calculation of biased gradients for the Cholesky parameter for binary classification. Left: condition c. Middle: condition d. Right: estimate of bias. (Note, denotes the variable described in the main text.)
Figure 8: sGP Count Prediction: Comparison of SVGP and DLM with 10 MC samples in terms of average negative log likelihood. In all plots, lower values imply better performance.

4 Experiments

For regression, the algorithms are implemented in PyTorch. DLM is implemented as described above. Where simplified objectives are available, specifically regression ELBO for SVGP and regression objective for FITC, we implement these so-called collapsed forms that require less optimization than the forms presented above

(Snelson and Ghahramani, 2006; Titsias, 2009; Bauer et al., 2016). For classification and count prediction, we extend the implementation from GPyTorch (Gardner et al., 2018). The optimization setup w.r.t. hyperparameters and is modified as explained above. Isotropic RBF kernels are used unless otherwise specified. We use a zero mean function for experiments in regression and count prediction and a constant mean function for binary prediction (b/c some of the datasets require this to obtain reasonable performance with GP). We trained every log-loss model until convergence, defined as having a stable training log loss. More specifically, we stop when the difference between the minimum and maximum of the loss in the last iterations does not exceed , for iterations in regression, and iterations in classification and count prediction. For square loss DLM the optimization for has a closed form, i.e., it is optimized in one step. If the log loss does not converge, we stop when the number of iterations exceeds 5000 for regression, and 3000 for classification and count regression.

Hyperparameter optimization: We start by using an artificial 1d dataset to illustrate the sensitivity of hyperparameter optimization. We first draw a sample randomly from , and where

is drawn from Gaussian distribution with standard deviation 1.0. We have 500 such samples for training and 100 samples for testing. Below we compare the predictive distribution of DLM in two settings. In the first

fully-optimized version, kernel parameters, likelihood variance and inducing input locations are learned by directly optimizing LogLoss DLM objective. In the second svgp-init version, we follow the algorithm explained above. We run multiple trials on the generated sine data and put one typical trial in Figure 1. For both and , the predictive distribution with svgp-init is smoother when compared with fully-optimized DLM. We can also see that the fully-optimized DLM misses more points in its 95% interval which corresponds to higher log loss. For this dataset, a smooth predictive distribution is intuitively better as the sine function is smooth and the noise variance is high. The shape of fully-optimized DLM prediction comes from the underestimate of the noise variance, which is similar to what Bauer et al. (2016) reports for FITC. Recall that the true variance here is 1.0. Averages over 10 runs show that DLM with estimates to be , and the estimate of DLM with is . This is much smaller compared to the SVGP estimate of and FITC estimate . In further experiments, not shown here, we have observed similar issues for other hyperparameters (e.g., the length scale). This phenomenon is dataset dependent. While in many cases fully-optimized does perform well, for consistency in the remainder of the paper we show results for svgp-init DLM which has excellent performance across all experiments.

Regression: We next turn to regression experiments with log-loss objective. We compare the average negative log likelihood of LogLoss DLM with SVGP and FITC on four datasets, pol, cahousing, sarcos and song that have been used in recent work to evaluate sGP. Each dataset is split into portions with relative sizes 67/8/25 for training, validation and testing. All datasets are normalized with respect to training data and the same normalization is performed on validation and test data. For the larger song dataset (-M samples in total), we randomly choose a subset of 10000 examples for test data in order to reduce the test time in experiments. For pseudo point size, we use 206 points in Cahousing (following previous work) and use 100 inducing inputs in other datasets. To reduce run time for DLM on large datasets we use mini-batch training with batches of 6000 samples.

In order to get a full view of the performance of DLM, we show the performance as a function of training set size with the number of inducing inputs fixed. Each point in our plots shows the result at convergence for that data size. This shows the potential of DLM to improve over SVGP when their objectives are optimized. Figure 2 shows the results from 5 repetitions of the experiments. The figure clearly shows that LogLoss DLM significantly outperforms both SVGP and FITC, that is, it provides significantly better calibrated predictions.

It is interesting to consider the values selected by our algorithm. Specifically, the trend for cadata shows two facts. Figure 2 (rightmost column) plots of log-loss as a function of for a small (691) and large (11063) training set size. When the training size is small the optimal is about 5 and when it is large the optimal valies is smaller than 1. This shows that the suggestion from theoretical analysis for large might be especially important in difficult prediction tasks when we do not have enough data. The other datasets do not require such strong regularization even when training sizes are small.

Square Loss: We next turn to regression experiments with the square-loss objective using the same datasets and experimental setup. Here we can derive a collapsed form for squareLoss DLM and we use this form in our experiment. As shown in Figure 3, squareLoss DLM is significantly better than SVGP and FITC. While LogLoss DLM is very close to squareLoss DLM in 3 of the datasets it performs less well on cadata. Overall squareLoss DLM dominates in all cases. The trends for are similar to the regression case and they are omitted.

Classification: For classification, we first compare DLM with SVGP on four datasets commonly used in literature: Banana, Thyroid, Twonorm and Ringnorm. We use a similar setup to the previous experiments. We first select a number of training sizes(up to 2000) and pick 10% of all data to be the validation set. From the remaining examples we randomly choose up to 1000 samples for testing (to reduce test time for the experiments). Notice that all data is normalized based on the training set with size 2000. Figure 4 shows the mean and standard deviation of NLL of five repetitions. When the number of training samples is small, DLM yields better performance on all four datasets. As the number of training samples grows, the improvement of DLM is still obvious on Banana dataset while for other datasets, the gap between SVGP and DLM shrinks. The appendix shows the corresponding classification error results in the same experiments. It is interesting to note that we get significantly better log-loss, that is calibrated predictions, but the classification errors remain similar. This is as expected because DLM optimizes for a specific loss, and importantly there is no degredation in classification performance.

We next turn to an evaluation of classification on a very large dataset. Here we show performance as a function of training iterations. In particular, for the M-size airline dataset of Hensman et al. (2015)

, we split a 100000 test set from the full dataset, and trained on the remaining data for 20 epochs with Adam and learning rate

. The number of inducing points was set to 200 and the mini-batch size was 1000. Here, we used the RBF-ARD kernel. The train/evaluation protocol was: SVGP was trained with all hyperparameters and variational parameters being learned; then, DLM was initialized with the learned SVGP hyperparameters which were then fixed; the DLM variational parameters were learned from scratch. The results are shown in Figure 5 and we see similar trends to other classification datasets where DLM produces a better-calibrated predictive probability (left) for approximately the same level of error (right). Here, a large value for produces the worst performance highlighting the need for a good selection strategy for the regularization parameter.

Optimization with Biased Gradients: We next consider the use of biased gradients to optimize the general case. We start our exploration with binary classification. Here, since we have the exact gradients, we can collect statistics on the characteristics of the gradients as compared to their true values. To explore the results we refer to the following result on stochastic gradient descent with potentially biased gradients:

Theorem 1 ( Bertsekas and Tsitsiklis (1996)).

Consider an algorithm to minimize a function , with updates of the form with satisfying and . If conditions (a), (b), (c), (d) hold then converges to a stationary point of .
(a) .
(b) .
(c) .
(d) .

Intuitively condition c refers to the direction bias of the noisy gradient . Here all we need is a sufficiently strong positive projection on the direction of the gradient. Condition d indirectly limits the variance of because its norm must be bounded relative to the norm of the true gradient.

Figure 6 shows log loss for the airline dataset with 10 and 100 Monte Carlo samples. Additional plots for 1 MC sample and for classification error in the same experiments are given in the appendix. We can observe that with 10 samples DLM is not distinguishable from SVGP. However with 100 MC samples the log loss of DLM is very similar to the case of exact computation and provides a significant improvement over SVGP.

Figure 7 shows statistics for the noisy gradient using MC samples for the Cholesky factor. Similar plots for the mean and inducing variables are given in the appendix. To obtain these plots, we run DLM with for 0.1 epochs (using fixed hyperparameters learned from SVGP). We then freeze the learning algorithm and repeatedly sample the same gradients to gather statistics. The left plot shows a distribution plot for , which helps visualize condition c, for the calculated gradients. We observe that this holds with approximately equal to 1, the value that would be obtained if we had . The middle plot shows a distribution plot of and compares it to

in an attempts to visualize condition d. Here we see that the bias clearly exists despite the concentration in the case of 100 samples. However we also see that the norms are close and that they are getting closer with a larger number of samples. The right plot is a distribution plot of the norm of the bias vector of

where again we see as expected that the bias decreases with more samples. What is significant is the fact that despite the bias that still exists even for the large number of samples, DLM successfully converges and provides better log loss than SVGP. This is a preliminary result but it shows the potential for convergence with biased gradients.

Count Prediction: We next apply the same approximation to Poisson regression with the log link function. For this case, the loss term in the SVGP objective, , has a closed form which is used in our implementation. Figure 8 shows results on four datasets: abalone and three labels provided in the UCSD Peds datasets using 10 Monte Carlo samples in the approximation for DLM. We can see that DLM performs better than SVGP on all datasets.

5 Conclusion

Sparse GP algorithms retain the benefits of GP in terms of model flexibility and uncertainty quantification, and mitigate the computational complexity of inference in the full GP model. Direct loss minimization algorithms acknowledge that the separation of inference and prediction, which is a benefit of the Bayesian approach, breaks down with approximate inference. They therefore choose a posterior distribution which is tailored for the task at hand, by integrating the loss function into the learning process. The paper develops practical DLM algorithms for sparse GP and shows their empirical success in a range of problems. DLM leads to significantly better calibrated predictions in regression, classification and count prediction and to significantly lower square loss for regression problems. This is an important performance improvement in a model that received a large amount of attention in the last decade. In addition, the paper develops empirical observations on learning with biased stochastic gradients in Bayesian models. Since a lot of recent work in machine learning relies on samples to compute gradients, this shows interesting promise for future work. Extending the results in this paper to develop an analysis of algorithms using biased gradients or alternatively to provide unbiased sampling methods for DLM are important directions for future work.


This work was partly supported by NSF under grant IIS-1906694.


  • Alquier et al. (2016) Alquier, P., Ridgway, J., and Chopin, N. (2016). On the properties of variational approximations of Gibbs posteriors. JMLR, 17:1–41.
  • Bauer et al. (2016) Bauer, M., van der Wilk, M., and Rasmussen, C. E. (2016). Understanding probabilistic sparse gaussian process approximations. In Advances in neural information processing systems, pages 1533–1541.
  • Bertsekas and Tsitsiklis (1996) Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Neuro-Dynamic Programming. Athena Scientific.
  • Blei and Lafferty (2006) Blei, D. and Lafferty, J. (2006). Correlated topic models. In NIPS, pages 147–154.
  • Burt et al. (2019) Burt, D. R., Rasmussen, C. E., and van der Wilk, M. (2019). Rates of convergence for sparse variational gaussian process regression. In Proceedings of the 36th International Conference on Machine Learning, ICML, volume 97, pages 862–871.
  • Dehaene (2017) Dehaene, G. P. (2017). Computing the quality of the laplace approximation.
  • Gardner et al. (2018) Gardner, J. R., Pleiss, G., Bindel, D., Weinberger, K. Q., and Wilson, A. G. (2018). Gpytorch: Blackbox matrix-matrix gaussian process inference with GPU acceleration. CoRR, abs/1809.11165.
  • Hensman et al. (2013) Hensman, J., Fusi, N., and Lawrence, N. D. (2013). Gaussian processes for big data. In Proceedings of the 29th UAI Conference, pages 282–290.
  • Hensman et al. (2015) Hensman, J., Matthews, A., and Ghahramani, Z. (2015). Scalable variational gaussian process classification. JMLR.
  • Jankowiak et al. (2019) Jankowiak, M., Pleiss, G., and Gardner, J. R. (2019). Sparse gaussian process regression beyond variational inference. arXiv preprint arXiv:1910.07123.
  • Lacoste-Julien et al. (2011) Lacoste-Julien, S., Huszar, F., and Ghahramani, Z. (2011). Approximate inference for the loss-calibrated bayesian. In

    Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS

    , volume 15 of JMLR Proceedings, pages 416–424.
  • Rasmussen and Williams (2006) Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning. MIT Press.
  • Reeb et al. (2018) Reeb, D., Doerr, A., Gerwinn, S., and Rakitsch, B. (2018). Learning gaussian processes by minimizing pac-bayesian generalization bounds. In Advances in Neural Information Processing Systems, pages 3341–3351.
  • Samilbeni et al. (2018) Samilbeni, H., Cheng, C.-A., Boots, B., and Deisenroth, M. (2018). Orthogonally decoupled variational gaussian processes. In Proceedings of Advances in Neural Information Processing Systems 32 (NeurIPS).
  • Shalev-Shwartz and Ben-David (2014) Shalev-Shwartz, S. and Ben-David, S. (2014). Understanding machine learning: From theory to algorithms. Cambridge University Press.
  • Sheth and Khardon (2016) Sheth, R. and Khardon, R. (2016). Monte carlo structured svi for two-level non-conjugate models. arXiv preprint arXiv:1612.03957.
  • Sheth and Khardon (2017) Sheth, R. and Khardon, R. (2017). Excess risk bounds for the Bayes risk using variational inference in latent Gaussian models. In NIPS, pages 5151–5161.
  • Sheth and Khardon (2019) Sheth, R. and Khardon, R. (2019). Pseudo-bayesian learning via direct loss minimization with applications to sparse gaussian process models. In

    Symposium on Advances in Approximate Bayesian Inference (AABI)

  • Sheth et al. (2015) Sheth, R., Wang, Y., and Khardon, R. (2015). Sparse variational inference for generalized Gaussian process models. In ICML, pages 1302–1311.
  • Snelson and Ghahramani (2006) Snelson, E. and Ghahramani, Z. (2006). Sparse gaussian processes using pseudo-inputs. In Advances in neural information processing systems, pages 1257–1264.
  • Stoyanov et al. (2011) Stoyanov, V., Ropson, A., and Eisner, J. (2011). Empirical risk minimization of graphical model parameters given approximate inference, decoding, and model structure. In Proceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, AISTATS, volume 15 of JMLR Proceedings, pages 725–733.
  • Titsias (2009) Titsias, M. (2009). Variational learning of inducing variables in sparse Gaussian processes. In AISTATS, pages 567–574.
  • Wang and Blei (2017) Wang, Y. and Blei, D. M. (2017). Frequentist consistency of variational bayes. CoRR, abs/1705.03439.
  • Wang and Blei (2019) Wang, Y. and Blei, D. M. (2019). Variational bayes under model misspecification. In Advances in Neural Information Processing Systems, pages 13357–13367.

Appendix A Datasets

dataset type size dim
pol111 regression 15000 26 100
cadata222 regression 20640 8 206
sarcos333(Rasmussen and Williams, 2006) regression 48933 21 100
song444 regression 515345 90 100
banana555 classification 5300 2 53
thyroid444 classification 3772 6 37
twonorm666 classifcation 7400 20 74
ringnorm777 classification 7400 20 74
airline888(Hensman et al., 2015) classification 2055733 8 200
abalone444 count 4177 9 41
Peds1_dir0999 count 4000 30 40
Peds1_dir1999 count 4000 30 40
Peds1_dir2999 count 4000 30 40
Table 1: Details of datasets

Table 1 shows the datasets used and their characteristics. In the table “dim” refers to the number of features and is the number of inducing points used in our experiments. Notice that in some datasets, categorical features are converted to dummy coding, i.e., we use binary features to represent a feature with categories. One category is assigned the all zero code while the other categories are assigned to the unit vector with the corresponding entry set to 1.

Appendix B Classification Error Results

Figure 9 shows the classification errors in the experiment shown in the main paper. As can be seen while log loss in the main paper is much improved the classification errors are comparable.

Figure 9: sGP Classification: Comparison of SVGP and DLM in term of error rate. In all plots, lower values imply better performance.

Appendix C Additional Results for MC samples on the airline dataset

In this section we show detailed log loss and classification error results for 1, 10, 100 MC samples on the airline dataset. In addition we show gradient statistics for the mean and inducing variables which show similar behavior to the Cholesky factor. The results are shown in Figures 10, 11, 12, 13, 14. The plots confirm the trends shown in the main paper.

Figure 10: Comparison of SVGP and DLM with one Monte Carlo sample per datapoint on the binary classification airline dataset. On the left is negative log predictive probability and on the right is error rate. In both plots, lower values imply better performance.
Figure 11: Comparison of SVGP and DLM with ten Monte Carlo samples per datapoint on the binary classification airline dataset. On the left is negative log predictive probability and on the right is error rate. In both plots, lower values imply better performance.
Figure 12: Comparison of SVGP and DLM with one hundred Monte Carlo samples per datapoint on the binary classification airline dataset. On the left is negative log predictive probability and on the right is error rate. In both plots, lower values imply better performance.
Figure 13: Statistics for calculation of biased gradients for the mean parameter for binary classification. Left: condition c. Middle: condition d. Right: estimate of bias.
Figure 14: Statistics for calculation of biased gradients for the inducing locations parameter for binary classification. Left: condition c. Middle: condition d. Right: estimate of bias.

Appendix D Relative Error Results for Count Prediction

Figure 15 shows relative error MRE (which is defined as , ) in the experiment shown in the main paper. As can be seen the differences shown for log loss hold also for relative error.

Figure 15: sGP Count Prediction: Comparison of SVGP and DLM with 10 MC samples in terms of mean relative error (MRE). In all plots, lower values imply better performance.