1 Introduction
Gaussian Processes (GPs) are a powerful modelling method due to their nonparametric nature rasmussenwilliamsbook
. Although GPs are probabilistic models and hence come equipped with an intrinsic measure of uncertainty, this uncertainty does not allow conclusions about their performance on previously unseen test data. For instance, one often observes overfitting if a large number of hyperparameters is adjusted using marginal likelihood optimization
bauerunderstandingsparseGPapproximations . While a fully Bayesian approach, i.e. marginalizing out the hyperparameters, reduces this risk, it incurs a prohibitive runtime since the predictive distribution is no longer analytically tractable. Also, it does not entail outofthebox safety guarantees.In this work, we propose a novel training objective for GP models, which enables us to give rigorous
and quantitatively good performance guarantees on future predictions. Such rigorous guarantees are developed within Statistical Learning Theory (e.g.
understandingmachinelearningbook ). But as the classical uniform learning bounds are meaningless for expressive models like deep neural nets understandingdeeplearningrequiresrethinkinggeneralization (as e.g. the VC dimension exceeds the training size) and GPs or nonparametric methods in general, such guarantees cannot be employed for learning those models. Instead, common optimization schemes are (regularized) empirical risk minimization (ERM) understandingdeeplearningrequiresrethinkinggeneralization ; understandingmachinelearningbook , maximum likelihood (MLE) rasmussenwilliamsbook , or variational inference (VI) jordanghahramanietalvi ; titsias09 .On the other hand, better nonuniform learning guarantees have been developed within the PACBayesian framework mcallesterpacbayesianmodelaveraging ; asdf2 ; catonithermodynofml (Sect. 2). They are specially adapted to probabilistic methods like GPs and can yield tight generalization bounds, as observed for GP classification seegerclassificationpaper , probabilistic SVMs tighterpacbayesboundssvm ; pacbayesandmargins
, linear classifiers
pacbayeslearningoflinearclassifiers , or stochastic NNs nonvacuousgeneralizationboundssnn . Most previous works used PACBayesian bounds merely for the final evaluation of the generalization performance, whereas learning by optimizing a PACBayesian bound has been barely explored pacbayeslearningoflinearclassifiers ; nonvacuousgeneralizationboundssnn . This work, for the first time, explores the use of PACBayesian bounds (a) for GP training and (b) in the regression setting.Specifically, we propose to learn full and sparse GP predictors directly by minimizing a PACBayesian upper bound from Eq. (5) on the true future risk of the predictor, as a principled method to ensure good generalization (Sect. 3). Our general approach comes naturally for GPs because the KL divergence
in the PACBayes theorem can be evaluated analytically for GPs
sharing the same hyperparameters. As this applies to popular sparse GP variants such as DTC seegerdtc , FITC snelsonspgp , and VFE titsias09 , they all become amenable to our method of PACBayes learning, combining computational benefits of sparse GPs with theoretical guarantees. We carefully account for the different types of parameters (hyperparameters, inducing inputs, observation noise, freeform parameters), as only some of them contribute to the “penalty term” in the PACBayes bound. Further, we base GP learning directly on the inverse binary KL divergence seegerclassificationpaper , and not on looser bounds used previously, such as from Pinsker’s inequality (e.g., nonvacuousgeneralizationboundssnn ).We demonstrate our GP learning method on regression tasks, whereas PACBayes bounds have so far mostly been used in a classification setting. A PACBayesian bound for regression with potentially unbounded loss function was developed in
pacbayesiantheorymeetsbayesianinference , it requires a subGaussian assumption w.r.t. the (unknown) data distribution, see also excessriskboundsusingviingaussianmodels . To remain distributionfree as in the usual PAC setting, we employ and investigate a generic bounded loss function for regression.We evaluate our learning method on several datasets and compare its performance to stateoftheart GP methods rasmussenwilliamsbook ; snelsonspgp ; titsias09 in Sect. 4. Our learning objective exhibits robust optimization behaviour with the same scaling to large datasets as the other GP methods. We find that our method yields significantly better risk bounds, often by a factor of more than two, and that only for our approach the guarantee improves with the number of inducing points.
2 General PACBayesian Framework
2.1 Risk functions
We consider the standard supervised learning setting understandingmachinelearningbook where a set of training examples () is used to learn in a hypothesis space , a subset of the space of functions . We allow learning algorithms that output a distribution over hypotheses , rather than a single hypothesis , which is the case for GPs we consider later on.
To quantify how well a hypothesis performs, we assume a bounded loss function to be given, w.l.o.g. scaled to the interval . measures how well the prediction approximates the actual output at an input . The empirical risk of a hypothesis is then defined as the average training loss . As in the usual PAC framework, we assume an (unknown) underlying distribution on the set of examples, and define the (true) risk as . We will later assume that the training set consists of independent draws from and study how close is to its mean understandingmachinelearningbook . To quantify the performance of stochastic learning algorithms, that output a distribution over hypotheses, we define the empirical and true risks by a slight abuse of notation as mcallesterpacbayesianmodelaveraging :
(1)  
(2) 
These are the average losses, also termed Gibbs risks, on the training and true distributions, respectively, where the hypothesis is sampled according to before prediction.
In the following, we focus on the regression case, where is the set of reals. An exemplary loss function in this case is , where the functions specify an interval outside of which a prediction is deemed insufficient; similar to
support vector regression
vapnikthenatureofstatisticallearningtheory , we use , with a desired accuracy goal specified before learning (see Sect. 4). In any case, the expectations over in (1)–(2) reduce to onedimensional integrals asis a realvalued random variable at each
. See App. C, where we also explore other loss functions.Instead of the stochastic predictor with , one is often interested in the deterministic Bayes predictor seegerclassificationpaper ; for GP regression, this simply equals the predictive mean at . The corresponding Bayes risk is defined by . While PACBayesian theorems do not directly give a bound on but only on , it is easy to see that if is quasiconvex in , as in the examples above, and the distribution of is symmetric around its mean (e.g., Gaussian) seegerclassificationpaper . An upper bound on below thus implies a nontrivial bound on .
2.2 PACBayesian generalization bounds
In this paper we aim to learn a GP by minimizing suitable risk bounds. Due to the probabilistic nature of GPs, we employ generalization bounds for stochastic predictors, which were previously observed to yield stronger guarantees than those for deterministic predictors seegerclassificationpaper ; tighterpacbayesboundssvm ; nonvacuousgeneralizationboundssnn . The most important results in this direction are the socalled “PACBayesian bounds”, originating from mcallesterpacbayesianmodelaveraging ; asdf2 and developed in various directions seegerclassificationpaper ; maurernotepacbayes ; catonithermodynofml ; pacbayeslearningoflinearclassifiers ; pacbayesianboundsbasedontherenyidivergence ; pacbayesiantheorymeetsbayesianinference .
The PACBayesian theorem (Theorem 1) gives a probabilistic upper bound (generalization guarantee) on the true risk of a stochastic predictor in terms of its empirical risk on a training set . It requires to fix a distribution on the hypothesis space before seeing the training set , and applies to the true risk of any distribution on ^{1}^{1}1We follow common usage and call and
“prior” and “posterior” distributions in the PACBayesian setting, although their meaning is somewhat different from priors and posteriors in Bayesian probability theory.
. The bound contains a term that can be interpreted as complexity of the hypothesis distribution , namely the KullbackLeibler (KL) divergence , which takes values in . The bound also contains the binary KLdivergence , defined for , or more precisely its (upper) inverse w.r.t. the second argument (for , ):(3) 
which equals the unique satisfying . While has no closedform expression, we refer to App. A for an illustration and more details, including its derivatives for optimization.
Theorem 1 (PACBayesian theorem mcallesterpacbayesianmodelaveraging ; seegerclassificationpaper ; maurernotepacbayes ).
For any valued loss function , for any distribution , for any , for any distribution on a hypothesis set , and for any , the following holds with probability at least over the training set :
(4) 
The RHS of (4) can be upper bounded by , which gives a useful intuition about the involved terms, but can exceed and thereby yield a trivial statement. Note that the full PACBayes theorem maurernotepacbayes gives a simultaneous lower bound on , which is however not relevant here as we are going to minimize the upper risk bound. Further refinements of the bound are possible (e.g., maurernotepacbayes ), but as they improve over Theorem 1 only in small regimes catonithermodynofml ; pacbayeslearningoflinearclassifiers ; pacbayesianboundsbasedontherenyidivergence , often despite adjustable parameters, we will stick with the parameterfree bound (4).
We want to consider a family of prior distributions parametrized by , e.g. in GP hyperparameter training rasmussenwilliamsbook . If this family is countable, one can generalize the above analysis by fixing some probability distribtion on and defining the mixture prior ; when
is a finite set, the uniform distribution
is a canonical choice. Using the fact that holds for each (App. B), Theorem 1 yields that, with probability at least over ,(5) 
The bound (5) holds simultaneously for all and all ^{2}^{2}2The same result can be derived from (4) via a union bound argument (see Appendix B).. One can thus optimize over both and to obtain the best generalization guarantee, with confidence at least . We use for our training method below, but we will also compare to training with the suboptimal upper bound as was done previously nonvacuousgeneralizationboundssnn . The PACBayesian bound depends only weakly on the confidence parameter , which enters logarithmically and is suppressed by the sample size . When the hyperparameter set is not too large (i.e. is small compared to ), the main contribution to the penalty term in the second argument of comes from , which must be for a good generalization statement (see Sect. 4).
3 PACBayesian learning of GPs
3.1 Learning full GPs
GP modelling is usually presented as a Bayesian method rasmussenwilliamsbook , in which the prior is specified by a positive definite kernel and a mean function on the input set . In ordinary GP regression, the learned distribution is then chosen as the Bayesian posterior coming from the assumption that the training outputs are noisy versions of with i.i.d. Gaussian likelihood . Under this assumption, is again a GP rasmussenwilliamsbook :
(6) 
with , , . Eq. (6) is employed to make (stochastic) predictions for on new inputs . In our approach below, we do not require any Bayesian rationale behind but merely use its form, parametrized by , as an optimization ansatz within the PACBayesian theorem.
Importantly, for any full GP prior and its corresponding posterior from (6), the KLdivergence in Theorem 1 and Eq. (5) can be evaluated on finite ()dimensional matrices. This allows us to evaluate the PACBayesian bound and in turn to learn GPs by optimizing it. More precisely, one can easily verify that and have the same conditional distribution ^{3}^{3}3In fact, direct computation rasmussenwilliamsbook gives . Remarkably, does not depend on nor on , even though from (6) does. Intuitively this is because, for the above likelihood, is independent of given ., so that
(7)  
(8)  
where in the last step we used the wellknown formula kullbackleiblerpaper
for the KL divergence between normal distributions
and , and simplified a bit (see also App. D).To learn a full GP means to select “good” values for the hyperparameters , which parametrize a family of GP priors , and for the noise level rasmussenwilliamsbook . Those values are afterwards used to make predictions with the corresponding posterior from (6). In our experiments (Sect. 4) we will use the squared exponential (SE) kernel on , , where
is the signal variance,
are the lengthscales, and we set the mean function to zero. The hyperparameters are (SEARD kernel rasmussenwilliamsbook ), or if we take all lengthscales to be equal (nonARD).The basic idea of our method, which we call “PACGP” is now to learn the parameters^{4}^{4}4Contrary to the usual GP viewpoint rasmussenwilliamsbook , is not a hyperparameter in our method since the prior does not depend on . Thus, does also not contribute to the “penalty term” . is merely a free parameter in the posterior distribution . By (8), as , so we need this parameter because otherwise and the bound as well as the optimization objective would become trivial. Although the parameter is originally motivated by a Gaussian observation noise assumption, the aim here is merely to parameterize the posterior in some way while maintaining computational tractability; cf. also Sect. 3.2. and by minimizing the upper bound from Eq. (5), therefore selecting the GP predictor with the best generalization performance guarantee within the scope of the PACBayesian bound. Note that all involved terms (App. C) and from (8) as well as their derivatives (App. A) can be computed effectively, so we can use gradientbased optimization.
The only remaining issue is that the learned prior hyperparameters have to come from a discrete set that must be specified before seeing the training set (Sect. 2.2). To achieve this, we first minimize the RHS of Eq. (5) over and in a gradientbased manner, and thereafter discretize each of the components of to the closest point in the equispaced element set ; thus, when denotes the number of components of , the penalty term to be used in the optimization objective (5) is . The SEARD kernel has , while the standard SE kernel has parameters. In our experiments we round each component of to two decimal digits in the range , i.e. , . We found that this discretization has virtually no effect on the predictions of , and that coarser rounding (i.e. smaller ) does not significantly improve the bound (5) (via its smaller penalty term ) nor the optimization (via its higher sensitivity to ); see App. F.
3.2 Learning sparse GPs
Despite the fact that, with confidence , the bound in (5) holds for any from the prior GP family and for any distribution , we optimized in Sect. 3.1 the upper bound merely over the parameters after substituting and the corresponding from (6). We are limited by the need to compute effectively, for which we relied on the property and the Gaussianity of and , cf. (7). Building on this two requirements, we now construct more general pairs of GPs with effectively computable , so that our learning method becomes more widely applicable, including sparse GP methods.
Instead of the points associated with the training set as in Sect. 3.1, one may choose from the input space any number of points , often called inducing inputs
, and any Gaussian distribution
on function values , with any and positive semidefinite matrix . The distribution on can be extended to all function values at all inputs using the conditional from the prior (see Sect. 3.1). This yields the following predictive GP:(9)  
where , , and . This form of
includes several approximate posteriors from Bayesian inference that have been used in the literature
rasmussenwilliamsbook ; seegerclassificationpaper ; titsias09 ; scalablevariationalgaussianprocessclassification , even for noise models other than the Gaussian one used to motivate the from Sect. 3.1. Analogous reasoning as in (7) now gives seegerclassificationpaper ; titsias09 ; KLdivergencebetweenstochasticprocesses :(10) 
One can thus effectively optimize in (5) the prior and the posterior distribution by varying the number and locations of inducing inputs and the parameters and , along with the hyperparameters . Optimization can in this framework be organized such that it consumes time per gradient step and memory as opposed to and for the full GP of Sect. 3.1. This is a big saving when and justifies the name “sparse GP” rasmussenwilliamsbook ; unifyingviewsparsegpapprox .
Some popular sparseGP methods unifyingviewsparsegpapprox are special cases of the above form, by prescribing certain and depending on the training set , so that only the inducing inputs and a few parameters such as are left free:
(11) 
where with , , and is a diagonal matrix with entries . Setting corresponds to the FITC approximation snelsonspgp , whereas is the VFE and DTC method titsias09 ; seegerdtc (see App. D
for their training objectives); one can also linearly interpolate between both choices with
buiyanturnerunifying . Another form of sparse GPs where the latent function values are fixed and not marginalized over, corresponds to , which however gives diverging via (10) and therefore trivial bounds in (4)–(5).Our learning method for sparse GPs (“PACSGP”) follows now similar steps as in Sect. 3.1: One has to include a penality for the prior hyperparameters , which are to be discretized into the set after the optimization of (5). Note, contains the prior hyperparameters only and not the inducing points nor , , , or from (11); all these quantities can be optimized over simultaneously with , but do not need to be discretized. The number of inducing inputs can also be varied, which determines the required computational effort, and all optimizations can be both discrete seegerdtc or continuous snelsonspgp ; titsias09 . When optimizing over positive , the parametrization with a lower triangular matrix can be used scalablevariationalgaussianprocessclassification . For the experiments below we always employ the FITC parametrization (fixed ) in our proposed PACSGP method, i.e. our optimization parameters are and besides the length scale hyperparameters .
4 Experiments^{5}^{5}5Python code (building on GPflow gpflow and TensorFlow tensorflow ) implementing our method is available at https://github.com/boschresearch/PAC_GP.
We now illustrate our learning method and compare it with other GP methods on various regression tasks. In contrast to prior work nonvacuousgeneralizationboundssnn , we found the gradientbased training with the objective (5) to be robust enough, such that no pretraining with conventional objectives (such as from App. D) is necessary. We set throughout seegerclassificationpaper ; nonvacuousgeneralizationboundssnn , cf. Sect. 2.2, and use (unless specified otherwise) the generic bounded loss function for regression, with accuracy goal as specified below.
We evaluate the following methods: (a) PACGP: Our proposed method (cf. Sect. 3.1) with the training objective (5) (klPACGP) and for comparison with the looser training objective (sqrtPACGP) (see below (5), similar to e.g. nonvacuousgeneralizationboundssnn ); (b) PACSGP: Our sparse GP method (Sect. 3.2), again with objectives (klPACSGP) and (sqrtPACSGP), respectively; (c) fullGP: The ordinary full GP for regression rasmussenwilliamsbook ; (d) VFE: Titsias’ sparse GP titsias09 ; (e) FITC: SnelsonGhahramani’s sparse GP snelsonspgp . Note that fullGP, VFE, and FITC as well as sqrtPACGP and sqrtPACSGP are trained on other objectives (see App. D), and we will evaluate the upper bound (5) on their generalization performance by evaluating via (8) or (10). To obtain finite generalization bounds, we discretize for all methods at the end of training as in Sect. 3.1 and use the appropriate in (5).
(a) Predictive distribution.
To get a first intuition, we illustrate in Fig. 1 the effect of varying in the loss function on the predictive distribution of our sparse PACSGP. The accuracy goal defines a band around the predictive mean within which datapoints do not contribute to the empirical risk . We thus chose the accuracy goal relative to the observation noise obtained from an ordinary fullGP. Results are presented on the 1D toy dataset^{6}^{6}6snelson: dimensions 200 1, available at www.gatsby.ucl.ac.uk/~snelson. from the original FITC snelsonspgp and VFE titsias09 publications (for a comparison to the predictive distributions of FITC and VFE see App. E, which also contains an illustration that our klPACSGP avoids FITC’s known overfitting on pathological datasets.). Here and below, we optimize the hyperparameters in each experiment anew.
We find that for large (right plot) the predictive distribution (blue) becomes smoother: Due to the wider band (green), the PACSGP does not need to adapt much to the data for the band to contain many data points. Hence the predictive distribution can remain closer to the prior, which reduces the KLterm in the objective (5). For the same reason, the inducing points need not adapt much compared to their initial positions for large . For smaller , the PACSGP adapts more to the data, whereas for very small (left plot), it is anyhow not possible to place many data points within the narrow band, so the predictive distribution can again be closer to the prior (compare e.g. in the first and second plots the blue curves near the rightmost datapoints) for a smaller KLterm. In particular, the KLdivergence (divided by number of training points) for the three settings in Fig.1 are: 0.097 (left), 0.109 (middle), and 0.031 (right).
(b) FullGP experiments – dependence on the accuracy goal .
To explore the dependence on the desired accuracy further, we compare in Fig. 2 the ordinary fullGP to our PACGPs on the boston housing dataset^{7}^{7}7boston: dimensions 506 13, available at http://lib.stat.cmu.edu/datasets/boston. As preprocessing we normalized all features and the output to mean zero and unit variance, then analysed the impact of the accuracy goal . We used 80% of the dataset for training and 20% for testing, in ten repetitions of the experiment.
Our PACGP yields significantly better generalization guarantees for all accuracy goals compared to fullGP, since we are directly optimizing the bound (5). This effect is stronger for large , where the KLterm of PACGP can decrease as may again remain closer to while keeping the training loss low. Although better bounds do not necessarily imply better Gibbs test risk, klPACGP performs only slightly below the ordinary fullGP in this regard. Moreover, our PACGPs exhibit less overfitting than the fullGP, for which the training risks are significantly larger than the test risks (see Table 1 in App. G for numerical values). On the other hand, the tighter objective (5) in the klPACGP allows learning a slightly more complex GP in terms of the KLdivergence compared to the sqrtPACGP, which results in better test risks and at the same time better guarantees. This confirms that klPACGP is always preferable to sqrtPACGP. However, as any prediction within the full band around the ground truth incurs no risk for our PACGPs, their mean squared error (MSE) increases with .
The fact that our learned PACGPs exhibit higher training and test errors (Gibbs risk and esp. MSE) than fullGP can be explained by their underfitting in order to hedge against violating Eq. (5) (i.e. Theorem 1). This underfitting is evidenced by PACGP’s significantly less complex learned posterior as measured by (Fig. 2), or similarly (via Eqs. (8,10)), by its larger learned noise variance compared to fullGP’s (Table 1 in App. G). It is exactly this stronger regularization of PACGP in terms of the divergence that leads to its better generalization guarantees.
In the following, we will fix after preprocessing data as above, to illustrate PACGP further. Note however that in a concrete application, should be fixed to a desired accuracy goal using domain knowledge, but before seeing the training set . Alternatively, one can consider a set of values chosen in advance, at the cost of a term in addition to in the objective (5).
(c) SparseGP experiments – dependence on number of inducing inputs .
We now examine our sparse PACSGP method (Sect. 3.2) on the three large data sets pol, sarcos, and kin40k^{8}^{8}8pol: 15,000 26, kin40k: 40,000 8 (both from https://github.com/trungngv/fgp.git);
sarcos: 48,933 21 (http://www.gaussianprocess.org/gpml/data), again using 80%–20% traintest splits and ten iterations.
The results are shown in Fig. 3.
Here, we vary the number of inducing points . For modelling pol and kin40k, we use the SEARD kernel due to its better performance on these datasets, whereas we model sarcos without ARD (cf. Table 2 in App. G for the comparison ARD vs. nonARD).
The corresponding penalty terms for the three plots are and ; when compared to from Fig. 3, their contribution is largest for the pol dataset.
Our klPACSGP achieves significantly better upper bounds than VFE and FITC, by more than a factor of 3 on sarcos, a factor of roughly 2 on pol, and a factor between 1.3 and 2 on kin40k (Fig. 3, cf. also Table 2 in App. G). Also, the PACBayes upper bound is much tighter for klPACSGP than for VFE or FITC, i.e. closer to the Gibbs risk, often by factors exceeding 3. Our klPACSGP behaves also more favorably in terms of generalization guarantee when inducing points are added and more complex models are allowed: our upper bound improves substantially with (kin40) or does at least not degrade (pol and sarcos), as opposed to VFE and FITC, whose complexities grow substantially with . Since very low training risks can already be achieved by a moderate number of inducing points for pol and sarcos, a growing with deteriorates the upper bound. Regarding the upper bound, the increased flexibility from larger only pays off for the kin40k dataset, whereas the MSE improves with increasing for all models and datasets. As above, klPACSGP is always slightly preferrable to sqrtPACSGP, not only for the upper bound and Gibbs risks as expected but also for MSE (see Table 2 in App. G).
Similarly to the boston dataset, the higher test errors of klPACSGP compared to VFE and FITC can be explained by underfitting due to the stronger regularization, again shown by lower and significantly larger learned (by factors of 4–28 compared to VFE), cf. Table 2 in App. G. In fact, although our implementation of PACSGP employs the FITC parametrization, the PAC(S)GP optimization is not prone to FITC’s wellknown overfitting tendency bauerunderstandingsparseGPapproximations , due to the regularization via the divergence (see App. E, and in particular Supplementary Figure 7).
To investigate whether the higher test MSE of PACGP compared to VFE and FITC (and the fullGP above) is a consequence of the 01loss used so far, we reran the PACSGP experiments for inducing inputs with the more distancesensitive loss function (Eq. (20)), which is MSElike for small deviations , i.e. (Supplementary Figure 5). Our results are tabulated in Table 3 in App. G. The findings are inconclusive and range from an improvement w.r.t. MSE of 25% (pol) over little change (sarcos) to a decline of 12% (kin40k), showing that the effect of the loss function is smaller than might have been expected. Nevertheless, generalization guarantees of PACSGP remain much better than the ones of the other methods. While the MSE of our PACGPs would improve by choosing smaller (e.g., Fig. 2), this comes at the disadvantage of worse generalization bounds.
We further note that no method shows significant overfitting in Fig. 3, in the sense that the differences between test and training Gibbs risks are all rather small, despite the KLcomplexity increasing with for VFE and FITC. This is unlike for Boston housing above, and may be due to the much larger training sets here. When comparing VFE and FITC, we observe that VFE consistently outperforms FITC in terms of both MSE as well as generalization guarantee, where VFE’s higher KLcomplexity is offset by its much lower Gibbs risk. This fortifies the results in bauerunderstandingsparseGPapproximations . We lastly note that, since for our PACSGP the obtained guarantees are much smaller than 1/2, we obtain strong guarantees even on the Bayes risk (Sect. 2.1).
5 Conclusion
In this paper, we proposed and explored the use of PACBayesian bounds as an optimization objective for GP training. Consequently, we were able to achieve significantly better guarantees on the outofsample performance compared to stateoftheart GP methods, such as VFE or FITC, while maintaining computational scalability. We further found that using the tighter generalization bound (5) based on the inverse binary kldivergence leads to an increase in the performance on all metrics compared to a looser bound as employed in previous works (e.g. nonvacuousgeneralizationboundssnn ).
Despite the much better generalization guarantees obtained by our method, it often yields worse test error, in particular test MSE, than standard GP regression methods; this largely persists even when using more distancesensitive loss functions than the 01loss. The underlying reason could be that all loss functions considered in this work were bounded, as necessitated by our desire to provide generalization guarantees irrespective of the true data distribution. While rigorous PACBayesian bounds exist for MSElike unbounded loss functions under special assumptions on the data distribution pacbayesiantheorymeetsbayesianinference , it may nevertheless be worthwhile to investigate whether these training objectives lead to better test MSE in examples. A drawback is that those assumptions are usually impossible to verify, thus the generalization guarantees are not comparable. Note that the design of a loss function is dependent on the application domain and there is no ubiquitous choice across all settings. In many safetycritical applications, small deviations are tolerable whereas larger deviations are all equally catastrophic, thus a 01loss as ours and a rigorous bound on it can be more useful than the MSE test error.
While in this work we focussed on regression tasks, the same strategy of optimizing a generalization bound can also be applied to learn GPs for binary and categorical outputs. Note that the true term in this setting has so far been merely upper bounded by its regression proxy seegerclassificationpaper , and it would be interesting to develop better bounds on the classification complexity term. Lastly, it may be worthwhile to use other or more general sparse GPs within our PACBayesian learning method, such as freeform scalablevariationalgaussianprocessclassification or even more general GPs moregeneralthanfreeform .
Acknowledgments
We would like to thank Duy NguyenTuong, Martin Schiegg, and Michael Schober for helpful discussions and proofreading.
References

(1)
C. E. Rasmussen, C. K. I. Williams, “Gaussian Processes for Machine Learning”, The MIT Press (2006).
 (2) M. Bauer, M. v. d. Wilk, C. E. Rasmussen, “Understanding Probabilistic Sparse Gaussian Process Approximations”, In NIPS (2016).
 (3) S. ShalevShwartz, S. BenDavid, “Understanding Machine Learning: From Theory to Algorithms”, Cambridge University Press (2014).

(4)
C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, “Understanding deep learning requires rethinking generalization”, In ICLR (2017).
 (5) M. Jordan, Z. Ghahramani, T. Jaakkola, L. Saul, “Introduction to variational methods for graphical models”, Machine Learning, 37, 183233 (1999).
 (6) M. Titsias, “Variational Learning of Inducing Variables in Sparse Gaussian Processes”, In AISTATS (2009).
 (7) D. McAllester, “PACBayesian model averaging”, COLT (1999).
 (8) D. McAllester, “PACBayesian Stochastic Model Selection”, Machine Learning 51, 521 (2003).
 (9) O. Catoni, “PacBayesian Supervised Classification: The Thermodynamics of Statistical Learning”, IMS Lecture Notes Monograph Series 2007, Vol. 56 (2007).
 (10) M. Seeger, “PACBayesian Generalization Error Bounds for Gaussian Process Classification”, Journal of Machine Learning Research 3, 233269 (2002).
 (11) A. Ambroladze, E. ParradoHernández, J. ShaweTaylor, “Tighter PACBayes bounds”, In NIPS (2007).
 (12) J. Langford, J. ShaweTayor, “PACBayes & margins”, In NIPS (2002).
 (13) P. Germain, A. Lacasse, F. Laviolette, M. Marchand, “PACBayesian Learning of Linear Classifiers”, In ICML (2009).

(14)
G. K. Dziugaite, D. M. Roy, “Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data”, In UAI (2017).
 (15) E. Snelson, Z. Ghahramani, “Sparse Gaussian Processes using Pseudoinputs”, In NIPS (2005).
 (16) M. Seeger, C. K. I. Williams, N. Lawrence, “Fast Forward Selection to Seepd Up Sparse Gaussian Process Regression”, In AISTATS (2003).
 (17) P. Germain, F. Bach, A. Lacoste, S. LacosteJulien, “PACBayesian Theory Meets Bayesian Inference”, In NIPS (2016).
 (18) R. Sheth, R. Khardon, “Excess Risk Bounds for the Bayes Risk using Variational Inference in Latent Gaussian Models”, In NIPS (2017).
 (19) V. Vapnik, “The Nature of Statistical Learning Theory”, Springer (1995).
 (20) A. Maurer, “A Note on the PAC Bayesian Theorem”, arXiv:cs/0411099 (2004).
 (21) L. Begin, P. Germain, F. Laviolette, J.F. Roy, “PACBayesian Bounds based on the Renyi Divergence”, In AISTATS (2016).
 (22) S. Kullback, R. Leibler, “On information and sufficiency”, Annals of Mathematical Statistics 22, 7986 (1951).

(23)
A. G. de G. Matthews, J. Hensman, R. Turner, Z. Ghahramani, “On Sparse Variational Methods and the KullbackLeibler Divergence between Stochastic Processes”, In AISTATS (2016).
 (24) J. QuinoneroCandela, C. E. Rasmussen, “A Unifying View of Sparse Approximate Gaussian Process Regression”, Journal of Machine Learning Research 6, 19391959 (2005).
 (25) T. D. Bui, J. Yan, R. E. Turner, “A Unifying Framework for Gaussian Process PseudoPoint Approximations using Power Expectation Propagation”, Journal of Machine Learning Research 18, 172 (2017).
 (26) J. Hensman, A. Matthews, Z. Ghahramani, “Scalable Variational Gaussian Process Classification”, In AISTATS (2015).

(27)
A. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. LeónVillagrá, Z. Ghahramani, J. Hensman, “GPflow: A Gaussian process library using TensorFlow”, Journal of Machine Learning Research 18, 16 (2017).
 (28) M. Abadi et al., “TensorFlow: LargeScale Machine Learning on Heterogeneous Systems”, https://www.tensorflow.org/ (2015).
 (29) C.A. Cheng, B. Boots, “Variational Inference for Gaussian Process Models with Linear Complexity”, In NIPS (2017).
Appendix A Inverse binary KLdivergence and its derivatives, Pinsker’s inequality
The function , defined in Eq. (3), can easily be computed numerically for any , to any desired accuracy via the bisection method, since the function is strictly monotonically increasing in from to (for or , we set ). Note that the monotonicity in implies further that is monotonically increasing in . Fig. 4 shows a plot of for various values of , and states a few special function values of . By Pinsker’s inequality, it holds that , which implies that .
When applying gradient descent on the RHS of the generalization bound (5) (which includes (4) as a special case), as we propose and do, one further needs, besides the evaluation of , also the derivatives of w.r.t. both of its arguments. These can be easily derived by differentiating the identity w.r.t. and , plugging in the easily computed derivatives of . The result is:
(12)  
(13) 
The derivative of the RHS of (5) with respect to parameters (which may include the hyperparameters , the noise level , the inducing points , or any other parameters of and such as , or from Section 3.2) thus reads:
(14) 
using the partial derivatives of in parentheses from (12)–(13).
We use the expression (14) for our gradientbased optimization of the parameters in Sect. 4. Note that we treat all components of as continuous parameters during this optimization, despite the fact that the hyperparameters for the prior have to come from a countable set (see around Eq. (5) and App. B). It is only after the optimization that we discretize all of those parameters onto a predefined grid (chosen before seeing the training sample ), as described in Sect. 3.1.
Note that in (14) can be computed analytically in our proposed methods PACGP and PACSGP by using standard matrix algebra (e.g., (rasmussenwilliamsbook, , App. A)) for differentiating the matrixanalytic expressions of in (7) or (10) (possibly with the parametrization (11)). Furthermore, the derivative is easily computed for common distributions (Sect. 2.2), again treating first as a continuous parameter in the optimization as explained in the previous paragraph; in our paper, we always discretize the hyperparameter set to be finite and choose as the uniform distribution, so is independent of and . Lastly, we show in App. C how to effectively compute in the expression (14) for relevant loss functions .
To our knowledge, the parameterfree PACBayes bound from Theorem 1 or Eq. (5) has never before been used for learning, as we do in our paper here, ostensibly due to the perceived difficulty of handling the derivatives of nonvacuousgeneralizationboundssnn . Instead, when a PACBayes bound was used to guide learning in prior works pacbayeslearningoflinearclassifiers ; nonvacuousgeneralizationboundssnn , then a simple sum of and a penalty term involving and was employed as an upper bound, either obtained from alternative PACBayes theorems catonithermodynofml ; pacbayeslearningoflinearclassifiers or from loosening the upper bound in Eq. (5) to an expression of the form by a use of Pinsker’s inequality nonvacuousgeneralizationboundssnn or by looser derivations (some of which are mentioned in pacbayeslearningoflinearclassifiers ; pacbayesianboundsbasedontherenyidivergence ). We show in our work how to perform the learning directly with the bound (5) using the derivative from Eq. (14), and demonstrate that its optimization is robust and stable and has better performance than the optimization of looser bounds (see Sect. 4).
Appendix B Proof of Eq. (5) — KLdivergence inequality and union bound
Let be a countable set (i.e. a finite set or a countably infinite set), and let
be any probability distribution over its elements
. Further, let be a family of probability distributions indexed by the , and define their mixture . Then it holds for each and :(15)  
(16)  
(17) 
The inequality (16) follows from the simple fact that the sum contains only nonnegative terms and is therefore at least as large as any of its summands, , together with the monotonicity of the logarithm . This inequality would not generally hold when were replaced by an integral over a continuous index , which explains the requirement of a countable index set . The inequality holds also for with the interpretation .