Learning Gaussian Processes by Minimizing PAC-Bayesian Generalization Bounds

10/29/2018 ∙ by David Reeb, et al. ∙ Bosch 0

Gaussian Processes (GPs) are a generic modelling tool for supervised learning. While they have been successfully applied on large datasets, their use in safety-critical applications is hindered by the lack of good performance guarantees. To this end, we propose a method to learn GPs and their sparse approximations by directly optimizing a PAC-Bayesian bound on their generalization performance, instead of maximizing the marginal likelihood. Besides its theoretical appeal, we find in our evaluation that our learning method is robust and yields significantly better generalization guarantees than other common GP approaches on several regression benchmark datasets.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Gaussian Processes (GPs) are a powerful modelling method due to their non-parametric nature rasmussen-williams-book

. Although GPs are probabilistic models and hence come equipped with an intrinsic measure of uncertainty, this uncertainty does not allow conclusions about their performance on previously unseen test data. For instance, one often observes overfitting if a large number of hyperparameters is adjusted using marginal likelihood optimization

bauer-understanding-sparse-GP-approximations . While a fully Bayesian approach, i.e. marginalizing out the hyperparameters, reduces this risk, it incurs a prohibitive runtime since the predictive distribution is no longer analytically tractable. Also, it does not entail out-of-the-box safety guarantees.

In this work, we propose a novel training objective for GP models, which enables us to give rigorous

and quantitatively good performance guarantees on future predictions. Such rigorous guarantees are developed within Statistical Learning Theory (e.g. 

understanding-machine-learning-book ). But as the classical uniform learning bounds are meaningless for expressive models like deep neural nets understanding-deep-learning-requires-rethinking-generalization (as e.g. the VC dimension exceeds the training size) and GPs or non-parametric methods in general, such guarantees cannot be employed for learning those models. Instead, common optimization schemes are (regularized) empirical risk minimization (ERM) understanding-deep-learning-requires-rethinking-generalization ; understanding-machine-learning-book , maximum likelihood (MLE) rasmussen-williams-book , or variational inference (VI) jordan-ghahramani-et-al-vi ; titsias09 .

On the other hand, better non-uniform learning guarantees have been developed within the PAC-Bayesian framework mcallester-pac-bayesian-model-averaging ; asdf2 ; catoni-thermodyn-of-ml (Sect. 2). They are specially adapted to probabilistic methods like GPs and can yield tight generalization bounds, as observed for GP classification seeger-classification-paper , probabilistic SVMs tighter-pac-bayes-bounds-svm ; pac-bayes-and-margins

, linear classifiers

pac-bayes-learning-of-linear-classifiers , or stochastic NNs nonvacuous-generalization-bounds-snn . Most previous works used PAC-Bayesian bounds merely for the final evaluation of the generalization performance, whereas learning by optimizing a PAC-Bayesian bound has been barely explored pac-bayes-learning-of-linear-classifiers ; nonvacuous-generalization-bounds-snn . This work, for the first time, explores the use of PAC-Bayesian bounds (a) for GP training and (b) in the regression setting.

Specifically, we propose to learn full and sparse GP predictors directly by minimizing a PAC-Bayesian upper bound from Eq. (5) on the true future risk of the predictor, as a principled method to ensure good generalization (Sect. 3). Our general approach comes naturally for GPs because the KL divergence

in the PAC-Bayes theorem can be evaluated analytically for GPs

sharing the same hyperparameters. As this applies to popular sparse GP variants such as DTC seeger-dtc , FITC snelson-spgp , and VFE titsias09 , they all become amenable to our method of PAC-Bayes learning, combining computational benefits of sparse GPs with theoretical guarantees. We carefully account for the different types of parameters (hyperparameters, inducing inputs, observation noise, free-form parameters), as only some of them contribute to the “penalty term” in the PAC-Bayes bound. Further, we base GP learning directly on the inverse binary KL divergence seeger-classification-paper , and not on looser bounds used previously, such as from Pinsker’s inequality (e.g., nonvacuous-generalization-bounds-snn ).

We demonstrate our GP learning method on regression tasks, whereas PAC-Bayes bounds have so far mostly been used in a classification setting. A PAC-Bayesian bound for regression with potentially unbounded loss function was developed in

pac-bayesian-theory-meets-bayesian-inference , it requires a sub-Gaussian assumption w.r.t. the (unknown) data distribution, see also excess-risk-bounds-using-vi-in-gaussian-models . To remain distribution-free as in the usual PAC setting, we employ and investigate a generic bounded loss function for regression.

We evaluate our learning method on several datasets and compare its performance to state-of-the-art GP methods rasmussen-williams-book ; snelson-spgp ; titsias09 in Sect. 4. Our learning objective exhibits robust optimization behaviour with the same scaling to large datasets as the other GP methods. We find that our method yields significantly better risk bounds, often by a factor of more than two, and that only for our approach the guarantee improves with the number of inducing points.

2 General PAC-Bayesian Framework

2.1 Risk functions

We consider the standard supervised learning setting understanding-machine-learning-book where a set of training examples () is used to learn in a hypothesis space , a subset of the space of functions . We allow learning algorithms that output a distribution over hypotheses , rather than a single hypothesis , which is the case for GPs we consider later on.

To quantify how well a hypothesis performs, we assume a bounded loss function to be given, w.l.o.g. scaled to the interval . measures how well the prediction approximates the actual output at an input . The empirical risk of a hypothesis is then defined as the average training loss . As in the usual PAC framework, we assume an (unknown) underlying distribution on the set of examples, and define the (true) risk as . We will later assume that the training set consists of independent draws from and study how close is to its mean understanding-machine-learning-book . To quantify the performance of stochastic learning algorithms, that output a distribution over hypotheses, we define the empirical and true risks by a slight abuse of notation as mcallester-pac-bayesian-model-averaging :

(1)
(2)

These are the average losses, also termed Gibbs risks, on the training and true distributions, respectively, where the hypothesis is sampled according to before prediction.

In the following, we focus on the regression case, where is the set of reals. An exemplary loss function in this case is , where the functions specify an interval outside of which a prediction is deemed insufficient; similar to

-support vector regression

vapnik-the-nature-of-statistical-learning-theory , we use , with a desired accuracy goal specified before learning (see Sect. 4). In any case, the expectations over in (1)–(2) reduce to one-dimensional integrals as

is a real-valued random variable at each

. See App. C, where we also explore other loss functions.

Instead of the stochastic predictor with , one is often interested in the deterministic Bayes predictor seeger-classification-paper ; for GP regression, this simply equals the predictive mean at . The corresponding Bayes risk is defined by . While PAC-Bayesian theorems do not directly give a bound on but only on , it is easy to see that if is quasi-convex in , as in the examples above, and the distribution of is symmetric around its mean (e.g., Gaussian) seeger-classification-paper . An upper bound on below thus implies a nontrivial bound on .

2.2 PAC-Bayesian generalization bounds

In this paper we aim to learn a GP by minimizing suitable risk bounds. Due to the probabilistic nature of GPs, we employ generalization bounds for stochastic predictors, which were previously observed to yield stronger guarantees than those for deterministic predictors seeger-classification-paper ; tighter-pac-bayes-bounds-svm ; nonvacuous-generalization-bounds-snn . The most important results in this direction are the so-called “PAC-Bayesian bounds”, originating from mcallester-pac-bayesian-model-averaging ; asdf2 and developed in various directions seeger-classification-paper ; maurer-note-pac-bayes ; catoni-thermodyn-of-ml ; pac-bayes-learning-of-linear-classifiers ; pac-bayesian-bounds-based-on-the-renyi-divergence ; pac-bayesian-theory-meets-bayesian-inference .

The PAC-Bayesian theorem (Theorem 1) gives a probabilistic upper bound (generalization guarantee) on the true risk of a stochastic predictor in terms of its empirical risk on a training set . It requires to fix a distribution on the hypothesis space before seeing the training set , and applies to the true risk of any distribution on 111We follow common usage and call and

“prior” and “posterior” distributions in the PAC-Bayesian setting, although their meaning is somewhat different from priors and posteriors in Bayesian probability theory.

. The bound contains a term that can be interpreted as complexity of the hypothesis distribution , namely the Kullback-Leibler (KL) divergence , which takes values in . The bound also contains the binary KL-divergence , defined for , or more precisely its (upper) inverse w.r.t. the second argument (for , ):

(3)

which equals the unique satisfying . While has no closed-form expression, we refer to App. A for an illustration and more details, including its derivatives for optimization.

Theorem 1 (PAC-Bayesian theorem mcallester-pac-bayesian-model-averaging ; seeger-classification-paper ; maurer-note-pac-bayes ).

For any -valued loss function , for any distribution , for any , for any distribution on a hypothesis set , and for any , the following holds with probability at least over the training set :

(4)

The RHS of (4) can be upper bounded by , which gives a useful intuition about the involved terms, but can exceed and thereby yield a trivial statement. Note that the full PAC-Bayes theorem maurer-note-pac-bayes gives a simultaneous lower bound on , which is however not relevant here as we are going to minimize the upper risk bound. Further refinements of the bound are possible (e.g., maurer-note-pac-bayes ), but as they improve over Theorem 1 only in small regimes catoni-thermodyn-of-ml ; pac-bayes-learning-of-linear-classifiers ; pac-bayesian-bounds-based-on-the-renyi-divergence , often despite adjustable parameters, we will stick with the parameter-free bound (4).

We want to consider a family of prior distributions parametrized by , e.g. in GP hyperparameter training rasmussen-williams-book . If this family is countable, one can generalize the above analysis by fixing some probability distribtion on and defining the mixture prior ; when

is a finite set, the uniform distribution

is a canonical choice. Using the fact that holds for each (App. B), Theorem 1 yields that, with probability at least over ,

(5)

The bound (5) holds simultaneously for all and all 222The same result can be derived from (4) via a union bound argument (see Appendix B).. One can thus optimize over both and to obtain the best generalization guarantee, with confidence at least . We use for our training method below, but we will also compare to training with the suboptimal upper bound as was done previously nonvacuous-generalization-bounds-snn . The PAC-Bayesian bound depends only weakly on the confidence parameter , which enters logarithmically and is suppressed by the sample size . When the hyperparameter set is not too large (i.e.  is small compared to ), the main contribution to the penalty term in the second argument of comes from , which must be for a good generalization statement (see Sect. 4).

3 PAC-Bayesian learning of GPs

3.1 Learning full GPs

GP modelling is usually presented as a Bayesian method rasmussen-williams-book , in which the prior is specified by a positive definite kernel and a mean function on the input set . In ordinary GP regression, the learned distribution is then chosen as the Bayesian posterior coming from the assumption that the training outputs are noisy versions of with i.i.d. Gaussian likelihood . Under this assumption, is again a GP rasmussen-williams-book :

(6)

with , , . Eq. (6) is employed to make (stochastic) predictions for on new inputs . In our approach below, we do not require any Bayesian rationale behind but merely use its form, parametrized by , as an optimization ansatz within the PAC-Bayesian theorem.

Importantly, for any full GP prior and its corresponding posterior from (6), the KL-divergence in Theorem 1 and Eq. (5) can be evaluated on finite (-)dimensional matrices. This allows us to evaluate the PAC-Bayesian bound and in turn to learn GPs by optimizing it. More precisely, one can easily verify that and have the same conditional distribution 333In fact, direct computation rasmussen-williams-book gives . Remarkably, does not depend on nor on , even though from (6) does. Intuitively this is because, for the above likelihood, is independent of given ., so that

(7)
(8)

where in the last step we used the well-known formula kullback-leibler-paper

for the KL divergence between normal distributions

and , and simplified a bit (see also App. D).

To learn a full GP means to select “good” values for the hyperparameters , which parametrize a family of GP priors , and for the noise level rasmussen-williams-book . Those values are afterwards used to make predictions with the corresponding posterior from (6). In our experiments (Sect. 4) we will use the squared exponential (SE) kernel on , , where

is the signal variance,

are the lengthscales, and we set the mean function to zero. The hyperparameters are (SE-ARD kernel rasmussen-williams-book ), or if we take all lengthscales to be equal (non-ARD).

The basic idea of our method, which we call “PAC-GP” is now to learn the parameters444Contrary to the usual GP viewpoint rasmussen-williams-book , is not a hyperparameter in our method since the prior does not depend on . Thus, does also not contribute to the “penalty term” . is merely a free parameter in the posterior distribution . By (8), as , so we need this parameter because otherwise and the bound as well as the optimization objective would become trivial. Although the parameter is originally motivated by a Gaussian observation noise assumption, the aim here is merely to parameterize the posterior in some way while maintaining computational tractability; cf. also Sect. 3.2. and by minimizing the upper bound from Eq. (5), therefore selecting the GP predictor with the best generalization performance guarantee within the scope of the PAC-Bayesian bound. Note that all involved terms (App. C) and from (8) as well as their derivatives (App. A) can be computed effectively, so we can use gradient-based optimization.

The only remaining issue is that the learned prior hyperparameters have to come from a discrete set that must be specified before seeing the training set (Sect. 2.2). To achieve this, we first minimize the RHS of Eq. (5) over and in a gradient-based manner, and thereafter discretize each of the components of to the closest point in the equispaced -element set ; thus, when denotes the number of components of , the penalty term to be used in the optimization objective (5) is . The SE-ARD kernel has , while the standard SE kernel has parameters. In our experiments we round each component of to two decimal digits in the range , i.e. , . We found that this discretization has virtually no effect on the predictions of , and that coarser rounding (i.e. smaller ) does not significantly improve the bound (5) (via its smaller penalty term ) nor the optimization (via its higher sensitivity to ); see App. F.

3.2 Learning sparse GPs

Despite the fact that, with confidence , the bound in (5) holds for any from the prior GP family and for any distribution , we optimized in Sect. 3.1 the upper bound merely over the parameters after substituting and the corresponding from (6). We are limited by the need to compute effectively, for which we relied on the property and the Gaussianity of and , cf. (7). Building on this two requirements, we now construct more general pairs of GPs with effectively computable , so that our learning method becomes more widely applicable, including sparse GP methods.

Instead of the points associated with the training set as in Sect. 3.1, one may choose from the input space any number of points , often called inducing inputs

, and any Gaussian distribution

on function values , with any and positive semidefinite matrix . The distribution on can be extended to all function values at all inputs using the conditional from the prior (see Sect. 3.1). This yields the following predictive GP:

(9)

where , , and . This form of

includes several approximate posteriors from Bayesian inference that have been used in the literature

rasmussen-williams-book ; seeger-classification-paper ; titsias09 ; scalable-variational-gaussian-process-classification , even for noise models other than the Gaussian one used to motivate the from Sect. 3.1. Analogous reasoning as in (7) now gives seeger-classification-paper ; titsias09 ; KL-divergence-between-stochastic-processes :

(10)

One can thus effectively optimize in (5) the prior and the posterior distribution by varying the number and locations of inducing inputs and the parameters and , along with the hyperparameters . Optimization can in this framework be organized such that it consumes time per gradient step and memory as opposed to and for the full GP of Sect. 3.1. This is a big saving when and justifies the name “sparse GP” rasmussen-williams-book ; unifying-view-sparse-gp-approx .

Some popular sparse-GP methods unifying-view-sparse-gp-approx are special cases of the above form, by prescribing certain and depending on the training set , so that only the inducing inputs and a few parameters such as are left free:

(11)

where with , , and is a diagonal -matrix with entries . Setting corresponds to the FITC approximation snelson-spgp , whereas is the VFE and DTC method titsias09 ; seeger-dtc (see App. D

for their training objectives); one can also linearly interpolate between both choices with

bui-yan-turner-unifying . Another form of sparse GPs where the latent function values are fixed and not marginalized over, corresponds to , which however gives diverging via (10) and therefore trivial bounds in (4)–(5).

Our learning method for sparse GPs (“PAC-SGP”) follows now similar steps as in Sect. 3.1: One has to include a penality for the prior hyperparameters , which are to be discretized into the set after the optimization of (5). Note, contains the prior hyperparameters only and not the inducing points nor , , , or from (11); all these quantities can be optimized over simultaneously with , but do not need to be discretized. The number of inducing inputs can also be varied, which determines the required computational effort, and all optimizations can be both discrete seeger-dtc or continuous snelson-spgp ; titsias09 . When optimizing over positive , the parametrization with a lower triangular matrix can be used scalable-variational-gaussian-process-classification . For the experiments below we always employ the FITC parametrization (fixed ) in our proposed PAC-SGP method, i.e. our optimization parameters are and besides the length scale hyperparameters .

4 Experiments555Python code (building on GPflow gpflow and TensorFlow tensorflow ) implementing our method is available at https://github.com/boschresearch/PAC_GP.

We now illustrate our learning method and compare it with other GP methods on various regression tasks. In contrast to prior work nonvacuous-generalization-bounds-snn , we found the gradient-based training with the objective (5) to be robust enough, such that no pretraining with conventional objectives (such as from App. D) is necessary. We set throughout seeger-classification-paper ; nonvacuous-generalization-bounds-snn , cf. Sect. 2.2, and use (unless specified otherwise) the generic bounded loss function for regression, with accuracy goal as specified below.

We evaluate the following methods: (a) PAC-GP: Our proposed method (cf. Sect. 3.1) with the training objective (5) (kl-PAC-GP) and for comparison with the looser training objective (sqrt-PAC-GP) (see below (5), similar to e.g. nonvacuous-generalization-bounds-snn ); (b) PAC-SGP: Our sparse GP method (Sect. 3.2), again with objectives (kl-PAC-SGP) and (sqrt-PAC-SGP), respectively; (c) full-GP: The ordinary full GP for regression rasmussen-williams-book ; (d) VFE: Titsias’ sparse GP titsias09 ; (e) FITC: Snelson-Ghahramani’s sparse GP snelson-spgp . Note that full-GP, VFE, and FITC as well as sqrt-PAC-GP and sqrt-PAC-SGP are trained on other objectives (see App. D), and we will evaluate the upper bound (5) on their generalization performance by evaluating via (8) or (10). To obtain finite generalization bounds, we discretize for all methods at the end of training as in Sect. 3.1 and use the appropriate in (5).

Figure 1: Predictive distributions. The predictive distributions (mean as shaded area) of our kl-PAC-SGP (blue) are shown for various choices of together with the full-GP’s prediction (red). (Note that by Eqs. (9,11), kl-PAC-SGP’s predictive variance does not include additive , whereas full-GP’s does rasmussen-williams-book .) The shaded green area visualizes an -band, centered around the kl-PAC-SGP’s predictive mean; datapoints (black dots) inside this band do not contribute to the risk . Crosses above/below the plots indicate the inducing point positions () before/after training.
(a) Predictive distribution.

To get a first intuition, we illustrate in Fig. 1 the effect of varying in the loss function on the predictive distribution of our sparse PAC-SGP. The accuracy goal defines a band around the predictive mean within which data-points do not contribute to the empirical risk . We thus chose the accuracy goal relative to the observation noise obtained from an ordinary full-GP. Results are presented on the 1D toy dataset666snelson: dimensions 200 1, available at www.gatsby.ucl.ac.uk/~snelson. from the original FITC snelson-spgp and VFE titsias09 publications (for a comparison to the predictive distributions of FITC and VFE see App. E, which also contains an illustration that our kl-PAC-SGP avoids FITC’s known overfitting on pathological datasets.). Here and below, we optimize the hyperparameters in each experiment anew.

We find that for large (right plot) the predictive distribution (blue) becomes smoother: Due to the wider -band (green), the PAC-SGP does not need to adapt much to the data for the -band to contain many data points. Hence the predictive distribution can remain closer to the prior, which reduces the KL-term in the objective (5). For the same reason, the inducing points need not adapt much compared to their initial positions for large . For smaller , the PAC-SGP adapts more to the data, whereas for very small (left plot), it is anyhow not possible to place many data points within the narrow -band, so the predictive distribution can again be closer to the prior (compare e.g. in the first and second plots the blue curves near the rightmost datapoints) for a smaller KL-term. In particular, the KL-divergence (divided by number of training points) for the three settings in Fig.1 are: 0.097 (left), 0.109 (middle), and 0.031 (right).

(b) Full-GP experiments – dependence on the accuracy goal .

To explore the dependence on the desired accuracy further, we compare in Fig. 2 the ordinary full-GP to our PAC-GPs on the boston housing dataset777boston: dimensions 506 13, available at http://lib.stat.cmu.edu/datasets/boston. As pre-processing we normalized all features and the output to mean zero and unit variance, then analysed the impact of the accuracy goal . We used 80% of the dataset for training and 20% for testing, in ten repetitions of the experiment.

Figure 2: Dependence on the accuracy goal . For each

, the plots from left to right show (means as bars, standard errors after ten iterations as grey ticks) the upper bound

from Eq. (5), the Gibbs training risk , the Gibbs test risk as a proxy for the true , MSE, and , after learning on the dataset boston housing by three different methods: our kl-PAC-GP method from Sect. 3.1 with sqrt-PAC-GP and the ordinary full-GP.

Our PAC-GP yields significantly better generalization guarantees for all accuracy goals compared to full-GP, since we are directly optimizing the bound (5). This effect is stronger for large , where the KL-term of PAC-GP can decrease as may again remain closer to while keeping the training loss low. Although better bounds do not necessarily imply better Gibbs test risk, kl-PAC-GP performs only slightly below the ordinary full-GP in this regard. Moreover, our PAC-GPs exhibit less overfitting than the full-GP, for which the training risks are significantly larger than the test risks (see Table 1 in App. G for numerical values). On the other hand, the tighter objective (5) in the kl-PAC-GP allows learning a slightly more complex GP in terms of the KL-divergence compared to the sqrt-PAC-GP, which results in better test risks and at the same time better guarantees. This confirms that kl-PAC-GP is always preferable to sqrt-PAC-GP. However, as any prediction within the full -band around the ground truth incurs no risk for our PAC-GPs, their mean squared error (MSE) increases with .

The fact that our learned PAC-GPs exhibit higher training and test errors (Gibbs risk and esp. MSE) than full-GP can be explained by their underfitting in order to hedge against violating Eq. (5) (i.e. Theorem 1). This underfitting is evidenced by PAC-GP’s significantly less complex learned posterior as measured by (Fig. 2), or similarly (via Eqs. (8,10)), by its larger learned noise variance compared to full-GP’s (Table 1 in App. G). It is exactly this stronger regularization of PAC-GP in terms of the divergence that leads to its better generalization guarantees.

In the following, we will fix after pre-processing data as above, to illustrate PAC-GP further. Note however that in a concrete application, should be fixed to a desired accuracy goal using domain knowledge, but before seeing the training set . Alternatively, one can consider a set of -values chosen in advance, at the cost of a term in addition to in the objective (5).

(c) Sparse-GP experiments – dependence on number of inducing inputs .
Figure 3: Dependence on the number of inducing variables. Shown is the average ( standard error over repetitions) upper bound , Gibbs training risk , Gibbs test risk, MSE, and as a function of the number of inducing inputs (from left to right). We compare our sparse kl-PAC-SGP (Sect. 3.2) with the two popular GP approximations VFE and FITC. Each row corresponds to one dataset: pol (top), sarcos (middle) and kin40k (bottom). kl-PAC-SGP has the best guarantee in all settings (left column), due to a lower model complexity (right column), but this comes at the price of slightly larger test errors.

We now examine our sparse PAC-SGP method (Sect. 3.2) on the three large data sets pol, sarcos, and kin40k888pol: 15,000 26, kin40k: 40,000 8 (both from https://github.com/trungngv/fgp.git);
sarcos: 48,933 21 (http://www.gaussianprocess.org/gpml/data)
, again using 80%–20% train-test splits and ten iterations. The results are shown in Fig. 3. Here, we vary the number of inducing points . For modelling pol and kin40k, we use the SE-ARD kernel due to its better performance on these datasets, whereas we model sarcos without ARD (cf. Table 2 in App. G for the comparison ARD vs. non-ARD). The corresponding penalty terms for the three plots are and ; when compared to from Fig. 3, their contribution is largest for the pol dataset.

Our kl-PAC-SGP achieves significantly better upper bounds than VFE and FITC, by more than a factor of 3 on sarcos, a factor of roughly 2 on pol, and a factor between 1.3 and 2 on kin40k (Fig. 3, cf. also Table 2 in App. G). Also, the PAC-Bayes upper bound is much tighter for kl-PAC-SGP than for VFE or FITC, i.e. closer to the Gibbs risk, often by factors exceeding 3. Our kl-PAC-SGP behaves also more favorably in terms of generalization guarantee when inducing points are added and more complex models are allowed: our upper bound improves substantially with (kin40) or does at least not degrade (pol and sarcos), as opposed to VFE and FITC, whose complexities grow substantially with . Since very low training risks can already be achieved by a moderate number of inducing points for pol and sarcos, a growing with deteriorates the upper bound. Regarding the upper bound, the increased flexibility from larger only pays off for the kin40k dataset, whereas the MSE improves with increasing for all models and datasets. As above, kl-PAC-SGP is always slightly preferrable to sqrt-PAC-SGP, not only for the upper bound and Gibbs risks as expected but also for MSE (see Table 2 in App. G).

Similarly to the boston dataset, the higher test errors of kl-PAC-SGP compared to VFE and FITC can be explained by underfitting due to the stronger regularization, again shown by lower and significantly larger learned (by factors of 4–28 compared to VFE), cf. Table 2 in App. G. In fact, although our implementation of PAC-SGP employs the FITC parametrization, the PAC-(S)GP optimization is not prone to FITC’s well-known overfitting tendency bauer-understanding-sparse-GP-approximations , due to the regularization via the -divergence (see App. E, and in particular Supplementary Figure 7).

To investigate whether the higher test MSE of PAC-GP compared to VFE and FITC (and the full-GP above) is a consequence of the 0-1-loss used so far, we re-ran the PAC-SGP experiments for inducing inputs with the more distance-sensitive loss function (Eq. (20)), which is MSE-like for small deviations , i.e.  (Supplementary Figure 5). Our results are tabulated in Table 3 in App. G. The findings are inconclusive and range from an improvement w.r.t. MSE of 25% (pol) over little change (sarcos) to a decline of 12% (kin40k), showing that the effect of the loss function is smaller than might have been expected. Nevertheless, generalization guarantees of PAC-SGP remain much better than the ones of the other methods. While the MSE of our PAC-GPs would improve by choosing smaller (e.g., Fig. 2), this comes at the disadvantage of worse generalization bounds.

We further note that no method shows significant overfitting in Fig. 3, in the sense that the differences between test and training Gibbs risks are all rather small, despite the KL-complexity increasing with for VFE and FITC. This is unlike for Boston housing above, and may be due to the much larger training sets here. When comparing VFE and FITC, we observe that VFE consistently outperforms FITC in terms of both MSE as well as generalization guarantee, where VFE’s higher KL-complexity is offset by its much lower Gibbs risk. This fortifies the results in bauer-understanding-sparse-GP-approximations . We lastly note that, since for our PAC-SGP the obtained guarantees are much smaller than 1/2, we obtain strong guarantees even on the Bayes risk (Sect. 2.1).

5 Conclusion

In this paper, we proposed and explored the use of PAC-Bayesian bounds as an optimization objective for GP training. Consequently, we were able to achieve significantly better guarantees on the out-of-sample performance compared to state-of-the-art GP methods, such as VFE or FITC, while maintaining computational scalability. We further found that using the tighter generalization bound (5) based on the inverse binary kl-divergence leads to an increase in the performance on all metrics compared to a looser bound as employed in previous works (e.g. nonvacuous-generalization-bounds-snn ).

Despite the much better generalization guarantees obtained by our method, it often yields worse test error, in particular test MSE, than standard GP regression methods; this largely persists even when using more distance-sensitive loss functions than the 0-1-loss. The underlying reason could be that all loss functions considered in this work were bounded, as necessitated by our desire to provide generalization guarantees irrespective of the true data distribution. While rigorous PAC-Bayesian bounds exist for MSE-like unbounded loss functions under special assumptions on the data distribution pac-bayesian-theory-meets-bayesian-inference , it may nevertheless be worthwhile to investigate whether these training objectives lead to better test MSE in examples. A drawback is that those assumptions are usually impossible to verify, thus the generalization guarantees are not comparable. Note that the design of a loss function is dependent on the application domain and there is no ubiquitous choice across all settings. In many safety-critical applications, small deviations are tolerable whereas larger deviations are all equally catastrophic, thus a 0-1-loss as ours and a rigorous bound on it can be more useful than the MSE test error.

While in this work we focussed on regression tasks, the same strategy of optimizing a generalization bound can also be applied to learn GPs for binary and categorical outputs. Note that the true -term in this setting has so far been merely upper bounded by its regression proxy seeger-classification-paper , and it would be interesting to develop better bounds on the classification complexity term. Lastly, it may be worthwhile to use other or more general sparse GPs within our PAC-Bayesian learning method, such as free-form scalable-variational-gaussian-process-classification or even more general GPs more-general-than-free-form .

Acknowledgments

We would like to thank Duy Nguyen-Tuong, Martin Schiegg, and Michael Schober for helpful discussions and proofreading.

References

  • (1)

    C. E. Rasmussen, C. K. I. Williams, “Gaussian Processes for Machine Learning”, The MIT Press (2006).

  • (2) M. Bauer, M. v. d. Wilk, C. E. Rasmussen, “Understanding Probabilistic Sparse Gaussian Process Approximations”, In NIPS (2016).
  • (3) S. Shalev-Shwartz, S. Ben-David, “Understanding Machine Learning: From Theory to Algorithms”, Cambridge University Press (2014).
  • (4)

    C. Zhang, S. Bengio, M. Hardt, B. Recht, O. Vinyals, “Understanding deep learning requires rethinking generalization”, In ICLR (2017).

  • (5) M. Jordan, Z. Ghahramani, T. Jaakkola, L. Saul, “Introduction to variational methods for graphical models”, Machine Learning, 37, 183-233 (1999).
  • (6) M. Titsias, “Variational Learning of Inducing Variables in Sparse Gaussian Processes”, In AISTATS (2009).
  • (7) D. McAllester, “PAC-Bayesian model averaging”, COLT (1999).
  • (8) D. McAllester, “PAC-Bayesian Stochastic Model Selection”, Machine Learning 51, 5-21 (2003).
  • (9) O. Catoni, “Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning”, IMS Lecture Notes Monograph Series 2007, Vol. 56 (2007).
  • (10) M. Seeger, “PAC-Bayesian Generalization Error Bounds for Gaussian Process Classification”, Journal of Machine Learning Research 3, 233-269 (2002).
  • (11) A. Ambroladze, E. Parrado-Hernández, J. Shawe-Taylor, “Tighter PAC-Bayes bounds”, In NIPS (2007).
  • (12) J. Langford, J. Shawe-Tayor, “PAC-Bayes & margins”, In NIPS (2002).
  • (13) P. Germain, A. Lacasse, F. Laviolette, M. Marchand, “PAC-Bayesian Learning of Linear Classifiers”, In ICML (2009).
  • (14)

    G. K. Dziugaite, D. M. Roy, “Computing Nonvacuous Generalization Bounds for Deep (Stochastic) Neural Networks with Many More Parameters than Training Data”, In UAI (2017).

  • (15) E. Snelson, Z. Ghahramani, “Sparse Gaussian Processes using Pseudo-inputs”, In NIPS (2005).
  • (16) M. Seeger, C. K. I. Williams, N. Lawrence, “Fast Forward Selection to Seepd Up Sparse Gaussian Process Regression”, In AISTATS (2003).
  • (17) P. Germain, F. Bach, A. Lacoste, S. Lacoste-Julien, “PAC-Bayesian Theory Meets Bayesian Inference”, In NIPS (2016).
  • (18) R. Sheth, R. Khardon, “Excess Risk Bounds for the Bayes Risk using Variational Inference in Latent Gaussian Models”, In NIPS (2017).
  • (19) V. Vapnik, “The Nature of Statistical Learning Theory”, Springer (1995).
  • (20) A. Maurer, “A Note on the PAC Bayesian Theorem”, arXiv:cs/0411099 (2004).
  • (21) L. Begin, P. Germain, F. Laviolette, J.-F. Roy, “PAC-Bayesian Bounds based on the Renyi Divergence”, In AISTATS (2016).
  • (22) S. Kullback, R. Leibler, “On information and sufficiency”, Annals of Mathematical Statistics 22, 79-86 (1951).
  • (23)

    A. G. de G. Matthews, J. Hensman, R. Turner, Z. Ghahramani, “On Sparse Variational Methods and the Kullback-Leibler Divergence between Stochastic Processes”, In AISTATS (2016).

  • (24) J. Quinonero-Candela, C. E. Rasmussen, “A Unifying View of Sparse Approximate Gaussian Process Regression”, Journal of Machine Learning Research 6, 1939-1959 (2005).
  • (25) T. D. Bui, J. Yan, R. E. Turner, “A Unifying Framework for Gaussian Process Pseudo-Point Approximations using Power Expectation Propagation”, Journal of Machine Learning Research 18, 1-72 (2017).
  • (26) J. Hensman, A. Matthews, Z. Ghahramani, “Scalable Variational Gaussian Process Classification”, In AISTATS (2015).
  • (27)

    A. Matthews, M. van der Wilk, T. Nickson, K. Fujii, A. Boukouvalas, P. León-Villagrá, Z. Ghahramani, J. Hensman, “GPflow: A Gaussian process library using TensorFlow”, Journal of Machine Learning Research 18, 1-6 (2017).

  • (28) M. Abadi et al., “TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems”, https://www.tensorflow.org/ (2015).
  • (29) C.-A. Cheng, B. Boots, “Variational Inference for Gaussian Process Models with Linear Complexity”, In NIPS (2017).

Appendix A Inverse binary KL-divergence and its derivatives, Pinsker’s inequality

The function , defined in Eq. (3), can easily be computed numerically for any , to any desired accuracy via the bisection method, since the function is strictly monotonically increasing in from to (for or , we set ). Note that the monotonicity in implies further that is monotonically increasing in . Fig. 4 shows a plot of for various values of , and states a few special function values of . By Pinsker’s inequality, it holds that , which implies that .

Figure 4: Inverse binary KL-divergence. The figure shows plots of for in different colors, the curves for larger lying higher. For it is (staight blue line). At the curves start at . At we have for any .

When applying gradient descent on the RHS of the generalization bound (5) (which includes (4) as a special case), as we propose and do, one further needs, besides the evaluation of , also the derivatives of w.r.t. both of its arguments. These can be easily derived by differentiating the identity w.r.t.  and , plugging in the easily computed derivatives of . The result is:

(12)
(13)

The derivative of the RHS of (5) with respect to parameters (which may include the hyperparameters , the noise level , the inducing points , or any other parameters of and such as , or from Section 3.2) thus reads:

(14)

using the partial derivatives of in parentheses from (12)–(13).

We use the expression (14) for our gradient-based optimization of the parameters in Sect. 4. Note that we treat all components of as continuous parameters during this optimization, despite the fact that the hyperparameters for the prior have to come from a countable set (see around Eq. (5) and App. B). It is only after the optimization that we discretize all of those parameters onto a pre-defined grid (chosen before seeing the training sample ), as described in Sect. 3.1.

Note that in (14) can be computed analytically in our proposed methods PAC-GP and PAC-SGP by using standard matrix algebra (e.g., (rasmussen-williams-book, , App. A)) for differentiating the matrix-analytic expressions of in (7) or (10) (possibly with the parametrization (11)). Furthermore, the derivative is easily computed for common distributions (Sect. 2.2), again treating first as a continuous parameter in the optimization as explained in the previous paragraph; in our paper, we always discretize the hyperparameter set to be finite and choose as the uniform distribution, so is independent of and . Lastly, we show in App. C how to effectively compute in the expression (14) for relevant loss functions .

To our knowledge, the parameter-free PAC-Bayes bound from Theorem 1 or Eq. (5) has never before been used for learning, as we do in our paper here, ostensibly due to the perceived difficulty of handling the derivatives of nonvacuous-generalization-bounds-snn . Instead, when a PAC-Bayes bound was used to guide learning in prior works pac-bayes-learning-of-linear-classifiers ; nonvacuous-generalization-bounds-snn , then a simple sum of and a penalty term involving and was employed as an upper bound, either obtained from alternative PAC-Bayes theorems catoni-thermodyn-of-ml ; pac-bayes-learning-of-linear-classifiers or from loosening the upper bound in Eq. (5) to an expression of the form by a use of Pinsker’s inequality nonvacuous-generalization-bounds-snn or by looser derivations (some of which are mentioned in pac-bayes-learning-of-linear-classifiers ; pac-bayesian-bounds-based-on-the-renyi-divergence ). We show in our work how to perform the learning directly with the -bound (5) using the derivative from Eq. (14), and demonstrate that its optimization is robust and stable and has better performance than the optimization of looser bounds (see Sect. 4).

Appendix B Proof of Eq. (5) — KL-divergence inequality and union bound

Let be a countable set (i.e. a finite set or a countably infinite set), and let

be any probability distribution over its elements

. Further, let be a family of probability distributions indexed by the , and define their mixture . Then it holds for each and :

(15)
(16)
(17)

The inequality (16) follows from the simple fact that the sum contains only nonnegative terms and is therefore at least as large as any of its summands, , together with the monotonicity of the logarithm . This inequality would not generally hold when were replaced by an integral over a continuous index , which explains the requirement of a countable index set . The inequality holds also for with the interpretation .

The inequality from (15)–(17), together with fact that is monotonically increasing in its second argument, shows how to obtain Eq. (5) from Theorem 1.

Remark 2.

Using the value with directly in (4) would of course yield a better bound than (5), but this