Code to related to my NIPS 2016 paper
We exhibit a strong link between frequentist PAC-Bayesian risk bounds and the Bayesian marginal likelihood. That is, for the negative log-likelihood loss function, we show that the minimization of PAC-Bayesian generalization risk bounds maximizes the Bayesian marginal likelihood. This provides an alternative explanation to the Bayesian Occam's razor criteria, under the assumption that the data is generated by an i.i.d distribution. Moreover, as the negative log-likelihood is an unbounded loss function, we motivate and propose a PAC-Bayesian theorem tailored for the sub-gamma loss family, and we show that our approach is sound on classical Bayesian linear regression tasks.READ FULL TEXT VIEW PDF
Code to related to my NIPS 2016 paper
). However, despite the amount of work dedicated to this statistical learning theory—many authors improved the initial results(Catoni, 2007; Lever et al., 2013; McAllester, 2003; Seeger, 2003; Tolstikhin and Seldin, 2013)
and/or generalized them for various machine learning setups(Bégin et al., 2014; Germain et al., 2016; Grünwald and Mehta, 2016; Langford and Shawe-Taylor, 2002; Pentina and Lampert, 2014; Seldin and Tishby, 2010; Seldin et al., 2011, 2012)—it is mostly used as a frequentist method. That is, under the assumptions that the learning samples are -generated by a data-distribution, this theory expresses probably approximately correct
(PAC) bounds on the generalization risk. In other words, with probability, the generalization risk is at most away from the training risk. The Bayesian side of PAC-Bayes comes mostly from the fact that these bounds are expressed on the averaging/aggregation/ensemble of multiple predictors (weighted by a posterior distribution) and incorporate prior knowledge. Although it is still sometimes referred as a theory that bridges the Bayesian and frequentist approach (, Guyon et al., 2010), it has been merely used to justify Bayesian methods until now.111Some existing connections (Banerjee, 2006; Bissiri et al., 2016; Grünwald, 2012; Lacoste, 2015; Seeger, 2002, 2003; Zhang, 2006) are discussed in Appendix A.1.
In this work, we provide a direct connection between Bayesian inference techniques (summarized by Bishop, 2006; Ghahramani, 2015) and PAC-Bayesian risk bounds in a general setup. Our study is based on a simple but insightful connection between the Bayesian marginal likelihood and PAC-Bayesian bounds (previously mentioned by Grünwald (2012)) obtained by considering the negative log-likelihood loss function (Section 3). By doing so, we provide an alternative explanation for the Bayesian Occam’s razor criteria (Jeffreys and Berger, 1992; MacKay, 1992) in the context of model selection, expressed as the complexity-accuracy trade-off appearing in most PAC-Bayesian results. In Section 4
, we extend PAC-Bayes theorems to regression problems with unbounded loss, adapted to the negative log-likelihood loss function. Finally, we study the Bayesian model selection from a PAC-Bayesian perspective (Section5), and illustrate our finding on classical Bayesian regression tasks (Section 6).
We denote the learning sample , that contains input-output pairs. The main assumption of frequentist learning theories—including PAC-Bayes—is that is randomly sampled from a data generating distribution that we denote . Thus, we denote the observation of elements. From a frequentist perspective, we consider in this work loss functions , where is a (discrete or continuous) set of predictors , and we write the empirical risk on the sample () and the generalization error on distribution as
The PAC-Bayesian theory (McAllester, 1999, 2003) studies an averaging of the above losses according to a posterior distribution over . That is, it provides probably approximately correct generalization bounds on the (unknown) quantity
given the empirical estimateand some other parameters. Among these, most PAC-Bayesian theorems rely on the Kullback-Leibler divergence between a prior distribution over —specified before seeing the learning sample —and the posterior —typically obtained by feeding a learning process with .
Two appealing aspects of PAC-Bayesian theorems are that they provide data-driven generalization bounds that are computed on the training sample (, they do not rely on a testing sample), and that they are uniformly valid for all over . This explains why many works study them as model selection criteria or as an inspiration for learning algorithm conception. Theorem 1, due to Catoni (2007), has been used to derive or study learning algorithms (Germain et al., 2009; Hazan et al., 2013; McAllester and Keshet, 2011; Noy and Crammer, 2014).
Given a distribution over , a hypothesis set , a loss function , a prior distribution over , a real number , and a real number , with probability at least over the choice of , we have
Theorem 1 is limited to loss functions mapping to the range . Through a straightforward rescaling we can extend it to any bounded loss, , , where . This is done by using and with the rescaled loss function After few arithmetic manipulations, we can rewrite Equation (1) as
From an algorithm design perspective, Equation (2
) suggests optimizing a trade-off between the empirical expected loss and the Kullback-Leibler divergence. Indeed, for fixed, , , , and , minimizing Equation is equivalent to find the distribution that minimizes
where is a normalization term. Notice that the constant of Equation (1) is now absorbed in the loss function as the rescaling factor setting the trade-off between the expected empirical loss and .
In this section, we show that by choosing the negative log-likelihood loss function, minimizing the PAC-Bayes bound is equivalent to maximizing the Bayesian marginal likelihood. To obtain this result, we first consider the Bayesian approach that starts by defining a prior over the set of possible model parameters . This induces a set of probabilistic estimators , mapping
to a probability distribution over. Then, we can estimate the likelihood of observing given and , , .222To stay aligned with the PAC-Bayesian setup, we only consider the discriminative case in this paper. One can extend to the generative setup by considering the likelihood of the form instead. Using Bayes’ rule, we obtain the posterior :
where and .
To bridge the Bayesian approach with the PAC-Bayesian framework, we consider the negative log-likelihood loss function (Banerjee, 2006), denoted and defined by
Then, we can relate the empirical loss of a predictor to its likelihood:
or, the other way around,
Unfortunately, existing PAC-Bayesian theorems work with bounded loss functions or in very specific contexts (, Dalalyan and Tsybakov, 2008; Zhang, 2006), and spans the whole real axis in its general form. In Section 4, we explore PAC-Bayes bounds for unbounded losses. Meanwhile, we consider priors with bounded likelihood. This can be done by assigning a prior of zero to any yielding .
where the normalization constant corresponds to the Bayesian marginal likelihood:
This shows that the optimal PAC-Bayes posterior given by the generalization bound of Theorem 1 coincides with the Bayesian posterior, when one chooses as loss function and (as in Equation 2). Moreover, using the posterior of Equation (8) inside Equation (3), we obtain
In other words, minimizing the PAC-Bayes bound is equivalent to maximizing the marginal likelihood. Thus, from the PAC-Bayesian standpoint, the latter encodes a trade-off between the averaged negative log-likelihood loss function and the prior-posterior Kullback-Leibler divergence. Note that Equation (3) has been mentioned by Grünwald (2012), based on an earlier observation of Zhang (2006). However, the PAC-Bayesian theorems proposed by the latter do not bound the generalization loss directly, as the “classical” PAC-Bayesian results Catoni (2007); McAllester (1999); Seeger (2002) that we extend to regression in forthcoming Section 4 (see the corresponding remarks in Appendix A.1).
In Section 5, we exploit the link between PAC-Bayesian bounds and Bayesian marginal likelihood to expose similarities between both frameworks in the context of model selection. Beforehand, next Section 4 extends the PAC-Bayesian generalization guarantees to unbounded loss functions. This is mandatory to make our study fully valid, as the negative log-likelihood loss function is in general unbounded (as well as other common regression losses).
This section aims to extend the PAC-Bayesian results of Section 3 to real valued unbounded loss. These results are used in forthcoming sections to study , but they are valid for broader classes of loss functions. Importantly, our new results are focused on regression problems, as opposed to the usual PAC-Bayesian classification framework.
Given a distribution over , a hypothesis set , a loss function , a prior distribution over , a , and a real number , with probability at least over the choice of , we have
Alquier et al. used Theorem 3 to design a learning algorithm for -valued classification losses. Indeed, a bounded loss function can be used along with Theorem 3 by applying the Hoeffding’s lemma to Equation (12), that gives More specifically, with , we obtain the following bound
Note that the latter bound leads to the same trade-off as Theorem 1 (expressed by Equation 3). However, the choice has the inconvenience that the bound value is at least , even at the limit . With the bound converges (a result similar to Equation (14) is also formulated by Pentina and Lampert (2014)):
In a regression context, it may be restrictive to consider strictly bounded loss functions. Therefore, we extend Theorem 3 to sub-Gaussian losses. We say that a loss function
is sub-Gaussian with variance factorunder a prior and a data-distribution
if it can be described by a sub-Gaussian random variable(see Boucheron et al. (2013, Section 2.3)):
The above sub-Gaussian assumption corresponds to the Hoeffding assumption of Alquier et al. (2016), and allows to obtain the following result.
Given , , , and defined in the statement of Theorem 3, if the loss is sub-Gaussian with variance factor , we have, with probability at least over the choice of ,
We say that an unbounded loss function is sub-gamma with a variance factor and scale parameter , under a prior and a data-distribution , if it can be described by a sub-gamma random variable (see Boucheron et al. (2013, Section 2.4)), that is
Under this sub-gamma assumption, we obtain the following new result, which is necessary to study linear regression in the next sections.
Following the same path as in the proof of Corollary 4 (with ), we have
where the inequality comes from the sub-gamma loss assumption, with . ∎
The parameters and of Corollary 5 rely on the chosen loss function and prior, and the assumptions concerning the data distribution. As an example, consider a regression problem where , a family of linear predictors , with , and a Gaussian prior . Let us assume that the input examples are generated by with label , where and is a Gaussian noise. Under the squared loss function
The classical PAC-Bayesian theorems are stated in a classification context and bound the generalization error/loss of the stochastic Gibbs predictor . In order to predict the label of an example , the Gibbs predictor first draws a hypothesis according to , and then returns . Maurer (2004)
shows that we can generalize PAC-Bayesian bounds on the generalization risk of the Gibbs classifier to any loss function with output between zero and one. Provided thatand , a common choice is to use the linear loss function . The Gibbs generalization loss is then given by . Many PAC-Bayesian works use as a surrogate loss to study the zero-one classification loss of the majority vote classifier :
where being the indicator function. Given a distribution , an upper bound on the Gibbs risk is converted to an upper bound on the majority vote risk by (Langford and Shawe-Taylor, 2002). In some situations, this factor of two may be reached, , . In other situations, we may have even if (see Germain et al. (2015) for an extensive study). Indeed, these bounds obtained via the Gibbs risk are exposed to be loose and/or unrepresentative of the majority vote generalization error.333It is noteworthy that the best PAC-Bayesian empirical bound values are so far obtained by considering a majority vote of linear classifiers, where the prior and posterior are Gaussian (Ambroladze et al., 2006; Germain et al., 2009; Langford and Shawe-Taylor, 2002), similarly to the Bayesian linear regression analyzed in Section 6.
In the current work, we study regression losses instead of classification ones. That is, the provided results express upper bounds on for any (bounded, sub-Gaussian, or sub-gamma) losses. Of course, one may want to bound the regression loss of the averaged regressor . In this case, if the loss function is convex (as the squared loss), Jensen’s inequality gives Note that a strict inequality replaces the factor two mentioned above for the classification case, due to the non-convex indicator function of Equation (20).
Now that we have generalization bounds for real-valued loss functions, we can continue our study linking PAC-Bayesian results to Bayesian inference. In the next section, we focus on model selection.
We consider distinct models , each one defined by a set of parameters . The PAC-Bayesian theorems naturally suggest selecting the model that is best adapted for the given task by evaluating the bound for each model and selecting the one with the lowest bound (Ambroladze et al., 2006; McAllester, 2003; Zhang, 2006). This is closely linked with the Bayesian model selection procedure, as we showed in Section 3 that minimizing the PAC-Bayes bound amounts to maximizing the marginal likelihood. Indeed, given a collection of optimal Gibbs posteriors—one for each model—given by Equation (8),
Corollary 6 below formally links the PAC-Bayesian and the Bayesian model selection. To obtain this result, we simply use the bound of Corollary 5 times, together with and Equation (3). From the union bound (a.k.a. Bonferroni inequality), it is mandatory to compute each bound with a confidence parameter of , to ensure that the final conclusion is valid with probability at least .
Hence, under the uniform prior over the models, choosing the one with the best model evidence is equivalent to choosing the one with the lowest PAC-Bayesian bound.
To perform proper inference on hyperparameters, we have to rely on theHierarchical Bayes approach. This is done by considering an hyperprior over the set of hyperparameters . Then, the prior can be conditioned on a choice of hyperparameter . The Bayes rule of Equation (5) becomes
Under the negative log-likelihood loss function, we can rewrite the results of Corollary 5 as a generalization bound on , where is the hyperposterior on and the hyperprior. Indeed, Equation (18) becomes
To relate to the bound obtained in Corollary 6, we consider the case of a discrete hyperparameter set , with a uniform prior (from now on, we regard each hyperparameter as the specification of a model ). Then, Equation (23) becomes
This bound is now a function of instead of as in the bound given by the “best” model in Corollary 6. This yields a tighter bound, corroborating the Bayesian wisdom that model averaging performs best. Conversely, when selecting a single hyperparameter , the hierarchical representation is equivalent to choosing a deterministic hyperposterior, satisfying and for every other values. We then have
With the optimal posterior for the selected , we have
Inserting this result into Equation (17), we fall back on the bound obtained in Corollary 6. Hence, by comparing the values of the bounds, one can get an estimate on the consequence of performing model selection instead of model averaging.
In this section, we perform Bayesian linear regression using the parameterization of Bishop (2006). The output space is and, for an arbitrary input space , we use a mapping function .
Given and model parameters , we consider the likelihood . Thus, the negative log-likelihood loss is
For a fixed , minimizing Equation (24) is equivalent to minimizing the squared loss function of Equation (19). We also consider an isotropic Gaussian prior of mean and variance : . For the sake of simplicity, we consider fixed parameters and . The Gibbs optimal posterior (see Equation 8) is then given by
where ; ; is a matrix such that the line is ;
is the labels-vector ; and the negative log marginal likelihood is
To obtain the second equality, we substitute
This exhibits how the Bayesian regression optimization problem is related to the minimization of a PAC-Bayesian bound, expressed by a trade-off between and . See Appendix A.5 for detailed calculations.
To produce Figures (a)a and (b)b, we reimplemented the toy experiment of Bishop (2006, Section 3.5.1). That is, we generated a learning sample of data points according to , where is uniformly sampled in the interval and is a Gaussian noise. We then learn seven different polynomial models applying Equation (25). More precisely, for a polynomial model of degree , we map input to a vector , and we fix parameters and . Figure (a)a illustrates the seven learned models. Figure (b)b shows the negative log marginal likelihood computed for each polynomial model, and is designed to reproduce Bishop (2006, Figure 3.14), where it is explained that the marginal likelihood correctly indicates that the polynomial model of degree is “the simplest model which gives a good explanation for the observed data”. We show that this claim is well quantified by the trade-off intrinsic to our PAC-Bayesian approach: the complexity term keeps increasing with the parameter , while the empirical risk drastically decreases from to , and only slightly afterward. Moreover, we show that the generalization risk (computed on a test sample of size ) tends to increase with complex models (for ).
Figure (c)c compares the values of the PAC-Bayesian bounds presented in this paper on a synthetic dataset, where each input is generated by a Gaussian .
The associated output is given by , with , , and . We perform Bayesian linear regression in the input space, , , fixing and .
That is, we compute the posterior of Equation (25) for training samples of sizes from to .
For each learned model, we compute the empirical negative log-likelihood loss of Equation (24), and the three PAC-Bayes bounds, with confidence parameter of .
Note that this loss function
is an affine transformation of the squared loss studied in Section 4 (Equation 19), ,
It turns out that is sub-gamma with parameters
as shown in Appendix A.6. The bounds of
Corollary 5 are computed using the above mentioned values of
leading to and . As the two other bounds of Figure (c)c are not suited for unbounded loss, we compute their value using a cropped loss .
Different parameter values could have been chosen, sometimes leading to another picture: a large value of degrades our sub-gamma bound, as a larger interval does for the other bounds.
In the studied setting, the bound of Corollary 5—that we have developed for (unbounded) sub-gamma losses—gives tighter guarantees than the two results for -bounded losses (up to ). However, our new bound always maintains a gap of between its value and the generalization loss. The result of Corollary 2 (adapted from Catoni (2007)) for bounded losses suffers from a similar gap, while having higher values than our sub-gamma result. Finally, the result of Theorem 3 (Alquier et al. (2016)), combined with (Eq. 14), converges to the expected loss, but it provides good guarantees only for large training sample (). Note that the latter bound is not directly minimized by our “optimal posterior”, as opposed to the one with (Eq. 13), for which we observe values between (for ) and (for )—not displayed on Figure (c)c.
The first contribution of this paper is to bridge the concepts underlying the Bayesian and the PAC-Bayesian approaches;
under proper parameterization, the minimization of the PAC-Bayesian bound maximizes the marginal likelihood.
motivates the second contribution of this paper, which is to prove PAC-Bayesian generalization bounds for regression with unbounded sub-gamma loss functions, including the squared loss used in regression tasks.
In this work, we studied model selection techniques. On a broader perspective, we would like to suggest that both Bayesian and PAC-Bayesian frameworks may have more to learn from each other than what has been done lately (even if other works paved the way (, Bissiri et al., 2016; Grünwald, 2012; Seeger, 2003)). Predictors learned from the Bayes rule can benefit from strong PAC-Bayesian frequentist guarantees (under the assumption). Also, the rich Bayesian toolbox may be incorporated in PAC-Bayesian driven algorithms and risk bounding techniques.
We thank Gabriel Dubé and Maxime Tremblay for having proofread the paper and supplemental.
In this section, we discuss briefly other works containing (more or less indirect) links between Bayesian inference and PAC-Bayesian theory, and explain how they relate to the current paper.
Soon after the initial work of McAllester (1999, 2003), Seeger shows how to apply the PAC-Bayesian theorems to bound the generalization error of Gaussian Processes in a classification context. By building upon the PAC-Bayesian theorem initially appearing in Langford and Seeger (2001)
—where the divergence between the training error and the generalization one is given by the Kullback-Leibler divergence between two Bernoulli distributions—it achieves very tight generalization bounds.444The PAC-Bayesian results for Gaussian processes are summarized in Rasmussen and Williams (2006, Section 7.4) Also, the thesis of Seeger (2003, Section 3.2) foresees this by noticing that “the log marginal likelihood incorporates a similar trade-off as the PAC-Bayesian theorem”, but using another variant of the PAC-Bayes bound and in the context of classification.
This paper shows similarities between the early PAC-Bayesian results (McAllester (2003), Langford and Seeger (2001)), and the Bayesian log-loss bound (Freund and Schapire (1997); Kakade and Ng (2004)). This is done by highlighting that the proof of all these results are strongly relying on the same compression lemma (Banerjee, 2006, Lemma 1), which is equivalent to our change of measure used in the proof of Theorem 3 (see forthcoming Equation 26). Note that the loss studied in the Bayesian part of Banerjee (2006) is the negative log-likelihood of Equation (6). Also, as in Equation (3), the Bayesian log-loss bound contains the Kullback-Leibler divergence between the prior and the posterior. However, the latter result is not a generalization bound, but a bound on the training loss that is obtained by computing a surrogate training loss in the specific context of online learning. Moreover, the marginal likelihood and the model selection techniques are not addressed in Banerjee (2006).
This paper presents a family of information theoretical bounds for randomized estimators that have a lot in common with PAC-Bayesian results (although the bounded quantity is not directly the generalization error). Minimizing these bounds leads to the same optimal Gibbs posterior of Equation (4). The author noted that using the negative log-likelihood (Equation 6) leads to the Bayesian posterior, but made no connection with the marginal likelihood.
This paper proposes the Safe Bayesian algorithm, which selects a proper Bayesian learning rate — that is analogous to the parameter of our Equation (1), and the parameter of our Equation (11) — in the context of misspecified models.555The empirical model selection capabilities of the Safe Bayesian algorithm has been further studied in Grünwald and van Ommen (2014). The standard Bayesian inference method is obtained with a fixed learning rate, corresponding to the case (that is the case we focus on the current paper, see Corollaries 4 and 5). The analysis of Grünwald (2012) relies both on the Minimum Description Length principle (Grünwald, 2007) and PAC-Bayesian theory. Building upon the work of Zhang (2006) discussed above, they formulate the result that we presented as Equation (3), linking the marginal likelihood to the inherent PAC-Bayesian trade-off. However, they do not compute explicit bounds on the generalization loss, which required us to take into account the complexity term of Equation (12).
In a binary classification context, it is shown that the parameter of Theorem 1 can be interpreted as a Bernoulli label noise model from a Bayesian likelihood standpoint. For more details, we refer the reader to Section 2.2 of this thesis.
This recent work studies Bayesian inference through the lens of loss functions. When the loss function is the negative log-likelihood (Equation 6), the approach of Bissiri et al. (2016) coincides with the Bayesian update rule. As mentioned by the authors, there is some connection between their framework and the PAC-Bayesian one, but “the motivation and construction are very different.”
Recall that Theorem 3 originally comes from Alquier et al. (2016, Theorem 4.1). We present below a different proof that follows the key steps of the very general PAC-Bayesian theorem presented in Bégin et al. (2016, Theorem 4).
The Donsker-Varadhan’s change of measure states that, for any measurable function , we have
Thus, with , we obtain
Now, we apply Markov’s inequality on the random variable :
This implies that with probability at least over the choice of , we have
Given a loss function , and a fixed predictor , we consider the random experiment of sampling . We denote a realization of the random variable , for . Each is , zero mean, and bounded by and , as . Thus,