This paper combines two directions of research: stability of learning algorithms, and PAC-Bayes bounds for algorithms that randomize with a data-dependent distribution. The combination of these ideas enables the development of risk bounds that exploit stability of the learned hypothesis but are independent of the complexity of the hypothesis class. We use the PAC-Bayes setting with ‘priors’ defined in terms of the data-generating distribution, as introduced by Catoni2007 and developed further e.g. by Lever-etal2010 and PASTS2012. Our work can be viewed as deriving specific results for this approach in the case of stable Hilbert space valued algorithms.
The analysis introduced by BE2002Stability, which followed and extended LugosiPawlak1994posterior and was further developed by CelisseGuedj2016stability, KarimCsaba2017Apriori and Liu-etal2017BarcelonaPaper among others, shows that stability of learning algorithms can be used to give bounds on the generalisation of the learned functions. Their results work by assessing how small changes in the training set affect the resulting classifiers. Intuitively, this is because stable learning should ensure that slightly different training sets give similar solutions.
In this paper we focus on the sensitivity coefficients (see our Definition 1
) of the hypothesis learned by a Hilbert space valued algorithm, and provide an analysis leading to a PAC-Bayes bound for randomized classifiers under Gaussian randomization. As a by-product of the stability analysis we derive a concentration inequality for the output of a Hilbert space valued algorithm. Applying it to Support Vector Machines[ShaweCristianini2004KernelMethods, SteinwartChristmann2008SVMs] we deduce a concentration bound for the SVM weight vector, and also a PAC-Bayes performance bound for SVM with Gaussian randomization. Experimental results compare our new bound with other stability-based bounds, and with a more standard PAC-Bayes bound.
Our work contributes to a line of research aiming to develop ‘self-bounding algorithms’ (Freund1998self-bounding, LangfordBlum2003self-bounding) in the sense that besides producing a predictor the algorithm also creates a performance certificate based on the available data.
2 Main Result(s)
We consider a learning problem where
the learner observes pairs of patterns (inputs)
from the space111 All spaces where random variables
take values are assumed to be measurable spaces.
All spaces where random variables take values are assumed to be measurable spaces.and labels in the space . A training set (or sample) is a finite sequence of such observations. Each pair is a random element of
whose (joint) probability law is222 denotes the set of all probability measures over the space . . We think of
as the underlying ‘true’ (but unknown) data-generating distribution. Examples are i.i.d. (independent and identically distributed) in the sense that the joint distribution ofis the -fold product measure .
A learning algorithm is a function that maps training samples (of any size) to predictor functions. Given , the algorithm produces a learned hypothesis that will be used to predict the label of unseen input patterns . Typically and . For instance, for binary classification, and
for regression. A loss functionis used to assess the quality of hypotheses . Say if a pair is sampled, then quantifies the dissimilarity between the label predicted by , and the actual label . We may write to express the losses (of ) as function of the training examples. The (theoretical) risk of hypothesis under data-generating distribution is333 Mathematicians write for the integral of a function with respect to a (not necessarily probability) measure on . . It is also called the error of under . The empirical risk of on a sample is where is the empirical measure444 Integrals with respect to evaluate as follows: . on associated to the sample. Notice that the risk (empirical or theoretical) is tied to the choice of a loss function. For instance, consider binary classification with the 0-1 loss , where is an indicator function equal to when the argument is true and equal to when the argument is false. In this case the risk is , i.e. the probability of misclassifying the random example when using ; and the empirical risk is , i.e. the in-sample proportion of misclassified examples.
Our main theorem concerns Hilbert space valued algorithms, in the sense that its learned hypotheses live in a Hilbert space . In this case we may use the Hilbert space norm to measure the difference between the hypotheses learned from two slightly different samples.
To shorten the notation we will write . A generic element of this space is , the observed examples are and the sample of size is .
Consider a learning algorithm where is a separable Hilbert space. We define555 For a list and indexes , we write , i.e. the segment from to . the hypothesis sensitivity coefficients of as follows:
This is close in spirit to what is called “uniform stability” in the literature, except that our definition concerns stability of the learned hypothesis itself (measured by a distance on the hypothesis space), while e.g. BE2002Stability deal with stability of the loss functional. The latter could be called “loss stability” (in terms of “loss sensitivity coefficients”) for the sake of informative names.
Writing when these -tuples differ at one entry (at most), an equivalent formulation to the above is . In particular, if two samples and differ only on one example, then . Thus our definition implies stability with respect to replacing one example with an independent copy. Alternatively, one could define , which corresponds to the “uniform argument stability” of Liu-etal2017BarcelonaPaper. We avoid the ‘almost-sure’ technicalities by defining our ’s as the maximal difference (in norm) with respect to all -tuples . The extension to sensitivity when changing several examples is natural: . Note that is a Lipschitz factor with respect to the Hamming distance. The “total Lipschitz stability” of Kontorovich2014concentration is a similar notion for stability of the loss functional. The “collective stability” of London-etal2013 is not comparable to ours (different setting) despite the similar look.
We will consider randomized classifiers that operate as follows. Let be the classifier space, and let
be a probability distribution over the classifiers. To make a prediction the randomized classifier picksaccording to and predicts a label with the chosen . Each prediction is made with a fresh draw. For simplicity we use the same label for the probability distribution and for the corresponding randomized classifier. The risk measures and are extended to randomized classifiers: is the average theoretical risk of , and its average empirical risk. Given two distributions
, the Kullback-Leibler divergence (a.k.a. relative entropy) ofwith respect to is
Of course this makes sense when is absolutely continuous with respect to , which ensures that the Radon-Nikodym derivative
exists. For Bernoulli distributions with parametersand we write , and .
2.1 A PAC-Bayes bound for stable algorithms with Gaussian randomization
This is our main result:
The proof of our theorem combines stability of the learned hypothesis (in the sense of Definition 1) and a PAC-Bayes bound for the average theoretical error of a randomized classifier, quoted below in Section 4 (Proofs) for reference. Note that the randomizing distribution depends on the sample. Literature size=,color=red!20!white,size=,color=red!20!white,todo: size=,color=red!20!white,Omar: work on literature on the PAC-Bayes framework for learning linear classifiers include Germain-etal2015 and PASTS2012
with references. Application of the PAC-Bayes framework to training neural networks can be seen e.g. inLondon2017, DziugaiteRoy2017.
2.2 A PAC-Bayes bound for SVM with Gaussian randomization
For a Support Vector Machine (SVM) with feature map into a separable Hilbert space , we may identify666Riesz representation theorem is behind this identification. a linear classifier with a vector . With this identification we can regard an SVM as a Hilbert space777 may be infinite-dimensional (e.g. Gaussian kernel). valued mapping that based on a training sample learns a weight vector . In this context, stability of the SVM’s solution then reduces to stability of the learned weight vector.
To be specific, let be the SVM that regularizes the empirical risk over the sample by solving the following optimization problem:
Our stability coefficients in this case satisfy (Example 2 of BE2002Stability, adapted to our setting). Then a direct application of our Theorem 2 together with a concentration argument for the SVM weight vector (see our Corollary 9 below) gives the following:
Let . Suppose that (once trained) the algorithm will randomize according to Gaussian888See Appendix E about the interpretation of Gaussian randomization for a Hilbert space valued algorithm. distributions . For any randomization variance , for any , with probability we have
In closing this section we mention that our main theorem is general in that it may be specialized to any Hilbert space valued algorithm. This covers any regularized ERM algorithm [Liu-etal2017BarcelonaPaper]. We applied it to SVM’s whose hypothesis sensitivity coefficients (as in our Definition 1
) are known. It can be argued that neural networks (NN’s) fall under this framework as well. Then an appealing future research direction, with deep learning in view, is to figure out the sensitivity coefficients of NN’s trained by Stochastic Gradient Descent. Then our main theorem could be applied to provide non-vacuous bounds for the performance of NN’s, which we believe is very much needed.
3 Comparison to other bounds
For reference we list several risk bounds (including ours). They are in the context of binary classification (). For clarity, risks under the 0-1 loss are denoted by and risks with respect to the (clipped) hinge loss are denoted by . Bounds requiring a Lipschitz loss function do not apply to the 0-1 loss. However, the 0-1 loss is upper bounded by the hinge loss, allowing us to upper bound the risk with respect to the former in therms of the risk with respect to the latter. On the other hand, results requiring a bounded loss function do not apply to the regular hinge loss. In those cases the clipped hinge loss is used, which enjoys boundedness and Lipschitz continuity.
3.1 P@EW: Our new instance-dependent PAC-Bayes bound
Our Corollary 3, with , a Gaussian centered at with randomization variance , gives the following risk bound which holds with probability :
3.2 P@O: Prior at the origin PAC-Bayes bound
The PAC-Bayes bound Theorem 4 again with , gives the following risk bound which holds with probability :
3.3 Bound of Liu-etal2017BarcelonaPaper
From Corollary 1 of Liu-etal2017BarcelonaPaper (but with as in formulation (svm)) we get the following risk bound which holds with probability :
We use Corollary 1 of Liu-etal2017BarcelonaPaper with , and (clipped hinge loss).
3.4 Bound of BE2002Stability
From Example 2 of BE2002Stability (but with as in formulation (svm)) we get the following risk bound which holds with probability :
We use Example 2 and Theorem 17 (based on Theorem 12) of BE2002Stability with (normalized kernel) and (clipped hinge loss).
In Appendix C below there is a list of different SVM formulations, and how to convert between them. We found it useful when implementing code for experiments.
There are obvious differences in the nature of these bounds: the last two (Liu-etal2017BarcelonaPaper and BE2002Stability) are risk bounds for the (un-randomized) classifiers, while the first two (P@EW, P@O) give an upper bound on the KL-divergence between the average risks (empirical to theoretical) of the randomized classifiers. Of course inverting the KL-divergence we get a bound for the average theoretical risk in terms of the average empirical risk and the (square root of the) right hand side. Also, the first two bounds have an extra parameter, the randomization variance (), that can be optimized. Note that P@O bound is not based on stability, while the other three bounds are based on stability notions. Next let us comment on how these bounds compare quantitatively.
Our P@EW bound and the P@O bound are similar except for the first term on the right hand side. This term comes from the KL-divergence between the Gaussian distributions. Our P@EW bound’s first term improves with larger values of , which in turn penalize the norm of the weight vector of the corresponding SVM, resulting in a small first term in P@O bound. Note that P@O bound is equivalent to the setting of , a Gaussian with center in the direction of , at distance from the origin (as discussed in Langford05 and implemented in PASTS2012).
The first term on the right hand side of our P@EW bound comes from the concentration of the weight (see our Corollary 9). Lemma 1 of Liu-etal2017BarcelonaPaper implies a similar concentration inequality for the weight vector, but it is not hard to see that our concentration bound is slightly better.
Finally, in the experiments we compare our P@EW bound with BE2002Stability.
(PAC-Bayes bound) Consider a learning algorithm . For any , and for any , with probability we have
The probability is over the generation of the training sample .
The above is Theorem 5.1 of Langford05, though see also Theorem 2.1 of Germain-etal2009. To use the PAC-Bayes bound, we will use and , a Gaussian distribution centered at the expected output and a Gaussian (posterior) distribution centered at the random output , both with covariance operator . The KL-divergence between those Gaussians scales with . More precisely:
Therefore, bounding will give (via the PAC-Bayes bound of Theorem 4 above) a corresponding bound on the divergence between the average empirical risk and the average theoretical risk of the randomized classifier . Hypothesis stability (in the form of our Definition 1) implies a concentration inequality for . This is done in our Corollary 8 (see Section 4.3 below) and completes the circle of ideas to prove our main theorem. The proof of our concentration inequality is based on an extension of the bounded differences theorem of McDiarmid to vector-valued functions discussed next.
4.1 Real-valued functions of the sample
To shorten the notation let’s present the training sample as where each example is a random variable taking values in the (measurable) space . We quote a well-known theorem:
(McDiarmid inequality) Let be independent -valued random variables, and a real-valued function such that for each and for each list of ‘complementary’ arguments we have
Then for every , .
McDiarmid’s inequality applies to a real-valued function of independent random variables. Next we present an extension to vector-valued functions of independent random variables. The proof follows the steps of the proof of the classic result above, but we have not found this result in the literature, hence we include the details.
4.2 Vector-valued functions of the sample
Let be independent -valued random variables and a function into a separable Hilbert space. We will prove that bounded differences in norm999The Hilbert space norm, induced by the inner product of . implies concentration of around its mean in norm, i.e., that is small with high probability.
Notice that McDiarmid’s theorem can’t be applied directly to when is vector-valued. We will apply McDiarmid to the real-valued , which will give an upper bound for in terms of . The next lemma upper bounds for a function with bounded differences in norm. Its proof is in Appendix A.
Let be independent -valued random variables, and a function into a Hilbert space satisfying the bounded differences property: for each and for each list of ‘complementary’ arguments we have
If the vector-valued function has bounded differences in norm (as in the Lemma) and is any constant, then the real-valued function has the bounded differences property (as in McDiarmid’s theorem). In particular this is true for (notice that is constant over replacing by an independent copy ) so applying McDiarmid’s inequality to it, combining with Lemma 6, we get the following theorem:
Under the assumptions of Lemma 6, for any , with probability we have
Notice that the vector of difference bounds appears in the above inequality only through its Euclidean norm .
4.3 Stability implies concentration
The hypothesis sensitivity coefficients give concentration of the learned hypothesis:
Let be a Hilbert space valued algorithm. Suppose has hypothesis sensitivity coefficients . Then for any , with probability we have
This is a consequence of Theorem 7 since for , hence .
Last (not least) we deduce concentration of the weight vector .
Let . Suppose that the kernel used by SVM is bounded by . size=,color=blue!20!white,size=,color=blue!20!white,todo: size=,color=blue!20!white,Csaba: This was messed up. I think is the upper bound on the features. For any , for any , with probability we have
Under these conditions we have hypothesis sensitivity coefficients (we follow BE2002Stability, Example 2 and Lemma 16, adapted to our setting). Then apply Corollary 8.
The purpose of the experiments was to explore the strengths and potential weaknesses of our new bound in relation to the alternatives presented earlier, as well as to explore the bound’s ability to help model selection. For this, to facilitate comparisons, taking the setup of PASTS2012, we experimented with the five UCI datasets described there. However, we present results for pim and rin only, as the results on the other datasets mostly followed the results on these and in a way these two datasets are the most extreme. In particular, they are the smallest and largest with dimensions ( examples, and dimensional feature space), and , respectively.
Model and data preparation We used an offset-free SVM classifier with a Gaussian RBF kernel with RBF width parameter . The SVM used the so-called standard SVM-C formulation; the conversion between our and the SVM-C formulation which multiplies the total (hinge) loss by is given by where is the number of training examples and is the penalty in our formulation (svm). The datasets were split into a training and a test set using the train_test_split method of scikit, size=,color=blue!20!white,size=,color=blue!20!white,todo: size=,color=blue!20!white,Csaba: Ref keeping of the data for training and for testing.
Model parameters Following the procedure suggested in Section 2.3.1 of ChZi05, we set up a geometric grid over the -parameter space where ranges between and and ranges between and , where is the median of the Euclidean distance between pairs of data points of the training set, and given , is obtained as the reciprocal value of the empirical variance of data in feature space underlying the RBF kernel with width . The grid size was selected for economy of computation. The grid lower and upper bounds for were ad-hoc, though they were inspired by the literature, while for the same for , we enlarged the lower range to focus on the region of the parameter space where the stability-based bounds have a better chance to be effective: In particular, the stability-based bounds grow with in a linear fashion, with a coefficient that was empirically observed to be close to one.
Computations For each of the pairs on the said grid, we trained an SVM-model using a Python implementation of the SMO algorithm of Platt99, adjusted to SVMs with no offset (Steinwart2008SVMs argue that “the offset term has neither a known theoretical nor an empirical advantage” for the Gaussian RBF kernel). We then calculated various bounds using the obtained model, as well as the corresponding test error rates (recall that the randomized classifiers’ test error is different than the test error of the SVM model that uses no randomization). The bounds compared were the two mentioned hinge-loss based bounds: The bound by Liu-etal2017SpanishPaper and that of BE2002Stability. In addition we calculated the P@O and (our) P@EW bound. When these latter were calculated we optimized the randomization variance parameter by minimizing error estimate obtained from the respective bound (the KL divergence was inverted numerically). Further details of this can be found in Appendix D.
Results As explained earlier our primary interest is to explore the various bounds strengths and weaknesses. In particular, we are interested in their tightness, as well as their ability to support model selection. As the qualitative results were insensitive to the split, results for a single “random” (arbitrary) split are shown only.
Tightness The hinge loss based bounds gave trivial bounds over almost all pairs of . Upon investigating this we found that this is because the hinge loss takes much larger values than the training error rate unless takes large values (cf. Fig. 3 in Appendix D). However, for large values of , both of the bounds are vacuous. In general, the stability based bounds (Liu-etal2017SpanishPaper, BE2002Stability and our bound) are sensitive to large values of . Fig. 1 show the difference between the P@O bound and the test error of the underlying respective randomized classifiers as a function of while Fig. 2 shows the difference between the P@EW bound and the test error of the underlying randomized classifier. (Figs. 9 and 7 in the appendix show the test errors for these classifiers, while Figs. 8 and 6 shows the bound.) The meticulous reader may worry about that it appears that on the smaller dataset, pim, the difference shown for P@O is sometimes negative. As it turns out this is due to the randomness of the test error: Once we add a confidence correction that accounts for the randomness of the test small test set () this difference disappears once we correct the test error for this. From the figures the most obvious difference between the bounds is that the P@EW bound is sensitive to the value of and it becomes loose for larger values of . This is expected: As noted earlier, stability based bounds, which P@EW is an instance of, are sensitive to . The P@O bound shows a weaker dependence on if any. In the appendix we show the advantage (or disadvantage) of the P@EW bound over the P@O bound on Fig. 10. From this figure we can see that on pim, P@EW is to be preferred almost uniformly for small values of (), while on rin, the advantage of P@EW is limited both for smaller values of and a certain range of the RBF width. Two comments are in order in connection to this: (i) We find it remarkable that a stability-based bound can be competitive with the P@O bound, which is known as one of the best bounds available. size=,color=blue!20!white,size=,color=blue!20!white,todo: size=,color=blue!20!white,Csaba: citation (ii) While comparing bounds is interesting for learning about their qualities, the bounds can be used together (e.g., at the price of an extra union bound).
Model selection To evaluate a bounds capability in helping model selection it is worth comparing the correlation between the bound and test error of the underlying classifiers. By comparing Figs. 7 and 6 with Figs. 9 and 8 it appears that perhaps the behavior of the P@EW bound (at least for small values of ) follows more closely the behavior of the corresponding test error surface. This is particularly visible on rin, where the P@EW bound seems to be able to pick better values both for and , which lead to a much smaller test error (around ) than what one can obtain by using the P@O bound.
We have developed a PAC-Bayes bound for randomized classifiers.size=,color=blue!20!white,size=,color=blue!20!white,todo: size=,color=blue!20!white,Csaba: Discuss the experiments… We proceeded by investigating the stability of the hypothesis learned by a Hilbert space valued algorithm. A special case being SVMs. We applied our main theorem to SVMs, leading to our P@EW bound, and we compared it to other stability-based bounds and to a previously known PAC-Bayes bound. The main finding is that perhaps P@EW is the first nontrivial bound that uses stability.
Appendix A Proof of Lemma 6
Let be a function of the independent -valued random variables , where the function maps into a separable Hilbert space . Let’s write as the telescopic sum101010The Doob decomposition: are martingale differences and their sum is a martingale.
and the -algebra generated by the first examples. Thus
We need . Taking the expectation above makes the second sum disappear since for we have
and clearly for . Thus we have
Also recall the notation for . It will be used extensively in what follows.
Let’s write the conditional expectations in terms of regular conditional probabilities:
The random variables are labelled with capitals. The lower case letters are for the variables of integration. We write for the distribution (probability law) of .
By independence, and (this latter is not really needed in the proof, but shortens the formulae). Hence,
Then is equal to the integral w.r.t. of
Notice that in the integrand, only the th argument differs. If , then . Thus bounded differences for implies bounded martingale differences (in norm).
Finally, using Jensen’s inequality and (1), and the bounded differences assumption:
Appendix B The average empirical error for Gaussian random linear classifiers
Let , a Gaussian with center and covariance matrix the identity . The average empirical error is to be calculated as
where and is the standard normal cumulative distribution
Recall that is the empirical measure on associated to the training examples, and the integral with respect to evaluates as a normalized sum.
In this section we write the derivation of (2).
To make things more general let , a Gaussian with center and covariance matrix . We’ll write for the corresponding Gaussian measure on . But to make notation simpler, lets work with input vectors (instead of feature vectors ). This is in the context of binary classification, so the labels are . The classifier is identified with the weight vector . The loss on example can be written as
We’ll talk about the empirical error of , namely . The average empirical error when choosing a random weight according to is:
Plugging in the definition of and swapping the order of the integrals and using the above formula for the loss, the right hand side is
where for a fixed pair we are writing
Decompose into two terms:
and notice that for the random vector we have and , so the functional has a 1-dimensional Gaussian distribution with mean and variance . Then
Altogether this gives
Notice that using (identity) and instead of this gives (2).
REMARK: Langford05 uses a which is along the direction of a vector , and in all directions perpendicular to . Such is a Gaussian centered at , giving his formula
Appendix C SVM weight vector: clarification about formulations
We have a sample of size .
In the standard implementation the weight vector found by is a solution of the following optimization problem:
In our paper the weight vector found by is a solution of the following optimization problem:
In BE2002Stability and Liu-etal2017BarcelonaPaper the weight vector found by is a solution of the following optimization problem: