# On the role of data in PAC-Bayes bounds

The dominant term in PAC-Bayes bounds is often the Kullback–Leibler divergence between the posterior and prior. For so-called linear PAC-Bayes risk bounds based on the empirical risk of a fixed posterior kernel, it is possible to minimize the expected value of the bound by choosing the prior to be the expected posterior, which we call the oracle prior on the account that it is distribution dependent. In this work, we show that the bound based on the oracle prior can be suboptimal: In some cases, a stronger bound is obtained by using a data-dependent oracle prior, i.e., a conditional expectation of the posterior, given a subset of the training data that is then excluded from the empirical risk term. While using data to learn a prior is a known heuristic, its essential role in optimal bounds is new. In fact, we show that using data can mean the difference between vacuous and nonvacuous bounds. We apply this new principle in the setting of nonconvex learning, simulating data-dependent oracle priors on MNIST and Fashion MNIST with and without held-out data, and demonstrating new nonvacuous bounds in both cases.

## Authors

• 21 publications
• 4 publications
• 3 publications
• 46 publications
06/18/2018

### PAC-Bayes bounds for stable algorithms with instance-dependent priors

PAC-Bayes bounds have been proposed to get risk estimates based on a tra...
06/25/2021

### Chebyshev-Cantelli PAC-Bayes-Bennett Inequality for the Weighted Majority Vote

We present a new second-order oracle bound for the expected risk of a we...
06/23/2020

### PAC-Bayes Analysis Beyond the Usual Bounds

We focus on a stochastic learning model where the learner observes a fin...
09/06/2019

### Better PAC-Bayes Bounds for Deep Neural Networks using the Loss Curvature

We investigate whether it's possible to tighten PAC-Bayes bounds for dee...
12/14/2019

### Optimal PAC-Bayesian Posteriors for Stochastic Classifiers and their use for Choice of SVM Regularization Parameter

PAC-Bayesian set up involves a stochastic classifier characterized by a ...
09/21/2021

### Learning PAC-Bayes Priors for Probabilistic Neural Networks

Recent works have investigated deep learning models trained by optimisin...
10/22/2021

### Probabilistic fine-tuning of pruning masks and PAC-Bayes self-bounded learning

We study an approach to learning pruning masks by optimizing the expecte...

## Code Repositories

### role-of-data

None

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1. Introduction

In this work, we are interested in the application of PAC-Bayes bounds McAllester1999,shawe1997pac to the problem of understanding the generalization properties of learning algorithms. Our focus will be on supervised learning from i.i.d. data, although PAC-Bayes theory has been generalized far beyond this setting. (Guedjsurvey provides a survey.) In our setting, PAC-Bayes bounds control the risk of Gibbs classifiers, i.e., randomized classifiers whose predictions, on each input, are determined by a classifier

sampled according to some distribution on the hypothesis space . The hallmark of a PAC-Bayes bound is a normalized Kullback–Leibler (KL) divergence, , defined in terms of a Gibbs classifier that is called a “prior” because it must be independent of the

data points used to estimate the empirical risk of

.

In applications of PAC-Bayes bounds to generalization error, the contribution of the KL divergence often dominates the bound: In order to have a small KL with a strongly data-dependent posterior, the prior must, in essence, predict the posterior. This is difficult without knowledge of (or access to) the data distribution, and represents a significant statistical barrier to achieving tight bounds. Instead, many PAC-Bayesian analyses rely on generic priors chosen for analytical convenience.

Generic priors, however, are not inherent to the PAC-Bayes framework: every valid prior yields a valid bound. Therefore, if one does not optimize the prior to the data distribution, one may obtain a bound that is loose on the account of ignoring important, favorable properties of the data distribution.

langford2003microchoice were the first to consider the problem of optimizing the prior to minimize the expected value

of the high-probability PAC-Bayes bound. In the realizable case, they show that the problem reduces to optimizing the expected value of the KL term. More precisely, they consider a fixed learning rule

, i.e., a fixed posterior kernel, which chooses a posterior, , based on a training sample, . In the realizable case, the bound depends linearly on the KL term. Then is minimized by the expected posterior, , i.e., for measurable . Both expectations are taken over the unknown distribution of the training sample, . We call the oracle prior. If we introduce an

-valued random variable

satisfying a.s., we see that its distribution, , is and thus, the “optimality” of the oracle is an immediate consequence of the identity a well-known variational characterization of mutual information in terms of KL divergence.

For so-called linear PAC-Bayes bounds (introduced below), the oracle prior is seen to minimize the bound in expectation, even in the unrealizable setting. In light of this, having settled on a learning rule , we might seek to achieve the tightest linear PAC-Bayes bound in expectation by attempting to approximate the oracle prior, . Indeed, there is a large literature aimed at obtaining localized PAC-Bayes bounds via distribution-dependent priors, whether analytically Catoni,lever2010distribution, through data Amb07,negrea2019information, or by way of concentration of measure, privacy, or stability Oneto2016,Oneto2017,DR18,rivasplata2018pac.

One of the contributions of this paper is the demonstration that an oracle prior may not yield the tightest linear PAC-Bayes risk bound in expectation. DR: Because we’ve been coy about which data we’re talking about when we say the prior must be independent, this seems like a contradiction. We may need to state the basic idea up front. Negrea wrote: this seems to contradict previous paragraph… did you mean nonlinear? or did you mean once you allow daat dependent priors you can do better than oracle? mainthm gives conditions on a learning rule for there to exist data-dependent priors that improves the bound based upon the oracle prior. This phenomenon is a hitherto unstated principle of PAC-Bayesian analysis: data-dependent priors are sometimes necessary for tight bounds. Note that, as the prior must be independent of data used to compute the bound a posteriori, if training data are used to define the prior, only the remaining data should be used to compute the bound (i.e., compute the empirical risk term and divide the KL term). Note that all training data are used by the learning algorithm. We formalize these subtleties in the body of the paper.

We give an example of a learning problem where mainthm implies data-dependent priors dominate. The example is adapted from a simple model of SGD in a linear model by NagaKolter19c. In the example, most input dimensions are noise with no signal and this noise accumulates in the learned weights. In our version, we introduce a learning rate schedule, and so earlier data points have a larger influence on the resulting weights. Even so, there is enough variability in the posterior that the oracle prior yields a vacuous bound. By conditioning on early data points, we reduce the variability and obtain nonvacuous bounds.

The idea of using data-dependent priors to obtain tighter bounds is not new Amb07,parrado2012pac,DR18,rivasplata2018pac. The idea is also implicit in the luckiness framework shawe1996framework. However, the observation that using data can be essential to obtaining a tight bound, even in full knowledge of the true distribution, is new, and brings a new dimension to the problem of constructing data-dependent priors.

In addition to demonstrating the theoretical role of data-dependent priors, we investigate them empirically, by studying generalization in nonconvex learning by stochastic (sub)gradient methods. As data-dependent oracle priors depend on the unknown distribution, we propose to use held-out data (“ghost sample”) to estimate unknown quantities. Unlike standard held-out test set bounds, this approach relies implicitly on a type of stability demonstrated by SGD. We also propose approximations to data-dependent oracle priors that use no ghost sample, and find, given enough data, the advantage of the ghost sample diminishes significantly. We show that both approaches yield state-of-the-art nonvacuous bounds on MNIST and Fashion-MNIST for posterior Gaussian distributions whose means are clamped to the weights learned by SGD. Our MNIST bound (11%) improves significantly on the best published bound (46%) Zhou18. Finally, we evaluate minimizing a PAC-Bayes bound with our data-dependent priors as a learning algorithm. We demonstrate significant improvements to both classifier accuracy and bound tightness, compared to optimizing with generic priors.

## 2. Preliminaries

Let be a space of labeled examples, and write

for the space of (probability) distributions on

. Given a space of classifiers and a bounded loss function , the risk of a hypothesis is We also consider Gibbs classifiers, i.e., elements in the space of distributions on , where risk is defined by . As is unknown, learning algorithms often work by optimizing an objective that depends on i.i.d. training data , such as the empirical risk where is the empirical distribution of . Writing for a data-dependent Gibbs classifier (i.e., a posterior), our primary focus is its risk, , and its relationship to empirical estimates, such as .

We now present a bound within the PAC-Bayes framework McAllester1999,shawe1997pac. Write for the KL divergence between distributions . (See KLapp for definitions.) This bound follows from [][Thm. 2]McAllesterDropOut, taking . (See also [][Thm. 1.2.6]Catoni.)

[Linear PAC-Bayes bound] Let , , , and . With probability at least over , for all ,

 (1) LD(Q)≤Ψβ,δ(Q,P;S)\smash{\tiny{% def}}=1βLS(Q)+KL(Q||P)+log1δ2β(1−β)|S|.

We call the prior. Since the bound is valid for all priors independent from , we can choose by optimizing, e.g., the risk bound in expectation, as first proposed by langford2003microchoice:

Let and fix a probability kernel . For all and , is minimized by the “oracle” prior .

## 3. Data-dependent oracle priors

Here we demonstrate that, for linear PAC-Bayes bounds, one may obtain a stronger bound using a “data-dependent oracle” prior, rather than the usual (data-independent) oracle prior. Further, using a data-dependent oracle prior may mean the difference between a vacuous and nonvacuous bound.

A typical PAC-Bayes generalization bound for a posterior kernel is based on the empirical risk computed from the same data fed to the kernel. Instead, let be a (possibly random) subset of of size , independent from , let denote the subsequence of data with indices in , and let denote the complementary subsequence. Consider now the PAC-Bayes bound based on the estimate . In this case, the prior need only be independent from . The -measurable data-dependent oracle arises as the solution of the optimization DR: We may generalize this to consider a constrained optimization, restricting the kernel to taking values in, say, the family of Gaussian distributions. This would then allow us to formalize the actual derivations we make in our empirical study. If the family of (say, Gaussian) distributions were denoted , we’re considering denoting this optimum as , as it would be literally the mutual information if the constraints were lifted.

 infinfsupP∈Z|J|→M1(H)E[KL(Q(S)||P(SJ))].

Letting be a random element in satisfying a.s., the value of condminfvar is the conditional mutual information . Define the information rate gain (from using to choose the prior) and the excess bias (from using to estimate the risk) to be, respectively,

 R(^h;S|SJ)=I(^h;S)|S|−I(^h;S|SJ)|S∖SJ|

and

 B(^h;S|SJ)=E[LS∖SJ(^h)−LS(^h)].

Note that, if is chosen uniformly at random, then . Using these two quantities, we can characterize whether a data-dependent prior can outperform the oracle prior. Let , , and . Fix and let be a (possibly random) subset of nonrandom cardinality , independent from . Conditional on and , let have distribution . Then

 EJES∼DnΨβ,δ(Q(S),P∗(SJ);S∖SJ)≤ES∼DnΨβ,δ(Q(S),P∗;S)

if and only if

 (2) R(^h;S|SJ)>2(1−β)B(^h;S|SJ)+log1δn(mn−m).

(Proof in app:mainthm.) To interpret the proposition, consider

: if the information rate gain is larger than the excess bias and a term that accounts for excess variance, then a data-dependent prior yields a tighter bound. It is reasonable to ask whether such situations arise. The following demonstration modifies a linear classification problem presented by NagaKolter19c. Their example was originally constructed to demonstrate potential roadblocks to studying generalization in SGD using uniform convergence arguments. Here, we modify the learning algorithm to have a decreasing step size, which causes earlier data points to have more influence. We exploit this property to achieve much tighter bounds using data-dependent oracle priors. Indeed, we will obtain nonvacuous bounds, while the optimal data-independent oracle prior yields a vacuous bound.

Consider the hypothesis class , interpreted as linear classifiers

 x↦sign(⟨x,w⟩):Rd→{−1,0,1},for w∈Rd.

Assume that , with , and decompose each input as , where and . (We will decompose the weights similarly.) Labels take values in and so a prediction of (i.e., on the decision boundary) is a mistake.

Consider the following i.i.d. training data: Let be a nonrandom vector and, for each , choose uniformly at random in , let , and let be multivariate normal with mean 0 and covariance , where is the identity matrix. Let denote the common marginal distribution of each training example .

Consider the following one-pass learning algorithm: Let , then, for and , put Then define the final weights to be , where is an independent, zero-mean multivariate Gaussian with covariance . Note that where and .

We will compare bounds based on oracle priors with those based on data-dependent oracle priors. To that end, let and define by a.s. Let . For a subset , let be the corresponding subset of the data and let be the complement.

There are constants such that the infimum

 infinfsupJ⊆[n]infinfsupβ∈(0,1)infinfsupP∈Z|J|→M1(H)E[Ψβ,δ(Q(S),P(SJ);S∖SJ)],

is achieved by a nonempty set . In particular, the optimal prior is data dependent.

Lower and upper bounds on the objective (theinf) for of the form , for , are visualized in fig:theorybound. Using a data-dependent prior in this scenario is critical for obtaining a nonvacuous bound. The derivation of these bounds as well as a sketch of the proof and a complete rigorous proof, can be found in app:fullproof.

## 4. Data-dependent priors for SGD

As the theoretical results in the previous section demonstrate, data-dependent oracle priors can lead to dramatically tighter bounds. In this section, we take the first steps towards understanding whether data-dependent priors can aid us in the study of stochastic gradient descent (SGD). DR: Maybe reintroduce some text from app.

In mainexample, we studied a posterior that depended more heavily on some data points than others. This property was introduced intentionally in order to serve as a toy model for SGD. Unlike the toy model, however, we know of no representations of the marginal distribution of the parameters learned by SGD that would allow us to optimize or compute a PAC-Bayes bound with respect to a data-dependent oracle prior. As a result, we are forced to make approximations.

Issues of tractability aside, another obstacle to using a data-dependent oracle prior is its dependence on the unknown data distribution. Ostensibly, this statistical barrier can be surmounted with extra data, although this would not make sense in a standard model-selection or self-bounded learning setup. In these more traditional learning scenarios, one has a training data set and wants to exploit this data set to the maximum extent possible. Using some of this data to estimate or approximate (functionals of) the unknown distribution means that this data is not available to the learning algorithm or the PAC-Bayes bound. Indeed, if our goal is simply to obtain the tightest possible bound on the risk of our classifier, we ought to use most of this extra data to learn a better classifier, leaving out a small fraction to get a tight Hoeffding-style estimate of our risk.

However, if our goal is to understand the generalization properties of some posterior kernel (and indirectly an algorithm like SGD), we do not simply want a tight estimate of risk. Indeed, a held-out test set bound is useless for understanding as it merely certifies that a learned classifier generalizes. If a classifier generalizes due to favorable properties of the data distribution, then we must necessarily capture these properties in our bound. These properties may be natural side products of the learning algorithm (such as weight norms) or functionals of the unknown distribution that we must estimate (such as data-dependent oracle priors or functionals thereof). In this case, it makes sense to exploit held out data to gain insight.

We begin by optimizing a prior over a restricted family. In particular, we consider Gaussian priors when the posterior kernel chooses Gaussian posteriors. Based on empirical findings on SGD in the literature, we propose an approximation to the data-dependent oracle prior.

### 4.1. Optimal isotropic Gaussian priors

Let be a probability space representing the distribution of a source of randomness. Our focus here is on kernels where is a multivariate normal, centered at the weights learned by SGD (using randomness , which we may assume without loss of generality encodes both the random initialization and the sequence of minibatches) on the full data set, . Such posteriors underlie several recent approaches to obtaining PAC-Bayes bounds for SGD. In these bounds, the covariance matrix is chosen to be diagonal and the scales are chosen to allow one to derive the bound on a deterministic classifier from the bound on a randomized classifier . For example, neyshabur2017pac derive deterministic classifier bounds from a PAC-Bayes bound based on (an estimate of) the Lipschitz constant of the network.

Fix some nonnegative integer and let . Let denote the size subset of corresponding to the first indices processed by SGD. (Note that these indices are encoded in .) Writing for the conditional expectation operator given , oracleprior implies that the tightest (linear PAC-Bayes) bound in expectation is obtained by minimizing in terms of , which yields the data-dependent oracle prior . (We are permitted to condition on because is independent from .)

As this prior is assumed to be intractable and the data distribution is unknown, we consider choosing the prior from the family of multivariate Gaussians. Specifically, consider the problem of finding an isotropic Gaussian prior that minimizes . (We will revisit this simplification in sec:oraclevar, where we consider priors and posteriors with non-isotropic diagonal covariance matrices.) For a fixed variance , the problem reduces to

 (3) argmininfsupwαESα,U[∥wS−wα∥2].

It follows that the (Gaussian) oracle prior mean is conditional expectation of weights learned by SGD. Under this choice, the contribution of the mean component to the bound is the trace of the conditional covariance of given . For the remainder of the section we will focus on the problem of approximating the oracle prior mean. The optimal choice of depends on the distribution of . One approach, which assumes that we build separate bounds for different values of that we combine via a union bound argument, is outlined in app:approx.

### 4.2. Ghost samples

In the setting above, the optimal Gaussian prior mean is given by the conditional expectation . Although the distribution is presumed to be unknown, there is a natural statistical estimate for . Namely, consider a ghost sample, , independent from and equal in distribution to . Let be the data set obtained by combining with a fraction of . (We can do so by matching the position of within and within .) Note that is also equal in distribution to . We may then take to be the mean of , i.e., the weights produced by SGD on the data set using the randomness .

By design, SGD acting on and randomness will process first and then start processing the data from the ghost sample. Crucially, the initial

fraction of the first epoch in both runs will be identical. By design,

and are equal in distribution when conditioned on and , and so

is an unbiased estimator for

.111We can minimize the variance of the KL term by producing conditionally i.i.d. copies of and averaging, although each such copy requires an independent -sized ghost sample.

### 4.3. Terminology

We call the run of SGD on data the -prefix run. The run of SGD on the full data is called the base run. A prior is constructed from the -prefix run by centering a Gaussian at the parameters obtained after steps of optimization. Prefix stopping time is chosen from a discrete set of values to minimize distance to posterior mean.222We account for these data-dependent choices via a union bound, which produces a negligible contribution. Note, that for , , i.e., the prior is centered at random initialization as it has no access to data. This is equivalent to the approach taken by DR17. When the prior has access to data , we call an SGD run training on an -prefix+ghost run, obtaining parameters .

The procedure of running the -prefix and base runs together for the first -fraction of a base run epoch using shared information (storing the data order) is an example of a coupling. This coupling is simple and does not attempt to match base and -prefix runs beyond the first iterations (where is the batch size, which we presume divides evenly for simplicity). It exploits the fact that the final weights have an outsized dependence on the first few iterations of SGD. More advanced coupling methods can be constructed. Such methods might attempt to couple beyond the first –fraction of the first epoch.

As argued above, it is reasonable to use held-out data to probe the implications of a data-dependent prior as it may give us insight into the generalization properties of . At the same time, we may be interested in approximations to the data-dependent oracle that do not use a ghost sample. Ordinarily, we would expect two independent runs of SGD, even on the same dataset, to produce potentially quite different weights (measured, e.g., by their distance) NagaKolter19c. fig:correlationplots shows that, when we condition on an initial prefix of data, we dramatically decrease the variability of the learned weights. This experiment shows that we can predict fairly well the final weights of SGD on the full data set using only a fraction of the data set, implying that most of the variability in SGD comes in the beginning of training. Crucially, the two runs are coupled in the same manner as the ghost-sample runs: the first -fraction of first epoch is identical. When only a fraction of the data is available, SGD treats this data as the entire data set, starting its second epoch immediately.

## 5. Methodology for empirical evaluation

mainexample shows that a data-oracle priors can yield tighter generalization bounds than an oracle prior. In this section, we describe the experimental methodology we use to evaluate this phenomenon in neural networks trained by stochastic gradient descent (SGD).

### 5.1. Pseudocode

mainalgs (right) describes the procedure for obtaining a PAC-Bayes risk bound on a network trained by SGD.333mainalgs (right) uses a fixed learning rate and a vanilla SGD for simplicity, but the algorithm can be adapted to any variants of SGD with different learning rate schedules. Note that the steps outlined in Lines 1–3 do not change with and therefore the best can be chosen efficiently without rerunning the optimization. If ghost data is not used, should be replaced with .

To avoid choosing , we use a variational KL bound, described in app:varkl, which allows us to optimize a posteriori for a small penalty. This PAC-Bayes bound on risk, denoted , is evaluated with confidence level in all of our experiments during evaluation/optimization.

### 5.2. Datasets and Architectures.

We use three datasets: MNIST, Fashion-MNIST and CIFAR-10. See app:datasets for more details. The architectures used are described in detail in app:architectures. For the details of the training procedure, see app:hyperparameters.

### 5.3. Stopping criteria

We terminate SGD optimization in the base run once the empirical error ( in mainalgs,alg:sgdrun) measured on all of fell below some desired value , which we refer to as the stopping criteria. We evaluate the results for different stopping criteria.

## 6. Empirical study of SGD-trained networks

### 6.1. Evaluating data-dependent priors.

A PAC-Bayes risk bound trades off empirical risk and the contribution coming from the KL term. For isotropic Gaussian priors and posteriors, the mean component in the KL is proportional to the squared difference in means normalized by the effective number of training samples not seen by the prior, i.e., . This scaled squared L2 distance term determines the tightness of the bound when the prior variance and the posterior and data are fixed, as the bound grows with . In this section we empirically evaluate how and vary with different values of .

Our goal is to evaluate whether, on standard vision datasets and architectures, a data-dependent oracle prior can be superior to an oracle prior. Since we do not have access to an oracle prior, we approximate it by using a ghost sample with , as described in sec:coupling. Data-dependent oracle priors are approximated by using a combination of training samples and ghost samples.

Our experimental results on MNIST and Fashion-MNIST appear in fig:l2distanceplots, where we plot and . The results suggest that the value of minimizing is data- and architecture-dependent. The optimal prefix size for MNIST, FC minimizing is . For MNIST, LeNet-5 and Fashion-MNIST, LeNet-5, the optimal is between and . We found that batch size affects the optimal , whether on -prefix or ghost data. As one might expect, the best is larger for smaller batch sizes. We hypothesize that this is due to increased stochasticity of SGD.

Interestingly, at larger values of we observe that the gap between and closes. This happens in all three experimental setups by : we observe that the prior mean obtained with training data alone is as close to final SGD weights as the prior mean obtained with .

### 6.2. Generalization bounds for SGD-trained networks.

We apply data-dependent priors to obtain tighter PAC-Bayes risk bounds for SGD-trained networks. We do not use ghost data in these experiments, as oracle priors are inaccessible in practice. Thus the prior mean is obtained by the -prefix run on prefix data alone. See mainalgs (right) and sec:method for the details of the experiment.

From the data in fig:bound, it is apparent that has a significant impact on the size of the bound. In all of the three networks tested, the best results are achieved for .

One of the clearest relationships to emerge from the data is the dependence of the bound on the stopping criterion: The smaller the error at which the base run was terminated, the looser the bound. This suggests that the extra optimization introduces variability into the weights that we are not able to predict well.

### 6.3. Optimal prior variance

We use oracle bounds to quantify limits on how much tighter these generalization bounds could be, were we able to optimize a diagonal prior variance.

Our data-dependent priors do not attempt to minimize the variance component of the KL bound. For a fixed , the variance component in klgaussians (see app:approx) increases if posterior variance deviates from . When the prior is isotropic, our empirical study shows that the optimized posterior variance is also close to isotropic. However, an isotropic structure may not describe the local minima found by SGD well. We are thus also interested in a hypothetical experiment, where we allow the prior variance to be optimal for any given diagonal Gaussian . While this produces an invalid bound, it reveals the contribution to the risk bound due to the prior variance. Optimizing klgaussians w.r.t. diagonal yields a prior with optimal variance, and expression reduces to

 (4) 12∑pi=1log(1+(wi−wiα)/σ2i),

where is the entry of the diagonal of .

Computing the bounds with as a prior, requires some minor modifications to mainalgs (right). As in mainalgs (right), the posterior is set to , with a diagonal covariance matrix that is initialized to . The prior is centered at , and the variance is automatically determined by the posterior variance. The KL then takes the form stated in eq:oraclevariance. The -prefix run in mainalgs (right) is followed by another SGD run minimizing with respect to a diagonal covariance .

We present the results in fig:bound, second row. At , the optimal prior variance decreases the bound substantially. However, at larger values of , the effect diminishes. In particular, at the values of that produce the lowest risk bound with a fixed isotropic prior variance, optimal prior variance makes little to no improvement. Interestingly, the optimized posterior variance remains close to isotropic. Overall, the results suggest that a diagonal prior offers little advantage over an isotropic prior.

## 7. Direct risk bound minimization.

One of the dominant approaches to training Gaussian neural networks is to minimize the evidence lower bound (ELBO), which essentially takes the same form as linearpbound, but with a different relative weight on the KL term. Here, we optimize a PAC-Bayes bound using our data-dependent prior methodology which can be related to empirical Bayes approaches. The details of the algorithm are outlined in mainalgs, left, where denotes a PAC-Bayes bound computed with differentiable surrogate loss. We perform experiments on 3 different datasets and architectures (see app:drbm for further details).

fig:directboundopt presents the error of the posterior (dashed line) optimized using mainalgs with different values of . It is apparent from the figure that for all the networks and datasets tested, the error of drops dramatically as increases, all the way up to around . Note that with the optimal achieves very high performance even compared to state-of-the-art networks and at the same time comes with a valid guarantee on error. For example, ResNet20 (without data augmentation and weight decay) trained on CIFAR10 achieved error of around , and the best-performing in fig:directboundopt gets an average error of with a bound that holds with 0.95 probability.

### Acknowledgments

The authors would like to thank Mufan Li, Jeffrey Negrea, Alexandre Drouin, and Blair Bilodeau for feedback on drafts. DMR was supported, in part, by an NSERC Discovery Grant, Ontario Early Researcher Award, and a stipend provided by the Charles Simonyi Endowment. This research was carried out in part during the Foundations of Deep Learning program at the Simons Institute for the Theory of Computing, and during the Special Year on Optimization, Statistics, and Theoretical Machine Learning at the Institute of Advanced Studies.

## Appendix A KL divergence

This material is adapted from DR17.

Let be probability measures defined on a common measurable space . When is absolutely continuous with respect to , written , we write for some Radon–Nikodym derivative (aka, density) of with respect to . The Kullback–Liebler (KL) divergence from to is if and otherwise.

Assume and admit densities and , respectively, w.r.t. some sigma-finite measure . In this case, the definition of the KL divergence satisfies

 KL(Q||P)=∫logq(x)p(x)q(x)ν(dx).

## Appendix B Proof of mainthm

The proof follows essentially from definitions. Note that we do not require to have any particular distribution and so, e.g.,

could be uniformly distributed among subsets of cardinality

or could be a.s. nonrandom and equal to . Note that the statement that a.s. implies that a.s. and that is independent of , both marginally and conditionally on . Informally, any randomness in plays no role in the determination of . Let .

Consider the linear PAC-Bayes bound based on . It is optimized, in expectation, by choosing an oracle prior:

In contrast, the linear PAC-Bayes bound based on is optimized, in expectation, by choosing a data-dependent oracle prior:

It follows that the data-dependent risk bound is tighter, in expectation, when

 (1−β)E[LS(Q(S))]+I(^h;S)+log1δ2n>(1−β)E[LS¯J(Q(S))]+I(^h;S|SJ)+log1δ2(1−α)n.

Equivalently,

 I(^h;S)+log1δ2n−I(^h;S|SJ)+log1δ2(1−α)n>(1−β)E[LS¯J(Q(S))−LS(Q(S))].

Rewriting the left-hand side,

Therefore, we prefer a data-dependent prior based on when

 (I(^h;S)n−I(^h;S|SJ)(1−α)n)>2(1−β)E[LS¯J(Q(S))−LS(Q(S))]+log1δn(α1−α).

The result follows by the definition of the information rate gain and excess bias.

## Appendix C Proof of keyclaim

We begin with a proof sketch.

###### Proof sketch.

With and fixed, the minimization over meets the hypotheses of oracleprior and so we may simplify the objective by taking . The KL term then becomes a conditional mutual information . Due to linearity of expectation, we may then optimize explicitly, leaving only a minimization over subsets ,

where and .

One can show that , where is the variance contribution from and . Using sub-Gaussian and sub-exponential tail bounds, one can establish that , where is due to variance in , , and .

Choosing , , , , , and , we obtain , while . Our upper bound is achieved by , i.e., by using the initial 24 data points to obtain a data-dependent (oracle) prior. ∎

### c.1. Complete proof and bounds on the objective

We now provide a complete rigorous proof. For subsets (of ), let ; let for ; let ; and let .

By oracleprior and linearity of expectation, for every subset and , oracleprior implies that the optimal prior is , and so we can simplify theinf by choosing this prior. In particular, now .

Define and . By linearity of expectation, we can remove the infimum over by explicit minimization. As a result, we see that theinf is equivalent to

 infinfsupJ⊆[n]R(J)+C(J)+√2R(J)C(J)+C2(J)Φ(J).

Pick some . Then the optimal prior conditioned on is

 PJ(SJ)=E[Q(S)|SJ]=δη1[n]u⊗NJ,

where . Let for . Then

 KL(Q(S)||PJ(SJ))=Dψ(κ/ϕ¯J)/2+12ϕ¯JD∑j=1(∑i∉Jηiyixi,2,j)2.

Taking expectations,

It remains to control the empirical risk term. To that end, pick and let . Then

where

 yi⟨wn,2,xi,2⟩=ηi∥xi,2∥2+∑j≠iηjyiyj⟨xj,2,xi,2⟩.

Rearranging and exploiting the chain rule of conditional expectation and symmetry of the normal distribution,

 Eℓ(W,zi)=EPxi,2[∑j≠i⟨ηjxj,2,xi,2⟩+⟨ξ,xi,2⟩≥τ+ηi∥xi,2∥2],

where the conditional probability is a tail bound on a univariate Gaussian with mean zero and variance .

Applying the standard (sub-)Gaussian tail bound,

 Eℓ(W,zi)≤Eexp{−12(τ+ηi∥xi,2∥2)2∥xi,2∥2ϕ−i}≤Eexp{−τ22∥xi,2∥2ϕ−i},

where the last inequality is crude, but suffices for our application. Note that is a chi-squared random variable with degrees of freedom, hence sub-exponential. Indeed, with probability at least ,

 D∥xi,2∥2/σ2≤D+2√Dlog(1/c)+2log(1/c).

Rearranging,

where the second inequality holds assuming , which we will ensure from this point on. So

 Eℓ(W,zi)≤infinfsupc≥e−D{1c+(1−1c)exp{−τ2/(2ϕ−iB(c))}.}.

Taking , we have . Then, using ,

 Eℓ(W,zi)≤exp{−D/16}+exp{−τ2/(4ϕ[n]σ2)}¯¯¯¯R.

We may now obtain a bound

 R(J)=ELS¯J(Q(S))=1n−|J|∑i∉JEℓ(W,zi)≤maxinfsupi∉JEℓ(W,zi)=¯¯¯¯R.

Thus

 Φ(J)≤¯¯¯¯R+C(J)+√2¯¯¯¯RC(J)+C(J)2

At the same time, we have for all . (Note that these two bounds are used to produce fig:theorybound.)

In particular, noting ,

 Φ(∅)≥Dmlnσ2η2[n]/D+κκ.

The result can be seen to follow from these bounds by evaluation using the particular values. In particular, one can see that taking to be a nonempty initial segment of , we have .

## Appendix D Analytic form of the KL for an approximate data-dependent oracle bound

In this section, we explore one possible analytic bound for a KL term for a PAC-Bayes bound, based on the setup in sec:sgdddpriors. We assume and are nonrandom. In an application, one would have to cover a set of possible values to handle the random case.

The KL divergence between Gaussians and takes the form

 2KL(Q(U,S)||P)=∥wS−wα∥2Σ−1αlndetΣαdetΣmean component+tr(Σ−1α(Σ))−p+lndetΣαdetΣ.variance component

Specializing to an isotropic prior, i.e., , we obtain

 2KL(Q(U,S)||P)=1σP∥wS−wα∥2lndetΣαdetΣmean% component+1σPtr(Σ)−p+plnσP−lndetΣ.variance component

Note that

 tr(covSα,U(wS))=infinfsupwα ESα,U[∥wS−wα∥2].

Consider

 σP=1p(tr(covSα,U(wS))+tr(