1 Introduction
While deep learning can provide remarkable empirical performance in many machine learning tasks, a theoretical account of this success remains elusive. One of the main metrics of performance in machine learning is generalization, or the ability to predict correctly on new unseen data. Statistical learning theory offers a theoretical framework to understand generalization, with the main tool being the derivation of upper bounds on the generalization error. However, it is now widely acknowledged that “classical” approaches in learning theory, based on worstcase analyses, are not sufficient to explain generalization of deep learning models, see e.g.
bartlett1998sample; zhang2016understanding. This observation has spurred a large amount of work on applying and developing other learning theory approaches to deep learning, some of which are showing promising results. However, current approaches are still lacking in many ways: generalization error bounds are often vacuous or offer little explanatory power by showing trends opposite to the true error as one varies important quantities. For example, nagarajan2019uniform show that some common spectral normbased bound predictions increase with increasing training set size, while the measured error decreases. Moreover, many bounds are published without being sufficiently tested, so that it can be hard to assess their strengths and weaknesses.There is therefore a need for a more systematic approach to studying generalization bounds. Two important recent studies have taken up this challenge. Firstly, jiang2019fantastic preformed extensive empirical tests of the predictive power of over
different generalization measures for a range of commonly used hyperparameters. They mainly studied flatness and normbased measures, and compared them by reporting an average of the performance. In their comparisons, they focused mostly on training hyperparameters. Architectures were only compared by changing width and depth of an architecture resembling the NetworkinNetwork
(gao2011robustness), and they did not systematically study variation on the data (dataset complexity or training set size).Secondly, in a followup study dziugaite2020search, argue that average performance is not a sufficient metric of generalization performance. They empirically demonstrate that generalization measures can exhibit correct correlations for certain parameter changes and seeds, but badly fail for others. In addition to some of the hyperparameter changes studied by jiang2019fantastic, they also consider the effect of changing dataset complexity and training set size.
In this paper we also perform a systematic analysis of generalization bounds. We first describe, in section 2, a set of seven qualitatively different desiderata which, we argue, should be used to guide the kind of experiments^{1}^{1}1Here, ‘experiment’ can be interpreted in the formal sense which dziugaite2020search introduce needed to fully explore the quality of the predictions of a theory of the generalization. These include four desiderata for making correct predictions when varying architecture, dataset complexity, dataset size, or optimizer choice, as well as three further desiderata, which describe the importance of quantitative agreement, computational efficiency and theoretical rigour. Our set of desiderata is broader than those used in previous tests of generalization theories, and will, we hope, lead to a better understanding of the strengths and weaknesses of competing approaches.
Rather than empirically testing the many bounds that can be found in the literature, we take a more general theoretical approach here. We first, in section 3
, present an overall framework for classifying the many different approaches to deriving frequentist bounds of the generalization error. This framework is then used to systematically organize the discussion of the different approaches in the literature. For each approach we review existing results in the literature and compare performance for the seven desiderata we propose. Such an analysis allows us to draw some more general conclusions about the kinds of strategies for deriving generalization bounds that may be most successful. In this context we also note, that for some proposed bounds, there is not enough empirical evidence in the literature to determine whether or not they satisfy our desiderata. This lacuna highlights the importance of the proposals put forward by
jiang2019fantastic and dziugaite2020search of large scale empirical studies, which we also advocate for in this paper.Inspired by our analysis of existing bounds we present in section 5
a highprobability version of the realizable PACBayes bound introduced by
mcallester1998some and applied to deep learning in valle2018deep. Because our bound, which is derived from a functionbased picture, is directly proportional to the Bayesian evidence or marginal likelihood of the model, we call it the marginal likelihood PACBayes bound. While our bound has frequentist origins, the connection between marginal likelihood and generalization is an important theme in Bayesian analyses, and can be traced back to the work of MacKay and Neal (mackay1992practical; neal1994priors; mackay2003information), as well as the early work on PACBayes (shawe1997pac; mcallester1998some). See also an important recent discussion of this connection in wilson2020bayesian.Recent largescale empirical work has shown that learning curves for deeplearning have an extensive powerlaw regime (hestness2017deep; spigler2019asymptotic; kaplan2020scaling; henighan2020scaling). The exponents in the powerlaw depend mainly on the data rather than on architectures, with smaller exponents for more complex datasets. Under the assumption of a powerlaw asymptotic behaviour with training set size, we are able to prove that our marginallikelihood PACBayes bound is asymptotically optimal up to a constant in the limit of large training sets.
In order to test our marginallikelihood PACBayes bound against the full set of desiderata, we perform over 1,000 experiments comparing 19 different architectures, ranging from simple fully connected networks (FCNs) and convolutional neural networks (CNNs) to more sophisticated architectures such as Resnet50, Densenet121 and Mobilenetv2. We are thus including variations in architecture hyperparameters such as pooling type, the number of layers, skip connections, etc…. These networks are tested on 5 different computer vision datasets of varying complexity. For each of these 95 architecture/data combinations, we train on a range of training set sizes to study learning curves. We additionally study the effect of label corruption on two of the data sets, MNIST and CIFAR10. Our bound provides nonvacuous generalization bounds, which are also tight enough to provide a reasonably good quantitative guide to generalization. It also does remarkably well at predicting qualitative trends in the generalization error as a function of data complexity, architecture (including, for example, differences between average and max pooling), and training set size. In particular, we find that we can estimate the value of the learning curve power law exponent for different datasets.
In our concluding section, we argue that the use of a function based perspective is a key to understanding the relatively good performance of our marginal likelihood bound across so many desiderata. This approach contrasts with many other PACBayesian bounds in the deep learning theory literature, which use distributions in parameter space, mostly as a way to measure flatness. We also discuss potential weaknesses of our approach, such as its inability to capture the effect of DNN width or of optimiser choice (but see mingard2020sgd). Finally, we argue that capturing trends in generalization when architecture and data are varied is not only theoretically interesting, but also important for many applications including architecture search and data augmentation.
2 Desiderata for predictive theories of generalization error
The fundamental question of what constitutes a “good” theory of generalization in deep learning is a profound and contested one. jiang2019fantastic and dziugaite2020search both consider this question and suggest that a good theory should capture the causal mechanisms behind generalization. jiang2019fantastic use a measure of conditional independence to estimate the causal relation between different complexity measure and generalization. dziugaite2020search consider a stronger notion that tries to capture whether a theory predicts well the generalization error over all possible experimental interventions. To formalize this notion they look at distributional robustness which quantifies the worstcase predictive performance of a theory.
Here we take a slightly different approach to this question by defining a seven key desiderata that a “good” theory of generalization should satisfy. This set is broader than that considered by jiang2019fantastic and dziugaite2020search in that it considers desiderata beyond simply the predictive performance.
Furthermore, rather than focusing only on the average case or worstcase performance of a predictive theory, we argue that a precise formulation of the performance of a generalization measure necessarily depends on the application. Therefore in our discussions in section 4, and in the extensive experiments in section 7, we aim to paint a finegrained picture of how well the theory predicts generalization for different experimental settings, and how it fares at other important desiderata.
We focus our attention on generalization error upper bounds, but the same ideas could apply to other types of theories that aim to predict generalization error (for example, those based on Bayesian assumptions on the data distribution).
The seven desiderata which we propose a good predictive theory of generalization for deep learning should aim to satisfy are listed below:
 D.1

data complexity: The predicted error should correctly scale with data complexity. In other words, the predicted error should correlate well with the true error when the dataset is changed. For example, a fixed DNN, for a fixed training set size, will typically have higher error for CIFAR10 than for MNIST, and an even higher error for a labelcorrupted dataset. The bound should capture such differences.
 D.2

training set size: The predicted error should correctly scale with training set size. That is, the predicted error should follow the same trend as the true error as the number of training examples increases. For example, it has been found empirically that the generalization error often follows a power law decay with training set size with an exponent which depends on the dataset, but not strongly on the architecture (hestness2017deep; novak2019neural; rosenfeld2019constructive; kaplan2020scaling).
 D.3

architectures: The predicted error should capture differences in generalization between architectures. Different architectures can display significant variation in generalization performance. For example, CNNs with pooling tend to generalize better than CNNs without pooling on image classification tasks. The predicted error should aim to predict these differences. Furthermore, one of the more puzzling properties of DNNs is that the performance depends only weakly on the number of parameters used for an architecture, provided the system is large enough (belkin2019reconciling; nakkiran2019deep). Since this question of why DNNs generalize so well in the overparameterised regime is one of the central questions in the theory of DNNs, it is particularly important that a predicted error can reproduce this relative insensitivity to the number of parameters.
 D.4

optimization algorithms: The predicted error should capture differences in generalization between different optimization algorithms. Different optimization algorithms used in training DNNs, as well as different choices of training hyperparameters such as SGD batch size and learning rates, or regularization techniques, can lead to differences in generalization, which the theory should aim to predict.
 D.5

nonvacuous: The predicted error should be quantitatively close to the true error. For generalization error upper bounds, it is commonly required that the error is less than , as any upper bound higher than that is satisfied by definition and thus can be considered “vacuous”. Beyond this requirement, the bound should aim to be as close as possible to the true error (a property often referred to as being a tight bound).
 D.6

efficiently computable: The predicted error should be efficiently computable. This requirement is particularly important if the prediction or bound is to be useful in practical applications (for example, in architecture search). It is of little use having a bound that cannot in practice be calculated.
 D.7

rigorous: The prediction should be rigorous. For the case of upper bounds, this means that the bound comes with a theorem that guarantees its validity under a wellspecified set of assumptions, which should be met on the domain where it is applied. In practice, theoretical results are often applied to domains where the assumptions aren’t known to hold (or are known not to hold), but rigorous guarantees offer the highest level of confidence on one’s predictions, and should be aimed for whenever possible.
A theory of generalization error should aim to satisfy as many of these desiderata as possible. However, the tradeoffs that one is willing to make depend on the domain of application of the theory. For example, a rigorous proof may be more valuable for theoretical insight than for a practical application, where computational efficiency may be more important. As another example, for an application to neural architecture search one may care more about the scaling with architecture, while for an application to decision making for data collection one may care more about correct scaling with training set size.
3 Classifying generalization bounds for deep learning
In this section, we provide an overview of existing approaches to predict generalization error. After some preliminary sections describing notation and definitions we present a general taxonomy which classifies generalization error upper bounds based on the key assumptions they make.
3.1 Notation for the supervised learning problem
Supervised learning deals with the problem of predicting outputs from inputs, given a set of example inputoutput pairs. The inputs live in an input domain , and the outputs belong to an output space . We will mostly focus on the binary classification setting where ^{2}^{2}2See appendix C for a brief comment on extensions to the multiclass setting for the generalization bounds we study here. We assume that there is a data distribution on the set of inputoutput pairs . The training set is a sample of inputoutput pairs sampled i.i.d. from , , where and . We will refer to the data distribution as noiseless if it can be factored as where is deterministic, that is it has support on a single value in . In this case is called the target function.
We define a loss function , which measures how “well” a prediction matches an observed output , by assigning to it a score which is high when they don’t match. In supervised learning, we define a hypothesis^{3}^{3}3This is sometimes also called a predictor, or classifier, in the context of supervised learning as a function from inputs to outputs and the risk of a hypothesis as the expected value of the loss of the predicted outputs on new samples . For the case of classification, where is a discrete set, we focus on the 01 loss function (also called classification loss) , where is the indicator function. We define the generalization error as the expected risk using this loss, which equals the probability of misclassification.
(1) 
This is the central quantity which we study here. We also define the training error . For a more general loss function, we also define the empirical risk , as the empirical average of the loss function over the training set. To simplify notation, we often simply write and for and , respectively.
Finally, we define a learning algorithm to be a mapping
from training sets of any size to hypothesis functions. For simplicity, we mainly describe deterministic hypotheses and learning algorithms, but most results should be easily generalizable to stochastic versions. A stochastic learning algorithm will map training sets to a probability distribution over hypothesiss, while a stochastic hypothesis maps inputs in
to proability distributions over outputs in . For PACBayes we will consider stochastic learning algorithms.The supervised learning problem consists of finding a learning algorithm that produces low generalization error (in expectation, or with high probability), under certain well defined assumptions (including none) on the data distribution .
3.2 General PAC learning framework
The modern theory of statistical learning theory originated with Leslie Valiant’s probably approximately correct (PAC) framework (valiant1984theory), and Vapnik and Chervonenkis’ uniform convergence analysis (vapnik1968uniform; vapnik1974theory; vapnik1995nature). Here we describe a general frequentist formulation of generalization error bounds of the type studied in supervised learning theory. Although the original PAC framework included the condition of computational efficiency of the learning algorithm in its definition, the term is used more widely now (guedj2019primer). In the most general case, the PAC learning framework we present here proves confidence bounds that establish a relationship between observed quantities (derived from the training set) and the unobserved generalization error , which is the real quantity of interest. Specifically we consider bounds on , a function measuring the difference between generalization and training error (which we refer to as the generalization gap), which state that, under some (or no) assumptions on the data distribution and algorithm , the following bound holds:
(2) 
with probability at least . Here the probability is for the event inside the square brackets when is sampled from . is a function which, following common practice in the literature, we will call capacity. The capacity measures some notion of the “complexity” of the algorithm, the data, or both. is called the confidence parameter and measures the probability with which the bound may fail because of getting an “unusual” training set. Finally, the exponent . The most common measure of generalization gap is the absolute difference between the generalization and training error, but in some general PACBayesian analyses may be any convex function (rivasplata2020pac). The capacity can also take many forms. Some common examples include the VC dimension (section 4.1.1) and the KL divergence of a posterior and prior for PACBayes bounds (section 4.2.1.1). Sometimes the dependence on training set is fully absorbed into so that is omitted.
We also distinguish two types of bounds

generalization gap bounds are bounds on or some other measure of discrepancy between the generalization and training error

generalization error bounds are bounds on a function of alone.
Note that a generalization gap bound immediately implies a generalization error bound, but not necessarily vice versa. If (realizability assumption), the distinction between these bounds doesn’t exist.
Finally, in order to simplify notation, we typically omit dependence of the capacity on the parameter throughout the paper.
3.3 A classification of generalization bounds
In this section we classify the main types of bounds which are possible under the general PAC framework described above, according to the different assumptions they make on the data distribution or algorithm, and on the different quantities the capacity may depend on.
3.3.1 According to assumptions on the data
The PAC framework is characterized by bounds which make as few assumptions about the data distribution as possible. There are two main approaches: agnostic and realizable bounds.

Agnostic bounds. For agnostic or distributionfree bounds, no assumption is made on the data distribution. In this case, the No Free Lunch theorem (wolpert1994relationship) implies that we can’t guarantee a small generalization error, but may still be able to guarantee a small difference between the training and generalization error (generalization gap), or a small generalization error for some training sets (for datadependent bounds).

Realizable bounds. For realizable bounds, the data distribution and the algorithm are assumed to be such that for any which has nonzero probability. This is saying that the algorithm is always able to fit the data perfectly. Note that this is a combined assumption about the algorithm (for instance the expressivity of its hypothesis class) and the data distribution. If the algorithm is fully expressive (able to express any function), and minimizes the training error (it belongs to the class of empirical risk minimization (ERM) algorithms), then this condition doesn’t put any constraint on beyond being noiseless.
For the realizable case, bounds usually take the form
(3) where we have omitted (without loss of generality), the exponent , because realizable bounds have
in most cases (The reason for this exponent comes fundamentally from the nonGaussian behaviour of the tail of the binomial distribution when the mean is close to zero
(langford2005tutorial).).
Statistical learning theory has focused on these two classes of assumptions, because they are considered to be minimal, and cover most cases of interest. However, other assumptions on are sometimes used in more advanced analyses, typically involving datadependent and nonuniform bounds (see below).
In this context it is interesting to compare a frequentist to a Bayesian approach to assumptions on the data. In the former case, one only assumes that belongs to a restricted set of possible data distributions. This approach naturally leads to to studying the worst case generalization over that set. In a Bayesian approach, one assumes a prior over data distributions, and then studies the typical or average case generalization. Most of bounds we will describe here use the frequentist approach, but there have been some interesting results using a Bayesian prior over data distributions, see section 4.5.
3.3.2 According to assumptions on the algorithm
As is the case for , supervised learning theory often makes very minimal assumptions on the learning algorithm , though there are also a rich set of algorithmdependent analyses. The minimal assumptions made on is that it outputs hypotheses within a set of hypothesiss called the hypothesis class . As this is the minimal assumption often made, we will refer to it as “algorithmindependent” to distinguish it from the approaches which make stronger assumptions. We thus classify bounds on the following two classes.

Algorithmindependent bounds. These bounds only assume that , for all ^{4}^{4}4or for all within a set of which have support , as in nagarajan2019uniform. Furthermore, the capacity can only depend on and , and not in a general way on ^{5}^{5}5We include this constraint because otherwise there wouldn’t be any useful distinction with algorithmdependent bounds. We could just take a set of algorithmdependent bounds that cover all algorithms with outputs in , and make an “algorithmindependent bound”. We choose the definition to avoid this possibility.. This definition means that bounds in this class must bound the generalization gap or the generalization error^{6}^{6}6Under these definitions, realizable bounds should technically be algorithmdependent, because they assume something beyond the hypothesis class of the algorithm – they assume that the algorithm is ERM. However, we will still refer to bounds that only add this extra assumption as algorithm dependent, as the ERM assumption may be considered as being relatively weak. However, this distinction becomes important when discussing some subtleties, like those in section 3.3.4. for the worst case hypothesis from a hypothesis class . We can therefore write bounds in this class in the following way
(4) Algorithmindependent bounds are commonly classified in two classes, according to the dependence of the capacity on :

Uniform convergence bounds (or uniform bounds for short) are algorithmindependent bounds where the capacity is independent of . This includes VC dimension bounds (section 4.1.1) and Rademacher complexity bounds (section 4.1.2). Here the nomenclature "uniform" means the value of the bound is independent of the hypothesis.

Nonuniform convergence bounds (or nonuniform bounds for short) are algorithmindependent bounds where the capacity depends on . Common examples include bounds for structural risk minimization (section 4.2.1.1).


Algorithmdependent bounds. This type of bounds generalizes the above class of bounds by considering stronger, or more general, assumptions on , as well as more general dependence of the capacity on . We can express this general class of bounds as
(5) where is a set of algorithms that represents our assumptions on . An example of this class of bounds are stabilitybased bounds (section 4.3.1), which rely on the assumption that the algorithm’s output doesn’t depend strongly on . The case where corresponds to the analysis of a particular algorithm. We will refer to this special case as algorithmspecific bounds. However, in almost all cases, analyses of particular algorithms rely only on a subset of the properties of the algorithm, so that the results often apply more generally and are not restricted to a specific algorithm.
3.3.3 According to dependence of capacity on training set
Another classification on bounds is according to the dependence of the capacity on the training dataset . For any of the classes of bounds discussed above, we further distinguish the following two classes of bounds.

Datadependent bounds are bounds in which the capacity depends on . Common examples are Rademacher complexity bounds. The PACBayes bounds for Bayesian algorithms (such as the one we present in theorem 5.1) is another example.
Note that in this paper we do not consider explicitly datadistributiondependent bounds, because we only consider bounds that depend on quantities derivable from . However, we will consider the behaviour of bounds under different assumptions on the data distribution . In particular, we could look at the expected value of the bound, or its whole distribution, for different . This is the sense in which we will discuss “data distribution dependence” in this paper. Note that only datadependent bounds, which depend on , can depend on the data distribution in this sense.
3.3.4 Further comments on nonuniform bounds and algorithmdependent bounds
A common way to derive algorithmdependent bounds is to start with nonuniform convergence bounds, and then make further assumptions on the algorithm which restrict the set of that may be returned for a given . We can then use the maximum value of the bound among those as an algorithmdependent bound valid for algorithms which satisfy the assumptions. Note that this makes the bound automatically datadependent. This can be expressed by bounds of the form
(6) 
where is a set which includes for the class of algorithms which we are considering. This is what is done, for example, for margin bounds (section 4.2.1.2) for maxmargin classifiers, like SVMs, where we assume the algorithm will output an which maximizes margin, and then plug that condition into a nonuniform margin bound (by setting to be the set of max margin classifiers for ).
Nonuniform bounds are often designed with particular assumptions on the algorithm and dataset in mind. This is because the value of nonuniform bounds depends on both and . This in turn means that the notion of optimality for nonuniform (and algorithm and datadependent bounds in general) should also depend on and as argued in section 6.
In the applications to deep learning theory we study, nonuniform convergence bounds are only used as a way to obtain algorithmdependent bounds, though we still present the fundamental nonuniform bounds in section 4.2.1.1 as useful background for the algorithmdependent bounds based on them.
In a recent influential paper, nagarajan2019uniform showed that for SGDtrained networks, the tightest^{7}^{7}7“Tightest” in their analysis refers to the fact that their bound is specific for the particular algorithm they study (SGDtrained DNNs), and the particular dataset (which was synthetically constructed) doublesided bounds based on uniform convergence give loose bounds for certain families of data distributions. In appendix G.4 they extend this result to include all algorithmdependent bounds for which the capacity is only allowed to depend on the output of the algorithm and has no other datadependence beyond this, while in Appendix J, they show that these limitations also apply to standard deterministic PACBayes bounds based on the general KLdivergencebased PACBayes bound (eq. 13). The intuition behind these results is that this kind of bounds can encode little information about the algorithm beyond the hypothesis class, as they can not explicitly capture the dependence of on .
Most algorithmdependent bounds derived from nonuniform bounds that we will study are based on dataindependent nonuniform bounds. This fact automatically makes the algorithmdependent bounds have a capacity that only depends on , and so they would suffer from the limitations pointed out in nagarajan2019uniform. Inspired by this result, we chose to classify algorithmdependent bounds into those that are based on nonuniform convergence and those that are not.
It is also worth noting that there are several different ways to “get around” the limitations in nagarajan2019uniform. 1) The analysis of nagarajan2019uniform finds a family of distributions where the class of bounds we discuss above fail. However, it doesn’t discard the possibility that these bounds may give tight predictions for other data distributions. 2) We can allow an algorithmdependent bound to depend on the data in other ways than via (e.g. by having explicit dependence on as in shawe1998structural; shawe1997pac) 3) negrea2019defense showed that one can still seize uniform convergence for the distributions that nagarajan2019uniform study by bounding the risk difference between and an in a class for which uniform convergence gives tight bounds. 4) We can consider nondoublesided bounds. Bounds derived from an analysis assuming realizability can satisfy 2) and 4). They can satisfy 2) because they only guarantee convergence on a hypothesis class which depends on the data (the set of with error). Bounds considered by nagarajan2019uniform, even if they can depend on , are worstcase over training sets. However, an analysis which assumes realizability can bound the generalization gap only for those training sets where . Realizable bounds can satisfy 4) because they can use the onesided version of the Chernoff bound for mean (langford2005tutorial), as can be seen for example in the Corollary 2.3 in shalev2014understanding. Note that realisable bounds derived from agnostic bounds (by setting ) will still suffer from the limitations that nagarajan2019uniform point out, because agnostic bounds themselves do not satisfy the conditions above. Therefore, only bounds which take full advantage of the realizable assumption may avoid the limitations in nagarajan2019uniform.
We note that the marginal likelihood PACBayes bound we present in section 5 is based on a realizable analysis using onesided bounds and thus avoids the limitations of doublesided bounds discussed above. In fact, our bound holds with high probability over the Bayesian posterior, rather than universally over a whole family of hypothesisdependent posteriors, as usually considered by deterministic PACBayes bounds (nagarajan2019uniform).
3.3.5 Overview of bounds
Algorithmindependent  
(section 4.1)  Algorithmdependent  
(section 4.2)  
Based on uniform convergence  Based on nonuniform  
convergence  Other  
Data independent  VC dimension bound^{*} (section 4.1.1)  SRMbased bounds^{†} (section 4.2.1.1)    uniform stability bounds^{‡} and compression bounds^{\mathsection} (section 4.3.1) 


Datadependent  Rademacher complexity bound^{\mathparagraph} (section 4.1.2)  datadependent SRMbased bounds^{**} (section 4.2.1.1)  margin bounds^{††} (4.2.1.2),
sensitivitybased bounds^{‡‡} (section 4.2.1.4), NTKbased bounds^{\mathsection\mathsection} (section 4.2.1.3), other PACBayes bounds^{\mathparagraph\mathparagraph} (section 4.2.2) 
nonuniform stability bounds^{***} (section 4.3.1),
marginallikelihood PACBayes bound^{†††} (section 5) 
^{*}vapnik1974theory; blumer1989learnability; bartlett2017nearly
^{†}vapnik1995nature; mcallester1998some
^{‡}bousquet2002stability; pmlrv48hardt16; mou2018generalization
^{\mathsection}littlestone1986relating; brutzkus2018sgd
^{\mathparagraph}bartlett2002rademacher
^{**}shawe1998structural; shawe1997pac
^{††}bartlett1997valid; bartlett1998sample; bartlett2017spectrally; neyshabur2018a; golowich2017size; neyshabur2018towards; barron2019complexity
^{‡‡}neyshabur2017exploring; dziugaite2017computing; arora2018stronger; banerjee2020randomized
^{\mathsection\mathsection}arora2019fine; cao2019generalization
^{\mathparagraph\mathparagraph}zhou2018non; dziugaite2018data
^{***}kuzborskij2017data
^{†††}valle2018deep
In the next sections of this paper, we will describe the major families of generalization error bounds that have been applied to DNNs. While we don’t claim that the list is exhaustive, we tried to cover all the major approaches to generalization bounds.
In table 1 we present a highlevel overview of where different general classes of bounds found in the literature fit within the classification introduced above. It also lists which bounds we treat explicitly in the rest of the paper, and where they sit in our taxonomy. Thus the table helps illustrate what kinds of general assumptions go into the different bounds.
Given this hierarchy of the main types of bounds, we next turn to a comparison of their performance. As expected, the overall empirical performance of the bounds improves as more assumptions are added.
4 Comparing existing bounds against desiderata
In this section we use the taxonomy from section 3 (illustrated in table table 1) to organise a discussion on how different bounds fare against the desiderata proposed in section 2. We use a ✗ when there is strong evidence that bounds in this family fail to satisfy most important aspects of the desiderata, ✓ when there is strong evidence that bounds in the family satisfy most important aspects of the desiderata, and ⚫ otherwise. We are aware these are not formally defined notions, and the marks should just be taken as an aid for the reader.
4.1 Algorithmindependent bounds
4.1.1 Dataindependent uniform convergence bounds: VC dimension
One of the iconic results in the theory of generalization is the notion of uniform convergence introduced by Vapnik and Chervonenkis (vapnik1974theory) which, expressed in the language of PAC learning (blumer1989learnability), considers dataindependent uniform convergence bounds, where the capacity doesn’t depend on , but only on the hypothesis class . The main result of this theory is that the optimal bound of this form (up to a multiplicative fixed constant) for the generalization gap, in the case of binary classification is
(7) 
for some constant , and where is a combinatorial quantity called the VapnikChervonenkis dimension (shalev2014understanding), which depends on the hypothesis class alone. In the realizable case, they also proved that the optimal realizable dataindependent uniform bound is
(8) 
for some constant , and where is the set of all with zero training error on . The particular realizability assumption here is that should be such that for all , is nonempty.
How does this bound do at the desiderata?

D.1 ✗ The bound is dataindependent by construction, and therefore its value is the same for any data distribution or training set.

D.2 ✗ The bound decreases with training set size, but a rate which is independent of the dataset, unlike what is observed in practice ( for a range of , for often significantly smaller than ). Recently, bousquet2020theory pointed out that a fundamental reason why dataindependent uniform convergence bounds don’t capture the behaviour of learning curves is because the worstcase distribution can depend on , so that the VC bound bounds an ‘envelope’ of the actual learning curves for individual , and this envelope may have a markedly different form with respect to than the individual learning curves.

D.3 ✗ The VC dimension can capture differences in architectures. However, it doesn’t appear to capture the correct trends. For example, VC dimension grows with the number of parameters (baum1989size; bartlett2017nearly)
, while for neural networks the generalization error tends to decrease (or at least not increase) with increased overparametrization
(neyshabur2018the). 
D.4 ✗ The bound is only dependent on the algorithm via the hypothesis class. Therefore it won’t capture any algorithmdependent behaviour except for regularization techniques that restrict the hypothesis class.

D.5 ✗ The VC dimension of neural networks used in modern deep learning is typically much larger than the number of training examples (zhang2016understanding), thus leading to vacuous VCdimension bounds.

D.6 ✓ Although computing the exact VC dimension of neural networks is intractable, there are good approximations and bounds which are easily computable (bartlett2017nearly).

D.7 ✓ The VC dimension offers a rigorous bound with minimal assumptions. Therefore, its guarantees are rigorously applicable to many cases.
A common way to interpret the VC dimension bound is in terms of biasvariance tradeoff
(neal2018modern; neal2019bias), which is a simple heuristic that is widely used in machine learning. Biasvariance tradeoff captures the intuition that there is a tradeoff between a model being too simple, when it cannot properly represent the data (large bias), and a model being too complex (large capacity) when it will tend to overfit, leading to large variance on unseen data. For the bound
eq. 7 we can identify as measuring the bias, and the term involving the VC dimension as indicative of the variance. Intuitively, increasing the VC dimension can make smaller by increasing the capacity, at the expense of higher variance (the second term), so that one may expect to see a Ushaped curve of generalization error versus model complexity.Many empirical works have shown that, once in the overparameterized regime, DNNs in fact typically show a monontonic decrease in generalization error as overparametrization increases, unlike what the VC dimension bound would suggest (lawrence1998size; neyshabur2018the; belkin2019reconciling). As the VC dimension bound is optimal among dataindependent algorithmindependent bounds, these results tell us that this class of bounds is fundamentally unable to explain the generalization of overparametrized neural networks.
Intuitively, it is not surprising that the VC dimension bound cannot capture this behaviour. As overparametrization increases, the model is able to express more functions, and so the worstcase generalization in the hypothesis class can only get worse. What the VC dimension measure (or for that matter naive applications of the biasvariance tradeoff heuristic in terms of overparameterization) is not capturing is: 1) the strong inductive bias within the hypothesis class which DNNs have; 2) that the effective expressivity of the model can depend on the data. That is, we need to look for bounds that are algorithmdependent and/or datadependent. In the following section, we will look at algorithmindependent datadependent bounds, and we will see that datadependence alone is not enough, so that a bound which takes the inductive bias into account is necessary to explain generalization of overparamtrized DNNs.
4.1.2 Datadependent uniform convergence bounds: Rademacher complexity
As a first step to include data dependence, we consider a classic datadependent uniform convergence bound of the form eq. 4 for algorithm independent uniform bounds, where is the absolute value function, and the capacity is independent of . It is given by:
(9) 
where is a constant, and
is the Rademacher complexity of the set of vectors
where are the input points in (bartlett2002rademacher; shalev2014understanding). There is also a lower bound which matches it up to a constant and up to an additive term of (bartlett2002rademacher; koltchinskii2011oracle). Therefore whether this bound is the optimal datadependent uniform generalization gap bound up to a constant, depends on the rate of the Rademacher complexity with . Bounds on the Rademacher complexity tend to be of but they often dominate the second term, so that the lower and the upper bound match in their first order behaviour, suggesting that the bound in eq. 9 may often be close to optimal within this class of bounds.Rademacher complexity bounds for neural networks typically rely on a bound on the norm of the weights (bartlett2002rademacher), which typically grows with overparametrization. The lesson we learn is that although Rademacher bounds are datadependent, they are still worstcase over algorithms with hypothesis class . Given that DNNs can express functions that generalize badly (zhang2016understanding), it is thus not surprising that Rademacher complexity bounds are vacuous, and suffer from similar problems as VC dimension bounds.
We note that Rademacher complexity has also been used as a tool in analyses which take into account more properties of the algorithm, for example for bounds based on nonuniform convergence. In this paper we will treat these other bounds separately, mainly in section 4.2.1.2, when looking at margin bounds. We will nevertheless briefly comment on these margin bounds in the desiderata below, as they have been studied more thoroughly, and may give insights into Rademacher complexity more generally.
The performance of the Rademacher complexitybased bounds on our desiderata can be found below:

D.1 ✗ The bound is datadependent, and could capture some dependence in the dataset. However, it only depends on the distribution over inputs, and therefore it can’t depend on the complexity of the target function, when the input distribution is fixed. This is unlike real neural networks which generalize worse when the labels are corrupted (zhang2016understanding).

D.2 ✗ Bounds on the Rademacher complexity are typically , thus not capturing the behaviour of learning curves. Furthermore, marginbased bounds (which are based on normbased bounds to the Rademacher complexity), often increase with . See section 4.2.1.2.

D.3 ✗ As Rademacher complexity captures a notion of expressivity, similarly to VC dimension, it often grows with overparametrization and number of layers (see for example, the normbased bounds on section 4.2.1.2). It seems unlikely it could capture other architectural differences.

D.4 ✗ Like the VC dimension bound, it is only dependent on the algorithm via the hypothesis class. The most studied hyperparameter to compare hypothesis classes in this context is the norms of the weights of the DNN. As we comment in section 4.2.1.2, these bounds appear to anticorrelate with the true error when changing several common optimization hyperparameters.

D.5 ✗ Like the VC dimension bound, these bounds are typically vacuous in the overparametrized regime. In fact, it has recently been shown that there are data distributions where the tightest doublesided distributiondependent uniform convergence bounds for several SGDtrained models are provably vacuous (nagarajan2019uniform). This implies that the distributiondependent dataindependent version of the Rademacher complexity bound (shalev2014understanding), and thus that the eq. 9 is also vacuous for that data distribution. Although for other data distributions the bounds may not be vacuous, this work suggest that uniform convergence bounds have some fundamental limitations (see section 3.3.4 for further discussion on this issue).

D.6 ✓ Same as for VC dimension bounds.

D.7 ✓ Same as for VC dimension bounds.
Although some fundamental limitations of VC dimension are overcome by Rademacher bounds, the bounds that currently exist for neural networks based on Rademacher complexity have very similar problems to those based on VC dimension. Although their capacity measure is datadependent, for the problems in which DNNs are applied, these bounds still give vacuous predictions and grow with the number of parameters. This observation, combined with the fact that the bounds may be optimal among uniform generalization gap bounds, suggests that to overcome the limitations highlighted here we will need to consider datadependent nonuniform and algorithmicdependent bounds.
Note that, although one can prove that the VC dimension (and perhaps for Rademacher complexity) bound is optimal within its class of bounds, for datadependent bounds this won’t be in general possible. As we discuss in section 6 whether one bound is tighter than another depends on what prior assumptions are made on the data distribution, so that a unique notion of optimality may not exist.
4.2 Algorithmdependent bounds
In this section we consider the major classes of algorithmdependent bounds and how they fare at satisfying the desiderata. We focus on realizable bounds (which assume zero training error), because modern deep learning often works in this regime (zhang2016understanding). For many of the recent bounds for deep learning, the lack of experiments means that we can’t conclusively answer whether they satisfy several of the desiderata. We hope that future work can fill these gaps.
4.2.1 Algorithmdependent bounds based on nonuniform convergence
We start by looking at algorithmdependent bounds derived from nonuniform convergence bounds (see also section 3.3.4). We begin by presenting in the next section the basic types of nonuniform bounds, before seeing the two main applications for deep learning in the next two sections: margin bounds, and sensitivitybased bounds. We will also briefly comment on some other approaches to apply PACBayes to deep learning that have been proposed.
4.2.1.1 Basic nonuniform convergence bounds and structural risk minimization
The simplest and most fundamental idea to make nonuniform bounds is related to a learning technique called structural risk minimization (SRM) developed by Vapnik and Chevonenkis (vapnik1995nature). The derivation of this bound is very similar to the classic textbook PAC bound (see e.g. corollary 2.3 in shalev2014understanding), but rather than using a uniform union bound, it uses a nonuniform union bound over the hypothesis class to prove that for any countable hypothesis class and any distribution over hypotheses (mcallester1998some; shalev2014understanding):
(10) 
First, as expected, this bound reduces to the standard uniform finiteclass PAC bound (valiant1984theory), when
is a uniform distribution
. What eq. 10 tells us is that if we have a learning algorithm and a problem for which we have prior knowledge that some functions are more likely to be learned than others, then we can obtain tighter bounds by choosing a that assigns a higher value to these . One way to intuitively understand how this knowledge affects the bound is to consider the limit where is highly concentrated on a subset of , and approximately uniform within that subset. Then eq. 10 approaches the finiteclass uniform PAC bound for a reduced hypothesis class and the capacity can be interpreted as measuring an “effective size” of the hypothesis class.Another way to think about this is that choosing a amounts to “betting” that some are more likely to appear. If we are right, then our bound will be better in practice than the standard finiteclass PAC bound, while if we are wrong and we get with low , then this bound will perform worse. In other words, unlike the uniform finiteclass PAC bound, eq. 10 depends on the we get, so that in order to evaluate its performance, we need to take into account the data distribution and the algorithm (which together determine the probability distribution over which the learning algorithm outputs), and work with an expected value of the generalization error.
To define an expected value of the bound, we assume a distribution over data distributions , which we call the prior . For example, this may be fully supported on a single if we know the distribution fully (perhaps we are looking only at the images in CIFAR10). In more realworld settings, we will have uncertainty over what the true distribution is, but we may believe that certain distributions (e.g. simpler ones) are more likely. We consider a stochastic learning algorithm which, for training set , outputs hypotheses with a probability (called the posterior (which need not be the Bayesian posterior). Under these two assumptions, it is not hard to see that, for a given distribution over hypotheses, the expected value of the capacity in the bound eq. 10 is given by
(11) 
where is the crossentropy, and is the posterior averaged over training sets and data distributions. The second equality follows from the definition of crossentropy. We can immediately see that the optimal bound of this form, for a given prior and algorithm with posterior is obtained by the choice , in which case we obtain where is the entropy. This calculation formalizes the intuition that we should choose to be as close as possible to the probabilities of obtaining different for the algorithm and task at hand. It also strengthens the intuition that this bound is capturing a notion of effective size, as is often interpreted as the logarithm of the effective number of elements where is concentrated.
Furthermore, we can consider the case where is given by the Bayesian posterior for prior . In this case, , where is the prior distribution over hypotheses , obtained by marginalizing over input distributions ^{8}^{8}8Note that, unless we restricted to the noiseless case, these would be stochastic hypotheses corresponding to as defined in section 3.1.. The average capacity in this case is . This is conceptually similar to the No Free Lunch theorem, in that it tells us that for the optimal algorithm, the bound only guarantees good generalization if we make enough assumptions about the data distribution (which corresponds to a low entropy ). We note, however, that the Bayesian posterior may not give the optimal value of the bound, as can be seen from the fact that is always lowered by making deterministic (as it would lower ), while the Bayesian posterior is not deterministic in general.
The problem with the basic nonuniform SRM bound eq. 10 is that it does not capture the idea that some functions may be more similar to others, which quantities such as VC dimension and Rademacher complexity do capture^{9}^{9}9For example, the VC dimension of a set of functions which are very similar to each other will typically be lower than a set of very dissimilar functions. The generalized version of SRM bound which we present below has the advantage of being nonuniform but being able to capture some notion of similarity among function within the subclasses .
A more commonlyused extension of the basic SRM bound considers dividing the (now potentially uncountable) hypothesis class into a countable set of (usually nested) subclasses , , such that . The result is that for any distribution over (shalev2014understanding), we have:
(12) 
where is any (potentially datadependent) capacity for class which guarantees uniform convergence within (for example, a bound on its VC dimension or Rademacher complexity). Results of this form are proven in shawe1998structural and shalev2014understanding.
We can also compute the expected value of the bound eq. 12, analogously to eq. 10. For the numerator of the bound (ignoring the confidence term), we obtain , where and represents the index of the subclass to which belongs. Analogously to before, the optimal value of is given by , and the Bayesian posterior will in general not result in the optimal average value of the bound.
One shortcoming of eq. 12 is that the decomposition of into has to be defined a priori, that is, it cannot depend on . shawe1998structural proposed an extension to the SRM framework which addressed this shortcoming, and defined a potentially infinite hierarchy of subclasses which could depend on the data . This framework includes as a special case the margin bounds we will see in section 4.2.1.2.
shawe1997pac
applied the datadependent SRM framework to obtain bounds for a parametrized model, where the capacity was related to the volume in parameter space of a sphere contained within the set of parameters producing zero training error. This work inspired the development of the first PACBayes bounds in
mcallester1998some^{10}^{10}10Although shawe1997pac is often cited as a precursor to PACBayes (shawe2019primer), it offers a distinct analysis (for example, it gives deterministic bounds rather than bounds on expected error), which as far as the authors know hasn’t been shown to necessarily give stronger or weaker bounds than PACBayes, and hasn’t been applied to neural networks.. These bounds apply for stochastic learning algorithms, and bound the expected value of the generalization error under the posterior , uniformly over posteriors. The standard form of the general PAC Bayes bound was proven by maurer2004note and states, for any distribution over ,(13) 
where is the KLdivergence between and . On the left hand side we use the standard abuse of notation to define , for .
This bound can be seen to generalize the SRM with datadependent hierarchies of shawe1998structural, where instead of “hard” subdivisions of into , we consider all possible distributions on . is analogous to in eq. 12 in that it very roughly measures how much of the total probability mass of is in the high probability region of . The term which penalizes “classes” of which are too diverse, analogously to , is , the average training error, because this is only small if the functions agree on , which will only happen with high probability if they are sufficiently similar. These analogies between the datadependent SRM bounds and PACBayes are only intuitive, but we conjecture that a more formal connection might be possible.
The PACBayes bound in eq. 13 is one of the most general nonuniform datadependent bounds, and its different applications give rise to sensitivitybased bounds (section 4.2.1.4) among others (section 4.2.2). In fact, margin bounds can also be derived from PACBayes (langford2003pac).
We don’t comment on the desiderata for SRM and the PACBayes bounds described above, because they provide a general framework for the nonuniform datadependent bounds we explore next. We will comment on the desiderata for these bounds individually.
4.2.1.2 Normbased and margin bounds
A popular method to obtain datadependent nonuniform bounds is the method of marginbased bounds. These usually start with normbased bounds which bound the Rademacher complexity of subclasses of a hypothesis class parametrized by a parameter vector (weights) corresponding to balls where has a bounded norm. One can then either express the bound under an assumption on the weight norms, or have the bound depend on the weight norms. The later case is done by applying an SRMlike bound to the family of hypothesis subclasses corresponding to different weight norms. For margin bounds, the bound is applied to a margin loss, which upper bounds the 01 loss. The result is a bound on the generalization error which depends on a new datadependent property called margin, which measures how confidently the classifier is classifying examples. The bound, in the agnostic case, has the general form, for any margin (shalev2014understanding),
(14) 
where and are constants (that may depend on the algorithm but not on or , is a capacity measure which usually measures the norm of the parameters , and the margin error is defined as , for a hypothesis . Note that we usually apply this to realvalued hypotheses where , and for which the classification error is defined as , where for binary classification. We abuse notation by writing and for the quantities evaluated at the hypothesis corresponding to parameter . One can also make a bound that holds (nonuniformly) for all values of by applying a weighted union bounds over discretized values of (shalev2014understanding).
The margin loss measures the amount of missclassification errors plus some examples which were classified correctly but with low confidence (measured by being smaller than ). In the case of linear models, where and , the ratio measures a geometric margin, and the margin loss measures the number of examples that are not on the right side of the classification boundary by a distance greater than
. Support vector machines are a famous example of an algorithm trying to maximize this geometric margin
(cortes1995support).For neural networks, margin bounds were originally developed based on the analysis of their fatshattering dimension (bartlett1997valid; bartlett1998sample). These bounds depend on the norm of the weights, and thus typically grows with overparmetrization. The bounds also grow exponentially with depth. More recently, more complex norms of the weights were studied, as well as using the Lipschitz constant of the network as the capacity measure for marginbounds (bartlett2017spectrally; neyshabur2018a; golowich2017size; neyshabur2018towards; barron2019complexity). These bounds have shown some correlation with the complexity of the data, but suffer from their (implicit or explicit) dependence on width and depth, and are vacuous (neyshabur2017exploring; neyshabur2015norm; arora2018stronger). They also show negative correlation with the generalization error when changing certain training hyperparmeters (jiang2019fantastic).
We now summarize the strengths and weaknesses of marginbased bounds on our desiderata. Note that most of our conclusions are based on empirical results on existing marginbased bounds, and may not be fundamental to the marginbased approach itself.

D.1 ✓ Marginbased bounds have shown to correlate with the true error when comparing CIFAR versus MNIST, and when comparing uncorrupted versus corrupted data (bartlett2017spectrally; neyshabur2017exploring). The correlation should be explored over a wider range of datasets and quantified more precisely.

D.2 ⚫ The dependence on the training set size of marginbased bounds depend on how the capacity measure changes with . In nagarajan2019uniform, it has been shown that several types of the marginbased bounds proposed for deep neural networks actually increase with training set size! However, in dziugaite2020search, other norm and marginbased generalization measures are found to correlate well with the error when changing training set size, suggesting that it may be possible to prove better bounds based on these measures.

D.3 ⚫ While most marginbased bounds increase with layer width, some (path norm bound in neyshabur2017exploring and the bound in neyshabur2018towards) actually decrease with layer width. Regarding variations in depth, all the proposed bounds increase with number of layers, and most of them do so exponentially (neyshabur2015norm; arora2018stronger; barron2019complexity). However, the empirical results in jiang2019fantastic; maddox2020rethinking; dziugaite2020search show that certain measures (e.g. pathnorm) positively correlate with the error when varying depth. Furthermore, according to maddox2020rethinking, the log pathnorm correlates with both width and depth significantly better than pathnorm. While no bounds have been derived that scale like these measures, it may be promising to consider work in this direction.

D.4 ⚫ jiang2019fantastic show that marginbased bounds that have been proposed to date appear to often anticorrelate with the generalization error when changing common optimization hyperparameters (dropout, learning rate, type of optimizer, etc.). On the other hand, in dziugaite2020search, it was demonstrated that some measures such as path norm, often predict certain properties well when changing learning rate (but they didn’t vary the other hyperparameters studied in jiang2019fantastic).

D.5 ✗ As far as the authors are aware, all of the marginbased bounds published to date for DNNs are vacuous (neyshabur2017exploring; neyshabur2015norm; arora2018stronger).

D.6 ✓ Marginbased bounds are often based on a notion of the norm of the weights that is relatively efficiently computable.

D.7 ✓ Proposed bounds are based on rigorous theorems with typically weak assumptions.
4.2.1.3 Bounds based on the neural tangent kernel
It was recently demonstrated that infinitewidth DNNs, when trained by SGD with infinitesimal learning rate, evolve in function space as linear models with a kernel known as the neural tangent kernel (NTK) (jacot2018neural; lee2019wide). Several works, either inspired by NTK or not, have relied on the linearization of the dynamics for wide neural networks. The bounds are similar to the normbased bounds in the previous section 4.2.1.2 in that they tend to bound the Rademacher complexity of hypothesis subclasses characterized by a norm, usually the norm of the deviation of the weights from initialiation. The analyses based on NTK are often also able to guarantee convergence of the optimizer, so that we can also bound the empirical risk for a sufficiently large number of optimizer iterations. For instance, arora2019fine proved that for a sufficiently wide twolayer fully connected neural network and sufficiently many (full batch) gradient descent steps , we have
(15) 
where is the hypothesis learnt by the DNN, is the vector of training outputs , is the Gram matrix for the inputs in the training set , and
is a lower bound on the eigenvalues of
. See arora2019fine (Theorem 5.1) for the full statement of the result. The connection to the normbased bounds comes from the NTK analysis they carried out, which showed that the Frobenius norm of the deviation of the weights from initializatoin is bounded by plus higher order terms. They then used this bound to obtain a datadependent bound on the Rademacher complexity. NTK analyses could provide tighter bounds on Rademacher complexity than those we saw in section 4.2.1.2, because the NTK likely captures the relation between the parameters and the function the network implements more precisely than the analyses based on bounding the Lipschitz constant or similar quantities.More recently, cao2019generalization sharpened arora2019fine’s result with a similar bound, but which applies to networks with any depth trained with SGD. In their bound, is replaced by the NTK matrix, which gives a tighter bound.

D.1 ✓ The NTKbased capacity in arora2019fine was shown to increase with amount of label corruption on MNIST and CIFAR, and on MNIST for the capacity measure in cao2019generalization.

D.2 ⚫ The authors are not aware of any work studying the dependence of NTKbased bounds on . The is not necessarily indicative because the numerator in the bound likely has nontrivial dependence on .

D.3 ✗ Like normbased margin bounds, current NTKbounds grow with depth (cao2019generalization). On the other hand, they show very little dependence on network width, for large enough width. This is a property of NTK analyses which matches empirical observations well.

D.4 ⚫ The authors are not aware of any work studying the dependence of NTKbounds on the optimization algorithm. Most analyses focus on vanilla versions of SGD with specific hyperparameter choices which help the theoretical analysis. Other optimization algorithms have been shown to have NTK limits (e.g. momentum in lee2019wide), but we are not aware of generalization bounds for these.

D.5 ⚫ The bounds in cao2019generalization are nonvacuous (at least up to the dominant term), but they are not very tight (with values close to above label corruption).

D.6 ✓ The NTK of fully connected networks has an analytical form which is efficiently computable (lee2019wide). However, for more complex architectures, it may be necessary to estimate the NTK limit (when it exists, see yang2019scaling) via Monte Carlo (novak2019neural).

D.7 ✓ Proposed bounds are based on rigorous theorems, though the assumptions on the algorithms and the width are sometimes hard to match in practice (too large width, or too small learning rate).
4.2.1.4 Sensitivitybased bounds
Many generalization error bounds recently developed and applied to deep learning are based on the idea that neural networks whose outputs (or loss function values) are robust to perturbations in the weights may generalize better. This is linked to the observed phenomenon that flatter minima empirically generalize better than sharper minima (hochreiter1997flat;
Hinton:1993:KNN:168304.168306
; zhang2018energy; keskar2016large). At an informal level, it has been argued that the reason for this correlation is that flatter minima may correspond to simpler functions (hochreiter1997flat; wu2017towards). In particular, hochreiter1997flat link flatness to generalization via the idea of minimum description length (MDL) (rissanen1978modeling). MDL generalization bounds are formally equivalent to the simple SRM bound in eq. 10, where is often interpreted as the length of the string representing under some prefixfree code (shalev2014understanding).A more sophisticated argument linking flatness to generalization is found in the datadependent SRM analysis of shawe1997pac. As we mentioned in section 4.2.1.1, they proved generalization bounds where the capacity was mainly controlled by the volume of a region (which they took to be a ball) in weight space in which the training error was zero. A larger volume corresponds to a flatter minimum. Note that one difference with previous work is that shawe1997pac define flatness in terms of the classification error, rather than the loss function (for which one typically uses the Hessian as a measure of flatness).
Recent theoretical works studying the link between sensitivity, flatness, and generalization focus on PACBayes analyses (neyshabur2017exploring; jiang2019fantastic). In the PACBayes bound eq. 13, the term is typically larger when the posterior has a larger variance, so that it can “overlap” with the prior more. On the other hand to control the average training error , we need to put most of its weight on regions of low error. If we combine these two considerations, the bound typically predicts best generalization for large regions of weight space with low error (flat minima). However, as shown in neyshabur2017exploring, a more careful look at the PACBayesian analysis suggests that flatness alone is not sufficient to control capacity and should be complemented with some other measure such as norm of the weights. In particular, if is taken to be a Gaussian around the weights found after training, and is taken to be a Gaussian around the origin, then the term also grows with the norm of the weights. This can be seen in the bound proposed in neyshabur2017exploring which states that, for all , with probability at least , over , we have
(16) 
where is the risk, and are the function and weights, respectively, produced by the network after training, and is the function obtained by perturbing the weights of the network, , by the noise . is a hyperparameter that can be chosen to take any value greater than . The first two terms after the inequality are called expected sharpness by neyshabur2017exploring, and measures how much the loss increases in average by a perturbation of order to the weights. neyshabur2017exploring perform experiments that show that this bound correlates well with the true error when varying data complexity, and number of training examples, but not when varying the amount of overparametrization. By optimizing the posterior of a PACBayes bound conceptually similar to eq. 16, dziugaite2017computing also obtained bounds on the expected error under weight perturbations. Their results were noteworthy because the bounds were nonvacuous.
The bounds in neyshabur2017exploring; dziugaite2017computing are bounding the expected value of the generalization error under perturbation of the weights, rather than the generalization error of the original network. Obtaining bounds on the latter is addressed by recent work on determinist PACBayes bounds (nagarajan2018deterministic). However their bounds are vacuous, and follow the wrong trends when varying depth and width.
In jiang2019fantastic, it was found empirically that using a worstcase measure of sharpness, that measures the loss change along the worst weight perturbation of a certain magnitude, very similar to the one proposed in keskar2016large, gives the best correlation (and best results in their causal analysis) among the many measures they tested.
arora2018stronger developed another approach to prove generalization bounds based on the robustness of neural networks to perturbations of the weights. They showed that if the effects of perturbing the weights does not grow too much as it propagates through every layer^{11}^{11}11a condition which they formalized through a series of measurable quantities, the network could be compressed to a network with fewer parameters, for which a generalization error bound could be given that was tighter than other proposed bounds, although still vacuous. They found that their bound correlates with the true error as it decreased during training. However, this bound has the disadvantage that it applies to the compressed network only, and not to the original network.
Recently, banerjee2020randomized have developed novel deterministic bounds based on a derandomization of a PACBayes bound. Their bound is also based on the flatness of the minimum found after training (measured by Hessian eigenvalues), and also takes into account the distance moved in parameter space. They provided evidence that their bound correlates well with the test error when varying the training set size and label corruption. However, they didn’t study the tightness of their bound (which depends on certain smoothness constants of a linearized version of the nonlinear DNN).

D.1 ✓ neyshabur2017exploring; banerjee2020randomized provided some evidence that their PACBayesian bounds correlates with true error when varying the data complexity. The authors are not aware of similar results for the other measures.

D.2 ✓ neyshabur2017exploring; banerjee2020randomized; dziugaite2020search have shown evidence that different sensitivitybased PACBayesian bounds correlate with true error when varying the training set size. However, a more quantitative comparison could be done, over more datasets and different architectures.

D.3 ⚫ Both the worstcase and expected sharpness measures appear to correlate well with the true error when varying depth. However, only the worstcase sharpness appears to correlate with the error when varying width (neyshabur2017exploring; jiang2019fantastic). Furthermore, dziugaite2020search show that although the average correlation with depth and width is good for some PACBayes and sharpness measures, they are not robust, and all of them fail for a significant number of experiments. The bound in arora2018stronger depends on quantities whose dependence on the architecture are hard to predict; however, the bound’s explicit dependence on depth suggests that it may grow linearly with depth, unlike the empirical observations. Recently, maddox2020rethinking showed that a measure of flatness known as effective dimensionality correlates better with the error than PACBayes measures, when varying width and depth, suggesting that it may be a better measure than PACBayesbased flatness measures to understand generalization.

D.4 ✓ The worstcase sharpness appears to correlate well with the true error when varying several algorithm hyperparameters, while other sharpness measures correlate a bit worse jiang2019fantastic. In dziugaite2020search, it was shown that some flatness measures indeed correlate well with the error over most experiments.

D.5 ⚫ Although the sharpness bounds in neyshabur2017exploring; jiang2019fantastic are likely vacuous, dziugaite2017computing showed that by optimizing the PACBayesian posterior over a large family of Gaussians, nonvacuous bounds could be obtained.

D.6 ⚫ Some sharpness bounds studied in jiang2019fantastic are efficiently computable. However, the more advanced ones such as the ones in dziugaite2017computing require significant computational expense.

D.7 ⚫ The bounds in neyshabur2017exploring; dziugaite2017computing; arora2018stronger are based on rigorous theorems. However, they only apply to either the expected error under random perturbations of the weights, or to the compressed network. The deterministic PACBayes bounds in nagarajan2018deterministic; banerjee2020randomized apply to the deterministic error of the original network, but may be vacuous or not very tight. The worstcase sharpness measure which appears to correlate best with the generalization error jiang2019fantastic lacks a rigorous theorem that explains this correlation.
4.2.2 Other PACBayes bounds
There are many recent works applying PACBayesian ideas to obtain generalization error bounds in novel ways. zhou2018non proved nonvacuous generalization error bounds on compressed networks trained on large datasets. However, their bounds are still very loose, and their correlation with the true error hasn’t been studied yet. dziugaite2018data extended the PACBayesian analysis to include datadependent priors under the assumption that they are close to differentiallyprivate priors. Their bounds are nonvacuous and apply to the expected value of the generalization error after training with SGLD (welling2011bayesian), but they are very computationally expensive, and have only been tested on a small synthetic dataset.
4.3 Other algorithmdependent bounds
We now consider other types of algorithmdependent bounds, which are not based on nonuniform convergence. The main class of such bounds are stabilitybased bounds, under which we include compression and algorithmic stability bounds. These bounds include both dataindependent and datadependent bounds. The dataindependent bounds will suffer from many of the same pitfalls as VC dimension bounds, as DNNs show generalization for some datasets but not others, while the datadependent bounds are more promising.
4.3.1 Stabilitybased bounds
Stabilitybased bounds offer an alternative way to obtain algorithmdependent bounds, different from the nonuniform convergence SRMlike bounds. In fact, they even allow to obtain dataindependent algorithmdependent bounds. Stability analyses show that if the output of a learning algorithm depends weakly on the training set, then it can be shown to generalize. One approach of this kind was developed by littlestone1986relating, who derived compression bounds. They obtain dataindependent bounds on the generalization error for learning algorithms whose output can be computed via a fixed function of only out of the training examples (which examples they are can depend on the training sample). For the realizable case, the bound is
(17) 
See shalev2014understanding for the formal statement and proof. These bounds are based on the general concept of ‘stability’ described above because if the output of the algorithm only depends on a small subset of the training examples, it means the output is insensitive to changes in most of the outputs. A compression bound has been recently developed for twolayer neural networks trained with SGD on linearlyseparable data, based on a proof that in this case SGD converges in a bounded number of nonzero weight updates, which therefore gives a bound on (brutzkus2018sgd).
The most common notion of stability, called algorithmic stability, was related to generalization by bousquet2002stability, and considers how sensitive the output of the learning algorithm is to removing a single example from the training sample. Most work on algorithmic stability has focused on the dataindependent notion of uniform stability, which has been used to obtain dataindependent bounds for SGD (pmlrv48hardt16; mou2018generalization).
Because compression bounds and uniform stability are both dataindependent, they can’t capture the crucial datadependence of generalization in deep learning which was pointed out, for example, by zhang2016understanding. To this end, some recent extensions have looked at datadependent notions of stability. kuzborskij2017data applied this idea to obtain generalization error bounds for SGDtrained models. They obtain bounds of the form
(18) 
where the expectation is also taken over the randomness in the algorithm (as SGD is a stochastic algorithm), is a constant related to the step size of SGD, is the initial weights of the neural network, is the minimum risk achiveable by the hypothesis class of the algorithm, is the training time, and is a bound on the variance of the SGD gradients.
The bound in eq. 18 has several limitations which we discuss now. First of all, it is a bound on the expected value of the generalization error, which does not immediately imply a bound that holds both with high probability and logarithmic dependence^{12}^{12}12Note that the Markov inequality implies a high probability bound, but it has a polynomial dependence on the confidence parameter, which is expected to be far from optimal. on the confidence parameter , as the other bounds studied here (shalev2010learnability; feldman2019high)
. The bound applies for smooth convex losses, but the authors also provide a more complex bound for smooth nonconvex losses. The smoothness is an important limitation, as most neural networks in practice use ReLU activations, making the loss surface nonsmooth. However, in their experiment they use a CNN with max pooling, which gives a nonsmooth loss surface, and their bounds still work well empirically. The bound also requires an estimate of
. They estimate this with a validation set, which makes the bound not dependent on alone. However, in practice the initialization is usually random and independent of , which means could be estimated from the empirical loss on . The bound can not be directly applied to classification error, which is not Lipschitz, but it can be applied to crossentropy loss which in turn implies an upper bound on classification error. Perhaps the most serious limitation of this stability bound is that it assumes a single pass over the data , which is not the usual case in practice as training and generalization error often decrease by training on several passes of the data.Some recent works extended the datadependent stability analysis of SGDtrained neural networks in different directions. london2017pac; li2019generalization combined stability bounds with the PACBayesian approach we be discussed later. zhou2019understanding proved datadependent stability bounds that apply to SGD with multiple passes over the data. However, their bound increases with training time (although logarithmically rather than polynomially as in pmlrv48hardt16; mou2018generalization), contradicting the empirical result that generalization error appears to pleateau with training time (hoffer2017train). However, their results have not yet been empirically tested, so that it is hard to evaluate these bounds on the desiderata.

D.1 ✓ The datadependent stability bound in zhou2019understanding correlated with the true error when varying the amount of label corruption on three different datasets. li2019generalization also showed that their Bayes stability analysis gives bound that are larger for randomly labelled CIFAR10 than for uncorrupted CIFAR10.

D.2 ⚫ The explicit dependence of many stability bounds on is given by the classical or . However, data dependent bounds have quantities which may change with in complicated ways. Furthermore, the bounds for nonconvex losses in kuzborskij2017data have a more complicated datadependence with explicit power law dependence on . kuzborskij2017data show a good correlation between their bound and the empirical generalization gap. However, this is only done for onepass (online) SGD. To the best of our knowledge, no study has compared the dependence of the other bounds with the true error, for the usual case where SGD is trained over multiple passes to reach low training error.

D.3 ⚫ The datadependent stability bounds depend on empirical quantities of the loss surface and the behavior of SGD, both of which can in principle be affected by the choice of architecture. However, as far as the authors know this dependence hasn’t been explored.

D.4 ✗ Most stability bounds grow with training time. However, empirically the opposite correlation is found, with longer training time leading to better generalization hoffer2017train; jiang2019fantastic. charles2018stability showed situations in which gradient descent (GD) is not uniformly stable but SGD is. However, whether SGD really generalizes better than GD is still a controversial topic (hoffer2017train).

D.5 ⚫ Many stability bounds pmlrv48hardt16; mou2018generalization; zhou2019understanding grow with training time, and thus will become vacuous for sufficiently large training times. kuzborskij2017data have shown remarkably tight expected generalization gap bounds, which however only apply to onepass SGD. These results are very promising, but further empirical analysis, and work on tight bounds for multipass SGD is still needed.

D.6 ✓ Dataindependent stability bounds are typically easy to compute. Datadependent ones like the one in kuzborskij2017data are harder (depending on empirical quantities like the Hessian and gradient sizes), but still applicable to reasonably sized problems

D.7 ⚫ Proposed bounds are based on rigorous theorems. However, these often have assumptions which are not held in common practice (e.g. onepass over the data, smoothness of loss function, linear separability).
4.4 Other bounds and generalization measures
jiang2018predicting have recently empirically studied a measure of generalization based on features of the distribution of margins at different hidden layers. Their measure shows significantly better correlation between the bound and the error as data complexity and architecture is varied, than the marginbased measures in bartlett2017spectrally. However, they provide limited results regarding the predictive power when changing individual features of the data or architecture, and instead provide an aggregate correlation score when they are all changed simultaneously. The success of the measure explored in jiang2018predicting may be related to the correlation observed in valle2018deep; mingard2020sgd
between the prior probability of functions for Bayesian DNNs and the critical sample ratio (CSR), which was proposed as a complexity measure in
arpit2017closer. The critical sample ratio is an aggregate measure of the distances between input points and the decision boundary, like the distribution of input margins in jiang2018predicting. Furthermore the correlation between the prior probability and generalization is established in valle2018deep; mingard2020sgd, as well as our PACBayes bound in section 5, which helps understand why CSR may correlate with generalization.wei2019data; wei2019improved offered theoretical bounds based on a similar idea to jiang2018predicting. They considered extending the notion of margins to all the layers of the DNN. However, no empirical evaluation was presented.
Recently, and in response to the work of nagarajan2019uniform, negrea2019defense proposed a method to apply uniform convergence to cases where it previously produced vacuous generalization gap bounds. The idea is to show that an algorithm produces a hypothesis with generalizatoin and empirical errors close to some hypothesis in a class with a uniform convergence property. This work has yet to be applied to deep learning, but could offer an interesting direction.
Finally, several measures investigated by jiang2019fantastic; maddox2020rethinking; dziugaite2020search, some of which we have discussed in the sections on margin and sensitivitybased bounds, don’t yet correspond to rigorous analyses of generalization, but some predict generalization better than existing bounds, and so are promising directions for future more rigorous analyses of generalization.
4.5 generalization error predictions for specific data distributions
There is recent work on predicting generalization error of DNNs by making assumptions on , rather than relying on frequentist PAC bounds which typically don’t make any assumption on (beyond assuming realizability). spigler2019asymptotic; bordelon2020spectrum study kernel regression with missspecified priors, based on sollich2002gaussian^{13}^{13}13A similar analysis from a physicsinspired perspective has been presented in DBLP:journals/corr/abs190605301. bordelon2020spectrum further apply this idea to the neural tangent kernel (NTK) of a fully connected network, which approximates the behaviour of SGDtrained DNNs in the infinite width and infinitesimal learning rate limit (jacot2018neural), which seems to work well for finitewidth but wide DNNs (lee2019wide). They apply this to MNIST by estimating the NTK eigenspectrum on a sample of MNIST, and then training the DNN on smaller samples. Their predicted generalization error closely follows the observed error of the SGDtrained DNN. As far as the authors are aware this is one of the most accurate predictions of the generalization error of DNNs based on wellestablished theory.
One of the limitations of the analysis in bordelon2020spectrum is that it relies on knowing what the data distribution is, and in particular the eigenvalues of the NTK kernel, and the eigenspectrum of the target function (with respect to the eigen basis of the NTK kernel). This can be estimated by using a sufficiently large sample of the data, but it is not discussed in bordelon2020spectrum how big the sample needs to be for the estimate to be accurate. They use a sample larger than the training set, which therefore makes this predictor fall outside the requirements of the kinds of predictons we have been considering (which only depend on ). However, the approach offers an analytical theory of generalization which can help with interpretability and gaining understanding of which properties of a DNN architecture lead to generalization for a particular dataset. The other limitation of the work in bordelon2020spectrum is that the analysis only applies for MSE loss, which is not commonly used for classification (though the training with the two losses often results in DNNs with similar learned functions (mingard2020sgd)).
5 Marginallikelihood PACBayesian generalization error bound
In the previous section, we saw that algorithmindependent or dataindependent bounds are clearly insufficient to explain the generalization performance of DNNs because the hypothesis class of DNNs is too expressive, and the generalization strongly depends on the dataset, respectively. Furthermore, the main approaches for algorithmdependent bounds are based on nonuniform convergence, which has been shown to have fundamental limitations in its ability to predict generalization in SGDtrained DNNs for some datasets (nagarajan2019uniform). Although there are ways around this limitation (see the discussion in section 3.3.4), it suggests that looking at other approaches to obtain generalization bounds may be promising. Nonuniform stability bounds offer an interesting alternative to nonuniform convergence, but their empirical success so far is still limited.
Here we present a new deterministic realizable PACBayes bound which applies to a DNN trained using Bayesian inference, with high probability over the posterior. We work in the same set up as
valle2018deep and mcallester1998some. We consider binary classification, and a space of functions or hypotheses with codomain . We consider a “prior” over the hypothesis space , and an algorithm which samples hypotheses according to the Bayesian posterior, with likelihood. To recall, we define generalization error as the probability of misclassification upon a new sample , where is the target function. In section A.1, we prove the following theorem:Theorem 5.1.
(marginallikelihood PACBayes bound)
For any distribution on any hypothesis space and any realizable distribution on a space of instances we have, for , and , that with probability at least over the choice of sample of instances, that with probability at least over the choice of :
where is chosen according to the posterior distribution , is the set of hypotheses in consistent with the sample , and where
The proof is presented in section A.1. It closely follows that of the original PACBayesian theorem by McAllister, with the main technical step relying on the quantifier reversal lemma of mcallester1998some. Note that the bound is essentially the same as that of langford2001bounds, except for the fact that it holds in probability and it adds an extra term dependent on the confidence parameter , which is usually negligible, but may be important when considering the effect of optimizer choice. The quantity corresponds to the marginal likelihood, or Bayesian evidence of the data , and we will also denote it by , to simplify notation.
In valle2018deep, the authors interpreted as approximating the probability by which the stochastic algorithm (e.g. SGD) outputs hypothesis after training. The preceding bound relaxes this assumption, because it shows that in some sense, the bound holds for “almost all” of the zeroerror region of parameter space. More precisely, it holds with high probability over the posterior. This suggests that SGD may not need to approximate the Bayesian inference as closely, for this bound to be useful. Nevertheless, mingard2020sgd gave empirical results showing that, for DNNs, the distribution over functions that SGD samples from, approximates the Bayesian posterior rather closely. A fully rigorous generalization error bound for DNNs would need further analysis of SGD dynamics, but we believe these theoretical and empirical results strongly suggest that the PACBayes bound should be applicable to SGDtrained DNNs.
Because it applies to the Bayesian posterior only, the bound in theorem 5.1 does not apply universally over a large family of posteriors, like standard deterministic PACBayes bounds do, which can be shown to sometimes give loose bounds (nagarajan2019uniform). Furthermore, as we will show in section 6.2, the bound is in a certain sense asymptotically optimal in the limit of large training set size.
We expect our bound to give significantly tighter results than previous PACBayes bounds applied to DNNs, because rather than working with parameters, our bound works directly with posteriors and priors in function space. Since the parameterfunction map (valle2018deep) of DNNs is manytoone, with a lot of parameterredundancy, it is not hard to construct situations where between a parameterspace posterior and prior is high, but between the induced posterior and prior in functionspace is low. In fact, in section A.3, we show that the following inequality holds
(19) 
which implies that it is always better (or at least not worse) to consider PACBayes bounds in function space for parametrized models, if possible. Furthermore, in section 7, we will empirically verify that our bound gives good predictions for SGDtrained DNNs, and satisfies most of our desiderata for a generalization error bound. Thus our empirical results corroborate our expectation of better agreement above.
6 Optimality of datadependent bounds
6.1 General definitions for optimality
In section 4.1.1 we saw that VC dimension bounds are provably optimal (up to a constant) among the set of dataindependent algorithmindependet uniform bounds. The question of optimality is more difficult for the other types of bounds. As will be shown below, the optimal datadependent bound may depend on a chosen prior over data distributions, as well as the algorithm. We won’t consider notions of optimality for dataindependent nonuniform, or algorithmicdependent bounds (e.g. uniform stability bounds). Instead, the notion of optimality we treat here is for datadependent algorithmdependent bounds.
To explore this notion of optimality, we make the following two definitions, using a simplified notation for the right hand side of the inequality in eq. 2, which we refer to as the value of the bound:
Definition 6.1.
A PAC bound of the form in eq. 2 with value is called distributionadmissible or distributionPareto optimal for algorithm if there does not exist another bound for algorithm with value such that for all , for all , and for all , , and for some ,,.
Definition 6.2.
A PAC bound of the form in eq. 2 with value is called optimal with respect to a prior over data distributions, for algorithm , for a training set size and confidence level if it minimizes the value of over valid PAC bounds for algorithm .
We also say that the bound is optimal with respect to a prior over data distributions, for algorithm , if it is optimal in the above sense for all , and .
Any of these definitions can be extended to require optimality over a family of algorithms rather than a single one. Analogous definitions can also be made for datadependent uniform bounds like eq. 9. For hypothesis classes where the Rademacher complexity dominates the term (in expectation over , for any ) so that the lower bound closely matches the upper bound, then not only is eq. 9 (approximately) distributionPareto optimal, but it’s the unique distributionPareto optimal uniform generalization gap bound, and therefore the optimal uniform generalization gap bound for any prior (up to lower order terms).
For nonuniform datadependent bounds, the authors are not aware of any result showing optimality. However, as we saw in section 4.2.1.1, nonuniform bounds are typically based on choosing a “prior”, and the expected value of the bound for different depends heavily on the choice of this “prior”. This suggests that perhaps there isn’t a unique notion of optimality (i.e. the optimal bound could depend on the assumed true prior ), but doesn’t prove this negative result because these are only upper bounds. We will see in the next section that for some pairs of algorithms and priors over data distributions, the marginallikelihood PACBayes bound asymptotically matches (up to a constant) the expected generalization error (which is a lower bound on the expected value of any generalization error upper bound) and is thus asymptotically optimal (up to a constant) according to our definition. This result, combined with the empirical result that the average value of the bound empirically depends on the data distribution, also implies that the optimal PAC bound, for a fixed algorithm, depends on the prior over data distributions.
6.2 Bayesian optimality of the marginallikelihood PACBayesian bound
In this section, we present a theorem to show that under certain conditions on the algorithm and distribution, the marginallikelihood PACBayes bound is tight. In particular, we will show that if the generalization error decreases as a power law with training set size , the bound is asymptotically optimal up to a constant.
We again consider binary classification. Let be a training set of size . Consider sampling one more sample to obtain . We think of and
as random variables with distributions determined by
. We also define to be the probability that a Bayesian posterior conditioned on a training set and a test point predicts the incorrect label. We will denote , where the dependence on is left implicit. Note that where is a learning algorithm that returns a function sampled from the Bayesian posterior conditioned on training set , and is the average generalization error (averaged both over posterior samples). We will let where we omit the dependence on and for brevity.In the following we denote with angle brackets an average over (which includes average over ). Note that , and it’s also the same as averaged over , because already includes an average over one test point sampled from .
Lemma 6.3.
Assume as . Then, assuming that for some , for all , , there exists a constant , such that, as ,
(20) 
Furthermore, if , then .
The proof is found in section A.2.
Theorem 6.4.
Assume as , with . Then, assuming that for some , for all , , there exists a constant , such that, as ,
(21) 
Furthermore, if , then .
Proof.
We can write using a telescoping sum
(22)  
(23)  
(24) 
In theorem 6.4, is the actual expected error averaged over training sets of size . This is a lower bound for the trainingsetaveraged value of any upper bound on the expected error, such as the PACBayes bound (mcallester1998some). It is also a lower bound on the average value of highprobability bounds like theorem 5.1. On the other hand, the PACBayes bound has an expected value with a leading order behaviour given by
(25) 
which up to a constant, matches the asymptotic behaviour of the lower bound given by itself. Therefore, we can conclude that the PACBayes bound section 5 is asymptotically optimal (up to a constant, which may be computable a priori), for pairs of priors and learning algorithms that satisfy the theorem assumptions.
The main assumption of the theorem is that follows a power law behaviour asymptotically in . As we mentioned in section 2, this has been empirically found to be the case for sufficiently expressive deep learning models. spigler2019asymptotic could prove that the learning curve follows a power law for stationary Gaussian processes in a misspecified teacherstudent scenario, and input instances distributed on a lattice. They could also compute the power law exponents analytically. More recently, bordelon2020spectrum developed a theory of average generalization error learning curves (see section 4.5) and proposed an explanation for the power law behaviour based on some assumptions on the data distribution.
The second assumption, which is only necessary to get a smaller value of the constant is that . This seems plausible in many situations. For large , most of the variation in should typically come from the choice of test point (on which implicitly depends) rather than the choice of . We can consider two extreme cases. If is for some and for all other , then is a Bernoulli variable, and the variance is of the same order as the mean. On the other extreme, if for all , then the variance (coming from the choice of ) is . It seems plausible that in many situations, we find something in between.
In section 7, we will show extensive empirical evidence that the PACBayes bound follows a powerlaw behaviour with an exponent closely matching the empiricallymeasured test error. We sometimes observe deviations, but these are likely coming from the use of the expectationpropagation (EP) approximation, which empirically appears to introduce systematic errors, as a power law (mingard2020sgd). Furthermore, we observe a positive correlation between the exponent and the proportionality constant relating the bound and the error, , as predicted by theorem 6.4.
7 Experimental results for the marginallikelihood PACBayes bound
In section 2 we presented a series of seven desiderata for a generalization theory of deep learning, and in section 4, we used this framework to compare a wide range of different bounds from the literature. In this section we perform extensive experiments that are needed to test the marginallikelihood PACBayes bound against the desiderata, especially D.13 (performance when varying dataset, architecture, and training set size). This empirical work provides an example of how to test a bound in detail against the desiderata. In section 8, we will discuss the comparison of our bound against all the desiderata.
The key quantity needed to compute the bound theorem 5.1 for DNNs is the marginal likelihood, which measures the capacity term in the bound. We follow the same approach as valle2018deep and calculate the marginal likelihood using the Gaussian process (GP) approximation to DNNs (also called neural network GP, or NNGP), proposed in several recent works (lee2017deep; matthews2018gaussian; novak2018bayesian; garriga2018convnets; yang2019tensor), and which can be used to approximate Bayesian inference in DNNs. This approximation requires computing the GP kernel of the NNGP for the inputs in the training set , which in the infinite width limit, equals the covariance matrix of the outputs of the DNN with random weights, for inputs in . This can be computed anlaytically for FCNs, but for more complex archiectures this is unfeasible, so we rely on the Monte Carlo approximation used in novak2018bayesian. We approximate the kernel by an estimator of the empirical covariance matrix for the vector of outputs of the DNN, computed from random initializations of the DNN. The marginal likelihood is then approximated using the expectationpropagation (EP) approximation, as was done in previous works (valle2018deep; mingard2020sgd). See section B.1 for more details. We note that for some experiments, running the network to convergence or computing the EP approximation turned out to be too computationally expensive to achieve within our budget, which is why several architectures are not present for some of the experiments for larger training set sizes (e.g. those in section 7.3). Code for the experiments can be found in https://github.com/guillefix/nnpacbayes
In the following experiments, the datasets are binarized, and all DNNs are trained using Adam optimizer to 0 training error. The test error is measured on the full test set for the different datasets used, while the training set is sampled from the full training set provided by these datasets. For full experimental details see
appendix B.7.1 Error versus label corruption (Desideratum D.1)
Desideratum D.1 requires that the bound correlates with the error as we change the dataset complexity. We test this correlation in two ways: by directly corrupting a fraction of the labels of a standard dataset, increasing its complexity, and by comparing different standard datasets differing in complexity.
In fig. 1, we show the true test error and the PACBayes bound for a CNN and a FCN, as we increase the label corruption for three different datasets (CIFAR10, MNIST, and Fashion MNIST), which have been binarized. We find that the bound is not only relatively tight, but qualitatively follows the behaviour of the true error, increasing with complexity, as well as preserving the order among the three datasets. In fig. 3 and fig. 4, we present more datasets, and observe that for sufficiently large training set size, the bound typically can correctly predict in which datasets the networks will generalize better.
7.2 Error versus training set size (Desideratum D.2)
Desideratum D.2 requires that the bound predicts the change in error as we increase the training set size. As mentioned previously, several works (hestness2017deep; novak2019neural; rosenfeld2019constructive; kaplan2020scaling) have found that learning curves for DNNs tend to show a power law behaviour with an exponent that depends on the dataset, but not significantly on the architecture. However, we note that in some work on learning curves for Gaussian processes (sollich2002gaussian; spigler2019asymptotic; bordelon2020spectrum), more complex types of learning curves have also been observed (sometimes involving several regions of nonmonotonicity), suggesting that DNNs may potentially show more complex learning curves, different from power laws, in some regimes. As a simple example, in hestness2017deep, it was pointed out that in the low data regime learning curves do not show the asymptotic power law behaviour. Another example is the doubledescent behaviour with respect to data size observed in nakkiran2019deep.
We empirically computed the learning curves for many combinations of architectures and datasets, as well as the corresponding PACBayes bounds. In fig. 2 and fig. 3, we show the learning curves for three representative architectures for different datasets. The exponent of the learning curve clearly depends on the dataset, but less strongly on the architecture. The PACBayes bound approximately matches the power law exponent of the empirical learning curves, as predicted by theorem 6.4. For fig. 2, the bound even predicts the quick drop in generalization for EMNIST between and for the FCN. However, we found that this fine grained agreement doesn’t typically hold for other architectures, for which the bound only matches the overall power law behaviour of the learning curve.
In fig. 4, we show the learning curves for several representative architectures for five different datasets. For each architecture, the bound and the SGD results have a very similar learning curve exponent, though there can be a different vertical offset that depends on the architecture. The relative ordering in generalization performance between different architectures is typically also predicted by the bound, specially for large . We will explore this trend in more detail in section 7.3.
The learning curves we observe in the figs above agree with the previous empirical observations of power law behaviour in learning curves for DNNs, with only a few exceptions, where we observe a deviation from power law behaviour. In particular the learning curve for CIFAR10 for batch 32 appears to deviate from a power law on this range of . However for batch 256 it shows cleaner power law behaviour (see fig. 12, fig. 13, fig. 14 in appendix E) that agrees better with the PACBayes bound exponent.
In fig. 8 and fig. 9, shown in appendix D, we present the learning curves for several variants of ResNets and DenseNets, respectively. Within each family of similar architectures, the learning curve is even more similar. The PACBayes bound matches the behaviour of the true error rather closely for the entire range of architectures and datasets used. In particular, the power law exponent of the PACBayes bound is close to that of the true learning curves for these 14 different architectures, just as was found in fig. 3 for three representative ones, showing that our generalization error theory is robust and widely applicable.
). Then, this second method makes assumptions about how the bounds scale. The error bars are estimated standard errors from the linear fits. For the ratio estimate the errors due to fluctuations in dataset are negligible. Note that the exponents cluster according to dataset. The outliers for MNIST and KMNIST are both the FCN. The DNNs were trained using Adam and batch size 32 to 0 training error. For some datasets (e.g. CIFAR10) the power law behaviour is less clear, but we still chose to include the estimated exponent for completeness.
In fig. 5, we compare the power law exponent from the empirical learning curves (calculated with Adam and batch size 32), to the exponent from the PACBayes bounds. For the empirical learning curves, is estimated from a linear fit of vs . We note that the exponents cluster according to the dataset, as observed in previous work (hestness2017deep). However, we also observe some smaller, but statistically significant variation in exponents within a dataset, indicating that the architecture nevertheless may play a role in the learning curve behaviour (though less significant than the dataset). One exception is the FCN which shows a significantly different exponent than other architectures – a deviation which is also predicted by the PACBayes bound, but for which we do not yet have an explanation.
To estimate the learning curve exponent for the PACBayes curves, we used two methods. In fig. 4(a), we used linear fit to the log of the PACBayes bound vs , as was done for the empirical exponents. In fig. 4(b), we estimate it as where is the ratio of the PACBayes bound and the error, which is obtained from the expression in theorem 6.4. This way of deriving the exponent assumes that the condition on the variance of the error holds (although one may still expect and the exponent to correlate even if the condition doesn’t hold exactly). For both ways of estimating the error, there is a good correlation between the estimate of the exponent and the empirical exponent. The absolute value of the estimated exponent in fig. 4(a) does show deviation from the true value, which is probably due to systematic errors in the EP approximation used to compute the marginal likelihood (this was discussed and empirically investigated in mingard2020sgd). It should be kept in mind that power law exponents can be sensitive to the protocol used to measure them (clauset2009power; stumpf2012critical), so the exact values we find in fig. 5 may not be as meaningful as the correlation between the empirical values and those from the bound.
7.3 Error versus architecture (Desideratum D.3)
Desideratum D.3 requires that the bound correlates with the error when changing the architecture. We explore this in two ways, by varying certain common architecture hyperparameters (pooling type and depth), and by comparing several stateoftheart (SOTA) architectures to each other. In fig. 5(a), we vary the pooling type, and find that the bound correctly predicts that the error is higher for max pooling than avg pooling, and both are lower than no pooling, on this particular dataset. In fig. 5(b), we vary the number of hidden layers of a CNN trained on MNIST, and find that the bound closely tracks the change in generalization error with numbers of layers.
To explore more complex changes to the architecture, we plot in fig. 6(a) the bound and error against each other for five datasets, for a set of stateoftheart architectures, including several resnets and densenet variants (see section B.3 for architecture details), at a fixed training set size of 15K. The results display a clear correlation, showing that our PACBayes bound can help explain why some architectures generalize better than others.
Nevertheless, the empirical differences between architectures are rather small. For that reason, we can’t disentangle whether the deviations from the bound predictions are due to deviations from the bound assumptions (for example SGD not behaving as a Bayesian sampler), or from the different approximations used in computing the bound (for example, the EP approximation used in computing the marginal likelihood, see section B.1).
8 Evaluating the marginallikelihood bound against the seven desiderata
We now evaluate the marginallikelihood PACBayes bound against the seven desiderata, in the same manner as we did for the other families of bounds that we studied.

D.1 ✓ In section 7.1, and section 7.2, we saw that the bound correctly predicts the relative performance between different datasets for all the architectures we tried, at least for sufficiently large training sets. Thus the bound captures trends with data complexity.

D.2 ✓ In section 7.2 we showed that the bound correctly predicts the overall behaviour of the learning curve for all the architectures and datasets we studied. In particular, it captures the relative ordering of the power law exponents for different datasets/architecture pairs (fig. 5). Thus the bound captures trends with training set size.

D.3 ✓ In section 7.3 we showed that the bound shows a good correlation with the generalization when varying the architecture, for all the datasets. We still don’t know whether any deviations are fundamental to our approach or due to the various approximations used in computing the bound. We do know that the NNGP approximation can’t capture the dependence of generalization on layer width. However, theorem 5.1 may still capture these effects if the marginal likelihood could be computed for finitewidth DNNs, perhaps using finitewidth corrections to NNGPs (antognini2019finite; yaida2019non). Overall, the bound does a good job in tracking the effect of architecture changes.

D.4 ✗ Our approach currently cannot capture effects on generalization caused by different DNN optimization algorithms, because the marginal likelihood, as we calculate it, assumes a Bayesian posterior and does not depend on the fine details of the SGD based optimization algorithms normally employed in DNNs. In this context it is useful to consider the distinction made in (mingard2020sgd) between questions of type 1) that ask why DNNs generalize at all in the overparameterised regime, and questions of type 2), that ask how to further finetune generalisation, when the algorithm already performs well, by for example changing optimiser hyperparameters such as batch size or learning rate. Effects of type 2) that are sensitive to the optimiser are hard to capture with the current version of our bound. We are effectively relying on the observation that, to first order, different SGD variants seem to perform similarly, and all appear to approximate Bayesian inference (mingard2020sgd). Capturing the second order effects, e.g. deviations from approximate Bayesian inference due to optimiser hyperparameter tuning, would require an extension to our approach. We note that one of the main hypotheses used to explain differences in generalization among different optimization algorithms is that some algorithms are more biased towards flat solutions in parameter space than others (hochreiter1997flat; keskar2016large; jastrzebski2018finding; wu2017towards; zhang2018energy; wei2019noise). Therefore, one possibility could be to combine our PACBayes approach (based on probabilities of functions), with the more standard approaches based on flatness, to capture this effect.

D.5 ✓ The results in section 7 clearly show that the bound is nonvacuous. In fact, the logarithm in the left hand side of theorem 5.1 ensures the bound is less than one^{14}^{14}14Note that PACBayes bounds of the form derived in eq. 13 are nonvacuous when the KL divergence on the left hand side is directly inverted, rather than using Pinkster’s inequality. In the realizable case, the KL divergence reduces to a logarithmic form that can be easily inverted.. More importantly, we find that the bound can be relatively tight.

D.6 ⚫ Our bound lies on the higher end of computation cost among bounds we compare here. The most expensive steps are computing the kernel via Monte Carlo sampling and computing the marginal likelihood. The former has a complexity proportional to times the cost of running the model, but the constant can be reduced by making the last layer wider, and it can be heavily parallelized. On the other hand, computing the marginal likelihood using the NNGP approach has a complexity of because it requires inverting a matrix, and may only be improved by making assumptions on the kernel matrix (like being low rank). Therefore, the computational complexity of the bound scales similarly to inference in Gaussian processes. This means we could compute up to , but larger training set sizes would be difficult. In concurrent work, park2020towards showed that for small enough training set sizes computations based on NNGP are competitive relative to training the corresponding DNN.

D.7 ⚫ theorem 5.1 is fully rigorous for DNNs trained with exact Bayesian inference. In section 5 we argue that the theorem is probably applicable to DNNs trained with SGD (and several of its variants), based on empirical evidence and arguments from valle2018deep and mingard2020sgd, as well as the new evidence in the current paper. Furthermore, unlike the bound used in valle2018deep, theorem 5.1 applies with high probability over the posterior (with only logarithmic dependence on the confidence parameter). This property could aid in the analysis of algorithms that sample parameter space with a distribution sufficiently similar to the Bayesian posterior. However, further work is needed to further justify the application of the bound to other optimizers. In addition, the NNGP and EP approximations which we used to evaluate the bound may have introduced errors for which we don’t have rigorous guarantees yet.
9 Discussion
In this paper we provide a general framework for comparing generalization bounds for deep learning, which complements two other recent largescale studies jiang2019fantastic; dziugaite2020search. Our framework has two parts. Firstly, we introduce seven desiderata in section 2 against which generalization theories (including the bounds we focus on here) should be compared. Secondly, we classify, in
Comments
There are no comments yet.