On the Implicit Biases of Architecture Gradient Descent

10/08/2021 ∙ by Jeremy Bernstein, et al. ∙ California Institute of Technology 4

Do neural networks generalise because of bias in the functions returned by gradient descent, or bias already present in the network architecture? Por qué no los dos? This paper finds that while typical networks that fit the training data already generalise fairly well, gradient descent can further improve generalisation by selecting networks with a large margin. This conclusion is based on a careful study of the behaviour of infinite width networks trained by Bayesian inference and finite width networks trained by gradient descent. To measure the implicit bias of architecture, new technical tools are developed to both analytically bound and consistently estimate the average test error of the neural network–Gaussian process (NNGP) posterior. This error is found to be already better than chance, corroborating the findings of Valle-Pérez et al. (2019) and underscoring the importance of architecture. Going beyond this result, this paper finds that test performance can be substantially improved by selecting a function with much larger margin than is typical under the NNGP posterior. This highlights a curious fact: minimum a posteriori functions can generalise best, and gradient descent can select for those functions. In summary, new technical tools suggest a nuanced portrait of generalisation involving both the implicit biases of architecture and gradient descent. Code for this paper is available at: https://github.com/jxbz/implicit-bias/.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Following an influential paper by Zhang et al. (2017), the basic question of why neural networks generalise

is generally regarded to be open. The authors demonstrate a surprising fact: deep learning generalises even when the neural network is expressive enough to represent functions that do not generalise. In turn, this implies that the theory of

Vapnik & Chervonenkis (1971)—based on uniform convergence of train error to population error—does not explain generalisation in neural networks.

Several hypotheses have since been proposed to fill this theoretical vacuum. One prominent hypothesis states that, while most neural network solutions do not generalise well, there is an implicit bias in the kinds of functions returned by gradient descent (Soudry et al., 2018). In sharp contrast, a second hypothesis states that the solution space of a neural network is dominated by simple functions, while the complex kinds of functions that overfit are relatively rare (Valle-Pérez et al., 2019).

This latter hypothesis dovetails with a particular “PAC-Bayesian” theorem of McAllester (1998)

, which bounds the average population error of all classifiers consistent with a training sample. If this bound is small, then the complex functions that overfit must indeed be rare. In more technical terms, the PAC-Bayesian theorem can provide a meaningful certificate of generalisation even for machine learning models with an infinite Vapnik-Chervonenkis dimension, or an arbitrarily large number of parameters, by properly accounting for the measure of those very complex functions.

This paper takes a more nuanced position between these two hypotheses. While the average population error of all neural networks that fit a training sample is found to be good (and certifiably good by the PAC-Bayesian theorem), it is still possible for certain networks with special properties to perform substantially better than average (and likewise substantially better than the PAC-Bayes bound). Moreover, gradient descent may be used to specifically target these special networks. This subtly counters a position put forward by Mingard et al. (2021), which suggests that gradient descent may be well-modelled as sampling randomly from a particular Bayesian posterior distribution.

To support these claims, a careful study of the behaviour of both infinite width and finite width neural networks is conducted. In particular, this paper makes the following technical contributions:

  1. [leftmargin=4.25em]

  2. 3 Implicit Bias of Architecture. A purely analytical PAC-Bayes bound (Theorem 2) on the population error of the neural network–Gaussian process (NNGP) posterior for binary classification is derived. The bound furnishes an interpretable measure of model complexity that depends on both the architecture and the training data. This exact analytical bound improves upon an approximate computational approach due to Valle-Pérez et al. (2019).

  3. 4 Testing the Bound

    . The new bound is found to be both non-vacuous and correlated with the test error of finite width multilayer perceptrons trained by gradient descent. This provides supporting evidence for the important role of architecture in generalisation, as put forward by

    Valle-Pérez et al. (2019). Still, a gap exists with the performance of gradient descent.

  4. 5 Implicit Bias of Gradient Descent. Going further beyond the work of Valle-Pérez et al. (2019), a new theoretical tool (Theorem 3

    ) is developed to enable consistent estimation of the average error of the NNGP posterior on a given holdout set. The average is found to be significantly worse than the holdout performance of Gaussian process draws with large margin. The experiment is repeated for finite width neural networks trained by gradient descent, and the same qualitative phenomenon persists. This finding demonstrates the ability of gradient descent to select large margin functions with vanishing posterior probability that nonetheless generalise significantly better than the posterior average.

2 Related Work

Margin-based generalisation theory

A rich body of work explores the connection between margin and generalisation. For instance, Bartlett et al. (2017) propose a margin-based complexity measure for neural networks derived via Rademacher complexity analysis, and Neyshabur et al. (2018) derive a similar result via PAC-Bayes analysis. Neyshabur et al. (2019) test these bounds experimentally, finding them to be vacuous and to scale poorly with network width. Biggs & Guedj (2021) further develop this style of bound. Margin-based PAC-Bayes bounds go back at least to the work of Herbrich (2001), who derived such a bound for linear classification. The standard idea is that a solution with large margin implies the existence of nearby solutions with the same training error (but perhaps smaller margin), facilitating a more spread out PAC-Bayes posterior. This style of margin-based PAC-Bayes bound does not provide guidance on whether the original large margin classifier should generalise better or worse than the additional nearby classifiers included in the posterior.

Non-vacuous bounds for neural networks

Aside from PAC-Bayes, many styles of generalisation bound for neural networks are vacuous (Dziugaite & Roy, 2017). Even many PAC-Bayes bounds are vacuous (Achille & Soatto, 2018; Foret et al., 2021) due to their choice of PAC-Bayes prior. Dziugaite & Roy (2017) construct a non-vacuous weight space PAC-Bayes bound by iteratively optimising both the PAC-Bayes prior and posterior by gradient descent. Wu et al. (2020) extend this approach to involve Hessians. Meanwhile, Valle-Pérez et al. (2019) instantiate a non-vacuous PAC-Bayes bound in the function space of neural networks via the NNGP correspondence (Neal, 1994; Lee et al., 2018). The evaluation of this bound involves an iterative and statistically inconsistent expectation-propagation approximation. Unlike this paper’s purely analytical bound in Theorem 2, these non-vacuous bounds are all evaluated by iterative computational procedures.

Implicit bias of gradient descent

Much research has gone into the role that gradient descent plays in neural network generalisation. For instance, Zhang et al. (2017) suggest that gradient descent may converge to solutions with special generalisation properties, while Wilson et al. (2017) suggest that gradient descent may have a more favourable implicit bias than the Adam optimiser. Azizan et al. (2021) make a related study in the context of mirror descent. Keskar et al. (2017)

point to an implicit bias present in mini-batch stochastic gradient descent, but

Geiping et al. (2021) argue against the importance of stochasticity. Meanwhile, Soudry et al. (2018)

observe how gradient descent combined with certain loss functions converges to max-margin solutions;

Wei et al. (2018) make a similar observation. On a different tack, Mingard et al. (2021), Valle-Pérez & Louis (2020) and Valle-Pérez et al. (2019) argue that the implicit bias of neural architecture is more important than that of gradient descent for understanding generalisation, and so long as gradient descent does not select solutions pathologically, generalisation should occur.

weight space

volume

solution space

volume

solution space

volume

function space

volume
Figure 1: The solution space of a learning task denotes the subset of classifiers that attain zero error on a dataset . While the solution space can have a complicated geometry when described in weight space, in function space the solution space of a binary classification task is just an orthant.

3 Implicit Bias of Architecture

This section derives an analytical bound on the population error of an infinitely wide neural network averaged over all weight configurations that attain 0% train error—in other words, the average population error of the NNGP posterior. Since this quantity assesses the performance of all solutions rather than just those returned by gradient descent, it is intended to measure the implicit bias of architecture. In Section 4

, the bound is found to be non-vacuous for multilayer perceptrons. The source of a substantial gap with the results of gradient descent training is investigated in Section

5, and ultimately attributed to an additional implicit bias of gradient descent: namely margin.

3.1 Preliminary Notation

Consider a training dataset of input–label pairs: The inputs are embedded in Euclidean -space, while to simplify the analysis the paper shall restrict to binary labels . A classifier

shall depend on a weight vector

, where denotes the weight space. In particular, for an input the prediction is given by .

One is interested in the relationship between the train error of a classifier on a dataset to the population error on a data distribution . For a weight vector , these are defined as:

(1)

The solution space denotes those classifiers that attain zero train error.

3.2 Geometry of Solutions in Function Space

The function space is defined to be the set of outputs that the classifier can realise on dataset :

(2)

In weight space the geometry of solutions can be arbitrarily complicated, but in function space the solutions occupy the single orthant picked out by the binary training labels.

To measure the volume of the solution space, one can define a measure on weight space. It is convenient to enforce so that

is a probability measure. In the absence of a better alternative,

is usually set to a multivariate Gaussian or uniform distribution. The measure

on weight space induces a measure on function space via the relation:

(3)

The volume of solutions is denoted , and can be computed either in weight or function space. In function space, is an orthant probability. The situation is visualised in Figure 1.

3.3 PAC-Bayes Theory

PAC-Bayes relates the volume of solutions to their average population error. The following result was derived by Valle-Pérez et al. (2019) as a corollary of a theorem due to Langford & Seeger (2001). The result is similar in form to a theorem of McAllester (1998).

Theorem 1 (A PAC-Bayesian theorem).

First, fix a probability measure over the weight space of a classifier. Let denote a training set of datapoints sampled iid from the data distribution and let denote the corresponding solution space. Consider the population error of weight setting , and its average over the solution space . Then, for a proportion of draws of the training set ,

(4)

For large , the term is negligible and the result says that the population error averaged over solutions is less than the ratio of the negative log volume of solutions to the number of training points.

3.4 Volume of Solutions via Gaussian Orthant Probabilities

Since infinitely wide neural networks induce a Gaussian measure on function space, computing the volume of the solution space amounts to computing a Gaussian orthant probability.

In more detail, let denote a measure on the weight space of an infinitely wide neural network satisfying the conditions of the NNGP correspondence—see Theorem 4 in Appendix A.3 for an example. Then for a weight vector , the network outputs on a training set are distributed:

(5)

with covariance .

Therefore, under the NNGP correspondence, the volume of solutions computed in function space is just the Gaussian probability of the orthant picked out by the binary training labels:

(6)

To facilitate estimating and bounding this probability, this paper has derived the following lemma.

Lemma 1 (Gaussian orthant probability).

For a covariance matrix , and a binary vector , let denote the corresponding Gaussian orthant probability:

(7)

Letting denote the identity matrix, the elementwise product and the elementwise absolute value, then may be equivalently expressed as:

(8)

and may be bounded as follows:

(9)

The proof is given in Appendix A.2. Equation 8 yields an unbiased Monte Carlo estimator of Gaussian orthant probabilities, and Inequality 9 yields a lower bound. The complexity measures and are defined for later use.

To gain intuition about the lemma, observe that is the orthant probability for an isotropic Gaussian. Depending on the degree of anisotropy inherent in the covariance matrix , Equation 8 captures how the orthant probability may either be exponentially amplified or suppressed compared to .

As an aside, using Inequality 9 to lower bound Equation 6 has a Bayesian interpretation. Since the volume of solutions may be written , it may be interpreted as the Bayesian evidence for the network architecture under the likelihood function . A Bayesian would then refer to Inequality 9 as an evidence lower bound.

The following generalisation bound is a basic consequence of Theorem 1, Equation 6 and Lemma 1:

Theorem 2 (Upper bound on the average population error of an infinitely wide neural network).

First, fix a probability measure over the weight space of an infinitely wide neural network. Let denote a training set of datapoints sampled iid from the data distribution , let denote the binary vector of training labels, and let denote the corresponding solution space. Consider the population error of weight setting , and its average over the solution space . Let denote the NNGP covariance matrix (Equation 5). Then, for a proportion of draws of the training set ,

(10)

where the complexity measures and are defined in Lemma 1.

Since the complexity measure is an analytical function of the NNGP covariance matrix and the binary vector of training labels , Theorem 2 is an analytical generalisation bound for the NNGP posterior. For large , the result simplifies to: . So the bound depends on the ratio of complexity measures and to the number of datapoints .

To gain further intuition about Theorem 2, consider two special cases. First, suppose that the neural architecture induces no correlation between any pair of distinct data points, such that . Then and the bound is worse than chance. This corresponds to pure memorisation of the training labels. Next, suppose that the neural architecture induces strong intra-class correlations and strong inter-class anti-correlations, such that . Although this is singular, it may be seen directly that . Then by Theorem 1, which is much better than chance for large . This corresponds to pure generalisation

from the training labels. Interpolating between these two extremes would suggest that a good neural architecture would impose a prior on functions with strong intra-class and weak inter-class correlations.

4 Testing the Bound

This section compares the generalisation bound for infinite width networks (Theorem 2) to the performance of finite width multilayer perceptrons (MLPs) trained by gradient descent. The bound is found to be non-vacuous and correlated with the effects of varying depth and dataset complexity. Still, there is a substantial gap between the bound and gradient descent, which is investigated in Section 5.

Three modified versions of the MNIST handwritten digit dataset

(LeCun et al., 1998) of varying “hardness” were used in the experiments, as detailed in Figure 2 (top left). MLPs were trained with layers and hidden units per layer. Specifically, each input image was flattened to lie in , and normalised to satisfy . The networks consisted of an input layer in , layers in , and an output layer in The nonlinearity was set to . For this architecture, the width kernel is the compositional arccosine kernel described in Theorem 4 in Appendix A.3

. For the finite width networks, the training loss was set to square loss using the binary training labels as regression targets, and the networks were trained for 100 epochs using the Nero optimiser

(Liu et al., 2021) with an initial learning rate of 0.01 decayed by a factor of 0.9 every epoch, and a mini-batch size of ) data points. The final train error was 0% in all reported experiments. All bounds were computed with a failure probability of , and was estimated using Monte-Carlo samples. All experiments were run on one NVIDIA Titan RTX GPU.

The generalisation bound of Theorem 2 was first compared across three datasets of varying complexity. The network architecture was set to a depth MLP, and the bound was computed via Monte-Carlo estimation of . The results are shown in Figure 2 (top right). The bound was found to reflect the relative hardness of the datasets. For random labels, the bound was vacuous as desired. Next, the generalisation bound was compared against the empirical performance of a depth MLP trained on the decimal digits dataset. The results are shown in Figure 2 (bottom left). While loose compared to the holdout error of the finite width network, the bound is still non-vacuous. Finally, the effect of varying network depth was investigated on the decimal digits dataset. Two depths were compared: and . The results are shown in Figure 2 (bottom right). The bounds (computed via Monte-Carlo estimation of ) appear to predict the relative holdout performance of the two architectures at finite width. These results corroborate those of Valle-Pérez & Louis (2020) but without the use of the statistically inconsistent expectation-propagation approximation.

MNIST Variant Inputs Labels random labels coin flip decimal digits parity binary digits parity
Figure 2: Testing Theorem 2—the bound on the average population error of the NNGP posterior. The bound is found to be non-vacuous and correlated with the effects of varying network depth and dataset complexity. For all curves, the mean and range are plotted over three global random seeds. A substantial gap is visible between the bound and the results of gradient descent training, which is investigated in Section 5.
Top left: The three datasets used in the experiments, listed in order of hardness. For random labels there is no meaningful relationship between image and label, so generalisation is impossible. Binary digits is easier than decimal digits because each class is less diverse.
Top right: Comparing generalisation bounds from Theorem 2 for datasets of varying hardness. The curves are computed by Monte-Carlo estimation of for an infinite width depth 7 MLP. The ordering of the bounds reflects the dataset hardness. For random labels the bound is rightfully vacuous.
Bottom left: Comparing generalisation bounds from Theorem 2 to the results of training finite width ( and ) networks by gradient descent. The bounds are computed by both Monte-Carlo estimation of (referred to as estimator) and exact computation of (referred to as bound). The comparison is made for depth 7 MLPs on the decimal digits dataset. The exact bound computed via is looser than the bound computed via Monte-Carlo estimation but only slightly, and both are non-vacuous above 150 datapoints. While the bounds follow the same trend as the results of training finite width networks, a substantial gap is visible.
Bottom right: Comparing generalisation bounds from Theorem 2 to the results of training finite width networks by gradient descent on decimal digits, for networks of varying depth. The bounds are computed by exact computation of . The ordering of the bounds matches the finite width results, but again a substantial gap is visible between the bounds and the results of gradient descent training.

5 Implicit Bias of Gradient Descent

In Section 3, a bound was derived on the average population error of all infinitely wide neural networks that fit a certain training set. Since the bounded quantity measures the performance of all solutions rather than just those returned by gradient descent, it is intended to assess the implicit bias of architecture. In Section 4, this bound was tested and found to be non-vacuous. Still a substantial gap was found between the bound and the performance of finite width networks trained by gradient descent. This gap could arise for several potential reasons:

  1. [label=)]

  2. slackness in the bounding technique;

  3. a difference between infinite width and finite width neural networks;

  4. an additional implicit bias of gradient descent.

This section tests these various possibilities, ultimately concluding that gradient descent does have an important additional implicit bias: the ability to control the margin of the returned network.

5.1 An Exact Formula for the Average Holdout Error of Solutions

To investigate whether slackness in the bounding technique is responsible for the gap, an additional theoretical tool was developed: a formula for the holdout error of a binary classifier averaged over the solution space. In contrast to Theorem 2 which gives an upper bound on population error, Theorem 3 is an equality and therefore does not suffer from slackness. The proof is given in Appendix A.1.

Theorem 3 (Average holdout error of a binary classifier).

First, fix a probability measure over the weight space of a binary classifier. Let denote a training set and let denote the solution space. For a holdout set of datapoints, consider the holdout error of weight setting , and its average over the solution space . Then:

(11)

In words, the average holdout error over solutions equals the reduction in the volume of solutions when the training set is augmented with a negated holdout point, averaged over the holdout set. For an infinitely wide neural network, Equation 11 involves computing a sum of ratios of Gaussian orthant probabilities. These probabilities can be consistently estimated by Equation 8 in Lemma 1.

5.2 Three-Way Comparison of Holdout Error

Armed with Theorem 3, this subsection makes a three-way comparison between the holdout error of infinite width networks averaged over solutions, the holdout error of finite width networks averaged over solutions, and the holdout error of finite width networks trained by gradient descent.

To compute the holdout error of finite width networks averaged over solutions, weight vectors were randomly sampled and all non-solutions were discarded. To make this process computationally tractable, a very small training set was used consisting of only 5 samples from binary digits. A holdout set of 50 datapoints was used, and the network was set to a 7-layer MLP. The results were:

average holdout error at infinite width:
average holdout error at width 10,000:
gradient descent holdout error at width 10,000:

Based on these results, three comments are in order. First, the close agreement between the average holdout error at finite and infinite width suggests that the infinite width limit may not be responsible for the significant gap observed in Section 4. Second, the significant gap between the gradient descent holdout error and the holdout error averaged over solutions suggests slackness in the bounding technique in Theorem 2 may also not be the main culprit. Third, the significant gap between the gradient descent holdout error and the holdout error averaged over solutions suggests that gradient descent does have an extra implicit bias. The next subsection attempts to diagnose this implicit bias.

Figure 3: Exploring generalisation performance along the label ray defined in Section 5.3. Functions further along the ray are found to generalise substantially better than average, for both NNGP draws and finite width networks trained by gradient descent.
Top left: Schematic diagram illustrating the label ray in function space. While all functions in the grey shaded orthant attain zero train error, the darker shaded solutions have small margin and thus—under a Gaussian process model—holdout predictions are expected to be driven by random fluctuations.
Top right: Holdout error along the label ray of the small learning task described in Section 5.2. Results are shown for both networks trained by gradient descent and NNGP posterior samples. The label scale is defined in Section 5.3

and measures the distance of the function along the label ray. NNGP posterior samples were generated by sampling from a Gaussian distribution with mean and covariance given by Equations

12 and 13. Networks trained by gradient descent were found by training with the loss function given in Equation 14

. Shading shows the standard deviation over 100 random intialisations or posterior samples. Deep into function space (large

) the holdout error is significantly better than for functions close to the origin (small ). The average holdout error over the orthant, as computed in Section 5.2, is also plotted. Large margin networks and NNGP posterior samples both significantly outperform the orthant average. Finally, a gap is visible between large margin NNGP posterior samples and large margin networks trained by gradient descent—this may be due to an additional and undiagnosed implicit bias of gradient descent.
Bottom: Repeating experiments from Section 4 with a smaller label scale to sanity check the findings. For a depth 2 and a depth 7 MLP on the decimal digits dataset, the experiment from Section 4 was repeated with the loss function set to as defined in Equation 14. The infinite width curve shows the Theorem 2 bound estimated via . The other two curves show the results of networks trained by gradient descent with and . Despite all networks attaining 0% train error, holdout error was significantly worse for networks trained using . Also, the networks trained using exhibit a substantially smaller gap with the Theorem 2 upper bound.

5.3 Diagnosis: Margin

This subsection finds that gradient descent has an important implicit bias in determining the margin of the returned network. This conclusion is made by studying the generalisation error of functions along the label ray in function space—depicted in Figure 3 (top left). For a training set with a vector of binary labels , a function -far along the label ray refers to the point .

For the case of a zero mean NNGP, the predictive distribution on a holdout set conditioned on training labels -far along the label ray is given by Bishop (2006, Chapter 2.3.1):

posterior mean (12)
posterior covariance (13)

where is the covariance between holdout and train inputs, and , and are defined analogously. For large , predictions are driven by the posterior mean, whereas for small , they are driven by random fluctuations. So one expects that letting should improve holdout error.

For the case of gradient descent training, functions -far along the label ray can be returned by minimising the following loss function:

(14)

A subtle but important point is that for gradient descent training with Nero (Liu et al., 2021) the norms of the weight matrices are constrained, thus controls a properly normalised notion of margin.

An experiment was performed to measure the holdout error of networks -far along the label ray—for both NNGP draws and neural networks selected by Nero—for ranging from to . As can be seen in Figure 3 (top right), varying appears to directly control the holdout error, despite all solutions attaining train error. Moreover, large solutions significantly outperform the average holdout error over the solution space. This suggests that gradient descent possesses a significant implicit bias in its ability to control margin—going beyond the implicit bias of architecture.

To sanity check this result, experiments from Section 4 were repeated with the loss function replacing . As can be seen in Figure 3 (bottom), the holdout performance was significantly diminshed at , as was the gap with the PAC-Bayes bound from Theorem 2.

6 Discussion and Conclusion

This paper has explored the separate implicit biases of architecture and gradient descent. Section 3 derived an analytical generalisation bound on the NNGP posterior, Section 4 found this bound to be non-vacuous, while Section 5 showed that large margin functions substantially outperform the bound.

The findings in this paper support the importance of the implicit bias of architecture: reproducing the findings in Valle-Pérez & Louis (2020) but with improved technical tools, the average generalisation performance of architecture was already found to be good. But the implicit bias of gradient descent was also found to be important. In particular, in contrast to an assumption made by Valle-Pérez & Louis (2020) that gradient descent “samples the zero-error region close to uniformly”, it was found that gradient descent can be used to target zero-error functions with large margin. This also subtly counters a proposal by Mingard et al. (2021) that gradient descent acts like a “Bayesian sampler”. In particular, gradient descent can target functions with margin for which the Bayesian posterior probability . And indeed, these minimum a posteriori functions seem to generalise best.

One direction of future work suggested by these results is an improvement to the function space PAC-Bayes theory to account for margin. While margin-based PAC-Bayes bounds do already exist (Herbrich, 2001; Langford & Shawe-Taylor, 2003; Neyshabur et al., 2018), these bounds operate in weight space and further do not seem to imply an advantage to selecting the max-margin classifier over other classifiers included in the PAC-Bayes posterior.

Ultimately, a generalisation theory that properly accounts for the various implicit biases of deep learning could provide a more principled basis for both neural architecture design as well as the design of new regularisation schemes. It is hoped that the new analytical results as well as experimental insights included in this paper contribute a step towards reaching that goal.

References

  • Achille & Soatto (2018) Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research, 2018.
  • Azizan et al. (2021) Navid Azizan, Sahin Lale, and Babak Hassibi. Stochastic mirror descent on overparameterized nonlinear models. IEEE Transactions on Neural Networks and Learning Systems, 2021.
  • Bartlett et al. (2017) Peter L. Bartlett, Dylan J. Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. In Neural Information Processing Systems, 2017.
  • Biggs & Guedj (2021) Felix Biggs and Benjamin Guedj. On margins and derandomisation in PAC-Bayes. arXiv:2107.03955, 2021.
  • Bishop (2006) Christopher M. Bishop. Pattern Recognition and Machine Learning. Springer-Verlag, 2006.
  • Cho & Saul (2009) Youngmin Cho and Lawrence Saul. Kernel methods for deep learning. In Neural Information Processing Systems, 2009.
  • Dziugaite & Roy (2017) Gintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In

    Uncertainty in Artificial Intelligence

    , 2017.
  • Foret et al. (2021) Pierre Foret, Ariel Kleiner, Hossein Mobahi, and Behnam Neyshabur. Sharpness-aware minimization for efficiently improving generalization. In International Conference on Learning Representations, 2021.
  • Geiping et al. (2021) Jonas Geiping, Micah Goldblum, Phillip E. Pope, Michael Moeller, and Tom Goldstein. Stochastic training is not necessary for generalization. arXiv:2109.14119, 2021.
  • Herbrich (2001) Ralf Herbrich. Learning Kernel Classifiers: Theory and Algorithms. MIT Press, 2001.
  • Keskar et al. (2017) Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. In International Conference on Learning Representations, 2017.
  • Langford & Seeger (2001) John Langford and Matthias Seeger. Bounds for averaging classifiers. Technical report, Carnegie Mellon University, 2001.
  • Langford & Shawe-Taylor (2003) John Langford and John Shawe-Taylor. PAC-Bayes & margins. In Neural Information Processing Systems, 2003.
  • LeCun et al. (1998) Yann LeCun, Corinna Cortes, and Christopher J.C. Burges. MNIST handwritten digit database, 1998.
  • Lee et al. (2018) Jaehoon Lee, Jascha Sohl-Dickstein, Jeffrey Pennington, Roman Novak, Sam Schoenholz, and Yasaman Bahri. Deep neural networks as Gaussian processes. In International Conference on Learning Representations, 2018.
  • Liu et al. (2021) Yang Liu, Jeremy Bernstein, Markus Meister, and Yisong Yue. Learning by turning: Neural architecture aware optimisation. In International Conference on Machine Learning, 2021.
  • McAllester (1998) David A. McAllester. Some PAC-Bayesian theorems. In

    Conference on Computational Learning Theory

    , 1998.
  • Mingard et al. (2021) Chris Mingard, Guillermo Valle-Pérez, Joar Skalse, and Ard A. Louis. Is SGD a Bayesian sampler? Well, almost. Journal of Machine Learning Research, 2021.
  • Neal (1994) Radford M. Neal. Bayesian Learning for Neural Networks. Ph.D. thesis, Department of Computer Science, University of Toronto, 1994.
  • Neyshabur et al. (2018) Behnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-Bayesian approach to spectrally-normalized margin bounds for neural networks. In International Conference on Learning Representations, 2018.
  • Neyshabur et al. (2019) Behnam Neyshabur, Zhiyuan Li, Srinadh Bhojanapalli, Yann LeCun, and Nathan Srebro. The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations, 2019.
  • Soudry et al. (2018) Daniel Soudry, Elad Hoffer, Mor Shpigel Nacson, Suriya Gunasekar, and Nathan Srebro. The implicit bias of gradient descent on separable data. Journal of Machine Learning Research, 2018.
  • Valle-Pérez & Louis (2020) Guillermo Valle-Pérez and Ard A. Louis. Generalization bounds for deep learning. arXiv:2012.04115, 2020.
  • Valle-Pérez et al. (2019) Guillermo Valle-Pérez, Chico Q. Camargo, and Ard A. Louis. Deep learning generalizes because the parameter–function map is biased towards simple functions. In International Conference on Learning Representations, 2019.
  • van der Vaart (1998) Aad W. van der Vaart. Asymptotic Statistics. Cambridge University Press, 1998.
  • Vapnik & Chervonenkis (1971) Vladimir N. Vapnik and Alexey Ya. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 1971.
  • Wei et al. (2018) Colin Wei, J. Lee, Qiang Liu, and Tengyu Ma. On the margin theory of feedforward neural networks. arXiv:1810.05369, 2018.
  • Wilson et al. (2017) Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, and Benjamin Recht. The marginal value of adaptive gradient methods in machine learning. In Neural Information Processing Systems, 2017.
  • Wu et al. (2020) Yikai Wu, Xingyu Zhu, Chenwei Wu, Annie Wang, and Rong Ge. Dissecting Hessian: Understanding common structure of Hessian in neural networks. arXiv:2010.04261, 2020.
  • Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. In International Conference on Learning Representations, 2017.

Appendix A Proofs

a.1 Function Space Orthant Arithmetic

The following PAC-Bayesian theorem is due to Valle-Pérez et al. (2019): See 1

Proof.

The first inequality is a basic property of logarithms. The second inequality follows from Theorem 3 of Langford & Seeger (2001), by setting the prior measure to and the posterior measure to the conditional . Under these settings, the average training error rate over the posterior is zero and . ∎

See 3

Proof.

The result follows by interchanging the order of expectations, rewriting the expectation of an indicator variable as a probability and finally applying the definition of conditional probability:

a.2 Gaussian Orthant Probabilities

See 1

Proof.

The orthant probability may first be expressed using the probability density function of the multivariate Normal distribution as follows:

By the change of variables or equivalently , the orthant probability may be expressed as:

where the second equality follows by symmetry, and the third equality follows by inserting a factor of into the integrand.

Next, by Jensen’s inequality,

The third equality follows by noting , while for , . ∎

a.3 Neural Networks as Gaussian Processes

The essence of the following lemma is due to Neal (1994). The lemma will be used in the proof of Theorem 4.

Lemma 2 (NNGP correspondence).

For the neural network layer given by Equation 18, consider randomly sampling the weight matrix . If the following hold:

  1. [label=()]

  2. for every , the activations

    are iid with finite first and second moment;

  3. the weights

    are drawn iid with zero mean and finite variance;

  4. for any random variable

    with finite first and second moment, also has finite first and second moment;

then, in the limit that , the following also hold:

  1. [label=(0)]

  2. for every , the activations are iid with finite first and second moment;

  3. for any collection of inputs , the distribution of the th pre-activations is jointly Normal.

While condition (i) may seem non-trivial, notice that the lemma propagates this condition to the next layer via entailment (1). This means that provided condition (i) holds for at the first layer, then recursive application of the lemma implies that the network’s pre-activations are jointly Normal at all layers via entailment (2).

Proof of Lemma 2.

To establish entailment (1), consider the -dimensional vector . Observe that satisfies:

(15)

By conditions (i) and (ii), the summands in Equation 15

are iid random vectors with zero mean, and any two distinct components of the same vector summand have the same variance and zero covariance. Then by the multivariate central limit theorem

(van der Vaart, 1998, p. 16), in the limit that , the components of are Gaussian with a covariance equal to a scaled identity matrix. In particular, the components of are iid with finite first and second moment. Applying condition (iii) then implies that the same holds for . This establishes entailment (1).

To establish entailment (2), consider instead the -dimensional vector . Observe that satisfies:

(16)

Again by combining conditions (i) and (ii), the summands in Equation 16 are iid random vectors with finite mean and finite covariance. Then as , the distribution of is jointly Normal—again by the multivariate central limit theorem. This establishes entailment (2). ∎

The essence of the following theorem appears in a paper by Lee et al. (2018), building on the work of Cho & Saul (2009). The theorem and its proof are included for completeness.

Theorem 4 (NNGP for relu networks).

Consider an -layer MLP defined recursively via:

(17)
(18)

where denotes an input, denotes the pre-activations at the th layer, denotes the weight matrix at the th layer, and denotes the nonlinearity.

Set the output dimension

and set the nonlinearity to a scaled relu

Suppose that the weight matrices have entries drawn iid , and consider any collection of inputs each with Euclidean norm .

If , the distribution of outputs induced by random sampling of the weights is jointly Normal with mean zero and covariance:

(19)

where .

Proof.

Condition (ii) of Lemma 2 holds at all layers for iid standard Normal weights, and condition (iii) holds trivially for the scaled relu nonlinearity. Provided one can establish condition (i) for the first layer activations , then condition (i) will hold at all layers by recursive application of Lemma 2, thus establishing joint Normality of the pre-activations at all layers (including the network outputs). But condition (i) holds at the first layer, since it is quick to check by Equation 17 that for any satisfying , the pre-activations are iid , and preserves both iid-ness and finite-ness of the first and second moment.

Since the pre-activations at any layer are jointly Normal, all that remains is to compute their first and second moments. For the th hidden unit in the th layer, the first moment . This can be seen by taking the expectation of Equation 18 and using the fact that the are independent of the previous layer’s activations and have mean zero.

Since the pre-activations and are jointly Normal with mean zero, their distribution is completely described by their covariance matrix , defined by:

where the hidden unit index is unimportant since hidden units in the same layer are identically distributed.

The theorem statement will follow from an effort to express in terms of , and then recursing back through the network. By Equation 18 and independence of the , the covariance may be expressed as:

(20)

where indexes an arbitrary hidden unit in the th layer. To make progress, it helps to first evaluate:

which follows by the definition of and symmetry of the Gaussian expectation around zero. Then, by recursion:

where the final equality holds because the first layer pre-activations are iid by Equation 17. Therefore, the covariance at layer is just:

Equation 20 may now be used to express in terms of . Dropping the indexing for brevity:

By making the substitution , this integral becomes equivalent to as expressed in Equation 15 of Cho & Saul (2009). Substituting in the evaluation of this integral (Cho & Saul, 2009, Equation 6), one obtains:

(21)

where .

All that remains is to evaluate . Since , this is given by:

The proof is completed by combining this expression with the recurrence relation in Equation 21. ∎