DeepAI

# Better PAC-Bayes Bounds for Deep Neural Networks using the Loss Curvature

We investigate whether it's possible to tighten PAC-Bayes bounds for deep neural networks by utilizing the Hessian of the training loss at the minimum. For the case of Gaussian priors and posteriors we introduce a Hessian-based method to obtain tighter PAC-Bayes bounds that relies on closed form solutions of layerwise subproblems. We thus avoid commonly used variational inference techniques which can be difficult to implement and time consuming for modern deep architectures. Through careful experiments we analyze the influence of the prior mean, prior covariance, posterior mean and posterior covariance on obtaining tighter bounds. We discuss several limitations in further improving PAC-Bayes bounds through more informative priors.

06/22/2022

### Cold Posteriors through PAC-Bayes

We investigate the cold posterior effect through the lens of PAC-Bayes g...
06/07/2021

### How Tight Can PAC-Bayes be in the Small Data Regime?

In this paper, we investigate the question: Given a small number of data...
06/18/2018

### PAC-Bayes bounds for stable algorithms with instance-dependent priors

PAC-Bayes bounds have been proposed to get risk estimates based on a tra...
12/21/2021

### Risk bounds for aggregated shallow neural networks using Gaussian prior

Analysing statistical properties of neural networks is a central topic i...
09/19/2018

### Identifying Generalization Properties in Neural Networks

While it has not yet been proven, empirical evidence suggests that model...
10/27/2021

### Does the Data Induce Capacity Control in Deep Learning?

This paper studies how the dataset may be the cause of the anomalous gen...
06/21/2014

### PAC-Bayes Analysis of Multi-view Learning

This paper presents eight PAC-Bayes bounds to analyze the generalization...

## 1 Introduction

Deep neural networks are by now the established method for tackling a number of machine learning tasks. Despite this usually their performance on out of sample data is extremely difficult to be proven formaly and is usually validated empirically by using a validation set of samples. Classic measures of capacity such as the VC dimension which are uniform across all functions representable by the classification architecture are doomed to fail; DNNs are typically overparameterized and correspondingly the set of representable functions is large enough to make the the bounds vacuous. For example in the highly cited work

Zhang et al. (2016) the authors show that the set of representable functions of typical DNNs contains solutions that can memorize the labels over the training set.

A lesson drawn from this experiment is that defining the hypothesis class a priori is results in bounds that are too loose. Clearly from empirical observation the optimisation algorithm reaches solutions that are not trivially memorizing the labels. As such recently researchers turned to measures of complexity that are data dependent and are defined a posteriory; that is taking into account the specific solution achieved after optimization. One way of achieving the above is by defining norm based generalization bounds. Examples include Bartlett et al. (2017) Neyshabur et al. (2017) Golowich et al. (2017) and have been reached through a number of proof techniques. Typically this involve a product of the spectral (or other) norms of the DNN weight matrices, the smaller the norms after training the simpler the final hypothesis, as such the generalization bounds can be made significantly tigher. Furthermore these adaptive measures of complexity have been shown to correlate empirically with generalization error, they exhibit large values for random labels and small values for real labeled data.

However these analytical bounds when evaluated explicitly are still vacuous by several orders of magnitude. Furthermore the conclusions drawn by empirical correlations have been criticized in a number of works Kawaguchi et al. (2017) Pitas et al. (2019) Nagarajan and Kolter (2019). Analyzing different cases they provide counter examples where uniform convergence fails to explain generalization. In Dinh et al. (2017) the authors show that sharp minima can generalize, leading to the conclusion that flatness at the minimum is might be a sufficient but not necessary condition for generalization. A general point of confusionn seems to be that a number of works seem to propose conditions (such as small spectral norms) that are sufficient but not necessary for good generalization. Concretely defining such conditions for individual hypotheses and not hypothesis classes remains an important open research question Kawaguchi et al. (2017).

A number of other works have found, sometimes heuristic, metrics that correlate with better generalization

Thomas et al. (2019)Liang et al. (2017)Rangamani et al. (2019)Novak et al. (2018)Jiang et al. (2018). However on a more fundamental level simple correlation with generalization error is unsatisfying for more critical applications such as healthcare, autonomous driving, and finance where DNNs are increasingly being deployed, potentially making life-altering decisions. Concequently some works have achieved success in proving generalization in specific settings by optimizing PAC-Bayesian bounds McAllester (1999)

. PAC-Bayes theorems typically assume a randomized classifier defined by a posterior distribution

, they then bound the generalization error of this randomized classifier by using as a measure of complexity the KL-Divergence betweeen the posterior distribution and a proper prior distribution . The prior

is meant to model a ”very simple function” and is usually chosen to be a scaled standard Gaussian distribution. In

Dziugaite and Roy (2017) the authors optimize the mean of the posterior distribution while enforcing non-trivial training accuracy so as to obtain a non-vacuous bound on significantly simplified MNIST experiments. In Zhou et al. (2018)

the authors compress an original neural network therefore minimizing it’s effective capacity while constraining it to have high accuracy over the training set. The obtained network can be shown to have non-vacuous generalization bounds even for large scale Imagenet experiments.

It is worthwhile to note the subtle but important ways in which the above two works diverge from PAC-Bayesian intuition. PAC Bayes defines an a-posteriori hypothesis class roughly as a ball around the obtained classifier solution, this ball is defined implicitly by assuming a posterior that is Gaussian with a given variance. The larger the variance of the posterior, the larger the ball that can be placed on the obtained solution and the simpler the hypothesis class, or in the case of derandomized PAC-Bayes the simpler the individual hypothesis. By optimizing the mean of the posterior in

Dziugaite and Roy (2017) and by applying compression in Zhou et al. (2018) the authors arrive to posteriors that are not similar even in expectation to the original classifier. Furthermore intuition regarding the role of the magnitude of the variance, is largely destroyed. It is therefore an open problem to test the limits of PAC-Bayes for proving generalization in the original solutions obtained by vanilla SGD.

In Dziugaite and Roy (2018) the authors take a step in this direction by optimising the prior of the PAC-Bayes bound. PAC-Bayesian theory allows the prior to be distribution dependent but not dependent on the training set, the authors enforce this constraint through the differential privacy approach Dwork (2011). Both objectives in Dziugaite and Roy (2017)Dziugaite and Roy (2018) are however difficult to optimise which makes them typically unusable for anything but extremelly small scale experiments.

Somewhat in parallel, the neural network compression literature has seen a number of works using second order information in the Hessian to remove redundant parameters of neural networks LeCun et al. (1990) Hassibi and Stork (1993). Recent works such as Dong et al. (2017)Wang et al. (2019)Peng et al. (2019) have extended the above to the deep setting introducing layerwise approximations of the Hessian, and executing pruning on the Kronecker-factored eigenbasis. There has been controversy over the relationship between the Hessian, the Fisher, the Empirical Fisher and their use as curvature matrices in modern machine learning Kunstner et al. (2019)Thomas et al. (2019) despite that our layerwise Hessian is well grounded and we are mainly interested in it’s empirical results in capturing parameter relevance which have been good in practice. Furthermore a number of approximations can be efficiently and stably computed using simple forward passes of the training set. The Hessian can be used to find directions (weights) along which the empirical loss is flat. This has been used in the compression literature to prune away irrelevant weights. For example one could use the information in Hessian to first prune a network and then derive a bound on the remaining weights. By contrast we aim to add directly more noise in directions along which the loss does not vary.

In this work we adopt the PAC-Bayesian approach. We seek to find a posterior covariance that is close to a given prior while taking into account weight importance through the Hessian, therefore adding noise with higher variance to unimportant weights. Our approach can be seen as approximately finding the largest ball (or ellispoid) around the given DNN solution obtained by vanilla SGD for which the classification results are consistent, and is therefore directly related to PAC-Bayesian intuition. While the resulting bounds are still vacuous they are significantly tighter than when not incorporating second order information, and we discuss the impications for PAC-Bayes in general.

Our work has close connections with Achille and Soatto (2018)Achille et al. (2019). These works aim to link on a fundamental level the Kolmogorov, Shannon and Fisher Information in deep neural networks to the sufficiency,minimality, and invariance of their representations. Our work by contrast focuses on tightening PAC-Bayesian bounds and determining how much progress can be made towards non-vacuous bounds simply by leveraging local properties of a given minimum.

Computing the exact Hessian is computationally hard for DNNs Martens and Grosse (2015)Martens et al. (2012) and inverting the resulting full matrix is often intractable Dong et al. (2017). The layerwise Hessian that we use circumvents both issues.

## 2 Contributions

• We develop a method of tightening existing PAC-Bayes bounds which doesn’t rely on stochastic optimization. Our method instead focuses on approximating the layerwise Hessian matrices, and then performing a grid search over optimal posterior covariances for which we derive a closed form solution with respect to a given prior. We show how for a specific setting this can be seen as minimizing an upper bound on the empirical loss.

• We perform detailed experiments on PAC-Bayes bounds separating the effect of the choice of the prior mean, posterior mean, and posterior Hessian on the tighteness of the bounds. Choosing the prior mean to be equal to the random initialization results in the vast majority of the improvement of a bound often making the bound non-vacuous for simple problems. For bounds that are made non-vacuous in this way, optimizing the covariance of the posterior results in a small improvement. For bounds that remain vacuous after apropriately choosing the prior mean, optimizing the posterior covariance results in a bigger improvement of the bound, we are nevertheless unable to tighten the bounds to the point of non-vacuity. An obvious conclusion is that better approximations of the Hessian might be critical.

• We discuss difficulties in tightening the bounds further by learning a more informative from a separate training set. We also give intuition as to why this is impossible for the first layer, given our setting.

## 3 PAC-Bayesian framework

We consider the hypothesis class realized by the feedforward neural network architecture of depth

with coordinate-wise activation functions

defined as the set of functions with where and

. Given the loss function

we can define the population loss: and given a training set of instances the empirical loss .

The PAC-Bayesina framework McAllester (1999) provides generalization error guarantees for randomized classifiers drawn from a posterior distribution . The framework models the complexity of the randomized classifier as the KL-Divergence between the posterior and a prior . The prior must be valid in the Bayesian sense in that it cannot depend in any way on the training data. On the contrary the posterior can be chosen to be any arbitrary distribution. We will use the following form of the PAC-Bayes bound.

###### Theorem 3.1.

(PAC-Bayesian theorem McAllester (1999)) For any data distribution over

, we have that the following bound holds with probability at least

over random i.i.d. samples of size drawn form the data distribution:

 Eθ∼Q[L(θ)]≤Eθ∼Q[^L(θ)]+ ⎷KL(Q||P)+ln2(N−1)δ2N (1)

Here

is an arbitrary ”posterior” distribution over parameter vectors, which may depend on the sample

and on the prior .

In Neyshabur et al. (2017) the authors derive an analytical solution for the above theorem while ”derandomizing” the bound so as for it to be applicable for deterministic classifiers. The resulting bound has a strong dependence on the spectral norms of each layer’s weight matrix. This and other norm based bounds derived by different techniques Bartlett et al. (2017) Golowich et al. (2017) correlate empirically with the generalization error but are still vacuous by several orders of magnitude.

## 4 From full empirical loss to quadratic approximation

### 4.1 Previous work

As the analytical solution for the KL term in 1 obviously undersetimates the noise robustness of the deep neural network around the minimum one might be tempted to obtain a tighter PAC-Bayes bound by directly optimising

 L(Q(w|D))=Eθ∼Q[^L(θ)]+ ⎷KL(Q||P)+ln2(N−1)δ2N (2)

so as to obtain a posterior that is both close to the PAC-Bayesian prior and has a non-vacuous accuracy. Optimising the above objective cannot be done directly as computing or it’s gradients is intractable for general distributions . A typical workaround is to parameterize as having a Gaussian distribution where

and compute gradients of the resulting unbiased estimate

. Note that the objective is still typically very difficult to optimise due to the high variance in the resulting gradients. In Dziugaite and Roy (2017) the authors use this techinque to obtain non-vacuous bounds for fully connected deep neural networks. However crucially they optimise the mean of the parameterized posterior, resulting in a stochastic classifier which corresponds to a different minimum than the original deterministic one.

Remarkably the above formulation bears striking resemblance to the objective

 Cβ(D;P,Q)=Eθ∼Q[^L(θ)]+βKL((Q||P)) (3)

which is known as the Information Bottelneck (IB) Lagrangian***Actually this is an upper bound on the IB Lagrangian, but we will use this term for simplicity. under the Information Bottleneck Framework Achille and Soatto (2018)Tishby et al. (2000), the Evidence Lower Bound (ELBO) in the variational inference literature Kingma et al. (2015)Bishop (2006) when , or more recently as the task complexity Achille et al. (2019). We will use a version of the above objective to obtain our bounds.

### 4.2 Our approach

We now present the first main contribution of our paper. The original objective 3 is difficult to optimise directly as the term or its gradients cannot be computed efficiently. Furthermore variational inference using a MCMC estimate suffers from Wu et al. (2018)

high variance of the gradients and requires careful initialization and hyperparameter tuning. We will derive an upper bound on

3 which is more suitable for optimisation.

###### Theorem 4.1.

Assuming the following loss function the following is an upper bound on the IB Lagrangian given that we are at a local minimum

 Cβ(D;P,Q)≲∑l√∑jcljEη∼Q′lj[12ηTHljη]+β∑l,j%KL((Qlj||Plj)) (4)

where denotes different layers,

denotes the different neurons at each layer (we assume the same number for simplicity),

denotes the local Hessian, and is a centered version of .

###### Proof.

We start by defining a layerwise empirical error . One can then easily show that substituting this in the IB Langangian we get

 Cβ(D;P,Q)=Eθ∼Q[^L(θ)]+βKL((Q||P))≤Eθ∼Q[L−1∑l=1√^El(θl)L∏k=l+1||^θk||F+√^EL(θL)]+βKL((Q||P))≤L−1∑l=1√Eθ∼Q[^El(θl)]L∏l=k+1Eθ∼Q[||^θl||F]+√Eθ∼Q[^EL(θL)]+βKL((Q||P))≤L∑l=1cl√Eθ∼Q[^El(θl)]+βKL((Q||P))≤L∑l=1cl ⎷Eη∼Q′[(∂^El(θl)∂θl)Tη+12ηTHlη+O(||η||3)]+βKL((Q||P))≈L∑l=1cl√Eη∼Q′[12ηTHlη]+βKL((Q||P)) (5)

were in line 3 we use the linearity of expectation, Hölder’s inequality due to the non-negativity of the random variables, and Jensen’s inequality for the concave square root. In line 4 we hide the frobenius terms into constants

. Each error term is only multiplied with Frobenius norm terms from the deeper layers. Therefore one can start optimising from the final layer and proceed to the first while considering as constant. In practice we will just consider all as unknow scaling factors. In line 5 we expand each term using a Taylor expansion, and subsequently ignore the first term as the DNN is assumed to be well trained and the first derivative will be zero, while terms with order higher than 2 are unimportant. We also use to denote the centered version of distribution .

Taking the first and second derivatives of the layerwise error with respect to we get

 ∂El(θ)∂Wl=1NN∑i=1∂∂Wl||Wlzil−1−zil||22=1NN∑i=1(Wlzil−1−zil)2zil−1T (6)
 ∂2El(θ)∂Wl∂W(j,:)l=1NN∑i=1zil−1zil−1T (7)

Where the second derivative is with respect to any row of the weight matrix . We see that the full Hessian matrix then has a block diagonal structure where each block is equal to . Each row corresponds to a neuron of the layer and for an appropriate choice of prior and posterior with block diagonal covariances it is easy to see that the final form of expression 5 factorizes as

 Cβ(D;P,Q)≲∑l√∑jcljEη∼Q′lj[12ηTHljη]+β∑l,j% KL((Qlj||Plj)) (8)

this completes the proof. ∎

We see that we have managed to upper bound the empirical randomized loss by a scaled sum of quadratic terms involving layerwise Hessian matrices and centered random noise vectors. Intuitively we have reduced the complexity of our optimisation problem simply by turning it into a number of separate subproblems.

## 5 Finding closed form solutions for 4

We will see that for specific cases of and the subproblems in the above upper bound have closed form solutions. We make the following modelling assumtions and . We also remove the square root from the above expression as this greatly simplifies the following calculations. We can then show that this problem has a closed form solution. We also ignore the constants and suppress the subscript from .

###### Lemma 5.1.

The optimization problem has the closed form solution

 Σ∗0=β(Hl+βλΣ−11)−1 (9)

where captures the approximate curvature in the directions of the parameters, while is a chosen prior.

###### Proof.
 Cβ(D;P,Q)=Eη∼Q′[12ηTHlη]+βKL((Q||P))=Eη∼Q′[12tr(HlηTη)]+βKL((Q||P))=12tr(HlEη∼Q′[ηTη])+βKL((Q||P))=12tr(HlΣ0)+β2(tr(1λΣ−11Σ0)−k+(μ0−μ1)TΣ−11(μ0−μ1)+ln(detλΣ1detΣ0)) (10)

The gradient with respect to is

 ∂Cβ(D;P,Q)∂Σ0=[12Hl+β2λΣ−11−β2Σ−10]. (11)

Setting it to zero, we obtain the minimizer . ∎

Even though the above minimiser is exact we encounter a number of scaling issues due to the hidden constants . We therefore perform a grid search over the parameters and in practice and try to find pareto optimal pairs balancing the accuracy of the randomized classifier and the KL complexity term.

## 6 Experiments

We apply the above procedure to fully connected networks trained on the Mnist and Cifar datasets. Specifically we test the architecture

 input→300FC→300FC→#classesFC→output

on Mnist and

 input→200FC→200FC→#classesFC→output

on Cifar. We test three different setups for each dataset, the original 10-class problems (Mnist-10, Cifar-10) and two simplifications where we collapse the 10 classes into 5 classes (Mnist-5, Cifar-5) and 2 classes (Mnist-2, Cifar-2) respectively. We train each configuration to 100% accuracy, we derive the layerwise Hessians, and do a grid search for the parameters and . For each point on the grid we calculate the empirical accuracy over the training set using Monte Carlo sampling and 5 samples, as well as the comlexity term . We plot the results in Figure 2.

The baseline that we use is a Gaussian prior and posterior with the same diagonal covariance structure, scaled by the free parameter . The mean for the prior is chosen as the random initialization. Interestingly we see that for the case of Mnist not much improvement can be achieved using the Hessian approach, for all numbers of classes. Furthermore even the baseline approach yields non-vacuous bounds. This implies a more carefull interpretation of the results in Dziugaite and Roy (2017). There the authors claim to obtain non-vacuous bounds mainly through their non-convex optimisation of the PAC-Bayes posterior. We see that non-vacuity is achieved primarily as a result of the problem being very simple, and the choice of the prior mean as the random initialization. The techniques employed in Dziugaite and Roy (2017) simply tighten the bound further.

For the case of Cifar we see that we can significantly tighten the bound. However we cannot manage to turn a vacuous bound to a non-vacuous one, at least with the current problem set, optimisation approach and approximation of the Hessian. This suggests that we are either making an approximation to the Hessian that is too crude or that we need to design better priors. We will discuss both in detail in the next sections.

## 7 Can we find a better prior?

### 7.1 Absolute limits of our bound

PAC-Bayesian theory allows one to choose an informative prior, the prior can depend on the generating distribution but not the training set used for the classifier that will be bounded. A number of previous works Parrado-Hernández et al. (2012)Catoni (2003)Ambroladze et al. (2007) have used this insight mainly on simpler linear settings and usually by training a classifier on a separate training set and using the result as a prior.

Deep neural networks present a particular challenge when trying to learn a prior distribution. In fact given a function computed by a deep neural network, a huge number of weight reparametrizations that compute the same function are possible in principle Dinh et al. (2017). As such it would be hopeless to expect miningful priors for fully connected layers as their curvature at obtained minima should not showcase statistical regularity. That need not be the case for the input layer where the spatial structure of the input signals should enforce some regularity on the Hessian.

As an example at the input of a neural network using the Cifar datasets the layerwise Hessian has the non-trivial structure in Figure 3.

We made some efforts to utilize this additional information to construct a tighter bound, by simply computing the covariance for a number of datasets similar to Cifar, and using that as a prior covariance for the first hidden layer. This did not result in the benefits we anticipated, we therefore derived closed form solutions of the optimal prior induced by our solution for the optimal posterior. This is considered cheating in PAC-Bayes analysis however it will prove illustrative on why we are failing to tighten our bound further and none of the other results in this paper rely on the intuition that we extract.

While it is is difficult to find an optimum in the general case of a full prior Hessian and a full Hessian at the minimum, for the diagonal case the calculations are straightforward. We formulate the following

###### Theorem 7.1.

The optimal prior for with and and assuming that and has

 h∗i1=λ ⎷h2il4β2+hilβ(μi0−μi1)2. (12)

where encodes the local curvature at the the minimum obtained by SGD, corresponds to the random initialization (by design) of the DNN, and corresponds to the minimum obtained after optimization.

For our choice of Gaussian posterior and optimisation procedure, the following is a lower bound to the minima we obtain under any Gaussian prior

 minΣ0,Σ1Cβ(D;P,Q)≳12(∑iail(μi0−μi1)2+β∑iln(hil+ailail)) (13)

where .

###### Proof.

We can then see that the minimizer is equal to . Substituting in we obtain:

 Cβ(D;P,Q)|Σ0=Σ∗0=Eη∼Q[12ηTHlη]+βKL((Q||P))|Σ0=Σ∗0=12tr(Hlβ(Hl+βλH1)−1)+β2(tr(1λH1β(Hl+βλH1)−1)+1λ(μ0−μ1)TH1(μ0−μ1)−k+ln⎛⎜⎝detλH−11detβ(Hl+βλH1)−1⎞⎟⎠)=β2tr(Hl(Hl+βλH1)−1)+β22λ(tr(H1(Hl+βλH1)−1))+β2(+1λ(μ0−μ1)TH1(μ0−μ1)−k+ln⎛⎜⎝detλH−11detβ(Hl+βλH1)−1⎞⎟⎠)=β2(tr((Hl+βλH1)(Hl+βλH1)−1)1λ(μ0−μ1)TH1(μ0−μ1)−k+ln⎛⎜⎝detλH−11detβ(Hl+βλH1)−1⎞⎟⎠)=β2[+1λ(μ0−μ1)TH1(μ0−μ1)+ln⎛⎜⎝detλH−11detβ(Hl+βλH1)−1⎞⎟⎠] (14)

The above matrix equation 14 is difficult to deal with directly. We will therefore use the common diagonal approximation of the Hessian which is more amenable to manipulation. Substituting and in the above expression we get

 Cβ(D;P,Q)=β2(1λ∑ihi1(μi0−μi1)2−∑iln(hi1λ)+∑iln(hil+βλhi1β)) (15)

The above expression is easy to optimize. We see that the global minimum exists at

 h∗i1=λ ⎷h2il4β2+hilβ(μi0−μi1)2. (16)

The above result is intuitively pleasing setting a lower bound to what we can achieve only to the initialization, obtained minimum, curvature at the minimum and the regularization parameter . In particular the scaling factor has disappeared. As we discuss in detail in the next section we also now have a clearer picture on why we were failing to tighten our bound further; the optimal prior depends not only on but also on .

### 7.2 Experiments ”Learning” a prior covariance

Ideally we would like the dominant term in the expression 12 to be . By contrast if the dominant term is then utilizing this relation for the prior would be problematic, as the optimal prior should be independent of .

In Figure 4 we plot the results using the ”cheating” prior. In 3(a) we see that we indeed are able to get a much tighter bound, however in figures 3(b) 3(c) it is made clear that the optimal values are correlated exactly with the layer weights, and not with the Hessian at the minimum. This helps explain why choosing a prior covariance with respect to a statistically similar training set does not lead to any tightening of the bound.

We furthermore test to what extent the above optimum correlates with or . For a realistic neural network we plot the pairs of variables changing the value of . We see that indeed the optimum correlates perfectly with but not at all with .

Together the above results might be usefull in finding some limits into how much progress can be made through learning valid priors, for example through differential privacy Dziugaite and Roy (2018). In particular we see in Theorem 7.1 that simply choosing a prior that is centered on the initialization and not the minimum, leads to an error factor that can probably not be avoided, which is obvious with hindsight.

## 8 Beyond an upper bound

As noted in Kunstner et al. (2019) the generalized Gauss-Newton approximation of the Hessian coincides with the Fisher matrix

in modern deep learning architectures. While the Fisher matrix is at least as difficult to compute exactly as the Hessian one can compute an unbiased but noisy estimate as

Martens and Grosse (2015)

where care must be taken to sample from the model predictive distribution . Additionally we note that the interpretation of the outputs after the softmax as probabilities is not well grounded theoretically Gal and Ghahramani (2016). Determining the true predictive distribution requires MCMC sampling for example by taking multiple dropout samples Gal and Ghahramani (2016). One can then directly expand the IB Lagrangian as

###### Lemma 8.1.

The IB Lagrangian can be approximated as

 Cβ(D;P,Q))≈η∼Q′E[12ηT~Hη]+βKL((Q||P)) (17)

where and . For and it has the closed form solution

 Σ∗0=β(~H+βλΣ−11)−1 (18)

which can be computed efficiently.

We now make two additional notes regarding computational aspects of the above. The approximation of the Hessian can be computed efficiently as the outer product of large but manageable gradient vectors. The main computational burden after we approximate the Hessian, and given that we choose a standard normal pior, is inverting a matrix of the form . This problem can be tackled in a few different ways. The simplest would be to consider only the diagonal elements of . As we saw before the diagonal elements include considerable information regarding the importance of weights. However inversion of the full matrix is also possible recursively using the Sherman-Morrison formula

 (A+uvT)−1=A−1−A−1uvTA−11+vTA−1u

by taking into account that it consists of a sum of outer products . Performing a grid search over the resulting closed form solutions should be feasible for a number of architectures, albeit very time consuming. Direct inversion should also be possible for moredate parameter numbers.

## 9 Discussion

We have seen how utilizing an approximation to the curvature around a given DNN minimum can lead to tighter PAC-Bayes bounds. Two obvious extensions stand out; deriving better approximations of the Hessian and learning more informative priors. Also conducting experiments in convolutional and larger architectures would show the limits of what can be achieved. This is the subject of ongoing research.

## References

• A. Achille, G. Paolini, G. Mbeng, and S. Soatto (2019) The information complexity of learning tasks, their structure and their distance. arXiv preprint arXiv:1904.03292. Cited by: §1, §4.1.
• A. Achille and S. Soatto (2018) Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research 19 (1), pp. 1947–1980. Cited by: §1, §4.1.
• A. Ambroladze, E. Parrado-Hernández, and J. S. Shawe-taylor (2007) Tighter pac-bayes bounds. In Advances in neural information processing systems, pp. 9–16. Cited by: §7.1.
• P. L. Bartlett, D. J. Foster, and M. J. Telgarsky (2017) Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6240–6249. Cited by: §1, §3.
• C. M. Bishop (2006) Pattern recognition and machine learning. springer. Cited by: §4.1.
• O. Catoni (2003) A pac-bayesian approach to adaptive classification. preprint 840. Cited by: §7.1.
• L. Dinh, R. Pascanu, S. Bengio, and Y. Bengio (2017) Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1019–1028. Cited by: §1, §7.1.
• X. Dong, S. Chen, and S. Pan (2017) Learning to prune deep neural networks via layer-wise optimal brain surgeon. In Advances in Neural Information Processing Systems, pp. 4857–4867. Cited by: §1, §1.
• C. Dwork (2011) Differential privacy. Encyclopedia of Cryptography and Security, pp. 338–340. Cited by: §1.
• G. K. Dziugaite and D. M. Roy (2017) Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008. Cited by: §1, §1, §1, §4.1, §6.
• G. K. Dziugaite and D. M. Roy (2018) Data-dependent pac-bayes priors via differential privacy. In Advances in Neural Information Processing Systems, pp. 8430–8441. Cited by: §1, §7.2.
• Y. Gal and Z. Ghahramani (2016) Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §8.
• N. Golowich, A. Rakhlin, and O. Shamir (2017) Size-independent sample complexity of neural networks. arXiv preprint arXiv:1712.06541. Cited by: §1, §3.
• B. Hassibi and D. G. Stork (1993) Second order derivatives for network pruning: optimal brain surgeon. In Advances in neural information processing systems, pp. 164–171. Cited by: §1.
• Y. Jiang, D. Krishnan, H. Mobahi, and S. Bengio (2018) Predicting the generalization gap in deep networks with margin distributions. arXiv preprint arXiv:1810.00113. Cited by: §1.
• K. Kawaguchi, L. P. Kaelbling, and Y. Bengio (2017) Generalization in deep learning. arXiv preprint arXiv:1710.05468. Cited by: §1.
• D. P. Kingma, T. Salimans, and M. Welling (2015) Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583. Cited by: §4.1.
• F. Kunstner, L. Balles, and P. Hennig (2019) Limitations of the empirical fisher approximation. arXiv preprint arXiv:1905.12558. Cited by: §1, §8.
• Y. LeCun, J. S. Denker, and S. A. Solla (1990) Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §1.
• T. Liang, T. Poggio, A. Rakhlin, and J. Stokes (2017) Fisher-rao metric, geometry, and complexity of neural networks. arXiv preprint arXiv:1711.01530. Cited by: §1.
• J. Martens and R. Grosse (2015) Optimizing neural networks with kronecker-factored approximate curvature. In International conference on machine learning, pp. 2408–2417. Cited by: §1, §8.
• J. Martens, I. Sutskever, and K. Swersky (2012) Estimating the hessian by back-propagating curvature. arXiv preprint arXiv:1206.6464. Cited by: §1.
• D. A. McAllester (1999) Some pac-bayesian theorems. Machine Learning 37 (3), pp. 355–363. Cited by: §1, Theorem 3.1, §3.
• V. Nagarajan and J. Z. Kolter (2019) Uniform convergence may be unable to explain generalization in deep learning. arXiv preprint arXiv:1902.04742. Cited by: §1.
• B. Neyshabur, S. Bhojanapalli, and N. Srebro (2017) A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564. Cited by: §1, §3.
• R. Novak, Y. Bahri, D. A. Abolafia, J. Pennington, and J. Sohl-Dickstein (2018) Sensitivity and generalization in neural networks: an empirical study. arXiv preprint arXiv:1802.08760. Cited by: §1.
• E. Parrado-Hernández, A. Ambroladze, J. Shawe-Taylor, and S. Sun (2012) PAC-bayes bounds with data dependent priors. Journal of Machine Learning Research 13 (Dec), pp. 3507–3531. Cited by: §7.1.
• H. Peng, J. Wu, S. Chen, and J. Huang (2019) Collaborative channel pruning for deep networks. In International Conference on Machine Learning, pp. 5113–5122. Cited by: §1.
• K. Pitas, A. Loukas, M. Davies, and P. Vandergheynst (2019) Some limitations of norm based generalization bounds in deep neural networks. arXiv preprint arXiv:1905.09677. Cited by: §1.
• A. Rangamani, N. H. Nguyen, A. Kumar, D. Phan, S. H. Chin, and T. D. Tran (2019) A scale invariant flatness measure for deep network minima. arXiv preprint arXiv:1902.02434. Cited by: §1.
• V. Thomas, F. Pedregosa, B. van Merriënboer, P. Mangazol, Y. Bengio, and N. L. Roux (2019) Information matrices and generalization. arXiv preprint arXiv:1906.07774. Cited by: §1, §1.
• N. Tishby, F. C. Pereira, and W. Bialek (2000) The information bottleneck method. arXiv preprint physics/0004057. Cited by: §4.1.
• C. Wang, R. Grosse, S. Fidler, and G. Zhang (2019) EigenDamage: structured pruning in the kronecker-factored eigenbasis. arXiv preprint arXiv:1905.05934. Cited by: §1.
• A. Wu, S. Nowozin, E. Meeds, R. E. Turner, J. M. Hernández-Lobato, and A. L. Gaunt (2018) Deterministic variational inference for robust bayesian neural networks. arXiv preprint arXiv:1810.03958. Cited by: §4.2.
• C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §1.
• W. Zhou, V. Veitch, M. Austern, R. P. Adams, and P. Orbanz (2018) Non-vacuous generalization bounds at the imagenet scale: a pac-bayesian compression approach. arXiv preprint arXiv:1804.05862. Cited by: §1, §1.