1 Introduction
Deep neural networks are by now the established method for tackling a number of machine learning tasks. Despite this usually their performance on out of sample data is extremely difficult to be proven formaly and is usually validated empirically by using a validation set of samples. Classic measures of capacity such as the VC dimension which are uniform across all functions representable by the classification architecture are doomed to fail; DNNs are typically overparameterized and correspondingly the set of representable functions is large enough to make the the bounds vacuous. For example in the highly cited work
Zhang et al. (2016) the authors show that the set of representable functions of typical DNNs contains solutions that can memorize the labels over the training set.A lesson drawn from this experiment is that defining the hypothesis class a priori is results in bounds that are too loose. Clearly from empirical observation the optimisation algorithm reaches solutions that are not trivially memorizing the labels. As such recently researchers turned to measures of complexity that are data dependent and are defined a posteriory; that is taking into account the specific solution achieved after optimization. One way of achieving the above is by defining norm based generalization bounds. Examples include Bartlett et al. (2017) Neyshabur et al. (2017) Golowich et al. (2017) and have been reached through a number of proof techniques. Typically this involve a product of the spectral (or other) norms of the DNN weight matrices, the smaller the norms after training the simpler the final hypothesis, as such the generalization bounds can be made significantly tigher. Furthermore these adaptive measures of complexity have been shown to correlate empirically with generalization error, they exhibit large values for random labels and small values for real labeled data.
However these analytical bounds when evaluated explicitly are still vacuous by several orders of magnitude. Furthermore the conclusions drawn by empirical correlations have been criticized in a number of works Kawaguchi et al. (2017) Pitas et al. (2019) Nagarajan and Kolter (2019). Analyzing different cases they provide counter examples where uniform convergence fails to explain generalization. In Dinh et al. (2017) the authors show that sharp minima can generalize, leading to the conclusion that flatness at the minimum is might be a sufficient but not necessary condition for generalization. A general point of confusionn seems to be that a number of works seem to propose conditions (such as small spectral norms) that are sufficient but not necessary for good generalization. Concretely defining such conditions for individual hypotheses and not hypothesis classes remains an important open research question Kawaguchi et al. (2017).
A number of other works have found, sometimes heuristic, metrics that correlate with better generalization
Thomas et al. (2019)Liang et al. (2017)Rangamani et al. (2019)Novak et al. (2018)Jiang et al. (2018). However on a more fundamental level simple correlation with generalization error is unsatisfying for more critical applications such as healthcare, autonomous driving, and finance where DNNs are increasingly being deployed, potentially making lifealtering decisions. Concequently some works have achieved success in proving generalization in specific settings by optimizing PACBayesian bounds McAllester (1999). PACBayes theorems typically assume a randomized classifier defined by a posterior distribution
, they then bound the generalization error of this randomized classifier by using as a measure of complexity the KLDivergence betweeen the posterior distribution and a proper prior distribution . The prioris meant to model a ”very simple function” and is usually chosen to be a scaled standard Gaussian distribution. In
Dziugaite and Roy (2017) the authors optimize the mean of the posterior distribution while enforcing nontrivial training accuracy so as to obtain a nonvacuous bound on significantly simplified MNIST experiments. In Zhou et al. (2018)the authors compress an original neural network therefore minimizing it’s effective capacity while constraining it to have high accuracy over the training set. The obtained network can be shown to have nonvacuous generalization bounds even for large scale Imagenet experiments.
It is worthwhile to note the subtle but important ways in which the above two works diverge from PACBayesian intuition. PAC Bayes defines an aposteriori hypothesis class roughly as a ball around the obtained classifier solution, this ball is defined implicitly by assuming a posterior that is Gaussian with a given variance. The larger the variance of the posterior, the larger the ball that can be placed on the obtained solution and the simpler the hypothesis class, or in the case of derandomized PACBayes the simpler the individual hypothesis. By optimizing the mean of the posterior in
Dziugaite and Roy (2017) and by applying compression in Zhou et al. (2018) the authors arrive to posteriors that are not similar even in expectation to the original classifier. Furthermore intuition regarding the role of the magnitude of the variance, is largely destroyed. It is therefore an open problem to test the limits of PACBayes for proving generalization in the original solutions obtained by vanilla SGD.In Dziugaite and Roy (2018) the authors take a step in this direction by optimising the prior of the PACBayes bound. PACBayesian theory allows the prior to be distribution dependent but not dependent on the training set, the authors enforce this constraint through the differential privacy approach Dwork (2011). Both objectives in Dziugaite and Roy (2017)Dziugaite and Roy (2018) are however difficult to optimise which makes them typically unusable for anything but extremelly small scale experiments.
Somewhat in parallel, the neural network compression literature has seen a number of works using second order information in the Hessian to remove redundant parameters of neural networks LeCun et al. (1990) Hassibi and Stork (1993). Recent works such as Dong et al. (2017)Wang et al. (2019)Peng et al. (2019) have extended the above to the deep setting introducing layerwise approximations of the Hessian, and executing pruning on the Kroneckerfactored eigenbasis. There has been controversy over the relationship between the Hessian, the Fisher, the Empirical Fisher and their use as curvature matrices in modern machine learning Kunstner et al. (2019)Thomas et al. (2019) despite that our layerwise Hessian is well grounded and we are mainly interested in it’s empirical results in capturing parameter relevance which have been good in practice. Furthermore a number of approximations can be efficiently and stably computed using simple forward passes of the training set. The Hessian can be used to find directions (weights) along which the empirical loss is flat. This has been used in the compression literature to prune away irrelevant weights. For example one could use the information in Hessian to first prune a network and then derive a bound on the remaining weights. By contrast we aim to add directly more noise in directions along which the loss does not vary.
In this work we adopt the PACBayesian approach. We seek to find a posterior covariance that is close to a given prior while taking into account weight importance through the Hessian, therefore adding noise with higher variance to unimportant weights. Our approach can be seen as approximately finding the largest ball (or ellispoid) around the given DNN solution obtained by vanilla SGD for which the classification results are consistent, and is therefore directly related to PACBayesian intuition. While the resulting bounds are still vacuous they are significantly tighter than when not incorporating second order information, and we discuss the impications for PACBayes in general.
Our work has close connections with Achille and Soatto (2018)Achille et al. (2019). These works aim to link on a fundamental level the Kolmogorov, Shannon and Fisher Information in deep neural networks to the sufficiency,minimality, and invariance of their representations. Our work by contrast focuses on tightening PACBayesian bounds and determining how much progress can be made towards nonvacuous bounds simply by leveraging local properties of a given minimum.
2 Contributions

We develop a method of tightening existing PACBayes bounds which doesn’t rely on stochastic optimization. Our method instead focuses on approximating the layerwise Hessian matrices, and then performing a grid search over optimal posterior covariances for which we derive a closed form solution with respect to a given prior. We show how for a specific setting this can be seen as minimizing an upper bound on the empirical loss.

We perform detailed experiments on PACBayes bounds separating the effect of the choice of the prior mean, posterior mean, and posterior Hessian on the tighteness of the bounds. Choosing the prior mean to be equal to the random initialization results in the vast majority of the improvement of a bound often making the bound nonvacuous for simple problems. For bounds that are made nonvacuous in this way, optimizing the covariance of the posterior results in a small improvement. For bounds that remain vacuous after apropriately choosing the prior mean, optimizing the posterior covariance results in a bigger improvement of the bound, we are nevertheless unable to tighten the bounds to the point of nonvacuity. An obvious conclusion is that better approximations of the Hessian might be critical.

We discuss difficulties in tightening the bounds further by learning a more informative from a separate training set. We also give intuition as to why this is impossible for the first layer, given our setting.
3 PACBayesian framework
We consider the hypothesis class realized by the feedforward neural network architecture of depth
with coordinatewise activation functions
defined as the set of functions with where and. Given the loss function
we can define the population loss: and given a training set of instances the empirical loss .The PACBayesina framework McAllester (1999) provides generalization error guarantees for randomized classifiers drawn from a posterior distribution . The framework models the complexity of the randomized classifier as the KLDivergence between the posterior and a prior . The prior must be valid in the Bayesian sense in that it cannot depend in any way on the training data. On the contrary the posterior can be chosen to be any arbitrary distribution. We will use the following form of the PACBayes bound.
Theorem 3.1.
(PACBayesian theorem McAllester (1999)) For any data distribution over
, we have that the following bound holds with probability at least
over random i.i.d. samples of size drawn form the data distribution:(1) 
Here
is an arbitrary ”posterior” distribution over parameter vectors, which may depend on the sample
and on the prior .In Neyshabur et al. (2017) the authors derive an analytical solution for the above theorem while ”derandomizing” the bound so as for it to be applicable for deterministic classifiers. The resulting bound has a strong dependence on the spectral norms of each layer’s weight matrix. This and other norm based bounds derived by different techniques Bartlett et al. (2017) Golowich et al. (2017) correlate empirically with the generalization error but are still vacuous by several orders of magnitude.
4 From full empirical loss to quadratic approximation
4.1 Previous work
As the analytical solution for the KL term in 1 obviously undersetimates the noise robustness of the deep neural network around the minimum one might be tempted to obtain a tighter PACBayes bound by directly optimising
(2) 
so as to obtain a posterior that is both close to the PACBayesian prior and has a nonvacuous accuracy. Optimising the above objective cannot be done directly as computing or it’s gradients is intractable for general distributions . A typical workaround is to parameterize as having a Gaussian distribution where
and compute gradients of the resulting unbiased estimate
. Note that the objective is still typically very difficult to optimise due to the high variance in the resulting gradients. In Dziugaite and Roy (2017) the authors use this techinque to obtain nonvacuous bounds for fully connected deep neural networks. However crucially they optimise the mean of the parameterized posterior, resulting in a stochastic classifier which corresponds to a different minimum than the original deterministic one.Remarkably the above formulation bears striking resemblance to the objective
(3) 
which is known as the Information Bottelneck (IB) Lagrangian^{*}^{*}*Actually this is an upper bound on the IB Lagrangian, but we will use this term for simplicity. under the Information Bottleneck Framework Achille and Soatto (2018)Tishby et al. (2000), the Evidence Lower Bound (ELBO) in the variational inference literature Kingma et al. (2015)Bishop (2006) when , or more recently as the task complexity Achille et al. (2019). We will use a version of the above objective to obtain our bounds.
4.2 Our approach
We now present the first main contribution of our paper. The original objective 3 is difficult to optimise directly as the term or its gradients cannot be computed efficiently. Furthermore variational inference using a MCMC estimate suffers from Wu et al. (2018)
high variance of the gradients and requires careful initialization and hyperparameter tuning. We will derive an upper bound on
3 which is more suitable for optimisation.Theorem 4.1.
Assuming the following loss function the following is an upper bound on the IB Lagrangian given that we are at a local minimum
(4) 
where denotes different layers,
denotes the different neurons at each layer (we assume the same number for simplicity),
denotes the local Hessian, and is a centered version of .Proof.
We start by defining a layerwise empirical error . One can then easily show that substituting this in the IB Langangian we get
(5) 
were in line 3 we use the linearity of expectation, Hölder’s inequality due to the nonnegativity of the random variables, and Jensen’s inequality for the concave square root. In line 4 we hide the frobenius terms into constants
. Each error term is only multiplied with Frobenius norm terms from the deeper layers. Therefore one can start optimising from the final layer and proceed to the first while considering as constant. In practice we will just consider all as unknow scaling factors. In line 5 we expand each term using a Taylor expansion, and subsequently ignore the first term as the DNN is assumed to be well trained and the first derivative will be zero, while terms with order higher than 2 are unimportant. We also use to denote the centered version of distribution .Taking the first and second derivatives of the layerwise error with respect to we get
(6) 
(7) 
Where the second derivative is with respect to any row of the weight matrix . We see that the full Hessian matrix then has a block diagonal structure where each block is equal to . Each row corresponds to a neuron of the layer and for an appropriate choice of prior and posterior with block diagonal covariances it is easy to see that the final form of expression 5 factorizes as
(8) 
this completes the proof. ∎
We see that we have managed to upper bound the empirical randomized loss by a scaled sum of quadratic terms involving layerwise Hessian matrices and centered random noise vectors. Intuitively we have reduced the complexity of our optimisation problem simply by turning it into a number of separate subproblems.
5 Finding closed form solutions for 4
We will see that for specific cases of and the subproblems in the above upper bound have closed form solutions. We make the following modelling assumtions and . We also remove the square root from the above expression as this greatly simplifies the following calculations. We can then show that this problem has a closed form solution. We also ignore the constants and suppress the subscript from .
Lemma 5.1.
The optimization problem has the closed form solution
(9) 
where captures the approximate curvature in the directions of the parameters, while is a chosen prior.
Proof.
(10) 
The gradient with respect to is
(11) 
Setting it to zero, we obtain the minimizer . ∎
Even though the above minimiser is exact we encounter a number of scaling issues due to the hidden constants . We therefore perform a grid search over the parameters and in practice and try to find pareto optimal pairs balancing the accuracy of the randomized classifier and the KL complexity term.
6 Experiments
We apply the above procedure to fully connected networks trained on the Mnist and Cifar datasets. Specifically we test the architecture
on Mnist and
on Cifar. We test three different setups for each dataset, the original 10class problems (Mnist10, Cifar10) and two simplifications where we collapse the 10 classes into 5 classes (Mnist5, Cifar5) and 2 classes (Mnist2, Cifar2) respectively. We train each configuration to 100% accuracy, we derive the layerwise Hessians, and do a grid search for the parameters and . For each point on the grid we calculate the empirical accuracy over the training set using Monte Carlo sampling and 5 samples, as well as the comlexity term . We plot the results in Figure 2.
The baseline that we use is a Gaussian prior and posterior with the same diagonal covariance structure, scaled by the free parameter . The mean for the prior is chosen as the random initialization. Interestingly we see that for the case of Mnist not much improvement can be achieved using the Hessian approach, for all numbers of classes. Furthermore even the baseline approach yields nonvacuous bounds. This implies a more carefull interpretation of the results in Dziugaite and Roy (2017). There the authors claim to obtain nonvacuous bounds mainly through their nonconvex optimisation of the PACBayes posterior. We see that nonvacuity is achieved primarily as a result of the problem being very simple, and the choice of the prior mean as the random initialization. The techniques employed in Dziugaite and Roy (2017) simply tighten the bound further.
For the case of Cifar we see that we can significantly tighten the bound. However we cannot manage to turn a vacuous bound to a nonvacuous one, at least with the current problem set, optimisation approach and approximation of the Hessian. This suggests that we are either making an approximation to the Hessian that is too crude or that we need to design better priors. We will discuss both in detail in the next sections.
7 Can we find a better prior?
7.1 Absolute limits of our bound
PACBayesian theory allows one to choose an informative prior, the prior can depend on the generating distribution but not the training set used for the classifier that will be bounded. A number of previous works ParradoHernández et al. (2012)Catoni (2003)Ambroladze et al. (2007) have used this insight mainly on simpler linear settings and usually by training a classifier on a separate training set and using the result as a prior.
Deep neural networks present a particular challenge when trying to learn a prior distribution. In fact given a function computed by a deep neural network, a huge number of weight reparametrizations that compute the same function are possible in principle Dinh et al. (2017). As such it would be hopeless to expect miningful priors for fully connected layers as their curvature at obtained minima should not showcase statistical regularity. That need not be the case for the input layer where the spatial structure of the input signals should enforce some regularity on the Hessian.
As an example at the input of a neural network using the Cifar datasets the layerwise Hessian has the nontrivial structure in Figure 3.
We made some efforts to utilize this additional information to construct a tighter bound, by simply computing the covariance for a number of datasets similar to Cifar, and using that as a prior covariance for the first hidden layer. This did not result in the benefits we anticipated, we therefore derived closed form solutions of the optimal prior induced by our solution for the optimal posterior. This is considered cheating in PACBayes analysis however it will prove illustrative on why we are failing to tighten our bound further and none of the other results in this paper rely on the intuition that we extract.
While it is is difficult to find an optimum in the general case of a full prior Hessian and a full Hessian at the minimum, for the diagonal case the calculations are straightforward. We formulate the following
Theorem 7.1.
The optimal prior for with and and assuming that and has
(12) 
where encodes the local curvature at the the minimum obtained by SGD, corresponds to the random initialization (by design) of the DNN, and corresponds to the minimum obtained after optimization.
For our choice of Gaussian posterior and optimisation procedure, the following is a lower bound to the minima we obtain under any Gaussian prior
(13) 
where .
Proof.
We can then see that the minimizer is equal to . Substituting in we obtain:
(14) 
The above matrix equation 14 is difficult to deal with directly. We will therefore use the common diagonal approximation of the Hessian which is more amenable to manipulation. Substituting and in the above expression we get
(15) 
The above expression is easy to optimize. We see that the global minimum exists at
(16) 
∎
The above result is intuitively pleasing setting a lower bound to what we can achieve only to the initialization, obtained minimum, curvature at the minimum and the regularization parameter . In particular the scaling factor has disappeared. As we discuss in detail in the next section we also now have a clearer picture on why we were failing to tighten our bound further; the optimal prior depends not only on but also on .
7.2 Experiments ”Learning” a prior covariance
Ideally we would like the dominant term in the expression 12 to be . By contrast if the dominant term is then utilizing this relation for the prior would be problematic, as the optimal prior should be independent of .
In Figure 4 we plot the results using the ”cheating” prior. In 3(a) we see that we indeed are able to get a much tighter bound, however in figures 3(b) 3(c) it is made clear that the optimal values are correlated exactly with the layer weights, and not with the Hessian at the minimum. This helps explain why choosing a prior covariance with respect to a statistically similar training set does not lead to any tightening of the bound.
We furthermore test to what extent the above optimum correlates with or . For a realistic neural network we plot the pairs of variables changing the value of . We see that indeed the optimum correlates perfectly with but not at all with .
Together the above results might be usefull in finding some limits into how much progress can be made through learning valid priors, for example through differential privacy Dziugaite and Roy (2018). In particular we see in Theorem 7.1 that simply choosing a prior that is centered on the initialization and not the minimum, leads to an error factor that can probably not be avoided, which is obvious with hindsight.
8 Beyond an upper bound
As noted in Kunstner et al. (2019) the generalized GaussNewton approximation of the Hessian coincides with the Fisher matrix
in modern deep learning architectures. While the Fisher matrix is at least as difficult to compute exactly as the Hessian one can compute an unbiased but noisy estimate as
Martens and Grosse (2015)where care must be taken to sample from the model predictive distribution . Additionally we note that the interpretation of the outputs after the softmax as probabilities is not well grounded theoretically Gal and Ghahramani (2016). Determining the true predictive distribution requires MCMC sampling for example by taking multiple dropout samples Gal and Ghahramani (2016). One can then directly expand the IB Lagrangian as
Lemma 8.1.
The IB Lagrangian can be approximated as
(17) 
where and . For and it has the closed form solution
(18) 
which can be computed efficiently.
We now make two additional notes regarding computational aspects of the above. The approximation of the Hessian can be computed efficiently as the outer product of large but manageable gradient vectors. The main computational burden after we approximate the Hessian, and given that we choose a standard normal pior, is inverting a matrix of the form . This problem can be tackled in a few different ways. The simplest would be to consider only the diagonal elements of . As we saw before the diagonal elements include considerable information regarding the importance of weights. However inversion of the full matrix is also possible recursively using the ShermanMorrison formula
by taking into account that it consists of a sum of outer products . Performing a grid search over the resulting closed form solutions should be feasible for a number of architectures, albeit very time consuming. Direct inversion should also be possible for moredate parameter numbers.
9 Discussion
We have seen how utilizing an approximation to the curvature around a given DNN minimum can lead to tighter PACBayes bounds. Two obvious extensions stand out; deriving better approximations of the Hessian and learning more informative priors. Also conducting experiments in convolutional and larger architectures would show the limits of what can be achieved. This is the subject of ongoing research.
References
 The information complexity of learning tasks, their structure and their distance. arXiv preprint arXiv:1904.03292. Cited by: §1, §4.1.
 Emergence of invariance and disentanglement in deep representations. The Journal of Machine Learning Research 19 (1), pp. 1947–1980. Cited by: §1, §4.1.
 Tighter pacbayes bounds. In Advances in neural information processing systems, pp. 9–16. Cited by: §7.1.
 Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6240–6249. Cited by: §1, §3.
 Pattern recognition and machine learning. springer. Cited by: §4.1.
 A pacbayesian approach to adaptive classification. preprint 840. Cited by: §7.1.
 Sharp minima can generalize for deep nets. In Proceedings of the 34th International Conference on Machine LearningVolume 70, pp. 1019–1028. Cited by: §1, §7.1.
 Learning to prune deep neural networks via layerwise optimal brain surgeon. In Advances in Neural Information Processing Systems, pp. 4857–4867. Cited by: §1, §1.
 Differential privacy. Encyclopedia of Cryptography and Security, pp. 338–340. Cited by: §1.
 Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008. Cited by: §1, §1, §1, §4.1, §6.
 Datadependent pacbayes priors via differential privacy. In Advances in Neural Information Processing Systems, pp. 8430–8441. Cited by: §1, §7.2.
 Dropout as a bayesian approximation: representing model uncertainty in deep learning. In international conference on machine learning, pp. 1050–1059. Cited by: §8.
 Sizeindependent sample complexity of neural networks. arXiv preprint arXiv:1712.06541. Cited by: §1, §3.
 Second order derivatives for network pruning: optimal brain surgeon. In Advances in neural information processing systems, pp. 164–171. Cited by: §1.
 Predicting the generalization gap in deep networks with margin distributions. arXiv preprint arXiv:1810.00113. Cited by: §1.
 Generalization in deep learning. arXiv preprint arXiv:1710.05468. Cited by: §1.
 Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583. Cited by: §4.1.
 Limitations of the empirical fisher approximation. arXiv preprint arXiv:1905.12558. Cited by: §1, §8.
 Optimal brain damage. In Advances in neural information processing systems, pp. 598–605. Cited by: §1.
 Fisherrao metric, geometry, and complexity of neural networks. arXiv preprint arXiv:1711.01530. Cited by: §1.
 Optimizing neural networks with kroneckerfactored approximate curvature. In International conference on machine learning, pp. 2408–2417. Cited by: §1, §8.
 Estimating the hessian by backpropagating curvature. arXiv preprint arXiv:1206.6464. Cited by: §1.
 Some pacbayesian theorems. Machine Learning 37 (3), pp. 355–363. Cited by: §1, Theorem 3.1, §3.
 Uniform convergence may be unable to explain generalization in deep learning. arXiv preprint arXiv:1902.04742. Cited by: §1.
 A pacbayesian approach to spectrallynormalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564. Cited by: §1, §3.
 Sensitivity and generalization in neural networks: an empirical study. arXiv preprint arXiv:1802.08760. Cited by: §1.
 PACbayes bounds with data dependent priors. Journal of Machine Learning Research 13 (Dec), pp. 3507–3531. Cited by: §7.1.
 Collaborative channel pruning for deep networks. In International Conference on Machine Learning, pp. 5113–5122. Cited by: §1.
 Some limitations of norm based generalization bounds in deep neural networks. arXiv preprint arXiv:1905.09677. Cited by: §1.
 A scale invariant flatness measure for deep network minima. arXiv preprint arXiv:1902.02434. Cited by: §1.
 Information matrices and generalization. arXiv preprint arXiv:1906.07774. Cited by: §1, §1.
 The information bottleneck method. arXiv preprint physics/0004057. Cited by: §4.1.
 EigenDamage: structured pruning in the kroneckerfactored eigenbasis. arXiv preprint arXiv:1905.05934. Cited by: §1.
 Deterministic variational inference for robust bayesian neural networks. arXiv preprint arXiv:1810.03958. Cited by: §4.2.
 Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §1.
 Nonvacuous generalization bounds at the imagenet scale: a pacbayesian compression approach. arXiv preprint arXiv:1804.05862. Cited by: §1, §1.