Log In Sign Up

Cold Posteriors through PAC-Bayes

by   Konstantinos Pitas, et al.

We investigate the cold posterior effect through the lens of PAC-Bayes generalization bounds. We argue that in the non-asymptotic setting, when the number of training samples is (relatively) small, discussions of the cold posterior effect should take into account that approximate Bayesian inference does not readily provide guarantees of performance on out-of-sample data. Instead, out-of-sample error is better described through a generalization bound. In this context, we explore the connections between the ELBO objective from variational inference and the PAC-Bayes objectives. We note that, while the ELBO and PAC-Bayes objectives are similar, the latter objectives naturally contain a temperature parameter λ which is not restricted to be λ=1. For both regression and classification tasks, in the case of isotropic Laplace approximations to the posterior, we show how this PAC-Bayesian interpretation of the temperature parameter captures the cold posterior effect.


page 1

page 2

page 3

page 4


Practical calibration of the temperature parameter in Gibbs posteriors

PAC-Bayesian algorithms and Gibbs posteriors are gaining popularity due ...

Better PAC-Bayes Bounds for Deep Neural Networks using the Loss Curvature

We investigate whether it's possible to tighten PAC-Bayes bounds for dee...

PAC-Bayes Bounds on Variational Tempered Posteriors for Markov Models

Datasets displaying temporal dependencies abound in science and engineer...

On PAC-Bayesian reconstruction guarantees for VAEs

Despite its wide use and empirical successes, the theoretical understand...

PAC-Bayes Information Bottleneck

Information bottleneck (IB) depicts a trade-off between the accuracy and...

VIB is Half Bayes

In discriminative settings such as regression and classification there a...

Controlling Confusion via Generalisation Bounds

We establish new generalisation bounds for multiclass classification by ...

1 Introduction

(a) UCI, Abalone
(b) UCI, Diamonds
(c) KC_House
(d) MNIST-10
Figure 1: The PAC-Bayes bound and the test negative log likelihood for different values of (quantities on the y-axis are normalized). (a-c) are regression tasks on the UCI Abalone, UCI Diamonds and KC_House datasets, while (d) is a classification task on the MNIST-10 dataset. PAC-Bayes bound closely tracks the test negative log-likelihood.

In their influential paper, Wenzel et al. (2020)

highlighted the observation that Bayesian neural networks typically exhibit better test time predictive performance if the posterior distribution is “sharpened” through tempering. Their work has been influential primary because it serves as a well documented example of the potential drawbacks of the Bayesian approach to deep learning. While other subfields of deep learning have seen rapid adoption, and have had impact on real world problems, Bayesian deep learning has, to date, seen relatively little practical use

Izmailov et al. (2021); Lotfi et al. (2022); Dusenberry et al. (2020); Wenzel et al. (2020). The “cold posterior effect”, as the authors of Wenzel et al. (2020)

named their observation, highlights an important mismatch between Bayesian theory and practice. As we increase the number of training samples, Bayesian theory tells us that we should be concentrating more and more on the true model parameters, in a frequentist sense. At any given moment, the posterior is our best guess at what the true model parameters are, without having to resort to heuristics. Since the original paper, a number of works

Noci et al. (2021); Zeno et al. (2020); Adlam et al. (2020); Nabarro et al. (2021); Fortuin et al. (2021); Aitchison (2020) have tried to explain the cold posterior effect, identify its origins, propose remedies and defend Bayesian deep leaning in the process.

The experimental setups where the cold posterior effect arises have, however, been hard to pinpoint precisely. In Noci et al. (2021) the authors conducted detailed experiments testing various hypotheses. The cold posterior effect was shown to arise from augmenting the data during optimization (data augmentation hypothesis), from selecting only the “easiest” data samples when constructing the dataset (data curation hypothesis), and from selecting a “bad” prior (prior misspecification hypothesis). In Nabarro et al. (2021) the authors propose a principled log-likelihood that incorporates data augmentation, however they show that the cold-posterior persists. Data curation was first proposed as an explanation in Aitchison (2020), however the authors show that data curation can only explain a part of the cold posterior effect. Misspecified priors have also been explored as a possible cause in several other works Zeno et al. (2020); Adlam et al. (2020); Fortuin et al. (2021). Again the results have been mixed. In smaller models, data dependent priors seem to decrease the cold posterior effect while in larger models the effect increases Fortuin et al. (2021).

We propose that discussions of the cold posterior effect should take into account that in the non-asymptotic setting (where the number of training data points is relatively small), Bayesian inference does not readily provide a guarantee for performance on out-of-sample data. Existing theorems describe posterior contraction Ghosal et al. (2000); Blackwell and Dubins (1962), however in practical settings, for a finite number of training steps and for finite training data, it is often difficult to precisely characterise how much the posterior concentrates. Furthermore, theorems on posterior contraction are somewhat unsatisfying in the supervised classification setting, in which the cold posterior effect is usually discussed. Ideally, one would want a theoretical analysis that links the posterior distribution to the test error directly.

Here, we investigate PAC-Bayes generalization bounds McAllester (1999); Catoni (2007); Alquier et al. (2016); Dziugaite and Roy (2017)

as the model that governs performance on out-of-sample data. PAC-Bayes bounds describe the performance on out-of-sample data, through an application of the convex duality relation between measurable functions and probability measures. The convex duality relation naturally gives rise to the log-Laplace transform of a special random variable

Catoni (2007). Importantly the log-Laplace transform has a temperature parameter which is not constrained to be . We investigate the relationship of this temperature parameter to cold posteriors. Our contributions are the following: 1) We prove a PAC-Bayes bound for linearized deep neural networks, that has a simple analytical form with respect to . This provides useful intuition for potential causes of the cold posterior effect. 2) For isotropic Laplace approximations to the posterior, for both regression and classification tasks, we show that a related PAC-Bayes bound correlates with performance on out-of-sample data. Our bounds are oracle

bounds, in that some quantities are typically unknown in real settings. We also rely on Monte Carlo sampling to estimate some quantities, and do not attempt to bound the error of these estimates. Even with these caveats we believe that our analysis highlights an important aspect of the cold posterior effect that has up to now been overlooked.

2 The cold posterior effect in the misspecified and non-asymptotic setting

We denote the learning sample , that contains input-output pairs. Observations are assumed to be sampled randomly from a distribution . Thus, we denote the i.i.d observation of

elements. We consider loss functions

, where is a set of predictors . We also denote the risk and the empirical risk

. We encounter cases where we make predictions using the posterior predictive distribution

, with some abuse of notation we write the corresponding risk and empirical risk terms as and correspondingly.

We will use two loss functions, the non-differentiable zero-one loss , and the negative log-likelihood, which is a commonly used differentiable surrogate , where we assume that the outputs of

are normalized to form a probability distribution. Given the above, denoting the prior by

, the Evidence Lower Bound (ELBO) has the following form


where . Note that our temperature parameter is the inverse of the one typically used in cold posterior papers. In this form has a clearer interpretation as the temperature of a log-Laplace transform. There is a slight ambiguity between tempered and cold posteriors, as argued in Aitchison (2020) for Gaussian priors and posteriors the two objectives are equivalent. Overall our setup is equivalent to the one in Wenzel et al. (2020). One typically models the posterior and prior distributions over weights using a parametric distribution (commonly a Gaussian) and optimizes the ELBO, using the reparametrization trick, to find the posterior distribution Blundell et al. (2015); Khan et al. (2018); Mishkin et al. (2018); Ashukha et al. (2019); Wenzel et al. (2020). The cold posterior is the following observation:

Even though the ELBO has the form 1 with , practitioners have found that much larger values typically result in worse test time performance, for example a higher test misclassification rate and higher test negative log-likelihood.

The starting point of our discussion will be thus to define the quantity that we care about in the context of Bayesian deep neural networks and cold posterior analyses. Concretely, in the setting of supervised prediction, what we often try to minimize is


the conditional relative entropy (Cover, 1999) between the true conditional distribution and

the posterior predictive distribution. For example, this is implicitly the quantity that we minimize when optimizing classifiers using the cross-entropy loss

Masegosa (2020); Morningstar et al. (2022). It determines how accurately we can predict the future, it is often what governs how much money our model will make or how many lives it will save. It is also on this and similar predictive metrics that the cold posterior appears. In the following we will outline the relationship between the ELBO, PAC-Bayes and 2.

2.1 Elbo

We assume a training sample as before, denote

the true posterior probability over predictors

parameterized by (typically weights for neural networks), and and respectively the prior and variational posterior distributions as before. The ELBO results from the following calculations

Thus, maximizing the ELBO can be seen as minimizing the KL divergence between the true posterior and the variational posterior over the weights . The true posterior distribution gives more probability mass to predictors which are more likely given the training data, however these predictors do not necessarily minimize

, the evaluation metric of choice (

2) for supervised classification.

It is well known that Bayesian inference is strongly consistent under very broad conditions Ghosal et al. (2000).For example let the set of predictors be countable and suppose that the data distribution is such that, for some we have that is equal to the true conditional distribution of given . Then the Blackwell–Dubins consistency theorem Blackwell and Dubins (1962) implies that with -probability 1, the Bayesian posterior concentrates on . In supervised classification methods such as SVMs, the number of parameters is typically much smaller than the number of samples. In this situation, it is reasonable to assume that we are operating in the regime where and that the posterior quickly concentrates on the true set of parameters. In such cases, a more detailed analysis, such as a PAC-Bayesian one, is unnecessary as the posterior is akin to a Dirac delta mass at the true parameters. However neural networks do not operate in this regime. In particular they are heavily overparametrized such that Bayesian model averaging always occurs empirically. In such cases, it is often difficult to precisely characterise how much the posterior concentrates. Furthermore, ideally, one would want a theoretical analysis that links the posterior distribution to the test error directly.

There is a more subtle cause that undermines consistency theorems that have worked well in the past, specifically model misspecification. As shown in Grünwald and Langford (2007), for the case of supervised classification, there are two cases of misspecification where the Bayesian posterior does not concentrate to the optimal distribution with respect to the true risk even with infinite training data . 1) Assuming homoskedastic noise in the likelihood, when some data samples are corrupted with higher level noise than others. 2) The set of all predictors does not include the true predictor . Both types of misspecification probably occur for deep neural networks. For example, accounting for heteroskedastic noise Collier et al. (2021) improves performance on some classification benchmarks. And the existence of multiple minima is a clue that no single best parametrization exists Masegosa (2020).

Operating in the regime where is (comparatively) small and where makes it important to derive a more precise certificate of generalization through a generalization bound, which directly bounds the true risk. In the following we focus on analyzing a PAC-Bayes bound in order to obtain insights into when the cold posterior effect occurs.

2.2 PAC-Bayes

We first look at the following bound, that we name the original bound and denote it by .

Theorem 1 (, Alquier et al. (2016)).

Given a distribution over , a hypothesis set , a loss function , a prior distribution over , real numbers and , with probability at least over the choice , we have for all on

There are three different terms in the above bound:

The empirical risk term is the empirical mean of the loss of the classifier over all training samples. The KL term is the complexity of the model, which in this case is measured as the KL-divergence between the posterior and prior distributions. The Moment term, this is the log-Laplace transform for a reversal of the temperature, we will keep the name “Moment” in the following.

Using a PAC-Bayes bound together with Jensen’s inequality, one can bound (2) directly as follows

The last line holds under the conditions of Theorem 1 and in particular with probability at least over the choice . Notice here the presence of the temperature parameter , which needs not be .

In particular it is easy to see that maximizing the ELBO is equivalent to minimizing a PAC-Bayes bound for , which might not necessarily be optimal for a finite sample size. More specifically even for exact inference, where , the Bayesian posterior predictive distribution does not necessarily minimize .

2.3 Safe-Bayes and other relevant work

We are not the first to discuss the relationship of Bayesian inference and PAC-Bayes, nor the connection to tempered posteriors, however we are the first investigate the relationship with the cold posterior effect in the context of deep learning. In Germain et al. (2016) the authors where the first to find connections between PAC-Bayes and Bayesian inference. However they only investigate the case where . After identifying two sources of misspecification the authors in Grünwald and Langford (2007) proposed a solution, through an approach which they named Safe-Bayes Grünwald (2012); Grünwald and Van Ommen (2017). Safe-Bayes corresponds to finding a temperature parameter for a generalized (tempered) posterior distribution with possibly different than 1. The optimal value of is found by taking a sequential view of Bayesian inference and minimizing a prequential risk, the risk of each new data sample given the previous ones. This results in a PAC-Bayes bound on the true risk, and is reminiscent of recent works in Bayesian inference and model selection such as Lyle et al. (2020); Ru et al. (2021). The analysis of Grünwald (2012); Grünwald and Van Ommen (2017) is restricted to the case where . By contrast we provide an analytical expression of the bound on true risk, given , and also numerically investigate the case of . Our analysis thus provides intuition regarding which parameters (for example the curvature ) might result in cold posteriors.

3 The effect of the temperature parameter on the PAC-Bayes bound

PAC-Bayes objectives are typically difficult to analyze theoretically. In the following we make a number of simplifying assumptions, thus making deep neural networks amenable to study. We exploit the idea of a recent line of works Zancato et al. (2020); Maddox et al. (2021); Jacot et al. (2018); Khan et al. (2019) that have considered linearizations of deep neural networks, at some estimate , such that


to derive theoretical results. Our approach is somewhat connected to the NTK Jacot et al. (2018), however it is much closer to Zancato et al. (2020); Maddox et al. (2021); Khan et al. (2019) as we make no assumptions about infinite width. For appropriate modelling choices, we aim at deriving a bound for this linearized model.

We adopt the linear form (3) together with the Gaussian likelihood with , yielding . We also make the following modeling choices

  • Prior over weights: .

  • Gradient as Gaussian mixtures: ; note that this assumption should be somewhat realistic for pretrained neural networks, in that multiple works have shown that gradients with respect to the training, at , set are clusterable Zancato et al. (2020).

  • Labeling function: , where .

Thus . The assumption that is close to is quite strong, and we furthermore argued in the previous sections that no single is truly “correct”. However we note that for fine-tuning tasks linearized neural networks work remarkably well Maddox et al. (2021); Deshpande et al. (2021). It is therefore at least somewhat reasonable to assume the above oracle labelling function, in that for deep learning architectures good that fit many datasets can be found close to in practical settings. In any case as we will see later we will only look for some rough intuition from this analysis, and will resort to 1 for practical settings.

We also assume that we have a deterministic estimate of the posterior weights which we keep fixed, and we model the posterior as

. Therefore estimating the posterior corresponds to estimating the variance

. This setting has been widely explored before in the literature as it coincides with the Laplace approximation to the posterior.

Proposition 1 ().

With the above modeling choices, and given a distribution over , real numbers and with , with probability at least over the choice , we have

where is the curvature parameter, and is the posterior gradient variance.


(Sketch) We first develop all the terms in the PAC-Bayes bound based on our modelling choices. We start with the empirical risk term

where we set . This factor is the only one multiplied with the posterior variance . It can be interpreted as measuring the curvature of the loss landscape.

We continue with the KL term. For our modelling choice of Gaussian prior and posterior , the KL has the following analytical expression

Finally we develop the moment term. We find the following upper bound

where and . In bounding the moment term we first recognize that the random variable consists in the difference . We remove the term resulting in an upper bound to the moment, and in this way we avoid also having to calculate the expectation . For samples from the prior , we can then compute the remaining expectation because we have assumed that the labelling function is known and equal to where . The cost of finding an analytical expression is that this bound is very loose. While can be small by itself can be very large. We are also now saddled with the constraint where . We will see that this is also pessimistic.

We now minimize with respect to the following objective which is equivalent to the one of Equation (4) in Wenzel et al. (2020)

For our particular modeling choices, we can then find the minimum in closed-form by setting the gradient with respect to to be zero. We get where . We get our result by putting everything in the original bound. ∎

We now make a number of observations regarding Proposition 1. The first is that we started our derivations from the linearized deep neural network and the negative log-likelihood with a Gaussian likelihood however the empirical risk term in Proposition 1 is exactly the Gauss–Newton approximation to the loss landscape for the full deep neural network. Here is the trace of the Hessian under the Gauss Newton approximation (without a scaling factor ). We refer to details about this interpretation of the empirical risk to the Appendix. We call the bound of Proposition 1 the bound. The r.h.s of the inequality in Proposition 1 is unwieldy, and provides little intuition. We simplify it by making the following parameter choices

Corollary 1.

For , , and ignoring additive constants, the dependence of Proposition 1 bound on the temperature parameter is as follows, with probability at least

(a) for Corollary 1.
(b) for KC_House.
(c) for KC_House.
Figure 2: as a function of the parameter. We plot the empirical risk, moment and KL terms, as well as the bound values for different . In Figure 1(a) we plot the simplified case of Corollary 1. We see that the empirical risk and KL terms decrease while the moment term decreases as increases. The bound first decreases and then increases, and has a global minimum. In Figure 1(b) we plot the bound for a single MAP estimate of a neural network trained on the KC_House dataset. The x-axis is contains only values , which limits our investigation of the cold posterior effect, also the Moment term explodes. In Figure 1(c) we plot the bound for the same model and dataset. We can now investigate values of . The terms increase or decrease according to the intuition in Corollary 1. However the KL term decreases faster than the moment increases and the Bound always decreases as we increase .

From Equation (4) it is easy to derive some intuition as to the effect of . We see that as increases the empirical risk decreases while the moment term increases. For this particular modeling choice the KL term decreases as we increase . We plot the bound for the parameter choices of Corollary 1 in Figure 1(a). We see that as a function of the bound first decreases and then increases, the bound is minimized for an intermediate value. This analysis also gives us some interesting intuition regarding cold posteriors in general. Under the PAC-Bayesian modeling of the risk, cold posteriors are the result of a complex interaction between the various parameters of the bound. This might explain why pinpointing their cause is difficult in practice.

In the next section we show that for real datasets and models, the bound terms often have the same behaviour as in Corollary 1

. However this is not the case for the overall bound, which has different shapes based on different hyperparameter values. In particular in some cases the bound as a function of

is sometimes not convex, while in others it does not have a global minimum, but decreases as .

4 Experiments

(a) UCI, Abalone
(b) UCI, Diamonds
(c) KC_House
(d) MNIST-10
Figure 3: PAC-Bayes bound (top row) and negative log-likelihood (bottom row) for varying (quantities on the y-axis are normalized). (a-c) are regression tasks on the UCI Abalone, UCI Diamonds and KC_House datasets, while (d) is a classification task on the MNIST-10 dataset. The PAC-Bayes bound closely tracks the negative log-likelihood.

We tested our theoretical results on three regression datasets and one classification dataset. The regression tasks are with the Abalone and Diamonds datasets from the UCI repository Dua and Graff (2017), as well as with the popular “House Sales in King County, USA” (KC_House) dataset from the Kaggle competition website 29. We chose these datasets because they provide a large number of samples, compared to more common UCI datasets such as Boston and Wine. This larger number of samples makes it easier to compute oracle quantities. For the classification task we used the typical MNIST-10 dataset Deng (2012). When needing extra samples to approximate the complete distribution we used samples from the EMNIST Cohen et al. (2017) dataset.

In all experiments we split the dataset into five sets. Three of them are the typical prediction tasks sets: training set , testing set , and validation set . Our experimental setup requires two extra sets: the “train-suffix” set , as well as a large sample set called “true” set that is used to approximate the complete distribution.

We will encounter several problems when trying to evaluate the bound in practice. Firstly, as we mentioned, we expect the bound to be loose. We will therefore find it useful to evaluate the original bound, without any approximations. As we mentioned previously, we name this the . We approximate both the moment term and the empirical risk term using Monte Carlo sampling with and , and . An additional problem when analyzing our classification model is that we derived the bound for model with a single output, while the MNIST-10 dataset has 10 classes and thus a 10-dimensional output. Furthermore classification models are not typically trained with the Gaussian likelihood, but with a softmax activation coupled with the categorical crossentropy loss. In the previous section we observed that for the linearized neural network and the Gaussian likelihood we arrived at a second order Taylor expansion of the loss landscape where the Hessian was approximated with the Gauss–Newton approximation. In the case of MNIST, we thus make this Taylor expansion directly on the full cross-entropy empirical loss where is the trace of the Gauss–Newton approximation to the Hessian, for the categorical cross-entropy loss (without a scaling factor ) ( denotes the output dimensions and is a special function, see Kunstner et al. (2019) eq. 35 for details). We can now similarly to before calculate for our Laplace approximation to the posterior.

PAC-Bayes bounds require correct control of the prior mean as the distance between prior and posterior means in the KL term is often the dominant term in the bound. To control this distance we follow a variation of the approach in Dziugaite et al. (2021) to constructing our classifiers. We first use the to find a prior mean . We then set the posterior mean equal to the prior mean but evaluate the r.h.s of the bounds on the . Note that in this way , while the bound is still valid since the prior is independent from the evaluation set . The cost is that we deviate from normal practice. We have in essence constructed a Laplace approximation where the mean was learned using while the posterior variance was learned using .

For the UCI and KC_House experiments we use fully connected networks with 2 hidden layers with 100 dimensions, followed by the ReLU activation function, and a final Softmax activation. For the MNIST-10 dataset we use the standard LeNet architecture

Lecun et al. (1998). More details on the experimental setup can be found in the Appendix.

4.1 UCI +KC_House experiments

We first test the applicability of the bound in practice. For this we use the KC_House dataset, although the results for the rest of the datasets are similar. We train the network on using SGD with stepsize

for 10 epochs. We then estimate the bound using

. We give details on how all bound parameters are estimated in the Appendix. We plot the results in Figure 1(b). While the terms of the bound move in the way we expected from Corollary 1 We see that the bound is significantly off scale on the y axis. Specifically it is constrained to be which is far from the regime that we want to explore. In Figure 1(c) we thus plot for the same dataset the bound. We see that some of the problems of the bound have been fixed. Specifically we can now estimate bound values for . However surprisingly the bound always decreases as we increase . This is because the moment term increases at a much slower rate than the KL term decreases. We thus use the bound in the rest of our experiments.

In Figures 3 we plot the results for all the regression datasets. As before we train the networks on using SGD with stepsize for 10 epochs. We then estimate the bound using . We also plot the test NLL of the posterior predictive. For all cases we discover a cold posterior effect. We test different prior variances and the effect gets stronger with higher variances. The test NLL decreases with colder posteriors up to the point where the classifier is essentially deterministic. The bound correlates tightly with this behaviour. For increasing values of the posterior needs to be colder ( larger) to achieve the same NLL. We see that the bound also tracks this behaviour.

The results for all graphs are averaged over 10 different initializations to the neural network.

4.2 MNIST-10 experiments

We repeat the above experiment for the MNIST-10 dataset. As before we train the networks on using SGD with stepsize for 10 epochs. We then estimate the bound using . The behaviour is similar to the regression datasets, we again find a consistent cold-posterior effect, which the PAC-Bayes bound captures. Also we again see that the effect gets stronger with higher variances. We plot the results in Figure 2(d). In the Appendix we discuss two additional metrics, the Expected Calibration Error and the Zero-One loss.

The results for all graphs are averaged over 10 different initializations to the neural network.

An important caveat that we want to disclose is that even for extremely large

the results of the stochastic classifiers and the deterministic ones do not match exactly. We have double checked the relevant code, and believe that the source is some numerical instabilities of the torch package. It is for this reason that we also do not plot the deterministic results on the same graphs as the stochastic ones.

5 Discussion

For the case of isotropic Laplace approximations to the posterior, we presented a PAC-Bayesian interpretation of the cold posterior effect. We argued that performance on out-of-sample data is best described through a PAC-Bayes bound, which naturally includes a temperature parameter which is not constrained to be . There are a number of avenues for future work. The most interesting to the authors is extending this analysis to more complex approximations to the posterior and in particular Monte Carlo methods.


  • B. Adlam, J. Snoek, and S. L. Smith (2020) Cold posteriors and aleatoric uncertainty. arXiv preprint arXiv:2008.00029. Cited by: §1, §1.
  • L. Aitchison (2020) A statistical theory of cold posteriors in deep neural networks. arXiv preprint arXiv:2008.05912. Cited by: §1, §1, §2.
  • P. Alquier, J. Ridgway, and N. Chopin (2016) On the properties of variational approximations of Gibbs posteriors.

    The Journal of Machine Learning Research

    17 (1), pp. 8374–8414.
    Cited by: Appendix C, §1, Theorem 1.
  • A. Ashukha, A. Lyzhov, D. Molchanov, and D. Vetrov (2019) Pitfalls of in-domain uncertainty estimation and ensembling in deep learning. In International Conference on Learning Representations, Cited by: §2.
  • L. Bégin, P. Germain, F. Laviolette, and J. Roy (2016) PAC-bayesian bounds based on the rényi divergence. In Artificial Intelligence and Statistics, pp. 435–444. Cited by: Appendix C.
  • D. Blackwell and L. Dubins (1962) Merging of opinions with increasing information. The Annals of Mathematical Statistics 33 (3), pp. 882–886. Cited by: §1, §2.1.
  • C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural network. In International Conference on Machine Learning, pp. 1613–1622. Cited by: §2.
  • O. Catoni (2007) PAC-Bayesian supervised classification: the thermodynamics of statistical learning. Monograph Series, Vol. 56, Institute of Mathematical Statistics Lecture Notes. Cited by: §1.
  • G. Cohen, S. Afshar, J. Tapson, and A. Van Schaik (2017) EMNIST: extending mnist to handwritten letters. In 2017 international joint conference on neural networks (IJCNN), pp. 2921–2926. Cited by: §4.
  • M. Collier, B. Mustafa, E. Kokiopoulou, R. Jenatton, and J. Berent (2021) Correlated input-dependent label noise in large-scale image classification. In

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    pp. 1551–1560. Cited by: §2.1.
  • T. M. Cover (1999) Elements of information theory. John Wiley & Sons. Cited by: §2.
  • F. Dangel, F. Kunstner, and P. Hennig (2019) BackPACK: packing more into backprop. In International Conference on Learning Representations, Cited by: 1st item.
  • E. Daxberger, A. Kristiadi, A. Immer, R. Eschenhagen, M. Bauer, and P. Hennig (2021) Laplace redux-effortless bayesian deep learning. Advances in Neural Information Processing Systems 34. Cited by: 1st item, 1st item, §B.4.4, §B.6.
  • L. Deng (2012) The mnist database of handwritten digit images for machine learning research. IEEE Signal Processing Magazine 29 (6), pp. 141–142. Cited by: 6th item, §4.
  • A. Deshpande, A. Achille, A. Ravichandran, H. Li, L. Zancato, C. Fowlkes, R. Bhotika, S. Soatto, and P. Perona (2021) A linearized framework and a new benchmark for model selection for fine-tuning. arXiv preprint arXiv:2102.00084. Cited by: §3.
  • D. Dua and C. Graff (2017) UCI machine learning repository. University of California, Irvine, School of Information and Computer Sciences. External Links: Link Cited by: 4th item, §4.
  • M. Dusenberry, G. Jerfel, Y. Wen, Y. Ma, J. Snoek, K. Heller, B. Lakshminarayanan, and D. Tran (2020) Efficient and scalable Bayesian neural nets with rank-1 factors. In International Conference of Machine Learning, pp. 2782–2792. Cited by: §1.
  • G. K. Dziugaite, K. Hsu, W. Gharbieh, G. Arpino, and D. Roy (2021) On the role of data in pac-bayes. In International Conference on Artificial Intelligence and Statistics, pp. 604–612. Cited by: §B.4.4, §4.
  • G. K. Dziugaite and D. M. Roy (2017) Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. Uncertainty in Artificial Intelligence. Cited by: §B.4.4, §1.
  • V. Fortuin, A. Garriga-Alonso, F. Wenzel, G. Rätsch, R. Turner, M. van der Wilk, and L. Aitchison (2021) Bayesian neural network priors revisited. In International Conference on Learning Representations, Cited by: §1, §1.
  • P. Germain, F. Bach, A. Lacoste, and S. Lacoste-Julien (2016) PAC-Bayesian theory meets Bayesian inference. Advances in Neural Information Processing Systems 29. Cited by: §A.1, Appendix C, §2.3.
  • S. Ghosal, J. K. Ghosh, and A. W. Van Der Vaart (2000) Convergence rates of posterior distributions. Annals of Statistics, pp. 500–531. Cited by: §1, §2.1.
  • P. Grünwald and J. Langford (2007) Suboptimal behavior of Bayes and MDL in classification under misspecification. Machine Learning 66 (2), pp. 119–149. Cited by: §2.1, §2.3.
  • P. Grünwald and T. Van Ommen (2017) Inconsistency of Bayesian inference for misspecified linear models, and a proposal for repairing it. Bayesian Analysis 12 (4), pp. 1069–1103. Cited by: §2.3.
  • P. Grünwald (2012) The safe Bayesian. In International Conference on Algorithmic Learning Theory, pp. 169–183. Cited by: §2.3.
  • A. Immer, M. Korzepa, and M. Bauer (2021) Improving predictions of bayesian neural nets via local linearization. In International Conference on Artificial Intelligence and Statistics, pp. 703–711. Cited by: §A.1.
  • P. Izmailov, S. Vikram, M. D. Hoffman, and A. G. G. Wilson (2021) What are Bayesian neural network posteriors really like?. In International Conference on Machine Learning, pp. 4629–4640. Cited by: §1.
  • A. Jacot, F. Gabriel, and C. Hongler (2018) Neural tangent kernel: convergence and generalization in neural networks. Advances in neural information processing systems 31. Cited by: §3.
  • [29] Kaggle. Note: Cited by: 5th item, §4.
  • M. E. E. Khan, A. Immer, E. Abedi, and M. Korzepa (2019) Approximate inference turns deep networks into Gaussian processes. Advances in neural information processing systems 32. Cited by: §3.
  • M. Khan, D. Nielsen, V. Tangkaratt, W. Lin, Y. Gal, and A. Srivastava (2018) Fast and scalable bayesian deep learning by weight-perturbation in adam. In International Conference on Machine Learning, pp. 2611–2620. Cited by: §2.
  • F. Kunstner, P. Hennig, and L. Balles (2019) Limitations of the empirical fisher approximation for natural gradient descent. Advances in neural information processing systems 32. Cited by: §A.1, §4.
  • F. Küppers, J. Kronenberger, J. Schneider, and A. Haselhoff (2021) Bayesian confidence calibration for epistemic uncertainty modelling. In Proceedings of the IEEE Intelligent Vehicles Symposium (IV), Cited by: 2nd item.
  • Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner (1998) Gradient-based learning applied to document recognition. Proceedings of the IEEE 86 (11), pp. 2278–2324. External Links: Document Cited by: §B.3, §4.
  • S. Lotfi, P. Izmailov, G. Benton, M. Goldblum, and A. G. Wilson (2022) Bayesian model selection, the marginal likelihood, and generalization. arXiv preprint arXiv:2202.11678. Cited by: §1.
  • C. Lyle, L. Schut, R. Ru, Y. Gal, and M. van der Wilk (2020) A Bayesian perspective on training speed and model selection. Advances in Neural Information Processing Systems 33, pp. 10396–10408. Cited by: §2.3.
  • W. Maddox, S. Tang, P. Moreno, A. G. Wilson, and A. Damianou (2021) Fast adaptation with linearized neural networks. In International Conference on Artificial Intelligence and Statistics, pp. 2737–2745. Cited by: §3, §3.
  • A. Masegosa (2020) Learning under model misspecification: applications to variational and ensemble methods. Advances in Neural Information Processing Systems 33, pp. 5479–5491. Cited by: §2.1, §2.
  • D. A. McAllester (1999) Some PAC-Bayesian theorems. Machine Learning 37 (3), pp. 355–363. Cited by: §1.
  • A. Mishkin, F. Kunstner, D. Nielsen, M. Schmidt, and M. E. Khan (2018) Slang: fast structured covariance approximations for bayesian deep learning with natural gradient. Advances in Neural Information Processing Systems 31. Cited by: §2.
  • W. R. Morningstar, A. Alemi, and J. V. Dillon (2022) PAC-Bayes: Narrowing the empirical risk gap in the misspecified Bayesian regime. In International Conference on Artificial Intelligence and Statistics, pp. 8270–8298. Cited by: §2.
  • S. Nabarro, S. Ganev, A. Garriga-Alonso, V. Fortuin, M. van der Wilk, and L. Aitchison (2021) Data augmentation in bayesian neural networks and the cold posterior effect. arXiv preprint arXiv:2106.05586. Cited by: §1, §1.
  • M. P. Naeini, G. Cooper, and M. Hauskrecht (2015) Obtaining well calibrated probabilities using bayesian binning. In Twenty-Ninth AAAI Conference on Artificial Intelligence, Cited by: §B.6.
  • L. Noci, K. Roth, G. Bachmann, S. Nowozin, and T. Hofmann (2021) Disentangling the roles of curation, data-augmentation and the prior in the cold posterior effect. Advances in Neural Information Processing Systems 34. Cited by: §1, §1.
  • A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, L. Fang, J. Bai, and S. Chintala (2019) PyTorch: an imperative style, high-performance deep learning library. In Advances in Neural Information Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett (Eds.), pp. 8024–8035. External Links: Link Cited by: 3rd item.
  • H. Ritter, A. Botev, and D. Barber (2018) A scalable Laplace approximation for neural networks. In 6th International Conference on Learning Representations, Vol. 6. Cited by: §B.6.
  • R. Ru, C. Lyle, L. Schut, M. Fil, M. van der Wilk, and Y. Gal (2021) Speedy performance estimation for neural architecture search. Advances in Neural Information Processing Systems 34. Cited by: §2.3.
  • F. Wenzel, K. Roth, B. S. Veeling, J. Swiatkowski, L. Tran, S. Mandt, J. Snoek, T. Salimans, R. Jenatton, and S. Nowozin (2020) How good is the Bayes posterior in deep neural networks really?. International Conference on Machine Learning. Cited by: §1, §2, §3.
  • L. Zancato, A. Achille, A. Ravichandran, R. Bhotika, and S. Soatto (2020) Predicting training time without training. Advances in Neural Information Processing Systems 33, pp. 6136–6146. Cited by: 2nd item, §3.
  • C. Zeno, I. Golan, A. Pakman, and D. Soudry (2020) Why cold posteriors? on the suboptimal generalization of optimal Bayes estimates. In Third Symposium on Advances in Approximate Bayesian Inference, Cited by: §1, §1.

Appendix A Proofs of main results

a.1 Proof of Proposition 1

Recall that we model our predictor as . Then for the choice of a Gaussian likelihood, given a training signal , a training label and weights , the negative log-likelihood loss takes the form . We also define . Our derivations closely follow the approach of Germain et al. [2016] p.11, section A.4.

Given the above definitions and modelling choices we develop the empirical risk term

In the penultimate line, we have used the fact that a real number is the trace of itself as well as the cyclic property of the trace. The second summation ( over the parameters of the model) results from the fact that is isotropic with a common scaling factor . The term in blue is exactly the Gauss–Newton approximation to the Hessian of the full neural network, for the squared loss function Kunstner et al. [2019], Immer et al. [2021], and in the last line we set . Since is a sum of positive numbers, taking into account that the blue term is the Gauss–Newton approximation to the Hessian and if we assume that the Gauss–Newton approximation is diagonal, then is a measure of the curvature at minimum of the loss landscape. We finally get

We continue with the KL term which is known to have the following analytical expression for Gaussian prior and posterior distributions

We finally develop the moment term. Using an intermediate variable to simplify the calculations, we get