1 Introduction
for both ResNet (top) and a Linear model (bottom). Notice, this suggests that the random variables
for both ResNet and Linear models are subgamma (see Definition 1 and Theorem 1). We obtain the results for ResNet using CIFAR10 dataset and for the Linear model using MNIST dataset. The parameter for the subgamma fit are for both models and for ResNet and for the Linear model.Deep neural networks are ubiquitous across disciplines and often achieve state of the art results. Albeit deep nets are able to encode highly complex inputoutput relations, in practice, they do not tend to overfit
(Zhang et al., 2016). This tendency to not overfit has been investigated in numerous works on generalization bounds. Indeed, many generalization bounds apply to composite functions specified by deep nets. However, most of these bounds assume that the loss function is bounded or Lipschitz. Unfortunately, this assumption excludes plenty of deep nets and Bayesian deep nets that rely on the popular negative loglikelihood (NLL) loss.In this work we introduce a new PACBayesian generalization bound for unbounded loss functions with unbounded gradientnorm, i.e., nonLipschitz functions. This setting is closer to presentday deep net training, which uses the unbounded NLL loss and requires to avoid large gradient values during training so as to prevent exploding gradients. To prove the bound we utilize the contractivity of the logSobolev inequality (Ledoux, 1999)
. It enables to bound the momentgenerating function of the model risk. Our PACBayesian bound adds a novel complexity term to existing PACBayesian bounds: the expected norm of the loss function gradients computed with respect to the input. Intuitively this norm measures the complexity of the loss function, i.e., the model. In our work we prove that this complexity term is subgamma when considering linear models with the NLL loss, or more generally, for any linear model with a Lipschitz loss function. We also derive a bound for any Bayesian deep net, which permits to verify empirically that this complexity term is subgamma. See Figure
3.This new term, which measures the complexity of the model, augments existing PACBayesian bounds for bounded or Lipschitz loss functions which typically consist of two terms: (1) the empirical risk, which measures the fitting of the posterior over the parameters to the training data, and (2) the KLdivergence between the prior and the posterior distributions over the parameters, which measures the complexity of learning the posterior when starting from the prior over the parameters.
2 Related work
Generalization bounds for deep nets were explored in various settings. VCtheory provides both upper bounds and lower bounds to the network’s VCdimension, which are linear in the number of network parameters (Bartlett et al., 2017b, 2019). While VCtheory asserts that such a model should overfit as it can learn any random labeling (e.g., Zhang et al. (2016)), surprisingly, deep nets generally do not overfit.
Rademacher complexity allows to apply data dependent bounds to deep nets (Bartlett and Mendelson, 2002; Neyshabur et al., 2015; Bartlett et al., 2017a; Golowich et al., 2017; Neyshabur et al., 2018). These bounds rely on the loss and the Lipschitz constant of the network and consequently depend on a product of norms of weight matrices which scales exponentially in the network depth. Wei and Ma (2019) developed a bound that considers the gradientnorm over training examples. In contrast, our bound depends on average quantities of the gradientnorm and thus we answer an open question of Bartlett et al. (2017a) about the existence of bounds that depend on average loss and average gradientnorm, albeit in a PACBayesian setting. PACBayesian bounds that use Rademacher complexity have been studied by Kakade et al. (2009); Yang et al. (2019).
Stability bounds may be applied to unbounded loss functions and in particular to the negative loglikelihood (NLL) loss (Bousquet and Elisseeff, 2002; Rakhlin et al., 2005; ShalevShwartz et al., 2009; Hardt et al., 2015; Zhang et al., 2016)
. However, stability bounds for convex loss functions, e.g., for logistic regression, do not apply to deep nets since they require the NLL loss to be a convex function of the parameters. Alternatively,
Hardt et al. (2015) and Kuzborskij and Lampert (2017)estimate the stability of stochastic gradient descent dynamics, which strongly relies on early stopping. This approach results in weaker bounds for the nonconvex setting. Stability PACBayesian bounds for bounded and Lipschitz loss functions were developed by London (2017).PACBayesian bounds were recently applied to deep nets (McAllester, 2013; Dziugaite and Roy, 2017; Neyshabur et al., 2017). In contrast to our work, those related works all consider bounded loss functions. An excellent survey on PACBayesian bounds was provided by Germain et al. (2016). Alquier et al. (2016)
introduced PACBayesian bounds for linear classifiers with the hingeloss by explicitly bounding its moment generating function.
Alquier et al. (2012) provide an analysis for PACBayesian bounds with Lipschitz functions. Our work differs as we derive PACBayesian bounds for nonLipschitz functions. Work by Germain et al. (2016) is closer to our setting and considers PACBayesian bounds for the quadratic loss function. In contrast, our work considers the multiclass setting, and nonlinear models. PACBayesian bounds for the NLL loss in the online setting were put forward by Takimoto and Warmuth (2000); Banerjee (2006); Bartlett et al. (2013); Grünwald and Mehta (2017). The online setting does not consider the whole sample space and therefore is simpler to analyze in the Bayesian setting.PACBayesian bounds for the NLL loss function are intimately related to learning Bayesian inference
(Germain et al., 2016). Recently many works applied various posteriors in Bayesian deep nets. Gal and Ghahramani (2015); Gal (2016) introduce a Bayesian inference approximation using Monte Carlo (MC) dropout, which approximates a Gaussian posterior using Bernoulli dropout. Srivastava et al. (2014); Kingma et al. (2015)introduced Gaussian dropout which effectively creates a Gaussian posterior that couples between the mean and the variance of the learned parameters and explored the relevant loguniform priors.
Blundell et al. (2015); Louizos and Welling (2016) suggest to take a full Bayesian perspective and learn separately the mean and the variance of each parameter.3 Background
Generalization bounds provide statistical guarantees on learning algorithms. They assess how the learned parameters of a model perform on test data given the model’s result on the training data , where is the data instance and
is the corresponding label. The performance of the parametric model is measured by a loss function
. The risk of this model is its average loss, when the data instance and its label are sampled from the true but unknown distribution . We denote the risk by . The empirical risk is the average training set loss .PACBayesian theory bounds the expected risk of a model when its parameters are averaged over the learned posterior distribution . The parameters of the posterior distribution are learned from the training data . In our work we start from the following PACBayesian bound:
Theorem 1 (Alquier et al. (2016)).
Let
be the KLdivergence between two probability density functions
. For any , for any and for any prior distribution , with probability at least over the draw of the training set , the following holds simultaneously for any posterior distribution :where
Unfortunately, the complexity term of this bound is impossible to compute for large values of , as we show in our experimental evaluation. To deal with this complexity term, Alquier et al. (2016); Germain et al. (2016); Boucheron et al. (2013) consider the subGaussian assumption, which amounts to the bound for any and some variance factor . This assumption is also referred to as the Hoeffding assumption, which is related to Hoeffding’s lemma that is usually applied in PACBayesian bounds to loss function that are uniformly bounded by a constant, i.e., for any simultaneously.
Unfortunately, many loss functions that are used in practice are unbounded. In particular, the NLL loss function is unbounded even in the multiclass setting when
is discrete. For instance, consider a fully connected deep net, where the input vector of the
th layer is a function of the parameters of all previous layers, i.e., . The entries of are computed from the response of its preceding layer, i.e., , followed by a transfer function , i.e., . Since the NLL is define as , the NLL loss increases with and is unbounded when if the rows in consist of the vector . In our experimental validation in Section 5 we show that the unboundedness of the NLL loss results in a complexity term that is not subGaussian.This complexity term influences the value of , which controls the convergence rate of the bound, as it weighs the complexity terms and by . Therefore, a tight bound requires to be as large as possible. However, since influences exponentially, one needs to make sure that . In such cases one may use subgamma random variables (Alquier et al., 2016; Germain et al., 2016; Boucheron et al., 2013):
Definition 1.
The random variable is called subgamma if the complexity term in Theorem 1 satisfies for every such that .
In Corollary 1 we prove that the complexity term is subgamma when considering linear models with the NLL loss, or more generally, for any linear model with a Lipschitz loss function. In Corollary 2 we derive a bound on for any Bayesian deep net, which permits to verify empirically that is subgamma.
4 PACBayesian bounds for smooth loss functions
Our main theorem below shows that for smooth loss functions, the complexity term is bounded by the expected gradientnorm of the loss function with respect to the data . This property is appealing since the gradientnorm contracts the network’s output, as evident in its extreme case by the vanishing gradients property. In our experimental evaluation we show how this contractivity depends on the depth of the network and the variance of the prior and its affect on the generalization of Bayesian deep nets.
Theorem 2.
Consider the setting of Theorem 1 and assume , given
follows the Gaussian distribution and
is a smooth loss function (e.g., the negative loglikelihood loss). Let . Then(1) 
Our proof technique to show the main Theorem uses the logSobolev inequality for Gaussian distributions (Ledoux, 1999) as we illustrate next.
Proof.
The proof consists of three steps. First we use the statistical independence of the training samples to decompose the moment generating function
Next we consider the moment generating function in its logspace, i.e., by considering the cumulant generating function and obtain the following equality:
Finally we use the logSobolev inequality for Gaussian distributions,
and complete the proof through some algebraic manipulations.
The first step of the proof results in Eq. (4). To derive it we use the independence of the training samples:
The first equality holds since is a constant that is independent of the expectation . The last equality holds since are identically distributed.
The second step of the proof results in Eq. (4). It relies on the relation of the moment generating function and the cumulant generating function . The fundamental theorem of calculus asserts , while refers to the derivative at . We then compute and :
The second equality follows from l’Hopital rule: and recalling that and .
The third and final step of the proof begins with applying the logSobolev inequality in Eq. (4) for Gaussian distributions. Combining it with Eq. (4) leads to the inequality
It is insightful to see that
This is compelling because the term in the above inequality cancels with the term in Eq. (4). This shows theoretically why the complexity term mostly depends on the gradientnorm. This observation also concludes the proof after rearranging terms. ∎
The above theorem can be extended to settings for which is sampled from any logconcave distribution, e.g., the Laplace distribution, cf. Gentil (2005). For readability we do not discuss this generalization here.
Eq. (4) hints at the infeasibility of computing the complexity term directly for large values of : since the loss function is nonnegative, the value of grows exponentially with , while diminishes to zero exponentially fast. These opposing quantities make evaluation of numerically infeasible. In contrast, our bound makes all computations in logspace, hence its computation is feasible for larger values of , up to their subgamma interval , see Table 1.
Notably, the general bound in Theorem 2 is more theoretical than practical: to estimate it in practice one needs to avoid the integration over . However, it is an important intermediate step to derive a practical bound for linear models with Lipschitz loss function, and a general bound for any smooth loss function as we discuss in the next two sections respectively.
4.1 Linear models
In the following we consider smooth loss functions over linear models in the multiclass setting, where is the data instance, are the possible labels and the loss function takes the form . We also assume that is a Lipschitz function, i.e., . Included in these assumptions are the popular NLL loss that is used in logistic regression and the multiclass hinge loss
that is used in support vector machines (SVMs).
Corollary 1.
Consider smooth loss functions over linear models , with Lipschitz constant , i.e., . Under the conditions of Theorem 2 with Gaussian prior distribution and variance for which , we obtain .
Proof.
This bound is derived by applying Theorem 2. We begin by realizing the gradient of with respect to
. Using the chain rule,
. Hence, we obtain for the gradient norm . Plugging this result into Eq. (1) we obtain the following bound for its exponent:Since , the ratio in the integral equals one and the integral . Combining these results we obtain:
Finally, whenever we follow the Gaussian integral and derive the bound
∎
The above corollary provides a PACBayesian bound for classification using the NLL loss, and shows is subgamma in the interval . Interestingly, it can achieve a rate of , albeit with variance of the prior of . In our experimental evaluation we show that it is better to achieve lower rate, i.e., while using a prior with a fixed variance, i.e., . The above bound also extends the result of Alquier et al. (2016) for binary hingeloss to the multiclass hinge loss (cf. Alquier et al. (2016), Section 6). Unfortunately, the above bound cannot be applied to nonlinear loss functions, since their gradientnorm is not bounded, as evident by the exploding gradients property in deep nets.
4.2 Nonlinear models
In the following we derive a generalization bound for onaverage bounded loss functions and onaverage bounded gradientnorms.
Corollary 2.
Consider smooth loss functions that are onaverage bounded, i.e., . Under the conditions of Theorem 2, for any we obtain
Proof.
This bound is derived by applying Theorem 2 and bounding . We derive this bound in three steps:

[noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

From we obtain .

We lower bound for any : First we note that , therefore we consider . Also, since the function is monotone in within the unit interval, i.e., for there holds and consequently for any .

The assumption and the monotonicity of the exponential function result in the lower bound . From convexity of the exponential function, and the lower bound follows.
Combining these bounds we derive the upper bound , and the result follows. ∎
The above derivation upper bounds the complexity term by the expected gradientnorm of the loss function, i.e., the flow of its gradients through the architecture of the model. It provides the means to empirically show that is subgamma (see Section 5.3). In particular, we show empirically that the rate of the bound can be as high as , dependent on the gradientnorm. This is a favorable property, since the convergence of the bound scales as . Therefore, one would like to avoid exploding gradientnorms, as this effectively harms the true risk bound. While one may achieve a fast rate bound by forcing the gradientnorm to vanish rapidly, practical experience shows that vanishing gradients prevent the deep net from fitting the model to the training data when minimizing the empirical risk. In our experimental evaluation we demonstrate the influence of the expected gradientnorm on the bound of the true risk.
5 Experiments
In this section we perform an experimental evaluation of the our PACBayesian bounds, both for linear models and nonlinear models. We begin by verifying our assumptions: (i) The PACBayesian bound in Theorem 1 cannot be computed for large values of . (ii) Although the NLL loss is unbounded, it is onaverage bounded. Next, we study the behavior of the complexity term for different architectures, both for linear models and deep nets. We show that the random variable is subgamma, namely that for every such that . Importantly, we show that which relates to the rate of convergence of the bound is determined by the architecture of the deep net. Lastly, we demonstrate the importance of in the learning process, balancing the three terms of (i) the empirical risk; (ii) the KL divergence; and (iii) the complexity term.
Implementation details.
We use multilayer perceptrons (MLPs) for the MNIST and FashionMNIST dataset
(Xiao et al., 2017). We use convolutional neural networks (CNNs) the CIFAR10 dataset. In all models we use the ReLU activation function. We optimize the NLL loss function using SGD with a learning rate of 0.01 and a momentum value of 0.9 in all settings for 50 epochs. We use minibatches of size 128 and did not use any learning rate scheduling. For the ResNet experiments we optimize an 18layers ResNet model on the CIFAR10 dataset, using Adam optimizer with learning rate of 0.001 for 150 epochs where we halve the learning rate every 50 epochs using batchsize of 128.
5.1 Verify Assumptions
We start by empirically demonstrating the numerical instability of computing the complexity term in Eq. 1. This numerical instability occurs due to the exponentiation of the random variables, namely , which quickly goes to infinity as grows. We estimated Eq. 1 over MNIST, for MLPs of depth , where MLP of depth is a linear model. For a fair comparison we changed layers’ width to reach roughly the same number of parameters in each model (except for the linear case). For these architectures evaluated for different values of , using variance of 0.1 over the prior distribution. The results are depicted in Figure 4. One can see that we are able to compute the complexity term for and even to smaller for the linear model (). Standard bounds for MNIST require to be in the interval , where is the training sample size ( for MNIST). We observe that while computing the term goes to infinity while goes to zero, and they are not able to balance each other. In our derivation this is solved by looking at the gradient to emphasize their change, see Eq. (4).
In Corollary 2 we assume that although the loss is unbounded, it is onaverage bounded, meaning . We tested this assumption on MNIST, FashionMNIST using MLPs of depth and CIFAR10 using CNNs of depth , where for in the CNN models
is the number of convolutional layers, and we include a max pool layer after each convolutional layer. In all CNN models we include an additional output layer to be a fully connected one. To span the possible weights we sampled them from a normal prior distribution with different variance. The results appear in Figure
8. We observe that the loss is onaverage bounded by , while being dependent on the variance of the prior. Moreover, for variance up to , the onaverage loss bound is about and its affect on the complexity term is minimal. Notice, although the expected loss is increasing for high variance levels, these are not being used for initialize deep nets, considering common initialization techniques.5.2 Complexity of Neural Nets
Prior Variance  0.0004  0.01  0.05  0.1  0.3  0.5  0.7  

One 
Test Loss  0.753  0.442  0.277  0.271  0.324  0.408  0.677 
Train Loss  0.731  0.424  0.273  0.253  0.292  0.373  0.697  
Bound on  10.57  66.92  inf  inf  inf  inf  inf  
Bound on  0.0002  0.001  0.029  0.141  1.76  6.33  15.18  
KL  20027  10776  2561  1478  3886  7447  8995  
Two 
Test Loss  1.743  0.549  0.104  0.066  0.127  0.236  0.602 
Train Loss  1.752  0.569  0.095  0.038  0.056  0.155  0.425  
Bound on  0.0  0.014  0.540  inf  inf  inf  inf  
Bound on  0.0  0.0  0.006  0.115  16.95  inf  inf  
KL  25848  30469  7369  7834  96976  166638  244255  
Three 
Test Loss  2.3  2.3  0.091  0.062  0.136  0.294  1.319 
Train Loss  2.3  2.3  0.078  0.027  0.067  0.268  1.173  
Bound on  0.0  0.0  0.002  31.99  inf  inf  inf  
Bound on  0.0  0.001  0.001  0.041  62.73  inf  inf  
KL  nan  10776  9480  8215  95984  175134  226237  
Four 
Test Loss  2.3  2.3  0.083  0.067  0.132  inf  inf 
Train Loss  2.3  2.3  0.064  0.022  0.081  inf  inf  
Bound on  0.0  0.0  0.0  2.855  inf  inf  inf  
Bound on  0.0  0.0  0.0  0.012  inf  inf  inf  
KL  nan  nan  10849  8239  113943  nan  nan  
Five 
Test Loss  2.3  2.3  0.087  0.066  0.133  inf  inf 
Train Loss  2.3  2.3  0.055  0.019  0.101  inf  inf  
Bound on  0.0  0.0  0.0  0.204  inf  inf  inf  
Bound on  0.0  0.0  0.0  0.004  inf  inf  inf  
KL  nan  nan  11800  9090  140305  nan  nan 
Next we turn to estimate our bounds of , both for the linear models and nonlinear models, corresponding to Corollary 1 and Corollary 2. We use the same architectures as mentioned above. The bound on is controlled by the expected gradientnorm . Figure 12 presents the expected gradientnorm as a function of different variance levels for the prior distribution over the models parameters. For the linear model we used the bound in Corollary 1. One can see that the linear model has the largest expected gradientnorm, since the Lipschitz condition considers the worstcase gradientnorm. One also can see that the deeper the network, the smaller its gradientnorm. This is attributed to the gradient vanishing property. As a result, deeper nets can have faster convergence rate, i.e., use larger values of in the generalization bound, since the vanishing gradients creates a contractivity property that stabilize the loss function, i.e., reduces its variability. However, this comes at the expanse of the expressivity of the deep net, since vanishing gradients cannot fit the training data in the learning phase. This is demonstrated in the next experiment.
Figure 20 presents the bound on as a function of the variance levels for the prior distribution over the models parameters. One can see that the bound gets larger when the variance of the prior increases. Another thing to note is that the bound is often explodes when the variance is larger than . This conforms with Corollary 1, which is unbounded for variance larger than . This plot conforms the practice of Bayesian deep nets that set the variance of the prior up to .
5.3 SubGamma Approximation
In Section 4.1 we proved that the proposed bound is subgamma for the linear case. Unfortunately, such proof can not be directly applied to the nonlinear case. Hence, we empirically demonstrate that the proposed bound over is indeed subgamma for various model architectures. For that we used the same models architectures as before using the MNIST dataset. Results are depicted in Figure 17. Notice, similar to the ResNet model, the proposed bound is subgamma in all explored settings using with different scaling factors.
5.4 Optimization
Lastly, in order to better understand the balance between all components composed the proposed generalization bound we optimize all five MLP models presented above using the MNIST dataset, and computed the average training loss, average test loss, KL divergence, and the bound on , bound using and . We repeat this optimization process for various variance levels over the prior distribution over the model parameters. Results for the MNIST dataset are summerized in Table 1, more experimental results can be found in Section A.1 in the Appendix.
Results suggests that using variance levels of [0.05, 0.1] produce the overall best performance across all depth levels. This findings is consistent with our previous results which suggest that below this value the Bound goes to zero, hence make a good generalization on the expense of model performance. However, larger variance levels may cause the bound to explode and as a results makes the optimization problem harder.
Lastly, we analyzed the commonly used ResNet model (He et al., 2016)
. For that, we trained four different versions of the ResNet18 model: (i) standard model (ResNet); (ii) model with no skip connections (ResNetNoSkip); (iii) model with no batch normalizations (ResNetNoBN); and (iv) model without both skip connections and batch normalization layers (ResNetNoSkipNoBN). We optimize all models using CIFAR10 dataset. Figure
24 visualizes the results. Consistently with previous findings, variance levels of 0.1 gets the best performance overall, both in terms of model test loss and the generalization.Notice, ResNet and ResNetNoSkip achieve comparable performance in all measures. Additionally, when considering variance levels of 0.1 for the prior distribution, removing the batch normalization layers and including the skipconnections also gets comparable performance to ResNet and ResNetNoSkip. Similarly to Zhang et al. (2019), this findings suggest that even without batch normalization layers models can converge using exact initialization. On the other hand while removing both batch normalization the and skip connections, models either explores immediately or suffer greatly from gradient vanishing. These results are consistent with previous findings in which batch normazliaton greatly improves optimization (Santurkar et al., 2018).
6 Discussion and Future Work
We present a new PACBayesian generalization bound for deep Bayesian neural networks for unbounded loss functions with unbounded gradientnorm. The proof relies on bounding the logpartition function using the expected squared norm of the gradients with respect to the input. We prove that the proposed bound is subgamma for any linear model with a Lipschitz loss function and we verify it empirically for the nonlinear case. Experimental validation shows that the resulting bound provides insights for better model optimization, prior distribution search and model initialization.
References
 On the properties of variational approximations of gibbs posteriors. Journal of Machine Learning Research 17 (239), pp. 1–41. Cited by: §2, §3, §3, §4.1, Theorem 1.
 Model selection for weakly dependent time series forecasting. Bernoulli 18 (3), pp. 883–913. Cited by: §2.
 On bayesian bounds. In Proceedings of the 23rd international conference on Machine learning, pp. 81–88. Cited by: §2.
 Horizonindependent optimal prediction with logloss in exponential families. arXiv preprint arXiv:1305.4324. Cited by: §2.
 Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6241–6250. Cited by: §2.
 Nearlytight vcdimension and pseudodimension bounds for piecewise linear neural networks. arXiv preprint arXiv:1703.02930. Cited by: §2.
 Rademacher and gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3 (Nov), pp. 463–482. Cited by: §2.
 Nearlytight vcdimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research 20 (63), pp. 1–17. External Links: Link Cited by: §2.
 Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Cited by: §2.
 Concentration inequalities: a nonasymptotic theory of independence. Oxford university press. Cited by: §3, §3.
 Stability and generalization. The Journal of Machine Learning Research 2, pp. 499–526. Cited by: §2.
 Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008. Cited by: §2.

Dropout as a bayesian approximation: representing model uncertainty in deep learning
. arXiv preprint arXiv:1506.02142 2. Cited by: §2.  Uncertainty in deep learning. Ph.D. Thesis, PhD thesis, University of Cambridge. Cited by: §2.
 Logarithmic sobolev inequality for logconcave measure from prekopaleindler inequality. arXiv preprint math/0503476. Cited by: §4.
 Pacbayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pp. 1884–1892. Cited by: §2, §2, §3, §3.
 Sizeindependent sample complexity of neural networks. arXiv preprint arXiv:1712.06541. Cited by: §2.
 A tight excess risk bound via a unified pacbayesianrademachershtarkovmdl complexity. arXiv preprint arXiv:1710.07732. Cited by: §2.
 Train faster, generalize better: stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240. Cited by: §2.

Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778. Cited by: §5.4.  On the complexity of linear prediction: risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pp. 793–800. Cited by: §2.
 Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583. Cited by: §2.
 Datadependent stability of stochastic gradient descent. arXiv preprint arXiv:1703.01678. Cited by: §2.
 Concentration of measure and logarithmic sobolev inequalities. In Seminaire de probabilites XXXIII, pp. 120–216. Cited by: §1, §4.
 A pacbayesian analysis of randomized learning with application to stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 2931–2940. Cited by: §2.
 Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning, pp. 1708–1716. Cited by: §2.
 A pacbayesian tutorial with a dropout bound. arXiv preprint arXiv:1307.2118. Cited by: §2.
 A pacbayesian approach to spectrallynormalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564. Cited by: §2.
 Towards understanding the role of overparametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076. Cited by: §2.
 Normbased capacity control in neural networks. In Conference on Learning Theory, pp. 1376–1401. Cited by: §2.
 Stability results in learning theory. Analysis and Applications 3 (04), pp. 397–417. Cited by: §2.
 How does batch normalization help optimization?. In Advances in Neural Information Processing Systems, pp. 2483–2493. Cited by: §5.4.
 Stochastic convex optimization.. In COLT, Cited by: §2.
 Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §2.
 The laststep minimax algorithm. In International Conference on Algorithmic Learning Theory, pp. 279–290. Cited by: §2.
 Datadependent sample complexity of deep neural networks via lipschitz augmentation. In Advances in Neural Information Processing Systems, pp. 9722–9733. Cited by: §2.
 External Links: cs.LG/1708.07747 Cited by: §5.
 Fastrate pacbayes generalization bounds via shifted rademacher processes. In Advances in Neural Information Processing Systems, pp. 10802–10812. Cited by: §2.
 Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §1, §2, §2.
 Fixup initialization: residual learning without normalization. arXiv preprint arXiv:1901.09321. Cited by: §5.4.
Appendix A Appendix
a.1 Extended results
We provide an additional results for the optimization experiments. We optimized MLP models at different depth levels using FashionMNIST dataset.
Prior Variance  0.0004  0.01  0.05  0.1  0.3  0.5  0.7  

One 
Test Loss  0.79  0.57  0.45  0.46  1.04  1.59  10.43 
Train Loss  0.77  0.55  0.41  0.39  0.86  1.31  9.97  
MGF bound ()  10.57  67.31  inf  inf  inf  inf  inf  
MGF bound ()  0.0001  0.0011  0.0296  0.143  1.812  6.386  14.74  
KL  14255  7829  2430  1648  5379  8440  35299  
Two 
Test Loss  1.46  0.71  0.35  0.31  0.42  0.6  10.43 
Train Loss  1.46  0.69  0.29  0.2  0.3  0.49  9.97  
MGF bound ()  0.014  0.53  inf  inf  inf  inf  inf  
MGF bound ()  0.0  0.0  0.005  0.12  16.86  inf  inf  
KL  28222  22053  6983  9180  98648  168266  209935  
Three 
Test Loss  2.3  0.92  0.34  0.31  0.42  0.83  2.39 
Train Loss  2.3  0.91  0.28  0.19  0.35  0.79  1.76  
MGF bound ()  0.0  0.0  0.002  32.16  inf  inf  inf  
MGF bound ()  0.0  0.0  0.0  0.04  64.48  inf  inf  
KL  0  29826  9219  1648  547636  198760  374718  
Four 
Test Loss  2.3  2.3  0.34  0.31  0.44  inf  inf 
Train Loss  2.3  2.3  0.26  0.18  0.38  inf  inf  
MGF bound ()  0.0  0.0  0.0  2.81  inf  inf  inf  
MGF bound ()  0.0  0.0  0.0  0.01  inf  inf  inf  
KL  0  0  10859  10165  388842  nan  nan  
Five 
Test Loss  2.3  2.3  0.34  0.31  1.3  inf  inf 
Train Loss  2.3  2.3  0.25  0.18  1.28  inf  inf  
MGF bound ()  0.0  0.0  0.0  0.19  inf  inf  inf  
MGF bound ()  0.0  0.0  0.0  0.003  inf  inf  inf  
KL  0  0  11918  10986  353632  nan  nan 