On the generalization of bayesian deep nets for multi-class classification

02/23/2020
by   Yossi Adi, et al.
0

Generalization bounds which assess the difference between the true risk and the empirical risk have been studied extensively. However, to obtain bounds, current techniques use strict assumptions such as a uniformly bounded or a Lipschitz loss function. To avoid these assumptions, in this paper, we propose a new generalization bound for Bayesian deep nets by exploiting the contractivity of the Log-Sobolev inequalities. Using these inequalities adds an additional loss-gradient norm term to the generalization bound, which is intuitively a surrogate of the model complexity. Empirically, we analyze the affect of this loss-gradient norm term using different deep nets.

READ FULL TEXT VIEW PDF
02/28/2012

PAC-Bayesian Generalization Bound on Confusion Matrix for Multi-Class Classification

In this work, we propose a PAC-Bayes bound for the generalization risk o...
01/29/2020

A Class of Lower Bounds for Bayesian Risk with a Bregman Loss

A general class of Bayesian lower bounds when the underlying loss functi...
12/24/2018

Generalization Bounds for Uniformly Stable Algorithms

Uniform stability of a learning algorithm is a classical notion of algor...
06/21/2021

Complexity-Free Generalization via Distributionally Robust Optimization

Established approaches to obtain generalization bounds in data-driven op...
10/16/2020

Failures of model-dependent generalization bounds for least-norm interpolation

We consider bounds on the generalization performance of the least-norm l...
02/14/2018

Stronger generalization bounds for deep nets via a compression approach

Deep nets generalize well despite having more parameters than the number...
07/26/2018

Rademacher Generalization Bounds for Classifier Chains

In this paper, we propose a new framework to study the generalization pr...

1 Introduction

(a)
(b)
Figure 3: The proposed bound as a function of

for both ResNet (top) and a Linear model (bottom). Notice, this suggests that the random variables

for both ResNet and Linear models are sub-gamma (see Definition 1 and Theorem 1). We obtain the results for ResNet using CIFAR-10 dataset and for the Linear model using MNIST dataset. The parameter for the sub-gamma fit are for both models and for ResNet and for the Linear model.

Deep neural networks are ubiquitous across disciplines and often achieve state of the art results. Albeit deep nets are able to encode highly complex input-output relations, in practice, they do not tend to overfit 

(Zhang et al., 2016). This tendency to not overfit has been investigated in numerous works on generalization bounds. Indeed, many generalization bounds apply to composite functions specified by deep nets. However, most of these bounds assume that the loss function is bounded or Lipschitz. Unfortunately, this assumption excludes plenty of deep nets and Bayesian deep nets that rely on the popular negative log-likelihood (NLL) loss.

In this work we introduce a new PAC-Bayesian generalization bound for unbounded loss functions with unbounded gradient-norm, i.e., non-Lipschitz functions. This setting is closer to present-day deep net training, which uses the unbounded NLL loss and requires to avoid large gradient values during training so as to prevent exploding gradients. To prove the bound we utilize the contractivity of the log-Sobolev inequality (Ledoux, 1999)

. It enables to bound the moment-generating function of the model risk. Our PAC-Bayesian bound adds a novel complexity term to existing PAC-Bayesian bounds: the expected norm of the loss function gradients computed with respect to the input. Intuitively this norm measures the complexity of the loss function, i.e., the model. In our work we prove that this complexity term is sub-gamma when considering linear models with the NLL loss, or more generally, for any linear model with a Lipschitz loss function. We also derive a bound for any Bayesian deep net, which permits to verify empirically that this complexity term is sub-gamma. See Figure

3.

This new term, which measures the complexity of the model, augments existing PAC-Bayesian bounds for bounded or Lipschitz loss functions which typically consist of two terms: (1) the empirical risk, which measures the fitting of the posterior over the parameters to the training data, and (2) the KL-divergence between the prior and the posterior distributions over the parameters, which measures the complexity of learning the posterior when starting from the prior over the parameters.

2 Related work

Generalization bounds for deep nets were explored in various settings. VC-theory provides both upper bounds and lower bounds to the network’s VC-dimension, which are linear in the number of network parameters (Bartlett et al., 2017b, 2019). While VC-theory asserts that such a model should overfit as it can learn any random labeling (e.g., Zhang et al. (2016)), surprisingly, deep nets generally do not overfit.

Rademacher complexity allows to apply data dependent bounds to deep nets (Bartlett and Mendelson, 2002; Neyshabur et al., 2015; Bartlett et al., 2017a; Golowich et al., 2017; Neyshabur et al., 2018). These bounds rely on the loss and the Lipschitz constant of the network and consequently depend on a product of norms of weight matrices which scales exponentially in the network depth. Wei and Ma (2019) developed a bound that considers the gradient-norm over training examples. In contrast, our bound depends on average quantities of the gradient-norm and thus we answer an open question of Bartlett et al. (2017a) about the existence of bounds that depend on average loss and average gradient-norm, albeit in a PAC-Bayesian setting. PAC-Bayesian bounds that use Rademacher complexity have been studied by Kakade et al. (2009); Yang et al. (2019).

Stability bounds may be applied to unbounded loss functions and in particular to the negative log-likelihood (NLL) loss (Bousquet and Elisseeff, 2002; Rakhlin et al., 2005; Shalev-Shwartz et al., 2009; Hardt et al., 2015; Zhang et al., 2016)

. However, stability bounds for convex loss functions, e.g., for logistic regression, do not apply to deep nets since they require the NLL loss to be a convex function of the parameters. Alternatively, 

Hardt et al. (2015) and Kuzborskij and Lampert (2017)estimate the stability of stochastic gradient descent dynamics, which strongly relies on early stopping. This approach results in weaker bounds for the non-convex setting. Stability PAC-Bayesian bounds for bounded and Lipschitz loss functions were developed by London (2017).

PAC-Bayesian bounds were recently applied to deep nets (McAllester, 2013; Dziugaite and Roy, 2017; Neyshabur et al., 2017). In contrast to our work, those related works all consider bounded loss functions. An excellent survey on PAC-Bayesian bounds was provided by Germain et al. (2016). Alquier et al. (2016)

introduced PAC-Bayesian bounds for linear classifiers with the hinge-loss by explicitly bounding its moment generating function.

Alquier et al. (2012) provide an analysis for PAC-Bayesian bounds with Lipschitz functions. Our work differs as we derive PAC-Bayesian bounds for non-Lipschitz functions. Work by Germain et al. (2016) is closer to our setting and considers PAC-Bayesian bounds for the quadratic loss function. In contrast, our work considers the multi-class setting, and non-linear models. PAC-Bayesian bounds for the NLL loss in the online setting were put forward by Takimoto and Warmuth (2000); Banerjee (2006); Bartlett et al. (2013); Grünwald and Mehta (2017). The online setting does not consider the whole sample space and therefore is simpler to analyze in the Bayesian setting.

PAC-Bayesian bounds for the NLL loss function are intimately related to learning Bayesian inference 

(Germain et al., 2016). Recently many works applied various posteriors in Bayesian deep nets. Gal and Ghahramani (2015); Gal (2016) introduce a Bayesian inference approximation using Monte Carlo (MC) dropout, which approximates a Gaussian posterior using Bernoulli dropout. Srivastava et al. (2014); Kingma et al. (2015)

introduced Gaussian dropout which effectively creates a Gaussian posterior that couples between the mean and the variance of the learned parameters and explored the relevant log-uniform priors.

Blundell et al. (2015); Louizos and Welling (2016) suggest to take a full Bayesian perspective and learn separately the mean and the variance of each parameter.

3 Background

Generalization bounds provide statistical guarantees on learning algorithms. They assess how the learned parameters of a model perform on test data given the model’s result on the training data , where is the data instance and

is the corresponding label. The performance of the parametric model is measured by a loss function

. The risk of this model is its average loss, when the data instance and its label are sampled from the true but unknown distribution . We denote the risk by . The empirical risk is the average training set loss .

PAC-Bayesian theory bounds the expected risk of a model when its parameters are averaged over the learned posterior distribution . The parameters of the posterior distribution are learned from the training data . In our work we start from the following PAC-Bayesian bound:

Theorem 1 (Alquier et al. (2016)).

Let

be the KL-divergence between two probability density functions

. For any , for any and for any prior distribution , with probability at least over the draw of the training set , the following holds simultaneously for any posterior distribution :

where

Unfortunately, the complexity term of this bound is impossible to compute for large values of , as we show in our experimental evaluation. To deal with this complexity term, Alquier et al. (2016); Germain et al. (2016); Boucheron et al. (2013) consider the sub-Gaussian assumption, which amounts to the bound for any and some variance factor . This assumption is also referred to as the Hoeffding assumption, which is related to Hoeffding’s lemma that is usually applied in PAC-Bayesian bounds to loss function that are uniformly bounded by a constant, i.e., for any simultaneously.

Unfortunately, many loss functions that are used in practice are unbounded. In particular, the NLL loss function is unbounded even in the multi-class setting when

is discrete. For instance, consider a fully connected deep net, where the input vector of the

-th layer is a function of the parameters of all previous layers, i.e., . The entries of are computed from the response of its preceding layer, i.e., , followed by a transfer function , i.e., . Since the NLL is define as , the NLL loss increases with and is unbounded when if the rows in consist of the vector . In our experimental validation in Section 5 we show that the unboundedness of the NLL loss results in a complexity term that is not sub-Gaussian.

This complexity term influences the value of , which controls the convergence rate of the bound, as it weighs the complexity terms and by . Therefore, a tight bound requires to be as large as possible. However, since influences exponentially, one needs to make sure that . In such cases one may use sub-gamma random variables (Alquier et al., 2016; Germain et al., 2016; Boucheron et al., 2013):

Definition 1.

The random variable is called sub-gamma if the complexity term in Theorem 1 satisfies for every such that .

In Corollary 1 we prove that the complexity term is sub-gamma when considering linear models with the NLL loss, or more generally, for any linear model with a Lipschitz loss function. In Corollary 2 we derive a bound on for any Bayesian deep net, which permits to verify empirically that is sub-gamma.

4 PAC-Bayesian bounds for smooth loss functions

Our main theorem below shows that for smooth loss functions, the complexity term is bounded by the expected gradient-norm of the loss function with respect to the data . This property is appealing since the gradient-norm contracts the network’s output, as evident in its extreme case by the vanishing gradients property. In our experimental evaluation we show how this contractivity depends on the depth of the network and the variance of the prior and its affect on the generalization of Bayesian deep nets.

Theorem 2.

Consider the setting of Theorem 1 and assume , given

follows the Gaussian distribution and

is a smooth loss function (e.g., the negative log-likelihood loss). Let . Then

(1)

Our proof technique to show the main Theorem uses the log-Sobolev inequality for Gaussian distributions (Ledoux, 1999) as we illustrate next.

Proof.

The proof consists of three steps. First we use the statistical independence of the training samples to decompose the moment generating function

Next we consider the moment generating function in its log-space, i.e., by considering the cumulant generating function and obtain the following equality:

Finally we use the log-Sobolev inequality for Gaussian distributions,

and complete the proof through some algebraic manipulations.

The first step of the proof results in Eq. (4). To derive it we use the independence of the training samples:

The first equality holds since is a constant that is independent of the expectation . The last equality holds since are identically distributed.

The second step of the proof results in Eq. (4). It relies on the relation of the moment generating function and the cumulant generating function . The fundamental theorem of calculus asserts , while refers to the derivative at . We then compute and :

The second equality follows from l’Hopital rule: and recalling that and .

The third and final step of the proof begins with applying the log-Sobolev inequality in Eq. (4) for Gaussian distributions. Combining it with Eq. (4) leads to the inequality

It is insightful to see that

This is compelling because the term in the above inequality cancels with the term in Eq. (4). This shows theoretically why the complexity term mostly depends on the gradient-norm. This observation also concludes the proof after rearranging terms. ∎

The above theorem can be extended to settings for which is sampled from any log-concave distribution, e.g., the Laplace distribution, cf. Gentil (2005). For readability we do not discuss this generalization here.

Eq. (4) hints at the infeasibility of computing the complexity term directly for large values of : since the loss function is non-negative, the value of grows exponentially with , while diminishes to zero exponentially fast. These opposing quantities make evaluation of numerically infeasible. In contrast, our bound makes all computations in log-space, hence its computation is feasible for larger values of , up to their sub-gamma interval , see Table 1.

Notably, the general bound in Theorem 2 is more theoretical than practical: to estimate it in practice one needs to avoid the integration over . However, it is an important intermediate step to derive a practical bound for linear models with Lipschitz loss function, and a general bound for any smooth loss function as we discuss in the next two sections respectively.

4.1 Linear models

In the following we consider smooth loss functions over linear models in the multi-class setting, where is the data instance, are the possible labels and the loss function takes the form . We also assume that is a Lipschitz function, i.e., . Included in these assumptions are the popular NLL loss that is used in logistic regression and the multi-class hinge loss

that is used in support vector machines (SVMs).

Corollary 1.

Consider smooth loss functions over linear models , with Lipschitz constant , i.e., . Under the conditions of Theorem 2 with Gaussian prior distribution and variance for which , we obtain .

Proof.

This bound is derived by applying Theorem 2. We begin by realizing the gradient of with respect to

. Using the chain rule,

. Hence, we obtain for the gradient norm . Plugging this result into Eq. (1) we obtain the following bound for its exponent:

Since , the ratio in the integral equals one and the integral . Combining these results we obtain:

Finally, whenever we follow the Gaussian integral and derive the bound

The above corollary provides a PAC-Bayesian bound for classification using the NLL loss, and shows is sub-gamma in the interval . Interestingly, it can achieve a rate of , albeit with variance of the prior of . In our experimental evaluation we show that it is better to achieve lower rate, i.e., while using a prior with a fixed variance, i.e., . The above bound also extends the result of Alquier et al. (2016) for binary hinge-loss to the multi-class hinge loss (cf. Alquier et al. (2016), Section 6). Unfortunately, the above bound cannot be applied to non-linear loss functions, since their gradient-norm is not bounded, as evident by the exploding gradients property in deep nets.

Figure 4: Estimating the complexity term in Eq. 1 over MNIST, for a linear model and MLPs of depth . We are able to compute the complexity term for and even to smaller in for the linear model (). Standard bounds for MNIST require to be at least , which is the square root of the training sample size. In all settings we use variance of 0.1 for the prior distribution over the model parameters.

4.2 Non-linear models

In the following we derive a generalization bound for on-average bounded loss functions and on-average bounded gradient-norms.

Corollary 2.

Consider smooth loss functions that are on-average bounded, i.e., . Under the conditions of Theorem 2, for any we obtain

Proof.

This bound is derived by applying Theorem 2 and bounding . We derive this bound in three steps:

  • [noitemsep,topsep=0pt,parsep=0pt,partopsep=0pt]

  • From we obtain .

  • We lower bound for any : First we note that , therefore we consider . Also, since the function is monotone in within the unit interval, i.e., for there holds and consequently for any .

  • The assumption and the monotonicity of the exponential function result in the lower bound . From convexity of the exponential function, and the lower bound follows.

Combining these bounds we derive the upper bound , and the result follows. ∎

The above derivation upper bounds the complexity term by the expected gradient-norm of the loss function, i.e., the flow of its gradients through the architecture of the model. It provides the means to empirically show that is sub-gamma (see Section 5.3). In particular, we show empirically that the rate of the bound can be as high as , dependent on the gradient-norm. This is a favorable property, since the convergence of the bound scales as . Therefore, one would like to avoid exploding gradient-norms, as this effectively harms the true risk bound. While one may achieve a fast rate bound by forcing the gradient-norm to vanish rapidly, practical experience shows that vanishing gradients prevent the deep net from fitting the model to the training data when minimizing the empirical risk. In our experimental evaluation we demonstrate the influence of the expected gradient-norm on the bound of the true risk.

5 Experiments

(a)
(b)
(c)
Figure 8: Verifying the assumption in Corollary 2 that the loss function is on-average bounded, i.e., . We tested this assumption on MNIST (left), Fashion-MNIST (middle), using MLPs and CIFAR10 (right) using CNNs. The loss is on-average bounded, while being dependent on the variance of the prior. For variance up to , the loss is on-average bound is about .
(a)
(b)
(c)
Figure 12: Estimating as a function of the different variance levels for the prior distribution . Results are reported for MNIST (left), Fashion-MNIST (middle), using MLPs and CIFAR10 (right) using CNNs. The linear model has the largest expected gradient-norm, since the Lipschitz condition considers the worst-case gradient-norm, see Corollary 1. The gradient-norm gets smaller as a function of the depth, due to the vanishing gradient property. As a result, deeper nets can have faster convergence rate, i.e., use larger values of can be used in the generalization bound.
(a)
(b)
(c)
(d)
Figure 17: The proposed bound as a function of for MLP models using two, three, four, and five layers , depicted from left to right, (the left figure is for two layers, the second left is for three layers and so on). Notice, this suggests that the random variables for all presented MLP models are sub-gamma. We obtain the results for all models using the MNIST dataset. The parameter for the sub-gamma fit is for all models using different scaling factors.

In this section we perform an experimental evaluation of the our PAC-Bayesian bounds, both for linear models and non-linear models. We begin by verifying our assumptions: (i) The PAC-Bayesian bound in Theorem 1 cannot be computed for large values of . (ii) Although the NLL loss is unbounded, it is on-average bounded. Next, we study the behavior of the complexity term for different architectures, both for linear models and deep nets. We show that the random variable is sub-gamma, namely that for every such that . Importantly, we show that which relates to the rate of convergence of the bound is determined by the architecture of the deep net. Lastly, we demonstrate the importance of in the learning process, balancing the three terms of (i) the empirical risk; (ii) the KL divergence; and (iii) the complexity term.

Implementation details.

We use multilayer perceptrons (MLPs) for the MNIST and Fashion-MNIST dataset 

(Xiao et al., 2017)

. We use convolutional neural networks (CNNs) the CIFAR-10 dataset. In all models we use the ReLU activation function. We optimize the NLL loss function using SGD with a learning rate of 0.01 and a momentum value of 0.9 in all settings for 50 epochs. We use mini-batches of size 128 and did not use any learning rate scheduling. For the ResNet experiments we optimize an 18-layers ResNet model on the CIFAR-10 dataset, using Adam optimizer with learning rate of 0.001 for 150 epochs where we halve the learning rate every 50 epochs using batch-size of 128.

5.1 Verify Assumptions

We start by empirically demonstrating the numerical instability of computing the complexity term in Eq. 1. This numerical instability occurs due to the exponentiation of the random variables, namely , which quickly goes to infinity as grows. We estimated Eq. 1 over MNIST, for MLPs of depth , where MLP of depth is a linear model. For a fair comparison we changed layers’ width to reach roughly the same number of parameters in each model (except for the linear case). For these architectures evaluated for different values of , using variance of 0.1 over the prior distribution. The results are depicted in Figure 4. One can see that we are able to compute the complexity term for and even to smaller for the linear model (). Standard bounds for MNIST require to be in the interval , where is the training sample size ( for MNIST). We observe that while computing the term goes to infinity while goes to zero, and they are not able to balance each other. In our derivation this is solved by looking at the gradient to emphasize their change, see Eq. (4).

(a)
(b)
Figure 20: Our bound on as a function of variance levels for the prior distribution over the models parameters, for (top) and (bottom). One can see that the bound gets larger when the variance of the prior increases. This plot conforms the practice of Bayesian deep nets that set the variance of the prior up to . Notice, the five layers for the , it get value only for variance of , below that value the bound is zero, and above that value the bound explodes.

In Corollary 2 we assume that although the loss is unbounded, it is on-average bounded, meaning . We tested this assumption on MNIST, Fashion-MNIST using MLPs of depth and CIFAR10 using CNNs of depth , where for in the CNN models

is the number of convolutional layers, and we include a max pool layer after each convolutional layer. In all CNN models we include an additional output layer to be a fully connected one. To span the possible weights we sampled them from a normal prior distribution with different variance. The results appear in Figure 

8. We observe that the loss is on-average bounded by , while being dependent on the variance of the prior. Moreover, for variance up to , the on-average loss bound is about and its affect on the complexity term is minimal. Notice, although the expected loss is increasing for high variance levels, these are not being used for initialize deep nets, considering common initialization techniques.

5.2 Complexity of Neural Nets

Prior Variance 0.0004 0.01 0.05 0.1 0.3 0.5 0.7

One

Test Loss 0.753 0.442 0.277 0.271 0.324 0.408 0.677
Train Loss 0.731 0.424 0.273 0.253 0.292 0.373 0.697
Bound on 10.57 66.92 inf inf inf inf inf
Bound on 0.0002 0.001 0.029 0.141 1.76 6.33 15.18
KL 20027 10776 2561 1478 3886 7447 8995

Two

Test Loss 1.743 0.549 0.104 0.066 0.127 0.236 0.602
Train Loss 1.752 0.569 0.095 0.038 0.056 0.155 0.425
Bound on 0.0 0.014 0.540 inf inf inf inf
Bound on 0.0 0.0 0.006 0.115 16.95 inf inf
KL 25848 30469 7369 7834 96976 166638 244255

Three

Test Loss 2.3 2.3 0.091 0.062 0.136 0.294 1.319
Train Loss 2.3 2.3 0.078 0.027 0.067 0.268 1.173
Bound on 0.0 0.0 0.002 31.99 inf inf inf
Bound on 0.0 0.001 0.001 0.041 62.73 inf inf
KL nan 10776 9480 8215 95984 175134 226237

Four

Test Loss 2.3 2.3 0.083 0.067 0.132 inf inf
Train Loss 2.3 2.3 0.064 0.022 0.081 inf inf
Bound on 0.0 0.0 0.0 2.855 inf inf inf
Bound on 0.0 0.0 0.0 0.012 inf inf inf
KL nan nan 10849 8239 113943 nan nan

Five

Test Loss 2.3 2.3 0.087 0.066 0.133 inf inf
Train Loss 2.3 2.3 0.055 0.019 0.101 inf inf
Bound on 0.0 0.0 0.0 0.204 inf inf inf
Bound on 0.0 0.0 0.0 0.004 inf inf inf
KL nan nan 11800 9090 140305 nan nan
Table 1: We optimize MLP models of different depth levels, where one corresponds to the linear model, two corresponds to two layers and so on. We report the avg. test loss, avg. train loss, bound on , and the KL value for the MNIST dataset.

Next we turn to estimate our bounds of , both for the linear models and nonlinear models, corresponding to Corollary 1 and Corollary 2. We use the same architectures as mentioned above. The bound on is controlled by the expected gradient-norm . Figure 12 presents the expected gradient-norm as a function of different variance levels for the prior distribution over the models parameters. For the linear model we used the bound in Corollary 1. One can see that the linear model has the largest expected gradient-norm, since the Lipschitz condition considers the worst-case gradient-norm. One also can see that the deeper the network, the smaller its gradient-norm. This is attributed to the gradient vanishing property. As a result, deeper nets can have faster convergence rate, i.e., use larger values of in the generalization bound, since the vanishing gradients creates a contractivity property that stabilize the loss function, i.e., reduces its variability. However, this comes at the expanse of the expressivity of the deep net, since vanishing gradients cannot fit the training data in the learning phase. This is demonstrated in the next experiment.

Figure 20 presents the bound on as a function of the variance levels for the prior distribution over the models parameters. One can see that the bound gets larger when the variance of the prior increases. Another thing to note is that the bound is often explodes when the variance is larger than . This conforms with Corollary 1, which is unbounded for variance larger than . This plot conforms the practice of Bayesian deep nets that set the variance of the prior up to .

5.3 Sub-Gamma Approximation

In Section 4.1 we proved that the proposed bound is sub-gamma for the linear case. Unfortunately, such proof can not be directly applied to the non-linear case. Hence, we empirically demonstrate that the proposed bound over is indeed sub-gamma for various model architectures. For that we used the same models architectures as before using the MNIST dataset. Results are depicted in Figure 17. Notice, similar to the ResNet model, the proposed bound is sub-gamma in all explored settings using with different scaling factors.

5.4 Optimization

Lastly, in order to better understand the balance between all components composed the proposed generalization bound we optimize all five MLP models presented above using the MNIST dataset, and computed the average training loss, average test loss, KL divergence, and the bound on , bound using and . We repeat this optimization process for various variance levels over the prior distribution over the model parameters. Results for the MNIST dataset are summerized in Table 1, more experimental results can be found in Section A.1 in the Appendix.

Results suggests that using variance levels of [0.05, 0.1] produce the overall best performance across all depth levels. This findings is consistent with our previous results which suggest that below this value the Bound goes to zero, hence make a good generalization on the expense of model performance. However, larger variance levels may cause the bound to explode and as a results makes the optimization problem harder.

(a)
(b)
(c)
Figure 24: ResNet variations results for CIFAR10 dataset. In the left subplot we report the average training loss and average test loss (dashed lines). In the middle sub figure we present the KL values, and in the right hand sub figure we report the bound on . All results are reported using different variance levels of the prior distribution over the model parameters.

Lastly, we analyzed the commonly used ResNet model (He et al., 2016)

. For that, we trained four different versions of the ResNet18 model: (i) standard model (ResNet); (ii) model with no skip connections (ResNetNoSkip); (iii) model with no batch normalizations (ResNetNoBN); and (iv) model without both skip connections and batch normalization layers (ResNetNoSkipNoBN). We optimize all models using CIFAR-10 dataset. Figure 

24 visualizes the results. Consistently with previous findings, variance levels of 0.1 gets the best performance overall, both in terms of model test loss and the generalization.

Notice, ResNet and ResNetNoSkip achieve comparable performance in all measures. Additionally, when considering variance levels of 0.1 for the prior distribution, removing the batch normalization layers and including the skip-connections also gets comparable performance to ResNet and ResNetNoSkip. Similarly to Zhang et al. (2019), this findings suggest that even without batch normalization layers models can converge using exact initialization. On the other hand while removing both batch normalization the and skip connections, models either explores immediately or suffer greatly from gradient vanishing. These results are consistent with previous findings in which batch normazliaton greatly improves optimization (Santurkar et al., 2018).

6 Discussion and Future Work

We present a new PAC-Bayesian generalization bound for deep Bayesian neural networks for unbounded loss functions with unbounded gradient-norm. The proof relies on bounding the log-partition function using the expected squared norm of the gradients with respect to the input. We prove that the proposed bound is sub-gamma for any linear model with a Lipschitz loss function and we verify it empirically for the non-linear case. Experimental validation shows that the resulting bound provides insights for better model optimization, prior distribution search and model initialization.

References

  • P. Alquier, J. Ridgway, and N. Chopin (2016) On the properties of variational approximations of gibbs posteriors. Journal of Machine Learning Research 17 (239), pp. 1–41. Cited by: §2, §3, §3, §4.1, Theorem 1.
  • P. Alquier, O. Wintenberger, et al. (2012) Model selection for weakly dependent time series forecasting. Bernoulli 18 (3), pp. 883–913. Cited by: §2.
  • A. Banerjee (2006) On bayesian bounds. In Proceedings of the 23rd international conference on Machine learning, pp. 81–88. Cited by: §2.
  • P. Bartlett, P. Grunwald, P. Harremoës, F. Hedayati, and W. Kotlowski (2013) Horizon-independent optimal prediction with log-loss in exponential families. arXiv preprint arXiv:1305.4324. Cited by: §2.
  • P. L. Bartlett, D. J. Foster, and M. J. Telgarsky (2017a) Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6241–6250. Cited by: §2.
  • P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian (2017b) Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. arXiv preprint arXiv:1703.02930. Cited by: §2.
  • P. L. Bartlett and S. Mendelson (2002) Rademacher and gaussian complexities: risk bounds and structural results. Journal of Machine Learning Research 3 (Nov), pp. 463–482. Cited by: §2.
  • P. L. Bartlett, N. Harvey, C. Liaw, and A. Mehrabian (2019) Nearly-tight vc-dimension and pseudodimension bounds for piecewise linear neural networks. Journal of Machine Learning Research 20 (63), pp. 1–17. External Links: Link Cited by: §2.
  • C. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra (2015) Weight uncertainty in neural networks. arXiv preprint arXiv:1505.05424. Cited by: §2.
  • S. Boucheron, G. Lugosi, and P. Massart (2013) Concentration inequalities: a nonasymptotic theory of independence. Oxford university press. Cited by: §3, §3.
  • O. Bousquet and A. Elisseeff (2002) Stability and generalization. The Journal of Machine Learning Research 2, pp. 499–526. Cited by: §2.
  • G. K. Dziugaite and D. M. Roy (2017) Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008. Cited by: §2.
  • Y. Gal and Z. Ghahramani (2015)

    Dropout as a bayesian approximation: representing model uncertainty in deep learning

    .
    arXiv preprint arXiv:1506.02142 2. Cited by: §2.
  • Y. Gal (2016) Uncertainty in deep learning. Ph.D. Thesis, PhD thesis, University of Cambridge. Cited by: §2.
  • I. Gentil (2005) Logarithmic sobolev inequality for log-concave measure from prekopa-leindler inequality. arXiv preprint math/0503476. Cited by: §4.
  • P. Germain, F. Bach, A. Lacoste, and S. Lacoste-Julien (2016) Pac-bayesian theory meets bayesian inference. In Advances in Neural Information Processing Systems, pp. 1884–1892. Cited by: §2, §2, §3, §3.
  • N. Golowich, A. Rakhlin, and O. Shamir (2017) Size-independent sample complexity of neural networks. arXiv preprint arXiv:1712.06541. Cited by: §2.
  • P. D. Grünwald and N. A. Mehta (2017) A tight excess risk bound via a unified pac-bayesian-rademacher-shtarkov-mdl complexity. arXiv preprint arXiv:1710.07732. Cited by: §2.
  • M. Hardt, B. Recht, and Y. Singer (2015) Train faster, generalize better: stability of stochastic gradient descent. arXiv preprint arXiv:1509.01240. Cited by: §2.
  • K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    Proceedings of the IEEE conference on computer vision and pattern recognition

    ,
    pp. 770–778. Cited by: §5.4.
  • S. M. Kakade, K. Sridharan, and A. Tewari (2009) On the complexity of linear prediction: risk bounds, margin bounds, and regularization. In Advances in neural information processing systems, pp. 793–800. Cited by: §2.
  • D. P. Kingma, T. Salimans, and M. Welling (2015) Variational dropout and the local reparameterization trick. In Advances in Neural Information Processing Systems, pp. 2575–2583. Cited by: §2.
  • I. Kuzborskij and C. H. Lampert (2017) Data-dependent stability of stochastic gradient descent. arXiv preprint arXiv:1703.01678. Cited by: §2.
  • M. Ledoux (1999) Concentration of measure and logarithmic sobolev inequalities. In Seminaire de probabilites XXXIII, pp. 120–216. Cited by: §1, §4.
  • B. London (2017) A pac-bayesian analysis of randomized learning with application to stochastic gradient descent. In Advances in Neural Information Processing Systems, pp. 2931–2940. Cited by: §2.
  • C. Louizos and M. Welling (2016) Structured and efficient variational deep learning with matrix gaussian posteriors. In International Conference on Machine Learning, pp. 1708–1716. Cited by: §2.
  • D. McAllester (2013) A pac-bayesian tutorial with a dropout bound. arXiv preprint arXiv:1307.2118. Cited by: §2.
  • B. Neyshabur, S. Bhojanapalli, D. McAllester, and N. Srebro (2017) A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564. Cited by: §2.
  • B. Neyshabur, Z. Li, S. Bhojanapalli, Y. LeCun, and N. Srebro (2018) Towards understanding the role of over-parametrization in generalization of neural networks. arXiv preprint arXiv:1805.12076. Cited by: §2.
  • B. Neyshabur, R. Tomioka, and N. Srebro (2015) Norm-based capacity control in neural networks. In Conference on Learning Theory, pp. 1376–1401. Cited by: §2.
  • A. Rakhlin, S. Mukherjee, and T. Poggio (2005) Stability results in learning theory. Analysis and Applications 3 (04), pp. 397–417. Cited by: §2.
  • S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry (2018) How does batch normalization help optimization?. In Advances in Neural Information Processing Systems, pp. 2483–2493. Cited by: §5.4.
  • S. Shalev-Shwartz, O. Shamir, N. Srebro, and K. Sridharan (2009) Stochastic convex optimization.. In COLT, Cited by: §2.
  • N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov (2014) Dropout: a simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research 15 (1), pp. 1929–1958. Cited by: §2.
  • E. Takimoto and M. K. Warmuth (2000) The last-step minimax algorithm. In International Conference on Algorithmic Learning Theory, pp. 279–290. Cited by: §2.
  • C. Wei and T. Ma (2019) Data-dependent sample complexity of deep neural networks via lipschitz augmentation. In Advances in Neural Information Processing Systems, pp. 9722–9733. Cited by: §2.
  • H. Xiao, K. Rasul, and R. Vollgraf (2017) External Links: cs.LG/1708.07747 Cited by: §5.
  • J. Yang, S. Sun, and D. M. Roy (2019) Fast-rate pac-bayes generalization bounds via shifted rademacher processes. In Advances in Neural Information Processing Systems, pp. 10802–10812. Cited by: §2.
  • C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals (2016) Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530. Cited by: §1, §2, §2.
  • H. Zhang, Y. N. Dauphin, and T. Ma (2019) Fixup initialization: residual learning without normalization. arXiv preprint arXiv:1901.09321. Cited by: §5.4.

Appendix A Appendix

a.1 Extended results

We provide an additional results for the optimization experiments. We optimized MLP models at different depth levels using Fashion-MNIST dataset.

Prior Variance 0.0004 0.01 0.05 0.1 0.3 0.5 0.7

One

Test Loss 0.79 0.57 0.45 0.46 1.04 1.59 10.43
Train Loss 0.77 0.55 0.41 0.39 0.86 1.31 9.97
MGF bound () 10.57 67.31 inf inf inf inf inf
MGF bound () 0.0001 0.0011 0.0296 0.143 1.812 6.386 14.74
KL 14255 7829 2430 1648 5379 8440 35299

Two

Test Loss 1.46 0.71 0.35 0.31 0.42 0.6 10.43
Train Loss 1.46 0.69 0.29 0.2 0.3 0.49 9.97
MGF bound () 0.014 0.53 inf inf inf inf inf
MGF bound () 0.0 0.0 0.005 0.12 16.86 inf inf
KL 28222 22053 6983 9180 98648 168266 209935

Three

Test Loss 2.3 0.92 0.34 0.31 0.42 0.83 2.39
Train Loss 2.3 0.91 0.28 0.19 0.35 0.79 1.76
MGF bound () 0.0 0.0 0.002 32.16 inf inf inf
MGF bound () 0.0 0.0 0.0 0.04 64.48 inf inf
KL 0 29826 9219 1648 547636 198760 374718

Four

Test Loss 2.3 2.3 0.34 0.31 0.44 inf inf
Train Loss 2.3 2.3 0.26 0.18 0.38 inf inf
MGF bound () 0.0 0.0 0.0 2.81 inf inf inf
MGF bound () 0.0 0.0 0.0 0.01 inf inf inf
KL 0 0 10859 10165 388842 nan nan

Five

Test Loss 2.3 2.3 0.34 0.31 1.3 inf inf
Train Loss 2.3 2.3 0.25 0.18 1.28 inf inf
MGF bound () 0.0 0.0 0.0 0.19 inf inf inf
MGF bound () 0.0 0.0 0.0 0.003 inf inf inf
KL 0 0 11918 10986 353632 nan nan
Table 2: We optimize models of different depth levels, where one corresponds to the linear model, two corresponds to two layers and so on. We report the avg. test loss, avg. train loss, bound on , and the KL value.