Theoretical understanding of the high generalization ability of deep learning models is a crucial research objective for principled improvement of the performance. Insights and formulations are also useful to compare between models and interpret them. Some prior work has explained the generalization by the fact that trained networks can be compressed well (Arora et al., 2018; Blier & Ollivier, 2018), while others have tried to explain it by the scale of the network and prediction margins (Bartlett et al., 2017; Neyshabur et al., 2018). From the minimum description length principle (Rissanen, 1986), models representable with a smaller number of bits are expected to generalize better. Bits-back arguments (Hinton & van Camp, 1993; Honkela & Valpola, 2004) have shown that when models are stable against noise on parameters, we can describe models with fewer bits. These arguments have motivated the research on “flat minima.” Empirical work has supported the usefulness to measure the flatness on local minima (Keskar et al., 2017; Yao et al., 2018). Other work has proposed training methods to search for flatter minima (Hochreiter & Schmidhuber, 1997; Chaudhari et al., 2017; Hoffer et al., 2017). As measures of the flatness, prior work proposed the volume of the region in which a network keeps roughly the same loss (Hochreiter & Schmidhuber, 1997), the maximum loss around the minima (Keskar et al., 2017), and the spectral norm of the Hessian (Yao et al., 2018).
Despite the empirical connections of “flatness” to generalization, existing definitions of it suffer from scale dependence issues. Dinh et al. (2017)
showed that we can arbitrarily change the flatness of the loss landscape for some networks without changing the functions represented by the networks. Such scale dependence appears in networks with ReLU activation functions or normalization layers such as batch-normalization(Ioffe & Szegedy, 2015) and weight-normalization Salimans & Kingma (2016)
. Since generalization does not depend on rescalings of parameters, the scale dependence issues suggest that the prior definitions of “flatness” might not be a good measure of the generalization of neural networks. The literature showed that solely looking for the flatness is not sufficient and both the flatness and the scale of parameters should be small for generalization(Dziugaite & Roy, 2017; Neyshabur et al., 2017). However, how to unify the two quantities has been left as an open problem.
What causes the problems in the previous definitions of “flatness?” In prior definitions, they implicitly used Gaussian priors with the same variance for all parameters (Sec.3). However, as Dinh et al. (2017) pointed out, an assumption that all parameters have the same scale is not good prior knowledge for neural networks. In this paper, using a PAC-Bayesian framework (McAllester, 1999, 2003), we explicitly take scale properties of neural networks into consideration. We first incorporate the knowledge that each weight matrix can have different scales (Sec. 4). Next, we extend the analysis to row and column wise scaling of parameters (Sec. 5). To the best of our knowledge, our analysis provides the first rescaling invariant definition of the flatness of local minima. Figure 1 shows that our definition of flat minima could distinguish models trained on random labels even when a previous definition fails.
2 Related work
Dziugaite & Roy (2017) introduced the PAC-Bayesian framework to study the generalization of deep learning models. They also connected PAC-Bayesian arguments to flat minima. They pointed out that sharpness is not sufficient and we need to pay attention to the scale of parameters. However, they could not remove the scale dependence. Neyshabur et al. (2017) extended Dziugaite & Roy (2017) and suggested that the sharpness is better to be scaled by the scale of parameters. However, they left a way to combine the sharpness and the scale of parameters as an open problem.
Neyshabur et al. (2018) analyzed the generalization of deep learning models using the PAC-Bayesian framework. They focused on bounding the worst-case propagation of perturbations. Given the existence of adversarial examples (Szegedy et al., 2014), this approach inevitably provides a loose bound. Alternatively, we rely on flat-minima arguments, which better capture the effect of parameter perturbations. We point out an additional insufficiency of their bound in Sec. 5.1. Our redefined notion of flat minima is free from this issue, suggesting it better captures generalization.
Wang et al. (2018) examined a better choice of the posterior variance and showed that Hessian-based analysis relates to the scale of parameters. However, their argument could not overcome the scale dependence. Moreover, their analysis is based on a parameter-wise argument, which involves a factor that scales with the number of parameters, making their overall bound essentially equivalent to naive parameter counting. In contrast, our analysis completely removes the scale dependence. Additionally, our analysis does not have a constant that scales with the number of parameters.
Li et al. (2018)
demonstrated that normalizing the loss landscape by the scale of filters in convolutional neural networks provides better visualization of the loss landscape. While their work empirically showed the effectiveness to normalize the flatness by the scale of parameters, they did not provide theoretical justification why the normalization is essential. We provide some theoretical justifications to focus on the normalized loss landscape through the lens of the PAC-Bayesian framework.
|: prior distribution of hypothesis (parameters)|
|: posterior distribution of hypothesis (parameters)|
|: underlying (true) data distribution|
|: training set, i.i.d.s̃ample from|
|: -th sample in ,|
|: number of data points in a training set|
|: number of class|
|: number of layers (depth) in NN|
|: number of hidden units (width) in NN|
|: -th weight matrix|
|: parameter of network|
|: hypothesis, typically depends on ()|
|: expected (-) loss concerning distribution|
|: loss on a data point of a hypothesis|
|: KL divergence|
|: derivative concerning parameter|
|: Hessian concerning parameter|
|: Frobenius norm of a matrix|
|: output of a hypothesis at a data point|
-th element of a vector
|: the -th element of a matrix|
|: label of data point|
3 Flat minima from PAC-Bayesian perspective
In this section, we introduce a fundamental PAC-Bayesian generalization error bound and its connection to flat minima provided by prior work. Table 1 summarizes notations used in this paper.
3.1 PAC-Bayesian generalization error bound
For any distribution , any set
of classifiers, any distributionof support , any and any nonnegative real number
, we have, with probability at least,
We can provide calculable bounds on
depending on which types of loss functions we use(Germain et al., 2016). Especially when we use the - loss, can be bounded by . Note that this does not depend on the choice of priors. In this paper, we mainly treat the loss.
We reorganize the PAC-Bayesian bound (1) for later use as follows.
Similar decompositions can be found in prior work (Dziugaite & Roy, 2017; Neyshabur et al., 2017, 2018).
We use a different PAC-Bayes bound (1) for later analysis, but they are essentially the same.111To apply our analysis,
we can also use some other PAC-Bayesian bounds such as Theorem 1.2.6 in Catoni (2007) ,
which is known to be relatively tight in some cases and successfully provided empirically nontrivial bounds for ImageNet scale networks in
, which is known to be relatively tight in some cases and successfully provided empirically nontrivial bounds for ImageNet scale networks inZhou et al. (2019). The original PAC-Bayesian bound (1) is for a stochastic classifier , but the reorganized one (3) is a bound for a deterministic classifier .
3.2 PAC-Bayesian view of flat minima
Flat minima, which are the noise stability of the training loss with respect to parameters, naturally correspond to (C) in Eq. (3). When (C) is sufficiently small, we can expect that (A) in (3) is also small. Similarly to existing work (Langford & Caruana, 2002; Hochreiter & Schmidhuber, 1997; Dziugaite & Roy, 2017; Arora et al., 2018), we focus on analyzing terms (B) and (C).
3.3 Effect of noises under second-order approximation
To connect PAC-Bayes analysis with the Hessian of the loss landscape as prior work (Keskar et al., 2017; Dinh et al., 2017; Yao et al., 2018), we consider the second-order approximation of some surrogate loss functions. We use the unit-variance Gaussian as the posterior of parameters. Then the term (C) in the PAC-Bayesian bound (3) can be calculated as
Thus, we can approximate the term (C) by the trace of the Hessian. There are two issues in using the trace of the Hessian as a sharpness metric. First, we used unit-variance Gaussians for all parameters, which might not necessarily be the best choice. Second, we ignored the effect of the KL-divergence term (B). While prior work already pointed out these issues (Dziugaite & Roy, 2017; Neyshabur et al., 2017), there have not been methods to analyze the two terms jointly. The two flaws are the keys of our analysis described in the next section.
4 Warm-up: Matrix-normalized flat minima
In this section, we modify flat minima to make them invariant to the transformations proposed in Dinh et al. (2017). An example of networks we consider is the following network with one hidden layer.
Weight matrices and are subsets of the parameters . Prior work has used some unit-variance Gaussian for the prior and the posterior as discussed in Sec. 3.3. This corresponds to the assumption that all parameters have the same scale. However, we already know that the scale can vary from weight matrix to weight matrix. In other words, the current choice of priors does not well capture our prior knowledge. To cope with this problem, we explicitly make parameters’ priors have uncertainty in their scale. This section implies that we need to multiply the scale of the loss landscape by the scale of parameters.222Our loss landscape rescaling is slightly different from simply multiplying the scale of parameters to the Hessian (13). Its benefits are discussed in Sec. 7.3.
4.1 Controlling prior variance
In this subsection, we first revisit the technique to control the variance parameters of the Gaussian priors after training (Hinton & van Camp, 1993). Next, we consider its effect on the KL divergence term (B) in (3). Following standard practice, we use a Gaussian with zero-mean and diagonal covariance matrix () where . We also use a Gaussian with mean and covariance , as the posterior. The mean is the parameters of the network.
where is the -th weight matrix and and are the prior variance and the posterior variance of the -th weight matrix, respectively. When we fix the prior variances, the KL term is proportional to the squared Frobenius norm of parameters. However, since we introduced the special prior, we can arbitrarily change the prior variance after training and control the KL term. To minimize the KL divergence term, the prior variance is set to the same value as the posterior variance . Below, we use to denote both and . Now, thanks to our hyperprior, we can write the KL divergence term as
where are parameters we can tune after training. The constant only depends on the number of weight matrices, which is much smaller than the total number of parameters. The KL divergence term (7) has additional flexibility to deal with the scale of weight matrices because we can scale after training.
4.2 Defining matrix-normalized flat minima
In this subsection, we show how to tune the variances introduced in Sec. 4.1. Deciding the value of the variances induces our definition of scale-invariant flat minima. To minimize the PAC-Bayesian bound (1), we choose the variance to minimize the following quantity.
First, we model the loss function by second-order approximation.666The choice of the surrogate loss function to calculate the Hessian is discussed in Sec. 5.4. For the sake of notational simplicity, we introduce the following quantity for each weight matrix .
where are parameters in the weight matrix. The quantity is the sum of the diagonal elements of the Hessian of the training loss function for the weight matrix . Now, the quantity (8) can be approximated by
where is the variance associated to the weight matrix . With an assumption that the Hessian is positive semidefinite, the quantity (10) is minimized when we set
By inserting this to the quantity (10), we get
We refer to this quantity as matrix-normalized sharpness. Intuitively, we scale the sharpness by the scale of each weight matrix. Matrix-normalized sharpness (13) is invariant to the rescaling of parameters proposed in Dinh et al. (2017). Thus, we can overcome one of the open problems by considering the effect of both terms (B) and (C) in PAC-Bayesian bound (3). Its connection to minimum description length arguments, which are the basis of flat minima arguments, is discussed in Sec. 7.1.
5 Normalized flat minima
In this section, we point out a scale dependence of matrix-wise capacity control which is similar to the prior flat minima definitions. To remove the scale dependence, we extend matrix-normalized flat minima and define normalized flat minima. The new definition provides improved invariance while enjoying a reduced effective number of parameters in the constant term for a better generalization guarantee. The extension has a more complicated form than matrix-normalized flat minima. However, it is similar in a sense that it multiplies the scale of parameters to the loss curvatures.
5.1 Scale dependence of matrix-wise capacity control
First, we propose a transformation different from Dinh et al. (2017) that changes the scale of the Hessian of networks arbitrarily. Let us consider a simple network with a single hidden layer and ReLU activation. We denote this network as
We can scale the -th column of by and -th row of by without modifying the function that the network represents as follows.777Running examples of the transformation can be found in appendix A.1.
Since we are using the ReLU activation function, which has positive homogeneity, this transformation does not change the represented function. By the transformation, the scale of the diagonal elements of the Hessian corresponding to the -th row of are scaled by . This can cause essentially the same effect with the transformation proposed by Dinh et al. (2017).
The transformation reveals a scale dependence of matrix-norm based generalization error bounds as follows. Assume has at least two non-zero rows and has at least two non-zero columns. Using the transformation, we can make both , and have at least one arbitrarily large element. In other words, both weight matrices have arbitrarily large spectral norms and Frobenius norms. Also, the stable rank of the two matrices become arbitrarily close to one. Thus, the matrix-norm based capacity control (Bartlett et al., 2017; Neyshabur et al., 2018) suffers from the same scale dependence as the prior definitions of flatness.
5.2 Improving invariance of flatness
To address the newly revealed scale dependence in Sec. 5.1, we modify the choice of the hyperprior discussed in Sec. 4.2. We introduce a parameter for the -th row and for the -th column and use the product of them as variance.888In some parameters such as bias terms and scaling parameters in normalization layers, setting variance parameters per row corresponds to applying naive parameter counting for these parameters. Thus, noise induced by the posterior become and the KL-term become a constant that scales with the number of such parameters. In other words, we set the variance of to . For the priors of and , we use the same priors with Sec. 4. Setting the variances per row and column makes the constant term in the KL-term . This is still much smaller than setting variance per parameter, which scales . Applying the same discussion as Sec. 4.2, we define normalized sharpness as the sum of the solutions of the following optimization problem defined for each weight matrix.
In convolutional layers, since the same filter has the same scale, we only need to set the hyperprior on the input and output channels. When changes to , the normalized sharpness (17) is scaled by as matrix-normalized flat minima (Sec. 4). Thus, networks with smaller normalized flatness at some choice of , e.g., , also have smaller normalized sharpness at other choices of . Below, we set for simpler calculation.
5.3 Practical calculation
We present a practical calculation technique to solve the optimization problem (17). First, we reparametrize variance parameters and as follows.
It is straightforward to see that the convexity also holds with convolutional layers. Thus, we can estimate the near optimal value ofand by gradient descent. Details of the gradient calculation can be found in appendix C. Figure 2 shows the pseudo code of the normalized sharpness calculation.
5.4 Choice of the surrogate loss function
When we measure the generalization gap using the
loss, which is not differentiable with respect to parameters, we need to use surrogate loss functions. The choice of the surrogate loss functions needs special care when we use flatness for model comparison. For the comparison to make sense, the value of the normalized sharpness is preferable not to change when the accuracy of the models does not change. Thus, the surrogate loss function is better to make the normalized sharpness invariant against some changes that do not change accuracy such as scalings and shifting of the networks’ outputs. For example, the cross-entropy loss taken after softmax does not satisfy the first condition. Thus, using the loss function makes the model comparison less meaningful. While the above conditions do not make the choices of the surrogate loss function unique, we heuristically use the following loss.
is an output of a network, is a label of , and is the number of classes. We refer to the loss function as normalized-softmax-cross-entropy loss. We use this loss function in later experiments (Sec. 6).
6 Numerical evaluations
We numerically justify the insights from the previous sections. We specifically check the followings.
Scale dependence of existing sharpness metric can be harmful in common settings, not only artificial ones (Sec. 6.2).
Normalized sharpness better captures generalization than existing sharpness metrics (Sec. 6.2).
Detailed experimental setups are described in appendix E.
6.1 Distinguishing models trained on random labels
We checked whether normalized sharpness can distinguish models trained on random labels. Hypotheses which fit random labels belong to hypothesis classes such that Rademacher complexity is . Thus, if normalized sharpness captures generalization reasonably well, it should have a larger value for networks trained on random labels.
We trained a multilayer perceptron with three hidden layers and LeNet(Lecun et al., 1998) on MNIST (LeCun et al., 1998) and LeNet and Wide ResNet (Zagoruyko & Komodakis, 2016) with 16 layers and width factor on CIFAR-10 (Krizhevsky, 2009) for 100 times for each pair. At each run, we randomly selected the ratio of random labels from to at intervals. We used Adam optimizer and applied no regularization or data augmentation so that the training accuracy reached near even with random labels.
Figure 3 shows scatter plots of normalized sharpness v.s. accuracy gap for networks trained on MNIST and CIFAR-10. The results show that networks tended to have larger normalized sharpness to fit random labels. Thus, we can say that normalized sharpness provides reasonably good hierarchy in hypothesis class. These results support our analysis concerning normalized flat minima in Sec. 4 and Sec. 5.
6.2 Effect of normalization
We tested how our modification of flat minima change its property. We used the same trained model with Sec. 6.1, but used cross-entropy loss for calculating the Hessian. We plotted the trace of the Hessian without normalization (4) and the sum of the squared Frobenius norm of the weight matrices (7).
Figure 4 shows the results. Even though sharpness without normalization can also distinguish models trained on random labels to some extent, the signal is weaker compared to normalized sharpness. Notably, in larger models with normalization layers, sharpness without normalization lost its ability to distinguish models. The result shows that the scaling dependence of the flatness measures can be problematic even in natural settings and also supports the advantages of the normalization.
In this section, we discuss connections between normalized flat minima and previous studies.
7.1 Connection to MDL arguments
Minimum description length (MDL) (Rissanen, 1986) provides us with a measure of generalization through the amount of the necessary information to describe the model. Intuitively, at flat minima, we can use less accurate representations of the parameters and thus requires fewer bits. The gain of the description length is quantitatively explained by bits-back arguments (Hinton & van Camp, 1993; Honkela & Valpola, 2004). According to the theory, we can represent the model by the following number of bits.
This is the same with the (B) term in (3). Thus, from the minimum description length principle, normalized sharpness balances the effect of the posterior variance and the number of bits we can save. For more discussion on the connection to MDL, please refer to Dziugaite & Roy (2017).
7.2 Comparison with other prior choices
Kingma et al. (2015) proposed a local reparametrization trick that removes the scale dependence of the KL term by hyperpriors. However, in Kingma et al. (2015), reparametrization was performed per parameter. This makes the constant in (7) scales with the number of parameters. Thus, even though the trick removes the scale independence, the resultant KL-term is as good as a naive parameter counting. On the other hand, the constant in our analysis scales at most compared to their . Achille & Soatto (2018) used the local reparametrization trick to connect information bottleneck (Tishby et al., 1999) with flat minima and PAC-Bayes. However, the use of the trick made their discussion vacuous from PAC-Bayesian perspective. Reinterpreting information bottleneck using our prior design might provide novel insights.
Dziugaite & Roy (2017) and Neyshabur et al. (2019) used the initial parameters as the mean of the prior. In deep networks, we could not empirically observe its advantages over using zero means. Moreover, it provides additional scale dependence to the notion of the flat minima. Applying some normalization to the prior mean to make the overall bound scale-invariant might help to utilize the initial state of the networks even in deep networks, but we leave the exploration as future work.
7.3 Comparison with Fisher-Rao norm
Liang et al. (2017) proposed the Fisher-Rao norm, which is defined as follows.
While the formulation is similar to normalized sharpness, there are three crucial differences. First, normalized sharpness uses the Hessian, not Fisher, and directly measures curvature. Second, the Fisher-Rao norm is parameter-wise, while the normalized sharpness exploits the parameter structures in neural networks. Third, normalized sharpness takes the square root of the Frobenius norms of parameters and the Hessian. To highlight an advantage by the third difference, we consider the following network.
Next, we rescale the parameters as follows.
By this rescaling, the Fisher-Rao norm of the weight matrix becomes half, while the normalized sharpness (17) is kept the same. The definition of matrix-normalized sharpness (13) is also invariant against this rescaling. This additional invariance suggests that our definition better captures generalization.
7.4 Supporting empirical findings
Li et al. (2018) showed that we can clearly observe flat/sharp minima when we rescale the loss landscape by the scale of the parameters. Especially, they applied normalization filter-wisely rather than layer-wisely in convolutional layers. This is closely related to the notion of row and column wise normalization proposed in Sec. 5. Loshchilov & Hutter (2019) proposed AdamW and empirically closed generalization gap of trained networks using Adam. Zhang et al. (2019) connected second order optimization with weight decay such as AdamW with normalized curvature, which plays a key role in this paper. Rethinking such optimization methods through the normalized flat minima might be useful to improve them further.
In this paper, we proposed a notion of normalized flat minima, which is free from the known scale dependence. The advantages of our definition are as follows.
It is invariant to transformations from which the prior definitions of flatness suffered.
It can approximate generalization bounds tighter than naive parameter countings.
Our discussion extends potential applications of the notion of flat minima from the cases when parameters are “appropriately” normalized to general cases. Experimental results suggest that our analysis is powerful enough to distinguish overfitted models even when models are large and existing flat minima definitions tend to suffer from the scale dependence issues.
One flaw of the normalized flat minima is that it uses Gaussian for both prior and posterior even though that is standard practice in the literature (Hinton & van Camp, 1993; Neyshabur et al., 2018). From Draxler et al. (2018) and Izmailov et al. (2018), we know that appropriate posteriors of networks have more complex structures than Gaussians. From the minimum description length perspective, using Gaussian limits the compression algorithms of models. Recent analyses of compression algorithms for neural networks (Blier & Ollivier, 2018) might be useful for better prior and posterior designs. Sun et al. (2019) might help to develop methods to define priors and posteriors on function space and calculate the KL-divergence on function space directly.
YT was supported by Toyota/Dwango AI scholarship. IS was supported by KAKENHI 17H04693. MS was supported by the International Research Center for Neurointelligence (WPI-IRCN) at The University of Tokyo Institutes for Advanced Study.
Achille & Soatto (2018)
Achille, A. and Soatto, S.
Emergence of Invariance and Disentanglement in Deep
Journal of Machine Learning Research, 19, 2018.
- Alquier et al. (2016) Alquier, P., Ridgway, J., and Chopin, N. On the properties of variational approximations of Gibbs posteriors. Journal of Machine Learning Research, 17(239):1–41, 2016.
- Arora et al. (2018) Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. Stronger Generalization Bounds for Deep Nets via a Compression Approach. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 254–263. PMLR, 10–15 Jul 2018.
- Bartlett et al. (2017) Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems 30, pp. 6240–6249. Curran Associates, Inc., 2017.
- Blier & Ollivier (2018) Blier, L. and Ollivier, Y. The Description Length of Deep Learning Models Léonard. In Advances in Neural Information Processing Systems 31, pp. 2220–2230. Curran Associates, Inc., 2018.
- Catoni (2007) Catoni, O. Pac-Bayesian Supervised Classification: The Thermodynamics of Statistical Learning. Institute of Mathematical Statistics, 2007.
- Chaudhari et al. (2017) Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. Entropy-SGD: Biasing Gradient Descent Into Wide Valleys. In International Conference on Learning Representations, 2017.
- Dinh et al. (2017) Dinh, L., Pascanu, R., Bengio, S., and Bengio, Y. Sharp Minima Can Generalize For Deep Nets. In Proceedings of the 34th International Conference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pp. 1019–1028. PMLR, 06–11 Aug 2017.
- Draxler et al. (2018) Draxler, F., Veschgini, K., Salmhofer, M., and Hamprecht, F. Essentially no barriers in neural network energy landscape. In Proceedings of the 35th International Conference on Machine Learning, volume 80 of Proceedings of Machine Learning Research, pp. 1309–1318. PMLR, 10–15 Jul 2018.
Dziugaite & Roy (2017)
Dziugaite, G. K. and Roy, D. M.
Computing Nonvacuous Generalization Bounds for Deep (Stochastic)
Neural Networks with Many More Parameters than Training Data.
Proceedings of the Thirty-Third Conference on Uncertainty in Artificial Intelligence, 2017.
Germain et al. (2016)
Germain, P., Bach, F., Lacoste, A., and Lacoste-Julien, S.
PAC-Bayesian Theory Meets Bayesian Inference.In Advances in Neural Information Processing Systems 29, pp. 1884–1892. Curran Associates, Inc., 2016.
Hinton & van Camp (1993)
Hinton, G. E. and van Camp, D.
Keeping the Neural Networks Simple by Minimizing the Description
Length of the Weights.
Proceedings of the Sixth Annual Conference on Computational Learning Theory, COLT ’93, pp. 5–13. ACM, 1993. ISBN 0-89791-611-5.
- Hochreiter & Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. Flat Minima. Neural Computation, 9(1):1–42, 1997.
- Hoffer et al. (2017) Hoffer, E., Hubara, I., and Soudry, D. Train longer, generalize better: closing the generalization gap in large batch training of neural networks. In Advances in Neural Information Processing Systems 30, pp. 1731–1741. Curran Associates, Inc., 2017.
- Honkela & Valpola (2004) Honkela, A. and Valpola, H. Variational Learning and Bits-back Coding: An Information-theoretic View to Bayesian Learning. Trans. Neur. Netw., 15(4):800–810, July 2004. ISSN 1045-9227.
- Ioffe & Szegedy (2015) Ioffe, S. and Szegedy, C. Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. In Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pp. 448–456. PMLR, 07–09 Jul 2015.
- Izmailov et al. (2018) Izmailov, P., Podoprikhin, D., Garipov, T., Vetrov, D. P., and Wilson, A. G. Averaging Weights Leads to Wider Optima and Better Generalization. In Conference on Uncertainty in Artificial Intelligence, 2018.
- Keskar et al. (2017) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima. In International Conference on Learning Representations, 2017.
- Kingma et al. (2015) Kingma, D. P., Salimans, T., and Welling, M. Variational Dropout and the Local Reparameterization Trick. In Advances in Neural Information Processing Systems 28, pp. 2575–2583. Curran Associates, Inc., 2015.
- Krizhevsky (2009) Krizhevsky, A. Learning Multiple Layers of Features from Tiny Images. 2009.
- Langford & Caruana (2002) Langford, J. and Caruana, R. (Not) Bounding the True Error. In Advances in Neural Information Processing Systems 14, pp. 809–816. MIT Press, 2002.
- Lecun et al. (1998) Lecun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-based Learning Applied to Document Recognition. In Proceedings of the IEEE, pp. 2278–2324, 1998.
LeCun et al. (1998)
LeCun, Y., Cortes, C., and Burges, C. J. C.
The MNIST Database of Handwritten Digits.1998.
- Li et al. (2018) Li, H., Xu, Z., Taylor, G., and Goldstein, T. Visualizing the Loss Landscape of Neural Nets. In Advances in Neural Information Processing Systems 31, pp. 6391–6401. Curran Associates, Inc., 2018.
- Liang et al. (2017) Liang, T., Poggio, T. A., Rakhlin, A., and Stokes, J. Fisher-Rao Metric, Geometry, and Complexity of Neural Networks. CoRR, abs/1711.01530, 2017.
- Loshchilov & Hutter (2019) Loshchilov, I. and Hutter, F. Decoupled Weight Decay Regularization. In International Conference on Learning Representations, 2019.
- McAllester (1999) McAllester, D. A. Some PAC-Bayesian Theorems. Machine Learning, 37(3):355–363, Dec 1999. ISSN 1573-0565.
- McAllester (2003) McAllester, D. A. PAC-Bayesian Stochastic Model Selection. Machine Learning, 51(1):5–21, Apr 2003. ISSN 1573-0565.
- Neyshabur et al. (2017) Neyshabur, B., Bhojanapalli, S., Mcallester, D., and Srebro, N. Exploring Generalization in Deep Learning. In Advances in Neural Information Processing Systems 30, pp. 5947–5956. Curran Associates, Inc., 2017.
- Neyshabur et al. (2018) Neyshabur, B., Bhojanapalli, S., and Srebro, N. A PAC-Bayesian Approach to Spectrally-Normalized Margin Bounds for Neural Networks. In International Conference on Learning Representations, 2018.
- Neyshabur et al. (2019) Neyshabur, B., Li, Z., Bhojanapalli, S., LeCun, Y., and Srebro, N. The role of over-parametrization in generalization of neural networks. In International Conference on Learning Representations, 2019.
- Rissanen (1986) Rissanen, J. Stochastic complexity and modeling. Ann. Statist., 14(3):1080–1100, 09 1986.
- Salimans & Kingma (2016) Salimans, T. and Kingma, D. P. Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks. In Advances in Neural Information Processing Systems 29, pp. 901–909. Curran Associates, Inc., 2016.
- Sun et al. (2019) Sun, S., Zhang, G., Shi, J., and Grosse, R. Functional Variational Bayesian Neural Networks. In International Conference on Learning Representations, 2019.
- Szegedy et al. (2014) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. Intriguing properties of neural networks. In International Conference on Learning Representations, 2014.
- Tishby et al. (1999) Tishby, N., Pereira, F. C., and Bialek, W. The information bottleneck method. In Proceedings of the 37-th Annual Allerton Conference on Communication, Control and Computing, pp. 368–377, 1999.
- Wang et al. (2018) Wang, H., Shirish Keskar, N., Xiong, C., and Socher, R. Identifying Generalization Properties in Neural Networks. ArXiv e-prints, 2018.
- Xie et al. (2017) Xie, S., Girshick, R. B., Dollár, P., Tu, Z., and He, K. Aggregated Residual Transformations for Deep Neural Networks. In
- Yao et al. (2018) Yao, Z., Gholami, A., Lei, Q., Keutzer, K., and Mahoney, M. W. Hessian-based Analysis of Large Batch Training and Robustness to Adversaries. In Advances in Neural Information Processing Systems 31, pp. 4954–4964. Curran Associates, Inc., 2018.
- Zagoruyko & Komodakis (2016) Zagoruyko, S. and Komodakis, N. Wide Residual Networks. In Proceedings of the British Machine Vision Conference, pp. 87.1–87.12, 2016.
- Zhang et al. (2019) Zhang, G., Wang, C., Xu, B., and Grosse, R. Three Mechanisms of Weight Decay Regularization. In International Conference on Learning Representations, 2019.
- Zhou et al. (2019) Zhou, W., Veitch, V., Austern, M., Adams, R. P., and Orbanz, P. Non-vacuous Generalization Bounds at the ImageNet Scale: a PAC-Bayesian Compression Approach. In International Conference on Learning Representations, 2019.
- Zoph et al. (2018) Zoph, B., Vasudevan, V., Shlens, J., and Le, Q. V. Learning Transferable Architectures for Scalable Image Recognition. In 2018 IEEE Conference on Computer Vision and Pattern Recognition, pp. 8697–8710, 2018.
Appendix A Running examples
a.1 Row and column scaling
We show running examples of the transformation proposed in Sec. 5.1. We consider the following network.
Now, matrix norms are as follows.
We apply the transformation to the first row of and the first column of with . Then, parameters change as follows.
Next, we apply the transformation to the second row of and the second column of with . Parameters change as follows.
Now, matrix norms changed as follows.
Using the same method, we can make matrix norms of both and arbitrarily large.
Appendix B Convexity of variance parameters
We show that the optimization problem (17) is convex with respect to the log of variance parameters and . Let us define the following parameters.
For the sake of notational simplicity, we rewrite the objective in (17) as follows.
where and for all and . To show that is convex with respect to and , we show that the Hessian of is semi-positive definite. First, we calculate the elements of the Hessian.
For the notational simplicity, we define the following.
Note that . We can rewrite the elements of the Hessian as follows.
Now, it sufficies to show that .
Appendix C Calculation of normalized sharpness
To calculate the normalized flat minima (17), we have to solve the following optimization problem for each weight matrix.
The parameter is arbitrary but we set for simplicity. If we can estimate the diagonal elements of the Hessian, the later parts are straightforward. We can use the following to estimate the diagonal elements of the Hessian.
where is a small constant. In our experiments, was chosen per weight matrix according to their Frobenius norm for better estimation of the Hessian.
Appendix D Alternative definition of the normalized sharpness
We can use stochastic gradient descent to optimizeand . An advantage of (62) over (17) is that we do not need a second-order approximation. However, we have the following disadvantages.
The optimization problem becomes nonconvex.
Appendix E Experimental setups
e.1 Setups of Sec. 6.1
Ratio of random labels were selected from (, , , , , , , , , , ) uniform randomly at each training. We used cross-entropy loss during the training. We used a normalized-softmax-cross-entropy loss (20) to calculate normalized sharpness.
MLP on MNIST:
We trained MLP for 50 epochs with batchsize 128 on MNIST. We used Adam optimizer with its default parameters ().
LeNet on MNIST:
We trained LeNet for 50 epochs with batchsize 128 on MNIST. We used Adam optimizer with its default parameters ().
LeNet on CIFAR10:
We trained LeNet for 100 epochs with batchsize 128 on CIFAR10. We used Adam optimizer with its default parameters ().
Wide ResNet on CIFAR10:
We trained layers Wide ResNet for 200 epochs with batchsize 128 on CIFAR10. We used width factor . We used Adam optimizer with its default parameters ().
e.2 Setups of Sec. 6.2
We used the same setups described in Sec. E.2. We used cross-entropy loss for both training and calculation of the Hessian.