Recent work by Zhang et al. (2016)
showed experimentally that standard deep convolutional architectures can easily fit a random labelling over unstructured random noise. This challenged conventional wisdom that the good generalization of DNNs results from impicit regularization though the use of SGD or explicit regularization using dropout and batch normalization, and stirred a considerable amount of research towards rigorously understanding generalization for these now ubiquitous classification architectures.
A key idea dating back to Hochreiter & Schmidhuber (1997) has been that neural networks that generalize are flat minima in the optimisation landscape. This has been observed empirically in Keskar et al. (2016), however in Dinh et al. (2017) the authors propose that the notion of flatness needs to be carefully defined. One usually proceeds by assuming that a flat minimum can be described with low precision while a sharp minimum requires high precision. This means that we can describe the statistical model corresponding to the flat minimum with few bits. Then one can use a minimum description length (MDL) argument to show generalization Rissanen (1983). Alternatively one can quantitatively measure the ”flatness” of a minimum by injecting noise to the network parameters and measuring the stability of the network output. The more a trained network output is stable to noise, the more ”flat” the minimum to which it corresponds, and the better it’s generalization Shawe-Taylor & Williamson (1997) McAllester (1999). Using this measure of ”flatness” Neyshabur et al. (2017a) proposed a GE bound for deep fully connected networks.
The bound of Neyshabur et al. (2017a) depends linearly on the latent ambient dimensionality of the hidden layers. For convolutional architectures while the ambient dimensionality of the convolution operators is huge, the effective number of parameters is much smaller. Therefore with a careful analysis one can derive generalization bounds that depend on the intrinsic layer dimensionality leading to much tigher bounds. In a recent work Arora et al. (2018) the authors explore a similar idea. They first compress a neural network removing redundant parameters with little or no degradation in accuracy and then derive a generalization bound on the new compressed network. This results in much tighter bounds than previous works.
Contributions of this paper.
We apply a sparsity based analysis on convolutional-like layers of DNNs. We define convolutional like layers as layers with a sparse banded structure similar to convolutions but without weight sharing. We show that the implicit sparsity of convolutional-like layers reduces significantly the capacity of the network resulting in a much tighter GE bound.
We then extend our results to true convolutional layers, with weight sharing, finding again tighter GE bounds. Surprisingly the bound for convolutional-like and convolutional layers is of the same order of magnitude.
For completeness we propose to sparsify fully connected layers in line with the work of Arora et al. (2018) in order to reduce the number of effective network parameters. We then derive improved GE bounds based on the new reduced parameters.
Other related works. A number of recent works have tried to analyze the generalization of deep neural networks. This include margin approaches such as Sokolić et al. (2016) Bartlett et al. (2017). Other lines of inquiry have been investigating the data dependent stability of SGD Kuzborskij & Lampert (2017) as well as the implicit bias of SGD over separable data Soudry et al. (2017) Neyshabur et al. (2017b).
We now expand on the PAC-Bayes framework. Specifically let be any predictor (not necessarily a neural network) learned from the training data and parameterized by . We assume a prior distribution over the parameters which should be a proper Bayesian prior and cannot depend on the training data. We also assume a posterior over the predictors of the form , wherewe get:
Notice that the above gives a generalization result over a distribution of predictors. We now restate a usefull lemma from Neyshabur et al. (2017a) which can be used to give a generalization result for a single predictor instance.
Let be any predictor (not necessarily a neural network) with parameters , and be any distribution on the parameters that is independent of the training data. Then with probability over the training set of size , for any random pertubation s.t. , we have:
where are constants.
Let’s look at some intuition behind this bound. It links the empirical risk of the predictor to the true risk , for a specific predictor and not a posterior distribution of predictors. We have also moved to using a margin based loss, this is essential step in order to remove the posterior assumption. The pertubation quantifies how the true risk would be affected by choosing a bad predictor. The condition can be interpreted as choosing a posterior with small variance
small variance, sufficiently concentrated around the current empirical estimate , so that we can remove the randomness assumption with high confidence.
How small should we choose the the variance of ? The choice is complicated because the KL term in the bound is inversely proportional to the variance of the pertubation. Therefore we need to find the largest possible variance for which our stability condition holds.
The basis of our analysis is the following pertubation bound from Neyshabur et al. (2017a) on the output of a DNN:
(Pertubation Bound). For any , let be a d-layer network with ReLU activations.. Then for any
be a d-layer network with ReLU activations.. Then for any, and , and pertubation such that , the change in the output of the network can be bounded as follows:
where , and are considered as constants after an appropriate normalization of the layer weights.
We note that correctly estimating the spectral norm of the pertubation at each layer is critical to obtaining a tight bound. Specifically if we exploit the structure of the pertubation we can increase significantly the variance of the added pertubation for which our stability condition holds.
We will also use the following definition of a network sparsification which will prove usefull later:
be a classifier we say thatis a ()-sparsification of if for any in the training set we have for all labels F
and each fully connected layer in has at most non-zero entries along each row and column.
3 Layerwise Pertubations
We need to find the maximum variance for which . For this we present the following lemmas which bound the spectral norm of the noise at each layer. Specifically we assume a variance level for the noise applied to each DNN parameter, based on the sparsity structure of a given layer we obtain noise matrices with different structure and corresponding concentration inequalitites of the spectral norm . In the following we omit log parameters for clarity.
3.1 Fully Connected Layers
We first start with fully connected layers that have been sparsified to a sparsity level . After some calculations we get the following:
Let be the pertubation matrix of a fully connected layer with with row and column sparsity equal to . Then for the spectral norm of behaves like:
Note that in the above theorem we for fully connected layers without sparsity we can set .
3.2 Convolutional-Like Layers
We now analyze convolutional-like layers. These are layers that have a sparse banded structure similar to convolutions, with the simplifying assumption that the weights of the translated filters are not shared. This results in a matrix which for d convolutions is plotted in Figure 2.a. While this type of layer is purely theoretical and does not directly correspond to practically applied layers, it is nevertheless usefull for isolating the effect of sparsity on the generalization error. After some calculations we get the following:
Let be the pertubation matrix of a 2 convolutional-like layer with input channels, output channels, and convolutional filters . For and the spectral norm of behaves like:
We see that the spectral norm of the noise is up to log parameters independent of the dimensions of the latent feature maps, and the ambient layer dimensionality. The spectral norm is a function of the root of the filter support , the number of input channels and the number of output channels .
3.3 Convolutional Layers
We extend our analysis to true 2 convolutional layers. After some calculations we get the following:
Let be the pertubation matrix of a 2 convolutional layer with input channels, output channels, convolutional filters and feature maps . For and the spectral norm of behaves like:
We see that again up to log parameters, the spectral norm of the noise is independent of the dimensions of the latent feature maps. The spectral norm is a function of the root of the filter support , the number of input channels and the number of output channels . The factor on the concentration probability implies a less tight concentration than the convolutional-like layer. This is to be expected as the layer has fewer parameters due to weight sharing. We see also that up to log parameters the expected value of the spectral norms is the same for convolutional-like and convolutional layers. This is somewhat surprising as it implies similar generalization error bounds with and without weight sharing, contrary to conventional wisdom about DNN design.
4 Generalization Bound
We now proceed to find the maximum value of the variance parameter . For this we use the following lemma:
(Pertubation Bound). For any , let be a d-layer network with ReLU activations and we denote the set of convolutional layers and the set of dense layers. Then for any , and , and a pertubation for , for any with
where , and are considered as constants after an appropriate normalization of the layer weights.
While we have deferred all other proofs to the Appendix we will describe the proof to this lemma in detail as it is crucial for understanding the GE bound.
We denote the set of convolutional layers, the set of dense layers and assume where is the total number of layers. We then assume that the probability for each of the events (5) and events (7) is upper bounded by . We take a union bound over these events and after some calculations obtain that:
We are then ready to apply our result directly to Lemma 2.2. We calculate that with probability :
We have now found a bound on the pertubation at the final layer of the network as a function of with probability . What remains is to find the specific value of such that . We calculate:
We are now ready to state our main result. It follows directly from calculating the KL term in Lemma 2.1 where and . For the variance value chosen in equation (8):
(Generalization Bound). For any , let be a d-layer network with ReLU activations. Then for any , with probability over the training set of size we have:
We can now compare this result from the previous by Neyshabur et al. (2017a). For the case where all layers are sparse with sparsity ignoring log factors our bound scales like . For the case where all the layers are convolutional with ignoring log factors our bound scales like . For the same cases Neyshabur et al. (2017a) scales like . We see that for convolutional layers our bound is orders of magnitude tighter for convolutional layers. Similarly for sparse fully connected layers .
5.1 Concentration Bounds
In this section we present experiments that validate the proposed concentration bounds for convolutional-like and convolutional layers. We assume d convolutions, input channels, output channels and calculate theoretically and experimentally the spectral norm of the random pertubation matrix . We increase the number of input and output channels assuming that . To find empirical estimates we average the results over iterations for each choice of . We plot the results in Figure 3. We see that the results for the expected value deviate by some log factors while the bounds correctly capture the growth rate of the expected spectral norm as the parameters increase. Furthermore the empirical estimates validate the prediction that the norm will be less concentrated around the mean for the true convolutional layer.
5.2 Generalization Error Bounds
For each layer we define the expected value of where depending on different estimations of . We then plot on log scale for different layers of some common DNN architectures. The intuition behind this plot is that if we consider a network where all layers are the same i.e. , the bound of (13) scales like
. We experiment on LeNet-5 for the MNIST dataset, and on AlexNet and VGG-16 for the Imagenet dataset. We see that for convolutional layers the original approach ofNeyshabur et al. (2017a) gives too pessimistic estimates, orders of magnitude higher than our approach.
We see also that for large and the convolutional-like estimate is approximately the same as the convolutional estimate.
We take also a closer look at the sample complexity estimate for the above architecture. We assume that . We plot the results in Table 1. We obtain results orders of magnitude tigher than the previous PAC-Bayesian approach.
|Neyshabur et al. (2017a)|
An interesting observation its that a original approach estimates the sample complexity of VGG-16 to be 1 order of magnitude larger than AlexNet even though both are trained on the same dataset (Imagenet) and do not overfit. By contrast our bound estimates approaximately the same sample complexity for both architectures, even though they differ significantly is ambient architecture dimensions. We also observe that our bound results in improved estimates when the spatial support of the filters is small relative to the dimensions of the feature maps. We observe that for the LeNet-5 architecture where the size of the support of the filters is big relative to the ambient dimension the benefit from our approach is small as the convolutional layers can be adequately modeled as dense matrices.
It must be noted here that the assumption is quite strong and the bound is still worse than naive parameter counting when applied to real trained networks.
We have presented a new PAC-Bayes bound for deep convolutional neural networks. By decoupling the analysis from the ambient dimension of convolutional layers we manage to obtain bounds orders of magnitude tighter than existing approaches. We present numerical experiments on common feedforward architectures that corroborate our theoretical analysis.
- Arora et al. (2018) Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018.
- Bandeira et al. (2016) Afonso S Bandeira, Ramon Van Handel, et al. Sharp nonasymptotic bounds on the norm of random matrices with independent entries. The Annals of Probability, 44(4):2479–2506, 2016.
- Bartlett et al. (2017) Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrally-normalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pp. 6241–6250, 2017.
- Dinh et al. (2017) Laurent Dinh, Razvan Pascanu, Samy Bengio, and Yoshua Bengio. Sharp minima can generalize for deep nets. arXiv preprint arXiv:1703.04933, 2017.
- Dong et al. (2017) Xin Dong, Shangyu Chen, and Sinno Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. In Advances in Neural Information Processing Systems, pp. 4860–4874, 2017.
- Han et al. (2015) Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network. In Advances in neural information processing systems, pp. 1135–1143, 2015.
- Hochreiter & Schmidhuber (1997) Sepp Hochreiter and Jürgen Schmidhuber. Flat minima. Neural Computation, 9(1):1–42, 1997.
- Keskar et al. (2016) Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On large-batch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
- Kuzborskij & Lampert (2017) Ilja Kuzborskij and Christoph Lampert. Data-dependent stability of stochastic gradient descent. arXiv preprint arXiv:1703.01678, 2017.
- McAllester (1999) David A McAllester. Some pac-bayesian theorems. Machine Learning, 37(3):355–363, 1999.
- Neyshabur et al. (2017a) Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017a.
- Neyshabur et al. (2017b) Behnam Neyshabur, Ryota Tomioka, Ruslan Salakhutdinov, and Nathan Srebro. Geometry of optimization and implicit regularization in deep learning. arXiv preprint arXiv:1705.03071, 2017b.
- Pitas et al. (2018) Konstantinos Pitas, Mike Davies, and Pierre Vandergheynst. Feta: A dca pruning algorithm with generalization error guarantees. arXiv preprint arXiv:1803.04239, 2018.
- Rissanen (1983) Jorma Rissanen. A universal prior for integers and estimation by minimum description length. The Annals of statistics, pp. 416–431, 1983.
Shawe-Taylor & Williamson (1997)
John Shawe-Taylor and Robert C Williamson.
A pac analysis of a bayesian estimator.
Proceedings of the tenth annual conference on Computational learning theory, pp. 2–9. ACM, 1997.
- Sokolić et al. (2016) Jure Sokolić, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65(16):4265–4280, 2016.
- Soudry et al. (2017) Daniel Soudry, Elad Hoffer, and Nathan Srebro. The implicit bias of gradient descent on separable data. arXiv preprint arXiv:1710.10345, 2017.
- Vershynin (2010) Roman Vershynin. Introduction to the non-asymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
- Zhang et al. (2016) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
In the derivations below we will rely upon two the following usefull theorem for the concentration of the spectral norm of sparse random matrices Bandeira et al. (2016):
Let be a random rectangular matrix with where are independent random variables and are scalars. Then:
for any and with:
7.1 Proof of Theorem 3.1
For we get the result trivially from Therorem 7.1 by assuming:
. We can extend the result to by considering that .
7.2 Proof of Theorem 3.2
We will consider first the case . A convolutional layer is characterised by it’s output channels. For each output channel each input channel is convolved with an independent filter, the output of the output channel is then the sum of the results of these convolutions. We consider convolutional-like layers, i.e. the layers are banded but the entries are independent and there is no weight sharing. We plot for the case of one dimensional signals the implied structure.
We need to evaluate , and for a matrix this matrix. Where:
Below we plot what these sums represent:
For we can find an upper bound, by considering that the sum for a given filter and a given pixel location represents the maximum number of overlaps for all 2d shifts. For the case of 2d this is equal to the support of the filters and we also need to consider all input channels. We then get
For we need to consider that each column in the matrix represents a concatenation of convolutional filters . Then it is straight forward to derive that:
Furthermore it is trivial to show that . The theorem results extends trivially to by considering that . ∎
7.3 Proof of Theorem 3.3
We start by defining the structure of the 2d convolutional noise . Given input channels and output channels the noise matrix is structured as:
We transform this matrix into the Fourier domain to obtain:
Then the matrices and have entries:
In the first line we have used the fact that 2d convolution is diagonalized by the 2d Fourier transform into diagonal matrices. In the second line we have used the fact that this concatenation of diagonal matrices can be always rearranged as a block diagonal matrix with blocks
We can now apply the following concentration inequality Vershynin (2010):
Let be an matrix whose entries are independent standard normal random variables. Then for every :
on the matrices and . We obtain the following concentration inequalities:
We then make the following calculations:
We now notice that there are matrices and matrices . We assume that the probability of each of the events in (21) is upper bounded by for and take a union bound.
We then get:
7.4 Proof of Theorem 4.2
We have to calculate the KL-term in Lemma 2.1 with the chosen distributions for and ,for the value of
with where we have used the fact that both P and
follow multivariate Gaussian distributions. Theorem 4.2 results from substituting the value ofinto Lemma 2.1.
We now present a technical point regarding the parameter . Recall that we mentioned in Lemma 2.2 that depends on a normalization of the network layers, we will formalise this concept below. The analysis is identical to the one in Neyshabur et al. (2017a).
Let and consider a network with the normalized weights . Due to the homogeneity of the ReLu, we hvae that for feedforward neural networks with ReLu activations and so the (empirical and the expected) loss (including margin loss) is the same for . We can also verify that and , and so the excess error in the Theorem statement is also invariant to this transformation. It is therefore sufficient to prove the Theorem only for normalized weights , and hence we assume w.l.o.g. that the spectral norm is equal across layers, i.e. for any layer , .
In the previous derivations we have set according to . More precisely, since the prior cannot depend on the learned predictor or it’s norm, we will set based on an approximation For each value of on a pre-determined grid, we will compute the PAC-Bayes bound, establishing the generalization guaranteee for all for which , and ensuring that each relevant value of is covered by some on the grid. We will then take a union bound over all on the grid. In the previous we have considered a fixed and the for which , and hence .
Finally we need to take a union bound over different choices of . Let us see how many choices of we need to ensure we always have in the grid s.t. . We only need to consider values of in the range . For outside this range the theorem statement holds trivially: Recall that the LHS of the theorem statement, is always bounded by 1. If , then for any , and therefore . Alternatively, if , then the second term in equation (1) is greater than one. Hence, we only need to consider values of in the range discussed above. Since we need to satisfy , the size of the cover we need to consider is bounded by . Taking a union bound over the choices of in this cover and using the bound in equation (33) gives us the theorem statement.