PAC-Bayesian Margin Bounds for Convolutional Neural Networks - Technical Report

12/30/2017 ∙ by Pitas Konstantinos, et al. ∙ EPFL 0

Recently the generalisation error of deep neural networks has been analysed through the PAC-Bayesian framework, for the case of fully connected layers. We adapt this approach to the convolutional setting.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Recent work by Zhang et al. (2016)

showed experimentally that standard deep convolutional architectures can easily fit a random labelling over unstructured random noise. This challenged conventional wisdom that the good generalization of DNNs results from impicit regularization though the use of SGD or explicit regularization using dropout and batch normalization, and stirred a considerable amount of research towards rigorously understanding generalization for these now ubiquitous classification architectures.

A key idea dating back to Hochreiter & Schmidhuber (1997) has been that neural networks that generalize are flat minima in the optimisation landscape. This has been observed empirically in Keskar et al. (2016), however in Dinh et al. (2017) the authors propose that the notion of flatness needs to be carefully defined. One usually proceeds by assuming that a flat minimum can be described with low precision while a sharp minimum requires high precision. This means that we can describe the statistical model corresponding to the flat minimum with few bits. Then one can use a minimum description length (MDL) argument to show generalization Rissanen (1983). Alternatively one can quantitatively measure the ”flatness” of a minimum by injecting noise to the network parameters and measuring the stability of the network output. The more a trained network output is stable to noise, the more ”flat” the minimum to which it corresponds, and the better it’s generalization Shawe-Taylor & Williamson (1997) McAllester (1999). Using this measure of ”flatness” Neyshabur et al. (2017a) proposed a GE bound for deep fully connected networks.

The bound of Neyshabur et al. (2017a) depends linearly on the latent ambient dimensionality of the hidden layers. For convolutional architectures while the ambient dimensionality of the convolution operators is huge, the effective number of parameters is much smaller. Therefore with a careful analysis one can derive generalization bounds that depend on the intrinsic layer dimensionality leading to much tigher bounds. In a recent work Arora et al. (2018) the authors explore a similar idea. They first compress a neural network removing redundant parameters with little or no degradation in accuracy and then derive a generalization bound on the new compressed network. This results in much tighter bounds than previous works.

Contributions of this paper.

  • We apply a sparsity based analysis on convolutional-like layers of DNNs. We define convolutional like layers as layers with a sparse banded structure similar to convolutions but without weight sharing. We show that the implicit sparsity of convolutional-like layers reduces significantly the capacity of the network resulting in a much tighter GE bound.

  • We then extend our results to true convolutional layers, with weight sharing, finding again tighter GE bounds. Surprisingly the bound for convolutional-like and convolutional layers is of the same order of magnitude.

  • For completeness we propose to sparsify fully connected layers in line with the work of Arora et al. (2018) in order to reduce the number of effective network parameters. We then derive improved GE bounds based on the new reduced parameters.

Other related works. A number of recent works have tried to analyze the generalization of deep neural networks. This include margin approaches such as Sokolić et al. (2016) Bartlett et al. (2017). Other lines of inquiry have been investigating the data dependent stability of SGD Kuzborskij & Lampert (2017) as well as the implicit bias of SGD over separable data Soudry et al. (2017) Neyshabur et al. (2017b).

2 Preliminaries

We now expand on the PAC-Bayes framework. Specifically let be any predictor (not necessarily a neural network) learned from the training data and parameterized by . We assume a prior distribution over the parameters which should be a proper Bayesian prior and cannot depend on the training data. We also assume a posterior over the predictors of the form , where

is a random variable whose distribution can depend on the training data. Then with probability at least

we get:

(1)

Notice that the above gives a generalization result over a distribution of predictors. We now restate a usefull lemma from Neyshabur et al. (2017a) which can be used to give a generalization result for a single predictor instance.

Lemma 2.1.

Let be any predictor (not necessarily a neural network) with parameters , and be any distribution on the parameters that is independent of the training data. Then with probability over the training set of size , for any random pertubation s.t. , we have:

(2)

where are constants.

Let’s look at some intuition behind this bound. It links the empirical risk of the predictor to the true risk , for a specific predictor and not a posterior distribution of predictors. We have also moved to using a margin based loss, this is essential step in order to remove the posterior assumption. The pertubation quantifies how the true risk would be affected by choosing a bad predictor. The condition can be interpreted as choosing a posterior with

small variance

, sufficiently concentrated around the current empirical estimate , so that we can remove the randomness assumption with high confidence.

How small should we choose the the variance of ? The choice is complicated because the KL term in the bound is inversely proportional to the variance of the pertubation. Therefore we need to find the largest possible variance for which our stability condition holds.

(a)
(b)
Figure 1: Overlap between and : We see that as the variance increases the KL divergence between the two distributions decreases.

The basis of our analysis is the following pertubation bound from Neyshabur et al. (2017a) on the output of a DNN:

Lemma 2.2.

(Pertubation Bound). For any , let

be a d-layer network with ReLU activations.. Then for any

, and , and pertubation such that , the change in the output of the network can be bounded as follows:

(3)

where , and are considered as constants after an appropriate normalization of the layer weights.

We note that correctly estimating the spectral norm of the pertubation at each layer is critical to obtaining a tight bound. Specifically if we exploit the structure of the pertubation we can increase significantly the variance of the added pertubation for which our stability condition holds.

We will also use the following definition of a network sparsification which will prove usefull later:

Definition 2.1.

()-sparsification. Let

be a classifier we say that

is a ()-sparsification of if for any in the training set we have for all labels F

(4)

and each fully connected layer in has at most non-zero entries along each row and column.

We will assume for simplicity that we can always obtain a ()-sparsification of see for example Han et al. (2015)Dong et al. (2017)Pitas et al. (2018) and as such will therefore use and interchangeably.

3 Layerwise Pertubations

We need to find the maximum variance for which . For this we present the following lemmas which bound the spectral norm of the noise at each layer. Specifically we assume a variance level for the noise applied to each DNN parameter, based on the sparsity structure of a given layer we obtain noise matrices with different structure and corresponding concentration inequalitites of the spectral norm . In the following we omit log parameters for clarity.

3.1 Fully Connected Layers

We first start with fully connected layers that have been sparsified to a sparsity level . After some calculations we get the following:

Theorem 3.1.

Let be the pertubation matrix of a fully connected layer with with row and column sparsity equal to . Then for the spectral norm of behaves like:

(5)

Note that in the above theorem we for fully connected layers without sparsity we can set .

(a)
(b)
Figure 2: Structured Layers: We separate two cases in our analysis a) Convolutional-Like layers where the matrix has a sparse banded structure but there is no weight sharing. b) Convolutional layers where the matrix represents convolutions and the weights are shared.

3.2 Convolutional-Like Layers

We now analyze convolutional-like layers. These are layers that have a sparse banded structure similar to convolutions, with the simplifying assumption that the weights of the translated filters are not shared. This results in a matrix which for d convolutions is plotted in Figure 2.a. While this type of layer is purely theoretical and does not directly correspond to practically applied layers, it is nevertheless usefull for isolating the effect of sparsity on the generalization error. After some calculations we get the following:

Theorem 3.2.

Let be the pertubation matrix of a 2 convolutional-like layer with input channels, output channels, and convolutional filters . For and the spectral norm of behaves like:

(6)

We see that the spectral norm of the noise is up to log parameters independent of the dimensions of the latent feature maps, and the ambient layer dimensionality. The spectral norm is a function of the root of the filter support , the number of input channels and the number of output channels .

3.3 Convolutional Layers

We extend our analysis to true 2 convolutional layers. After some calculations we get the following:

Theorem 3.3.

Let be the pertubation matrix of a 2 convolutional layer with input channels, output channels, convolutional filters and feature maps . For and the spectral norm of behaves like:

(7)

We see that again up to log parameters, the spectral norm of the noise is independent of the dimensions of the latent feature maps. The spectral norm is a function of the root of the filter support , the number of input channels and the number of output channels . The factor on the concentration probability implies a less tight concentration than the convolutional-like layer. This is to be expected as the layer has fewer parameters due to weight sharing. We see also that up to log parameters the expected value of the spectral norms is the same for convolutional-like and convolutional layers. This is somewhat surprising as it implies similar generalization error bounds with and without weight sharing, contrary to conventional wisdom about DNN design.

4 Generalization Bound

We now proceed to find the maximum value of the variance parameter . For this we use the following lemma:

Lemma 4.1.

(Pertubation Bound). For any , let be a d-layer network with ReLU activations and we denote the set of convolutional layers and the set of dense layers. Then for any , and , and a pertubation for , for any with

(8)

we have:

(9)

where , and are considered as constants after an appropriate normalization of the layer weights.

While we have deferred all other proofs to the Appendix we will describe the proof to this lemma in detail as it is crucial for understanding the GE bound.

Proof.

We denote the set of convolutional layers, the set of dense layers and assume where is the total number of layers. We then assume that the probability for each of the events (5) and events (7) is upper bounded by . We take a union bound over these events and after some calculations obtain that:

(10)

We are then ready to apply our result directly to Lemma 2.2. We calculate that with probability :

(11)

We have now found a bound on the pertubation at the final layer of the network as a function of with probability . What remains is to find the specific value of such that . We calculate:

(12)

We are now ready to state our main result. It follows directly from calculating the KL term in Lemma 2.1 where and . For the variance value chosen in equation (8):

Theorem 4.2.

(Generalization Bound). For any , let be a d-layer network with ReLU activations. Then for any , with probability over the training set of size we have:

(13)

where .

We can now compare this result from the previous by Neyshabur et al. (2017a). For the case where all layers are sparse with sparsity ignoring log factors our bound scales like . For the case where all the layers are convolutional with ignoring log factors our bound scales like . For the same cases Neyshabur et al. (2017a) scales like . We see that for convolutional layers our bound is orders of magnitude tighter for convolutional layers. Similarly for sparse fully connected layers .

5 Experiments

5.1 Concentration Bounds

In this section we present experiments that validate the proposed concentration bounds for convolutional-like and convolutional layers. We assume d convolutions, input channels, output channels and calculate theoretically and experimentally the spectral norm of the random pertubation matrix . We increase the number of input and output channels assuming that . To find empirical estimates we average the results over iterations for each choice of . We plot the results in Figure 3. We see that the results for the expected value deviate by some log factors while the bounds correctly capture the growth rate of the expected spectral norm as the parameters increase. Furthermore the empirical estimates validate the prediction that the norm will be less concentrated around the mean for the true convolutional layer.

Figure 3: Empirical vs Theoretical Spetral Norms: We plot the empirical (solid lines) and theoretical (dashed lines) estimates for . We also shade the area between the maximum and minimum empirical values. We see that bounds are tight up to log factors and correctly capture the norm growth as the number of channels increases.

5.2 Generalization Error Bounds

For each layer we define the expected value of where depending on different estimations of . We then plot on log scale for different layers of some common DNN architectures. The intuition behind this plot is that if we consider a network where all layers are the same i.e. , the bound of (13) scales like

. We experiment on LeNet-5 for the MNIST dataset, and on AlexNet and VGG-16 for the Imagenet dataset. We see that for convolutional layers the original approach of

Neyshabur et al. (2017a) gives too pessimistic estimates, orders of magnitude higher than our approach.

We see also that for large and the convolutional-like estimate is approximately the same as the convolutional estimate.

We take also a closer look at the sample complexity estimate for the above architecture. We assume that . We plot the results in Table 1. We obtain results orders of magnitude tigher than the previous PAC-Bayesian approach.


LeNet-5 AlexNet VGG-16
Neyshabur et al. (2017a)
Ours
Table 1: Generalization error bounds for common feedforward architectures.

An interesting observation its that a original approach estimates the sample complexity of VGG-16 to be 1 order of magnitude larger than AlexNet even though both are trained on the same dataset (Imagenet) and do not overfit. By contrast our bound estimates approaximately the same sample complexity for both architectures, even though they differ significantly is ambient architecture dimensions. We also observe that our bound results in improved estimates when the spatial support of the filters is small relative to the dimensions of the feature maps. We observe that for the LeNet-5 architecture where the size of the support of the filters is big relative to the ambient dimension the benefit from our approach is small as the convolutional layers can be adequately modeled as dense matrices.

It must be noted here that the assumption is quite strong and the bound is still worse than naive parameter counting when applied to real trained networks.

(a) constants for the LeNet-5 architecture.
(b) constants for the AlexNet architecture.
(c) constants for the VGG-16 architecture.
Figure 4: constants for common feedforward architectures: With yellow we plot the the estimation based on the ambient dimensionality. With magenta we plot the estimation based on sparsity, all layers have been compressed to sparsity. With red we plot the convolutional-like estimation. With purple we plot the convolutional estimation. We see that the layerwise constants are improved by orders of magnitude. Furthmore we notice that for convolutional layers the improvement is more significant for deeper architectures where the spatial support of the convolutional filters is much smaller than the size of the feature maps. For fully connected layers somewhat disappointingly the improvement is not very large.

6 Discussion

We have presented a new PAC-Bayes bound for deep convolutional neural networks. By decoupling the analysis from the ambient dimension of convolutional layers we manage to obtain bounds orders of magnitude tighter than existing approaches. We present numerical experiments on common feedforward architectures that corroborate our theoretical analysis.

References

7 Appendix

In the derivations below we will rely upon two the following usefull theorem for the concentration of the spectral norm of sparse random matrices Bandeira et al. (2016):

Theorem 7.1.

Let be a random rectangular matrix with where are independent random variables and are scalars. Then:

(14)

for any and with:

(15)

7.1 Proof of Theorem 3.1

For we get the result trivially from Therorem 7.1 by assuming:

(16)

. We can extend the result to by considering that .

7.2 Proof of Theorem 3.2

Proof.

We will consider first the case . A convolutional layer is characterised by it’s output channels. For each output channel each input channel is convolved with an independent filter, the output of the output channel is then the sum of the results of these convolutions. We consider convolutional-like layers, i.e. the layers are banded but the entries are independent and there is no weight sharing. We plot for the case of one dimensional signals the implied structure.

Figure 5: Structure of a 1d convolutional-like layer with 2 input channels and 2 output channels.

We need to evaluate , and for a matrix this matrix. Where:

(17)

Below we plot what these sums represent:

(a)
(b)
Figure 6: and

For we can find an upper bound, by considering that the sum for a given filter and a given pixel location represents the maximum number of overlaps for all 2d shifts. For the case of 2d this is equal to the support of the filters and we also need to consider all input channels. We then get

(18)
Figure 7: Possible shifts with overlap: With blue we plot a 2d filter and 3 filters that overlap with it’s bottom right pixel. In purple we plot the box denoting the boundaries of the set of all shifted filters that overlap with the bottom right pixel.

For we need to consider that each column in the matrix represents a concatenation of convolutional filters . Then it is straight forward to derive that:

(19)

Furthermore it is trivial to show that . The theorem results extends trivially to by considering that . ∎

7.3 Proof of Theorem 3.3

Proof.
(a)
(b)
Figure 8: Concatenation of diagonal matrices: We see that the concatenation of diagonal matrices can be always rearranged into a block diagonal matrix.

We start by defining the structure of the 2d convolutional noise . Given input channels and output channels the noise matrix is structured as:

(20)

We transform this matrix into the Fourier domain to obtain:

(21)

where:

(22)

and:

(23)

Then the matrices and have entries:

(24)

.

In the first line we have used the fact that 2d convolution is diagonalized by the 2d Fourier transform into diagonal matrices

. In the second line we have used the fact that this concatenation of diagonal matrices can be always rearranged as a block diagonal matrix with blocks

We can now apply the following concentration inequality Vershynin (2010):

Theorem 7.2.

Let be an matrix whose entries are independent standard normal random variables. Then for every :

(25)

on the matrices and . We obtain the following concentration inequalities:

(26)

with:

(27)

We then make the following calculations:

(28)

since:

(29)

We now notice that there are matrices and matrices . We assume that the probability of each of the events in (21) is upper bounded by for and take a union bound.

We then get:

(30)

7.4 Proof of Theorem 4.2

Proof.

We have to calculate the KL-term in Lemma 2.1 with the chosen distributions for and ,for the value of

(31)

We get:

(32)

with where we have used the fact that both P and

follow multivariate Gaussian distributions. Theorem 4.2 results from substituting the value of

into Lemma 2.1.

(33)

We now present a technical point regarding the parameter . Recall that we mentioned in Lemma 2.2 that depends on a normalization of the network layers, we will formalise this concept below. The analysis is identical to the one in Neyshabur et al. (2017a).

Let and consider a network with the normalized weights . Due to the homogeneity of the ReLu, we hvae that for feedforward neural networks with ReLu activations and so the (empirical and the expected) loss (including margin loss) is the same for . We can also verify that and , and so the excess error in the Theorem statement is also invariant to this transformation. It is therefore sufficient to prove the Theorem only for normalized weights , and hence we assume w.l.o.g. that the spectral norm is equal across layers, i.e. for any layer , .

In the previous derivations we have set according to . More precisely, since the prior cannot depend on the learned predictor or it’s norm, we will set based on an approximation For each value of on a pre-determined grid, we will compute the PAC-Bayes bound, establishing the generalization guaranteee for all for which , and ensuring that each relevant value of is covered by some on the grid. We will then take a union bound over all on the grid. In the previous we have considered a fixed and the for which , and hence .

Finally we need to take a union bound over different choices of . Let us see how many choices of we need to ensure we always have in the grid s.t. . We only need to consider values of in the range . For outside this range the theorem statement holds trivially: Recall that the LHS of the theorem statement, is always bounded by 1. If , then for any , and therefore . Alternatively, if , then the second term in equation (1) is greater than one. Hence, we only need to consider values of in the range discussed above. Since we need to satisfy , the size of the cover we need to consider is bounded by . Taking a union bound over the choices of in this cover and using the bound in equation (33) gives us the theorem statement.