1 Introduction
Standard deep convolutional architectures have the capacity to fit a labelling over random noise (Zhang et al., 2016)
. At the same time, these same networks generalize well on real image data. This challenges conventional wisdom that the good generalization properties of deep convolutional neural networks are the result of restricted capacity from implicit regularization though the use of stochastic gradient descent (SGD) or explicit regularization using dropout and batch normalization
(Zhang et al., 2016). In fact, any measure of model complexity which is uniform across all functions representable by a given architecture is doomed to provide contradictory measurements (Bartlett et al., 2017; Arora et al., 2018; Neyshabur et al., 2015). A good measure of complexity should allow for high complexity models for difficult datasets (random noise) and low complexity models for easier datasets (real images).Researchers have recently focused on spectral complexity (Bartlett et al., 2017) normalized by the margin. Different techniques include robustness, PACBayes and Rademacher complexity (Sokolić et al., 2016; Bartlett et al., 2017; Neyshabur et al., 2017a, 2015; Golowich et al., 2017). Spectral complexity has been shown to correlate with the generalization error (GE) in a number of works (Neyshabur et al., 2017b; Bartlett et al., 2017). For the same network, this measure of model complexity has high values when the network is trained on data with random labels and considerably lower values for real labels. The measure can also be linked to the flatness of the minimum represented by a specific network instantiation (Wang et al., 2018). Flat minima have been shown empirically to also correlate with better generalization (Keskar et al., 2016).
Two competing hypotheses. Nevertheless, spectral complexity based bounds are still vacuous, as demonstrated empirically by Arora et al. (2018). One possible explanation for this shortcoming is that, while derived for general fully connected layers, the bounds are evaluated w.r.t. deep convolutional networks whose weights are significantly more structured. Perhaps due to the nontrivial correlations between filters, the generalization capacity of deep convolutional neural networks has been less studied. Zhou and Feng (2018)^{1}^{1}1whose work was published after our preprint provide an interesting analysis, but their analysis holds only for sigmoid activations, includes complex instance dependent network quantities that do not lend themselves easily to intuition, and is ultimately vacuous. A second interesting bound was proposed by Du et al. (2017), though the latter was derived for a greatly simplified architecture consisting of a single layer and convolutional filter.
An alternative hypothesis is that current analyses relying on spectral complexitytype metrics miss a crucial piece of the generalization puzzle. In particular, it is generally agreed that invariance to symmetries in the data is a key property of modern deep convolutional neural networks and should be exploited to improve existing GE bounds. Surprisingly, the role of invariances has seen little attention in the generalization literature: Achille and Soatto (2018) showed that low information content in the network weights corresponds to learning invariant signal representations to various nuisance latent parameters. Their work however does not result in a meaningful generalization bound. Sokolic et al. (2016)
demonstrated that classifiers that are invariant (to a set of discrete transformations of input signals) can potentially have a much lower GE than noninvariant ones.
Contributions. This paper aims to shed light into the limitations of spectral complexity based analyses.

Aiming to test the first hypothesis, we adapt one of the most recent spectral complexity based generalization bounds (Neyshabur et al., 2017a) to the case of deep convolutional neural networks. Interestingly, despite being orders of magnitude tighter than the previous estimate, our bound remains vacuous.

To investigate this limitation, we consider the paradoxical case of locally connected layers, i.e., layers constructed to have the same support as convolutional layers that do not employ weight sharing. Counterintuitively, we find that these enjoy the same generalization guarantees as convolutional layers (up to log factors that are artifacts of the derivation). Our experiments indicate that crucial quantities in the bound are tight, pointing to an inherent shortcoming of the proof technique.

We confirm empirically that, though important to generalization, the bounds in question fail to capture the invariance properties of convolutional networks to data symmetries, such as elastic deformations and translations. Our experiments suggest that these conclusions are not unique to our approach, but apply to spectral complexity based generalization bounds in general. We conclude that more research should be conducted in this direction.
Notation.
We denote vectors with bold lowercase letters and matrices with bold capital letters. Given two probability measures
and over a setwe define the KullbackLeibler divergence as
. We denote with the Gaussian kernel.2 Problem and previous work
Let be a distribution over samples and labels . We consider the standard classification problem in which a class classifier parameterized by is used to map input vectors to a dimensional vector, encoding class membership.
We may encode the confidence of the classifier by incorporating a dependence on a desired margin . Then, the margin classification loss is defined as
Note that we easily recover the standard classification loss definition by setting . Our objective is to obtain bounds of the generalization error GE:
(1) 
where
is the empirical loss computed over a random training set of size . For easy reference, we summarize some of the most crucial definitions in Table 1.
symbol  meaning 

input image size is  
neural network parameterized by  
classes  
training set size  
network depth  
set of fully connected layers  
set of convolutional  
set of locally connected layers  
weigh matrix of th layer  
filter support in th convolutional layer  
output channels in th convolutional layer 
Recent advances. A variety of techniques, based on VC dimension, Rademacher complexity, and PACBayes type arguments have been employed in the attempt to understand the generalization error of neural networks. Intriguingly, in a number of recent works the generalization error of a layer neural network with layer weights is expressed as
(2) 
with terms and being architecturedependent and only depending solely on the network weights. The latter term has been referred to as the spectral complexity of a neural network (Bartlett et al., 2017) and can be defined as
(3) 
To be precise, the aforementioned definition corresponds to the one derived in a PACBayes framework (Neyshabur et al., 2017a) together with . In Bartlett et al. (2017) the proposed measure is:
(4) 
obtained using an involved covering argument. Here, the network weights are contrasted to some reference weights and the Frobenius norm is substituted by the matrix norm defined as for . In the similar works of Bartlett and Mendelson (2002) and Neyshabur et al. (2015) the authors use the norm and the norm respectively.
Several experiments have shown that spectral complexity generally correlates with the true generalization error as quantified by held out data. Furthermore it is large for difficult datasets while it is small for easy datasets. Intuitively, spectral complexity is related to how robust a model is when adding noise to its layers. A simple model will be more robust to noise and can be seen as laying on a flat minimum; even moving it a way by a large quantity from the minimum center, the loss will remain approximately the same.
Nevertheless, it has also been empirically demonstrated that for standard deep convolutional neural networks these analyses tend to overestimate the generalization error by several orders of magnitude (Arora et al., 2018). The aim of this paper is to test whether this gap between theory and practice stems from artifacts of the analysis or from some fundamental limitation of the approach.
Other approaches. There has been a separate line of work aiming to numerically evaluate nonvacuous bounds. A key work in this direction was carried out by Dziugaite and Roy (2017), where the PACBayes bound was optimized to obtain nonvacuous estimates. Their work is however limited to small datasets, and is still far from tight. Stronger results can be obtained by first compressing the network and then computing a bound on the remaining parameters (Zhou et al., 2018; Arora et al., 2018). It is important however to note that, besides being far from tight, these bounds do not concern the original classifier, as a one to one correspondence cannot be established between the compressed and uncompressed architectures.
LeNet5  AlexNet  VGG16  

Neyshabur et al. (2017a)  
Ours 
3 Bound for convolutional layers
An obvious shortcoming is that, by being derived for fully connected layers, norm based generalization bounds are not specifically adapted to convolutional architectures. Our first order of business is thus to understand how much one may gain by explicitly considering the structure of convolutions in the generalization error derivation.
To this end, we first aim to tighten the bound of Neyshabur et al. (2017a) and adapt it to the convolutional case. Specifically we will improve upon the architecture dependent constant . We show that for the case of convolutional layers the original value of is unacceptably high.
Our main result is as follows:
Theorem 3.1.
(Generalization Bound). Let be a layer network, consisting of convolutional layers,
fullyconnected layers, and layerwise ReLU activations. For any
, with probability at least over the training set of size we havewith being a uniform bound on the input vectors, is as in (45),
terms and denote respectively the filter support and number of output channels of the th convolutional layer, and counts the number of nonzero entries of the th fullyconnected layer.
The theorem associates the generalization capacity of a deep convolutional neural network to its weights, as well as to key aspects of its architecture. Interestingly, there is a sharp contrast between convolutional and fullyconnected layers.
Fullyconnected layers
, in accordance to previous analyses, exhibit a sample complexity that depends linearly on the number of neurons—subject to sparsity constrains that is. For instance, when all layers are sparse with sparsity
and constant stablerank, ignoring log factors our bound implies that suffice to attain good generalization. For the same setting, the sample complexity was determined as by Neyshabur et al. (2017a).Convolutional layers contribute more mildly to the sample complexity, with the latter increasing linearly on the filter support and channels , but being independent on the layer input size. A case in point, in a fully convolutional network of layers, each with constant stablerank and output channels, our bound scales like , while previously scaled like , which constitutes a two order of magnitude improvement when the filter support is (as is usually the case).
To illustrate these differences resulting from
, we conduct an experiment on LeNet5 for the MNIST dataset, and on AlexNet and VGG16 for the Imagenet dataset. We omit term
assuming that . We plot the results in Table 1. We obtain results orders of magnitude tighter than the previous PACBayesian approach.An interesting observation its that a original approach estimates the sample complexity of VGG16 to be 1 order of magnitude larger than AlexNet even though both are trained on the same dataset (Imagenet) and do not overfit. By contrast our bound estimates approximately the same sample complexity for both architectures, even though they differ significantly is ambient architecture dimensions. We also observe that our bound results in improved estimates when the spatial support of the filters is small relative to the dimensions of the feature maps. We observe that for the LeNet5 architecture where the size of the support of the filters is big relative to the ambient dimension the benefit from our approach is small as the convolutional layers can be adequately modeled as dense matrices.
Clearly, the assumption is unrealistic in practice. For values obtained by trained networks the bounds presented above are still vacuous by several orders of magnitude. Some paradoxes resulting from this looseness as well as some possible explanations are explored in Section 5.
4 Proof outline of Theorem 3.1
We will first present two prior results which will be useful later. We first state a result relating the noise robustness to perturbations of a classifier to the GE, and then state a result quantifying the perturbation robustness of general deep neural networks. We then present our results relating all the above to the convolutional setting.
Let be any deterministic predictor (not necessarily a neural network). The following lemma from Neyshabur et al. (2017a) which introduces the condition as a probabilistic bound on the Lipschitz constant of the predictor , and relates it to the generalization error:
Lemma 4.1.
Neyshabur et al. (2017a) Let be any predictor (not necessarily a neural network) with parameters , and be any distribution on the parameters that is independent of the training data. Then with probability over the training set of size , for any random perturbation s.t. , we have
(5) 
where are constants.
There is a tradeoff between the condition and the KL term in the right hand side of the above inequality. The KL term is inversely
proportional to the variance of the noise
. Therefore one would want to maximize the variance of the noise, however the distance can potentially grow unbounded with high probability for high enough values of the variance.Characterizing the condition entails understanding the sensitivity of our deep convolutional neural network classifier on random perturbations. To that end, we review here a useful perturbation bound from Neyshabur et al. (2017a) on the output of a general deep neural network:
Lemma 4.2.
(Perturbation Bound). For any , let be a dlayer network with ReLU activations.. Then for any , and , and perturbation such that , the change in the output of the network can be bounded as follows
(6) 
where , and are considered as constants after an appropriate normalization of the layer weights.
We note that correctly estimating the spectral norm of the perturbation at each layer is critical to obtaining a tight bound. Specifically, if we exploit the structure of the perturbation we can increase significantly the variance of the added perturbation for which holds.
The analysis for the convolutional case is difficult due to the fact that the noise per pixel is not independent. We defer the proof to the appendix and omit log parameters for clarity. We obtain the following lemma:
Lemma 4.3.
Let be the perturbation matrix of a 2 convolutional layer with input channels, output channels, convolutional filters and feature maps . Then if we vectorize the convolutional filter weights and add a vectorized noise vector such that with probability greater than
(7) 
We see that the spectral norm of the noise is independent of the dimensions of the latent feature maps. The spectral norm is a function of the root of the filter support , the number of input channels and the number of output channels .
We now proceed to find the maximum value of the variance parameter that balances the noise sensitivity with the KL term dependence. We present the following lemma:
Lemma 4.4.
(Perturbation Bound). For any , let be a dlayer network with ReLU activations and we denote the set of convolutional layers and the set of fully connected layers. Then for any , and , and a perturbation for , for any with
(8) 
we have
(9) 
where , , are considered as constants after an appropriate normalization of the layer weights
(10) 
and
(11) 
Theorem 3.1 follows directly from calculating the KL term in Lemma 2.1 by noting that , , and that then .
5 Limitations of spectral norm based bounds
The improvement we attained by taking into account the structure of convolutional layers, though significant, still falls short from explaining why deep convolutional neural networks are able to generalize beyond the training set—the bounds are too pessimistic. In the following, we argue that this is not a shortcoming unique to our analysis but an inherent limitation of spectral complexitybased measures for generalization.
5.1 Are we still counting parameters?
We start by observing that spectral complexitybased measures feature a strong dependence on the stable rank of the weight matrices involved, given by where stands for a generic norm, such as the Frobenius norm in and the norm in (see respectively (45) and (4)) (Arora et al., 2018)
. The stable rank gives a robust estimate of the degrees of freedom of a matrix: roughly, an
matrix with constant stable rank has degrees of freedom, instead of as usual.This interpretation should give us a pause for thought. Based on this interpretation, bounds based on spectral complexity (and incorporating the
stable rank), appear to be sophisticated parameter counting techniques, able to adapt to different neural network realizations. As such they should in principle not be able to capture the complex interactions between data symmetries and CNN invariance to these symmetries. However, these network invariances are thought to play a key role in simplifying the classification problem and beating the curse of dimensionality
(Mallat, 2016).To demonstrate this paradox, in the following we will consider locally connected layers, which have the same sparsity structure as convolutional layers but don’t use weight sharing. Surprisingly, despite not exploiting the translation symmetries inherent to natural images, from the perspective of spectral complexity these layers have the same generalization capacity as convolutional layers.
5.1.1 The paradox of locally connected layers
Locally connected layers have a sparse banded structure similar to convolutions, with the simplifying assumption that the weights of the translated filters are not shared. The weight matrix is exemplified in Figure 1(a) for the case of onedimensional convolutions. While this type of layer is not used in practice, it allows us to isolate the effect of sparsity on the generalization error.
We prove the following:
Theorem 5.1.
Let be a layer network, consisting of locallyconnected layers, fullyconnected layers, and layerwise ReLU activations. For any , with probability at least over the training set of size we have
with being a uniform bound on the input vectors, is as in (45),
terms and denote respectively the filter support and number of output channels of the th locally connected layer, and counts the number of nonzero entries of the th fullyconnected layer.
The analysis is similar to the case of convolutional layers but is quite simpler as we can readily apply results for the spectral norm of sparse matrices. We defer the proof to the appendix.
Our bound is based on the following result, and an identical proof technique as in the case of convolutional layers:
Lemma 5.2.
Let be the perturbation matrix of a 2 locally connected layer with input channels, output channels, filters and feature maps . Then if nonzero elements follow , with probability greater than
(12) 
The spectral norm of the noise is, up to log parameters that are artifacts of the calculations, independent of the dimensions of the latent feature maps, and the ambient layer dimensionality. The spectral norm is a function of the root of the filter support , the number of input channels and the number of output channels . Our derivation is based on the implicit sparsity of the locally connected layers and relies on counting the number of nonzero elements along the input and output layer axes. We note also that Equation 7 is identical to Equation 6 if we substitute .
Surprisingly, the obtained bounds are identical up to log factors that are artifacts of the derivation. In the next subsection we investigate empirically the critical quantity in our bound, the expected spectral norm of the layer noise for the convolutional and the locally connected case, and find that it is tight. We argue that this points to inherent limitations in the proof technique.
5.1.2 Empirical investigation of tightness
Corollary 3.0.1, Lemma 3.1 and Lemma 3.2 represent concentration bounds around an expected value. We test our concentration bounds by computing theoretically and empirically the expected value of the spectral norm for synthetic data. We assume d signals, filters , feature maps , input channels, output channels and calculate the spectral norm . We increase the number of input and output channels assuming that . To find empirical estimates we average the results over iterations for each choice of . We plot the results in Figure 1(b). We see that the theoretical and empirical estimates for the expected value deviate by some log factors. This is more clear in the case of locally connected layers (blue lines). However the bounds correctly capture the growth rate of the expected value of the spectral norm as the parameters increase. Furthermore the empirical estimates validate the prediction that the norm will be less concentrated around the mean for the true convolutional layer.
5.2 Insensitivity of spectral complexity to data manifold symmetries
We conduct experiments to explore hidden variables that might not be captured by the spectral complexity term. We conjecture that spectral complexity based bounds do not account for invariances to different symmetries in the data. We aim to test this idea by increasing the amount of two simple nonlinear transformations to the data, namely translations and elastic deformations. Deep convolutional neural networks have been shown formally to be invariant to translations and stable to deformations
(Mallat, 2016; Wiatowski and Bölcskei, 2018). We note however that deep convolutional neural networks, are invariant to much more complex nonlinear transformations on the data, such as adding sunglasses to faces(Radford et al., 2015).In all following expriments we used the following architecture:
(13) 
, where denotes a convolutional layer with output channels and filter support, denotes a fully connected layer with outputs, and
denotes the maxpooling operator with pooling size of
. The total number of parameters is .5.2.1 Invariance to symmetries affects generalization
We first create three different version of the CIFAR10 dataset. a) The first control version consist of 10000 training images and 10000 test images sampled randomly from the CIFAR10 dataset. b) The second “translated” version is constructed by taking 5000 training images and 5000 test images sampled randomly from the CIFAR10 dataset. These “base” sets are then augmented separately with another 5000 images that are random translations of the originals. c) The third “elastic” version is constructed similarly to the “translated” version, however the “base” sets are augmented with images that are random elastic deformations of the originals.
We train using SGD a deep convolutional neural network on each of the above datasets and calculate the GE and the spectral complexity metric defined in (45
) at the end of each epoch. To confirm that these results were not specific to the Frobenius norm, but also representative of other spectral complexity definitions, we repeated the experiment also with the (2,1)norm metric
defined in (4). The results were consistent with those presented here; they are deferred to the appendix for completeness. As a sanity check, we confirm in Figure 2(a) that the GE and the metric correlate during training (though lying on different scales).Figure 2(b) plots the GE as a function of the metric for all three datasets, with markers corresponding to results for different epochs. It is important to compare GE values for the same spectral complexity as this highlights a hidden variable along which the GE varies, and which is not captured by spectral complexity. We see that for the same metric value the deep convolutional neural network exhibits different GE for the different dataset versions. The network is able to exploit it’s translation invariance and deformation stability to obtain a lower GE compared to the normal dataset. Intuitively by replacing part of the variation in the data manifold with variations to which the network is invariant, we are simplifying the manifold for the deep convolutional neural network improving the GE (even though the complexity of the classifier according to the spectral complexity is the same). We furthermore observe that the deep convolutional neural network is more robust to translations compared to elastic deformation, as it obtains improved GE for former for the same metric values.
5.2.2 Delving deeper into invariances
This section explores further the limitation of spectral complexity to data symmetries. Specifically, we create a number of datasets where we vary the percentage of augmentations. We start from datasets that have normal samples and gradually lower that percentage to by adding augmentations. We create two sets: one for augmentations using translations and one for augmentations using elastic deformations. We train using SGD a deep convolutional neural network on these datasets and calculate after each epoch the GE and the spectral complexity metric.
We plot the results in Figures 11 and 5. We see that, for both the translated and elastic datasets, more augmentations result in GE curves that have gradually smaller slopes. Thus, for the same metric, the GE decreases as the number of augmentations increases. Alternatively, we can fix a metric value and plot the GE vs the percentage of normal datapoints. We plot the results in Figures 10(a) and 4(a). We see that for fixed metric values the percentage of normal datapoints, i.e., ones that are not slight translations or deformations of others, correlates with the GE. We conclude that spectral complexity has limitations as it does not account for variations in the data manifold. Incorporating the geometry of the data manifold in future measures of complexity will improve their predictive abilities.
6 Discussion
We have presented new PACBayes generalization bounds for deep convolutional neural networks, that are orders of magnitude tighter than the previous estimate. We then explored several limitations of spectral complexity as measure for generalization performance. We applied our technique on locally connected layers and showed that they are indistinguishable from convolutional layers under the current proof technique, which is not true in practice. We furthermore explored the insensitivity of spectral complexity to invariances of deep convolutional neural networks. Our findings suggest that incorporating the data structure in generalization bounds should improve their predictive ability.
References
References

Achille and Soatto (2018)
Alessandro Achille and Stefano Soatto.
Emergence of invariance and disentanglement in deep representations.
The Journal of Machine Learning Research
, 19(1):1947–1980, 2018.  Arora et al. (2018) Sanjeev Arora, Rong Ge, Behnam Neyshabur, and Yi Zhang. Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296, 2018.
 Bandeira et al. (2016) Afonso S Bandeira, Ramon Van Handel, et al. Sharp nonasymptotic bounds on the norm of random matrices with independent entries. The Annals of Probability, 44(4):2479–2506, 2016.
 Bartlett and Mendelson (2002) Peter L Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. Journal of Machine Learning Research, 3(Nov):463–482, 2002.
 Bartlett et al. (2017) Peter L Bartlett, Dylan J Foster, and Matus J Telgarsky. Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6241–6250, 2017.
 Du et al. (2017) Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradient descent learns onehiddenlayer cnn: Don’t be afraid of spurious local minima. arXiv preprint arXiv:1712.00779, 2017.
 Dziugaite and Roy (2017) Gintare Karolina Dziugaite and Daniel M Roy. Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. arXiv preprint arXiv:1703.11008, 2017.
 Golowich et al. (2017) Noah Golowich, Alexander Rakhlin, and Ohad Shamir. Sizeindependent sample complexity of neural networks. arXiv preprint arXiv:1712.06541, 2017.
 Keskar et al. (2016) Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang. On largebatch training for deep learning: Generalization gap and sharp minima. arXiv preprint arXiv:1609.04836, 2016.
 Mallat (2016) Stéphane Mallat. Understanding deep convolutional networks. Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences, 374(2065):20150203, 2016.
 McAllester (1999) David A McAllester. Some pacbayesian theorems. Machine Learning, 37(3):355–363, 1999.
 Neyshabur et al. (2015) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Normbased capacity control in neural networks. In Conference on Learning Theory, pages 1376–1401, 2015.
 Neyshabur et al. (2017a) Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pacbayesian approach to spectrallynormalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564, 2017a.

Neyshabur et al. (2017b)
Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro.
Exploring generalization in deep learning.
In Advances in Neural Information Processing Systems, pages 5947–5956, 2017b.  Radford et al. (2015) Alec Radford, Luke Metz, and Soumith Chintala. Unsupervised representation learning with deep convolutional generative adversarial networks. arXiv preprint arXiv:1511.06434, 2015.
 Sokolic et al. (2016) Jure Sokolic, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Generalization error of invariant classifiers. arXiv preprint arXiv:1610.04574, 2016.
 Sokolić et al. (2016) Jure Sokolić, Raja Giryes, Guillermo Sapiro, and Miguel RD Rodrigues. Robust large margin deep neural networks. IEEE Transactions on Signal Processing, 65(16):4265–4280, 2016.
 Vershynin (2010) Roman Vershynin. Introduction to the nonasymptotic analysis of random matrices. arXiv preprint arXiv:1011.3027, 2010.
 Wang et al. (2018) Huan Wang, Nitish Shirish Keskar, Caiming Xiong, and Richard Socher. Identifying generalization properties in neural networks. arXiv preprint arXiv:1809.07402, 2018.

Wiatowski and Bölcskei (2018)
Thomas Wiatowski and Helmut Bölcskei.
A mathematical theory of deep convolutional neural networks for feature extraction.
IEEE Transactions on Information Theory, 64(3):1845–1866, 2018.  Zhang et al. (2016) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530, 2016.
 Zhou and Feng (2018) Pan Zhou and Jiashi Feng. Understanding generalization and optimization performance of deep cnns. arXiv preprint arXiv:1805.10767, 2018.
 Zhou et al. (2018) Wenda Zhou, Victor Veitch, Morgane Austern, Ryan P Adams, and Peter Orbanz. Nonvacuous generalization bounds at the imagenet scale: a pacbayesian compression approach. 2018.
Appendix
Concentration of measure
In the derivations below we will rely upon the following useful theorem for the concentration of the spectral norm of sparse random matrices
Theorem 1.1.
Bandeira et al. (2016) Let be a random rectangular matrix with where are independent random variables and are scalars. Then
(14) 
for any and with
(15) 
In the following we will use the same numbering for Theorems and Lemmas as in the main paper. Theorems and Lemmas unique to the appendix will be numbered with a prefix corresponding to the section where the theorem is introduced and suffix with a corresponding number.
A. Fully Connected Layers
Lemma A.1.
Let be the perturbation matrix of a fully connected layer with with row and column sparsity equal to . Then if nonzero elements follow , with probability greater than
(16) 
Proof.
For we need define an index function that allows a Gaussian random noise variable at the locations where the original dense layer is nonzero.
We assume that when and is zero otherwise, and get the result trivially from Therorem 7.1. We can extend the result to by considering that . ∎
B. Locally Connected Layers
Lemma 5.2.
Let be the perturbation matrix of a 2 locally connected layer with input channels, output channels, filters and feature maps . Then if nonzero elements follow , with probability greater than
(17) 
Proof.
We will consider first the case . A convolutional layer is characterised by it’s output channels. For each output channel each input channel is convolved with an independent filter resulting in a set of feature maps. For each output channel these featuremaps are then summed together. We consider locally connected layers, i.e. the layers are banded but the entries are independent and there is no weight sharing. For the case of one dimensional signals the implied structure is plotted in Figure 2a.
Similar to Lemmma A.1 we assume that when and is zero otherwise. We need to evaluate , and for a matrix like the one in Figure 2a.
We plot what these sums represent in Figures 1 a,b. For we can find an upper bound, by considering that the sum for a given filter and a given pixel location represents the maximum number of overlaps for all 2d shifts. For the case of 2d this is , equal to the support of the filters. We plot these shifts in Figure 2. We also need to consider that there are input channels. We then get
(18) 
For each column in the matrix represents a concatenation of convolutional filters . The support of the filters is and there are filters stacked on top of eachother, corresponding to the output channels. Then it is straight forward to derive that
(19) 
Furthermore trivially and when we can get the result by considering that . ∎
C. Convolutional Layers
Lemma 4.3.
Let be the perturbation matrix of a 2 convolutional layer with input channels, output channels, convolutional filters and feature maps . Then if we vectorize the convolutional filter weights and add a vectorized noise vector such that with probability greater than
(20) 
Proof.
We consider noise filters and feature maps . We define the convolutional noise operator from input channel to output channel in the spatial domain as and in the frequency domain as
and we denote the Fourier transform matrix as
. Each convolutional operator corresponds to one convolutional noise filter . We can now define the structure of the 2d convolutional noise matrix . Given input channels and output channels the noise matrix is structured as(21) 
were for all output channels the signal’s input channels are convolved with independent noise filters and the results of these convolutions are summed up.
By exploiting the unitaryinvariance property of the spectral norm we transform this matrix into the Fourier domain to obtain
(22) 
, where we have used the fact that the matrices are diagonal and a concatenation of diagonal matrices can always be rearranged into block diagonal form. In our case we have defined blocks
(23) 
with entries
(24) 
where , are the frequency coordinates. In this way the block corresponds to the frequency components from the fourier transforms of all filters.
We will also need the matrices and
(25) 
The entries of these matrices have the following distributions
(26) 
where we have used the fact that are i.i.d Gaussian.
We have now turned our initial problem into a form that lends itself more easily to a solution. Our original matrix has been turned into block diagonal form and each block can be split into real and imaginary parts that have independent gaussian entries, we note however that blocks are not independent of eachother. We will now derive a concentration bound on the original matrix by using the fact that the spectral norm of a block diagonal matrix is equal to the maximum of the spectral norms of the individual blocks.
We can write the following inequalities
(27) 
by setting and arbitrary constants we can furthermore write
(28) 
where in line 4 we set and in line 5 we used a union bound and assumed that and for positive constants .
We will now calculate concentration inequalities for the individual blocks and , turning the general formula we have derived into a specific one for our case. To do that we first apply the following concentration inequality by Vershynin (2010)
Theorem C.1.
Let be an matrix whose entries are independent Gaussian random variables with variance . Then for every
(29) 
on the matrices and . We obtain the following concentration inequalities
(30) 
We then make the following calculations which will prove usefull
(31) 
since
(32) 
which implies