# Some limitations of norm based generalization bounds in deep neural networks

Deep convolutional neural networks have been shown to be able to fit a labeling over random data while still being able to generalize well on normal datasets. Describing deep convolutional neural network capacity through the measure of spectral complexity has been recently proposed to tackle this apparent paradox. Spectral complexity correlates with GE and can distinguish networks trained on normal and random labels. We propose the first GE bound based on spectral complexity for deep convolutional neural networks and provide tighter bounds by orders of magnitude from the previous estimate. We then investigate theoretically and empirically the insensitivity of spectral complexity to invariances of modern deep convolutional neural networks, and show several limitations of spectral complexity that occur as a result.

• 5 publications
• 26 publications
• 19 publications
• 38 publications
10/06/2017

### Deep Convolutional Neural Networks as Generic Feature Extractors

Recognizing objects in natural images is an intricate problem involving ...
07/31/2017

### Capacity limitations of visual search in deep convolutional neural network

Deep convolutional neural networks follow roughly the architecture of bi...
05/29/2019

### Size-free generalization bounds for convolutional neural networks

We prove bounds on the generalization error of convolutional networks. T...
12/07/2021

### Spectral Complexity-scaled Generalization Bound of Complex-valued Neural Networks

Complex-valued neural networks (CVNNs) have been widely applied to vario...
08/27/2015

### Rapid Exact Signal Scanning with Deep Convolutional Neural Networks

A rigorous formulation of the dynamics of a signal processing scheme aim...
02/23/2018

### Left ventricle segmentation By modelling uncertainty in prediction of deep convolutional neural networks and adaptive thresholding inference

Deep neural networks have shown great achievements in solving complex pr...
05/17/2019

### Spectral Metric for Dataset Complexity Assessment

In this paper, we propose a new measure to gauge the complexity of image...

## 1 Introduction

Standard deep convolutional architectures have the capacity to fit a labelling over random noise (Zhang et al., 2016)

. At the same time, these same networks generalize well on real image data. This challenges conventional wisdom that the good generalization properties of deep convolutional neural networks are the result of restricted capacity from implicit regularization though the use of stochastic gradient descent (SGD) or explicit regularization using dropout and batch normalization

(Zhang et al., 2016). In fact, any measure of model complexity which is uniform across all functions representable by a given architecture is doomed to provide contradictory measurements (Bartlett et al., 2017; Arora et al., 2018; Neyshabur et al., 2015). A good measure of complexity should allow for high complexity models for difficult datasets (random noise) and low complexity models for easier datasets (real images).

Researchers have recently focused on spectral complexity (Bartlett et al., 2017) normalized by the margin. Different techniques include robustness, PAC-Bayes and Rademacher complexity (Sokolić et al., 2016; Bartlett et al., 2017; Neyshabur et al., 2017a, 2015; Golowich et al., 2017). Spectral complexity has been shown to correlate with the generalization error (GE) in a number of works (Neyshabur et al., 2017b; Bartlett et al., 2017). For the same network, this measure of model complexity has high values when the network is trained on data with random labels and considerably lower values for real labels. The measure can also be linked to the flatness of the minimum represented by a specific network instantiation (Wang et al., 2018). Flat minima have been shown empirically to also correlate with better generalization (Keskar et al., 2016).

Two competing hypotheses. Nevertheless, spectral complexity based bounds are still vacuous, as demonstrated empirically by Arora et al. (2018). One possible explanation for this shortcoming is that, while derived for general fully connected layers, the bounds are evaluated w.r.t. deep convolutional networks whose weights are significantly more structured. Perhaps due to the non-trivial correlations between filters, the generalization capacity of deep convolutional neural networks has been less studied. Zhou and Feng (2018)111whose work was published after our preprint provide an interesting analysis, but their analysis holds only for sigmoid activations, includes complex instance dependent network quantities that do not lend themselves easily to intuition, and is ultimately vacuous. A second interesting bound was proposed by Du et al. (2017), though the latter was derived for a greatly simplified architecture consisting of a single layer and convolutional filter.

An alternative hypothesis is that current analyses relying on spectral complexity-type metrics miss a crucial piece of the generalization puzzle. In particular, it is generally agreed that invariance to symmetries in the data is a key property of modern deep convolutional neural networks and should be exploited to improve existing GE bounds. Surprisingly, the role of invariances has seen little attention in the generalization literature: Achille and Soatto (2018) showed that low information content in the network weights corresponds to learning invariant signal representations to various nuisance latent parameters. Their work however does not result in a meaningful generalization bound. Sokolic et al. (2016)

demonstrated that classifiers that are invariant (to a set of discrete transformations of input signals) can potentially have a much lower GE than non-invariant ones.

Contributions. This paper aims to shed light into the limitations of spectral complexity based analyses.

• Aiming to test the first hypothesis, we adapt one of the most recent spectral complexity based generalization bounds (Neyshabur et al., 2017a) to the case of deep convolutional neural networks. Interestingly, despite being orders of magnitude tighter than the previous estimate, our bound remains vacuous.

• To investigate this limitation, we consider the paradoxical case of locally connected layers, i.e., layers constructed to have the same support as convolutional layers that do not employ weight sharing. Counter-intuitively, we find that these enjoy the same generalization guarantees as convolutional layers (up to log factors that are artifacts of the derivation). Our experiments indicate that crucial quantities in the bound are tight, pointing to an inherent shortcoming of the proof technique.

• We confirm empirically that, though important to generalization, the bounds in question fail to capture the invariance properties of convolutional networks to data symmetries, such as elastic deformations and translations. Our experiments suggest that these conclusions are not unique to our approach, but apply to spectral complexity based generalization bounds in general. We conclude that more research should be conducted in this direction.

Notation.

We denote vectors with bold lowercase letters and matrices with bold capital letters. Given two probability measures

and over a set

we define the Kullback-Leibler divergence as

. We denote with the Gaussian kernel.

## 2 Problem and previous work

Let be a distribution over samples and labels . We consider the standard classification problem in which a -class classifier parameterized by is used to map input vectors to a -dimensional vector, encoding class membership.

We may encode the confidence of the classifier by incorporating a dependence on a desired margin . Then, the -margin classification loss is defined as

 Lγ(fw):=P(x,y)∼D[fw(x)[y]≤γ+maxj≠yfw(x)[j]].

Note that we easily recover the standard classification loss definition by setting . Our objective is to obtain bounds of the generalization error GE:

 L0(fw)≤^Lγ(fw)+GE, (1)

where

 ^Lγ(fw)=m∑i=11{fw(xi)[yi]

is the empirical loss computed over a random training set of size . For easy reference, we summarize some of the most crucial definitions in Table 1.

Recent advances. A variety of techniques, based on VC dimension, Rademacher complexity, and PAC-Bayes type arguments have been employed in the attempt to understand the generalization error of neural networks. Intriguingly, in a number of recent works the generalization error of a layer neural network with layer weights is expressed as

 L0(fw)≤^Lγ(fw)+~O(ΨfRWγ√m+Φf), (2)

with terms and being architecture-dependent and only depending solely on the network weights. The latter term has been referred to as the spectral complexity of a neural network (Bartlett et al., 2017) and can be defined as

 RW:=d∏l=1||Wl||2(d∑l=1||Wl||2F||Wl||22)\sfrac12. (3)

To be precise, the aforementioned definition corresponds to the one derived in a PAC-Bayes framework (Neyshabur et al., 2017a) together with . In Bartlett et al. (2017) the proposed measure is:

 R′W:=d∏l=1||Wl||2⎛⎜⎝d∑l=1||W⊤l||\sfrac232,1||Wl||\sfrac232⎞⎟⎠\sfrac32, (4)

obtained using an involved covering argument. Here, the network weights are contrasted to some reference weights and the Frobenius norm is substituted by the -matrix norm defined as for . In the similar works of Bartlett and Mendelson (2002) and Neyshabur et al. (2015) the authors use the norm and the norm respectively.

Several experiments have shown that spectral complexity generally correlates with the true generalization error as quantified by held out data. Furthermore it is large for difficult datasets while it is small for easy datasets. Intuitively, spectral complexity is related to how robust a model is when adding noise to its layers. A simple model will be more robust to noise and can be seen as laying on a flat minimum; even moving it a way by a large quantity from the minimum center, the loss will remain approximately the same.

Nevertheless, it has also been empirically demonstrated that for standard deep convolutional neural networks these analyses tend to overestimate the generalization error by several orders of magnitude (Arora et al., 2018). The aim of this paper is to test whether this gap between theory and practice stems from artifacts of the analysis or from some fundamental limitation of the approach.

Other approaches. There has been a separate line of work aiming to numerically evaluate non-vacuous bounds. A key work in this direction was carried out by Dziugaite and Roy (2017), where the PAC-Bayes bound was optimized to obtain non-vacuous estimates. Their work is however limited to small datasets, and is still far from tight. Stronger results can be obtained by first compressing the network and then computing a bound on the remaining parameters (Zhou et al., 2018; Arora et al., 2018). It is important however to note that, besides being far from tight, these bounds do not concern the original classifier, as a one to one correspondence cannot be established between the compressed and uncompressed architectures.

## 3 Bound for convolutional layers

An obvious shortcoming is that, by being derived for fully connected layers, norm based generalization bounds are not specifically adapted to convolutional architectures. Our first order of business is thus to understand how much one may gain by explicitly considering the structure of convolutions in the generalization error derivation.

To this end, we first aim to tighten the bound of Neyshabur et al. (2017a) and adapt it to the convolutional case. Specifically we will improve upon the architecture dependent constant . We show that for the case of convolutional layers the original value of is unacceptably high.

Our main result is as follows:

###### Theorem 3.1.

(Generalization Bound). Let be a -layer network, consisting of convolutional layers,

fully-connected layers, and layer-wise ReLU activations. For any

, with probability at least over the training set of size we have

 L0(fw)≤^Lγ(fw)+~O(BΨfRWγ√m),

with being a uniform bound on the input vectors, is as in (45),

 Ψf=q∑l∈C√bl+∑l∈F√sl,

terms and denote respectively the filter support and number of output channels of the -th convolutional layer, and counts the number of non-zero entries of the -th fully-connected layer.

The theorem associates the generalization capacity of a deep convolutional neural network to its weights, as well as to key aspects of its architecture. Interestingly, there is a sharp contrast between convolutional and fully-connected layers.

Fully-connected layers

, in accordance to previous analyses, exhibit a sample complexity that depends linearly on the number of neurons—subject to sparsity constrains that is. For instance, when all layers are sparse with sparsity

and constant stable-rank, ignoring log factors our bound implies that suffice to attain good generalization. For the same setting, the sample complexity was determined as by Neyshabur et al. (2017a).

Convolutional layers contribute more mildly to the sample complexity, with the latter increasing linearly on the filter support and channels , but being independent on the layer input size. A case in point, in a fully convolutional network of layers, each with constant stable-rank and output channels, our bound scales like , while previously scaled like , which constitutes a two order of magnitude improvement when the filter support is (as is usually the case).

To illustrate these differences resulting from

, we conduct an experiment on LeNet-5 for the MNIST dataset, and on AlexNet and VGG-16 for the Imagenet dataset. We omit term

assuming that . We plot the results in Table 1. We obtain results orders of magnitude tighter than the previous PAC-Bayesian approach.

An interesting observation its that a original approach estimates the sample complexity of VGG-16 to be 1 order of magnitude larger than AlexNet even though both are trained on the same dataset (Imagenet) and do not overfit. By contrast our bound estimates approximately the same sample complexity for both architectures, even though they differ significantly is ambient architecture dimensions. We also observe that our bound results in improved estimates when the spatial support of the filters is small relative to the dimensions of the feature maps. We observe that for the LeNet-5 architecture where the size of the support of the filters is big relative to the ambient dimension the benefit from our approach is small as the convolutional layers can be adequately modeled as dense matrices.

Clearly, the assumption is unrealistic in practice. For values obtained by trained networks the bounds presented above are still vacuous by several orders of magnitude. Some paradoxes resulting from this looseness as well as some possible explanations are explored in Section 5.

## 4 Proof outline of Theorem 3.1

We will first present two prior results which will be useful later. We first state a result relating the noise robustness to perturbations of a classifier to the GE, and then state a result quantifying the perturbation robustness of general deep neural networks. We then present our results relating all the above to the convolutional setting.

Let be any deterministic predictor (not necessarily a neural network). The following lemma from Neyshabur et al. (2017a) which introduces the condition as a probabilistic bound on the Lipschitz constant of the predictor , and relates it to the generalization error:

###### Lemma 4.1.

Neyshabur et al. (2017a) Let be any predictor (not necessarily a neural network) with parameters , and be any distribution on the parameters that is independent of the training data. Then with probability over the training set of size , for any random perturbation s.t. , we have

 L0(fw)≤^Lγ(fw)+O( ⎷KL(w+u||P)+ln6mδm−1) (5)

where are constants.

There is a trade-off between the condition and the KL term in the right hand side of the above inequality. The KL term is inversely

proportional to the variance of the noise

. Therefore one would want to maximize the variance of the noise, however the distance can potentially grow unbounded with high probability for high enough values of the variance.

Characterizing the condition entails understanding the sensitivity of our deep convolutional neural network classifier on random perturbations. To that end, we review here a useful perturbation bound from Neyshabur et al. (2017a) on the output of a general deep neural network:

###### Lemma 4.2.

(Perturbation Bound). For any , let be a d-layer network with ReLU activations.. Then for any , and , and perturbation such that , the change in the output of the network can be bounded as follows

 |fw+u(x)−fw(x)|2≤e2B~βd−1∑l||Ul||2 (6)

where , and are considered as constants after an appropriate normalization of the layer weights.

We note that correctly estimating the spectral norm of the perturbation at each layer is critical to obtaining a tight bound. Specifically, if we exploit the structure of the perturbation we can increase significantly the variance of the added perturbation for which holds.

The analysis for the convolutional case is difficult due to the fact that the noise per pixel is not independent. We defer the proof to the appendix and omit log parameters for clarity. We obtain the following lemma:

###### Lemma 4.3.

Let be the perturbation matrix of a 2 convolutional layer with input channels, output channels, convolutional filters and feature maps . Then if we vectorize the convolutional filter weights and add a vectorized noise vector such that with probability greater than

 ||U||2≤σ(q[√a+√b]+√2log(2N2T)). (7)

We see that the spectral norm of the noise is independent of the dimensions of the latent feature maps. The spectral norm is a function of the root of the filter support , the number of input channels and the number of output channels .

We now proceed to find the maximum value of the variance parameter that balances the noise sensitivity with the KL term dependence. We present the following lemma:

###### Lemma 4.4.

(Perturbation Bound). For any , let be a d-layer network with ReLU activations and we denote the set of convolutional layers and the set of fully connected layers. Then for any , and , and a perturbation for , for any with

 σ=γ42B~βd−1[∑l∈CKl+∑l∈DJl] (8)

we have

 Pu[maxx∈X|fw+u(x)−fw(x)|2≤γ4]≥12 (9)

where , , are considered as constants after an appropriate normalization of the layer weights

 Kl=ql{√al+√bl+√2log(4N2ld)} (10)

and

 Jl=ql{2√sl+√2log(2d)}. (11)

Theorem 3.1 follows directly from calculating the KL term in Lemma 2.1 by noting that , , and that then .

## 5 Limitations of spectral norm based bounds

The improvement we attained by taking into account the structure of convolutional layers, though significant, still falls short from explaining why deep convolutional neural networks are able to generalize beyond the training set—the bounds are too pessimistic. In the following, we argue that this is not a shortcoming unique to our analysis but an inherent limitation of spectral complexity-based measures for generalization.

### 5.1 Are we still counting parameters?

We start by observing that spectral complexity-based measures feature a strong dependence on the -stable rank of the weight matrices involved, given by where stands for a generic norm, such as the Frobenius norm in and the norm in (see respectively (45) and (4)) (Arora et al., 2018)

. The stable rank gives a robust estimate of the degrees of freedom of a matrix: roughly, an

matrix with constant stable rank has degrees of freedom, instead of as usual.

This interpretation should give us a pause for thought. Based on this interpretation, bounds based on spectral complexity (and incorporating the

-stable rank), appear to be sophisticated parameter counting techniques, able to adapt to different neural network realizations. As such they should in principle not be able to capture the complex interactions between data symmetries and CNN invariance to these symmetries. However, these network invariances are thought to play a key role in simplifying the classification problem and beating the curse of dimensionality

(Mallat, 2016).

To demonstrate this paradox, in the following we will consider locally connected layers, which have the same sparsity structure as convolutional layers but don’t use weight sharing. Surprisingly, despite not exploiting the translation symmetries inherent to natural images, from the perspective of spectral complexity these layers have the same generalization capacity as convolutional layers.

#### 5.1.1 The paradox of locally connected layers

Locally connected layers have a sparse banded structure similar to convolutions, with the simplifying assumption that the weights of the translated filters are not shared. The weight matrix is exemplified in Figure 1(a) for the case of one-dimensional convolutions. While this type of layer is not used in practice, it allows us to isolate the effect of sparsity on the generalization error.

We prove the following:

###### Theorem 5.1.

Let be a -layer network, consisting of locally-connected layers, fully-connected layers, and layer-wise ReLU activations. For any , with probability at least over the training set of size we have

 L0(fw)≤^Lγ(fw)+~O(BΨfRWγ√m),

with being a uniform bound on the input vectors, is as in (45),

 Ψf=q∑l∈C√bl+∑l∈F√sl,

terms and denote respectively the filter support and number of output channels of the -th locally connected layer, and counts the number of non-zero entries of the -th fully-connected layer.

The analysis is similar to the case of convolutional layers but is quite simpler as we can readily apply results for the spectral norm of sparse matrices. We defer the proof to the appendix.

Our bound is based on the following result, and an identical proof technique as in the case of convolutional layers:

###### Lemma 5.2.

Let be the perturbation matrix of a 2 locally connected layer with input channels, output channels, filters and feature maps . Then if non-zero elements follow , with probability greater than

 ||U||2≤O(σ(q[√a+√b]+√2log(1T))). (12)

The spectral norm of the noise is, up to log parameters that are artifacts of the calculations, independent of the dimensions of the latent feature maps, and the ambient layer dimensionality. The spectral norm is a function of the root of the filter support , the number of input channels and the number of output channels . Our derivation is based on the implicit sparsity of the locally connected layers and relies on counting the number of non-zero elements along the input and output layer axes. We note also that Equation 7 is identical to Equation 6 if we substitute .

Surprisingly, the obtained bounds are identical up to log factors that are artifacts of the derivation. In the next subsection we investigate empirically the critical quantity in our bound, the expected spectral norm of the layer noise for the convolutional and the locally connected case, and find that it is tight. We argue that this points to inherent limitations in the proof technique.

#### 5.1.2 Empirical investigation of tightness

Corollary 3.0.1, Lemma 3.1 and Lemma 3.2 represent concentration bounds around an expected value. We test our concentration bounds by computing theoretically and empirically the expected value of the spectral norm for synthetic data. We assume d signals, filters , feature maps , input channels, output channels and calculate the spectral norm . We increase the number of input and output channels assuming that . To find empirical estimates we average the results over iterations for each choice of . We plot the results in Figure 1(b). We see that the theoretical and empirical estimates for the expected value deviate by some log factors. This is more clear in the case of locally connected layers (blue lines). However the bounds correctly capture the growth rate of the expected value of the spectral norm as the parameters increase. Furthermore the empirical estimates validate the prediction that the norm will be less concentrated around the mean for the true convolutional layer.

### 5.2 Insensitivity of spectral complexity to data manifold symmetries

We conduct experiments to explore hidden variables that might not be captured by the spectral complexity term. We conjecture that spectral complexity based bounds do not account for invariances to different symmetries in the data. We aim to test this idea by increasing the amount of two simple non-linear transformations to the data, namely translations and elastic deformations. Deep convolutional neural networks have been shown formally to be invariant to translations and stable to deformations

(Mallat, 2016; Wiatowski and Bölcskei, 2018). We note however that deep convolutional neural networks, are invariant to much more complex non-linear transformations on the data, such as adding sunglasses to faces(Radford et al., 2015).

In all following expriments we used the following architecture:

 input→32C3→%MP2→64C3→MP2→10% FC→output (13)

, where denotes a convolutional layer with output channels and filter support, denotes a fully connected layer with outputs, and

denotes the max-pooling operator with pooling size of

. The total number of parameters is .

#### 5.2.1 Invariance to symmetries affects generalization

We first create three different version of the CIFAR-10 dataset. a) The first control version consist of 10000 training images and 10000 test images sampled randomly from the CIFAR-10 dataset. b) The second “translated” version is constructed by taking 5000 training images and 5000 test images sampled randomly from the CIFAR-10 dataset. These “base” sets are then augmented separately with another 5000 images that are random translations of the originals. c) The third “elastic” version is constructed similarly to the “translated” version, however the “base” sets are augmented with images that are random elastic deformations of the originals.

We train using SGD a deep convolutional neural network on each of the above datasets and calculate the GE and the spectral complexity metric defined in (45

) at the end of each epoch. To confirm that these results were not specific to the Frobenius norm, but also representative of other spectral complexity definitions, we repeated the experiment also with the (2,1)-norm metric

defined in (4). The results were consistent with those presented here; they are deferred to the appendix for completeness. As a sanity check, we confirm in Figure 2(a) that the GE and the metric correlate during training (though lying on different scales).

Figure 2(b) plots the GE as a function of the metric for all three datasets, with markers corresponding to results for different epochs. It is important to compare GE values for the same spectral complexity as this highlights a hidden variable along which the GE varies, and which is not captured by spectral complexity. We see that for the same metric value the deep convolutional neural network exhibits different GE for the different dataset versions. The network is able to exploit it’s translation invariance and deformation stability to obtain a lower GE compared to the normal dataset. Intuitively by replacing part of the variation in the data manifold with variations to which the network is invariant, we are simplifying the manifold for the deep convolutional neural network improving the GE (even though the complexity of the classifier according to the spectral complexity is the same). We furthermore observe that the deep convolutional neural network is more robust to translations compared to elastic deformation, as it obtains improved GE for former for the same metric values.

#### 5.2.2 Delving deeper into invariances

This section explores further the limitation of spectral complexity to data symmetries. Specifically, we create a number of datasets where we vary the percentage of augmentations. We start from datasets that have normal samples and gradually lower that percentage to by adding augmentations. We create two sets: one for augmentations using translations and one for augmentations using elastic deformations. We train using SGD a deep convolutional neural network on these datasets and calculate after each epoch the GE and the spectral complexity metric.

We plot the results in Figures 11 and 5. We see that, for both the translated and elastic datasets, more augmentations result in GE curves that have gradually smaller slopes. Thus, for the same metric, the GE decreases as the number of augmentations increases. Alternatively, we can fix a metric value and plot the GE vs the percentage of normal data-points. We plot the results in Figures 10(a) and 4(a). We see that for fixed metric values the percentage of normal data-points, i.e., ones that are not slight translations or deformations of others, correlates with the GE. We conclude that spectral complexity has limitations as it does not account for variations in the data manifold. Incorporating the geometry of the data manifold in future measures of complexity will improve their predictive abilities.

## 6 Discussion

We have presented new PAC-Bayes generalization bounds for deep convolutional neural networks, that are orders of magnitude tighter than the previous estimate. We then explored several limitations of spectral complexity as measure for generalization performance. We applied our technique on locally connected layers and showed that they are indistinguishable from convolutional layers under the current proof technique, which is not true in practice. We furthermore explored the insensitivity of spectral complexity to invariances of deep convolutional neural networks. Our findings suggest that incorporating the data structure in generalization bounds should improve their predictive ability.

## Concentration of measure

In the derivations below we will rely upon the following useful theorem for the concentration of the spectral norm of sparse random matrices

###### Theorem 1.1.

Bandeira et al. (2016) Let be a random rectangular matrix with where are independent random variables and are scalars. Then

 P(||A||2≥(1+ϵ){σ1+σ2+5√log(1+ϵ)σ∗√log(max(d2,d1))+t})≤e−t2/2σ2∗ (14)

for any and with

 σ1:=maxi√∑jψ2ijσ2:=maxi√∑jψ2ijσ∗:=maxij|ψij|. (15)

In the following we will use the same numbering for Theorems and Lemmas as in the main paper. Theorems and Lemmas unique to the appendix will be numbered with a prefix corresponding to the section where the theorem is introduced and suffix with a corresponding number.

### A. Fully Connected Layers

###### Lemma A.1.

Let be the perturbation matrix of a fully connected layer with with row and column sparsity equal to . Then if non-zero elements follow , with probability greater than

 ||U||2≤O(σ(2√s+√2log(1T))). (16)
###### Proof.

For we need define an index function that allows a Gaussian random noise variable at the locations where the original dense layer is non-zero.

We assume that when and is zero otherwise, and get the result trivially from Therorem 7.1. We can extend the result to by considering that . ∎

### B. Locally Connected Layers

###### Lemma 5.2.

Let be the perturbation matrix of a 2 locally connected layer with input channels, output channels, filters and feature maps . Then if non-zero elements follow , with probability greater than

 ||U||2≤O(σ(q[√a+√b]+√2log(1T))). (17)
###### Proof.

We will consider first the case . A convolutional layer is characterised by it’s output channels. For each output channel each input channel is convolved with an independent filter resulting in a set of feature maps. For each output channel these featuremaps are then summed together. We consider locally connected layers, i.e. the layers are banded but the entries are independent and there is no weight sharing. For the case of one dimensional signals the implied structure is plotted in Figure 2a.

Similar to Lemmma A.1 we assume that when and is zero otherwise. We need to evaluate , and for a matrix like the one in Figure 2a.

We plot what these sums represent in Figures 1 a,b. For we can find an upper bound, by considering that the sum for a given filter and a given pixel location represents the maximum number of overlaps for all 2d shifts. For the case of 2d this is , equal to the support of the filters. We plot these shifts in Figure 2. We also need to consider that there are input channels. We then get

 σ1:=maxi√∑jψ2ij=√∑a∑q212=√aq2=q√a (18)

For each column in the matrix represents a concatenation of convolutional filters . The support of the filters is and there are filters stacked on top of eachother, corresponding to the output channels. Then it is straight forward to derive that

 σ1:=maxi√∑jψ2ij=√∑b∑q212=√bq2=q√b (19)

Furthermore trivially and when we can get the result by considering that . ∎

### C. Convolutional Layers

###### Lemma 4.3.

Let be the perturbation matrix of a 2 convolutional layer with input channels, output channels, convolutional filters and feature maps . Then if we vectorize the convolutional filter weights and add a vectorized noise vector such that with probability greater than

 ||U||2≤σ(q[√a+√b]+√2log(2N2T)). (20)
###### Proof.

We consider noise filters and feature maps . We define the convolutional noise operator from input channel to output channel in the spatial domain as and in the frequency domain as

and we denote the Fourier transform matrix as

. Each convolutional operator corresponds to one convolutional noise filter . We can now define the structure of the 2d convolutional noise matrix . Given input channels and output channels the noise matrix is structured as

 U=⎡⎢ ⎢⎣A00...A0a⋮⋱⋮Ab0...Aba⎤⎥ ⎥⎦ (21)

were for all output channels the signal’s input channels are convolved with independent noise filters and the results of these convolutions are summed up.

By exploiting the unitary-invariance property of the spectral norm we transform this matrix into the Fourier domain to obtain

 ||U||2=||(Ib⊗FT)⎡⎢ ⎢ ⎢ ⎢⎣~A00...~A0a⋮⋱⋮~Ab0...~Aba⎤⎥ ⎥ ⎥ ⎥⎦(Ia⊗F)||2=||⎡⎢ ⎢ ⎢ ⎢⎣~A00...~A0a⋮⋱⋮~Ab0...~Aba⎤⎥ ⎥ ⎥ ⎥⎦||2=||⎡⎢ ⎢ ⎢⎣~B0...0⋮⋱⋮0...~BN2⎤⎥ ⎥ ⎥⎦||2 (22)

, where we have used the fact that the matrices are diagonal and a concatenation of diagonal matrices can always be rearranged into block diagonal form. In our case we have defined blocks

 ~Bn=⎡⎢ ⎢ ⎢⎣λ00n…λ0an⋮⋱…λb0n…λban⎤⎥ ⎥ ⎥⎦ (23)

with entries

 λijn=λijn1,n2=q−1∑k1=0q−1∑k2=0e−2πi(k1n1q+k2n2q)fijk1,k2=q−1∑k1=0q−1∑k2=0cos(2π(k1n1q+k2n2q))fijk1,k2+iq−1∑k1=0q−1∑k2=0sin(2π(k1n1q+k2n2q))fijk1,k2 (24)

where , are the frequency coordinates. In this way the block corresponds to the frequency components from the fourier transforms of all filters.

We will also need the matrices and

 Re(~Bn)=⎡⎢ ⎢ ⎢⎣Re(λ00n)…Re(λ0an)⋮⋱…Re(λb0n)…Re(λban)⎤⎥ ⎥ ⎥⎦,Im(~Bn)=⎡⎢ ⎢ ⎢⎣%Im(λ00n)…Im(λ0an)⋮⋱…Im(λb0n)…Im(λban)⎤⎥ ⎥ ⎥⎦ (25)

The entries of these matrices have the following distributions

 Re(λijn)∼N(0,σ2re,n)∼N(0,q−1∑k1=0q−1∑k2=0% cos2(2π(k1n1q+k2n2q))Im(λijn)∼N(0,σ2im,n)∼N(0,q−1∑k1=0q−1∑k2=0sin2(2π(k1n1q+k2n2q)) (26)

where we have used the fact that are i.i.d Gaussian.

We have now turned our initial problem into a form that lends itself more easily to a solution. Our original matrix has been turned into block diagonal form and each block can be split into real and imaginary parts that have independent gaussian entries, we note however that blocks are not independent of eachother. We will now derive a concentration bound on the original matrix by using the fact that the spectral norm of a block diagonal matrix is equal to the maximum of the spectral norms of the individual blocks.

We can write the following inequalities

 P(||U||2≤ϵ)=P(⋂n{||~Bn||2≤ϵ})≥P(⋂n{||Re(~Bn)||2+||Im(~Bn)||2≤ϵ}) (27)

by setting and arbitrary constants we can furthermore write

 P(||U||2≤maxn(ϵn))≥P(⋂n{||Re(~Bn)||2+||Im(~Bn)||2≤maxn(ϵn)})≥P(⋂n{||Re(~Bn)||2+||Im(~Bn)||2≤ϵn})≥P(⋂n[{||Re(~Bn)||2≤ϵn,re}∩{||Im(~Bn)||2≤ϵn,im}])≥1−N2∑n=1[Tn,re+Tn,re] (28)

where in line 4 we set and in line 5 we used a union bound and assumed that and for positive constants .

We will now calculate concentration inequalities for the individual blocks and , turning the general formula we have derived into a specific one for our case. To do that we first apply the following concentration inequality by Vershynin (2010)

###### Theorem C.1.

Let be an matrix whose entries are independent Gaussian random variables with variance . Then for every

 P(||A||2≥σ(√N+√n+√2ln(1T)))≤T (29)

on the matrices and . We obtain the following concentration inequalities

 P(||Re(~Bn)||2≥σre,n(√a+√b+√2ln(1Tn,re)))≤Tn,reP(||Im(~Bn)||2≥σim,n(√a+√b+√2ln(1Tn,im)))≤Tn,im (30)

We then make the following calculations which will prove usefull

 maxn[σre,n+σim,n]=maxn[ ⎷q−1∑k1=0q−1∑k2=0cos2(2π(k1n1q+k2n2q)+ ⎷q−1∑k1=0q−1∑k2=0sin2(2π(k1n1q+k2n2q)]≤ ⎷q−1∑k1=0q−1∑k2=012+ ⎷q−1∑k1=0q−1∑k2=012=2√2q≤1.5q (31)

since

 ∂∂θl(√∑isin2(θl)+√∑icos2(θl))=122cos(θl)sin(θl)|sin(θl)|−122cos(θl)sin(θl)|cos(θl)|sin(θl)cos(θl)|sin(θl)||cos(θl)|(|cos(θl)|−|sin(θl)|)=0 (32)

which implies

 cos(θl