# Spectrally-normalized margin bounds for neural networks

This paper presents a margin-based multiclass generalization bound for neural networks that scales with their margin-normalized "spectral complexity": their Lipschitz constant, meaning the product of the spectral norms of the weight matrices, times a certain correction factor. This bound is empirically investigated for a standard AlexNet network trained with SGD on the mnist and cifar10 datasets, with both original and random labels; the bound, the Lipschitz constants, and the excess risks are all in direct correlation, suggesting both that SGD selects predictors whose complexity scales with the difficulty of the learning task, and secondly that the presented bound is sensitive to this complexity.

## Authors

• 17 publications
• 29 publications
• 26 publications
• ### The Role of Information Complexity and Randomization in Representation Learning

A grand challenge in representation learning is to learn the different e...
02/14/2018 ∙ by Matías Vera, et al. ∙ 0

• ### Towards Understanding the Role of Over-Parametrization in Generalization of Neural Networks

Despite existing work on ensuring generalization of neural networks in t...
05/30/2018 ∙ by Behnam Neyshabur, et al. ∙ 2

• ### Data-dependent Sample Complexity of Deep Neural Networks via Lipschitz Augmentation

Existing Rademacher complexity bounds for neural networks rely only on n...
05/09/2019 ∙ by Colin Wei, et al. ∙ 20

• ### Some limitations of norm based generalization bounds in deep neural networks

Deep convolutional neural networks have been shown to be able to fit a l...
05/23/2019 ∙ by Konstantinos Pitas, et al. ∙ 5

• ### Notes on Lipschitz Margin, Lipschitz Margin Training, and Lipschitz Margin p-Values for Deep Neural Network Classifiers

We provide a local class purity theorem for Lipschitz continuous, half-r...
10/15/2019 ∙ by George Kesidis, et al. ∙ 0

• ### Why ResNet Works? Residuals Generalize

Residual connections significantly boost the performance of deep neural ...
04/02/2019 ∙ by Fengxiang He, et al. ∙ 0

• ### Improved Sample Complexities for Deep Networks and Robust Classification via an All-Layer Margin

For linear classifiers, the relationship between (normalized) output mar...
10/09/2019 ∙ by Colin Wei, et al. ∙ 7

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Overview

Neural networks owe their astonishing success not only to their ability to fit any data set: they also generalize well, meaning they provide a close fit on unseen data. A classical statistical adage is that models capable of fitting too much will generalize poorly; what’s going on here?

Let’s navigate the many possible explanations provided by statistical theory. A first observation is that any analysis based solely on the number of possible labellings on a finite training set — as is the case with VC dimension — is doomed: if the function class can fit all possible labels (as is the case with neural networks in standard configurations (Zhang et al., 2017)), then this analysis can not distinguish it from the collection of all possible functions!

Next let’s consider scale-sensitive measures of complexity, such as Rademacher complexity and covering numbers, which (can) work directly with real-valued function classes, and moreover are sensitive to their magnitudes. Figure 1 plots the excess risk (the test error minus the training error) across training epochs against one candidate scale-sensitive complexity measure, the Lipschitz constant of the network (the product of the spectral norms of the weight matrices), and demonstrates that they are tightly correlated (which is not the case for, say, the norm of the weights). The data considered in Figure 1 is the standard cifar10 dataset, both with original and with random labels, which has been used as a sanity check when investigating neural network generalization (Zhang et al., 2017).

There is still an issue with basing a complexity measure purely on the Lipschitz constant (although it has already been successfully employed to regularize neural networks (Cisse et al., 2017)): as depicted in Figure 1, the measure grows over time, despite the excess risk plateauing. Fortunately, there is a standard resolution to this issue: investigating the margins (a precise measure of confidence) of the outputs of the network. This tool has been used to study the behavior of 2-layer networks, boosting methods, SVMs, and many others (Bartlett, 1996; Schapire et al., 1997; Boucheron et al., 2005); in boosting, for instance, there is a similar growth in complexity over time (each training iteration adds a weak learner), whereas margin bounds correctly stay flat or even decrease. This behavior is recovered here: as depicted in Figure 1, even though standard networks exhibit growing Lipschitz constants, normalizing these Lipschitz constants by the margin instead gives a decaying curve.

### 1.1 Contributions

This work investigates a complexity measure for neural networks that is based on the Lipschitz constant, but normalized by the margin of the predictor. The two central contributions are as follows.

• Theorem 1.1 below will give the rigorous statement of the generalization bound that is the basis of this work. In contrast to prior work, this bound: (a) scales with the Lipschitz constant (product of spectral norms of weight matrices) divided by the margin; (b) has no dependence on combinatorial parameters (e.g., number of layers or nodes) outside of log factors; (c) is multiclass (with no explicit dependence on the number of classes); (d) measures complexity against a reference network (e.g., for the ResNet (He et al., 2016), the reference network has identity mappings at each layer). The bound is stated below, with a general form and analysis summary appearing in Section 3 and the full details relegated to the appendix.

• An empirical investigation, in Section 2, of neural network generalization on the standard datasets cifar10, cifar100, and mnist using the preceding bound. Rather than using the bound to provide a single number, it can be used to form a margin distribution as in Figure 2. These margin distributions will illuminate the following intuitive observations: (a) cifar10 is harder than mnist; (b) random labels make cifar10 and mnist much more difficult; (c) the margin distributions (and bounds) converge during training, even though the weight matrices continue to grow; (d) regularization (“weight decay”) does not significantly impact margins or generalization.

A more detailed description of the margin distributions is as follows. Suppose a neural network computes a function , where

is the number of classes; the most natural way to convert this to a classifier is to select the output coordinate with the largest magnitude, meaning

. The margin, then, measures the gap between the output for the correct label and other labels, meaning .

Unfortunately, margins alone do not seem to say much; see for instance Figure 1(a), where the collections of all margins for all data points — the unnormalized margin distribution — are similar for cifar10 with and without random labels. What is missing is an appropriate normalization, as in Figure 1(b). This normalization is provided by Theorem 1.1, which can now be explained in detail.

To state the bound, a little bit of notation is necessary. The networks will use fixed nonlinearities , where is

-Lipschitz (e.g., as with coordinate-wise ReLU, and max-pooling, as discussed in

Section A.1); occasionally, it will also hold that . Given weight matrices let denote the function computed by the corresponding network:

 FA(x):=σL(ALσL−1(AL−1⋯σ1(A1x)⋯)). (1.1)

The network output (with and ) is converted to a class label in by taking the over components, with an arbitrary rule for breaking ties. Whenever input data are given, collect them as rows of a matrix . Occasionally, notation will be overloaded to discuss , a matrix whose column is . Let denote the maximum of . The norm is always computed entry-wise; thus, for a matrix, it corresponds to the Frobenius norm.

Next, define a collection of reference matrices with the same dimensions as ; for instance, to obtain a good bound for ResNet (He et al., 2016), it is sensible to set , the identity map, and the bound below will worsen as the network moves farther from the identity map; for AlexNet (Krizhevsky et al., 2012), the simple choice suffices. Finally, let denote the spectral norm, and let denote the matrix norm, defined by for . The spectral complexity of a network with weights is the defined as

 RA:=\delL∏i=1ρi∥Ai∥σ\delL∑i=1∥A⊤i−M⊤i∥2/32,1∥Ai∥2/3σ3/2. (1.2)

The following theorem provides a generalization bound for neural networks whose nonlinearities are fixed but whose weight matrices have bounded spectral complexity .

###### Theorem 1.1.

Let nonlinearities and reference matrices be given as above (i.e., is -Lipschitz and ). Then for

drawn iid from any probability distribution over

, with probability at least over , every margin and network with weight matrices satisfy

 Pr\sbr[2]argmaxjFA(x)j≠y≤ˆRγ(FA)+˜O\del∥X∥2RAγnln(W)+√ln(1/δ)n,

where and .

The full proof and a generalization beyond spectral norms is relegated to the appendix, but a sketch is provided in Section 3, along with a lower bound. Section 3 also gives a discussion of related work: briefly, it’s essential to note that margin and Lipschitz-sensitive bounds have a long history in the neural networks literature (Bartlett, 1996; Anthony and Bartlett, 1999; Neyshabur et al., 2015); the distinction here is the sensitivity to the spectral norm, and that there is no explicit appearance of combinatorial quantities such as numbers of parameters or layers (outside of log terms, and indices to summations and products).

To close, miscellaneous observations and open problems are collected in Section 4.

## 2 Generalization case studies via margin distributions

In this section, we empirically study the generalization behavior of neural networks, via margin distributions and the generalization bound stated in Theorem 1.1.

Before proceeding with the plots, it’s a good time to give a more refined description of the margin distribution, one that is suitable for comparisons across datasets. Given pattern/label pairs , with patterns as rows of matrix , and given a predictor , the (normalized) margin distribution is the univariate empirical distribution of the labeled data points each transformed into a single scalar according to

 (x,y)↦FA(x)y−maxi≠yFA(x)iRA∥X∥2/n,

where the spectral complexity is from eq. 1.2. The normalization is thus derived from the bound in Theorem 1.1, but ignoring log terms.

Taken this way, the two margin distributions for two datasets can be interpreted as follows. Considering any fixed point on the horizontal axis, if the cumulative distribution of one density is lower than the other, then it corresponds to a lower right hand side in Theorem 1.1

. For no reason other than visual interpretability, the plots here will instead depict a density estimate of the margin distribution. The vertical and horizontal axes are rescaled in different plots, but the random and true

cifar10 margin distributions are always the same.

A little more detail about the experimental setup is as follows. All experiments were implemented in Keras

(Chollet et al., 2015). In order to minimize conflating effects of optimization and regularization, the optimization method was vanilla SGD with step size

, and all regularization (weight decay, batch normalization, etc.) were disabled. “

cifar” in general refers to cifar10, however cifar100 will also be explicitly mentioned. The network architecture is essentially AlexNet (Krizhevsky et al., 2012) with all normalization/regularization removed, and with no adjustments of any kind (even to the learning rate) across the different experiments.

Comparing datasets. A first comparison is of cifar10 and the standard mnist digit data. mnist is considered “easy”, since any of a variety of methods can achieve roughly 1% test error. The “easiness” is corroborated by Figure 2(a), where the margin distribution for mnist places all its mass far to the right of the mass for cifar10. Interestingly, randomizing the labels of mnist, as in Figure 2(b), results in a margin distribution to the left of not only cifar10, but also slightly to the left of (but close to) cifar10 with randomized labels.

Next, Figure 2(c) compares cifar10 and cifar100, where cifar100 uses the same input images as cifar10; indeed, cifar10 is obtained from cifar100 by collapsing the original 100 categories into 10 groups. Interestingly, cifar100, from the perspective of margin bounds, is just as difficult as cifar10 with random labels. This is consistent with the large observed test error on cifar100 (which has not been “optimized” in any way via regularization).

Lastly, Figure 2(d) replaces the cifar10 input images with random images sampled from Gaussians matching the first- and second-order image statistics (see (Zhang et al., 2017) for similar experiments).

Convergence of margins. As was pointed out in Section 1, the weights of the neural networks do not seem to converge in the usual sense during training (the norms grow continually). However, as depicted in Figure 3(a), the sequence of (normalized) margin distributions is itself converging.

Regularization. As remarked in (Zhang et al., 2017), regularization only seems to bring minor benefits to test error (though adequate to be employed in all cutting edge results). This observation is certainly consistent with the margin distributions in Figure 3(b), which do not improve (e.g., by shifting to the right) in any visible way under regularization. An open question, discussed further in Section 4, is to design regularization that improves margins.

## 3 Analysis of margin bound

This section will sketch the proof of Theorem 1.1, give a lower bound, and discuss related work.

### 3.1 Multiclass margin bound

The starting point of this analysis is a margin-based bound for multiclass prediction. To state the bound, first recall that the margin operator is defined as , and define the ramp loss as

 ℓγ(r):=⎧⎨⎩0r<−γ,1+r/γr∈[−γ,0],1r>0,

and ramp risk as . Given a sample , define an empirical counterpart of as ; note that and respectively upper bound the probability and fraction of errors on the source distribution and training set. Lastly, given a set of real-valued functions , define the Rademacher complexity as

, where the expectation is over the Rademacher random variables

, which are iid with .

With this notation in place, the basic bound is as follows.

###### Lemma 3.1.

Given functions with and any , define

 Fγ:=\cbr(x,y)↦ℓγ(−M(f(x),y)):f∈F.

Then, with probability at least over a sample of size , every satisfies

 Pr[argmaxif(x)i≠y]≤ˆRγ(f)+2R((Fγ)|S)+3√ln(1/δ)2n.

This bound is a direct consequence of standard tools in Rademacher complexity. In order to instantiate this bound, covering numbers will be used to directly upper bound the Rademacher complexity term . Interestingly, the choice of directly working in terms of covering numbers seems essential to providing a bound with no explicit dependence on ; by contrast, prior work primarily handles multiclass via a Rademacher complexity analysis on each coordinate of a -tuple of functions, and pays a factor of (Zhang, 2004).

### 3.2 Covering number complexity upper bounds

This subsection proves Theorem 1.1 via Lemma 3.1 by controlling, via covering numbers, the Rademacher complexity for networks with bounded spectral complexity.

The notation here for (proper) covering numbers is as follows. Let denote the least cardinality of any subset that covers at scale with norm , meaning

 supA∈UminB∈V∥A−B∥≤ϵ.

Choices of that will be used in the present work include both the image of data under some function class , as well as the conceptually simpler choice of a family of matrix products.

The full proof has the following steps. (I) A matrix covering bound for the affine transformation of each layer is provided in Lemma 3.2; handling whole layers at once allows for more flexible norms. (II) An induction on layers then gives a covering number bound for entire networks; this analysis is only sketched here for the special case of norms used in Theorem 1.1, but the full proof in the appendix culminates in a bound for more general norms (cf. Lemma A.7). (III) The preceding whole-network covering number leads to Theorem 1.1 via Lemma 3.1 and standard techniques.

Step (I), matrix covering, is handled by the following lemma. The covering number considers the matrix product , where will be instantiated as the weight matrix for a layer, and is the data passed through all layers prior to the present layer.

###### Lemma 3.2.

Let conjugate exponents and be given with , as well as positive reals and positive integer . Let matrix be given with . Then

 lnN\del\cbrXA:A∈Rd×m,∥A∥q,s≤a,ϵ,∥⋅∥2≤⌈a2b2m2/rϵ2⌉ln(2dm).

The proof relies upon the Maurey sparsification lemma (Pisier, 1980), which is stated in terms of sparsifying convex hulls, and in its use here is inspired by covering number bounds for linear predictors (Zhang, 2002). To prove Theorem 1.1, this matrix covering bound will be instantiated for the case of . It is possible to instead scale with and

, but even for the case of the identity matrix

, this incurs an extra dimension factor. The use of here thus helps Theorem 1.1 avoid any appearance of and

outside of log terms; indeed, the goal of covering a whole matrix at a time (rather than the more standard vector covering) was to allow this greater sensitivity and avoid combinatorial parameters.

Step (II), the induction on layers, proceeds as follows. Let denote the output of layer but with images of examples of columns (thus ), and inductively suppose there exists a cover element for which depends on covering matrices chosen to cover weight matrices in earlier layers. Thanks to Lemma 3.2, there also exists so that . The desired cover element is thus where is the nonlinearity in layer ; indeed, supposing is -Lipschitz,

 ∥Xi+1−ˆXi+1∥2 ≤ρi∥AiXi−ˆAiˆXi∥2 ≤ρi\del∥AiXi−AiˆXi∥2+∥AiˆXi−ˆAiˆXi∥2 ≤ρi∥Ai∥σ∥Xi−ˆXi∥2+ρiϵi,

where the first term is controlled with the inductive hypothesis. Since depends on each choice , the cardinality of the full network cover is the product of the individual matrix covers.

The preceding proof had no sensitivity to the particular choice of norms; it merely required an operator norm on , as well as some other norm that allows matrix covering. Such an analysis is presented in full generality in Section A.5. Specializing to the particular case of spectral norms and group norms leads to the following full-network covering bound.

###### Theorem 3.3.

Let fixed nonlinearities and reference matrices be given, where is -Lipschitz and . Let spectral norm bounds , and matrix norm bounds be given. Let data matrix be given, where the rows correspond to data points. Let denote the family of matrices obtained by evaluating with all choices of network :

 HX:=\cbrFA(XT) : A=(A1,…,AL), ∥Ai∥σ≤si, ∥A⊤i−M⊤i∥2,1≤bi,

where each matrix has dimension at most along each axis. Then for any ,

 lnN(HX,ϵ,∥⋅∥2)≤∥X∥22ln(2W2)ϵ2\delL∏j=1s2jρ2j\delL∑i=1\delbisi2/33.

What remains is (III): Theorem 3.3 can be combined with the standard Dudley entropy integral upper bound on Rademacher complexity (see e.g. Mohri et al. (2012)), which together with Lemma 3.1 gives Theorem 1.1.

### 3.3 Rademacher complexity lower bounds

By reduction to the linear case (i.e., removing all nonlinearities), it is easy to provide a lower bound on the Rademacher complexity of the networks studied here. Unfortunately, this bound only scales with the product of spectral norms, and not the other terms in (cf. eq. 1.2).

###### Theorem 3.4.

Consider the setting of Theorem 3.3, but all nonlinearities are the ReLU , the output dimension is , and all non-output dimensions are at least 2 (and hence ). Let data be collected into data matrix . Then there is a such that for any scalar ,

 R\del[3]\cbr[2]FA:A=(A1,…,AL), ∏i∥Ai∥σ≤r|S≥c\nrm∗X2r. (3.1)

Note that, due to the nonlinearity, the lower bound should indeed depend on and not ; as a simple sanity check, there exist networks for which the latter quantity is 0, but the network does not compute the zero function.

### 3.4 Related work

To close this section on proofs, it is a good time to summarize connections to existing literature.

The algorithmic idea of large margin classifiers was introduced in the linear case by Vapnik (1982) (see also (Boser et al., 1992; Cortes and Vapnik, 1995)). Vapnik (1995) gave an intuitive explanation of the performance of these methods based on a sample-dependent VC-dimension calculation, but without generalization bounds. The first rigorous generalization bounds for large margin linear classifiers (Shawe-Taylor et al., 1998) required a scale-sensitive complexity analysis of real-valued function classes. At the same time, a large margin analysis was developed for two-layer networks (Bartlett, 1996), indeed with a proof technique that inspired the layer-wise induction used to prove Theorem 1.1 in the present work. Margin theory was quickly extended to many other settings (see for instance the survey by Boucheron et al. (2005)), one major success being an explanation of the generalization ability of boosting methods, which exhibit an explicit growth in the size of the function class over time, but a stable excess risk (Schapire et al., 1997). The contribution of the present work is to provide a margin bound (and corresponding Rademacher analysis) that can be adapted to various operator norms at each layer. Additionally, the present work operates in the multiclass setting, and avoids an explicit dependence on the number of classes , which seems to appear in prior work (Zhang, 2004; Tewari and Bartlett, 2007).

There are numerous generalization bounds for neural networks, including VC-dimension and fat-shattering bounds (many of these can be found in (Anthony and Bartlett, 1999)). Scale-sensitive analysis of neural networks started with (Bartlett, 1996), which can be interpreted in the present setting as utilizing data norm and operator norm (equivalently, the norm on weight matrix ). This analysis can be adapted to give a Rademacher complexity analysis (Bartlett and Mendelson, 2002), and has been adapted to other norms (Neyshabur et al., 2015), although the setting appears to be necessary to avoid extra combinatorial factors. More work is still needed to develop complexity analyses that have matching upper and lower bounds, and also to determine which norms are well-adapted to neural networks as used in practice.

The present analysis utilizes covering numbers, and is most closely connected to earlier covering number bounds (Anthony and Bartlett, 1999, Chapter 12), themselves based on the earlier fat-shattering analysis (Bartlett, 1996), however the technique here of pushing an empirical cover through layers is akin to VC dimension proofs for neural networks (Anthony and Bartlett, 1999). The use of Maurey’s sparsification lemma was inspired by linear predictor covering number bounds (Zhang, 2002).

Comparison to preprint. The original preprint of this paper (Bartlett et al., 2017) featured a slightly different version of the spectral complexity , given by In the present version eq:spec_comp, each term is replaced by . This is a strict improvement since for any matrix one has , and in general the gap between these two norms can be as large as .

On a related note, all of the figures in this paper use the norm in the spectral complexity instead of the norm. Variants of the experiments described in sec:empirical were carried out using each of the , , and norms in the term with negligible difference in the results.

Since spectrally-normalized margin bounds were first proposed in the preprint (Bartlett et al., 2017), subsequent works (Neyshabur et al., 2017; Neyshabur, 2017) re-derived a similar spectrally-normalized bound using the PAC-Bayes framework. Specifically, these works showed that may be replaced (up to factors) by: . Unfortunately, this bound never improves on Theorem 1.1, and indeed can be derived from it as follows. First, the dependence on the individual matrices in the second term of this bound can be obtained from Theorem 1.1 since for any it holds that . Second, the functional form appearing in Theorem 1.1 may be replaced by the form appearing above by using which holds for any (and can be proved, for instance, with Jensen’s inequality).

## 4 Further observations and open problems

Adversarial examples are a phenomenon where the neural network predictions can be altered by adding seemingly imperceptible noise to an input (Goodfellow et al., 2014). This phenomenon can be connected to margins as follows. The margin is nothing more than the distance an input must traverse before its label is flipped; consequently, low margin points are more susceptible to adversarial noise than high margin points. Concretely, taking the 100 lowest margin inputs from cifar10 and adding uniform noise at scale yielded flipped labels on 5.86% of the images, whereas the same level of noise on high margin points yielded 0.04% flipped labels. Can the bounds here suggest a way to defend against adversarial examples?

#### Regularization.

It was observed in (Zhang et al., 2017) that explicit regularization contributes little to the generalization performance of neural networks. In the margin framework, standard weight decay () regularization seemed to have little impact on margin distributions in Section 2. On the other hand, in the boosting literature, special types of regularization were developed to maximize margins (Shalev-Shwartz and Singer, 2008); perhaps a similar development can be performed here?

#### Sgd.

The present analysis applies to predictors that have large margins; what is missing is an analysis verifying that SGD applied to standard neural networks returns large margin predictors! Indeed, perhaps SGD returns not simply large margin predictors, but predictors that are well-behaved in a variety of other ways that can be directly translated into refined generalization bounds.

#### Improvements to Theorem 1.1.

There are several directions in which Theorem 1.1 might be improved. Can a better choice of layer geometries (norms) yield better bounds on practical networks? Can the nonlinearities’ worst-case Lipschitz constant be replaced with an (empirically) averaged quantity? Alternatively, can better lower bounds rule out these directions?

Is it possible to prove Theorem 1.1 solely via Rademacher complexity, with no invocation of covering numbers?

### Acknowledgements

The authors thank Srinadh Bhojanapalli, Ryan Jian, Behnam Neyshabur, Maxim Raginsky, Andrew J. Risteski, and Belinda Tzen for useful conversations and feedback. The authors thank Ben Recht for giving a provocative lecture at the Simons Institute, stressing the need for understanding of both generalization and optimization of neural networks. M.T. and D.F. acknowledge the use of a GPU machine provided by Karthik Sridharan and made possible by an NVIDIA GPU grant. D.F. acknowledges the support of the NDSEG fellowship. P.B. gratefully acknowledges the support of the NSF through grant IIS-1619362 and of the Australian Research Council through an Australian Laureate Fellowship (FL110100281) and through the ARC Centre of Excellence for Mathematical and Statistical Frontiers. The authors thank the Simons Institute for the Theory of Computing Spring 2017 program on the Foundations of Machine Learning. Lastly, the authors are grateful to La Burrita (both the north and the south Berkeley campus locations) for upholding the glorious tradition of the California Burrito.

## References

• Anthony and Bartlett (1999) Martin Anthony and Peter L. Bartlett. Neural Network Learning: Theoretical Foundations. Cambridge University Press, 1999.
• Bartlett et al. (2017) Peter Bartlett, Dylan J Foster, and Matus Telgarsky. Spectrally-normalized margin bounds for neural networks. arXiv preprint arXiv:1706.08498, 2017.
• Bartlett (1996) Peter L. Bartlett. For valid generalization the size of the weights is more important than the size of the network. In NIPS, 1996.
• Bartlett and Mendelson (2002) Peter L. Bartlett and Shahar Mendelson. Rademacher and gaussian complexities: Risk bounds and structural results. JMLR, 3:463–482, Nov 2002.
• Boser et al. (1992) Bernhard E. Boser, Isabelle M. Guyon, and Vladimir N. Vapnik. A training algorithm for optimal margin classifiers. In

Proceedings of the Fifth Annual Workshop on Computational Learning Theory

, COLT ’92, pages 144–152, New York, NY, USA, 1992. ACM.
ISBN 0-89791-497-X.
• Boucheron et al. (2005) Stéphane Boucheron, Olivier Bousquet, and Gabor Lugosi. Theory of classification: A survey of some recent advances. ESAIM: Probability and Statistics, 9:323–375, 2005.
• Chollet et al. (2015) François Chollet et al. Keras.
• Cisse et al. (2017) Moustapha Cisse, Piotr Bojanowski, Edouard Grave, Yann Dauphin, and Nicolas Usunier. Parseval networks: Improving robustness to adversarial examples. In ICML, 2017.
• Cortes and Vapnik (1995) Corinna Cortes and Vladimir N. Vapnik. Support-vector networks. Machine Learning, 20(3):273–297, 1995.
• Goodfellow et al. (2014) Ian J. Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining and harnessing adversarial examples. 2014. arXiv:1412.6572 [stat.ML].
• He et al. (2016) Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, 2016.
• Krizhevsky et al. (2012) Alex Krizhevsky, Ilya Sutskever, and Geoffery Hinton. Imagenet classification with deep convolutional neural networks. In NIPS, 2012.
• Mohri et al. (2012) Mehryar Mohri, Afshin Rostamizadeh, and Ameet Talwalkar. Foundations of Machine Learning. MIT Press, 2012.
• Neyshabur (2017) Behnam Neyshabur. Implicit regularization in deep learning. CoRR, abs/1709.01953, 2017. URL http://arxiv.org/abs/1709.01953.
• Neyshabur et al. (2015) Behnam Neyshabur, Ryota Tomioka, and Nathan Srebro. Norm-based capacity control in neural networks. In COLT, 2015.
• Neyshabur et al. (2017) Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nathan Srebro. A pac-bayesian approach to spectrally-normalized margin bounds for neural networks. CoRR, abs/1707.09564, 2017.
• Pisier (1980) Gilles Pisier. Remarques sur un résultat non publié de b. maurey. Séminaire Analyse fonctionnelle (dit), pages 1–12, 1980.
• Schapire et al. (1997) Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the margin: A new explanation for the effectiveness of voting methods. In ICML, pages 322–330, 1997.
• Shalev-Shwartz and Singer (2008) Shai Shalev-Shwartz and Yoram Singer. On the equivalence of weak learnability and linear separability: New relaxations and efficient boosting algorithms. In COLT, 2008.
• Shawe-Taylor et al. (1998) J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimization over data-dependent hierarchies. IEEE Trans. Inf. Theor., 44(5):1926–1940, September 1998.
• Tewari and Bartlett (2007) Ambuj Tewari and Peter L. Bartlett. On the consistency of multiclass classification methods. Journal of Machine Learning Research, 8:1007–1025, 2007.
• Vapnik (1982) Vladimir N. Vapnik. Estimation of Dependences Based on Empirical Data. Springer-Verlag, New York, 1982.
• Vapnik (1995) Vladimir N. Vapnik.

The Nature of Statistical Learning Theory

.
Springer-Verlag, New York, 1995.
• Zhang et al. (2017) Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals.

Understanding deep learning requires rethinking generalization.

ICLR, 2017.
• Zhang (2002) Tong Zhang. Covering number bounds of certain regularized linear function classes. Journal of Machine Learning Research, 2:527–550, 2002.
• Zhang (2004) Tong Zhang. Statistical analysis of some multi-category large margin classification methods. Journal of Machine Learning Research, 5:1225–1251, 2004.

## Appendix A Proofs

This appendix collects various proofs omitted from the main text.

### a.1 Lipschitz properties of ReLU and max-pooling nonlinearities

The standard ReLU

(“Rectified Linear Unit”) is the univariate mapping

 σr(r):=max{0,r}.

When applied to a vector or a matrix, it operates coordinate-wise. While the ReLU is currently the most popular choice of univariate nonlinearity, another common choice is the sigmoid . More generally, these univariate nonlinearities are Lipschitz, and this carries over to their vector and matrix forms as follows.

###### Lemma A.1.

If is -Lipschitz along every coordinate, then it is -Lipschitz according to for any .

###### Proof.

for any ,

 ∥σ(z)−σ(z′)∥p=\del∑i|σ(z)i−σ(z′)i|p1/p≤\del∑iρp|zi−z′i|p1/p=ρ∥z−z′∥p.

Define a max-pooling operator as follows. Given an input and output pair of finite-dimensional vector spaces and

(possibly arranged as matrices or tensors), the max-pooling operator iterates over a collection of sets of indices

(whose cardinality is equal to the dimension of ’), and for each element of sets the corresponding coordinate in the output to the maximum entry of the input over : given ,

 P(T)i:=maxj∈ZiTj.

The following Lipschitz constant of pooling operators will depend on the number of times each coordinate is accessed across elements of

; when this operator is used in computer vision, the number of times is typically a small constant, for instance 5 or 9

(Krizhevsky et al., 2012).

###### Lemma A.2.

Suppose that each coordinate of the input appears in at most elements of the collection . Then the max-pooling operator is -Lipschitz wrt for any . In particular, the max-pooling operator is -Lipschitz whenever forms a partition.

###### Proof.

Let be given. First consider any fixed set of indices , and suppose without loss of generality that . Then

 |P(T)Z−P(T′)Z|p=\delminj′∈Zmaxj∈ZTj−T′j′p≤maxj∈Z\delTj−T′jp≤∑j∈Z\envertTj−T′jp.

Consequently,

 ∥P(T)−P(T′)∥p =\del∑i|P(T)i−P(T′)i|p1/p=\del∑Z∈Z|P(T)Z−P(T′)Z|p1/p ≤\del∑Z∈Z∑j∈Z|Tj−T′j|p1/p=\del∑j∑Z∈\mcZ:j∈Z|Tj−T′j|p1/p ≤\delm∑j|Tj−T′j|p1/p=m1/p∥T−T′∥p.

### a.2 Margin properties in Section 3.1

The goal of this subsection is to prove the general margin bound in Lemma 3.1. To this end, it is first necessary to establish a few properties of the margin operator and of the ramp loss .

###### Lemma A.3.

For every and every , is 2-Lipschitz wrt .

###### Proof.

Let be given, and suppose (without loss of generality) . Choose coordinate so that . Then

 M(v,j)−M(v′,j) =\delvj−maxl≠jvj−\delv′j−v′i=vj−v′j+v′i+minl≠j(−vl) ≤\delvj−v′j+\delv′i−vi≤2∥v−v′∥∞≤2∥v−v′∥p.

Next, recall the definition of the ramp loss

 ℓγ(r):=⎧⎨⎩0r<−γ,1+r/γr∈[−γ,0],1r>0,

and of the ramp risk

 Rγ(f):=E(ℓγ(−M(f(x),y))).

(These quantities are standard; see for instance (Boucheron et al., 2005; Zhang, 2004; Tewari and Bartlett, 2007).)

###### Lemma A.4.

For any and every ,

 Pr[argmaxif(x)i≠y]≤Pr[M(f(x),y)≤0]≤Rγ(f),

where the follows any deterministic tie-breaking strategy.

###### Proof.
 Pr[argmaxif(x)i≠y] ≤Pr[maxi≠yf(x)i≥f(x)y] =Pr[−M(f(x),y)≥0] =E1[−M(f(x),y)≥0] ≤Eℓγ(−M(f(x),y))

With these tools in place, the proof of Lemma 3.1 is straightforward.

###### Proof of Lemma 3.1..

Since has range , it follows by standard properties of Rademacher complexity (see, for example, Mohri et al., 2012, Theorem 3.1) that with probability at least , every satisfies

 Rγ(f)≤ˆRγ(f)+2R((Fγ)|S)+3√ln(2/δ)2n.

The bound now follows by applying Lemma A.4 to the left hand side.

### a.3 Dudley Entropy Integral

This section contains a slight variant of the standard Dudley entropy integral bound on the empirical Rademacher complexity (e.g. Mohri et al. (2012)), which is used in the proof of Theorem 1.1. The presentation here diverges from standard presentations because the data metric (as in eq. A.1) is not normalized by . The proof itself is entirely standard however — even up to constants — and is included only for completeness.

###### Lemma A.5.

Let be a real-valued function class taking values in , and assume that . Then

 R(F|S)≤infα>0\del4α√n+12n∫√nα√logN(\mcF|S,ε,\nrm⋅2)dε.
###### Proof.

Let be arbitrary and let for each . For each let denote the cover achieving , so that

 ∀f∈\mcF∃v∈Vi\prn∗n∑t=1(f(xt)−vt)21/2≤εi, (A.1)

and . For a fixed , let denote the nearest element in . Then

 \Enϵsupf∈Fn∑t=1εif(xt) =\Enϵsupf∈F\brk∗n∑t=1ϵt(f(xt)−vNt[f])+N−1∑i=1n∑t=1ϵt(vit[f]−vi+1t[f])−n∑t=1ϵtv1t[f] ≤\Enϵsupf∈F\brk∗n∑t=1ϵt(f(xt)−vNt[f])+N−1∑i=1\Enϵsupf∈F\brk∗n∑t=1ϵt(vit[f]−vi+1t[f])+\Enϵsupf∈F\brk∗n∑t=1ϵtv1t[f].

For the third term, observe that it suffices to take , which implies

 \Enϵsupf∈F\brk∗n∑t=1ϵtv1t[f]=0.

The first term may be handled using Cauchy-Schwarz as follows:

 \Enϵsupf∈F\brk∗n∑t=1ϵt(f(xt)−vNt[f])≤ ⎷\Enϵn∑t=1(ϵt)2√supf∈Fn∑t=1(f(xt)−vNt[f])2≤√nεN.

Last to take care of are the terms of the form

 \Enϵsupf∈F\brk∗n∑t=1ϵt(vit[f]−vi+1t[f]).

For each , let . Then ,

 \Enϵsupf∈F\brk∗n∑t=1ϵt(vit[f]−vi+1t[f])≤\Enϵsupw∈Wi\brk∗n∑t=1ϵtwt,

and furthermore

 supw∈Wi ⎷n∑t=1w2t =supf∈F\nrm∗vi[f]−vi+1[f]2 ≤supf∈F\nrm∗vi[f]−(f(x1),…,f(xn))2+supf∈F\nrm∗(f(x1),…,f(xn))−vi+1[f]2 ≤εi+εi+1 =3εi+1.

With this observation, the standard Massart finite class lemma (Mohri et al., 2012) implies

 \Enϵsupw∈Wi\brk∗n∑t=1ϵtwt≤ ⎷2supw∈Win∑t=1(wt)2log\absWi≤3√2log\absWiεi+1≤6√log\absVi+1εi+1.

Collecting all terms, this establishes

 \Enϵsupf∈Fn∑t=1ϵtf(xt) ≤εN√n+6N−1∑i=1εi+1√logN(\mcF|S,εi+1,\nrm⋅2) ≤εN√n+12N∑i=1(εi−εi+1)√logN(\mcF|S,εi,\nrm⋅2) ≤εN√n+12∫√nεN+1√logN(\mcF|S,ε,\nrm⋅2)dε.

Finally, select any and take be the largest integer with . Then , and so

 εN√n+12∫√nεN+1√logN(\mcF|S,ε,\nrm⋅2)dε≤4α√n+12∫√nα√logN(\mcF|S,ε,\nrm⋅2)dε.

### a.4 Proof of matrix covering (Lemma 3.2)

First recall the Maurey sparsification lemma.

###### Lemma A.6 (name=Maurey; cf. (Pisier, 1980), (Zhang, 2002, Lemma 1)).

Fix Hilbert space with norm . Let be given with representation where and . Then for any positive integer , there exists a choice of nonnegative integers , , such that

 \enVertU−∥α∥1kd∑i=1kiVi2≤∥α∥1kd∑i=1αi∥Vi∥2≤∥α∥21kmaxi∥Vi∥2.
###### Proof.

Set for convenience, and let </