1 Introduction
Generalization, the ability of a classifier to perform well on unseen examples, is a desideratum for progress towards realworld deployment of deep neural networks in domains such as autonomous cars and healthcare. Until recently, it was commonly believed that deep networks generalize well to unseen examples. This was based on empirical evidence about performance on heldout dataset. However, new research has started to question this assumption. Adversarial examples cause networks to misclassify even slightly perturbed images at very high rates
[Goodfellow et al., 2014, Papernot et al., 2016]. In addition, deep networks can overfit to arbitrarily corrupted data [Zhang et al., 2016], and they are sensitive to small geometric transformations [Azulay and Weiss, 2018, Engstrom et al., 2017]. These results have led to the important question about how the generalization gap (difference between train and test accuracy) of a deep network can be predicted using the training data and network parameters. Since in all of the above cases, the training loss is usually very small, it is clear that existing losses such as crossentropy cannot serve that purpose. It has also been shown (e.g. in [Zhang et al., 2016]) that regularizers such as weight decay cannot solve this problem either.Consequently, a number of recent works [Neyshabur et al., 2017a, Kawaguchi et al., 2017, Bartlett et al., 2017, Poggio et al., 2017, Arora et al., 2018] have started to address this question, proposing generalization bounds based on analyses of network complexity or noise stability properties. However, a thorough empirical assessment of these bounds in terms of how accurately they can predict the generalization gap across various practical settings is not yet available.
Test Acc.: 55.2%  Test Acc.: 70.6%  Test Acc.: 85.1% 

In this work, we propose a new quantity for predicting generalization gap of a feedforward neural network. Using the notion of margin in support vector machines
[Vapnik, 1995] and extension to deep networks [Elsayed et al., 2018], we develop a measure that shows a strong correlation with generalization gap and significantly outperforms recently developed theoretical bounds on generalization^{1}^{1}1In fairness, the theoretical bounds we compare against were designed to be provable upper boundsrather than estimates with
low expected error. Nevertheless, since recent developments on characterizing the generalization gap of deep networks are in form of upper bounds, they form a reasonable baseline.. This is empirically shown by studying a wide range of deep networks trained on the CIFAR10 and CIFAR100 datasets. The measure presented in this paper may be useful for a constructing new loss functions with better generalization. Besides improvement in the prediction of the generalization gap, our work is distinct from recently developed bounds and margin definitions in a number of ways:
These recently developed bounds are typically functions of weight norms (such as the spectral, Frobenius or various mixed norms). Consequently, they cannot capture variations in network topology that are not reflected in the weight norms, e.g. adding residual connections
[He et al., 2016] without careful additional engineering based on the topology changes. Furthermore, some of the bounds require specific treatment for nonlinear activations. Our proposed measure can handle any feedforward deep network. 
Although some of these bounds involve margin, the margin is only defined and measured at the output layer [Bartlett et al., 2017, Neyshabur et al., 2017a]. For a deep network, however, margin can be defined at any layer [Elsayed et al., 2018]. We show that measuring margin at a single layer does not suffice to capture generalization gap. We argue that it is crucial to use margin information across layers and show that this significantly improves generalization gap prediction.

The common definition of margin, as used in the recent bounds e.g. [Neyshabur et al., 2017a]
, or as extended to deep networks, is based on the closest distance of the training points to the decision boundary. However, this notion is brittle and sensitive to outliers. In contrast, we adopt
margin distribution [Garg et al., 2002, Langford and ShaweTaylor, 2002] by looking at the entire distribution of distances. This is shown to have far better prediction power. 
We argue that the direct extension of margin definition to deep networks [Elsayed et al., 2018], although allowing margin to be defined on all layers of the model, is unable to capture generalization gap without proper normalization. We propose a simple normalization scheme that significantly boosts prediction accuracy.
2 Related Work
The recent seminal work of [Zhang et al., 2016] has brought into focus the question of how generalization can be measured from training data. They showed that deep networks can easily learn to fit randomly labeled data with extremely high accuracy, but with arbitrarily low generalization capability. This overfitting is not countered by deploying commonly used regularizers.
The work of [Bartlett et al., 2017]
proposes a measure based on the ratio of two quantities: the margin distribution measured at the output layer of the network; and a spectral complexity measure related to the network’s Lipschitz constant. Their normalized margin distribution provides a strong indication of the complexity of the learning task, e.g. the distribution is skewed towards the origin (lower normalized margin) for training with random labels.
[Neyshabur et al., 2017a, Neyshabur et al., 2017b] also develop bounds based on the product of norms of the weights across layers. [Arora et al., 2018] develop bounds based on noise stability properties of networks: more stability implies better generalization. Using these criteria, they are able to derive stronger generalization bounds than previous works.The margin distribution (specifically, boosting of margins across the training set) has been shown to correspond to generalization properties in the literature on linear models [Schapire et al., 1998]: they used this connection to explain the effectiveness of boosting and bagging techniques. [Reyzin and Schapire, 2006] showed that it was important to control the complexity of a classifier when measuring margin, which calls for some type of normalization. In the linear case (SVM), margin is naturally defined as a function of norm of the weights [Vapnik, 1995]. In the case of deep networks, true margin is intractable. Recent work [Elsayed et al., 2018] proposed a linearization to approximate the margin, and defined the margin at any layer of the network. [Sokolic et al., 2016] provide another approximation to the margin based on the norm of the Jacobian with respect to the input layer. They show that maximizing their approximations to the margin leads to improved generalization. However, their analysis was restricted to margin at the input layer.
[Poggio et al., 2017] and [Liao et al., 2018] propose a normalized crossentropy measure that correlates well with test accuracy. Their proposed normalized loss trades off confidence of predictions with stability, which leads to better correlation with test accuracy, leading to a significant lowering of output margin.
3 Prediction of Generalization Gap
In this section, we introduce our marginbased measure. We first explain the construction scheme for obtaining the margin distribution. We then squeeze the distributional information of the margin to a small number of statistics. Finally, we regress these statistics to the value of the generalization gap. We assess prediction quality by applying the learned regression coefficients to predict the generalization gap of unseen models.
3.1 Margin Approximation
First, we establish some notation. Consider a classification problem with classes. We assume a classifier consists of nonlinear functions , for that generate a prediction score for classifying the input vector to class . The predicted label is decided by the class with maximal score, i.e. . Define the decision boundary for each class pair as:
(1) 
Under this definition, the distance of a point to the decision boundary can be expressed as the smallest displacement of the point that results in a score tie:
(2) 
Unlike an SVM, computing the “exact” distance of a point to the decision boundary (Eq. 2) for a deep network is intractable^{2}^{2}2
This is because computing the distance of a point to a nonlinear surface is intractable. This is different from SVM where the surface is linear and distance of a point to a hyperplane admits a closed form expression.
. In this work, we adopt the approximation scheme from [Elsayed et al., 2018] to capture the distance of a point to the decision boundary. This a firstorder Taylor approximation to the true distance Eq. 2. Formally, given an input to a network, denote its representation at the layer (the layer activation vector) by . For the input layer, let and thus . Then for , the distance of the representation vector to the decision boundary for class pair is given by the following approximation:(3) 
Here
represents the output (logit) of the network logit
given . Note that this distance can be positive or negative, denoting whether the training sample is on the “correct” or “wrong” side of the decision boundary respectively. The training data induces a distribution of distances at each layer which, following earlier naming convention [Garg et al., 2002, Langford and ShaweTaylor, 2002], we refer to as margin distribution (at layer ). For margin distribution, we only consider distances with positive sign (we ignore all misclassified training points).A problem with plain distances and their associated distribution is that they can be trivially boosted without any significant change in the way classifier separates the classes. For example, consider multiplying weights at a layer by a constant and dividing weights in the following layer by the same constant. In a ReLU network, due to positive homogeneity property
[Liao et al., 2018], this operation does not affect how the network classifies a point, but it changes the distances to the decision boundary^{3}^{3}3For example, suppose the constant is greater that one. Then, multiplying the weights of a layer by magnifies distances computed at the layer by a factor of ..To offset the scaling effect, we normalize the margin distribution. Consider margin distribution at some layer , and let be the representation vector for training sample
. We compute the variance of each coordinate of
separately, and then sum these individual variances. This quantity is called total variation of . The square root of this quantity relates to the scale of the distribution. That is, if is scaled by a factor, so is the square root of the total variation. Thus, by dividing distances by the square root of total variation, we can construct a margin distribution invariant to scaling. More concretely, the total variation is computed as:(4) 
i.e. the trace of the empirical covariance matrix of activations. Using the total variation, the normalized margin is specified by:
(5) 
While the quantity is relatively primitive and easy to compute, Fig. 1 (top) shows that the normalizedmargin distributions based on Eq. 5 have the desirable effect of becoming heavier tailed and shifting to the right (increasing margin) as generalization gap decreases. We find that this effect holds across a range of networks trained with different hyperparameters.
3.2 Summarizing the Margin Distribution
Instead of working directly with the (normalized) margin distribution, it is easier to analyze a compact signature
of that. The moments of a distribution are a natural criterion for this purpose. Perhaps the most standard way of doing this is computing the empirical moments from the samples and then take the
root of the moment. In our experiments, we used the first five moments. However, it is a wellknown phenomenon that the estimation of higher order moments based on samples can be unreliable. Therefore, we also consider an alternate way to construct the distribution’s signature. Given a set of distances , which constitute the margin distribution. We use the median, first quartile
and third quartile of the normalized margin distribution, along with the two fences that indicate variability outside the upper and lower quartiles. There are many variations for fences, but in this work, with , we define the upper fence to be and the lower fence to be [McGill et al., 1978]. These statistics form the quartile description that summarizes the normalized margin distribution at a specific layer, as shown in the box plots of Fig. 1. We will later see that both signature representations are able to predict the generalization gap, with the second signature working slightly better.A number of prior works such as [Bartlett et al., 2017], [Neyshabur et al., 2017a], [Liu et al., 2016], [Sun et al., 2015], [Sokolic et al., 2016], and [Liang et al., 2017] have focused on analyzing or maximizing the margin at either the input or the output layer of a deep network. Since a deep network has many hidden layers with evolving representations, it is not immediately clear which of the layer margins is of importance for improving generalization. Our experiments reveal that margin distribution from all of the layers of the network contribute to prediction of generalization gap. This is also clear from Fig. 1 (top): comparing the input layer (layer 0) margin distributions between the left and right plots, the input layer distribution shifts slightly left, but the other layer distributions shift the other way. For example, if we use quartile signature, we have components in this vector, where is the total number of layers in the network. We incorporate dependence on all layers simply by concatenating margin signatures of all layers into a single combined vector that we refer to as total signature.
3.3 Evaluation Metrics
Our goal is to predict the generalization gap, i.e. the difference between training and test accuracy at the end of training, based on total signature of a trained model. We use the simplest prediction model, i.e. a linear form , where and are parameters of the predictor, and is a function applied elementwise to . Specifically, we will explore two choices of : the identity and entrywise transform , which correspond to additive and multiplicative combination of margin statistics respectively.
In order to estimate predictor parameters , we generate a pool of pretrained models (covering different datasets, architectures, regularization schemes, etc. as explained in Sec. 4) each of which gives one instance of the pair ( being the generalization gap for that model). We then find by minimizing mean squared error: where indexes the model in the pool. The next step is to assess the prediction quality. We consider two metrics for this.
The first metric examines quality of predictions on unseen models. For that, we consider a heldout pool of models, different from those used to estimate , and compute the value of on them via . In order to quantify the discrepancy between predicted gap and ground truth gap we use the notion of coefficient of determination () [Glantz et al., 1990]:
(6) 
measures what fraction of data variance can be explained by the linear model^{4}^{4}4 A simple manipulation shows that the prediction residual , so can be interpreted as a scale invariant alternative to the residual. (it ranges from to on training points but can be outside that range on unseen points). To be precise, we use kfold validation to study how the predictor can perform on held out pool of trained deep networks. We use 90/10 split, fit the linear model with the training pool, and measure on the held out pool. The performance is averaged over the 10 splits. Since is now not measured on the training pool, it does not suffer from high data dimension and can be negative. In all of our experiments, we use . We provide a subset of residual plots and corresponding univariate FTest for the experiments in the appendix (Sec. 7
). The Fscore also indicates how important each individual variable is.
The second metric examines how well the model fits based on the provided training pool; it does not require a test pool. To characterize this, we use adjusted [Glantz et al., 1990] defined as:
(7) 
The can be negative when the data is nonlinear. Note that is always smaller than . Intuitively, penalizes the model if the number of features is high relative to the available data points. The closer is to 1, the better the model fits. Using is a simple yet effective method to test the fitness of linear model and is independent of the scale of the target, making it a more illustrative metric than residuals.
4 Experiments
We tested our measure of generalization gap , along with baseline measures, on a number of deep networks and architectures: ninelayer convolutional networks on CIFAR10 (10 with input layer), and 32layer residual networks on both CIFAR10 and CIFAR100 datasets.
4.1 Convolutional Neural Networks on CIFAR10
Using the CIFAR10 dataset, we train
ninelayer convolutional networks with different settings of hyperparameters and training techniques. We apply weight decay and dropout with different strengths; we use networks with and without batch norm and data augmentation; we change the number of hidden units in the hidden layers. Finally, we also include training with and without corrupted labels, as introduced in
[Zhang et al., 2016]; we use a fixed amount of corruption of the true labels. The accuracy on the test set ranges from to and the generalization gap ranges from to . In standard settings, creating neural network models with small generalization gap is difficult; in order to create sufficiently diverse generalization behaviors, we limit some models’ capacities by large weight regularization which decreases generalization gap by lowering the training accuracy. All networks are trained by SGD with momentum. Further details are provided in the supplementary material (Sec. 6).For each trained network, we compute the signature of the normalized margin distribution (see Sec. 3). Empirically, we found constructing this only on four evenlyspaced layers, input, and hidden layers, leads to good predictors. This results in a 20dimensional signature vector. We estimate the parameters of the linear predictor with the log transform and using the dimensional signature vector . Fig. 2 (left) shows the resulting scatter plot of the predicted generalization gap and the true generalization gap . As it can be seen, it is very close to being linear across the range of generalization gaps, and this is also supported by the of the model, which is (max is equal to 1).
As a first baseline method, we compare against the work of [Bartlett et al., 2017] which provides one of the best generalization bounds currently known for deep networks. This work also constructs a margin distribution for the network, but in a different way. To make a fair comparison, we extract the same signature from their margin distribution. Since their margin distribution can only be defined for the output layer, their is 5dimensional for any network. The resulting fit is shown in Fig. 2(right). It is clearly a poorer fit than that of our signature, with a significantly lower of .
For a fairer comparison, we also reduced our signature from 20 dimensions to the best performing 4 dimensions (even one dimension less than what we used for Bartlett’s) by dropping 16 components in our . This is shown in Fig. 2 (middle) and has a of , which is poorer than our complete but still significantly higher than that of [Bartlett et al., 2017]. In addition, we considered two other baseline comparisons: [Sokolic et al., 2016], where margin at input is defined as a function of the Jacobian of output (logits) with respect to input; and [Elsayed et al., 2018] where the linearized approximation to margin is derived (for the same layers where we use our normalized margin approximation).
Norm. Margin 20D  Norm. Margin 4D  Bartlett Margin 5D 

); Right: fit for features extracted from the normalized margin distribution as used in
[Bartlett et al., 2017] ().To quantify the effect of the normalization, different layers, feature transformation etc., we conduct a number of ablation experiments with the following configuration: 1. linear/log: Use signature transform of or ; 2. single layer: Use signature from the best layer (); 3. single feat: Use only the best statistic from the total signature for all the layers (); 4. moment: Use the first 5 moments of the normalized margin distribution as signature instead of quartile statistics (Sec. 3); 5. spectral: Use signature of spectrally normalized margins from [Bartlett et al., 2017] (); 6. quartile: Use all the quartile statistics as total signature (Sec. 3); 7. best4: Use the best statistics from the total signature (); 8. Jacobian: Use the Jacobianbased margin defined in Eq (39) of [Sokolic et al., 2016] (); 9. LM: Use the large margin loss from [Elsayed et al., 2018] at the same four layers where the statistics are measured ().
CNN+CIFAR10  ResNet+CIFAR10  ResNet+CIFAR100  
Experiment Settings  Adj.  kfold  Adj.  kfold  Adj.  kfold 
quartile+log  0.94  0.90  0.87  0.81  0.97  0.96 
quartile+linear  0.88  0.84  0.82  0.72  0.91  0.87 
single feat+log  0.86  0.83  0.44  0.22  0.80  0.78 
single layer+log  0.73  0.67  0.53  0.39  0.95  0.94 
moment+log  0.93  0.87  0.83  0.74  0.80  0.78 
best4+log  0.89  0.87  0.54  0.43  0.93  0.92 
spectral+log  0.73  0.70         
Jacobian+log  0.42  Negative  0.20  Negative  0.47  Negative 
LM+linear  0.35  Negative  0.68  Negative  0.74  Negative 
In Table 1, we list the from fitting models based on each of these scenarios. We see that, both quartile and moment signatures perform similarly, lending support to our thesis that the margin distribution, rather than the smallest or largest margin, is of importance in the context of generalization.
4.2 Residual Networks on CIFAR10
On the CIFAR10 dataset, we train convolutional networks with residual connections; these networks are 32 layers deep with standard ResNet 32 topology [He et al., 2016]
. Since it is difficult to train ResNet without activation normalization, we created generalization gap variation with batch normalization
[Ioffe and Szegedy, 2015] and group normalization [Wu and He, 2018]. We further use different initial learning rates. The range of accuracy on the test set ranges from to and generalization gap from to . The residual networks were much deeper, and so we only chose layers for featurelength compatibility with the shallower convoluational networks. This design choice also facilitates ease of analysis and circumvents the dependency on depth of the models. Table 1 shows the .Note in the presence of residual connections that use convolution instead of identity and identity blocks that span more than one convolutional layers, it is not immediately clear how to properly apply the bounds of [Bartlett et al., 2017] (third from last row) without morphing the topology of the architecture and careful design of reference matrices. As such, we omit them for ResNet. Fig. 3 (left) shows the fit for the resnet models, with . Fig. 3 (middle) and Fig. 3 (right) compare the log normalized density plots of a CIFAR10 resnet and CIFAR10 CNN. The plots show that the Resnet achieves a better margin distribution, correlated with greater test accuracy, even though it was trained without data augmentation.
Norm. Margin 20D  Log density (ResNet32)  Log Density (CNN) 

4.3 ResNet on CIFAR100
On the CIFAR100 dataset, we trained ResNet32 with the same variation in hyperparameter settings as for the networks for CIFAR10 with one additional initial learning rate. The range of accuracy on the test set ranges from to and the generalization gap ranges from to . Table 1 shows for a number of ablation experiments and the full feature set. Fig. 4 (left) shows the fit of predicted and true generalization gaps over the networks (). Fig. 4 (middle) and Fig. 4 (right) compare a CIFAR100 residual network and a CIFAR10 residual network with the same architecture and hyperparameters. Under these settings, the CIFAR100 network achieves test accuracy, whereas CIFAR10 achieves . The resulting normalized margin density plots clearly reflect the better generalization achieved by CIFAR10: the densities at all layers are wider and shifted to the right. Thus, the normalized margin distributions reflect the relative “difficulty” of a particular dataset for a given architecture.
Norm. Margin 20D  Density (CIFAR100)  Density (CIFAR10) 

5 Discussion
We have presented a predictor for generalization gap based on margin distribution in deep networks and conducted extensive experiments to assess it. Our results show that our scheme achieves a high adjusted coefficient of determination (a linear regression predicts generalization gap accurately). Specifically, the predictor uses normalized margin distribution across multiple layers of the network. The best predictor uses quartiles of the distribution combined in multiplicative way (additive in
transform). Compared to the strong baseline of spectral complexity normalized output margin [Bartlett et al., 2017], our scheme exhibits much higher predictive power and can be applied to any feedforward network (including ResNets, unlike generalization bounds such as [Bartlett et al., 2017, Neyshabur et al., 2017a, Arora et al., 2018]). Our findings could be a stepping stone for studying new loss functions with better generalization properties. We leave some final thoughts in Appendix Sec. 8.Acknowledgments
We are thankful to Gamaleldin Elsayed (Google), Tomer Koren (Google), Sergey Ioffe (Google), Vighnesh Birodkar (Google), Shraman Ray Chaudhuri (Google), Kevin Regan (Google), Behnam Neyshabur (NYU) , Dylan Foster (Cornell), for discussions and helpful comments.
References
 [Arora et al., 2018] Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. (2018). Stronger generalization bounds for deep nets via a compression approach. arXiv preprint arXiv:1802.05296.
 [Azulay and Weiss, 2018] Azulay, A. and Weiss, Y. (2018). Why do deep convolutional networks generalize so poorly to small image transformations? arXiv preprint arXiv:1805.12177.
 [Bartlett et al., 2017] Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. (2017). Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249.
 [Elsayed et al., 2018] Elsayed, G. F., Krishnan, D., Mobahi, H., Regan, K., and Bengio, S. (2018). Large margin deep networks for classification. arXiv preprint arXiv:1803.05598.
 [Engstrom et al., 2017] Engstrom, L., Tsipras, D., Schmidt, L., and Madry, A. (2017). A rotation and a translation suffice: Fooling cnns with simple transformations. arXiv preprint arXiv:1712.02779.
 [Garg et al., 2002] Garg, A., HarPeled, S., and Roth, D. (2002). On generalization bounds, projection profile, and margin distribution. In Machine Learning, Proceedings of the Nineteenth International Conference (ICML 2002), University of New South Wales, Sydney, Australia, July 812, 2002, pages 171–178.
 [Glantz et al., 1990] Glantz, S. A., Slinker, B. K., and Neilands, T. B. (1990). Primer of applied regression and analysis of variance, volume 309. McGrawHill New York.
 [Goodfellow et al., 2014] Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014). Explaining and harnessing adversarial examples. arXiv preprint arXiv:1412.6572.

[He et al., 2016]
He, K., Zhang, X., Ren, S., and Sun, J. (2016).
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778.  [Ioffe and Szegedy, 2015] Ioffe, S. and Szegedy, C. (2015). Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167.
 [Kawaguchi et al., 2017] Kawaguchi, K., Kaelbling, L. P., and Bengio, Y. (2017). Generalization in deep learning. arXiv preprint arXiv:1710.05468.
 [Langford and ShaweTaylor, 2002] Langford, J. and ShaweTaylor, J. (2002). Pacbayes margins. In Proceedings of the 15th International Conference on Neural Information Processing Systems, NIPS’02, pages 439–446, Cambridge, MA, USA. MIT Press.
 [Liang et al., 2017] Liang, X., Wang, X., Lei, Z., Liao, S., and Li, S. Z. (2017). Softmargin softmax for deep classification. In International Conference on Neural Information Processing, pages 413–421. Springer.
 [Liao et al., 2018] Liao, Q., Miranda, B., Banburski, A., Hidary, J., and Poggio, T. (2018). A surprising linear relationship predicts test performance in deep networks. arXiv preprint arXiv:1807.09659.
 [Lin et al., 2013] Lin, M., Chen, Q., and Yan, S. (2013). Network in network. arXiv preprint arXiv:1312.4400.

[Liu et al., 2016]
Liu, W., Wen, Y., Yu, Z., and Yang, M. (2016).
Largemargin softmax loss for convolutional neural networks.
In ICML, pages 507–516.  [McGill et al., 1978] McGill, R., Tukey, J. W., and Larsen, W. A. (1978). Variations of box plots. The American Statistician, 32(1):12–16.

[Neyshabur et al., 2017a]
Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. (2017a).
Exploring generalization in deep learning.
In Advances in Neural Information Processing Systems, pages 5947–5956.  [Neyshabur et al., 2017b] Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. (2017b). A pacbayesian approach to spectrallynormalized margin bounds for neural networks. arXiv preprint arXiv:1707.09564.
 [Papernot et al., 2016] Papernot, N., McDaniel, P., Goodfellow, I., Jha, S., Celik, Z. B., and Swami, A. (2016). Practical blackbox attacks against deep learning systems using adversarial examples. arXiv preprint arXiv:1602.02697.
 [Poggio et al., 2017] Poggio, T., Kawaguchi, K., Liao, Q., Miranda, B., Rosasco, L., Boix, X., Hidary, J., and Mhaskar, H. (2017). Theory of deep learning iii: explaining the nonoverfitting puzzle. arXiv preprint arXiv:1801.00173.
 [Reyzin and Schapire, 2006] Reyzin, L. and Schapire, R. E. (2006). How boosting the margin can also boost classifier complexity. In Proceedings of the 23rd international conference on Machine learning, pages 753–760. ACM.
 [Schapire et al., 1998] Schapire, R. E., Freund, Y., Bartlett, P., Lee, W. S., et al. (1998). Boosting the margin: A new explanation for the effectiveness of voting methods. The annals of statistics, 26(5):1651–1686.
 [Sokolic et al., 2016] Sokolic, J., Giryes, R., Sapiro, G., and Rodrigues, M. R. D. (2016). Robust large margin deep neural networks. CoRR, abs/1605.08254.
 [Sun et al., 2015] Sun, S., Chen, W., Wang, L., and Liu, T. (2015). Large margin deep neural networks: Theory and algorithms. CoRR, abs/1506.05232.

[Vapnik, 1995]
Vapnik, V. N. (1995).
The Nature of Statistical Learning Theory
. SpringerVerlag New York, Inc., New York, NY, USA.  [Wu and He, 2018] Wu, Y. and He, K. (2018). Group normalization. arXiv preprint arXiv:1803.08494.
 [Zhang et al., 2016] Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2016). Understanding deep learning requires rethinking generalization. arXiv preprint arXiv:1611.03530.
6 Appendix: Experimental Details
6.1 Cnn + Cifar10
We use an architecture very similar to Network in Network ([Lin et al., 2013]
), but we remove all dropout and max pool from the network.
Layer Index  Layer Type  Output Shape 

0  Input  
1 
convolution + stride 2 

2  convolution + stride 1  
3  convolution + stride 1  
4  convolution + stride 2  
5  convolution + stride 1  
6  convolution + stride 1  
7  convolution + stride 2  
8  convolution + stride 1  
9  convolution + stride 1  
10  convolution + stride 1 
To create generalization gap in this model, we make the following modification to the base architecture:

Use channel size of 192, 288, and 384 to create different width

Train with and without batch norm at all convolutional layers

Apply dropout at layer 3 and 6 with

Apply regularization with

Trian with and without data augmentation with random cropping, flipping and shifting

Train each configuration twice
In total this gives us different network architectures. The models are trained with SGD with momentum (
) at minibatch size of 128 and intial learning rate of 0.01. All networks are trained for 380 epoch with
learning rate decay at interval of 100 epoch.6.2 ResNet 32 + CIFAR10
For this experiments, we use the standard ResNet 32 architectures. We consider down sampling to the marker of a stage, so there are in total 3 stages in the ResNet 32 architecture. To create generalization gap in this model, we make the following modifications to the architecture:

Use network width that are wider in number of channels.

Train with batch norm or group norm [Wu and He, 2018]

Train with initial learning rate of

Apply regularization with

Trian with and without data augmentation with random cropping, flipping and shifting

Train each configuration 3 times
In total this gives us different network architectures. The models are trained with SGD with momentum () at minibatch size of 128. All networks are trained for 380 epoch with learning rate decay at interval of 100 epoch.
6.3 ResNet 32 + CIFAR100
For this experiments, we use the standard ResNet 32 architectures. We consider down sampling to the marker of a stage, so there are in total 3 stages in the ResNet 32 architecture. To create generalization gap in this model, we make the following modifications to the architecture:

Use network width that are wider in number of channels.

Train with batch norm or group norm [Wu and He, 2018]

Train with initial learning rate of

Apply regularization with

Trian with and without data augmentation with random cropping, flipping and shifting

Train each configuration 3 times
In total this gives us different network architectures. The models are trained with SGD with momentum () at minibatch size of 128. All networks are trained for 380 epoch with learning rate decay at interval of 100 epoch.
7 Appendix: Further Analysis of Regression
7.1 CNN + CIFAR10 + All Quartile Signature
lower fence  upper fence  

h0  306.40  114.41  39.56  12.54  5.07 
h1  286.53  9.42  5.16  17.29  38.57 
h2  259.68  6.95  77.03  110.40  152.20 
h3  188.59  10.29  49.76  83.40  143.23 
lower fence  upper fence  

h0  3.59e43  1.13e21  1.76e9  4.87e4  2.52e2 
h1  2.34e41  2.41e3  2.40e2  4.64e5  2.70e09 
h2  8.76e39  8.95e3  5.38e16  4.30e21  9.12e27 
h3  3.40e31  1.54e3  2.37e11  5.17e17  1.31e25 
7.2 ResNet 32 + CIFAR10 + All Quartile Signature
lower fence  upper fence  

h0  45.67  16.67  6.97  1.71  0.68 
h1  58.84  88.14  44.15  15.59  9.36 
h2  60.20  78.57  35.76  12.89  7.52 
h3  59.75  0.27  1.192  7.37  44.22 
lower fence  upper fence  

h0  1.30e10  6.25e5  8.88e3  0.192  0.40 
h1  5.94e13  9.33e18  2.47e10  1.06e4  2.49e3 
h2  3.45e13  3.04e16  9.21e9  4.07e4  6.59e3 
h3  4.14e13  0.60  0.27  7.14e3  2.4e10 
7.3 ResNet 32 + CIFAR100 + All Quartile Signature
lower fence  upper fence  

h0  80.12  8.40  59.62  141.56  248.77 
h1  65.24  109.86  343.57  700.91  1124.43 
h2  99.06  15.47  122.36  305.88  512.69 
h3  244.07  128.45  65.58  28.10  2.34 
lower fence  upper fence  

h0  2.85e17  4.00e3  1.46e13  2.65e27  6.32e42 
h1  1.34e14  2.60e22  1.04e52  8.12e83  4.55e107 
h2  1.59e20  1.03e4  2.53e24  1.29e48  1.42e68 
h3  2.40e41  2.78e25  1.16e14  2.13e7  0.127 
8 Appendix: Some Observations and Conjectures
Everythig here uses the full quartile description.
8.1 Cross Architecture Comparison
We perform regression analysis with
both base CNN and ResNet32 on CIFAR10. The resulting and the kfold . This suggests that the same coefficient works generally well across architectures provided they are trained on the same data. Somehow, the distribution at the 3 locations of the networks are comparable even though the depths are vastly different.8.2 Cross Dataset Comparison
We perform regression analysis with ResNet32 on both CIFAR10 and CIFAR100. The resulting and the kfold . This suggests that the same coefficient works generally well across dataset of the same architecture.
8.3 Cross Everything
We join all our experiment data and the resulting The resulting and the kfold . It is perhaps surprising that a set of coefficient exists across both datasets and architectures.
8.4 Implications on Generalization Bounds
We believe that the method developed here can be used in complementary with existing generalization bound; more sophisticated engineering of the predictor may be used to actually verify what kind of function the generalization bound should look like up to constant factor or exponents; it may be helpful for developing generalization bound tighter than the existing ones.