Predicting the Generalization Gap in Deep Networks with Margin Distributions

09/28/2018 ∙ by Yiding Jiang, et al. ∙ 0

As shown in recent research, deep neural networks can perfectly fit randomly labeled data, but with very poor accuracy on held out data. This phenomenon indicates that loss functions such as cross-entropy are not a reliable indicator of generalization. This leads to the crucial question of how generalization gap should be predicted from the training data and network parameters. In this paper, we propose such a measure, and conduct extensive empirical studies on how well it can predict the generalization gap. Our measure is based on the concept of margin distribution, which are the distances of training points to the decision boundary. We find that it is necessary to use margin distributions at multiple layers of a deep network. On the CIFAR-10 and the CIFAR-100 datasets, our proposed measure correlates very strongly with the generalization gap. In addition, we find the following other factors to be of importance: normalizing margin values for scale independence, using characterizations of margin distribution rather than just the margin (closest distance to decision boundary), and working in log space instead of linear space (effectively using a product of margins rather than a sum). Our measure can be easily applied to feedforward deep networks with any architecture and may point towards new training loss functions that could enable better generalization.

READ FULL TEXT VIEW PDF

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Generalization, the ability of a classifier to perform well on unseen examples, is a desideratum for progress towards real-world deployment of deep neural networks in domains such as autonomous cars and healthcare. Until recently, it was commonly believed that deep networks generalize well to unseen examples. This was based on empirical evidence about performance on held-out dataset. However, new research has started to question this assumption. Adversarial examples cause networks to misclassify even slightly perturbed images at very high rates

[Goodfellow et al., 2014, Papernot et al., 2016]. In addition, deep networks can overfit to arbitrarily corrupted data [Zhang et al., 2016], and they are sensitive to small geometric transformations [Azulay and Weiss, 2018, Engstrom et al., 2017]. These results have led to the important question about how the generalization gap (difference between train and test accuracy) of a deep network can be predicted using the training data and network parameters. Since in all of the above cases, the training loss is usually very small, it is clear that existing losses such as cross-entropy cannot serve that purpose. It has also been shown (e.g. in [Zhang et al., 2016]) that regularizers such as weight decay cannot solve this problem either.

Consequently, a number of recent works [Neyshabur et al., 2017a, Kawaguchi et al., 2017, Bartlett et al., 2017, Poggio et al., 2017, Arora et al., 2018] have started to address this question, proposing generalization bounds based on analyses of network complexity or noise stability properties. However, a thorough empirical assessment of these bounds in terms of how accurately they can predict the generalization gap across various practical settings is not yet available.

Test Acc.: 55.2% Test Acc.: 70.6% Test Acc.: 85.1%
Figure 1: (Best seen as PDF) Density plots (top) and box plots (bottom) of normalized margin of three convolutional networks trained with cross-entropy loss on CIFAR-10 with varying test accuracy: left: 55.2%, middle: 70.6%, right: 85.1%. The left network was trained with 20% corrupted labels. Train accuracy of all above networks are close to 100%, and training losses close to zero. The densities and box plots are computed on the training set. Normalized margin distributions are strongly correlated with test accuracy (moving to the right as accuracy increases). This motivates our use of normalized margins at all layers. The (Tukey) box plots show the median and other order statistics (see section 3.2 for details), and motivates their use as features to summarize the distributions.

In this work, we propose a new quantity for predicting generalization gap of a feedforward neural network. Using the notion of margin in support vector machines

[Vapnik, 1995] and extension to deep networks [Elsayed et al., 2018], we develop a measure that shows a strong correlation with generalization gap and significantly outperforms recently developed theoretical bounds on generalization111In fairness, the theoretical bounds we compare against were designed to be provable upper bounds

rather than estimates with

low expected error. Nevertheless, since recent developments on characterizing the generalization gap of deep networks are in form of upper bounds, they form a reasonable baseline.
. This is empirically shown by studying a wide range of deep networks trained on the CIFAR-10 and CIFAR-100 datasets. The measure presented in this paper may be useful for a constructing new loss functions with better generalization. Besides improvement in the prediction of the generalization gap, our work is distinct from recently developed bounds and margin definitions in a number of ways:

  1. These recently developed bounds are typically functions of weight norms (such as the spectral, Frobenius or various mixed norms). Consequently, they cannot capture variations in network topology that are not reflected in the weight norms, e.g. adding residual connections

    [He et al., 2016] without careful additional engineering based on the topology changes. Furthermore, some of the bounds require specific treatment for nonlinear activations. Our proposed measure can handle any feedforward deep network.

  2. Although some of these bounds involve margin, the margin is only defined and measured at the output layer [Bartlett et al., 2017, Neyshabur et al., 2017a]. For a deep network, however, margin can be defined at any layer [Elsayed et al., 2018]. We show that measuring margin at a single layer does not suffice to capture generalization gap. We argue that it is crucial to use margin information across layers and show that this significantly improves generalization gap prediction.

  3. The common definition of margin, as used in the recent bounds e.g. [Neyshabur et al., 2017a]

    , or as extended to deep networks, is based on the closest distance of the training points to the decision boundary. However, this notion is brittle and sensitive to outliers. In contrast, we adopt

    margin distribution [Garg et al., 2002, Langford and Shawe-Taylor, 2002] by looking at the entire distribution of distances. This is shown to have far better prediction power.

  4. We argue that the direct extension of margin definition to deep networks [Elsayed et al., 2018], although allowing margin to be defined on all layers of the model, is unable to capture generalization gap without proper normalization. We propose a simple normalization scheme that significantly boosts prediction accuracy.

2 Related Work

The recent seminal work of [Zhang et al., 2016] has brought into focus the question of how generalization can be measured from training data. They showed that deep networks can easily learn to fit randomly labeled data with extremely high accuracy, but with arbitrarily low generalization capability. This overfitting is not countered by deploying commonly used regularizers.

The work of [Bartlett et al., 2017]

proposes a measure based on the ratio of two quantities: the margin distribution measured at the output layer of the network; and a spectral complexity measure related to the network’s Lipschitz constant. Their normalized margin distribution provides a strong indication of the complexity of the learning task, e.g. the distribution is skewed towards the origin (lower normalized margin) for training with random labels.

[Neyshabur et al., 2017a, Neyshabur et al., 2017b] also develop bounds based on the product of norms of the weights across layers. [Arora et al., 2018] develop bounds based on noise stability properties of networks: more stability implies better generalization. Using these criteria, they are able to derive stronger generalization bounds than previous works.

The margin distribution (specifically, boosting of margins across the training set) has been shown to correspond to generalization properties in the literature on linear models [Schapire et al., 1998]: they used this connection to explain the effectiveness of boosting and bagging techniques. [Reyzin and Schapire, 2006] showed that it was important to control the complexity of a classifier when measuring margin, which calls for some type of normalization. In the linear case (SVM), margin is naturally defined as a function of norm of the weights [Vapnik, 1995]. In the case of deep networks, true margin is intractable. Recent work [Elsayed et al., 2018] proposed a linearization to approximate the margin, and defined the margin at any layer of the network. [Sokolic et al., 2016] provide another approximation to the margin based on the norm of the Jacobian with respect to the input layer. They show that maximizing their approximations to the margin leads to improved generalization. However, their analysis was restricted to margin at the input layer.

[Poggio et al., 2017] and [Liao et al., 2018] propose a normalized cross-entropy measure that correlates well with test accuracy. Their proposed normalized loss trades off confidence of predictions with stability, which leads to better correlation with test accuracy, leading to a significant lowering of output margin.

3 Prediction of Generalization Gap

In this section, we introduce our margin-based measure. We first explain the construction scheme for obtaining the margin distribution. We then squeeze the distributional information of the margin to a small number of statistics. Finally, we regress these statistics to the value of the generalization gap. We assess prediction quality by applying the learned regression coefficients to predict the generalization gap of unseen models.

3.1 Margin Approximation

First, we establish some notation. Consider a classification problem with classes. We assume a classifier consists of non-linear functions , for that generate a prediction score for classifying the input vector to class . The predicted label is decided by the class with maximal score, i.e. . Define the decision boundary for each class pair as:

(1)

Under this definition, the distance of a point to the decision boundary can be expressed as the smallest displacement of the point that results in a score tie:

(2)

Unlike an SVM, computing the “exact” distance of a point to the decision boundary (Eq. 2) for a deep network is intractable222

This is because computing the distance of a point to a nonlinear surface is intractable. This is different from SVM where the surface is linear and distance of a point to a hyperplane admits a closed form expression.

. In this work, we adopt the approximation scheme from [Elsayed et al., 2018] to capture the distance of a point to the decision boundary. This a first-order Taylor approximation to the true distance Eq. 2. Formally, given an input to a network, denote its representation at the layer (the layer activation vector) by . For the input layer, let and thus . Then for , the distance of the representation vector to the decision boundary for class pair is given by the following approximation:

(3)

Here

represents the output (logit) of the network logit

given . Note that this distance can be positive or negative, denoting whether the training sample is on the “correct” or “wrong” side of the decision boundary respectively. The training data induces a distribution of distances at each layer which, following earlier naming convention [Garg et al., 2002, Langford and Shawe-Taylor, 2002], we refer to as margin distribution (at layer ). For margin distribution, we only consider distances with positive sign (we ignore all misclassified training points).

A problem with plain distances and their associated distribution is that they can be trivially boosted without any significant change in the way classifier separates the classes. For example, consider multiplying weights at a layer by a constant and dividing weights in the following layer by the same constant. In a ReLU network, due to positive homogeneity property

[Liao et al., 2018], this operation does not affect how the network classifies a point, but it changes the distances to the decision boundary333For example, suppose the constant is greater that one. Then, multiplying the weights of a layer by magnifies distances computed at the layer by a factor of ..

To offset the scaling effect, we normalize the margin distribution. Consider margin distribution at some layer , and let be the representation vector for training sample

. We compute the variance of each coordinate of

separately, and then sum these individual variances. This quantity is called total variation of . The square root of this quantity relates to the scale of the distribution. That is, if is scaled by a factor, so is the square root of the total variation. Thus, by dividing distances by the square root of total variation, we can construct a margin distribution invariant to scaling. More concretely, the total variation is computed as:

(4)

i.e. the trace of the empirical covariance matrix of activations. Using the total variation, the normalized margin is specified by:

(5)

While the quantity is relatively primitive and easy to compute, Fig. 1 (top) shows that the normalized-margin distributions based on Eq. 5 have the desirable effect of becoming heavier tailed and shifting to the right (increasing margin) as generalization gap decreases. We find that this effect holds across a range of networks trained with different hyper-parameters.

3.2 Summarizing the Margin Distribution

Instead of working directly with the (normalized) margin distribution, it is easier to analyze a compact signature

of that. The moments of a distribution are a natural criterion for this purpose. Perhaps the most standard way of doing this is computing the empirical moments from the samples and then take the

root of the moment. In our experiments, we used the first five moments. However, it is a well-known phenomenon that the estimation of higher order moments based on samples can be unreliable. Therefore, we also consider an alternate way to construct the distribution’s signature. Given a set of distances , which constitute the margin distribution. We use the median

, first quartile

and third quartile of the normalized margin distribution, along with the two fences that indicate variability outside the upper and lower quartiles. There are many variations for fences, but in this work, with , we define the upper fence to be and the lower fence to be [McGill et al., 1978]. These statistics form the quartile description that summarizes the normalized margin distribution at a specific layer, as shown in the box plots of Fig. 1. We will later see that both signature representations are able to predict the generalization gap, with the second signature working slightly better.

A number of prior works such as [Bartlett et al., 2017], [Neyshabur et al., 2017a], [Liu et al., 2016], [Sun et al., 2015], [Sokolic et al., 2016], and [Liang et al., 2017] have focused on analyzing or maximizing the margin at either the input or the output layer of a deep network. Since a deep network has many hidden layers with evolving representations, it is not immediately clear which of the layer margins is of importance for improving generalization. Our experiments reveal that margin distribution from all of the layers of the network contribute to prediction of generalization gap. This is also clear from Fig. 1 (top): comparing the input layer (layer 0) margin distributions between the left and right plots, the input layer distribution shifts slightly left, but the other layer distributions shift the other way. For example, if we use quartile signature, we have components in this vector, where is the total number of layers in the network. We incorporate dependence on all layers simply by concatenating margin signatures of all layers into a single combined vector that we refer to as total signature.

3.3 Evaluation Metrics

Our goal is to predict the generalization gap, i.e. the difference between training and test accuracy at the end of training, based on total signature of a trained model. We use the simplest prediction model, i.e. a linear form , where and are parameters of the predictor, and is a function applied element-wise to . Specifically, we will explore two choices of : the identity and entry-wise transform , which correspond to additive and multiplicative combination of margin statistics respectively.

In order to estimate predictor parameters , we generate a pool of pretrained models (covering different datasets, architectures, regularization schemes, etc. as explained in Sec. 4) each of which gives one instance of the pair ( being the generalization gap for that model). We then find by minimizing mean squared error: where indexes the model in the pool. The next step is to assess the prediction quality. We consider two metrics for this.

The first metric examines quality of predictions on unseen models. For that, we consider a held-out pool of models, different from those used to estimate , and compute the value of on them via . In order to quantify the discrepancy between predicted gap and ground truth gap we use the notion of coefficient of determination () [Glantz et al., 1990]:

(6)

measures what fraction of data variance can be explained by the linear model444 A simple manipulation shows that the prediction residual , so can be interpreted as a scale invariant alternative to the residual. (it ranges from to on training points but can be outside that range on unseen points). To be precise, we use k-fold validation to study how the predictor can perform on held out pool of trained deep networks. We use 90/10 split, fit the linear model with the training pool, and measure on the held out pool. The performance is averaged over the 10 splits. Since is now not measured on the training pool, it does not suffer from high data dimension and can be negative. In all of our experiments, we use . We provide a subset of residual plots and corresponding univariate F-Test for the experiments in the appendix (Sec. 7

). The F-score also indicates how important each individual variable is.

The second metric examines how well the model fits based on the provided training pool; it does not require a test pool. To characterize this, we use adjusted [Glantz et al., 1990] defined as:

(7)

The can be negative when the data is non-linear. Note that is always smaller than . Intuitively, penalizes the model if the number of features is high relative to the available data points. The closer is to 1, the better the model fits. Using is a simple yet effective method to test the fitness of linear model and is independent of the scale of the target, making it a more illustrative metric than residuals.

4 Experiments

We tested our measure of generalization gap , along with baseline measures, on a number of deep networks and architectures: nine-layer convolutional networks on CIFAR-10 (10 with input layer), and 32-layer residual networks on both CIFAR-10 and CIFAR-100 datasets.

4.1 Convolutional Neural Networks on CIFAR-10

Using the CIFAR-10 dataset, we train

nine-layer convolutional networks with different settings of hyperparameters and training techniques. We apply weight decay and dropout with different strengths; we use networks with and without batch norm and data augmentation; we change the number of hidden units in the hidden layers. Finally, we also include training with and without corrupted labels, as introduced in

[Zhang et al., 2016]; we use a fixed amount of corruption of the true labels. The accuracy on the test set ranges from to and the generalization gap ranges from to . In standard settings, creating neural network models with small generalization gap is difficult; in order to create sufficiently diverse generalization behaviors, we limit some models’ capacities by large weight regularization which decreases generalization gap by lowering the training accuracy. All networks are trained by SGD with momentum. Further details are provided in the supplementary material (Sec. 6).

For each trained network, we compute the signature of the normalized margin distribution (see Sec. 3). Empirically, we found constructing this only on four evenly-spaced layers, input, and hidden layers, leads to good predictors. This results in a 20-dimensional signature vector. We estimate the parameters of the linear predictor with the log transform and using the -dimensional signature vector . Fig. 2 (left) shows the resulting scatter plot of the predicted generalization gap and the true generalization gap . As it can be seen, it is very close to being linear across the range of generalization gaps, and this is also supported by the of the model, which is (max is equal to 1).

As a first baseline method, we compare against the work of [Bartlett et al., 2017] which provides one of the best generalization bounds currently known for deep networks. This work also constructs a margin distribution for the network, but in a different way. To make a fair comparison, we extract the same signature from their margin distribution. Since their margin distribution can only be defined for the output layer, their is 5-dimensional for any network. The resulting fit is shown in Fig. 2(right). It is clearly a poorer fit than that of our signature, with a significantly lower of .

For a fairer comparison, we also reduced our signature from 20 dimensions to the best performing 4 dimensions (even one dimension less than what we used for Bartlett’s) by dropping 16 components in our . This is shown in Fig. 2 (middle) and has a of , which is poorer than our complete but still significantly higher than that of [Bartlett et al., 2017]. In addition, we considered two other baseline comparisons: [Sokolic et al., 2016], where margin at input is defined as a function of the Jacobian of output (logits) with respect to input; and [Elsayed et al., 2018] where the linearized approximation to margin is derived (for the same layers where we use our normalized margin approximation).

Norm. Margin 20D Norm. Margin 4D Bartlett Margin 5D
Figure 2: (Best seen as PDF) Regression models to predict generalization gap. Left: regression model fit in log space for the full -dimensional feature space (); Middle: fit for a subset of only features, each from of the hidden layers (

); Right: fit for features extracted from the normalized margin distribution as used in

[Bartlett et al., 2017] ().

To quantify the effect of the normalization, different layers, feature transformation etc., we conduct a number of ablation experiments with the following configuration: 1. linear/log: Use signature transform of or ; 2. single layer: Use signature from the best layer (); 3. single feat: Use only the best statistic from the total signature for all the layers (); 4. moment: Use the first 5 moments of the normalized margin distribution as signature instead of quartile statistics (Sec. 3); 5. spectral: Use signature of spectrally normalized margins from [Bartlett et al., 2017] (); 6. quartile: Use all the quartile statistics as total signature (Sec. 3); 7. best4: Use the best statistics from the total signature (); 8. Jacobian: Use the Jacobian-based margin defined in Eq (39) of [Sokolic et al., 2016] (); 9. LM: Use the large margin loss from [Elsayed et al., 2018] at the same four layers where the statistics are measured ().

CNN+CIFAR10 ResNet+CIFAR10 ResNet+CIFAR100
Experiment Settings Adj. kfold Adj. kfold Adj. kfold
quartile+log 0.94 0.90 0.87 0.81 0.97 0.96
quartile+linear 0.88 0.84 0.82 0.72 0.91 0.87
single feat+log 0.86 0.83 0.44 0.22 0.80 0.78
single layer+log 0.73 0.67 0.53 0.39 0.95 0.94
moment+log 0.93 0.87 0.83 0.74 0.80 0.78
best4+log 0.89 0.87 0.54 0.43 0.93 0.92
spectral+log 0.73 0.70 - - - -
Jacobian+log 0.42 Negative 0.20 Negative 0.47 Negative
LM+linear 0.35 Negative 0.68 Negative 0.74 Negative
Table 1: Ablation experiments on all networks considering a number of different scenarios (see text for details). The last rows are baselines from other works: [Bartlett et al., 2017, Sokolic et al., 2016, Elsayed et al., 2018].

In Table 1, we list the from fitting models based on each of these scenarios. We see that, both quartile and moment signatures perform similarly, lending support to our thesis that the margin distribution, rather than the smallest or largest margin, is of importance in the context of generalization.

4.2 Residual Networks on CIFAR-10

On the CIFAR-10 dataset, we train convolutional networks with residual connections; these networks are 32 layers deep with standard ResNet 32 topology [He et al., 2016]

. Since it is difficult to train ResNet without activation normalization, we created generalization gap variation with batch normalization

[Ioffe and Szegedy, 2015] and group normalization [Wu and He, 2018]. We further use different initial learning rates. The range of accuracy on the test set ranges from to and generalization gap from to . The residual networks were much deeper, and so we only chose layers for feature-length compatibility with the shallower convoluational networks. This design choice also facilitates ease of analysis and circumvents the dependency on depth of the models. Table 1 shows the .

Note in the presence of residual connections that use convolution instead of identity and identity blocks that span more than one convolutional layers, it is not immediately clear how to properly apply the bounds of [Bartlett et al., 2017] (third from last row) without morphing the topology of the architecture and careful design of reference matrices. As such, we omit them for ResNet. Fig. 3 (left) shows the fit for the resnet models, with . Fig. 3 (middle) and Fig. 3 (right) compare the log normalized density plots of a CIFAR-10 resnet and CIFAR-10 CNN. The plots show that the Resnet achieves a better margin distribution, correlated with greater test accuracy, even though it was trained without data augmentation.

Norm. Margin 20D Log density (ResNet-32) Log Density (CNN)
Figure 3: (Best seen as PDF) Left: Regression model fit in log space for the full 20-dimensional feature space for residual networks () on CIFAR-10; Middle: Log density plot of normalized margins of a particular residual network that achieves test accuracy without data augmentation; Right: Log density plot of normalized margins of a CNN that achieves with data augmentation. We see that the resnet achieves larger margins, especially at the hidden layers, and this is reflected in the higher test accuracy.

4.3 ResNet on CIFAR-100

On the CIFAR-100 dataset, we trained ResNet-32 with the same variation in hyperparameter settings as for the networks for CIFAR-10 with one additional initial learning rate. The range of accuracy on the test set ranges from to and the generalization gap ranges from to . Table 1 shows for a number of ablation experiments and the full feature set. Fig. 4 (left) shows the fit of predicted and true generalization gaps over the networks (). Fig. 4 (middle) and Fig. 4 (right) compare a CIFAR-100 residual network and a CIFAR-10 residual network with the same architecture and hyperparameters. Under these settings, the CIFAR-100 network achieves test accuracy, whereas CIFAR-10 achieves . The resulting normalized margin density plots clearly reflect the better generalization achieved by CIFAR-10: the densities at all layers are wider and shifted to the right. Thus, the normalized margin distributions reflect the relative “difficulty” of a particular dataset for a given architecture.

Norm. Margin 20D Density (CIFAR-100) Density (CIFAR-10)
Figure 4: (Best seen as PDF) Left: Regression model fit in log space for the full 20-dimensional feature space for residual networks () on CIFAR-100; Middle: density plot of normalized margins of a particular residual network trained on CIFAR-100 that achieves test accuracy; Right: Density plot of normalized margins of a residual network trained on CIFAR-10 that achieves .

5 Discussion

We have presented a predictor for generalization gap based on margin distribution in deep networks and conducted extensive experiments to assess it. Our results show that our scheme achieves a high adjusted coefficient of determination (a linear regression predicts generalization gap accurately). Specifically, the predictor uses normalized margin distribution across multiple layers of the network. The best predictor uses quartiles of the distribution combined in multiplicative way (additive in

transform). Compared to the strong baseline of spectral complexity normalized output margin [Bartlett et al., 2017], our scheme exhibits much higher predictive power and can be applied to any feedforward network (including ResNets, unlike generalization bounds such as [Bartlett et al., 2017, Neyshabur et al., 2017a, Arora et al., 2018]). Our findings could be a stepping stone for studying new loss functions with better generalization properties. We leave some final thoughts in Appendix Sec. 8.

Acknowledgments

We are thankful to Gamaleldin Elsayed (Google), Tomer Koren (Google), Sergey Ioffe (Google), Vighnesh Birodkar (Google), Shraman Ray Chaudhuri (Google), Kevin Regan (Google), Behnam Neyshabur (NYU) , Dylan Foster (Cornell), for discussions and helpful comments.

References

6 Appendix: Experimental Details

6.1 Cnn + Cifar-10

We use an architecture very similar to Network in Network ([Lin et al., 2013]

), but we remove all dropout and max pool from the network.

Layer Index Layer Type Output Shape
0 Input
1

convolution + stride 2

2 convolution + stride 1
3 convolution + stride 1
4 convolution + stride 2
5 convolution + stride 1
6 convolution + stride 1
7 convolution + stride 2
8 convolution + stride 1
9 convolution + stride 1
10 convolution + stride 1
Table 2: Architecture of base CNN model.

To create generalization gap in this model, we make the following modification to the base architecture:

  1. Use channel size of 192, 288, and 384 to create different width

  2. Train with and without batch norm at all convolutional layers

  3. Apply dropout at layer 3 and 6 with

  4. Apply regularization with

  5. Trian with and without data augmentation with random cropping, flipping and shifting

  6. Train each configuration twice

In total this gives us different network architectures. The models are trained with SGD with momentum (

) at minibatch size of 128 and intial learning rate of 0.01. All networks are trained for 380 epoch with

learning rate decay at interval of 100 epoch.

6.2 ResNet 32 + CIFAR-10

For this experiments, we use the standard ResNet 32 architectures. We consider down sampling to the marker of a stage, so there are in total 3 stages in the ResNet 32 architecture. To create generalization gap in this model, we make the following modifications to the architecture:

  1. Use network width that are wider in number of channels.

  2. Train with batch norm or group norm [Wu and He, 2018]

  3. Train with initial learning rate of

  4. Apply regularization with

  5. Trian with and without data augmentation with random cropping, flipping and shifting

  6. Train each configuration 3 times

In total this gives us different network architectures. The models are trained with SGD with momentum () at minibatch size of 128. All networks are trained for 380 epoch with learning rate decay at interval of 100 epoch.

6.3 ResNet 32 + CIFAR-100

For this experiments, we use the standard ResNet 32 architectures. We consider down sampling to the marker of a stage, so there are in total 3 stages in the ResNet 32 architecture. To create generalization gap in this model, we make the following modifications to the architecture:

  1. Use network width that are wider in number of channels.

  2. Train with batch norm or group norm [Wu and He, 2018]

  3. Train with initial learning rate of

  4. Apply regularization with

  5. Trian with and without data augmentation with random cropping, flipping and shifting

  6. Train each configuration 3 times

In total this gives us different network architectures. The models are trained with SGD with momentum () at minibatch size of 128. All networks are trained for 380 epoch with learning rate decay at interval of 100 epoch.

7 Appendix: Further Analysis of Regression

7.1 CNN + CIFAR-10 + All Quartile Signature

Figure 5: Residual plots for all explanatory variables, row: h0, h1, h2, h3, column: lower fence, , , , upper fence. lower fence is clipped because distance cannot be smaller than 0. The residual is fairly evenly distributed around 0.
lower fence upper fence
h0 306.40 114.41 39.56 12.54 5.07
h1 286.53 9.42 5.16 17.29 38.57
h2 259.68 6.95 77.03 110.40 152.20
h3 188.59 10.29 49.76 83.40 143.23
lower fence upper fence
h0 3.59e-43 1.13e-21 1.76e-9 4.87e-4 2.52e-2
h1 2.34e-41 2.41e-3 2.40e-2 4.64e-5 2.70e-09
h2 8.76e-39 8.95e-3 5.38e-16 4.30e-21 9.12e-27
h3 3.40e-31 1.54e-3 2.37e-11 5.17e-17 1.31e-25
Table 3: F score (top) and p-values (bottom) for all 20 variables. Using , the null hypotheses are rejected for every variable.

7.2 ResNet 32 + CIFAR-10 + All Quartile Signature

Figure 6: Residual plots for all explanatory variables, row: h0, h1, h2, h3, column: lower fence, , , , upper fence. lower fence is clipped because distance cannot be smaller than 0. The residual is less evenly distributed as are in other two settings; this fact is well reflected in the cluster along the x axis and in the ; we speculate that this is due to not having diverse enough generalization gap in the models trained to cover the entire space of the “model” unlike in the other two settings.
lower fence upper fence
h0 45.67 16.67 6.97 1.71 0.68
h1 58.84 88.14 44.15 15.59 9.36
h2 60.20 78.57 35.76 12.89 7.52
h3 59.75 0.27 1.192 7.37 44.22
lower fence upper fence
h0 1.30e-10 6.25e-5 8.88e-3 0.192 0.40
h1 5.94e-13 9.33e-18 2.47e-10 1.06e-4 2.49e-3
h2 3.45e-13 3.04e-16 9.21e-9 4.07e-4 6.59e-3
h3 4.14e-13 0.60 0.27 7.14e-3 2.4e-10
Table 4: F score (top) and p-values (bottom) for all 20 variables. Using , we see that the null hypotheses are not rejected for 4 of the variables. We believe having a more diverse generalization behavior in the study will solve this problem.

7.3 ResNet 32 + CIFAR-100 + All Quartile Signature

Figure 7: Residual plots for all explanatory variables, row: h0, h1, h2, h3, column: lower fence, , , , upper fence. lower fence is clipped because distance cannot be smaller than 0. The residual is fairly evenly distributed around 0. There is one outlier in this experimental setting as shown in the plots.
lower fence upper fence
h0 80.12 8.40 59.62 141.56 248.77
h1 65.24 109.86 343.57 700.91 1124.43
h2 99.06 15.47 122.36 305.88 512.69
h3 244.07 128.45 65.58 28.10 2.34
lower fence upper fence
h0 2.85e-17 4.00e-3 1.46e-13 2.65e-27 6.32e-42
h1 1.34e-14 2.60e-22 1.04e-52 8.12e-83 4.55e-107
h2 1.59e-20 1.03e-4 2.53e-24 1.29e-48 1.42e-68
h3 2.40e-41 2.78e-25 1.16e-14 2.13e-7 0.127
Table 5: F score (top) and p-values (bottom) for all 20 variables. Using , the null hypotheses are rejected for every variable except for h3 upper fence.

8 Appendix: Some Observations and Conjectures

Everythig here uses the full quartile description.

8.1 Cross Architecture Comparison

We perform regression analysis with

both base CNN and ResNet32 on CIFAR-10. The resulting and the k-fold . This suggests that the same coefficient works generally well across architectures provided they are trained on the same data. Somehow, the distribution at the 3 locations of the networks are comparable even though the depths are vastly different.

Figure 8: Scatter Plots

8.2 Cross Dataset Comparison

We perform regression analysis with ResNet32 on both CIFAR-10 and CIFAR-100. The resulting and the k-fold . This suggests that the same coefficient works generally well across dataset of the same architecture.

Figure 9: Scatter Plots

8.3 Cross Everything

We join all our experiment data and the resulting The resulting and the k-fold . It is perhaps surprising that a set of coefficient exists across both datasets and architectures.

Figure 10: Scatter Plots

8.4 Implications on Generalization Bounds

We believe that the method developed here can be used in complementary with existing generalization bound; more sophisticated engineering of the predictor may be used to actually verify what kind of function the generalization bound should look like up to constant factor or exponents; it may be helpful for developing generalization bound tighter than the existing ones.