Are All Layers Created Equal?

02/06/2019 ∙ by Chiyuan Zhang, et al. ∙ 0

Understanding learning and generalization of deep architectures has been a major research objective in the recent years with notable theoretical progress. A main focal point of generalization studies stems from the success of excessively large networks which defy the classical wisdom of uniform convergence and learnability. We study empirically the layer-wise functional structure of over-parameterized deep models. We provide evidence for the heterogeneous characteristic of layers. To do so, we introduce the notion of (post training) re-initialization and re-randomization robustness. We show that layers can be categorized into either "robust" or "critical". In contrast to critical layers, resetting the robust layers to their initial value has no negative consequence, and in many cases they barely change throughout training. Our study provides further evidence that mere parameter counting or norm accounting is too coarse in studying generalization of deep models.

READ FULL TEXT VIEW PDF

Authors

page 6

page 8

page 10

page 11

page 18

page 19

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks have been remarkably successful in many real world machine learning applications. Distilled understanding of the systems is at least as important as their state-of-the-art performance when applying them in many critical domains. Recent work on understanding why deep networks perform so well in practice focused on questions such as networks’ performance under drifting or even adversarially perturbed data distribution. Another interesting and relevant to this work is research on how we can interpret or explain the decision function of trained networks. While related, this work takes a different angle as we focus on the role of the layers in trained networks and then relate the empirical results to generalization and robustness properties.

Theoretical research on the representation power of neural networks is well studied. It is known that a neural network with a single sufficiently wide hidden layer is universal approximator for continuous functions over a compact domain (Gybenko, 1989; Hornik, 1991; Anthony and Bartlett, 2009). More recent research further examines whether deep networks can have superior representation power than shallow ones with the same number of units or edges (Pinkus, 1999; Delalleau and Bengio, 2011; Montufar et al., 2014; Telgarsky, 2016; Shaham et al., 2015; Eldan and Shamir, 2015; Mhaskar and Poggio, 2016; Rolnick and Tegmark, 2017). The capacity to represent arbitrary functions with finite samples is also extensively discussed in recent work (Hardt and Ma, 2017; Zhang et al., 2017; Nguyen and Hein, 2018; Yun et al., 2018). However, the constructions used in the aforementioned work for building networks approximating particular functions are typically “artificial” and are unlikely to be obtained by gradient-based learning algorithms. We focus instead on empirically studying the role different layers in deep architecture take post gradient-based training.

Research on the generalization of deep neural networks has attracted a lot of interests. The observation that big neural networks can fit random labels on the training set (Zhang et al., 2017) makes it difficult to apply classical learning theoretic results based on uniform convergence over the hypothesis space. One approach to get around this issue is to show that while the space of neural networks of given architecture is very large, gradient-based learning on “well behaved” tasks leads to relatively “simple” models. More recent research focuses on the analysis of post-training complexity metrics such as norm, margin, robustness, flatness, or compressibility of the learned model in contrast to pre-training capacity of the entire hypothesis space. This line of work resulted in improved generalization bounds for deep neural networks, see for instance Dziugaite and Roy (2016); Kawaguchi et al. (2017); Bartlett et al. (2017); Neyshabur et al. (2018, 2017); Liang et al. (2017); Arora et al. (2018); Zhou et al. (2019) and the references therein. This paper provides further empirical evidence and alludes to potentially more fine-grained analysis.

In particular, we show empirically that the layers in a deep network are not homogeneous in the role they play at representing a prediction function. Some layers are critical to forming good predictions while others are fairly robust to the assignment of their parameters along training. Moreover, depending on the capacity of the network and the complexity of the target function, gradient-based trained networks conserve the complexity by not using excess capacity. The exact definition and implications on generalization are discussed in the body of the paper.

Before proceeding to the body of the paper, we would like to point to a few related papers. Modern neural networks are typically over-parameterized and thus have plenty of redundancy in their representation capabilities. Previous work exploited over-parameterization to compress (Han et al., 2015) or distill (Hinton et al., 2015) a trained network. Rosenfeld and Tsotsos (2018) found that one can achieve comparable performance by training only a small fraction of network parameters such as a subset of channels in each convolutional layer. Towards interpreting residual networks as ensemble of shallow networks, Veit et al. (2016) found that residual blocks in a trained network can be deleted or permuted to some extent without hurting the performance too much. In another line of research, it is shown that under extreme overparameterization, such as when the network width is polynomial in the training set size and input dimension (Allen-Zhu et al., 2018; Du et al., 2018a, b; Zou et al., 2018), or even in the asymptotic regime of infinite width (Lee et al., 2019), the network weights move slowly during training. The observations in this paper show that in more practical regime, different layers could behave very differently.

The rest of the paper is organized as follows: the experiment framework and our main notions of layerwise robustness are introduced in Section 2. Section 3 presents the results and analysis of layerwise robustness on a wide range of neural network models. Section 4 discusses the theoretical implications on generalization. Studies on joint robustness and connections to other notions of robustness are presented in Section 5 and Section 6, respectively. Finally, the paper ends with a conclusion that summarize our main contributions.

2 Setting

Feed forward networks naturally consist of multiple layers

where each unit in a layer takes inputs from units in the previous layer. It is common to form a layer using a linear transform (e.g. convolution), followed by some kind of normalization (e.g. batch-normalization), and then apply unit-wise non-linear activation function (e.g. rectification – ReLU). We use the term

layer in a more general sense to represent any layer-like computation block. In particular, a residual block in a residual network (He et al., 2016a) can also be treated as a layer.

Let be the space of a particular neural network architecture with (parametric) layers. We use the term capacity to refer to properties of the entire space before training takes place. It is usally measured by notions such as the Rademacher complexity, the VC Dimension, and various types of Covering numbers. The term complexity is used in reference to properties of a single neural network , often employing with notion of norm of the parameters and possibly normalized by empirical quantities such as the margin.

We are interested in analyzing post-training behavior of layers of popular deep networks. Such networks are typically trained using stochastic gradient methods which start by sampling random values for the parameters from a pre-defined distribution . The choice of typically depends on the type, fan-in, and fan-out of of each layer. During training the parameters are iteratively updated via

(1)

where we use

to designate a stochastic estimate of the gradient of the loss

on a sample of examples data with respect to . Variants that uses momentum and pre-conditioners, as well as adaptive learning rate scheduling for are commonly used in practice but are not employed in this paper. After training for iterations, the parameters are used as the final trained model.

A deep network build up the representation of its inputs by incrementally applying nonlinear transformations defined by each layer. As a result, the representations at a particular layer recursively depend on all the layers beneath it. This complex dependency makes it challenging to isolate and inspect each layer independently in theoretical studies. In this paper, we introduce and use the following two empirical probes to inspect the individual layers in a trained neural network.

Re-initialization

After training, for a given layer , we can re-initialize the parameters through assignment , while keeping the parameters for the other layers unchanged. The model with the parameters is then evaluated. Unless noted otherwise, we use the term performance to designate classification error on test data. The performance of a network in which layer was re-initialized is referred to as the re-initialization robustness of layer . Note that here denotes the random values realized at the beginning of the training. More generally, for time steps , we can re-initialize the -th layer by setting , and obtain the re-initialization robustness of layer after updates.

Re-randomization

To go one step further, we also examine re-randomization of a layer by re-sampling random values and evaluate the model’s performance for . Analogously, we refer to the evaluated performance as the re-randomization robustness of layer .

Note that there is no re-training or finetuning after re-initialization or re-randomization, and the network is evaluated directly with mixed weights. When a network exhibits no or negligible decrease in performance after re-initializing or re-randomizing of a layer, we say that the layer is robust, and otherwise the layer is called critical.

3 Robustness of individual layers

In this section, we study layer robustness of commonly used neural networks on standard image classification benchmarks, in particular MNIST, CIFAR10 and ImageNet. All the networks are trained using SGD with Nesterov momentum, and piecewise constant learning rate schedule. Please refer to Appendix 

A for further details.

3.1 Fully connected networks

(a) Test error
(b)
(c)
Figure 1: Robustness results for FCN on MNIST.

(a) Test error rate: each row corresponds to one layer in the network. The last row shows the full model performance at the corresponding epoch (i.e. all the model parameters are loaded from that checkpoint) as reference. The first column designates robustness of each layer w.r.t re-randomization and the rest of the columns designate re-initialization robustness at different checkpoints. The last column shows the final performance (at the last checkpoint during training) as reference. (b-c) Weights distances: each cell in the heatmaps depict the normalized

-norm (b) or -norm (c) distance of trained parameters to their initial weights.

We start by examining robustness of fully-connected networks (FCN). A FCN consists of fully connected layers each of which of output dimension and ReLU activation function. The extra final layer is a linear multiclass predictor with one output per class.

As a starter, we trained an FCN on the MNIST digit classification task, and applied the re-initialization and re-randomization analysis on the trained model. The results are shown in Figure 1(a). As expected, due to the intricate dependency of the classification function on each of the layers, re-randomizing any of the layers completely disintegrate the representations and classification accuracy drops to the level of random guessing. However, for re-initialization, we find that while the first later is very sensitive, the rest of the layers are robust to re-initializing back to their pre-training random weights.

A plausible reason for this phenomenon could be attributed to the fact that gradient norms increase during back-propagation to the point that the bottom layers are being updated more aggressively than the top ones. Alas, if this was the case, we would expect a smoother transition instead of a sharp contrast at the first layer. We thus measured how distant the weights of each layer from their initialization, “Checkpoint 0”, using either the normalized -norm, abbreviated as

and the -norm, .

(a) Test error
(b) Test loss
Figure 2: Layer-wise robustness studies of FCN on MNIST. The figures use the same layout as in Figure 1(a). The two subfigures show the robustness evaluated in the test error (the default), and the test loss, respectively.

The results are shown in Figure 1(b) and (c), respectively. As we can see, the robustness to re-initialization does not obviously correlate to either of the distances. Figure 2 shows the results on a FCN , which demonstrates the same phenomenon. The figure also shows that the cross entropy loss on the test set behaves similarly to the classification error. It suggests there might be something more intricate going on than simple gradient exploding issue. We loosely summarize the observations as follows:


Over-capacitated deep networks trained with stochastic gradient have low-complexity due to self-restricting the number of critical layers.

Intuitively, if a subset of parameters can be re-initialized to the random values at checkpoint 0 (which are independent of the training data), then the effective number of parameters, and as a result, the complexity of the model, can be reduced. We defer more detailed discussion on the theoretical implications to Section 4.

3.2 Adaptive complexity adjustment

We next by applied the same analysis procedure to a large number of different configurations in order to assess the effects of the network capacity and the task complexity on the layer robustness.

As the results in the previous section shown, the first layer is rather sensitive to re-initialization while the rest of the layers are quite robust. In Figure 3(a), we compare the average re-initialization robustness for all layers but the first with respect FCNs of varying hidden dimensions on MNIST. It is clear that the upper layers become more robust as the hidden dimension increases. We believe that it reflects the fact that the wider FCNs have higher model capacity. When the capacity is small, all layers are vigil participants in representing the prediction function. As capacity increases, it suffices to use the bottom layer while the rest act as random projections with non-linearities.

Similarly, Figure 3

(b) shows experiments on CIFAR10, which has the same number of classes and comparable number of training examples as MNIST, but is more difficult to classify. While it is hard to directly compare the robustness of the same model across the two different tasks, we still observe similar traits as the hidden dimensions increase yet not as pronounced. Informally put, the difficulty of the learning task seem to necessitate more diligence in forming accurate prediction.

[width=]figs/adabars/mnist

(a) MNIST
(b) CIFAR10
Figure 3: Re-initialization robustness of all layers but the first using Checkpoint 0 for FCNs with hidden layers of varying dimensions.

Each bar designates the difference in classification error between a fully trained model and a model with one layer re-initialized. The error bars designate one standard deviation obtained by running five experiments with different random initialization.

In summary, the empirical evidence represented in this section provide some evidence that deep networks automatically adjust their de-facto complexity. When a big network is trained on an easy task, only a few layers seem to be playing critical roles.

3.3 Large convolutional networks

(a) VGG11
(b) VGG19
(c) VGG13
(d) VGG16
Figure 4: Layer-wise robustness analysis with VGG networks on CIFAR10. The heatmaps use the same layout as in Figure 1, but they are transposed, to visualize the deeper architectures more effectively.

On typical computer vision tasks beyond MNIST, densely connected FCNs are outperformed significantly by convolutional neural networks. VGGs and ResNets are among the most widely used convolutional network architectures. Figure 

4 and Figure 5 show the robustness analysis on the two types of networks, respectively.

Since those networks are much deeper than the FCNs, we transpose the heatmaps to show the layers as columns. For VGGs, a large number of layers are sensitive to re-initialization, but the patterns are similar to the observations from the simple FCNs on MNIST: the bottom layers are more critical but the upper layers are robust to re-initialization.

(a) ResNet18
(b) ResNet50
(c) ResNet101
(d) ResNet152
Figure 5: Layer-wise robustness analysis on residual blocks of ResNets trained on CIFAR10.

The results for ResNets in Figure 5 is to be considered together with results on ImageNet in Figure 6. We found the robustness patterns for resnets more interesting mainly for two reasons:

ResNets re-distribute sensitive layers.

Unlike the FCNs and VGGs which put the sensitive layers at the bottom of the network, ResNets distribute them across the network. To better understand the patterns, let us do a brief recap of the ResNets architectures. It is common in theoretical analysis to broadly define ResNets as any neural network architectures with residual blocks. In practice, a few “standard” architectures (and variants) that divide the network into a few “stages” are commonly used. At the bottom, there is a pre-processing stage (stage0) with vanilla convolutional layers. It is followed by a few (typically 4) residual stages (stage1 to stage4) consisting of residual blocks, and then global average pooling and the densely connected linear classifier (final_linear). The image size shrinks and the number of convolutional feature channels doubles from each residual stage to the next one111There are more subtle details especially at stage1 depending on factors like the input image size, whether residual blocks contain a bottleneck, and the version of ResNets, etc.. As a result, while most of the residual blocks have real identity skip connections, the first block of each stage (stage*.resblk1) that connects to the previous stage has a non-identity skip connection due to different input / output shapes. Figure 7 illustrates the two types of residual blocks.

With a big picture of the ResNet architectures, we can see that each stage in a ResNet acts as a sub-network, and the layer-wise robustness patterns within each stage resembles the VGGs and FCNs.

Residual blocks can be robust to re-randomization.

Among the layers that are robust to re-initialization, if the layer is a residual block, it is also robust to re-randomization: e.g. compare the final_linear layer and any of the robust residual blocks. A possible reason is that the identity skip connection dominates the residual branch in those blocks. It is known from previous lesion studies (Veit et al., 2016) that residual blocks in a ResNet can be removed without seriously hurting the performance. But our experiments put it in the context with other architectures and study the adaptive robustness with respect to the interplay between the model capacity and the task difficulties. In particular, comparing the results on CIFAR10 and ImageNet, we see that especially on ResNet18 from Figure 6(a), many residual blocks with real identity skip connection also become sensitive comparing to bigger models due to smaller capacity.

(a) ResNet18
(b) ResNet50
(c) ResNet101
(d) ResNet152
Figure 6: Layer-wise robustness analysis on residual blocks of ResNets trained on ImageNet.
(a) Residual block
(b) Residual block with downsampling
Figure 7: Illustration of residual blocks (from ResNets V2) with and without a downsampling skip branch. C, N and R stand for convolution, (batch) normalization and ReLU activation, respectively. Those are basic residual blocks used in ResNet18 and ResNet34; for ResNet50 and more layers, the bottleneck residual blocks are used, which are similar to the illustrations here except the residual body is now with a reduction of the convolution channels in the middle for a “bottlenecked” residual.

4 Theoretical Implications on Generalization

As mentioned earlier, if some parameters can be re-assigned to the randomly initialized values without affecting the model performance, then the effective number of parameters is reduced as the random initialization is independent of the training data. The benefits on improving generalization is most easily demonstrated with a naive parameter counting generalization bound. For example, if we have a generalization bound of the form

where is a model with parameters trained on i.i.d. samples. is some complexity measure based on counting the number of parameters, and is the corresponding generalization bound. For example, Anthony and Bartlett (2009) provides various bounds on VC-dimension based on the number of weights in neural networks, which could then be plugged into standard VC-dimension based generalization bounds for classification (Vapnik, 1998). Now if we know that a fraction of the neural network weights will be robust to re-initialization after training, with a loss of the (empirical) risk of at most , then we get

where is a model obtained by re-initializing the fraction of parameters of the trained model

. Note that generalization bounds based on parameter counting generally does not work well for deep learning. Because of the heavy over-parameterization, the resulting bounds are usually trivial. However, as noted in

Arora et al. (2018), most of the alternative generalization bounds proposed for deep neural network models recently are actually worse than naive parameter counting. Moreover, by tweaking the existing analysis with additional layerwise robustness condition, some PAC-Bayes based bounds can also be potentially improved (Wang et al., 2018; Arora et al., 2018; Zhou et al., 2019).

Note that like the results in Arora et al. (2018); Zhou et al. (2019), the bounds provided by re-initialization robustness are for a different model (in our case the re-initialized one). Alternative approaches in the literature involve modifying the training algorithms to explicitly optimize the robustness or some derived generalization bounds (Neyshabur et al., 2015; Dziugaite and Roy, 2016). However, neither of the arguments provides guarantees for the model directly trained from SGD.

5 Joint robustness

The theoretical analysis suggests that robustness to either re-initialization or re-randomization could imply better generalization. Combined with the experimental results in previous sections, it seems to suggest a good way to explain the empirical observations that hugely over-parameterized networks could still generalize well, as they are only using a small portion of their full capacity. However, there is a caveat: the re-initialization and re-randomization analysis in Section 3 study each layer independently. However, two or more layers being independently robust does not necessarily imply that they are robust jointly. If, for example, we want a generalization bound that uses only half of the capacity, we need to show that half of the layers are robust to re-initialization or re-randomization simultaneously.

5.1 Are robust layers jointly robust?

(a) layer26
(b) layer2,3,5,6
(c) layer2,4,6
Figure 8: Joint robustness analysis of FCN on MNIST. The heatmap layout is the same as in Figure 1, but the layers are divided into two groups (indicated by the * mark on the layer names in each figure) and re-randomization and re-initialization are applied to all the layers in each group jointly. As a result, layers belonging to the same group have identical rows in the heatmap, but we still show all the layers to make the figures easier to read and compare with the previous layer-wise robustness results. The subfigures show the results from three different grouping schemes.
(a) ResNet18: resblk2
(b) ResNet50: resblk2, 3, …
(c) ResNet152: resblk2, 3, …
Figure 9: Joint robustness analysis of ResNets on CIFAR10, based on the scheme that group all but the first residual blocks in all the residual stages. Grouping is indicated by the * on the layer names.
(a) ResNet50: resblk2, 3 …of stage2, 3
(b) ResNet50: every second resblk
(c) ResNet101: every second resblk
(d) ResNet152: every second resblk
Figure 10: Joint robustness analysis of ResNets on CIFAR10, with alternative grouping schemes. Grouping is indicated by the * on the layer names.

In this section, we do joint robustness analysis on groups of layers. From Section 3.1, we see that on MNIST, for wide enough FCNs, all the layers above layer1 are robust to re-initialization. So we divide the layer into two groups: {layer1} and {layer2, layer3, …}, and perform the robustness studies on the two groups. The results for FCN are shown in Figure 8(a). For clarity and ease of comparison, the figure still spells out all the layers individually, but the values from layer2 to layer6 are simply repeated rows. The values show that the upper-layer-group is clearly not jointly robust to re-initialization (to checkpoint 0).

We also try some alternative grouping schemes: Figure 8(b) show the results when we group two in every three layers, which has slightly improved joint robustness; In Figure 8(c), the grouping scheme that include every other layer shows that with a clever grouping scheme, about half of the layers could be jointly robust.

Results on ResNets are similar. Figure 9 shows the joint robustness analysis on ResNets trained on CIFAR10. The grouping is based on the layer-wise robustness results from Figure 5: all the residual blocks in stage1 to stage4 are bundled and analyzed jointly. The results are similar to the FCNs: ResNet18 is relatively robust, but deeper ResNets are not jointly robust under this grouping. Two alternative grouping schemes are shown in Figure 10. By including only layers from stage1 and stage4, slightly improved robustness could be obtained on ResNet50. The scheme that groups every other residual block shows further improvements.

In summary, the individually robust layers are generally not jointly robust. But with some clever way of picking out a subset of the layers, joint robustness could still be achieved for up to half of the layers. In principle, one can enumerate all possible grouping schemes to find the best with a trade-off of the robustness and number of layers included.

5.2 Could robust layers be made jointly robust?

Results from the previous section show that there is a gap between the layer-wise robustness patterns and the the joint robustness. Here we try to see if we could close the gap by letting the training algorithm know that we are interested in the robustness of a subset of the layers. It is complicated to express this desire algorithmically, but we can make a stronger request by asking the learning algorithm to explicitly not “use” those layers. More specifically, we try two approaches to the layers in the group that is desired to be robust: 1) freeze them so that their parameters remain the same randomly initialized values; 2) remove the layers completely from the neural network architecture.

Arch Full Layer-wise Layers Layers
Model Robustness Frozen Removed

CIFAR10

ResNet50 8.40 9.771.38 11.74 9.23
ResNet101 8.53 8.870.50 9.21 9.23
ResNet152 8.54 8.740.39 9.17 9.23

ImageNet

ResNet50 34.74 38.545.36 44.36 41.50
ResNet101 32.78 33.842.10 36.03 41.50
ResNet152 31.74 32.421.55 35.75 41.50
Table 1: Error rates (%) on CIFAR10 (top rows) and ImageNet (bottom rows), respectively. Each row shows the performance of the full model, (the mean and std of) the layer-wise robustness to re-initialization, the performance when training with a subset of layers fixed at random initialization, and the performance when training with a subset of layers removed. In particular, the layer-wise robustness is averaged over all the residual blocks except the first one at each stage. The layer-freezing and layer-removal operations are also applied to those residual blocks (jointly).

The results are shown in Table 1. When we explicitly freeze the layers, the test error rates are still higher than the average layer-wise robustness measured in a normally trained model. However, the gap is much smaller than directly measuring the joint robustness (see Figure 9 for comparison). Moreover, on CIFAR10, we found that similar performance can be achieved even if we completely remove those layers from the network. On the other hand, on ImageNet, the frozen random layers seem to be needed to achieve good performances, while the “layers-removed” variant under-perform by a big gap. In this case, the random projections (with non-linearity) in those frozen layers are helpful with the performance.

6 Connections to other notions of robustness

The notion of layer-wise (and joint) robustness to re-initialization and re-randomization can be related to other notions of robustness in deep learning. For example, the flatness of the solution is a notion of robustness with respect to local perturbations to the network parameters (at convergence), and is extensively discussed in the context of generalization (Hochreiter and Schmidhuber, 1997; Chaudhari et al., 2017; Keskar et al., 2017; Smith and Le, 2018; Poggio et al., 2018)

. For a fixed layer, our notion of robustness to re-initialization is more restricted because the “perturbed values” can only be from the optimization trajectory; while the robustness to re-randomization could potentially allow larger perturbation variances. However, as our studies here show, the robustness or flatness at each layer could behave very differently, so analyzing each layer individually in the context of specific network architectures allow us to get more insights to the robustness behaviors.

On the other hand, Adversarial robustness (Szegedy et al., 2013) focus on the robustness with respect to perturbations to the inputs. In particular, it is found that trained deep neural network models are sensitive to input perturbations: small adversarially generated perturbations can usually change the prediction results to arbitrary different classes. A large number of defending and attacking algorithms have been proposed in recent years along this line. Here we briefly discuss the connection to adversarial robustness. In particular, take a normally trained ResNet222We use a slightly modified variant by explicitly having a downsample layer between stages, so that all the residual blocks are with real identity skip connections. See Figure 7., say with stages and residual blocks in each stage. Given configuration and , during each test evaluation, a subset of stages are randomly chosen, and for each of the chosen stages, a random residual block is picked and replaced with one of the pre-initialized weights for that layer. We keep pre-allocated weights for each residual block instead of re-sampling random numbers on each evaluation call, primarily to reduce the computation burden during the test time.

From the previous robustness analysis, we expect the stochastic classifier to get only a small performance drop when averaged over the test set. However, at individual example level, the randomness of the network outputs will make it harder for the attacker to generate adversarial examples. We evaluate the adversarial robustness against a weak FGSM (Goodfellow et al., 2014) attack and a strong PGD (Madry et al., 2017) attack. The results in Table 2 show that, compared to the baseline (the exact same trained model before being turned into a stochastic classifier), the randomness significantly increases the adversarial robustness against weak attacks. The performances under strong PGD attack drop to very low, but still with a non-trivial gap between the baseline.

In summary, the layer-wise robustness could improve the adversarial robustness of a trained model through injected stochasticity. However, it is not a good defense against strong attackers. If we work hard enough, more sophisticated attacks that explicitly deal with stochastic classifiers are likely to completely break this model.

Model Configuration Clean FGSM PGD
baseline
r=4,s=1
r=4,s=2
baseline
r=4,s=1
r=4,s=2
r=4,s=4
Table 2: Accuracies (%) of various model configurations on clean CIFAR10 test set and under a weak (FGSM) and a strong (PGD) adversarial attack, respectively.

The adversarial attacks are evaluated on a subset of 1000 test examples. Every experiment is repeated 5 times and the average performance is reported. The hyperparameters

and in model configurations mean the number of random weights pre-created for each residual block, and the number of stages that are re-randomized during each inference. means a ResNet architecture with two stages, where each stage contains four residual blocks; similarly has four stages each with four residual blocks.

7 Conclusions

We studied on a wide variety of popular models for image classification. We investigated the functional structure on a layer-by-layer basis of over-parameterized deep models. We introduced the notions of re-initialization and re-randomization robustness. Using these notions we provided evidence for the heterogeneous characteristic of layers, which can be morally categorized into either “robust” or “critical”. Resetting the robust layers to their initial value has no negative consequence on the model’s performance. Our empirical results give further evidence that mere parameter counting or norm accounting is too coarse in studying generalization of deep models. Moreover, optimization landscape based analysis (e.g. flatness or sharpness at the minimizer) is better performed respecting the network architectures due to the heterogeneous behaviors of different layers. For future work, we are interested in devising a new algorithm which learns the interleaving trained and partially random subnetworks within one large network.

Acknowledgments

The authors would like to thank David Grangier, Lechao Xiao, Kunal Talwar and Hanie Sedghi for helpful discussions and comments.

References

Appendix A Details on experiment setup

Our empirical studies are based on the MNIST, CIFAR10 and the ILSVRC 2012 ImageNet datasets. Stochastic Gradient Descent (SGD) with a momentum of 0.9 is used to minimize the multi-class cross entropy loss. Each model is trained for 100 epochs, using a stage-wise constant learning rate scheduling with a multiplicative factor of 0.2 on epoch 30, 60 and 90. Batch size of 128 is used, except for ResNets with more than 50 layers on ImageNet, where batch size of 64 is used due to device memory constraints.

We mainly study three types of neural network architectures:

  • FCNs: the multi-layer perceptrons consist of fully connected layers with equal output dimension and ReLU activation (except for the last layer, where the output dimension equals the number of classes and no ReLU is applied). For example, FCN

    has three layers of fully connected layers with the output dimension 256, and an extra final (fully connected) classifier layer.

  • VGGs: widely used network architectures from Simonyan and Zisserman (2014).

  • ResNets: the results from our analysis are similar for ResNets V1 (He et al., 2016a) and V2 (He et al., 2016b). We report our results with ResNets V2 due to the slightly better performance in most of the cases. For large image sizes from ImageNet, the stage0 contains a convolution and a max pooling (both with stride 2) to reduce the spatial dimension (from 224 to 56). On smaller image sizes like CIFAR10, we use a convolution with stride 1 here to avoid reducing the spatial dimension.

During training, CIFAR10 images are padded with 4 pixels of zeros on all sides, then randomly flipped (horizontally) and cropped. ImageNet images are randomly cropped during training and center-cropped during testing. Global mean and standard deviation are computed on all the training pixels and applied to normalize the inputs on each dataset.

Appendix B Batch normalization and weight decay

The primary goal of this paper is to study the (co-)evolution of the representations at each layer during training and the robustness of this representation with respect to the rest of the network. We try to minimize the factors that explicitly encourage changing of the network weights or representations in the analysis. In particular, unless otherwise specified, weight decay and batch normalization are not used. This leads to some performance drop in the trained models. Especially for deep residual networks: even though we could successfully train a residual network with 100+ layers without batch normalization, the final generalization performance could be quite worse than the state-of-the-art. Therefore, in this section, we include studies on networks trained with weight decay and batch normalization for comparison.

Architecture N/A +wd +bn +wd+bn

CIFAR10

ResNet18 10.4 7.5 6.9 5.5
ResNet34 10.2 6.9 6.6 5.1
ResNet50 8.4 9.9 7.6 5.0
ResNet101 8.5 9.8 6.9 5.3
ResNet152 8.5 9.7 7.3 4.7
VGG11 11.8 10.7 9.4 8.2
VGG13 10.3 8.8 8.4 6.7
VGG16 11.0 11.4 8.5 6.7
VGG19 12.1 8.6 6.9

ImageNet

ResNet18 41.1 33.1 33.5 31.5
ResNet34 39.9 30.6 30.1 27.2
ResNet50 34.8 31.8 28.2 25.0
ResNet101 32.9 29.9 26.9 22.9
ResNet152 31.9 29.1 27.6 22.6
Table 3: Test performance (classification error rates %) of various models studied in this paper. The table shows how much of the final performance is affected by training with or without weight decay (+wd) and batch normalization (+bn).

In particular, Table 3 shows the final test error rates of models trained with or without weight decay and batch normalization. Note the original VGG models do not use batch normalization (Simonyan and Zisserman, 2014), we list +bn variants here for comparison, by applying batch normalization to the output of each convolutional layer. On CIFAR10, the performance gap varies from 3% to 5%, but on ImageNet, a performance gap as large as 10% could be seen when trained without weight decay and batch normalization. Figure 11 shows how different training configurations affect the layerwise robustness analysis patterns on VGG16 networks. We found that when batch normalization is used, none of the layers are robust any more.

(a) VGG16
(b) VGG16 +wd
(c) VGG16 +bn
(d) VGG16 +wd +bn
Figure 11: Layer-wise robustness analysis with VGG16 on CIFAR10. The subfigures show how training with weight decay (+wd) and batch normalization (+bn) affects the layerwise robustness patterns.
(a) ResNet50
(b) ResNet50 +wd
(c) ResNet50 +bn
(d) ResNet50 +wd +bn
Figure 12: Layer-wise robustness analysis with ResNet50 on CIFAR10. The subfigures show how training with weight decay (+wd) and batch normalization (+bn) affects the layerwise robustness patterns.
(a) ResNet50
(b) ResNet50 +wd
(c) ResNet50 +bn
(d) ResNet50 +wd +bn
Figure 13: Layer-wise robustness analysis with ResNet50 on ImageNet. The subfigures show how training with weight decay (+wd) and batch normalization (+bn) affects the layerwise robustness patterns.

Figure 12 and Figure 13 show similar comparisons for ResNet50 on CIFAR10 and ImageNet, respectively. Unlike VGGs, we found that the layerwise robustness patterns are still quite pronounced under various training conditions for ResNets. In Figure 12(d) and Figure 13(c,d), we see the mysterious phenomenon that re-initialing with checkpoint-1 is less robust than with checkpoint-0 for many layers. We do not know exactly why this is happening. It might be that during early stages, some aggressive learning is happening causing changes in the parameters or statistics with large magnitudes, but later on when most of the training samples are classified correctly, the network gradually re-balances the layers to a more robust state. Figure 15(d-f) in the next section shows supportive evidence that, in this case the distance of the parameters between checkpoint-0 and checkpoint-1 is larger than between checkpoint-0 and the final checkpoint. However, on ImageNet this correlation is no longer clear as seen in Figure 16(d-f). See the discussions in the next section for more details.

Appendix C Robustness and distances

(a) Test error
(b)
(c)
Figure 14: Layer-wise robustness studies of VGG16 on CIFAR10. (a) shows the robustness analysis measured by the test error rate. (b) shows the normalized distance of the parameters at each layer to the version realized during the re-randomization and re-initialization analysis. (c) is the same as (b), except with the distance.
(a) Test error (-wd-bn)
(b)
(c)
(d) Test error (+wd+bn)
(e)
(f)
Figure 15: Layer robustness for ResNet50 on CIFAR10. Layouts are the same as in Figure 14. The first row (a-c) is for ResNet50 trained without weight decay and batch normalization. The second row (d-f) is with weight decay and batch normalization.
(a) Test error (-wd-bn)
(b)
(c)
(d) Test error (+wd+bn)
(e)
(f)
Figure 16: Layer-wise robustness studies of ResNet50 on ImageNet. Layouts are the same as in Figure 14. The first row (a-c) is for ResNet50 trained without weight decay and batch normalization. The second row (d-f) is with weight decay and batch normalization.

In Figure 1 from Section 3.1, we compared the layerwise robustness pattern to the layerwise distances of the parameters to the values at initialization (checkpoint-0). We found that for FCNs on MNIST, there is no obvious correlation between the “amount of parameter updates received” at each layer and its robustness to re-initialization for the two distances (the normalized and norms) we measured. In this appendix, we list results on other models and datasets studied in this paper for comparison.

Figure 14 shows the layerwise robustness plot along with the layerwise distance plots for VGG16 trained on CIFAR10. We found that the distance of the top layers are large, but the model is robust when we re-initialize those layers. However, the normalized distance seem to be correlated with the layerwise robustness patterns: the lower layers that are less robust have larger distances to their initialized values.

Similar plots for ResNet50 on CIFAR10 and ImageNet are shown in Figure 15 and Figure 16, respectively. In each of the figures, we also show extra results for models trained with weight decay and batch normalization. For the case without weight decay and batch normalization, we can see a weak correlation: the layers that are sensitive have slightly larger distances to their random initialization values. For the case with weight decay and batch normalization, the situation is less clear. First of all, in Figure 15(e-f), we see very large distances in a few layers at checkpoint-1. This provides a potential explanation to the mysterious pattern that re-initialization to checkpoint-1 is more sensitive than to checkpoint-0. Similar observations can be found in Figure 16(e-f) for ImageNet.

Appendix D Alternative visualizations

The empirical results on layer robustness are mainly visualized as heatmaps in the main text. The heatmaps allow uncluttered comparison of the results across layers and training epochs. However, it is not easy to tell the difference between numerical values that are close to each other from the color coding. In this section, we provide alternative visualizations that shows the same results with line plots. In particular, Figure 17 shows the layerwise robustness analysis for VGG16 on CIFAR10. Figure 18 and Figure 19 show the results for ResNet50 on CIFAR10 and ImageNet, respectively.

(a) Test error
(b)
(c)
Figure 17: Alternative visualization of layer robustness analysis for VGG16 models on CIFAR10. This shows the same results as Figure 14, but shown as curves instead of heatmaps.
(a) Test error
(b)
(c)
Figure 18: Alternative visualization of layer robustness analysis for ResNet50 on CIFAR10.
(a) Test error
(b)
(c)
Figure 19: Alternative visualization of layer robustness analysis for ResNet50 on ImageNet.