1 Introduction
Deep neural networks have been remarkably successful in many real world machine learning applications. Distilled understanding of the systems is at least as important as their stateoftheart performance when applying them in many critical domains. Recent work on understanding why deep networks perform so well in practice focused on questions such as networks’ performance under drifting or even adversarially perturbed data distribution. Another interesting and relevant to this work is research on how we can interpret or explain the decision function of trained networks. While related, this work takes a different angle as we focus on the role of the layers in trained networks and then relate the empirical results to generalization and robustness properties.
Theoretical research on the representation power of neural networks is well studied. It is known that a neural network with a single sufficiently wide hidden layer is universal approximator for continuous functions over a compact domain (Gybenko, 1989; Hornik, 1991; Anthony and Bartlett, 2009). More recent research further examines whether deep networks can have superior representation power than shallow ones with the same number of units or edges (Pinkus, 1999; Delalleau and Bengio, 2011; Montufar et al., 2014; Telgarsky, 2016; Shaham et al., 2015; Eldan and Shamir, 2015; Mhaskar and Poggio, 2016; Rolnick and Tegmark, 2017). The capacity to represent arbitrary functions with finite samples is also extensively discussed in recent work (Hardt and Ma, 2017; Zhang et al., 2017; Nguyen and Hein, 2018; Yun et al., 2018). However, the constructions used in the aforementioned work for building networks approximating particular functions are typically “artificial” and are unlikely to be obtained by gradientbased learning algorithms. We focus instead on empirically studying the role different layers in deep architecture take post gradientbased training.
Research on the generalization of deep neural networks has attracted a lot of interests. The observation that big neural networks can fit random labels on the training set (Zhang et al., 2017) makes it difficult to apply classical learning theoretic results based on uniform convergence over the hypothesis space. One approach to get around this issue is to show that while the space of neural networks of given architecture is very large, gradientbased learning on “well behaved” tasks leads to relatively “simple” models. More recent research focuses on the analysis of posttraining complexity metrics such as norm, margin, robustness, flatness, or compressibility of the learned model in contrast to pretraining capacity of the entire hypothesis space. This line of work resulted in improved generalization bounds for deep neural networks, see for instance Dziugaite and Roy (2016); Kawaguchi et al. (2017); Bartlett et al. (2017); Neyshabur et al. (2018, 2017); Liang et al. (2017); Arora et al. (2018); Zhou et al. (2019) and the references therein. This paper provides further empirical evidence and alludes to potentially more finegrained analysis.
In particular, we show empirically that the layers in a deep network are not homogeneous in the role they play at representing a prediction function. Some layers are critical to forming good predictions while others are fairly robust to the assignment of their parameters along training. Moreover, depending on the capacity of the network and the complexity of the target function, gradientbased trained networks conserve the complexity by not using excess capacity. The exact definition and implications on generalization are discussed in the body of the paper.
Before proceeding to the body of the paper, we would like to point to a few related papers. Modern neural networks are typically overparameterized and thus have plenty of redundancy in their representation capabilities. Previous work exploited overparameterization to compress (Han et al., 2015) or distill (Hinton et al., 2015) a trained network. Rosenfeld and Tsotsos (2018) found that one can achieve comparable performance by training only a small fraction of network parameters such as a subset of channels in each convolutional layer. Towards interpreting residual networks as ensemble of shallow networks, Veit et al. (2016) found that residual blocks in a trained network can be deleted or permuted to some extent without hurting the performance too much. In another line of research, it is shown that under extreme overparameterization, such as when the network width is polynomial in the training set size and input dimension (AllenZhu et al., 2018; Du et al., 2018a, b; Zou et al., 2018), or even in the asymptotic regime of infinite width (Lee et al., 2019), the network weights move slowly during training. The observations in this paper show that in more practical regime, different layers could behave very differently.
The rest of the paper is organized as follows: the experiment framework and our main notions of layerwise robustness are introduced in Section 2. Section 3 presents the results and analysis of layerwise robustness on a wide range of neural network models. Section 4 discusses the theoretical implications on generalization. Studies on joint robustness and connections to other notions of robustness are presented in Section 5 and Section 6, respectively. Finally, the paper ends with a conclusion that summarize our main contributions.
2 Setting
Feed forward networks naturally consist of multiple layers
where each unit in a layer takes inputs from units in the previous layer. It is common to form a layer using a linear transform (e.g. convolution), followed by some kind of normalization (e.g. batchnormalization), and then apply unitwise nonlinear activation function (e.g. rectification – ReLU). We use the term
layer in a more general sense to represent any layerlike computation block. In particular, a residual block in a residual network (He et al., 2016a) can also be treated as a layer.Let be the space of a particular neural network architecture with (parametric) layers. We use the term capacity to refer to properties of the entire space before training takes place. It is usally measured by notions such as the Rademacher complexity, the VC Dimension, and various types of Covering numbers. The term complexity is used in reference to properties of a single neural network , often employing with notion of norm of the parameters and possibly normalized by empirical quantities such as the margin.
We are interested in analyzing posttraining behavior of layers of popular deep networks. Such networks are typically trained using stochastic gradient methods which start by sampling random values for the parameters from a predefined distribution . The choice of typically depends on the type, fanin, and fanout of of each layer. During training the parameters are iteratively updated via
(1) 
where we use
to designate a stochastic estimate of the gradient of the loss
on a sample of examples data with respect to . Variants that uses momentum and preconditioners, as well as adaptive learning rate scheduling for are commonly used in practice but are not employed in this paper. After training for iterations, the parameters are used as the final trained model.A deep network build up the representation of its inputs by incrementally applying nonlinear transformations defined by each layer. As a result, the representations at a particular layer recursively depend on all the layers beneath it. This complex dependency makes it challenging to isolate and inspect each layer independently in theoretical studies. In this paper, we introduce and use the following two empirical probes to inspect the individual layers in a trained neural network.
Reinitialization
After training, for a given layer , we can reinitialize the parameters through assignment , while keeping the parameters for the other layers unchanged. The model with the parameters is then evaluated. Unless noted otherwise, we use the term performance to designate classification error on test data. The performance of a network in which layer was reinitialized is referred to as the reinitialization robustness of layer . Note that here denotes the random values realized at the beginning of the training. More generally, for time steps , we can reinitialize the th layer by setting , and obtain the reinitialization robustness of layer after updates.
Rerandomization
To go one step further, we also examine rerandomization of a layer by resampling random values and evaluate the model’s performance for . Analogously, we refer to the evaluated performance as the rerandomization robustness of layer .
Note that there is no retraining or finetuning after reinitialization or rerandomization, and the network is evaluated directly with mixed weights. When a network exhibits no or negligible decrease in performance after reinitializing or rerandomizing of a layer, we say that the layer is robust, and otherwise the layer is called critical.
3 Robustness of individual layers
In this section, we study layer robustness of commonly used neural networks on standard image classification benchmarks, in particular MNIST, CIFAR10 and ImageNet. All the networks are trained using SGD with Nesterov momentum, and piecewise constant learning rate schedule. Please refer to Appendix
A for further details.3.1 Fully connected networks
(a) Test error rate: each row corresponds to one layer in the network. The last row shows the full model performance at the corresponding epoch (i.e. all the model parameters are loaded from that checkpoint) as reference. The first column designates robustness of each layer w.r.t rerandomization and the rest of the columns designate reinitialization robustness at different checkpoints. The last column shows the final performance (at the last checkpoint during training) as reference. (bc) Weights distances: each cell in the heatmaps depict the normalized
norm (b) or norm (c) distance of trained parameters to their initial weights.We start by examining robustness of fullyconnected networks (FCN). A FCN consists of fully connected layers each of which of output dimension and ReLU activation function. The extra final layer is a linear multiclass predictor with one output per class.
As a starter, we trained an FCN on the MNIST digit classification task, and applied the reinitialization and rerandomization analysis on the trained model. The results are shown in Figure 1(a). As expected, due to the intricate dependency of the classification function on each of the layers, rerandomizing any of the layers completely disintegrate the representations and classification accuracy drops to the level of random guessing. However, for reinitialization, we find that while the first later is very sensitive, the rest of the layers are robust to reinitializing back to their pretraining random weights.
A plausible reason for this phenomenon could be attributed to the fact that gradient norms increase during backpropagation to the point that the bottom layers are being updated more aggressively than the top ones. Alas, if this was the case, we would expect a smoother transition instead of a sharp contrast at the first layer. We thus measured how distant the weights of each layer from their initialization, “Checkpoint 0”, using either the normalized norm, abbreviated as
and the norm, .
The results are shown in Figure 1(b) and (c), respectively. As we can see, the robustness to reinitialization does not obviously correlate to either of the distances. Figure 2 shows the results on a FCN , which demonstrates the same phenomenon. The figure also shows that the cross entropy loss on the test set behaves similarly to the classification error. It suggests there might be something more intricate going on than simple gradient exploding issue. We loosely summarize the observations as follows:
Overcapacitated deep networks trained with stochastic gradient have lowcomplexity due to selfrestricting the number of critical layers.
Intuitively, if a subset of parameters can be reinitialized to the random values at checkpoint 0 (which are independent of the training data), then the effective number of parameters, and as a result, the complexity of the model, can be reduced. We defer more detailed discussion on the theoretical implications to Section 4.
3.2 Adaptive complexity adjustment
We next by applied the same analysis procedure to a large number of different configurations in order to assess the effects of the network capacity and the task complexity on the layer robustness.
As the results in the previous section shown, the first layer is rather sensitive to reinitialization while the rest of the layers are quite robust. In Figure 3(a), we compare the average reinitialization robustness for all layers but the first with respect FCNs of varying hidden dimensions on MNIST. It is clear that the upper layers become more robust as the hidden dimension increases. We believe that it reflects the fact that the wider FCNs have higher model capacity. When the capacity is small, all layers are vigil participants in representing the prediction function. As capacity increases, it suffices to use the bottom layer while the rest act as random projections with nonlinearities.
Similarly, Figure 3
(b) shows experiments on CIFAR10, which has the same number of classes and comparable number of training examples as MNIST, but is more difficult to classify. While it is hard to directly compare the robustness of the same model across the two different tasks, we still observe similar traits as the hidden dimensions increase yet not as pronounced. Informally put, the difficulty of the learning task seem to necessitate more diligence in forming accurate prediction.
Each bar designates the difference in classification error between a fully trained model and a model with one layer reinitialized. The error bars designate one standard deviation obtained by running five experiments with different random initialization.
In summary, the empirical evidence represented in this section provide some evidence that deep networks automatically adjust their defacto complexity. When a big network is trained on an easy task, only a few layers seem to be playing critical roles.
3.3 Large convolutional networks
On typical computer vision tasks beyond MNIST, densely connected FCNs are outperformed significantly by convolutional neural networks. VGGs and ResNets are among the most widely used convolutional network architectures. Figure
4 and Figure 5 show the robustness analysis on the two types of networks, respectively.Since those networks are much deeper than the FCNs, we transpose the heatmaps to show the layers as columns. For VGGs, a large number of layers are sensitive to reinitialization, but the patterns are similar to the observations from the simple FCNs on MNIST: the bottom layers are more critical but the upper layers are robust to reinitialization.
The results for ResNets in Figure 5 is to be considered together with results on ImageNet in Figure 6. We found the robustness patterns for resnets more interesting mainly for two reasons:
ResNets redistribute sensitive layers.
Unlike the FCNs and VGGs which put the sensitive layers at the bottom of the network, ResNets distribute them across the network. To better understand the patterns, let us do a brief recap of the ResNets architectures. It is common in theoretical analysis to broadly define ResNets as any neural network architectures with residual blocks. In practice, a few “standard” architectures (and variants) that divide the network into a few “stages” are commonly used. At the bottom, there is a preprocessing stage (stage0) with vanilla convolutional layers. It is followed by a few (typically 4) residual stages (stage1 to stage4) consisting of residual blocks, and then global average pooling and the densely connected linear classifier (final_linear). The image size shrinks and the number of convolutional feature channels doubles from each residual stage to the next one^{1}^{1}1There are more subtle details especially at stage1 depending on factors like the input image size, whether residual blocks contain a bottleneck, and the version of ResNets, etc.. As a result, while most of the residual blocks have real identity skip connections, the first block of each stage (stage*.resblk1) that connects to the previous stage has a nonidentity skip connection due to different input / output shapes. Figure 7 illustrates the two types of residual blocks.
With a big picture of the ResNet architectures, we can see that each stage in a ResNet acts as a subnetwork, and the layerwise robustness patterns within each stage resembles the VGGs and FCNs.
Residual blocks can be robust to rerandomization.
Among the layers that are robust to reinitialization, if the layer is a residual block, it is also robust to rerandomization: e.g. compare the final_linear layer and any of the robust residual blocks. A possible reason is that the identity skip connection dominates the residual branch in those blocks. It is known from previous lesion studies (Veit et al., 2016) that residual blocks in a ResNet can be removed without seriously hurting the performance. But our experiments put it in the context with other architectures and study the adaptive robustness with respect to the interplay between the model capacity and the task difficulties. In particular, comparing the results on CIFAR10 and ImageNet, we see that especially on ResNet18 from Figure 6(a), many residual blocks with real identity skip connection also become sensitive comparing to bigger models due to smaller capacity.
4 Theoretical Implications on Generalization
As mentioned earlier, if some parameters can be reassigned to the randomly initialized values without affecting the model performance, then the effective number of parameters is reduced as the random initialization is independent of the training data. The benefits on improving generalization is most easily demonstrated with a naive parameter counting generalization bound. For example, if we have a generalization bound of the form
where is a model with parameters trained on i.i.d. samples. is some complexity measure based on counting the number of parameters, and is the corresponding generalization bound. For example, Anthony and Bartlett (2009) provides various bounds on VCdimension based on the number of weights in neural networks, which could then be plugged into standard VCdimension based generalization bounds for classification (Vapnik, 1998). Now if we know that a fraction of the neural network weights will be robust to reinitialization after training, with a loss of the (empirical) risk of at most , then we get
where is a model obtained by reinitializing the fraction of parameters of the trained model
. Note that generalization bounds based on parameter counting generally does not work well for deep learning. Because of the heavy overparameterization, the resulting bounds are usually trivial. However, as noted in
Arora et al. (2018), most of the alternative generalization bounds proposed for deep neural network models recently are actually worse than naive parameter counting. Moreover, by tweaking the existing analysis with additional layerwise robustness condition, some PACBayes based bounds can also be potentially improved (Wang et al., 2018; Arora et al., 2018; Zhou et al., 2019).Note that like the results in Arora et al. (2018); Zhou et al. (2019), the bounds provided by reinitialization robustness are for a different model (in our case the reinitialized one). Alternative approaches in the literature involve modifying the training algorithms to explicitly optimize the robustness or some derived generalization bounds (Neyshabur et al., 2015; Dziugaite and Roy, 2016). However, neither of the arguments provides guarantees for the model directly trained from SGD.
5 Joint robustness
The theoretical analysis suggests that robustness to either reinitialization or rerandomization could imply better generalization. Combined with the experimental results in previous sections, it seems to suggest a good way to explain the empirical observations that hugely overparameterized networks could still generalize well, as they are only using a small portion of their full capacity. However, there is a caveat: the reinitialization and rerandomization analysis in Section 3 study each layer independently. However, two or more layers being independently robust does not necessarily imply that they are robust jointly. If, for example, we want a generalization bound that uses only half of the capacity, we need to show that half of the layers are robust to reinitialization or rerandomization simultaneously.
5.1 Are robust layers jointly robust?
In this section, we do joint robustness analysis on groups of layers. From Section 3.1, we see that on MNIST, for wide enough FCNs, all the layers above layer1 are robust to reinitialization. So we divide the layer into two groups: {layer1} and {layer2, layer3, …}, and perform the robustness studies on the two groups. The results for FCN are shown in Figure 8(a). For clarity and ease of comparison, the figure still spells out all the layers individually, but the values from layer2 to layer6 are simply repeated rows. The values show that the upperlayergroup is clearly not jointly robust to reinitialization (to checkpoint 0).
We also try some alternative grouping schemes: Figure 8(b) show the results when we group two in every three layers, which has slightly improved joint robustness; In Figure 8(c), the grouping scheme that include every other layer shows that with a clever grouping scheme, about half of the layers could be jointly robust.
Results on ResNets are similar. Figure 9 shows the joint robustness analysis on ResNets trained on CIFAR10. The grouping is based on the layerwise robustness results from Figure 5: all the residual blocks in stage1 to stage4 are bundled and analyzed jointly. The results are similar to the FCNs: ResNet18 is relatively robust, but deeper ResNets are not jointly robust under this grouping. Two alternative grouping schemes are shown in Figure 10. By including only layers from stage1 and stage4, slightly improved robustness could be obtained on ResNet50. The scheme that groups every other residual block shows further improvements.
In summary, the individually robust layers are generally not jointly robust. But with some clever way of picking out a subset of the layers, joint robustness could still be achieved for up to half of the layers. In principle, one can enumerate all possible grouping schemes to find the best with a tradeoff of the robustness and number of layers included.
5.2 Could robust layers be made jointly robust?
Results from the previous section show that there is a gap between the layerwise robustness patterns and the the joint robustness. Here we try to see if we could close the gap by letting the training algorithm know that we are interested in the robustness of a subset of the layers. It is complicated to express this desire algorithmically, but we can make a stronger request by asking the learning algorithm to explicitly not “use” those layers. More specifically, we try two approaches to the layers in the group that is desired to be robust: 1) freeze them so that their parameters remain the same randomly initialized values; 2) remove the layers completely from the neural network architecture.
Arch  Full  Layerwise  Layers  Layers  

Model  Robustness  Frozen  Removed  
CIFAR10 
ResNet50  8.40  9.771.38  11.74  9.23 
ResNet101  8.53  8.870.50  9.21  9.23  
ResNet152  8.54  8.740.39  9.17  9.23  
ImageNet 
ResNet50  34.74  38.545.36  44.36  41.50 
ResNet101  32.78  33.842.10  36.03  41.50  
ResNet152  31.74  32.421.55  35.75  41.50 
The results are shown in Table 1. When we explicitly freeze the layers, the test error rates are still higher than the average layerwise robustness measured in a normally trained model. However, the gap is much smaller than directly measuring the joint robustness (see Figure 9 for comparison). Moreover, on CIFAR10, we found that similar performance can be achieved even if we completely remove those layers from the network. On the other hand, on ImageNet, the frozen random layers seem to be needed to achieve good performances, while the “layersremoved” variant underperform by a big gap. In this case, the random projections (with nonlinearity) in those frozen layers are helpful with the performance.
6 Connections to other notions of robustness
The notion of layerwise (and joint) robustness to reinitialization and rerandomization can be related to other notions of robustness in deep learning. For example, the flatness of the solution is a notion of robustness with respect to local perturbations to the network parameters (at convergence), and is extensively discussed in the context of generalization (Hochreiter and Schmidhuber, 1997; Chaudhari et al., 2017; Keskar et al., 2017; Smith and Le, 2018; Poggio et al., 2018)
. For a fixed layer, our notion of robustness to reinitialization is more restricted because the “perturbed values” can only be from the optimization trajectory; while the robustness to rerandomization could potentially allow larger perturbation variances. However, as our studies here show, the robustness or flatness at each layer could behave very differently, so analyzing each layer individually in the context of specific network architectures allow us to get more insights to the robustness behaviors.
On the other hand, Adversarial robustness (Szegedy et al., 2013) focus on the robustness with respect to perturbations to the inputs. In particular, it is found that trained deep neural network models are sensitive to input perturbations: small adversarially generated perturbations can usually change the prediction results to arbitrary different classes. A large number of defending and attacking algorithms have been proposed in recent years along this line. Here we briefly discuss the connection to adversarial robustness. In particular, take a normally trained ResNet^{2}^{2}2We use a slightly modified variant by explicitly having a downsample layer between stages, so that all the residual blocks are with real identity skip connections. See Figure 7., say with stages and residual blocks in each stage. Given configuration and , during each test evaluation, a subset of stages are randomly chosen, and for each of the chosen stages, a random residual block is picked and replaced with one of the preinitialized weights for that layer. We keep preallocated weights for each residual block instead of resampling random numbers on each evaluation call, primarily to reduce the computation burden during the test time.
From the previous robustness analysis, we expect the stochastic classifier to get only a small performance drop when averaged over the test set. However, at individual example level, the randomness of the network outputs will make it harder for the attacker to generate adversarial examples. We evaluate the adversarial robustness against a weak FGSM (Goodfellow et al., 2014) attack and a strong PGD (Madry et al., 2017) attack. The results in Table 2 show that, compared to the baseline (the exact same trained model before being turned into a stochastic classifier), the randomness significantly increases the adversarial robustness against weak attacks. The performances under strong PGD attack drop to very low, but still with a nontrivial gap between the baseline.
In summary, the layerwise robustness could improve the adversarial robustness of a trained model through injected stochasticity. However, it is not a good defense against strong attackers. If we work hard enough, more sophisticated attacks that explicitly deal with stochastic classifiers are likely to completely break this model.
Model Configuration  Clean  FGSM  PGD  

baseline  
r=4,s=1  
r=4,s=2  
baseline  
r=4,s=1  
r=4,s=2  
r=4,s=4 
The adversarial attacks are evaluated on a subset of 1000 test examples. Every experiment is repeated 5 times and the average performance is reported. The hyperparameters
and in model configurations mean the number of random weights precreated for each residual block, and the number of stages that are rerandomized during each inference. means a ResNet architecture with two stages, where each stage contains four residual blocks; similarly has four stages each with four residual blocks.7 Conclusions
We studied on a wide variety of popular models for image classification. We investigated the functional structure on a layerbylayer basis of overparameterized deep models. We introduced the notions of reinitialization and rerandomization robustness. Using these notions we provided evidence for the heterogeneous characteristic of layers, which can be morally categorized into either “robust” or “critical”. Resetting the robust layers to their initial value has no negative consequence on the model’s performance. Our empirical results give further evidence that mere parameter counting or norm accounting is too coarse in studying generalization of deep models. Moreover, optimization landscape based analysis (e.g. flatness or sharpness at the minimizer) is better performed respecting the network architectures due to the heterogeneous behaviors of different layers. For future work, we are interested in devising a new algorithm which learns the interleaving trained and partially random subnetworks within one large network.
Acknowledgments
The authors would like to thank David Grangier, Lechao Xiao, Kunal Talwar and Hanie Sedghi for helpful discussions and comments.
References
 AllenZhu et al. (2018) AllenZhu, Z., Li, Y., and Song, Z. (2018). A convergence theory for deep learning via OverParameterization. CoRR, arXiv:1811.03962.
 Anthony and Bartlett (2009) Anthony, M. and Bartlett, P. L. (2009). Neural Network Learning: Theoretical Foundations. Cambridge University Press.
 Arora et al. (2018) Arora, S., Ge, R., Neyshabur, B., and Zhang, Y. (2018). Stronger generalization bounds for deep nets via a compression approach. CoRR, arXiv:1802.05296.
 Bartlett et al. (2017) Bartlett, P. L., Foster, D. J., and Telgarsky, M. J. (2017). Spectrallynormalized margin bounds for neural networks. In Advances in Neural Information Processing Systems, pages 6240–6249.
 Chaudhari et al. (2017) Chaudhari, P., Choromanska, A., Soatto, S., LeCun, Y., Baldassi, C., Borgs, C., Chayes, J., Sagun, L., and Zecchina, R. (2017). Entropysgd: Biasing gradient descent into wide valleys. In ICLR.
 Delalleau and Bengio (2011) Delalleau, O. and Bengio, Y. (2011). Shallow vs. Deep SumProduct Networks. In NIPS, pages 666–674.
 Du et al. (2018a) Du, S. S., Lee, J. D., Li, H., Wang, L., and Zhai, X. (2018a). Gradient descent finds global minima of deep neural networks. CoRR, arXiv:1811.03804.
 Du et al. (2018b) Du, S. S., Zhai, X., Poczos, B., and Singh, A. (2018b). Gradient descent provably optimizes overparameterized neural networks. CoRR, arXiv:1810.02054.
 Dziugaite and Roy (2016) Dziugaite, G. K. and Roy, D. M. (2016). Computing nonvacuous generalization bounds for deep (stochastic) neural networks with many more parameters than training data. In UAI.
 Eldan and Shamir (2015) Eldan, R. and Shamir, O. (2015). The Power of Depth for Feedforward Neural Networks. CoRR, arXiv:1512.03965.
 Goodfellow et al. (2014) Goodfellow, I. J., Shlens, J., and Szegedy, C. (2014). Explaining and harnessing adversarial examples. CoRR, arXiv:1412.6572.

Gybenko (1989)
Gybenko, G. (1989).
Approximation by superposition of sigmoidal functions.
Mathematics of Control, Signals and Systems, 2(4):303–314.  Han et al. (2015) Han, S., Mao, H., and Dally, W. J. (2015). Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. CoRR, arXiv:1510.00149.
 Hardt and Ma (2017) Hardt, M. and Ma, T. (2017). Identity matters in deep learning. In ICLR.

He et al. (2016a)
He, K., Zhang, X., Ren, S., and Sun, J. (2016a).
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pages 770–778.  He et al. (2016b) He, K., Zhang, X., Ren, S., and Sun, J. (2016b). Identity mappings in deep residual networks. In European conference on computer vision, pages 630–645. Springer.
 Hinton et al. (2015) Hinton, G., Vinyals, O., and Dean, J. (2015). Distilling the knowledge in a neural network. CoRR, arXiv:1503.02531.
 Hochreiter and Schmidhuber (1997) Hochreiter, S. and Schmidhuber, J. (1997). Flat minima. Neural Computation, 9(1):1–42.
 Hornik (1991) Hornik, K. (1991). Approximation capabilities of multilayer feedforward networks. Neural networks, 4(2):251–257.
 Kawaguchi et al. (2017) Kawaguchi, K., Kaelbling, L. P., and Bengio, Y. (2017). Generalization in deep learning. CoRR, arXiv:1710.05468.
 Keskar et al. (2017) Keskar, N. S., Mudigere, D., Nocedal, J., Smelyanskiy, M., and Tang, P. T. P. (2017). On largebatch training for deep learning: Generalization gap and sharp minima. In ICLR.
 Lee et al. (2019) Lee, J., Xiao, L., Schoenholz, S., Bahri, Y., SohlDickstein, J., and Pennington, J. (2019). Wide neural networks of any depth evolve as linear models under gradient descent. Technical report, private communication.
 Liang et al. (2017) Liang, T., Poggio, T., Rakhlin, A., and Stokes, J. (2017). Fisherrao metric, geometry, and complexity of neural networks. CoRR, arXiv:1711.01530.
 Madry et al. (2017) Madry, A., Makelov, A., Schmidt, L., Tsipras, D., and Vladu, A. (2017). Towards deep learning models resistant to adversarial attacks. CoRR, arXiv:1706.06083.
 Mhaskar and Poggio (2016) Mhaskar, H. and Poggio, T. A. (2016). Deep vs. shallow networks : An approximation theory perspective. CoRR, arXiv:1608.03287.
 Montufar et al. (2014) Montufar, G. F., Pascanu, R., Cho, K., and Bengio, Y. (2014). On the number of linear regions of deep neural networks. In Advances in neural information processing systems (NIPS), pages 2924–2932.
 Neyshabur et al. (2017) Neyshabur, B., Bhojanapalli, S., McAllester, D., and Srebro, N. (2017). Exploring generalization in deep learning. In Advances in Neural Information Processing Systems, pages 5947–5956.
 Neyshabur et al. (2018) Neyshabur, B., Bhojanapalli, S., and Srebro, N. (2018). A PACBayesian approach to SpectrallyNormalized margin bounds for neural networks. In ICLR.
 Neyshabur et al. (2015) Neyshabur, B., Salakhutdinov, R., and Srebro, N. (2015). Pathsgd: Pathnormalized optimization in deep neural networks. In NIPS, pages 2422–2430.
 Nguyen and Hein (2018) Nguyen, Q. and Hein, M. (2018). Optimization Landscape and Expressivity of Deep CNNs. In International Conference on Machine Learning, pages 3727–3736.
 Pinkus (1999) Pinkus, A. (1999). Approximation theory of the MLP model in neural networks. Acta Numerica, 8:143–195.
 Poggio et al. (2018) Poggio, T., Liao, Q., Miranda, B., Banburski, A., Boix, X., and Hidary, J. (2018). Theory iiib: Generalization in deep networks. Technical report, MIT.
 Rolnick and Tegmark (2017) Rolnick, D. and Tegmark, M. (2017). The power of deeper networks for expressing natural functions. CoRR, arXiv:1705.05502.
 Rosenfeld and Tsotsos (2018) Rosenfeld, A. and Tsotsos, J. K. (2018). Intriguing Properties of Randomly Weighted Networks: Generalizing While Learning Next to Nothing. CoRR, arXiv:1802.00844.
 Shaham et al. (2015) Shaham, U., Cloninger, A., and Coifman, R. R. (2015). Provable approximation properties for deep neural networks. CoRR, arXiv:1509.07385.
 Simonyan and Zisserman (2014) Simonyan, K. and Zisserman, A. (2014). Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556.

Smith and Le (2018)
Smith, S. L. and Le, Q. V. (2018).
A bayesian perspective on generalization and stochastic gradient descent.
In ICLR.  Szegedy et al. (2013) Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., and Fergus, R. (2013). Intriguing properties of neural networks. CoRR, arXiv:1312.6199.
 Telgarsky (2016) Telgarsky, M. (2016). benefits of depth in neural networks. In Feldman, V., Rakhlin, A., and Shamir, O., editors, 29th Annual Conference on Learning Theory, volume 49 of Proceedings of Machine Learning Research, pages 1517–1539, Columbia University, New York, New York, USA. PMLR.
 Vapnik (1998) Vapnik, V. N. (1998). Statistical Learning Theory. Adaptive and learning systems for signal processing, communications, and control. Wiley.
 Veit et al. (2016) Veit, A., Wilber, M. J., and Belongie, S. (2016). Residual networks behave like ensembles of relatively shallow networks. In Advances in Neural Information Processing Systems, pages 550–558.
 Wang et al. (2018) Wang, H., Keskar, N. S., Xiong, C., and Socher, R. (2018). Identifying Generalization Properties in Neural Networks. CoRR, arXiv:1809.07402.
 Yun et al. (2018) Yun, C., Sra, S., and Jadbabaie, A. (2018). Finite sample expressive power of smallwidth relu networks. CoRR, arXiv:1810.07770.
 Zhang et al. (2017) Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals, O. (2017). Understanding deep learning requires rethinking generalization. In ICLR.
 Zhou et al. (2019) Zhou, W., Veitch, V., Austern, M., Adams, R. P., and Orbanz, P. (2019). Nonvacuous generalization bounds at the ImageNet scale: a PACBayesian compression approach. In ICLR.
 Zou et al. (2018) Zou, D., Cao, Y., Zhou, D., and Gu, Q. (2018). Stochastic gradient descent optimizes overparameterized deep ReLU networks. CoRR, arXiv:1811.08888.
Appendix A Details on experiment setup
Our empirical studies are based on the MNIST, CIFAR10 and the ILSVRC 2012 ImageNet datasets. Stochastic Gradient Descent (SGD) with a momentum of 0.9 is used to minimize the multiclass cross entropy loss. Each model is trained for 100 epochs, using a stagewise constant learning rate scheduling with a multiplicative factor of 0.2 on epoch 30, 60 and 90. Batch size of 128 is used, except for ResNets with more than 50 layers on ImageNet, where batch size of 64 is used due to device memory constraints.
We mainly study three types of neural network architectures:

FCNs: the multilayer perceptrons consist of fully connected layers with equal output dimension and ReLU activation (except for the last layer, where the output dimension equals the number of classes and no ReLU is applied). For example, FCN
has three layers of fully connected layers with the output dimension 256, and an extra final (fully connected) classifier layer. 
VGGs: widely used network architectures from Simonyan and Zisserman (2014).

ResNets: the results from our analysis are similar for ResNets V1 (He et al., 2016a) and V2 (He et al., 2016b). We report our results with ResNets V2 due to the slightly better performance in most of the cases. For large image sizes from ImageNet, the stage0 contains a convolution and a max pooling (both with stride 2) to reduce the spatial dimension (from 224 to 56). On smaller image sizes like CIFAR10, we use a convolution with stride 1 here to avoid reducing the spatial dimension.
During training, CIFAR10 images are padded with 4 pixels of zeros on all sides, then randomly flipped (horizontally) and cropped. ImageNet images are randomly cropped during training and centercropped during testing. Global mean and standard deviation are computed on all the training pixels and applied to normalize the inputs on each dataset.
Appendix B Batch normalization and weight decay
The primary goal of this paper is to study the (co)evolution of the representations at each layer during training and the robustness of this representation with respect to the rest of the network. We try to minimize the factors that explicitly encourage changing of the network weights or representations in the analysis. In particular, unless otherwise specified, weight decay and batch normalization are not used. This leads to some performance drop in the trained models. Especially for deep residual networks: even though we could successfully train a residual network with 100+ layers without batch normalization, the final generalization performance could be quite worse than the stateoftheart. Therefore, in this section, we include studies on networks trained with weight decay and batch normalization for comparison.
Architecture  N/A  +wd  +bn  +wd+bn  

CIFAR10 
ResNet18  10.4  7.5  6.9  5.5 
ResNet34  10.2  6.9  6.6  5.1  
ResNet50  8.4  9.9  7.6  5.0  
ResNet101  8.5  9.8  6.9  5.3  
ResNet152  8.5  9.7  7.3  4.7  
VGG11  11.8  10.7  9.4  8.2  
VGG13  10.3  8.8  8.4  6.7  
VGG16  11.0  11.4  8.5  6.7  
VGG19  12.1  8.6  6.9  
ImageNet 
ResNet18  41.1  33.1  33.5  31.5 
ResNet34  39.9  30.6  30.1  27.2  
ResNet50  34.8  31.8  28.2  25.0  
ResNet101  32.9  29.9  26.9  22.9  
ResNet152  31.9  29.1  27.6  22.6 
In particular, Table 3 shows the final test error rates of models trained with or without weight decay and batch normalization. Note the original VGG models do not use batch normalization (Simonyan and Zisserman, 2014), we list +bn variants here for comparison, by applying batch normalization to the output of each convolutional layer. On CIFAR10, the performance gap varies from 3% to 5%, but on ImageNet, a performance gap as large as 10% could be seen when trained without weight decay and batch normalization. Figure 11 shows how different training configurations affect the layerwise robustness analysis patterns on VGG16 networks. We found that when batch normalization is used, none of the layers are robust any more.
Figure 12 and Figure 13 show similar comparisons for ResNet50 on CIFAR10 and ImageNet, respectively. Unlike VGGs, we found that the layerwise robustness patterns are still quite pronounced under various training conditions for ResNets. In Figure 12(d) and Figure 13(c,d), we see the mysterious phenomenon that reinitialing with checkpoint1 is less robust than with checkpoint0 for many layers. We do not know exactly why this is happening. It might be that during early stages, some aggressive learning is happening causing changes in the parameters or statistics with large magnitudes, but later on when most of the training samples are classified correctly, the network gradually rebalances the layers to a more robust state. Figure 15(df) in the next section shows supportive evidence that, in this case the distance of the parameters between checkpoint0 and checkpoint1 is larger than between checkpoint0 and the final checkpoint. However, on ImageNet this correlation is no longer clear as seen in Figure 16(df). See the discussions in the next section for more details.
Appendix C Robustness and distances
In Figure 1 from Section 3.1, we compared the layerwise robustness pattern to the layerwise distances of the parameters to the values at initialization (checkpoint0). We found that for FCNs on MNIST, there is no obvious correlation between the “amount of parameter updates received” at each layer and its robustness to reinitialization for the two distances (the normalized and norms) we measured. In this appendix, we list results on other models and datasets studied in this paper for comparison.
Figure 14 shows the layerwise robustness plot along with the layerwise distance plots for VGG16 trained on CIFAR10. We found that the distance of the top layers are large, but the model is robust when we reinitialize those layers. However, the normalized distance seem to be correlated with the layerwise robustness patterns: the lower layers that are less robust have larger distances to their initialized values.
Similar plots for ResNet50 on CIFAR10 and ImageNet are shown in Figure 15 and Figure 16, respectively. In each of the figures, we also show extra results for models trained with weight decay and batch normalization. For the case without weight decay and batch normalization, we can see a weak correlation: the layers that are sensitive have slightly larger distances to their random initialization values. For the case with weight decay and batch normalization, the situation is less clear. First of all, in Figure 15(ef), we see very large distances in a few layers at checkpoint1. This provides a potential explanation to the mysterious pattern that reinitialization to checkpoint1 is more sensitive than to checkpoint0. Similar observations can be found in Figure 16(ef) for ImageNet.
Appendix D Alternative visualizations
The empirical results on layer robustness are mainly visualized as heatmaps in the main text. The heatmaps allow uncluttered comparison of the results across layers and training epochs. However, it is not easy to tell the difference between numerical values that are close to each other from the color coding. In this section, we provide alternative visualizations that shows the same results with line plots. In particular, Figure 17 shows the layerwise robustness analysis for VGG16 on CIFAR10. Figure 18 and Figure 19 show the results for ResNet50 on CIFAR10 and ImageNet, respectively.