1 Introduction
Deep Neural Networks (DNNs) have been extensively used in various domains [24]. Their success depends heavily on the improvement of training techniques [15, 21, 14], e.g. finely weight initialization [15, 11, 37, 13], normalization of internal representation [21, 42], and well designed optimization methods [45, 22]. It is believed that these techniques are well connected to the curvature of the loss [37, 36, 23]. Analyzing the curvature of the loss is essential in determining many learning behaviors of DNNs.
In optimization community, conditioning analysis uncovers the landscape of optimization objective by exploring the spectrum of its curvature matrix. It has been well explored for linear models both in regression [26] and classification [40]
, where the convergence condition of the optimization problem is controlled by the maximum eigenvalue of the curvature matrix
[26, 25], and the learning time of the model is lowerbounded by the conditioning number of the curvature matrix [26, 25]. However, in the context of deep learning, the conditioning analysis suffers from several barriers: 1) the model is overparameterized and whether the direction with respect to small/zero eigenvalues contributes to the optimization progress is unclear
[35, 32]; 2) the memory and computation cost is extremely expensive [35, 10].This paper aims to bridge the theoretical analysis in optimization community and the empirical techniques in training DNNs, for better understanding the learning dynamics of DNNs. We propose a layerwise conditioning analysis, where we analyze the optimization landscape with respect to each layer independently by exploring the spectrum of their curvature matrices. The theoretical insight of our layerwise conditioning analysis is based on the recent success of second curvature approximation techniques in DNNs [30, 29, 1, 39, 3]. We show that the maximum eigenvalue and the condition number of the blockwise Fisher Information Matrix (FIM) can be characterized based on the spectrum of the covariance matrix of the input and outputgradient, under mild assumptions, which makes evaluating optimization behavior practical in DNNs. Another theoretical base is the recently proposed proximal backpropagation [6, 9, 46] where the original optimization problem can be approximately decomposed into multiple independent subproblems with respect to each layer [46]. We provide the connection between our analysis to the proximal backpropagation.
Based on our layerwise conditioning analysis, we show that Batch Normalization (BN) [21] can adjust the magnitude of the layer activations/gradients, and thus stabilizes the training. However, this kind of stabilizations can drive certain layer into a state, referred to as weight domination, that the gradient update is feeble, which sometimes has detrimental effects on the learning (See Section 4.1 for details). We also experimentally observe that BN can improve the layerwise conditioning of the optimization problem. Furthermore, we find that the unnormalized network has many small eigenvalues of the layer curvature matrix, which is mainly caused by the so called
dying neuron
(Section 4.2), while this behavior is almost absent in batch normalized network.We further analyze the ignored difficulty in training very deep residual networks [14]. Using our layerwise conditioning analysis, we show that the difficulty mainly arises from the illconditioned behavior of the last linear layer. We solve this problem by only adding one BN layer before the last linear layer, which achieves improved performance over the original residual networks (Section 5).
2 Preliminaries
Optimization Objective
Consider a true data distribution and the sampled training sets of size
. We focus on a supervised learning task aiming to learn the conditional distribution
using the model , where is represented as a function parameterized by . From an optimization perspective, we aim to minimize the empirical risk, averaged over the sample loss represented as in training sets : .Gradient Descent
In general, the gradient descent (GD) update seeks to iteratively reduce the loss by , where
is the learning rate. Moreover, for largescale learning, stochastic gradient descent (SGD) is extensively used to approximate the gradients
with a minibatch gradient. In theory, the convergence behaviors (e.g., the number of iterations required for convergence to a station point) depend on the Lipschitz constant of the gradient function of ^{1}^{1}1The loss’s gradient function is assumed Lipschitz continuous with Lipschitz constant , i.e., for all and ., which characterizes the global smoothness of the optimization landscape. In practice, the Lipschitz constant is either unknown for complicated functions or too conservative to characterize the convergence behaviors [5]. Researchers thus turn to the local smoothness, characterized by the Hessian matrix under the condition that is twice differentiable.Approximate Curvature Matrices
The Hessian describes local curvature of the optimization landscape. Such curvature information intuitively guides the design of secondorder optimization algorithms [35, 5], where the update direction is adjusted by multiplying the inverse of preconditioned matrix as: .
is a positive definite matrix that approximates the Hessian and is expect to sustain the positive curvature of the Hessian. The second moment matrix of sample gradient:
is usually used as the preconditioned matrix [34, 27]. Besides, Pascanu and Bengio [33] show that the Fisher Information Matrix (FIM): can be viewed as a preconditioned matrix when performing the natural gradient descent algorithm [33]. More analyses of the connections among , , please refer to [28, 5]. In this paper, we refer to the analysis of the spectrum of the (approximate) curvature matrices as conditioning analysis.Conditioning Analysis for Linear Models
Consider a linear regression model with a scalar output
, and mean square error loss . As shown in [26, 25], the learning dynamics in such a quadratic surface are fully controlled by the spectrum of the Hessian matrix . There are two statistical momentums that are essential for evaluating the convergence behaviors of the optimization problem. One is the maximum eigenvalue of the curvature matrix , and the other is the condition number of the curvature matrix denoted by , where is the minimum nonzero eigenvalue of the curvature matrix. Specifically, controls the upper bound and the optimal learning rate (e.g., the optimal learning rate is and the training will diverge if ). controls the iterations required for convergence (e.g., the lower bound of the iteration is [26]). Ifis an identity matrix that can be achieved by whitening the input, the GD can converge within only one iteration. It is easy to extend the solution of linear regression from a scalar output to a vectorial output
. In this case, the Hessian is represented as(1) 
where indicates the Kronecker product [12] and denotes the identity matrix. For the linear classification model with cross entropy loss, the Hessian is approximated by [40]:
(2) 
Here, is defined by , where is the number of categories and denotes a matrix where entries are all one. Eqn. 2 assumes that the Hessian does not significantly change from the initial region to the optimal region [40].
3 Conditioning Analysis for DNNs
Considering a Multilayer Perceptron (MLP),
can be represented as a layerwise linear and nonlinear transformation, as follows:(3) 
where , and the learnable parameters . To simplify the denotation, we set as the output of the network .
The conditioning analysis on the full curvature matrix for DNNs is difficult since the expensive memory and computational costs [10, 32]. We thus seek to analyze an approximation of the curvature matrix. One successful example in secondorder optimization over DNNs is approximating the Fisher Information Matrix (FIM) by using the Kronecker product (KFAC) [29, 1, 39, 3]. In the KFAC approach, there are two assumptions: 1) weightgradients in different layers are assumed to be uncorrelated; 2) the input and outputgradient in each layer are approximated as independent, so the full FIM can be represented as a block diagonal matrix, , where is the subFIM (the FIM with respect to the parameters in certain layer) and is computed as:
(4) 
Here, denotes the layer input, and denotes the layer outputgradient. We note that Eqn. 3 is similar to Eqn. 1 and 2, and all of them depend on the covariance matrix of the (layer) input. The main difference is that in Eqn. 3, the covariance of outputgradient is considered and its value is changing over different optimization region, while in Eqn. 1 and 2, the covariance of outputgradient is constant.
Based on this observation, we propose layerwise conditioning analysis, i.e., we analyze the spectrum of each subFIM independently. We expect the spectrum of subFIMs can well reveal the one of the full FIM, at least in terms of analyzing the learning dynamics of the DNNs. Specifically, we analyze the maximum eigenvalue and condition number ^{2}^{2}2Since DNNs are usually overparameterized, we evaluate the more general condition number with respect to the percentage: , where is the th eigenvalue (in a descent order) and is the number of eigenvalues. e.g., is the original definition of condition number.. Based on conclusion of the conditioning analysis for linear models shown in Section 2, there are two remarkable properties, which can be used to uncover the landscape of the optimization problem implicitly:

Property 1: indicates the magnitude of the weightgradient in each layer, which shows the steepness of the landscape with respect to different layers.

Property 2: indicates how easy it is to optimize the corresponding layer.
Note that the assumptions required for our analysis is weaker than the KFAC approximation, since we only care about whether the spectrum of the full FIM matches well with the approximated one. We conduct experiments to analyze the training dynamics of the unnormalized (’Plain’) and batch normalized [21] (’BN’) networks, by analyzing the spectrum of the full FIM and subFIMs. We find that the conclusion made by analyzing the full FIM can also be derived by analyzing the subFIMs. Please refer to Appendix B for details. Furthermore, we argue that layerwise conditioning analysis can uncover more information in exploring the training dynamics of DNNs than the analysis using full curvature matrix. e.g., we can diagnose and further locate the gradient vanishing/explosion with respect to specific layer. We will elaborate in the subsequent sections.
Efficient Computation
We denote the covariance matrix of the layer input as and the covariance matrix of the layer outputgradient as . The condition number and maximum eigenvalue of the subFIM can be exactly derived based on the spectrum of and , which is shown in the following proposition.
Proposition 1.
Given , and , we have: 1) ; 2) ;
The proof is shown in Appendix A.1. Proposition 1 provides an efficient way to calculate the maximum eigenvalue and condition number of subFIM by computing those of and . In practice, we use the empirical distribution to approximate the expected distribution and when calculating and , since it is very efficient and can be performed with only one forward and backward pass. Such an approximation has been shown effective and efficient in FIM approximation methods [29, 1].
3.1 Connection to Proximal Backpropagation
CarreiraPerpinan and Wang [6] propose to use auxiliary coordinates (MAC) to redefine the optimization object with equalityconstraints imposed on each neuron. They solve the constrained optimization by adding quadratic penalty, and the optimization object , is defined as follows:
(5) 
where is a function with respect to each layer. It has been shown in [6], that the solution of minimizing converges to the solution of minimizing as , under mild condition. Furthermore, the proximal propagation proposed in [9] reformulates each subproblem independently with a backward order, which minimize each layer object , given the target signal from the upper layer, as follows:
(6) 
It has been shown that the gradient updates with respect to can be equivalent to the gradient updates with respect to Eqn. 6, when given appropriate step size. Please for more details in [9, 46].
Interestingly, if we view the auxiliary variable as the preactivation in certain layer, the suboptimization problem in each layer is formulated as:
(7) 
It is clear that the suboptimization problems with respect to ^{3}^{3}3Note that is also to be optimized in Enq.7, for providing the target signal to the lower layer., are actually a linear classification (for k=K) or regression (for ) model. Their conditioning analysis is well characterized in Section 2.
This connection suggests that 1) the quality (conditioning) of the full optimization problem is well correlated to its suboptimization problems shown in Eqn. 7, whose local curvature matrix can be well explored; 2) We can diagnose the ill behaviors of a DNN, by speculating its spectrum with respect to certain layer. We will elaborate on this in Section 4 and 5.
4 Exploring Batch Normalized Networks
Let denote the input for a given neuron in one layer of a DNN. Batch normalization (BN) [21] standardizes the neuron within minibatch data by:
(8) 
where and
are the mean and variance, respectively. The learnable parameters
and are used to recover the representation capacity. BN is a ubiquitously employed techniques in various architectures [21, 14, 44, 18] since its ability in stabilizing and accelerating training. Here, we explore how BN stabilizes and accelerates training based on our layerwise conditioning analysis.4.1 Stabilizing Training
From the perspective of a practitioner, two phenomena relate to the instability of training a neural network: 1) the training loss first increases significantly and then diverges; 2) the training loss hardly changes, compared to the initial condition. (e.g., random guess for classification). The former is mainly caused by weights with large updates (e.g., exploded gradients or optimization with a large learning rate). The latter is caused by weights with few updates (vanished gradients or optimization with a small learning rate). In the following theorem, we show that the unnormalized rectifier neural network is very likely to encounter both phenomena.
Theorem 1.
Given a rectifier neural network (Eqn. 3) with nonlinearity ( ), if the weight in each layer is scaled by ( and ), we have the scaled layer input: . With assumption that , we have the outputgradient: , and weightgradient: , for all .
The proof is shown in Appendix A.2. From Theorem 1, we observe that the scaled factor of the weight in layer will affect all other layers’ weightgradients. Specifically, if all (), the weightgradient will increase (decrease) exponentially for one iteration. Moreover, such an exponentially increased weightgradient will be sustained and amplified in the subsequent iteration, due to the increased magnitude of the weight caused by updating. That is why the unnormalized neural network will diverge, once the training loss increases in a few continuous iterations.
We further show that such an instability behavior can be relieved by batch normalization, based on the following theorem.
Theorem 2.
Under the same condition of Theorem 1, for the normalized network with and , we have: , , , for all .
The proof is shown in Appendix A.3. From Theorem 2, the scaled factor of the weight will not affect other layers’ activations/gradients. The magnitude of the weightgradient is inversely proportional to the scaled factor. Such an mechanism will stabilize the weight growth/reduction, as shown in [21, 43]. Note that the behaviors when stabilizing training (Theorem 2) also apply for other activation normalization methods [2, 41, 16]. We note that the scaleinvariance of BN in stabilizing training has been analyzed in previous work [2]. Different to their analyses on the normalization layer itself, we provide an explicit formulation of weightgradients and outputgradients in a network, which is more important to characterize the learning dynamics of DNNs.
Empirical Analysis
We further conduct experiments to show how the activation/gradient is affected by initialization in the unnormalized DNNs (indicated as ‘Plain’) and batch normalized DNNs (indicated as ‘BN’). We observe each layer’s and , because: 1) indicates the magnitude of the layer input; 2) indicates magnitude of the layer outputgradient; 3) , which indicates the magnitude of the weightgradients and relates to the upperbound of the learning rates, can be derived by and
. We train a 20layer MLP, with 256 neurons in each layer, for MNIST classification. The nonlinearity is ReLU. We use the full gradient descent
^{4}^{4}4We also perform SGD with a batch size of 1024, and further perform experiments on convolutional neural networks for CIFAR10 classification. The results are shown in Appendix
C.1, in which we have the same observation as the full gradient descent., and show the best performance among learning rates in . We observe that the magnitude of the layer input (outputgradient) of ‘Plain’ for random initialization [25] suffer the exponentially decrease (Figure 2 (a)), during forward pass (backward pass). The main reason for this is that the weight has a small magnitude, based on Theorem 1. This problem can be relieved by HeInitialization [13], where the magnitude of the input/outputgradient is stable among layers (Figure 2 (d)). We observe that BN can well preserve the magnitude of the input/outputgradient among different layers for both initialization methods, as shown in Figure 2 (a) and (d).Weight Domination
It has been shown that in [2] the scaleinvariant property of BN has an implicit early stopping effect on the weight matrices and help to stabilize learning towards convergence. Here we show that such a‘early stopping’ is layerwised, and certain ‘early stopping’ has detrimental effects on the learning, since it creates the false impression of a local minimum. For illustration, we provide a rough definition called weight domination with respect to certain layer.
Definition 4.1.
Let and is the weight matrix and its gradient in layer . If where
indicates the maximum singular value of a matrix, we refer to layer
has a state of weight domination.Weight domination implies the smoother gradient with respect to this layer. This is a desirable property for linear model (the distribution of the input is fixed), where the optimization algorithms targets to. However, weight domination is not always desirable for a certain layer of neural network, since such a state of one layer can be caused by the increased magnitude of the weight matrix or decreased magnitude of the layer input (the nonconvex optimization in Eqn. 7), not by the optimization objective. Although BN ensures a stable distribution of layer input, the network with BN still has a possibility that the magnitude of the weight in certain layer is significantly increased. We experimentally observe this phenomenon shown in Appendix C.2. Similar phenomenon is also observed in [43], where BN results in large updates of the corresponding weights.
Weight domination sometimes harms the learning of the network, because this state limits the ability in representation of the corresponding layer. To investigate it, we conduct experiments on a five layer MLP and show the results in Figure 3. We observe that the network with its certain layers being weight domination, can still decrease the loss, however it has degenerated performance. We also conduct experiments on Convolutional Neural Networks (CNNs) for CIFAR10 datasets, shown in Appendix C.2
4.2 Improved Conditioning
One motivation of BN is that whitening the input can improve the conditioning of the optimization [21] (e.g., the Hessian will be an identity matrix under the condition that for a linear model, based on Eqn. 1, and thus can accelerate training [8, 20]. However, such a motivation is hardly further validated either by theoretical or empirical analysis [8]. Furthermore, this motivation holds under the condition that BN is placed before the linear layer, while, in practice, BN is typically placed after the linear layer as recommended in [21]. In this section, we will empirically explore this motivation based on our layerwise conditioning analysis in the scenario of training DNNs.
We first experimentally observe that BN not only improves the conditioning of the layer input’s covariance matrix, but also improves the conditioning of the outputgradient’covariation, as shown in Figure 4. It has been shown that centered data is more likely to be wellconditioned [26, 38, 31, 19]. This suggests that placing BN after the linear layer can improve the conditioning of the outputgradient, because centering the activation, with the gradient backpropagating through such a transformation [21], also centers the gradient.
We further observe that the unnormalized network (’Plain’) has many small eigenvalues. For further exploration, we monitor the output of each neuron in each layer, and find that ‘Plain’ has some neurons that is not activated (zero output of ReLU) for each training example, we refer to this kind of neuron as dying neuron. We also observe that ‘Plain’ has some neurons that is always activated for every training example, we refer to them as full neuron. This observation is typically obvious in the initial iterations. The number of dying/full neuron will increase with the layer increasing (Figure 5). We conjecture that the dying neuron makes ’Plain’ have many small/zero eigenvalues. On the contrary, batch normalized networks have no dying/full neuron, because the centering operation ensures half of examples get activated. This further suggests placing BN before nonlinear activation can improve the conditioning.
5 Training Very Deep Residual Network
Residual networks [14]
have significantly relieved the difficulty of training deep networks by their introduction of the residual connection, which makes training networks with hundreds or even thousands of layers possible. However, Residual networks also suffer from degenerated performance when the model is extremely deep (
e.g., the 1202layer residual network has worse performance than the 110layer one.), as shown in [14]. He et al. [14] argue that this is from overfitting, not optimization difficulty. Here we show that a very deep residual network may also suffer difficulty in optimization.We first perform experiments on the CIFAR10 dataset with residual networks, following the same experimental setup as in [14], except that we run the experiments on one GPU. We vary the network depth, ranging in , and show the training loss in Figure 6 (a). We observe that the residual networks have an increased loss in the initial iterations, which is amplified for deeper networks. Later, the training get stuck in a state of random guess of the network (the loss keeps at ). Although the network can escape such a state with enough iterations, the networks suffer from degenerated training performance, especially for very deep networks.
Analysis of Learning Dynamics
To exploring why residual network has such a mysterious behavior, we perform the layerwise conditioning analysis upon the last linear layer (before the cross entropy loss). We monitor the maximum eigenvalue of the covariance matrix , the maximum eigenvalue of the second moment matrix of weightgradient , and the norm of the weight ().
We observe that the initially increased loss is mainly caused by the large magnitude of ^{5}^{5}5The large magnitude of is caused mainly by the adding of multiple residual connections from the previous layers with ReLU output. (Figure 6 (b)), which results in a large magnitude for , and thus a large magnitude for (Figure 6 (d)). The increased further facilities the increase of the loss. However, the learning object is to decrease the loss, and thus it is expected to decrease the magnitude of or (based on Eqn.7) in this case. Apparently, is harder to adjust, because its landscape of loss surface is controlled by , and all the values of are nonnegative with large magnitude. The network thus tries to decrease based on the given learning object. We experimentally find that the learnable parameters of BN have a large amount of negative values, which make the ReLUs (that position after the residual adding operation) not activated. Such a dynamic results in a significant reduction in the magnitude of . The small and the large drive the last linear layer of the network into the state of weight dominate, and show a random guess behavior. Although the residual network can escape such a state with enough iterations, the weight domination hinders optimization and results in degenerated training performance.
method  depth56  depth110  depth230  depth1202 

ResNet  7.52 0.30  6.89 0.52  7.35 0.64  9.42 3.10 
ResNet  6.50 0.22  6.10 0.09  5.94 0.18  5.68 0.14 
method  depth56  depth110  depth230  depth1202 

ResNet  29.60 0.41  28.3 1.09  29.25 0.44  30.49 4.44 
ResNet  28.82 0.38  27.05 0.23  25.80 0.10  25.51 0.27 
with different depth on CIFAR10 dataset. We evaluate the training loss with respect to the epochs.
5.1 Proposed Solution
Based on the above analysis, the essential is to resist the large magnitude of . We propose a simple solution and add one BN layer before the last linear layer to normalize its input. We refer to this residual network as ResNet, and the original one as ResNet. We also provide the analysis on the last linear layer of ResNet, ans we show the comparison between ResNet and ResNet on the 1202layer in Figure 7. We observe that ResNet can steadily train. It does not encounter the state of weight domination or large magnitude of in the last linear layer.
We try a similar solution where a constant is divided before the linear layer, and we find it also benefits the training. However, the main disadvantage of this solution is that the value of the constant has to be finely tuned on the networks with different depths. We also try a solution to put one BN before the average pooling, it has similar effects as the one before the last linear layer. We note that Bjorck et al. [4] propose to train a 110layer residual network with only one BN layer that is put before the average pooling. They show it achieves good results. We argue that this is not true for the very deep networks. We perform an experiments on the 1202 layer residual network, and we find that the model always has a fail in training with various hyperparameters.
Method  18layer  50layer  101layer 

ResNet  29.78  23.81  22.45 
ResNet  29.32  23.47  21.94 
, single model and singlecrop) on 18, 50 and 101layer residual networks for ImageNet classification.
ResNet, a simple revision of ResNet, achieves significantly improvement in performance for very deep residual networks. We show the experimental results on CIFAR and ImageNet classification, as follows.
CIFAR Datasets
Figure 8 (a) and (b) show the training loss of ResNet and ResNet, respectively on the CIFAR10 dataset. We observe that ResNet, with a depth of 1202, appears to have a degenerated training performance, especially in the initial phase. Note that, as the depth increases, ResNet obtains worse training performance in the first 80 epochs (before the learning rate is reduced), which coincides with our previous analysis. ResNet obtains nearly the same training loss for the networks with different depth in the first 80 epochs. Moreover, ResNet shows lower training loss with increasing depth. Comparing Figure 8 (b) to (a), we observe that ResNet has better training loss than ResNet for all depths (e.g., at a depth of 56, the loss of ResNet is 0.081, while for ResNet it is 0.043.).
Table 1 shows the test errors. We observe that ResNet achieves better test performance with increasing depth, while ResNet has degenerated performance. Compared to ResNet, ResNet has consistently improved performance over different depths. Particularly, ResNet reduces the absolute test error of ResNet by , , and at depths of 56, 110, 230 and 1202, respectively. Due to ResNet’s optimization efficiency, the training performance is likely improved if we amplify the regularization of the training. Thus, we set the weight decay to 0.0002 and double the training iterations, and find that the 1202layer ResNet achieves a test error of . We also train a 2402 layer network. We observe that can not converge, while ResNet achieves a test error of .
We further perform the experiment on CIFAR100 and use the same experimental setup as CIFAR10. Table 2 shows the test errors. ResNet reduces the absolute test error of ResNet by , , and at depths of 56, 110, 230 and 1202, respectively.
ImageNet Dataset
We also validate the effectiveness of ResNet on the largescale ImageNet classification, with 1000 classes [7]. We use the official 1.28M training images as the training set, and evaluate the top1 classification errors on the validation set with 50k images. We perform the experiments on the 18, 50 and 101layer networks. We follow the same setup as described in [14], except that 1) we train over 100 epochs with an extra lowered learning rate at the 90th epoch; 2) use 1 GPU for the 18layer network and 4 GPUs for 50 and 101layer network.
We also observe that ResNet has the better optimization efficiency compared to ResNet (See Appendix D). Table 3 shows the validation errors. We thus report the best result from the network with dropout of 0.3 (before the last linear layer) and without dropout, for a more fair comparison. We find that ResNet has a improved performance over ResNet on all networks.
6 Conclusion
We propose a layerwise conditioning analysis to characterize the optimization behaviors of DNNs. Such an analysis is theoretically derived under mild assumptions that approximately hold in practice. Based on our layerwise conditioning analysis, we show how batch normalization stabilizes training and improves the conditioning of the optimization problem. We further show that very deep residual networks still suffer difficulty in optimization, which is caused by the illconditioned state of the last linear layer. We remedy this by adding only one BN layer before the last linear layer. We expect our method to provide new insights in analyzing and understanding the techniques of training DNNs.
References
 [1] Jimmy Ba, Roger Grosse, and James Martens. Distributed secondorder optimization using kroneckerfactored approximations. In ICLR, 2017.
 [2] Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016.
 [3] Alberto Bernacchia, Mate Lengyel, and Guillaume Hennequin. Exact natural gradient in deep linear networks and its application to the nonlinear case. In NeurIPS. 2018.
 [4] Johan Bjorck, Carla Gomes, and Bart Selman. Understanding batch normalization. In NeurIPS, 2018.

[5]
Léon Bottou, Frank E Curtis, and Jorge Nocedal.
Optimization methods for largescale machine learning.
Siam Review, 60(2):223–311, 2018.  [6] Miguel CarreiraPerpinan and Weiran Wang. Distributed optimization of deeply nested systems. In AISTATS, 2014.
 [7] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR, 2009.
 [8] Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, and koray kavukcuoglu. Natural neural networks. In NeurIPS, 2015.

[9]
Thomas Frerix, Thomas Möllenhoff, Michael Möller, and Daniel
Cremers.
Proximal backpropagation.
In ICLR, 2018.  [10] Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. In ICML, 2019.
 [11] Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, 2010.
 [12] Roger B. Grosse and James Martens. A kroneckerfactored approximate fisher matrix for convolution layers. In ICML, 2016.
 [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing humanlevel performance on imagenet classification. In ICCV, 2015.
 [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [15] G E Hinton and R R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313:504–507, 2006.
 [16] Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate normalization schemes in deep networks. NeurIPS, 2018.
 [17] Roger A Horn and C. A Johnson. Topics in Matrix Analysis. Cambridge University Press, 1991.
 [18] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
 [19] Lei Huang, Xianglong Liu, Yang Liu, Bo Lang, and Dacheng Tao. Centered weight normalization in accelerating training of deep neural networks. In ICCV, 2017.
 [20] Lei Huang, Dawei Yang, Bo Lang, and Jia Deng. Decorrelated batch normalization. In CVPR, 2018.
 [21] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 [22] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
 [23] Jonas Kohler, Hadi Daneshmand, Aurelien Lucchi, Ming Zhou, Klaus Neymeyr, and Thomas Hofmann. Towards a theoretical understanding of batch normalization. arXiv preprint arXiv:1805.10694, 2018.
 [24] Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. Deep learning. Nature, 521:436–444, 2015.
 [25] Yann LeCun, Léon Bottou, Genevieve B. Orr, and KlausRobert Müller. Effiicient backprop. In Neural Networks: Tricks of the Trade, pages 9–50, 1998.
 [26] Yann LeCun, Ido Kanter, and Sara A. Solla. Second order properties of error surfaces: Learning time and generalization. In NeurIPS, 1990.
 [27] James Martens. Deep learning via hessianfree optimization. In ICML, pages 735–742. Omnipress, 2010.
 [28] James Martens. New perspectives on the natural gradient method. CoRR, abs/1412.1193, 2014.
 [29] James Martens and Roger Grosse. Optimizing neural networks with kroneckerfactored approximate curvature. In ICML, 2015.
 [30] James Martens, Ilya Sutskever, and Kevin Swersky. Estimating the hessian by backpropagating curvature. In ICML, 2012.

[31]
Grégoire Montavon and KlausRobert Müller.
Deep Boltzmann Machines and the Centering Trick
, volume 7700 of LNCS. 2012.  [32] Vardan Papyan. The full spectrum of deep net hessians at scale: Dynamics with sample size. CoRR, abs/1811.07062, 2018.
 [33] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. In ICLR, 2014.
 [34] Nicolas Le Roux, PierreAntoine Manzagol, and Yoshua Bengio. Topmoumoute online natural gradient algorithm. In NeurIPS, pages 849–856. Curran Associates, Inc., 2007.
 [35] Levent Sagun, Utku Evci, V. Ugur Güney, Yann N. Dauphin, and Léon Bottou. Empirical analysis of the hessian of overparametrized neural networks. CoRR, abs/1706.04454, 2017.
 [36] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? In NeurIPS, 2018.
 [37] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR, 2014.
 [38] Nicol N. Schraudolph. Accelerated gradient descent by factorcentering decomposition. Technical report, 1998.
 [39] Ke Sun and Frank Nielsen. Relative Fisher information and natural gradient for learning large modular models. In ICML, 2017.
 [40] Simon Wiesler and Hermann Ney. A convergence analysis of loglinear training. In NeurIPS, 2011.
 [41] Shuang Wu, Guoqi Li, Lei Deng, Liu Liu, Yuan Xie, and Luping Shi. L1norm batch normalization for efficient training of deep neural networks. CoRR, 2018.
 [42] Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018.
 [43] Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha SohlDickstein, and Samuel S. Schoenholz. A mean field theory of batch normalization. In ICLR, 2019.
 [44] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.
 [45] Matthew D. Zeiler. ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701, 2012.
 [46] Huishuai Zhang, Wei Chen, and TieYan Liu. On the local hessian in backpropagation. In NeurIPS. 2018.
Appendix A Proof of Theorems
Here, we provide proofs for the three proposition/theorems in the paper.
a.1 Proof of Proposition 1
Proposition 1. Given , and , we have: 1) ; 2) ;
Proof.
The proof is mainly based on the conclusion from Theorem 4.2.12 in [17], which is restated as follows:
Lemma A.1.
Let and . Furthermore, let be the arbitrary eigenvalue of and be the arbitrary eigenvalue of . We have as an eigenvalue of . Furthermore, any eigenvalue of arises as a product of eigenvalues of and .
Based on the definitions of and , we have that and are positive semidefinite. Therefore, all the eigenvalues of / are nonnegative. Let denote the eigenvalue of matrix . Based on Lemma A.1, we have . Since and are nonnegative, we thus have . Similarly, we can prove that . We thus have .
∎
a.2 Proof of Theorem 1
Theorem 1. Given a rectifier neural network (Eqn. 3) with nonlinearity ( ), if the weight in each layer is scaled by ( and ), we have the scaled layer input: . With assumption that , we have the outputgradient: , and weightgradient: , for all .
Proof.
(1) We first demonstrate that the scaled layer input (), using mathematical induction. It is easy to validate that and . We assume that and hold, for . When , we have
(9) 
We thus have
(10) 
By induction, we have , for . We also have for .
(2) We then demonstrate that the scaled outputgradient for . We also provide this using mathematical induction. Based on backpropagation, we have
(11) 
and
(12) 
Based on the assumption that , we have ^{6}^{6}6We denote if ..
We assume that holds, for . When , we have
(13) 
We also have
(14) 
By induction, we thus have , for .
(3) based on , and , it is easy to prove that for . ∎
a.3 Proof of Theorem 2
Theorem 2. Under the same condition of Theorem 1, for the normalized network with and , we have: , , , for all .
Proof.
(1) Following the proof in Theorem 1, by mathematical induction, it is easy to demonstrate that , and , for all .
(2) We also use mathematical induction to demonstrate for all
We first show the formulation of the gradient backpropagating through each neuron of the BN layer as:
(15) 
where
is the standard deviation and
denotes the expectation over minibatch examples. We have based on . Since , we have . Therefore, we have from Eqn. 15.Assume that for . When , we have:
(16) 
Following the proof for Theorem 1 , it is easy to get . Based on and , we have from Eqn. 15.
By induction, we have , for all
(3) based on , and , we have that , for all
∎
Appendix B Comparison of the Analysis with Full Curvature Matrix and Subcurvature Matrices
Here, we conduct experiments to analyze the training dynamics of the unnormalized (’Plain’) and batch normalized [21] (’BN’) networks, by analyzing the spectrum of the full curvature matrix and subcurvature matrices. Figure A1 show the results using the Fisher Information Matrix (FIM) on a 8layer MLP with 24 neurons in each layer. We calculate the maximum eigenvalue and the condition number with respect to different percentage for both the full FIM and the subFIMs. By observing the results from the full FIM (Figure A1 (a)), we find that: 1) the unnormalized network suffers the problem of gradient vanish (the maximum eigenvalue is around ) while batch normalized network has an appropriate magnitude of gradient (the maximum eigenvalue is around ); 2) ‘BN’ has better conditioning than ‘Plain’, which suggests batch normalization (BN) can improve the conditioning of the network, as observed in [36, 10]. We can also obtain similar conclusion by observing the results from the subFIMs (Figure A1 (b) to (i)). This experiment shows that our layerwise conditioning analysis can uncover the training dynamics of the networks if the full conditioning analysis can.
We also conduct experiments to analyze ‘Plain’ and ‘BN’, by using the second moment matrix of sample gradient . The results are shown in Figure A2. We have the same observation as the one by using the FIM.
Appendix C More Experiments in Exploring Batch Normalized Networks
In this section, we provide more experimental results relating to Exploring batch normalization (BN) [21] by layerwise conditioning analysis, which is discussed in Section 4 of the paper. We include experiments that train neural networks with Stochastic Gradient Descent (SGD) and experiments relating to weight domination discussed in the paper.
c.1 Experiments with SGD
Here, we perform experiments on the Multiple Layer Perceptron (MLP) for MNIST classification and Convolutional Neural Networks (CNNs) for CIFAR10 classification.
c.1.1 MLP for MNIST Classification
We use the same experimental setup as the experiments described in the paper for MNIST classification, except that we use SGD with a batch size of 1024. The results are shown in Figure A3 and Figure A4. We obtain similar results as those obtained using full gradient descent, as described in the Section Section 4 of the paper.
c.1.2 CNN for CIFAR10 Classification
We perform a layerwise conditioning analysis on the VGGstyle and residual network [14] architectures. Note that we view the activation in each spatial location of the feature map as a independent example, when calculating the covariance matrix of convolutional layer input and outputgradient. This process is similar to the process proposed in BN to normalize the convolutional layer [21].
We use the 20layer residual network described in the paper [14] for CIFAR10 classification. The VGGstyle network is constructed based on the 20layer residual network by removing the residual connections.
We use the same setups as described in [14], except that we do not use the weight decay in order to simplify the analysis and run the experiments on one GPU. Since the unnormalized networks (including the VGGstyle and residual network) do not converge with the large learning rate of 0.1, we run additional experiments with a learning rate of 0.01 for unnormalized neural networks, and report these results.
Figure A5 and Figure A6 show the results on the VGGStyle and residual network, respectively. We make the same observations as the ones made for the MLP for MNIST classification.
c.2 Experiments Relating to Weight Domination
Gradient Explosion of BN
In Section 4.1 of the paper, we mention that the network with BN still has a possibility that the magnitude of the weight in certain layers is significantly increased. Here, we provide the experimental results.
We conduct experiments on a 100layer batch normalized MLP with 256 neurons in each layer for MNIST classification. We calculate the maximum eigenvalues of the subFIMs, and show the results during the first 7 iterations in Figure A7 (a). we observe that the weightgradient has exponential explosion at initialization (‘Iter0’). After a single step, the firststep gradients dominate the weights due to gradient explosion in lower layers, hence the exponential growth in magnitude of weight. This increased magnitude of weight lead to small weight gradients (‘Iter1’ to ‘Iter7’), which is caused by BN as discussed in Section 4.1 of the paper. Therefore, some layers (especially the lower layers) of the network come into the state of weight domination. We obtain similar observations on the 110layer VGGstyle network for CIFAR10 classification, as shown in Figure A7 (b).
Investigation of Weight Domination
Weight domination sometimes harms the learning of the network, because this state limits the ability in representation of the corresponding layer. We conducted experiments on a fivelayer MLP and showed the results in Section 4. Here, we also conduct experiments on CNNs for CIFAR10 datasets, shown in Figure A8. We observe that the network with certain layers being the states of weight domination, can still decrease the loss, however it has degenerated performance.