Deep Neural Networks (DNNs) have been extensively used in various domains . Their success depends heavily on the improvement of training techniques [15, 21, 14], e.g. finely weight initialization [15, 11, 37, 13], normalization of internal representation [21, 42], and well designed optimization methods [45, 22]. It is believed that these techniques are well connected to the curvature of the loss [37, 36, 23]. Analyzing the curvature of the loss is essential in determining many learning behaviors of DNNs.
In optimization community, conditioning analysis uncovers the landscape of optimization objective by exploring the spectrum of its curvature matrix. It has been well explored for linear models both in regression  and classification 
, where the convergence condition of the optimization problem is controlled by the maximum eigenvalue of the curvature matrix[26, 25], and the learning time of the model is lower-bounded by the conditioning number of the curvature matrix [26, 25]
. However, in the context of deep learning, the conditioning analysis suffers from several barriers: 1) the model is over-parameterized and whether the direction with respect to small/zero eigenvalues contributes to the optimization progress is unclear[35, 32]; 2) the memory and computation cost is extremely expensive [35, 10].
This paper aims to bridge the theoretical analysis in optimization community and the empirical techniques in training DNNs, for better understanding the learning dynamics of DNNs. We propose a layer-wise conditioning analysis, where we analyze the optimization landscape with respect to each layer independently by exploring the spectrum of their curvature matrices. The theoretical insight of our layer-wise conditioning analysis is based on the recent success of second curvature approximation techniques in DNNs [30, 29, 1, 39, 3]. We show that the maximum eigenvalue and the condition number of the block-wise Fisher Information Matrix (FIM) can be characterized based on the spectrum of the covariance matrix of the input and output-gradient, under mild assumptions, which makes evaluating optimization behavior practical in DNNs. Another theoretical base is the recently proposed proximal back-propagation [6, 9, 46] where the original optimization problem can be approximately decomposed into multiple independent sub-problems with respect to each layer . We provide the connection between our analysis to the proximal back-propagation.
Based on our layer-wise conditioning analysis, we show that Batch Normalization (BN)  can adjust the magnitude of the layer activations/gradients, and thus stabilizes the training. However, this kind of stabilizations can drive certain layer into a state, referred to as weight domination, that the gradient update is feeble, which sometimes has detrimental effects on the learning (See Section 4.1 for details). We also experimentally observe that BN can improve the layer-wise conditioning of the optimization problem. Furthermore, we find that the unnormalized network has many small eigenvalues of the layer curvature matrix, which is mainly caused by the so called dying neuron
dying neuron(Section 4.2), while this behavior is almost absent in batch normalized network.
We further analyze the ignored difficulty in training very deep residual networks . Using our layer-wise conditioning analysis, we show that the difficulty mainly arises from the ill-conditioned behavior of the last linear layer. We solve this problem by only adding one BN layer before the last linear layer, which achieves improved performance over the original residual networks (Section 5).
Consider a true data distribution and the sampled training sets of size
. We focus on a supervised learning task aiming to learn the conditional distributionusing the model , where is represented as a function parameterized by . From an optimization perspective, we aim to minimize the empirical risk, averaged over the sample loss represented as in training sets : .
In general, the gradient descent (GD) update seeks to iteratively reduce the loss by , where
is the learning rate. Moreover, for large-scale learning, stochastic gradient descent (SGD) is extensively used to approximate the gradientswith a mini-batch gradient. In theory, the convergence behaviors (e.g., the number of iterations required for convergence to a station point) depend on the Lipschitz constant of the gradient function of 111The loss’s gradient function is assumed Lipschitz continuous with Lipschitz constant , i.e., for all and ., which characterizes the global smoothness of the optimization landscape. In practice, the Lipschitz constant is either unknown for complicated functions or too conservative to characterize the convergence behaviors . Researchers thus turn to the local smoothness, characterized by the Hessian matrix under the condition that is twice differentiable.
Approximate Curvature Matrices
The Hessian describes local curvature of the optimization landscape. Such curvature information intuitively guides the design of second-order optimization algorithms [35, 5], where the update direction is adjusted by multiplying the inverse of pre-conditioned matrix as: .
is a positive definite matrix that approximates the Hessian and is expect to sustain the positive curvature of the Hessian. The second moment matrix of sample gradient:is usually used as the pre-conditioned matrix [34, 27]. Besides, Pascanu and Bengio  show that the Fisher Information Matrix (FIM): can be viewed as a pre-conditioned matrix when performing the natural gradient descent algorithm . More analyses of the connections among , , please refer to [28, 5]. In this paper, we refer to the analysis of the spectrum of the (approximate) curvature matrices as conditioning analysis.
Conditioning Analysis for Linear Models
Consider a linear regression model with a scalar output, and mean square error loss . As shown in [26, 25], the learning dynamics in such a quadratic surface are fully controlled by the spectrum of the Hessian matrix . There are two statistical momentums that are essential for evaluating the convergence behaviors of the optimization problem. One is the maximum eigenvalue of the curvature matrix , and the other is the condition number of the curvature matrix denoted by , where is the minimum nonzero eigenvalue of the curvature matrix. Specifically, controls the upper bound and the optimal learning rate (e.g., the optimal learning rate is and the training will diverge if ). controls the iterations required for convergence (e.g., the lower bound of the iteration is ). If
is an identity matrix that can be achieved by whitening the input, the GD can converge within only one iteration. It is easy to extend the solution of linear regression from a scalar output to a vectorial output. In this case, the Hessian is represented as
Here, is defined by , where is the number of categories and denotes a matrix where entries are all one. Eqn. 2 assumes that the Hessian does not significantly change from the initial region to the optimal region .
3 Conditioning Analysis for DNNs
Considering a Multilayer Perceptron (MLP),can be represented as a layer-wise linear and nonlinear transformation, as follows:
where , and the learnable parameters . To simplify the denotation, we set as the output of the network .
The conditioning analysis on the full curvature matrix for DNNs is difficult since the expensive memory and computational costs [10, 32]. We thus seek to analyze an approximation of the curvature matrix. One successful example in second-order optimization over DNNs is approximating the Fisher Information Matrix (FIM) by using the Kronecker product (K-FAC) [29, 1, 39, 3]. In the K-FAC approach, there are two assumptions: 1) weight-gradients in different layers are assumed to be uncorrelated; 2) the input and output-gradient in each layer are approximated as independent, so the full FIM can be represented as a block diagonal matrix, , where is the sub-FIM (the FIM with respect to the parameters in certain layer) and is computed as:
Here, denotes the layer input, and denotes the layer output-gradient. We note that Eqn. 3 is similar to Eqn. 1 and 2, and all of them depend on the covariance matrix of the (layer) input. The main difference is that in Eqn. 3, the covariance of output-gradient is considered and its value is changing over different optimization region, while in Eqn. 1 and 2, the covariance of output-gradient is constant.
Based on this observation, we propose layer-wise conditioning analysis, i.e., we analyze the spectrum of each sub-FIM independently. We expect the spectrum of sub-FIMs can well reveal the one of the full FIM, at least in terms of analyzing the learning dynamics of the DNNs. Specifically, we analyze the maximum eigenvalue and condition number 222Since DNNs are usually over-parameterized, we evaluate the more general condition number with respect to the percentage: , where is the -th eigenvalue (in a descent order) and is the number of eigenvalues. e.g., is the original definition of condition number.. Based on conclusion of the conditioning analysis for linear models shown in Section 2, there are two remarkable properties, which can be used to uncover the landscape of the optimization problem implicitly:
Property 1: indicates the magnitude of the weight-gradient in each layer, which shows the steepness of the landscape with respect to different layers.
Property 2: indicates how easy it is to optimize the corresponding layer.
Note that the assumptions required for our analysis is weaker than the K-FAC approximation, since we only care about whether the spectrum of the full FIM matches well with the approximated one. We conduct experiments to analyze the training dynamics of the unnormalized (’Plain’) and batch normalized  (’BN’) networks, by analyzing the spectrum of the full FIM and sub-FIMs. We find that the conclusion made by analyzing the full FIM can also be derived by analyzing the sub-FIMs. Please refer to Appendix B for details. Furthermore, we argue that layerwise conditioning analysis can uncover more information in exploring the training dynamics of DNNs than the analysis using full curvature matrix. e.g., we can diagnose and further locate the gradient vanishing/explosion with respect to specific layer. We will elaborate in the subsequent sections.
We denote the covariance matrix of the layer input as and the covariance matrix of the layer output-gradient as . The condition number and maximum eigenvalue of the sub-FIM can be exactly derived based on the spectrum of and , which is shown in the following proposition.
Given , and , we have: 1) ; 2) ;
The proof is shown in Appendix A.1. Proposition 1 provides an efficient way to calculate the maximum eigenvalue and condition number of sub-FIM by computing those of and . In practice, we use the empirical distribution to approximate the expected distribution and when calculating and , since it is very efficient and can be performed with only one forward and backward pass. Such an approximation has been shown effective and efficient in FIM approximation methods [29, 1].
3.1 Connection to Proximal Back-propagation
Carreira-Perpinan and Wang  propose to use auxiliary coordinates (MAC) to redefine the optimization object with equality-constraints imposed on each neuron. They solve the constrained optimization by adding quadratic penalty, and the optimization object , is defined as follows:
where is a function with respect to each layer. It has been shown in , that the solution of minimizing converges to the solution of minimizing as , under mild condition. Furthermore, the proximal propagation proposed in  reformulates each sub-problem independently with a backward order, which minimize each layer object , given the target signal from the upper layer, as follows:
Interestingly, if we view the auxiliary variable as the pre-activation in certain layer, the sub-optimization problem in each layer is formulated as:
It is clear that the sub-optimization problems with respect to 333Note that is also to be optimized in Enq.7, for providing the target signal to the lower layer., are actually a linear classification (for k=K) or regression (for ) model. Their conditioning analysis is well characterized in Section 2.
This connection suggests that 1) the quality (conditioning) of the full optimization problem is well correlated to its sub-optimization problems shown in Eqn. 7, whose local curvature matrix can be well explored; 2) We can diagnose the ill behaviors of a DNN, by speculating its spectrum with respect to certain layer. We will elaborate on this in Section 4 and 5.
4 Exploring Batch Normalized Networks
Let denote the input for a given neuron in one layer of a DNN. Batch normalization (BN)  standardizes the neuron within mini-batch data by:
are the mean and variance, respectively. The learnable parametersand are used to recover the representation capacity. BN is a ubiquitously employed techniques in various architectures [21, 14, 44, 18] since its ability in stabilizing and accelerating training. Here, we explore how BN stabilizes and accelerates training based on our layer-wise conditioning analysis.
4.1 Stabilizing Training
From the perspective of a practitioner, two phenomena relate to the instability of training a neural network: 1) the training loss first increases significantly and then diverges; 2) the training loss hardly changes, compared to the initial condition. (e.g., random guess for classification). The former is mainly caused by weights with large updates (e.g., exploded gradients or optimization with a large learning rate). The latter is caused by weights with few updates (vanished gradients or optimization with a small learning rate). In the following theorem, we show that the unnormalized rectifier neural network is very likely to encounter both phenomena.
Given a rectifier neural network (Eqn. 3) with nonlinearity ( ), if the weight in each layer is scaled by ( and ), we have the scaled layer input: . With assumption that , we have the output-gradient: , and weight-gradient: , for all .
The proof is shown in Appendix A.2. From Theorem 1, we observe that the scaled factor of the weight in layer will affect all other layers’ weight-gradients. Specifically, if all (), the weight-gradient will increase (decrease) exponentially for one iteration. Moreover, such an exponentially increased weight-gradient will be sustained and amplified in the subsequent iteration, due to the increased magnitude of the weight caused by updating. That is why the unnormalized neural network will diverge, once the training loss increases in a few continuous iterations.
We further show that such an instability behavior can be relieved by batch normalization, based on the following theorem.
Under the same condition of Theorem 1, for the normalized network with and , we have: , , , for all .
The proof is shown in Appendix A.3. From Theorem 2, the scaled factor of the weight will not affect other layers’ activations/gradients. The magnitude of the weight-gradient is inversely proportional to the scaled factor. Such an mechanism will stabilize the weight growth/reduction, as shown in [21, 43]. Note that the behaviors when stabilizing training (Theorem 2) also apply for other activation normalization methods [2, 41, 16]. We note that the scale-invariance of BN in stabilizing training has been analyzed in previous work . Different to their analyses on the normalization layer itself, we provide an explicit formulation of weight-gradients and output-gradients in a network, which is more important to characterize the learning dynamics of DNNs.
We further conduct experiments to show how the activation/gradient is affected by initialization in the unnormalized DNNs (indicated as ‘Plain’) and batch normalized DNNs (indicated as ‘BN’). We observe each layer’s and , because: 1) indicates the magnitude of the layer input; 2) indicates magnitude of the layer output-gradient; 3) , which indicates the magnitude of the weight-gradients and relates to the upper-bound of the learning rates, can be derived by and
. We train a 20-layer MLP, with 256 neurons in each layer, for MNIST classification. The nonlinearity is ReLU. We use the full gradient descent444
We also perform SGD with a batch size of 1024, and further perform experiments on convolutional neural networks for CIFAR-10 classification. The results are shown in AppendixC.1, in which we have the same observation as the full gradient descent., and show the best performance among learning rates in . We observe that the magnitude of the layer input (output-gradient) of ‘Plain’ for random initialization  suffer the exponentially decrease (Figure 2 (a)), during forward pass (backward pass). The main reason for this is that the weight has a small magnitude, based on Theorem 1. This problem can be relieved by He-Initialization , where the magnitude of the input/output-gradient is stable among layers (Figure 2 (d)). We observe that BN can well preserve the magnitude of the input/output-gradient among different layers for both initialization methods, as shown in Figure 2 (a) and (d).
It has been shown that in  the scale-invariant property of BN has an implicit early stopping effect on the weight matrices and help to stabilize learning towards convergence. Here we show that such a‘early stopping’ is layer-wised, and certain ‘early stopping’ has detrimental effects on the learning, since it creates the false impression of a local minimum. For illustration, we provide a rough definition called weight domination with respect to certain layer.
Let and is the weight matrix and its gradient in layer . If where indicates the maximum singular value of a matrix,
we refer to layer
indicates the maximum singular value of a matrix, we refer to layerhas a state of weight domination.
Weight domination implies the smoother gradient with respect to this layer. This is a desirable property for linear model (the distribution of the input is fixed), where the optimization algorithms targets to. However, weight domination is not always desirable for a certain layer of neural network, since such a state of one layer can be caused by the increased magnitude of the weight matrix or decreased magnitude of the layer input (the non-convex optimization in Eqn. 7), not by the optimization objective. Although BN ensures a stable distribution of layer input, the network with BN still has a possibility that the magnitude of the weight in certain layer is significantly increased. We experimentally observe this phenomenon shown in Appendix C.2. Similar phenomenon is also observed in , where BN results in large updates of the corresponding weights.
Weight domination sometimes harms the learning of the network, because this state limits the ability in representation of the corresponding layer. To investigate it, we conduct experiments on a five layer MLP and show the results in Figure 3. We observe that the network with its certain layers being weight domination, can still decrease the loss, however it has degenerated performance. We also conduct experiments on Convolutional Neural Networks (CNNs) for CIFAR-10 datasets, shown in Appendix C.2
4.2 Improved Conditioning
One motivation of BN is that whitening the input can improve the conditioning of the optimization  (e.g., the Hessian will be an identity matrix under the condition that for a linear model, based on Eqn. 1, and thus can accelerate training [8, 20]. However, such a motivation is hardly further validated either by theoretical or empirical analysis . Furthermore, this motivation holds under the condition that BN is placed before the linear layer, while, in practice, BN is typically placed after the linear layer as recommended in . In this section, we will empirically explore this motivation based on our layer-wise conditioning analysis in the scenario of training DNNs.
We first experimentally observe that BN not only improves the conditioning of the layer input’s covariance matrix, but also improves the conditioning of the output-gradient’covariation, as shown in Figure 4. It has been shown that centered data is more likely to be well-conditioned [26, 38, 31, 19]. This suggests that placing BN after the linear layer can improve the conditioning of the output-gradient, because centering the activation, with the gradient back-propagating through such a transformation , also centers the gradient.
We further observe that the unnormalized network (’Plain’) has many small eigenvalues. For further exploration, we monitor the output of each neuron in each layer, and find that ‘Plain’ has some neurons that is not activated (zero output of ReLU) for each training example, we refer to this kind of neuron as dying neuron. We also observe that ‘Plain’ has some neurons that is always activated for every training example, we refer to them as full neuron. This observation is typically obvious in the initial iterations. The number of dying/full neuron will increase with the layer increasing (Figure 5). We conjecture that the dying neuron makes ’Plain’ have many small/zero eigenvalues. On the contrary, batch normalized networks have no dying/full neuron, because the centering operation ensures half of examples get activated. This further suggests placing BN before nonlinear activation can improve the conditioning.
5 Training Very Deep Residual Network
Residual networks 
have significantly relieved the difficulty of training deep networks by their introduction of the residual connection, which makes training networks with hundreds or even thousands of layers possible. However, Residual networks also suffer from degenerated performance when the model is extremely deep (e.g., the 1202-layer residual network has worse performance than the 110-layer one.), as shown in . He et al.  argue that this is from over-fitting, not optimization difficulty. Here we show that a very deep residual network may also suffer difficulty in optimization.
We first perform experiments on the CIFAR-10 dataset with residual networks, following the same experimental setup as in , except that we run the experiments on one GPU. We vary the network depth, ranging in , and show the training loss in Figure 6 (a). We observe that the residual networks have an increased loss in the initial iterations, which is amplified for deeper networks. Later, the training get stuck in a state of random guess of the network (the loss keeps at ). Although the network can escape such a state with enough iterations, the networks suffer from degenerated training performance, especially for very deep networks.
Analysis of Learning Dynamics
To exploring why residual network has such a mysterious behavior, we perform the layer-wise conditioning analysis upon the last linear layer (before the cross entropy loss). We monitor the maximum eigenvalue of the covariance matrix , the maximum eigenvalue of the second moment matrix of weight-gradient , and the norm of the weight ().
We observe that the initially increased loss is mainly caused by the large magnitude of 555The large magnitude of is caused mainly by the adding of multiple residual connections from the previous layers with ReLU output. (Figure 6 (b)), which results in a large magnitude for , and thus a large magnitude for (Figure 6 (d)). The increased further facilities the increase of the loss. However, the learning object is to decrease the loss, and thus it is expected to decrease the magnitude of or (based on Eqn.7) in this case. Apparently, is harder to adjust, because its landscape of loss surface is controlled by , and all the values of are non-negative with large magnitude. The network thus tries to decrease based on the given learning object. We experimentally find that the learnable parameters of BN have a large amount of negative values, which make the ReLUs (that position after the residual adding operation) not activated. Such a dynamic results in a significant reduction in the magnitude of . The small and the large drive the last linear layer of the network into the state of weight dominate, and show a random guess behavior. Although the residual network can escape such a state with enough iterations, the weight domination hinders optimization and results in degenerated training performance.
|ResNet||7.52 0.30||6.89 0.52||7.35 0.64||9.42 3.10|
|ResNet||6.50 0.22||6.10 0.09||5.94 0.18||5.68 0.14|
|ResNet||29.60 0.41||28.3 1.09||29.25 0.44||30.49 4.44|
|ResNet||28.82 0.38||27.05 0.23||25.80 0.10||25.51 0.27|
with different depth on CIFAR-10 dataset. We evaluate the training loss with respect to the epochs.
5.1 Proposed Solution
Based on the above analysis, the essential is to resist the large magnitude of . We propose a simple solution and add one BN layer before the last linear layer to normalize its input. We refer to this residual network as ResNet, and the original one as ResNet. We also provide the analysis on the last linear layer of ResNet, ans we show the comparison between ResNet and ResNet on the 1202-layer in Figure 7. We observe that ResNet can steadily train. It does not encounter the state of weight domination or large magnitude of in the last linear layer.
We try a similar solution where a constant is divided before the linear layer, and we find it also benefits the training. However, the main disadvantage of this solution is that the value of the constant has to be finely tuned on the networks with different depths. We also try a solution to put one BN before the average pooling, it has similar effects as the one before the last linear layer. We note that Bjorck et al.  propose to train a 110-layer residual network with only one BN layer that is put before the average pooling. They show it achieves good results. We argue that this is not true for the very deep networks. We perform an experiments on the 1202 layer residual network, and we find that the model always has a fail in training with various hyper-parameters.
, single model and single-crop) on 18, 50 and 101-layer residual networks for ImageNet classification.
ResNet, a simple revision of ResNet, achieves significantly improvement in performance for very deep residual networks. We show the experimental results on CIFAR and ImageNet classification, as follows.
Figure 8 (a) and (b) show the training loss of ResNet and ResNet, respectively on the CIFAR-10 dataset. We observe that ResNet, with a depth of 1202, appears to have a degenerated training performance, especially in the initial phase. Note that, as the depth increases, ResNet obtains worse training performance in the first 80 epochs (before the learning rate is reduced), which coincides with our previous analysis. ResNet obtains nearly the same training loss for the networks with different depth in the first 80 epochs. Moreover, ResNet shows lower training loss with increasing depth. Comparing Figure 8 (b) to (a), we observe that ResNet has better training loss than ResNet for all depths (e.g., at a depth of 56, the loss of ResNet is 0.081, while for ResNet it is 0.043.).
Table 1 shows the test errors. We observe that ResNet achieves better test performance with increasing depth, while ResNet has degenerated performance. Compared to ResNet, ResNet has consistently improved performance over different depths. Particularly, ResNet reduces the absolute test error of ResNet by , , and at depths of 56, 110, 230 and 1202, respectively. Due to ResNet’s optimization efficiency, the training performance is likely improved if we amplify the regularization of the training. Thus, we set the weight decay to 0.0002 and double the training iterations, and find that the 1202-layer ResNet achieves a test error of . We also train a 2402 layer network. We observe that can not converge, while ResNet achieves a test error of .
We further perform the experiment on CIFAR-100 and use the same experimental setup as CIFAR-10. Table 2 shows the test errors. ResNet reduces the absolute test error of ResNet by , , and at depths of 56, 110, 230 and 1202, respectively.
We also validate the effectiveness of ResNet on the large-scale ImageNet classification, with 1000 classes . We use the official 1.28M training images as the training set, and evaluate the top-1 classification errors on the validation set with 50k images. We perform the experiments on the 18, 50 and 101-layer networks. We follow the same setup as described in , except that 1) we train over 100 epochs with an extra lowered learning rate at the 90th epoch; 2) use 1 GPU for the 18-layer network and 4 GPUs for 50 and 101-layer network.
We also observe that ResNet has the better optimization efficiency compared to ResNet (See Appendix D). Table 3 shows the validation errors. We thus report the best result from the network with dropout of 0.3 (before the last linear layer) and without dropout, for a more fair comparison. We find that ResNet has a improved performance over ResNet on all networks.
We propose a layer-wise conditioning analysis to characterize the optimization behaviors of DNNs. Such an analysis is theoretically derived under mild assumptions that approximately hold in practice. Based on our layer-wise conditioning analysis, we show how batch normalization stabilizes training and improves the conditioning of the optimization problem. We further show that very deep residual networks still suffer difficulty in optimization, which is caused by the ill-conditioned state of the last linear layer. We remedy this by adding only one BN layer before the last linear layer. We expect our method to provide new insights in analyzing and understanding the techniques of training DNNs.
-  Jimmy Ba, Roger Grosse, and James Martens. Distributed second-order optimization using kronecker-factored approximations. In ICLR, 2017.
-  Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016.
-  Alberto Bernacchia, Mate Lengyel, and Guillaume Hennequin. Exact natural gradient in deep linear networks and its application to the nonlinear case. In NeurIPS. 2018.
-  Johan Bjorck, Carla Gomes, and Bart Selman. Understanding batch normalization. In NeurIPS, 2018.
Léon Bottou, Frank E Curtis, and Jorge Nocedal.
Optimization methods for large-scale machine learning.Siam Review, 60(2):223–311, 2018.
-  Miguel Carreira-Perpinan and Weiran Wang. Distributed optimization of deeply nested systems. In AISTATS, 2014.
-  J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. ImageNet: A Large-Scale Hierarchical Image Database. In CVPR, 2009.
-  Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, and koray kavukcuoglu. Natural neural networks. In NeurIPS, 2015.
Thomas Frerix, Thomas Möllenhoff, Michael Möller, and Daniel
Proximal backpropagation.In ICLR, 2018.
-  Behrooz Ghorbani, Shankar Krishnan, and Ying Xiao. An investigation into neural net optimization via hessian eigenvalue density. In ICML, 2019.
-  Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics, AISTATS 2010, 2010.
-  Roger B. Grosse and James Martens. A kronecker-factored approximate fisher matrix for convolution layers. In ICML, 2016.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level performance on imagenet classification. In ICCV, 2015.
-  Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
-  G E Hinton and R R Salakhutdinov. Reducing the dimensionality of data with neural networks. Science, 313:504–507, 2006.
-  Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate normalization schemes in deep networks. NeurIPS, 2018.
-  Roger A Horn and C. A Johnson. Topics in Matrix Analysis. Cambridge University Press, 1991.
-  Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
-  Lei Huang, Xianglong Liu, Yang Liu, Bo Lang, and Dacheng Tao. Centered weight normalization in accelerating training of deep neural networks. In ICCV, 2017.
-  Lei Huang, Dawei Yang, Bo Lang, and Jia Deng. Decorrelated batch normalization. In CVPR, 2018.
-  Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
-  Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014.
-  Jonas Kohler, Hadi Daneshmand, Aurelien Lucchi, Ming Zhou, Klaus Neymeyr, and Thomas Hofmann. Towards a theoretical understanding of batch normalization. arXiv preprint arXiv:1805.10694, 2018.
-  Yann LeCun, Yoshua Bengio, and Geoffrey E. Hinton. Deep learning. Nature, 521:436–444, 2015.
-  Yann LeCun, Léon Bottou, Genevieve B. Orr, and Klaus-Robert Müller. Effiicient backprop. In Neural Networks: Tricks of the Trade, pages 9–50, 1998.
-  Yann LeCun, Ido Kanter, and Sara A. Solla. Second order properties of error surfaces: Learning time and generalization. In NeurIPS, 1990.
-  James Martens. Deep learning via hessian-free optimization. In ICML, pages 735–742. Omnipress, 2010.
-  James Martens. New perspectives on the natural gradient method. CoRR, abs/1412.1193, 2014.
-  James Martens and Roger Grosse. Optimizing neural networks with kronecker-factored approximate curvature. In ICML, 2015.
-  James Martens, Ilya Sutskever, and Kevin Swersky. Estimating the hessian by back-propagating curvature. In ICML, 2012.
Grégoire Montavon and Klaus-Robert Müller.
Deep Boltzmann Machines and the Centering Trick, volume 7700 of LNCS. 2012.
-  Vardan Papyan. The full spectrum of deep net hessians at scale: Dynamics with sample size. CoRR, abs/1811.07062, 2018.
-  Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks. In ICLR, 2014.
-  Nicolas Le Roux, Pierre-Antoine Manzagol, and Yoshua Bengio. Topmoumoute online natural gradient algorithm. In NeurIPS, pages 849–856. Curran Associates, Inc., 2007.
-  Levent Sagun, Utku Evci, V. Ugur Güney, Yann N. Dauphin, and Léon Bottou. Empirical analysis of the hessian of over-parametrized neural networks. CoRR, abs/1706.04454, 2017.
-  Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? In NeurIPS, 2018.
-  Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of learning in deep linear neural networks. In ICLR, 2014.
-  Nicol N. Schraudolph. Accelerated gradient descent by factor-centering decomposition. Technical report, 1998.
-  Ke Sun and Frank Nielsen. Relative Fisher information and natural gradient for learning large modular models. In ICML, 2017.
-  Simon Wiesler and Hermann Ney. A convergence analysis of log-linear training. In NeurIPS, 2011.
-  Shuang Wu, Guoqi Li, Lei Deng, Liu Liu, Yuan Xie, and Luping Shi. L1-norm batch normalization for efficient training of deep neural networks. CoRR, 2018.
-  Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018.
-  Greg Yang, Jeffrey Pennington, Vinay Rao, Jascha Sohl-Dickstein, and Samuel S. Schoenholz. A mean field theory of batch normalization. In ICLR, 2019.
-  Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.
-  Matthew D. Zeiler. ADADELTA: an adaptive learning rate method. CoRR, abs/1212.5701, 2012.
-  Huishuai Zhang, Wei Chen, and Tie-Yan Liu. On the local hessian in back-propagation. In NeurIPS. 2018.
Appendix A Proof of Theorems
Here, we provide proofs for the three proposition/theorems in the paper.
a.1 Proof of Proposition 1
Proposition 1. Given , and , we have: 1) ; 2) ;
The proof is mainly based on the conclusion from Theorem 4.2.12 in , which is restated as follows:
Let and . Furthermore, let be the arbitrary eigenvalue of and be the arbitrary eigenvalue of . We have as an eigenvalue of . Furthermore, any eigenvalue of arises as a product of eigenvalues of and .
Based on the definitions of and , we have that and are positive semidefinite. Therefore, all the eigenvalues of / are non-negative. Let denote the eigenvalue of matrix . Based on Lemma A.1, we have . Since and are non-negative, we thus have . Similarly, we can prove that . We thus have .
a.2 Proof of Theorem 1
Theorem 1. Given a rectifier neural network (Eqn. 3) with nonlinearity ( ), if the weight in each layer is scaled by ( and ), we have the scaled layer input: . With assumption that , we have the output-gradient: , and weight-gradient: , for all .
(1) We first demonstrate that the scaled layer input (), using mathematical induction. It is easy to validate that and . We assume that and hold, for . When , we have
We thus have
By induction, we have , for . We also have for .
(2) We then demonstrate that the scaled output-gradient for . We also provide this using mathematical induction. Based on back-propagation, we have
Based on the assumption that , we have 666We denote if ..
We assume that holds, for . When , we have
We also have
By induction, we thus have , for .
(3) based on , and , it is easy to prove that for . ∎
a.3 Proof of Theorem 2
Theorem 2. Under the same condition of Theorem 1, for the normalized network with and , we have: , , , for all .
(1) Following the proof in Theorem 1, by mathematical induction, it is easy to demonstrate that , and , for all .
(2) We also use mathematical induction to demonstrate for all
We first show the formulation of the gradient back-propagating through each neuron of the BN layer as:
is the standard deviation anddenotes the expectation over mini-batch examples. We have based on . Since , we have . Therefore, we have from Eqn. 15.
Assume that for . When , we have:
By induction, we have , for all
(3) based on , and , we have that , for all
Appendix B Comparison of the Analysis with Full Curvature Matrix and Sub-curvature Matrices
Here, we conduct experiments to analyze the training dynamics of the unnormalized (’Plain’) and batch normalized  (’BN’) networks, by analyzing the spectrum of the full curvature matrix and sub-curvature matrices. Figure A1 show the results using the Fisher Information Matrix (FIM) on a 8-layer MLP with 24 neurons in each layer. We calculate the maximum eigenvalue and the condition number with respect to different percentage for both the full FIM and the sub-FIMs. By observing the results from the full FIM (Figure A1 (a)), we find that: 1) the unnormalized network suffers the problem of gradient vanish (the maximum eigenvalue is around ) while batch normalized network has an appropriate magnitude of gradient (the maximum eigenvalue is around ); 2) ‘BN’ has better conditioning than ‘Plain’, which suggests batch normalization (BN) can improve the conditioning of the network, as observed in [36, 10]. We can also obtain similar conclusion by observing the results from the sub-FIMs (Figure A1 (b) to (i)). This experiment shows that our layer-wise conditioning analysis can uncover the training dynamics of the networks if the full conditioning analysis can.
We also conduct experiments to analyze ‘Plain’ and ‘BN’, by using the second moment matrix of sample gradient . The results are shown in Figure A2. We have the same observation as the one by using the FIM.
Appendix C More Experiments in Exploring Batch Normalized Networks
In this section, we provide more experimental results relating to Exploring batch normalization (BN)  by layer-wise conditioning analysis, which is discussed in Section 4 of the paper. We include experiments that train neural networks with Stochastic Gradient Descent (SGD) and experiments relating to weight domination discussed in the paper.
c.1 Experiments with SGD
Here, we perform experiments on the Multiple Layer Perceptron (MLP) for MNIST classification and Convolutional Neural Networks (CNNs) for CIFAR-10 classification.
c.1.1 MLP for MNIST Classification
We use the same experimental setup as the experiments described in the paper for MNIST classification, except that we use SGD with a batch size of 1024. The results are shown in Figure A3 and Figure A4. We obtain similar results as those obtained using full gradient descent, as described in the Section Section 4 of the paper.
c.1.2 CNN for CIFAR-10 Classification
We perform a layer-wise conditioning analysis on the VGG-style and residual network  architectures. Note that we view the activation in each spatial location of the feature map as a independent example, when calculating the covariance matrix of convolutional layer input and output-gradient. This process is similar to the process proposed in BN to normalize the convolutional layer .
We use the 20-layer residual network described in the paper  for CIFAR-10 classification. The VGG-style network is constructed based on the 20-layer residual network by removing the residual connections.
We use the same setups as described in , except that we do not use the weight decay in order to simplify the analysis and run the experiments on one GPU. Since the unnormalized networks (including the VGG-style and residual network) do not converge with the large learning rate of 0.1, we run additional experiments with a learning rate of 0.01 for unnormalized neural networks, and report these results.
c.2 Experiments Relating to Weight Domination
Gradient Explosion of BN
In Section 4.1 of the paper, we mention that the network with BN still has a possibility that the magnitude of the weight in certain layers is significantly increased. Here, we provide the experimental results.
We conduct experiments on a 100-layer batch normalized MLP with 256 neurons in each layer for MNIST classification. We calculate the maximum eigenvalues of the sub-FIMs, and show the results during the first 7 iterations in Figure A7 (a). we observe that the weight-gradient has exponential explosion at initialization (‘Iter0’). After a single step, the first-step gradients dominate the weights due to gradient explosion in lower layers, hence the exponential growth in magnitude of weight. This increased magnitude of weight lead to small weight gradients (‘Iter1’ to ‘Iter7’), which is caused by BN as discussed in Section 4.1 of the paper. Therefore, some layers (especially the lower layers) of the network come into the state of weight domination. We obtain similar observations on the 110-layer VGG-style network for CIFAR-10 classification, as shown in Figure A7 (b).
Investigation of Weight Domination
Weight domination sometimes harms the learning of the network, because this state limits the ability in representation of the corresponding layer. We conducted experiments on a five-layer MLP and showed the results in Section 4. Here, we also conduct experiments on CNNs for CIFAR-10 datasets, shown in Figure A8. We observe that the network with certain layers being the states of weight domination, can still decrease the loss, however it has degenerated performance.