1 Introduction
Centering, scaling and decorrelating the input data is known as data whitening, which has demonstrated enormous success in speeding up training [27]. Batch Normalization (BN) [21]
extends the operations from the input layer to centering and scaling activations of each intermediate layer within a minibatch so that each neuron has a zero mean and a unit variance (Figure
1 (a)). BN has been extensively used in various network architectures [13, 45, 14, 52, 44, 16] for its benefits in improving both the optimization efficiency [21, 11, 23, 5, 39] and generalization capability [21, 3, 5, 50]. However, instead of performing whitening, BN is only capable of performing standardization, which centers and scales the activations but does not decorrelate them [21]. On the other hand, previous works suggest that further decorrelating the activations is beneficial to both the optimization [11, 31] and generalization [7, 51]. To the end of improving BN with whitening, Decorrelated Batch Normalization (DBN) [19] is proposed to whiten the activations of each layer within a minibatch, such that the output of each layer has an isometric diagonal covariance matrix (Figure 1(b)). DBN improves over BN in regards to both training efficiency and generalization capability, but it relies heavily on a large batch size and eigendecompositions or singular value decomposition (SVD), which suffers from poor efficiency on GPUs.
In order to address these issues, we propose Iterative Normalization (IterNorm) to further enhance BN with more efficient whitening. IterNorm avoids eigendecomposition or SVD by employing Newton’s iteration for approximating the whitening matrix. Thus, the capacity of GPUs can be effectively exploited. Eigenvalues of the covariance matrix are normalized prior to the iterations with guaranteed convergence condition of Newton’s iteration. As illustrated in (Figure
1 (c)), IterNorm stretches the dimensions along the eigenvectors progressively, so that the associated eigenvalues converge to 1 after normalization. One desirable property is that the convergence speed of IterNorm along the eigenvectors is proportional to the associated eigenvalues [4]. This means the dimensions that correspond to small/zero (when a small batch size is applied) eigenvalues can be largely ignored, given a fixed number of iterations. As a consequence, the sensitivity of IterNorm against batch size can be significantly reduced.When the data batch is undersized, it is known that the performance of both whitening and standardization on the test data can be significantly degraded [20, 50]. However, beyond our expectation, we observe that the performance on the training set also significantly degenerates under the same condition. We further observe such a phenomenon is caused by the stochasticity introduced by the minibatch based normalization [46, 41]. To allow a more comprehensive understanding and evaluation about the stochasticity, we introduce Stochastic Normalization Disturbance (SND), which is discussed in Section 4. With the support of SND, we provide a thorough analysis regarding the performance of normalization methods, with respect to the batch size and feature dimensions, and show that IterNorm has better tradeoff between optimization and generalization. Experiments on CIFAR10 [24] and ILSVRC2012 [10] demonstrate the consistent improvements of IterNorm over BN and DBN.
2 Related Work
Normalized activations [40, 35, 33, 48]
have long been known to benefit neural networks training. Some research methodologies attempt to normalize activations by viewing the population statistics as parameters and estimating them directly during training
[33, 48, 11]. Some of these methods include activations centering in Restricted Boltzmann Machine
[33][48] and activations whitening [11, 31]. This type of normalization may suffer from instability (such as divergence or gradient explosion) due to 1) inaccurate approximation to the population statistics with local data samples [48, 21, 20, 19] and 2) the internalcovariant shift problem [21].Ioffe et al., [21] propose to perform normalization as a function over minibatch data and backpropagate through the transformation. Multiple standardization options have been discovered for normalizing minibatch data, including the L2 standardization [21], the L1standardization [49, 15] and the standardization [15]. One critical issue with these methods, however, is that it normally requires a reasonable batch size for estimating the mean and variance. In order to address such an issue, a significant number of standardization approaches are proposed [3, 50, 36, 32, 20, 29, 47, 26, 9]. Our work develops in an orthogonal direction to these approaches, and aims at improving BN with decorrelated activations.
Beyond standardization, Huang et al. [19] propose DBN, which uses ZCAwhitening by eigendecomposition and backpropagates the transformation. Our approach aims at a much more efficient approximation of the ZCAwhitening matrix in DBN, and suggests that approximating whitening is more effective based on the analysis shown in Section 4.
Our approach is also related to works that normalize the network weights (e.g., either through reparameterization [38, 18, 17] or weight regularization [25, 34, 37]), and that specially design either scaling coefficients & bias values [1] or nonlinear function [22], to normalize activation implicitly [41]. IterNorm differs from these work in that it is a data dependent normalization, while these normalization approaches are independent of the data.
Newton’s iteration is also employed in several other deep neural networks. These methods focus on constructing bilinear [30] or secondorder pooling [28] by constraining the power of the covariance matrix and are limited to producing fullyconnected activations, while our work provides a generic module that can be ubiquitously built in various neural network frameworks. Besides, our method computes the square root inverse of the covariance matrix, instead of calculating the square root of the covariance matrix [30, 28].
3 Iterative Normalization
Let be a data matrix denoting the minibatch input of size in certain layer. BN [21] works by standardizing the activations over the minibatch input:
(1) 
where is the mean of , , is the dimensionwise variance corresponding to the ith dimension,
is a column vector of all ones, and
is a small number to prevent numerical instability. Intuitively, standardization ensures that the normalized output gives equal importance to each dimension by multiplying the scaling matrix (Figure 1 (a)).DBN [19] further uses ZCA whitening to produce the whitened output as^{1}^{1}1DBN and BN both use learnable dimensionwise scale and shift parameters to recover the possible loss of representation capability.:
(2) 
where and are the eigenvalues and associated eigenvectors of , i.e. . is the covariance matrix of the centered input. ZCA whitening works by stretching or squeezing the dimensions along the eigenvectors such that the associated eigenvalues to be (Figure 1 (b)). Whitening the activation ensures that all dimensions along the eigenvectors have equal importance in the subsequent linear layer.
One crucial problem of ZCA whitening is that calculating the whitening matrix requires eigendecomposition or SVD, as shown in Eqn. 2, which heavily constrains its practical applications. We observe that Eqn. 2 can be viewed as the square root inverse of the covariance matrix denoted by , which multiplies the centered input. The square root inverse of one specific matrix can be calculated using Newton’s iteration methods [4], which avoids executing eigendecomposition or SVD.
3.1 Computing by Newton’s Iteration
Given the square matrix , Newton’s method calculates by the following iterations [4]:
(3) 
where is the iteration number. will be converged to under the condition .
In terms of applying Newton’s methods to calculate the inverse square root of the covariance matrix , one crucial problem is cannot be guaranteed to satisfy the convergence condition . That is because is calculated over minibatch samples and thus varies during training. If the convergence condition cannot be perfectly satisfied, the training can be highly instable [4, 28]. To address this issue, we observe that one sufficient condition for convergence is to ensure the eigenvalues of the covariance matrix are less than . We thus propose to construct a transformation such that , and ensure the transformation is differentiable such that the gradients can backpropagate through this transformation. One feasible transformation is to normalize the eigenvalue as follows:
(4) 
where indicates the trace of . Note that is also a semidefinite matrix and thus all of its eigenvalues are greater than or equal to . Besides, has the property that the sum of its eigenvalues is . Therefore, can surely satisfy the convergence condition. We can thus calculate the inverse square root by Newton’s method as Eqn. 17. Given , we can compute based on Eqn. 4, as follows:
(5) 
Given , it’s easy to whiten the activations by multiplying with the centered inputs. In summary, Algorithm 1 describes our proposed methods for whitening the activations in neural networks.
Our method first normalizes the eigenvalues of the covariance matrix, such that the convergence condition of Newton’s iteration is satisfied. We then progressively stretch the dimensions along the eigenvectors, such that the final associate eigenvalues are all “”, as shown in Figure 1 (c). Note that the speed of convergence of the eigenvectors is proportional to the associated eigenvalues [4]. That is, the larger the eigenvalue is, the faster its associated dimension along the eigenvectors converges. Such a mechanism is a remarkable property to control the extent of whitening, which is essential for the success of whitening activations, as pointed out in [19], and will be further discussed in Section 4.
3.2 Backpropagation
As pointed out by [21, 19], viewing standardization or whitening as functions over the minibatch data and backpropagating through the normalized transformation are essential for stabilizing training. Here, we derive the backpropagation pass of IterNorm. Denoting
as the loss function, the key is to calculate
, given . Let’s denote , whereis the iteration number. Based on the chain rules, we have:
(6)  
where can be calculated by following iterations:
(7) 
3.3 Training and Inference
Like the previous normalizing activation methods [21, 3, 19, 50], our IterNorm can be used as a module and inserted into a network extensively. Since IterNorm is also a method for minibatch data, we use the running average to calculate the population mean and whitening matrix , which is used during inference. Specifically, during training, we initialize as and as and update them as follows:
(8) 
where and are the mean and whitening matrix calculated within each minibatch during training, and is the momentum of running average.
Additionally, we also use the extra learnable parameters and , as in previous normalization methods [21, 3, 19, 50], since normalizing the activations constrains the model’s capacity for representation. Such a process has been shown to be effective [21, 3, 19, 50].
Convolutional Layer
For a CNN, the input is , where and indicate the height and width of the feature maps, and and are the number of feature maps and examples, respectively. Following [21], we view each spatial position of the feature map as a sample. We thus unroll as with examples and feature maps. The whitening operation is performed over the unrolled .
Computational Cost
The main computation of our IterNorm includes calculating the covariance matrix, the iteration operation and the whitened output. The computational costs of the first and the third operation are equivalent to the convolution. The second operation’s computational cost is . Our method is comparable to the convolution operation. To be specific, given the internal activation , the convolution with the same input and output feature maps costs , while our IterNorm costs . The relative cost of IterNorm for convolution is . Further, we can use groupwise whitening, as introduced in [19] to improve the efficiency when the dimension is large. We also compare the wallclock time of IterNorm, DBN [19] and convolution in Appendix B.
During inference, IterNorm can be viewed as a convolution and merged to adjacent convolutions. Therefore, IterNorm does not introduce any extra costs in memory or computation during inference.
4 Stochasticity of Normalization
Minibatch based normalization methods are sensitive to the batch size [20, 19, 50]. As described in [19], fully whitening the activation may suffer from degenerate performance while the number of data in a minibatch is not sufficient. They [19] thus propose to use groupwise whitening [19]. Furthermore, standardization also suffers from degenerated performance under the scenario of microbatch [47]. These works argue that undersized data batch makes the estimated population statistics highly noisy, which results in a degenerating performance during inference [3, 20, 50].
In this section, we will provide a more thorough analysis regarding the performance of normalization methods, with respect to the batch size and feature dimensions. We show that normalization (standardization or whitening) with undersized data batch not only suffers from degenerate performance during inference, but also encounter the difficulty in optimization during training. This is caused by the Stochastic Normalization Disturbance (SND), which we will describe.
4.1 Stochastic Normalization Disturbance
Given a sample from a distribution , we take a sample set with a size of . We denote the normalization operation as and the normalized output as . For a certain ,
can be viewed as a random variable
[2, 46]. is thus a random variable which shows the stochasticity. It’s interesting to explore the statistical momentum of to measure the magnitude of the stochasticity. Here we define the Stochastic Normalization Disturbance (SND) for the sample over the normalization as:(9) 
It’s difficult to accurately compute this momentum if no further assumptions are made over the random variable , however, we can explore its empirical estimation over the sampled sets as follows:
(10) 
where denotes the time of sampling. Figure 2 gives the illustration of sample ’s SND with respect to the operation of BN. We can find that SND is closely related to the batch size. When batch size is large, the given sample has a small value of SND and the transformed outputs have a compact distribution. As a consequence, the stochastic uncertainty can be low.
SND can be used to evaluate the stochasticity of a sample after the normalization operation, which works like the dropout rate [43]. We can further define the normalization operation ’s SND as: and it’s empirical estimation as where is the number of sampled examples. describes the magnitudes of stochasticity for corresponding normalization operations.
Exploring the exact statistic behavior of SND is difficult and out of the scope of this paper. We can, however, explore the relationship of SND related to the batch size and feature dimension. We find that our defined SND gives a reasonable explanation to why we should control the extent of whitening and why minibatch based normalizations have a degenerate performance when given a small batch size.
Illustration of SND with different batch sizes. We sample 3000 examples (black points) from Gaussian distribution. We show a given example
(red cross) and its BN outputs (blue plus sign), when normalized over different sample sets . (a) and (b) show the results with batch sizes of 16 and 64, respectively.4.2 Controlling the Extent of Whitening
We start with experiments on multilayer perceptron (MLP) over MNIST dataset, by using the full batch gradient (batch size =60,000), as shown in Figure
3 (a). We find that all normalization methods significantly improve the performance. One interesting observation is that fullwhitening the activations with such a large batch size still underperforms the approximatewhitening of IterNorm, in terms of training efficiency. Intuitively, fullwhitening the activations may lead to amplifying the dimension with small eigenvalues, which may correspond to the noise. Exaggerating this noise may be harmful to learning, especially lowering down the generalization capability as shown in Figure 3 (a) that DBN has diminished test performance. We provide further analysis based on SND, along with the conditioning analysis. It has been shown that improved conditioning can accelerate training [27, 11], while increased stochasticity can slow down training but likely to improve generalization [43] .We experimentally explore the consequent effects of improved conditioning [27] with SND through BN (standardization), DBN (fullwhitening) and IterNorm (approximatewhitening with 5 iterations). We calculate the condition number of covariance matrix of normalized output, and the SND for different normalization methods (as shown in Figure 4). We find that DBN has the best conditioning with an exact condition number as 1, however it significantly enlarges SND, especially in a highdimensional space. Therefore, fullwhitening can not consistently improve the training efficiency, even for the highly improved conditioning, which is balanced out by the larger SND. Such an observation also explains why groupbased whitening [19] (by reducing the number of dimensions that will be whitened) works better from the training perspective.
IterNorm has consistently improved conditioning over different dimensions compared to BN. Interestingly, IterNorm has a reduced SND in a highdimensional space, since it can adaptively normalize the dimensions along different eigenvalues based on the convergence theory of Newton’s iteration [4]. Therefore, IterNorm possesses a better tradeoff between the improved conditioning and SND, which naturally illustrates IterNorm can be more efficiently trained. We also provide the results of IterNorm when applying different iteration numbers in Appendix C.
4.3 Microbatch Problem of BN
BN suffers from degenerate test performance if the batch data is undersized [50]. We also show that BN suffers from the optimization difficulty with a small batch size. We show the experimental results on MNIST dataset with a batch size of 2 in Figure 3 (b). We find that BN can hardly learn and produces random results, while the naive network without normalization learns well. Such an observation clearly shows that BN suffers from more difficulties in training with undersized data batch.
For an indepth investigation, we sample the data from the dimension of 128 (Figure 5 (a)), and find that BN has a significantly increased SND. With increasing batch sizes, the SND of BN can be gradually reduced. Meanwhile, reduced SND leads to more stable training. When we fix the batch size to 2 and vary the dimension, (as shown in Figure 5 (b)), we observe that the SND of BN can be reduced with a low dimension. On the contrary, the SND of BN can be increased in a highdimensional space. Thus, it can be explained why BN suffers from the difficulty during with a small batch, and why groupbased normalization [50] (by reducing the dimension and adding the examples to be standardized implicitly) alleviates the problem.
Compared to BN, IterNorm is much less sensitive to a small batch size in producing SND. Besides, the SND of IterNorm is more stable, even with a significantly increased dimension. Such characteristics of IterNorm are mainly attributed to its adaptive mechanism in normalization, that it stretches the dimensions along large eigenvalues and correspondingly ignores small eigenvalues, given a fixed number of iterations [4].
5 Experiments
We evaluate IterNorm with CNNs on CIFAR datasets [24] to show that the better optimization efficiency and generalization capability, compared to BN [21] and DBN [19]. Furthermore, IterNorm with residual networks will be applied to show the performance improvement on CIFAR10 and ImageNet [10] classification tasks. The code to reproduce the experiments is available at https://github.com/huangleiBuaa/IterNorm.
5.1 Sensitivity Analysis
We analyze the proposed methods on CNN architectures over the CIFAR10 dataset [24], which contains 10 classes with training examples and test examples. The dataset contains color images with 3 channels. We use the VGG networks [42] tailored for inputs (16 convolution layers and 1 fullyconnected layers), and the details of the networks are shown in Appendix D.
The datasets are preprocessed with a meansubtraction and variancedivision. We also execute normal data augmentation operation, such as a random flip and random crop with padding, as described in
[13].Comparison among BN, DBN and IterNorm on VGG over CIFAR10 datasets. We report the training (solid lines) and test (dashed lines) error with respect to epochs.
Experimental Setup
We use SGD with a batch size of 256 to optimize the model. We set the initial learning rate to 0.1, then divide the learning rate by 5 at 60 and 120 epochs, and finish the training at 160 epochs. All results are averaged over 3 runs. For DBN, we use a group size of 16 as recommend in [19], and we find that DBN is unstable for a group size of 32 or above, due to the fact that the eigendecomposition operation cannot converge. The main reason is that the batch size is not sufficient for DBN to fullwhiten the activation for each layer. For IterNorm, we don’t use groupwise whitening in the experiments, unless otherwise stated.
Effect of Iteration Number
The iteration number of our IterNorm controls the extent of whitening. Here we explore the effects of on performance of IterNorm, for a range of . Note that when , our method is reduced to normalizing the eigenvalues such that the sum of the eigenvalues is 1. Figure 7 (a) shows the results. We find that the smallest () and the largest () iteration number both have the worse performance in terms of training efficiency. Further, when , IterNorm has significantly worse test performance. These observations show that (1) whitening within an minibatch can improve the optimization efficiency, since IterNorm progressively stretches out the data along the dimensions of the eigenvectors such that the corresponding eigenvalue towards 1, with increasing iteration ; (2) controlling the extent of whitening is essential for its success, since stretching out the dimensions along small eigenvalue may produce large SND as described in Section 4, which not only makes estimating the population statistics difficult — therefore causing higher test error — but also makes optimization difficult. Unless otherwise stated, we use an iteration number of 5 in subsequent experiments.
Effects of Group Size
We also investigate the effects of group size. We vary the group size in , compared to the fullwhitening operation of IterNorm (group size of 512). Note that our IterNorm with group size of 1, like DBN, is also reduced to Batch Normalization [21], which is ensured by Eqn. 4 and 5. The results are shown in Figure 7 (b). We can find that our IterNorm, unlike DBN, is not sensitive to the large group size, not only in training, but also in testing. The main reason is that IterNorm gradually stretches out the data along the dimensions of eigenvectors such that the corresponding eigenvalue towards 1, in which the speed of convergence for each dimension is proportional to the associated eigenvalues [4]. Even though there are many small eigenvalue or zero in highdimension space, IterNorm only stretches the dimension along the associate eigenvector a little, given small iteration , which introduces few SND. In practice, we can use a smaller group size, which can reduce the computational costs. We recommend using a group size of 64, which is proposed in the experiments of Section 5.2 and 5.3 for IterNorm.
Comparison of Baselines
We compare our IterNorm with to BN and DBN. Under the basic configuration, we also experiment with other configurations, including (1) using a large batch size of 1024; (2) using a small batch size of 16; and (3) increasing the learning rate by 10 times and considering minibatch based normalization is highly dependent on the batch size and their benefits comes from improved conditioning and therefore larger learning rate. All experimental setups are the same, except that we search a different learning rate in for different batch sizes, based on the linear scaling rule [12]. Figure 6 shows the results.
We find that our proposed IterNorm converges the fastest with respect to the epochs, and generalizes the best, compared to BN and DBN. DBN also has better optimization and generalization capability than BN. Particularly, IterNorm reduces the absolute test error of BN by , , , for the four experiments above respectively, and DBN by , , , . The results demonstrate that our IterNorm outperforms BN and DBN in terms of optimization quality and generalization capability.
5.2 Results on CIFAR10 with Wide Residual Networks
We apply our IterNorm to Wide Residual Network (WRN) [52] to improve the performance on CIFAR10. Following the conventional description in [52], we use the abbreviation WRNdk to indicate a WRN with depth d and width k
. We adopt the publicly available Torch implementation
^{2}^{2}2https://github.com/szagoruyko/wideresidualnetworks and follow the same setup as in [52]. We apply IterNorm to WRN2810 and WRN4010 by replacing all the BN modules with our IterNorm. Figure 8 gives the training and testing errors with respect to the training epochs. We clearly find that the wide residual network with our proposed IterNorm improves the original one with BN, in terms of optimization efficiency and generalization capability. Table 1 shows the final test errors, compared to previously reported results for the baselines and DBN [19].The results show IterNorm improves the original WRN with BN and DBN on CIFAR10. In particular, our methods reduce the test error to on WRN2810, a relatively improvement of in performance over ‘Baseline’.
Method  WRN2810  WRN4010 

Baseline* [52]  3.89  3.80 
DBN [19]  3.79 0.09  3.74 0.11 
Baseline  3.89 0.13  3.82 0.11 
IterNorm  3.56 0.12  3.59 0.07 
Method  Top1  Top5 

Baseline* [13]  30.43  10.76 
DBNL1* [19]  29.87  10.36 
Baseline  29.76  10.39 
DBNL1  29.50  10.26 
IterNormL1  29.34  10.22 
IterNormFull  29.30  10.21 
IterNormL1 + DF  28.86  10.08 
5.3 Results on ImageNet with Residual Network
We validate the effectiveness of our methods on residual networks for ImageNet classification with 1000 classes [10]. We use the given official 1.28M training images as a training set, and evaluate the top1 and top5 classification errors on the validation set with 50k images.
Ablation Study on Res18
We first execute an ablation study on the 18layer residual network (Res18) to explore multiple positions for replacing BN with IterNorm. The models used are as follows: (a) ‘IterNormL1’: we only replace the first BN module of ResNet18, so that the decorrelated information from previous layers can pass directly to the later layers with the identity connections described in [19]; (b) We also replace all BN modules indicated as ‘IterNormfull’; We follow the same experimental setup as described in [13], except that we use 1 GPU and train over 100 epochs. We apply SGD with a minibatch size of 256, momentum of 0.9 and weight decay of 0.0001. The initial learning rate is set to 0.1 and divided by 10 at 30, 60 and 90 epochs, and end the training at 100 epochs.
We find that only replacing the first BN effectively improves the performance of the original residual network, either by using DBN or IterNorm. Our IterNorm has marginally better performance than DBN. We find that replacing all the layers of IterNorm has no significant improvement over only replacing the first layer. We conjecture that the reason might be that the learned residual functions tend to have small response as shown in [13], and stretching this small response to the magnitude as the previous one may lead to negative effects. Based on ‘IterNormL1’, we further plugin the IterNorm after the last average pooling (before the last linear layer) to learn the decorrelated feature representation. We find this significantly improves the performance, as shown in Table 2, referred to as ‘IterNormL1 + DF’. Such a way to apply IterNorm can improve the original residual networks and introduce negligible computational cost. We also attempt to use DBN to decorrelate the feature representation. However, it always suffers the problems of that the eigendecomposition can not converge.
Results on Res50/101
We further apply our method on the 50 and 101layer residual network (ResNet50 and ResNet101) and perform single model and singlecrop testing. We use the same experimental setup as before, except that we use 4 GPUs and train over 100 epochs. The results are shown in Table 3. We can see that the ‘IterNormL1’ achieves lower test errors compared to the original residual networks. ‘IterNormL1 + DF ’ further improves the performance.
Res50  Res101  

Method  Top1  Top5  Top1  Top5 
Baseline* [13]  24.70  7.80  23.60  7.10 
Baseline  23.95  7.02  22.45  6.29 
IterNormL1  23.28  6.72  21.95  5.99 
IterNormL1 + DF  22.91  6.47  21.77  5.94 
6 Conclusions
In this paper, we proposed Iterative Normalization (IterNorm) based on Newton’s iterations. It improved the optimization efficiency and generalization capability over standard BN by decorrelating activations, and improved the efficiency over DBN by avoiding the computationally expensive eigendecomposition. We introduced Stochastic Normalization Disturbance (SND) to measure the inherent stochastic uncertainty in normalization. With the support of SND, we provided a thorough analysis regarding the performance of normalization methods, with respect to the batch size and feature dimensions, and showed that IterNorm has better tradeoff between optimization and generalization. We demonstrated consistent performance improvements of IterNorm on the CIFAR10 and ImageNet datasets. The analysis of combining conditioning and SND, can potentially lead to novel visions for future normalization work, and our proposed IterNorm can potentially to be used in designing network architectures.
Appendix A Derivation of Backpropagation
We first show the forward pass for illustration. We follow the common matrix notation that the vectors are column vectors by default while their derivations are row vectors. Given the minibatch inputs , the forward pass of our Iterative Normalization (IterNorm) to compute the whitened output is described below:
(11)  
(12)  
(13)  
(14)  
(15)  
(16) 
where is calculated based on Newton’s iterations as follows:
(17) 
The backpropagation pass is based on the chain rule. Given , we can calculate based on Eqn. 11 and 12:
(18) 
where is calculated based on Eqn. 12:
(19) 
and is calculated based on Eqn. 13 and 16:
(20) 
where means symmetrizing by . Next, we will calculate , given as follows:
(21) 
Module  d=64  d=128 

‘nn’ convolution  8.89ms  17.46ms 
‘cudnn’ convolution  5.65ms  13.62ms 
DBN [19]  15.92ms  35.02ms 
IterNormiter3  6.59ms  13.32ms 
IterNormiter5  7.40ms  13.92ms 
IterNormiter7  8.21ms  14.68ms 
Based on Eqn. 17, we can derive as follows:
(23) 
where can be calculated based on Eqn. 15 and 17 by following iterations:
(24) 
Further, we can simplify the derivation of as:
(25) 
where .
Appendix B Comparison of Wall Clock Time
As discussed in Section 3.3 of the main paper, the computational cost of our method is comparable to the convolution operation. To be specific, given the internal activation , the convolution with the same input and output feature maps costs , while our IterNorm costs . The relative cost of IterNorm for convolution is .
Here, we compare the wallclock time of IterNorm, Decorrelated Batch Normalization (DBN) [19] and convolution. Our IterNorm is implemented based on Torch [8]. The implementation of DBN is from the released code of the DBN paper [19]. We compare ‘IterNorm’ to the ‘nn’, ‘cudnn’ convolution [6] and DBN [19] in Torch. The experiments are run on a TITAN Xp. We use the corresponding configurations of the input and convolution : , . We compare the results of and , as shown in Table A1. We find that our unoptimized implementation of IterNorm (e.g., ‘IterNormiter5’) is faster than the ‘nn’ convolution, and slightly slower than ‘cudnn’ convolution. Note that our IterNorm is implemented based on the API provided by Torch [8], it is thus more fair to compare IterNorm to ‘nn’ convolution. Besides, our IterNorm is significantly faster than DBN.
Besides, we also conduct additional experiments to compare the training time of IterNorm and DBN on the VGG architecture described in Section 5.1 of the main paper, with a batch size of 256. DBN (group size of 16) costs 1.366s per iteration, while IterNorm costs 0.343s.
input ( RGB image) 

conv3(3,64) 
conv3(64,64) 
maxpool(2,2) 
conv3(64,128) 
conv3(128,128) 
maxpool(2,2) 
conv3(128,256) 
conv3(256,256) 
maxpool(2,2) 
conv3(256,512) 
conv3(512,512) 
maxpool(2,2) 
conv3(512,512) 
avepool(2,2) 
FC(512,10) 
softmax 
Appendix C Experiments of IterNorm with Different Iterations
Here, we show the results of IterNorm with different iteration numbers on the experiments described in Section 4.2 of the main paper. We also show the results of Batch Normalization (BN) [21] and Decorrelated Batch Normalization (DBN) [19] for comparison.
Figure A1 shows the results on MNIST dataset. We explore the effects of on performance of IterNorm, for a range of . We observe that the smallest (T = 1) and the largest (T = 9) iteration number both have the worse performance in terms of training efficiency. Further, when T = 9, IterNorm has significantly worse test performance. These observations are consistent to the results on VGG network described in Section 5.1 of the main paper.
Figure A2 shows the results of SND and conditioning analysis. We observe that IterNorm has better conditioning and increasing SND, with increasing iteration . The results show that the iteration number can be effectively used to control the extent of whitening, therefore to obtain a good tradeoff between the improved conditioning and introduced stochasticity.
Appendix D Details of the VGG Network
References
 [1] Devansh Arpit, Yingbo Zhou, Bhargava Urala Kota, and Venu Govindaraju. Normalization propagation: A parametric technique for removing internal covariate shift in deep networks. In ICML, 2016.
 [2] Andrei Atanov, Arsenii Ashukha, Dmitry Molchanov, Kirill Neklyudov, and Dmitry Vetrov. Uncertainty estimation via stochastic batch normalization. In ICLR Workshop, 2018.
 [3] Lei Jimmy Ba, Ryan Kiros, and Geoffrey E. Hinton. Layer normalization. CoRR, abs/1607.06450, 2016.
 [4] Dario A. Bini, Nicholas J. Higham, and Beatrice Meini. Algorithms for the matrix pth root. Numerical Algorithms, 39(4):349–378, Aug 2005.
 [5] Johan Bjorck, Carla Gomes, and Bart Selman. Understanding batch normalization. In NIPS, 2018.
 [6] Sharan Chetlur, Cliff Woolley, Philippe Vandermersch, Jonathan Cohen, John Tran, Bryan Catanzaro, and Evan Shelhamer. cudnn: Efficient primitives for deep learning. CoRR, abs/1410.0759, 2014.
 [7] Michael Cogswell, Faruk Ahmed, Ross B. Girshick, Larry Zitnick, and Dhruv Batra. Reducing overfitting in deep networks by decorrelating representations. In ICLR, 2016.

[8]
R. Collobert, K. Kavukcuoglu, and C. Farabet.
Torch7: A matlablike environment for machine learning.
In BigLearn, NIPS Workshop, 2011.  [9] Tim Cooijmans, Nicolas Ballas, César Laurent, and Aaron C. Courville. Recurrent batch normalization. In ICLR, 2017.
 [10] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. ImageNet: A LargeScale Hierarchical Image Database. In CVPR, 2009.
 [11] Guillaume Desjardins, Karen Simonyan, Razvan Pascanu, and koray kavukcuoglu. Natural neural networks. In NIPS, 2015.
 [12] Priya Goyal, Piotr Dollár, Ross B. Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, and Kaiming He. Accurate, large minibatch SGD: training imagenet in 1 hour. CoRR, abs/1706.02677, 2017.
 [13] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [14] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In ECCV, 2016.
 [15] Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate normalization schemes in deep networks. arXiv preprint arXiv:1803.01814, 2018.
 [16] Gao Huang, Zhuang Liu, and Kilian Q. Weinberger. Densely connected convolutional networks. In CVPR, 2017.
 [17] Lei Huang, Xianglong Liu, Bo Lang, Adams Wei Yu, Yongliang Wang, and Bo Li. Orthogonal weight normalization: Solution to optimization over multiple dependent stiefel manifolds in deep neural networks. In AAAI, 2018.
 [18] Lei Huang, Xianglong Liu, Yang Liu, Bo Lang, and Dacheng Tao. Centered weight normalization in accelerating training of deep neural networks. In ICCV, 2017.
 [19] Lei Huang, Dawei Yang, Bo Lang, and Jia Deng. Decorrelated batch normalization. In CVPR, 2018.
 [20] Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batchnormalized models. In NIPS, 2017.
 [21] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 [22] Günter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. Selfnormalizing neural networks. In NIPS. 2017.
 [23] Jonas Kohler, Hadi Daneshmand, Aurelien Lucchi, Ming Zhou, Klaus Neymeyr, and Thomas Hofmann. Towards a theoretical understanding of batch normalization. arXiv preprint arXiv:1805.10694, 2018.
 [24] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, 2009.
 [25] Anders Krogh and John A. Hertz. A simple weight decay can improve generalization. In NIPS. 1992.

[26]
César Laurent, Gabriel Pereyra, Philemon Brakel, Ying Zhang, and Yoshua
Bengio.
Batch normalized recurrent neural networks.
In ICASSP, 2016.  [27] Yann LeCun, Léon Bottou, Genevieve B. Orr, and KlausRobert Müller. Effiicient backprop. In Neural Networks: Tricks of the Trade, pages 9–50, 1998.
 [28] Peihua Li, Jiangtao Xie, Qilong Wang, and Zilin Gao. Towards faster training of global covariance pooling networks by iterative matrix square root normalization. In CVPR, 2018.
 [29] Qianli Liao, Kenji Kawaguchi, and Tomaso Poggio. Streaming normalization: Towards simpler and more biologicallyplausible normalizations for online and recurrent learning. arXiv preprint arXiv:1610.06160, 2016.
 [30] TsungYu Lin and Subhransu Maji. Improved bilinear pooling with cnns. In BMVC, 2017.
 [31] Ping Luo. Learning deep architectures via generalized whitened neural networks. In ICML, 2017.
 [32] Ping Luo, Jiamin Ren, and Zhanglin Peng. Differentiable learningtonormalize via switchable normalization. arXiv preprint arXiv:1806.10779, 2018.
 [33] Grégoire Montavon and KlausRobert Müller. Deep Boltzmann Machines and the Centering Trick, volume 7700 of LNCS. Springer, 2nd edn edition, 2012.
 [34] Behnam Neyshabur, Ruslan Salakhutdinov, and Nathan Srebro. Pathsgd: Pathnormalized optimization in deep neural networks. In NIPS, 2015.
 [35] Tapani Raiko, Harri Valpola, and Yann LeCun. Deep learning made easier by linear transformations in perceptrons. In AISTATS, 2012.
 [36] Mengye Ren, Renjie Liao, Raquel Urtasun, Fabian H. Sinz, and Richard S. Zemel. Normalizing the normalizers: Comparing and extending network normalization schemes. In ICLR, 2017.
 [37] Pau Rodríguez, Jordi Gonzàlez, Guillem Cucurull, Josep M. Gonfaus, and F. Xavier Roca. Regularizing cnns with locally constrained decorrelations. In ICLR, 2017.
 [38] Tim Salimans and Diederik P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In NIPS, 2016.
 [39] Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization?(no, it is not about internal covariate shift). In NIPS, 2018.
 [40] Nicol N. Schraudolph. Accelerated gradient descent by factorcentering decomposition. Technical report, 1998.
 [41] Alexander Shekhovtsov and Boris Flach. Normalization of neural networks using analytic variance propagation. In Computer Vision Winter Workshop, 2018.
 [42] Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. In ICLR, 2015.
 [43] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout: A simple way to prevent neural networks from overfitting. J. Mach. Learn. Res., 15(1):1929–1958, Jan. 2014.

[44]
Christian Szegedy, Sergey Ioffe, and Vincent Vanhoucke.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In AAAI, 2017.  [45] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, and Zbigniew Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
 [46] Mattias Teye, Hossein Azizpour, and Kevin Smith. Bayesian uncertainty estimation for batch normalized deep networks. In ICML, 2018.
 [47] Guangrun Wang, Jiefeng Peng, Ping Luo, Xinjiang Wang, and Liang Lin. Kalman normalization: Normalizing internal representations across network layers. In NIPS, 2018.
 [48] Simon Wiesler, Alexander Richard, Ralf Schlüter, and Hermann Ney. Meannormalized stochastic gradient for largescale deep learning. In ICASSP, 2014.
 [49] Shuang Wu, Guoqi Li, Lei Deng, Liu Liu, Yuan Xie, and Luping Shi. L1norm batch normalization for efficient training of deep neural networks. CoRR, 2018.
 [50] Yuxin Wu and Kaiming He. Group normalization. In ECCV, 2018.

[51]
Wei Xiong, Bo Du, Lefei Zhang, Ruimin Hu, and Dacheng Tao.
Regularizing deep convolutional neural networks with a structured decorrelation constraint.
In ICDM, 2016.  [52] Sergey Zagoruyko and Nikos Komodakis. Wide residual networks. In BMVC, 2016.
Comments
There are no comments yet.