DecorrelatedBN
Code for Decorrelated Batch Normalization
view repo
Batch Normalization (BN) is capable of accelerating the training of deep models by centering and scaling activations within minibatches. In this work, we propose Decorrelated Batch Normalization (DBN), which not just centers and scales activations but whitens them. We explore multiple whitening techniques, and find that PCA whitening causes a problem we call stochastic axis swapping, which is detrimental to learning. We show that ZCA whitening does not suffer from this problem, permitting successful learning. DBN retains the desirable qualities of BN and further improves BN's optimization efficiency and generalization ability. We design comprehensive experiments to show that DBN can improve the performance of BN on multilayer perceptrons and convolutional neural networks. Furthermore, we consistently improve the accuracy of residual networks on CIFAR10, CIFAR100, and ImageNet.
READ FULL TEXT VIEW PDF
Batch Normalization (BN) is ubiquitously employed for accelerating neura...
read it
An emerging design principle in deep learning is that each layer of a de...
read it
Online Normalization is a new technique for normalizing the hidden
activ...
read it
Batch Normalization is quite effective at accelerating and improving the...
read it
Normalization techniques play an important role in supporting efficient ...
read it
In this work we investigate the reasons why Batch Normalization (BN) imp...
read it
In this work, we investigate Batch Normalization technique and propose i...
read it
Code for Decorrelated Batch Normalization
Batch Normalization [27] is a technique for accelerating deep network training. Introduced by Ioffe and Szegedy, it has been widely used in a variety of stateoftheart systems [19, 51, 21, 56, 50, 22]
. Batch Normalization works by standardizing the activations of a deep network within a minibatch—transforming the output of a layer, or equivalently the input to the next layer, to have a zero mean and unit variance. Specifically, let
be the original outputs of a single neuron on
examples in a minibatch. Batch Normalization produces the transformed outputs(1) 
where and are the mean and variance of the minibatch, is a small number to prevent numerical instability, and are extra learnable parameters. Crucially, during training, Batch Normalization is part of both the inference computation (forward pass) as well as the gradient computation (backward pass). Batch Normalization can be inserted extensively into a network, typically between a linear mapping and a nonlinearity.
Batch Normalization was motivated by the wellknown fact that whitening inputs (i.e. centering, decorrelating, and scaling) speeds up training [34]. It has been shown that better conditioning of the covariance matrix of the input leads to better conditioning of the Hessian in updating the weights, making the gradient descent updates closer to Newton updates [34, 52]. Batch Normalization exploits this fact further by seeking to whiten not only the input to the first layer of the network, but also the inputs to each internal layer in the network. But instead of whitening, Batch Normalization only performs standardization. That is, the activations are centered and scaled, but not decorrelated. Such a choice was justified in [27] by citing the cost and differentiability of whitening, but no actual attempts were made to derive or experiment with a whitening operation.
While standardization has proven effective for Batch Normalization, it remains an interesting question whether full whitening—adding decorrelation to Batch Normalization—can help further. Conceptually, there are clear cases where whitening is beneficial. For example, when the activations are close to being perfectly^{1}^{1}1For example, in 2D, this means all points lie close to the line and Batch Normalization does not change the shape of the distribution. correlated, standardization barely improves the conditioning of the covariance matrix, whereas whitening remains effective. In addition, prior work has shown that decorrelated activations result in better features [3, 45, 5] and better generalization [10, 55], suggesting room for further improving Batch Normalization.
In this paper, we propose Decorrelated Batch Normalization, in which we whiten the activations of each layer within a minibatch. Let be the input to a layer for the th example in a minibatch of size . The whitened input is given by
(2) 
where is the minibatch mean and is the minibatch covariance matrix.
Several questions arise for implementing Decorrelated Batch Normalization. One is how to perform backpropagation, in particular, how to backpropagate through the inverse square root of a matrix (i.e. ), whose key step is an eigen decomposition. The differentiability of this matrix transform was one of the reasons that whitening was not pursued in the Batch Normalization paper [27]. Desjardins et al. [14] whiten the activations but avoid backpropagation through it by treating the mean and the whitening matrix as model parameters, rather than as functions of the input examples. However, as has been pointed out [27, 26], doing so may lead to instability in training.
In this work, we decorrelate the activations and perform proper backpropagation during training. We achieve this by using the fact that eigendecomposition is differentiable, and its derivatives can be obtained using matrix differential calculus, as shown by prior work [17, 28]. We build upon these existing results and derive the backpropagation updates for Decorrelated Batch Normalization.
Another question is, perhaps surprisingly, the choice of how to compute the whitening matrix . The whitening matrix is not unique because a whitened input stays whitened after an arbitrary rotation [29]. It turns out that PCA whitening, a standard choice [15], does not speed up training at all and in fact inflicts significant harm. The reason is that PCA whitening works by performing rotation followed by scaling, but the rotation can cause a problem we call stochastic axis swapping, which, as will be discussed in Section 3.1, in effect randomly permutes the neurons of a layer for each batch. Such permutation can drastically change the data representation from one batch to another to the extent that training never converges.
To address this stochastic axis swapping issue, we discover that it is critical to use ZCA whitening [4, 29], which rotates the PCAwhitened activations back such that the distortion of the original activations is minimal. We show through experiments that the benefits of decorrelation are only observed with the additional rotation of ZCA whitening.
A third question is the amount of whitening to perform. Given a particular batch size, DBN may not have enough samples to obtain a suitable estimate for the full covariance matrix. We thus control the extent of whitening by decorrelating smaller groups of activations instead of all activations together. That is, for an output of dimension
, we divide it into groups of size and apply whitening within each group. This strategy has the added benefit of reducing the computational cost of whitening from to , where is the minibatch size.We conduct extensive experiments on multilayer perceptrons and convolutional neural networks, and show that Decorrelated Batch Normalization (DBN) improves upon the original Batch Normalization (BN) in terms of training speed and generalization performance. In particular, experiments demonstrate that using DBN can consistently improve the performance of residual networks [19, 21, 56] on CIFAR10, CIFAR100 [31] and ILSVRC2012 [13].
Normalized activations [37, 53] and gradients [46, 40] have long been known to be beneficial for training neural networks. Batch Normalization [27] was the first to perform normalization per minibatch in a way that supports backpropagation. One drawback of Batch Normalization, however, is that it requires a reasonable batch size to estimate the mean and variance, and is not applicable when the batch size is very small. To address this issue, Ba et al. [2] proposed Layer Normalization, which performs normalization on a single example using the mean and variance of the activations from the same layer. Batch Normalization and Layer Normalization were later unified by Ren et al. under the Division Normalization framework [41]. Other attempts to improve Batch Normalization for small batch sizes, include Batch Renormalization [26] and Stream Normalization [35]
. There have also been efforts to adapt Batch Normalization to Recurrent Neural Networks
[33, 12]. Our work extends Batch Normalization by decorrelating the activations, which is a direction orthogonal to all these prior works.Our work is closely related to Natural Neural Networks [14, 36], which whiten activations by periodically estimating and updating a whitening matrix. Our work differs from Natural Neural Networks in two important ways. First, Natural Neural Networks perform whitening by treating the mean and the whitening matrix as model parameters as opposed to functions of the input examples, which, as pointed out by Ioffe & Szegedy [27, 26], can cause instability in training, with symptoms such as divergence or gradient explosion. Second, during training, a Natural Neural Network uses a running estimate of the mean and whitening matrix to perform whitening for each minibatch; as a result, it cannot ensure that the transformed activations within each batch are in fact whitened, whereas in our case the activations with a minibatch are guaranteed to be whitened. Natural Neural Networks thus may suffer instability in training very deep neural networks [14].
Another way to obtain decorrelated activations is to introduce additional regularization in the loss function
[10, 55]. Cogswell et al. [10] introduced the DeCov loss on the activations as a regularizer to encourage nonredundant representations. Xiong et al. [55] extends [10] to learn groupwise decorrelated representations. Note that these methods are not designed for speeding up training. In fact, empirically they often slow down training [10], probably because decorrelated activations are part of the learning objective and thus may not be achieved until later in training.
Our approach is also related to work that implicitly normalizes activations by either normalizing the network weights—e.g. through reparameterization techniques [43, 25, 24], Riemannian optimization methods [23, 8], or additional weight regularization [32, 39, 42]—or by designing special scaling coefficients and bias values that can induce normalized activations under certain assumptions [1]. Our work bears some similarity to that of Huang et al. [24], which also backpropagates gradients through a ZCAlike normalization transform that involves eigendecomposition. But the work by Huang et al. normalizes weights instead of activations, which leads to significantly different derivations especially with regards to convolutional layers; in addition, unlike ours, it does not involve a separately estimated whitening matrix during inference, nor does it discuss the stochastic axis swapping issue. Finally, all of these works including [24] are orthogonal to ours in the sense that their normalization is data independent, whereas ours is data dependent. In fact, as shown in [25, 8, 24, 23], datadependent and dataindependent normalization can be combined to achieve even greater improvement.
Let be a data matrix that represents inputs to a layer in a minibatch of size . Let be the
th column vector of
, i.e. the dimensional input from the th example. The whitening transformation is defined as(3) 
where is the mean of , is the covariance matrix of the centered , is a column vector of all ones, and is a small positive number for numerical stability (preventing a singular ). The whitening transformation ensures that for the transformed data is whitened, i.e., .
Although Eqn. 3 gives an analytical form of the whitening transformation, this transformation is in fact not unique. The reason is that , the inverse square root of the covariance matrix, is defined only up to rotation, and as a result there exist infinitely many whitening transformations. Thus, a natural question is whether the specific choice of matters, and if so, which choice to use.
To answer this question, we first discuss a phenomenon we call stochastic axis swapping and show that not all whitening transformations are equally desirable.
Given a data point represented as a vector under the standard basis, its representation under another orthogonal basis is , where
is an orthogonal matrix. We define
stochastic axis swapping as follows:Assume a training algorithm that iteratively update weights using a batch of randomly sampled data points per iteration. Stochastic axis swapping occurs when a data point is transformed to be in one iteration and in another iteration such that where is a permutation matrix solely determined by the statistics of a batch.
Stochastic axis swapping makes training difficult, because the random permutation of the input dimensions can greatly confuse the learning algorithm—in the extreme case where the permutation is completely random, what remains is only a bag of activation values (similar to scrambling all pixels in an image), potentially resulting in an extreme loss of information and discriminative power.
Here, we demonstrate that the whitening of activations, if not done properly, can cause stochastic axis swapping in training neural networks. We start with standard PCA whitening [15], which computes through eigen decomposition: , where and
are the eigenvalues and eigenvectors of
, i.e. . That is, the original data point (after centering) is rotated by and then scaled by . Without loss of generalization, we assume that is unique by fixing the sign of its first element. A first opportunity for stochastic axis swapping is that the columns (or rows) of and can be permuted while still giving a valid whitening transformation. But this is easy to fix—we can commit to a unique and by ordering the eigenvalues nonincreasingly.But it turns out that ensuring a unique and is insufficient to avoid stochastic axis swapping. Fig. 1 illustrates an example. Given a minibatch of data points in one iteration as shown in Fig. 1(a), PCA whitening rotates them by and stretches them along the new axis system by , where . Considering another iteration shown in Figure 1(b), where all data points except the red points are the same, it has the same eigenvectors with different eigenvalues, where . In this case, the new rotation matrix is because we always order the eigenvalues nonincreasingly. The blue data points thus have two different representations with the axes swapped.
To further justify our conjecture, we perform an experiment on multilayer perceptrons (MLPs) over the MNIST dataset as shown in Figure 2. We refer to the network without whitening activations as ‘plain’ and the network with PCA whitening as DBNPCA. We find that DBNPCA has significantly inferior performance to ‘plain’. Particularly, on the 4layer MLP, DBNPCA behaves similarly to random guessing, which implies that it causes severe stochastic axis swapping.
The stochastic axis swapping caused by PCA whitening exists because the rotation operation is executed over varied activations. Such variation is a result of two factors: (1) the activations can change due to weight updates during training, following the internal covariate shift described in [27]
; (2) the optimization is based on random minibatches, which means that each batch will contain a different random set of examples in each training epoch.
A similar phenomenon is also observed in [24]. In this work, PCAstyle orthogonalization failed to learn orthogonal filters effectively in neural networks. However, no further analysis was provided to explain why this is the case.
To address the stochastic axis swapping problem, one straightforward idea is to rotate the transformed input back using the same rotation matrix :
(4) 
In other words, we scale along the eigenvectors to get the whitened activations under the original axis system. Such whitening is known as ZCA whitening [4], and has been shown to minimize the distortion introduced by whitening under L2 distance [4, 29, 24]. We perform the same experiments with ZCA whitening as we did with PCA whitening: with MLPs on MNIST. Shown in Figure 2, ZCA whitening (referred to as DBNZCA) improves training performance significantly compared to no whitening (‘plain’) and DBNPCA. This shows that ZCA whitening is critical to addressing the stochastic axis swapping problem.
It is important to note that the backpropagation through ZCA whitening is nontrivial. In our DBN, the mean and the covariance are not parameters of the whitening transform , but are functions of the minibatch data . We need to backpropagate the gradients through as in [27, 24]. Here, we use the results from [28] to derive the backpropagation formulations of whitening:
(5) 
where is the loss function, is 0diagonal with , the operator is elementwise matrix multiplication, and sets the offdiagonal elements of as zero.
Detailed derivation can be found in the appendix. Here we only show the simplified formulation:
(6) 
where , , , and . The notation represents symmetrizing the corresponding matrix.
Decorrelated Batch Normalization (DBN) is a datadependent whitening transformation with backpropagation formulations. Like Batch Normalization [27], it can be inserted extensively into a network. Algorithms 1 and 2 describe the forward pass and the backward pass of our proposed DBN respectively. During training, the mean and the whitening matrix are calculated within each minibatch to ensure that the activations are whitened for each minibatch. We also maintain the expected mean and the expected whitening matrix for use during inference. Specifically, during training, we initialize as and as and update them by running average as described in Line 10 and 11 of Algorithm 1.
Normalizing the activations constrains the model’s capacity for representation. To remedy this, Ioffe and Szegedy [27] introduce extra learnable parameters and in Eqn. 1
. These learnable parameters often marginally improve the performance in our observation. For DBN, we also recommend to use learnable parameters. Specifically, the learnable parameters can be merged into the following ReLU activation
[38], resulting in the Translated ReLU (TReLU) [54].For a convolutional neural network, the input to the DBN transformation is where and indicate the height and width of feature maps, and and are the numbers of feature maps and examples respectively. Following [27], we view each spatial position of the feature map as a sample. We thus unroll as with examples and feature maps. The whitening operation is performed over the unrolled .
As discussed in Section 1, it is necessary to control the extent of whitening such that there are sufficient examples in a batch for estimating the whitening matrix. To do this we use “group whitening”, specifically, we divide the activations along the feature dimension with size into smaller groups of size () and perform whitening within each group. The extent of whitening is controlled by the hyperparameter . In the case , Decorrelated Batch Normalization reduces to the original Batch Normalization.
In addition to controlling the extent of whitening, group whitening reduces the computational complexity [24]. Full whitening costs for a batch of size . When using group whitening, the cost is reduced to . Typically, we choose , therefore the cost of group whitening is .
DBN extends BN such that the activations are decorrelated over minibatch data. DBN thus inherits the beneficial properties of BN, such as the ability to perform efficient training with large learning rates and very deep networks. Here, we further highlight the benefits of DBN over BN, in particular achieving better dynamical isometry [44] and improved conditioning.
Saxe et al. [44]
introduce dynamical isometry—the desirable property that occurs when the singular values of the product of Jacobians lie within a small range around
. Enforcing this property, even approximately, is beneficial to training because it preserves the gradient magnitudes during backpropagation and alleviates the vanishing and exploding gradient problems
[44]. Ioffe and Szegedy [27] find that Batch Normalization achieves approximate dynamical isometry under the assumption that (1) the transformation between two consecutive layers is approximately linear, and (2) the activations in each layer are Gaussian and uncorrelated. Our DBN inherently satisfies the second assumption, and therefore is more likely to achieve dynamical isometry than BN.[14] demonstrated that whitening activations results in a block diagonal Fisher Information Matrix (FIM) for each layer under certain assumptions [18]. Their experiments show that such a block diagonal structure in the FIM can improve the conditioning. The proposed method in [14], however, cannot whiten the activations effectively, as shown in [36] and also discussed in Section 2. DBN, on the other hand, does this directly. Therefore, we conjecture that DBN can further improve the conditioning of the FIM, and we justify this experimentally in Section 4.1.
We start with experiments to highlight the effectiveness of Decorrelated Batch Normalization (DBN) in improving the conditioning and speeding up convergence on multilayer perceptrons (MLP). We then conduct comprehensive experiments to compare DBN and BN on convolutional neural networks (CNNs). In the last section, we apply our DBN to residual networks on CIFAR10, CIFAR100 [31] and ILSVRC2012 to show its power to improve modern network architectures. The code to reproduce the experiments is available at https://github.com/umichvl/DecorrelatedBN.
We focus on classification tasks and the loss function is the negative loglikelihood: . Unless otherwise stated, we use random weight initialization as described in [34] and ReLU activations [38].
In this section, we verify the effectiveness of our proposed method in improving conditioning and speeding up convergence on MLPs. We also discuss the effect of the group size on the tradeoff between the performance and computation cost. We compare against several baselines, including the original network without any normalization (referred to as ‘plain’), Natural Neural Networks (NNN) [14], Layer Normalization (LN) [2], and Batch Normalization (BN) [27]. All results are averaged over 5 runs.
We perform conditioning analysis on the YaleB dataset [16], specifically, the subset [6] with 2,414 images and 38 classes. We resize the images to 3232 and reshape them as 1024dimensional vectors. We then convert the images to grayscale in the range and subtract the perpixel mean.
For each method, we train a 5layer MLP with the numbers of neurons in each hidden layer= and use full batch gradient descent. Hyperparameters are selected by grid search based on the training loss. For all methods, the learning rate is chosen from . For NNN, the revised term is one of and the natural reparameterization interval is one of .
We evaluate the condition number of the relative Fisher Information Matrix (FIM) [49] with respect to the last layer. Figure 3 (a) shows the evolution of the condition number over training iterations. Figure 3 (b) shows the training loss over the wall clock time. Note that the experiments are performed on CPUs and the model with DBN is slower than the model with BN per iteration. From both figures, we see that NNN, LN, BN and DBN converge faster, and achieve better conditioning compared to ‘plain’. This shows that normalization is able to make the optimization problem easier. Also, DBN achieves the best conditioning compared to other normalization methods, and speeds up convergence significantly.
As discussed in Section 3.4, the group size controls the extent of whitening. Here we show the effects of the hyperparameter on the performance of DBN. We use a subset [6]
of the PIE face recognition
[47] dataset with 68 classes with 11,554 images. We adopt the same preprocessing strategy as with YaleB.We trained a 6layer MLP with the numbers of neurons in each hidden layer=
. We use Stochastic Gradient Descent (SGD) with a batch size of 256. Other configurations are chosen in the same way as the previous experiment. Additionally, we explore group sizes in
for DBN. Note that when , DBN is reduced to the original BN without the extra learnable parameters.Figure 4 (a) shows the training loss of DBN with different group sizes. We find that the largest (G128) and smallest group sizes (G1) both have noticeably slower convergence compared to the ones with intermediate group sizes such as G16. These results show that (1) decorrelating activations over a minibatch can improve optimization, and (2) controlling the extent of whitening is necessary, as the estimate of the full whitening matrix might be poor over minibatch samples. Also, the eigendecomposition with small group sizes (e.g. 16) is less computationally expensive. We thus recommend using group whitening in training deep models.
We also compared DBN with group whitening () to other baselines and the results are shown in Figure 4 (b). We find that DBN converges significantly faster than other normalization methods.
We design comprehensive experiments to evaluate the performance of DBN with CNNs against BN, the stateoftheart normalization technique. For these experiments we use the CIFAR10 dataset [31], which contains 10 classes, 50k training images, and 10k test images.
We compare DBN to BN over different experimental configurations, including the choice of optimization method, nonlinearities, and the position of DBN/BN in the network. We adopt the VGGA architecture [48] for all experiments, and preprocess the data by subtracting the perpixel mean and dividing by the variance.
We use SGD with a batchsize of 256, momentum of 0.9 and weight decay of 0.0005. We decay the learning rate by half every iterations. The hyperparameters are chosen by grid search over a random validation set of 5k examples taken from the training set. The grid search includes the initial learning rate and the decay interval . We set the group size of DBN as . Figure 5 (a) compares the performance of BN and DBN under this configuration.
We also experiment with other configurations, including using (1) Adam [30] as the optimization method, (2) replacing ReLU with another widely used nonlinearity called Exponential Linear Units (ELU) [9], and (3) inserting BN/DBN after the nonlinearity. All the experimental setups are otherwise the same, except that Adam [30] is used with an initial learning rate in . The respective results are shown in Figure 5 (b), (c) and (d).
In all configurations, DBN converges faster with respect to the epochs and generalizes better, compared to BN. Particularly, in the four experiments above, DBN reduces the absolute test error by , , and respectively. The results demonstrate that our Decorrelated Batch Normalization outperforms Batch Normalization in terms of optimization quality and regularization ability.
Method  Res20  Res32  Res44  Res56 

Baseline*  8.75  7.51  7.17  6.97 
Baseline  7.94  7.31  7.17  7.21 
DBNL1  7.94  7.28  6.87  6.63 
DBNscaleL1  7.77  6.94  6.83  6.49 
CIFAR10  CIFAR100  

Method  Baseline* [56]  Baseline  DBNscaleL1  Baseline* [56]  Baseline  DBNscaleL1 
WRN2810  3.89  3.99 0.13  3.79 0.09  18.85  18.75 0.28  18.36 0.17 
WRN4010  3.80  3.80 0.11  3.74 0.11  18.3  18.7 0.22  18.27 0.19 
Res18  Res34  Res50  Res101  

Method  Top1  Top5  Top1  Top5  Top1  Top5  Top1  Top5 
Baseline*  –  –  –  –  24.70  7.80  23.60  7.10 
Baseline  30.21  10.87  26.53  8.59  24.87  7.58  22.54  6.38 
DBNscaleL1  29.87  10.36  26.46  8.42  24.29  7.08  22.17  6.09 
We conduct experiments to support the conclusions from Section 3.5, specifically that DBN has better stability and converges faster than BN with high learning rates in very deep networks. The experiments were conducted on the Splain network, which follows the design of the residual network [20] but removes the identity maps and uses the same feature maps for simplicity.
He et al. [20] addressed the degradation problem for the network without identity mappings: that is, when the network depth increases, the training accuracy degrades rapidly, even when Batch Normalization is used. In our experiments, we demonstrate that DBN will relieve this problem to some extent. In other words, a model with DBN is easier to optimize. We validate this on the Splain architecture with feature maps of dimension and number of layers 20, 32 and 44. The models are trained with a minibatch size of 128, momentum of 0.9 and weight decay of 0.0005. We set the initial learning rate to be 0.1, dividing it by 5 at 80 and 120 epochs, and end training at 160 epochs. The results in Figure 6 (a) show that, with increased depth, the model with BN was more difficult to optimize than with DBN. We conjecture that the approximate dynamical isometry of DBN alleviates this problem.
A network with Batch Normalization can benefit from high learning rates, and thus faster training, because it reduces internal covariate shift [27]. Here, we show that DBN can help even more. We train the networks with BN and DBN with a higher learning rate — 0.4. We use the Splain architecture with feature maps of dimensions and divide the learning rate by 5 at 60, 100, 140, and 180 epochs. The results in Figure 6 (b) show that DBN has significantly better training accuracy than BN. We argue that DBN benefits from higher learning rates because of its property of improved conditioning.
Due to our current unoptimized implementation of DBN, it would incur a high computational cost to replace all BN modules of a network with DBN^{2}^{2}2See the appendix for more details on computational cost.. We instead only decorrelate the activations among a subset of layers. We find that this in practice is already effective for residual networks [19], because the information in previous layers can pass directly to the later layers through the identity connections. We also show that we can improve upon residual networks [19, 21, 56] by using only one DBN module before the first residual block, which introduces negligible computation cost. In principle, an optimized implementation of DBN will be much faster, and could be injected in multiple places in the network with little overhead. However, optimizing the implementation of DBN is beyond the scope of this work.
We apply our method on residual networks [19] by using only one DBN module before the first residual block (denoted as DBNL1). We also consider DBN with adjustable scale (denoted as DBNscaleL1) as discussed in Section 3.3
. We adopt the Torch implementation of residual networks
^{3}^{3}3https://github.com/facebook/fb.resnet.torch and follow the same experimental protocol as described in [19]. We train the residual networks with depth 20, 32, 44 and 56 on CIFAR10. Table 1 shows the test errors of these networks. Our methods obtain lower test errors compared to BN over all 4 networks, and the improvement is more dramatic for deeper networks. Also, we see that DBNscaleL1 marginally outperforms DBNL1 in all cases. Therefore, we focus on comparing DBNscaleL1 to BN in later experiments.We apply DBN to Wide Residual Network (WRN) [56] to improve the performance on CIFAR10 and CIFAR100. Following the convention set in [56], we use the abbreviation WRNdk to indicate a WRN with depth d and width k. We again adopt the publicly available Torch implementation^{4}^{4}4https://github.com/szagoruyko/wideresidualnetworks and follow the same setup as in [56]. The results in Table 2 show that DBN improves the original WRN on both datasets and both networks. In particular, we reduce the test error by and on CIFAR10 and CIFAR100, respectively.
We further validate the scalability of our method on ILSVRC2012 with 1000 classes [13]. We use the given official 1.28M training images as a training set, and evaluate the top1 and top5 classification errors on the validation set with 50k images. We use the 18, 34, 50 and 101layer residual network (Res18, Res34, Res50 and Res101) and perform single model and singlecrop testing. We follow the same experimental setup as described in [19], except that we use 4 GPUs instead of 8 for training Res50 and 2 GPUs for training Res18 and Res34 (whose single crop results have not been previously reported): we apply SGD with a minibatch size of 256 over 4 GPUs for Res50 and 8 GPUs for Res101, momentum of 0.9 and weight decay of 0.0001; we set the initial learning rate of 0.1, dividing it by 10 at 30 and 60 epochs, and end the training at 90 epochs. The results are shown in Table 3. We can see that the DBNscaleL1 achieves lower test errors compared to the original residual networks.
In this paper, we propose Decorrelated Batch Normalization (DBN), which extends Batch Normalization to include whitening over minibatch data. We find that PCA whitening can sometimes be detrimental to training because it causes stochastic axis swapping, and demonstrate that it is critical to use ZCA whitening, which avoids this issue. DBN retains the advantages of Batch Normalization while using decorrelated representations to further improve models’ optimization efficiency and generalization abilities. This is because DBN can maintain approximate dynamical isometry and improve the conditioning of the Fisher Information Matrix. These properties are experimentally validated, suggesting DBN has great potential to be used in designing DNN architectures.
Acknowledgement This work is partially supported by China Scholarship Council, NSFC61370125 and SKLSDE2017ZX03. We also thank Jonathan Stroud and Lanlan Liu for their help with proofreading and editing.
For illustration, we first provide the forward pass, then derive the backward pass. Regarding notation, we follow the matrix notation that all the vectors are column vectors, except that the gradient vectors are row vectors.
Given minibatch layer inputs where m is the number of examples, the ZCAwhitened output for the input can be calculated as follows:
(A.1)  
(A.2)  
(A.3)  
(A.4)  
(A.5)  
(A.6) 
where and are the mean vector and the covariance matrix within the minibatch data. Eqn. A.3 is the eigen decomposition where and is a diagonal matrix where the diagonal elements are the eigenvalues. Note that are auxiliary variables for clarity. Actually are the output of PCA whitening. However, PCA whitening hardly works for deep networks as discussed in the paper.
Based on the chain rule and the result from
[28], we can get the backward pass derivatives as follows:(A.7)  
(A.8)  
(A.9)  
(A.10)  
(A.11)  
(A.12)  
(A.13) 
where is the loss function, is 0diagonal with , the operator is elementwise matrix multiplication, sets the offdiagonal elements of to zero, and means symmetrizing by . Note that Eqn. A.11 is from the results in [28]. Besides, a similar formulation to backpropagate the gradient through the whitening transformation has been derived in the context of learning an orthogonal weight matrix in [24].
For more efficient computation, we provide the simplified formulation as follows:
(A.14) 
where
f  
The details of how to derive Eqn. A.14 are as follows.
In this part, we analyze the computational cost of DBN module for Convolutional Neural Networks (CNNs). Theoretically, a convolutional layer with a input, batch size of , and filters of size costs . Adding DBN with a group size incurs an overhead of . The relative overhead is , which is negligible when is small (e.g. 16).
Empirically, our unoptimized implementation of DBN costs ms (forward pass + backward pass, averaged over runs) for full whitening, with a input, a batch size of . In comparison, the highly optimized cudnn convolution [7] in Torch [11] with the same input costs ms.
Methods  Training  Inference 
BN  163.68  11.47 
DBNG8  707.47  15.50 
DBNG16  466.92  14.41 
DBNG64  297.25  13.70 
DBNG256  440.88  13.64 
DBNG512  1004  13.68 

Training  Inference  
Method  BN  DBNscale  BN  DBNscale 
Res56  69.53  86.80  4.57  5.12 
Res44  55.03  72.06  3.65  4.36 
Res32  40.36  57.47  2.80  3.33 
Res20  25.97  42.87  1.94  2.44 
WideRes40  643.94  659.55  25.69  26.08 
WideRes28  440  457  36.56  38.10 
Training  Inference  
Method  BN  DBNscale  BN  DBNscale 
Res101  1.52  4.21  0.42  0.57 
Res50  0.51  1.12  0.19  0.25 
Res34  1.21  2.04  0.55  0.71 
Res18  0.81  1.45  0.40  0.51 
Table B.1, B.2, B.3 show the wall clock time for our CIFAR10 and ImageNet experiments described in the paper. Note that DBN with small groups (e.g. G8) can cost more time than larger groups due to our unoptimized implementation: for example, we whiten each group sequentially instead of in parallel, because Torch does not yet provide an easy way to use linear algebra library of CUDA in parallel. Our current implementation of DBN has a low GPU utilization (e.g.  on average), versus for BN. Thus there is a lot of room for a more efficient implementation.
Training  Inference  
Method  BN  DBNscale  BN  DBNscale 
Res101  1.24  1.35  0.35  0.37 
Res50  0.69  0.80  0.19  0.22 
Res34  0.34  0.45  0.12  0.14 
Res18  0.20  0.31  0.07  0.09 
We also observe that for the ResNet experiments on ImageNet, the overhead of multiGPU parallelization is relatively high in our current DBN implementation. Thus, we perform another set of ResNet experiments with the same settings except that we use a batch size of 32 and a single GPU. As shown in Table B.4, the difference in time cost between BN and DBN is much smaller on a single GPU. This suggests that there is room for optimizing our DBN implementation for multiGPU training and inference.
Proc. IEEE Conf. Computer Vision and Pattern Recognition Machine Learning (CVPR’07)
, 2007.Training deep networks with structured layers by matrix backpropagation.
In Proceedings of International Conference on Computer Vision, ICCV 2015, 2015.Deep Boltzmann Machines and the Centering Trick
, volume 7700 of LNCS. Springer, 2nd edn edition, 2012.International Conference on Artificial Intelligence and Statistics (AISTATS)
, pages 924–932, 2012.
Comments
There are no comments yet.