Decorrelated Batch Normalization

04/23/2018 ∙ by Lei Huang, et al. ∙ 0

Batch Normalization (BN) is capable of accelerating the training of deep models by centering and scaling activations within mini-batches. In this work, we propose Decorrelated Batch Normalization (DBN), which not just centers and scales activations but whitens them. We explore multiple whitening techniques, and find that PCA whitening causes a problem we call stochastic axis swapping, which is detrimental to learning. We show that ZCA whitening does not suffer from this problem, permitting successful learning. DBN retains the desirable qualities of BN and further improves BN's optimization efficiency and generalization ability. We design comprehensive experiments to show that DBN can improve the performance of BN on multilayer perceptrons and convolutional neural networks. Furthermore, we consistently improve the accuracy of residual networks on CIFAR-10, CIFAR-100, and ImageNet.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

DecorrelatedBN

Code for Decorrelated Batch Normalization


view repo
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Batch Normalization [27] is a technique for accelerating deep network training. Introduced by Ioffe and Szegedy, it has been widely used in a variety of state-of-the-art systems [19, 51, 21, 56, 50, 22]

. Batch Normalization works by standardizing the activations of a deep network within a mini-batch—transforming the output of a layer, or equivalently the input to the next layer, to have a zero mean and unit variance. Specifically, let

be the original outputs of a single neuron on

examples in a mini-batch. Batch Normalization produces the transformed outputs

(1)

where and are the mean and variance of the mini-batch, is a small number to prevent numerical instability, and are extra learnable parameters. Crucially, during training, Batch Normalization is part of both the inference computation (forward pass) as well as the gradient computation (backward pass). Batch Normalization can be inserted extensively into a network, typically between a linear mapping and a nonlinearity.

Batch Normalization was motivated by the well-known fact that whitening inputs (i.e. centering, decorrelating, and scaling) speeds up training [34]. It has been shown that better conditioning of the covariance matrix of the input leads to better conditioning of the Hessian in updating the weights, making the gradient descent updates closer to Newton updates [34, 52]. Batch Normalization exploits this fact further by seeking to whiten not only the input to the first layer of the network, but also the inputs to each internal layer in the network. But instead of whitening, Batch Normalization only performs standardization. That is, the activations are centered and scaled, but not decorrelated. Such a choice was justified in [27] by citing the cost and differentiability of whitening, but no actual attempts were made to derive or experiment with a whitening operation.

While standardization has proven effective for Batch Normalization, it remains an interesting question whether full whitening—adding decorrelation to Batch Normalization—can help further. Conceptually, there are clear cases where whitening is beneficial. For example, when the activations are close to being perfectly111For example, in 2D, this means all points lie close to the line and Batch Normalization does not change the shape of the distribution. correlated, standardization barely improves the conditioning of the covariance matrix, whereas whitening remains effective. In addition, prior work has shown that decorrelated activations result in better features [3, 45, 5] and better generalization [10, 55], suggesting room for further improving Batch Normalization.

In this paper, we propose Decorrelated Batch Normalization, in which we whiten the activations of each layer within a mini-batch. Let be the input to a layer for the -th example in a mini-batch of size . The whitened input is given by

(2)

where is the mini-batch mean and is the mini-batch covariance matrix.

Several questions arise for implementing Decorrelated Batch Normalization. One is how to perform back-propagation, in particular, how to back-propagate through the inverse square root of a matrix (i.e. ), whose key step is an eigen decomposition. The differentiability of this matrix transform was one of the reasons that whitening was not pursued in the Batch Normalization paper [27]. Desjardins et al. [14] whiten the activations but avoid back-propagation through it by treating the mean and the whitening matrix as model parameters, rather than as functions of the input examples. However, as has been pointed out [27, 26], doing so may lead to instability in training.

In this work, we decorrelate the activations and perform proper back-propagation during training. We achieve this by using the fact that eigen-decomposition is differentiable, and its derivatives can be obtained using matrix differential calculus, as shown by prior work [17, 28]. We build upon these existing results and derive the back-propagation updates for Decorrelated Batch Normalization.

Another question is, perhaps surprisingly, the choice of how to compute the whitening matrix . The whitening matrix is not unique because a whitened input stays whitened after an arbitrary rotation [29]. It turns out that PCA whitening, a standard choice [15], does not speed up training at all and in fact inflicts significant harm. The reason is that PCA whitening works by performing rotation followed by scaling, but the rotation can cause a problem we call stochastic axis swapping, which, as will be discussed in Section 3.1, in effect randomly permutes the neurons of a layer for each batch. Such permutation can drastically change the data representation from one batch to another to the extent that training never converges.

To address this stochastic axis swapping issue, we discover that it is critical to use ZCA whitening [4, 29], which rotates the PCA-whitened activations back such that the distortion of the original activations is minimal. We show through experiments that the benefits of decorrelation are only observed with the additional rotation of ZCA whitening.

A third question is the amount of whitening to perform. Given a particular batch size, DBN may not have enough samples to obtain a suitable estimate for the full covariance matrix. We thus control the extent of whitening by decorrelating smaller groups of activations instead of all activations together. That is, for an output of dimension

, we divide it into groups of size and apply whitening within each group. This strategy has the added benefit of reducing the computational cost of whitening from to , where is the mini-batch size.

We conduct extensive experiments on multilayer perceptrons and convolutional neural networks, and show that Decorrelated Batch Normalization (DBN) improves upon the original Batch Normalization (BN) in terms of training speed and generalization performance. In particular, experiments demonstrate that using DBN can consistently improve the performance of residual networks [19, 21, 56] on CIFAR-10, CIFAR-100  [31] and ILSVRC-2012 [13].

2 Related Work

Normalized activations [37, 53] and gradients [46, 40] have long been known to be beneficial for training neural networks. Batch Normalization [27] was the first to perform normalization per mini-batch in a way that supports back-propagation. One drawback of Batch Normalization, however, is that it requires a reasonable batch size to estimate the mean and variance, and is not applicable when the batch size is very small. To address this issue, Ba et al. [2] proposed Layer Normalization, which performs normalization on a single example using the mean and variance of the activations from the same layer. Batch Normalization and Layer Normalization were later unified by Ren et al. under the Division Normalization framework [41]. Other attempts to improve Batch Normalization for small batch sizes, include Batch Renormalization [26] and Stream Normalization [35]

. There have also been efforts to adapt Batch Normalization to Recurrent Neural Networks 

[33, 12]. Our work extends Batch Normalization by decorrelating the activations, which is a direction orthogonal to all these prior works.

Our work is closely related to Natural Neural Networks [14, 36], which whiten activations by periodically estimating and updating a whitening matrix. Our work differs from Natural Neural Networks in two important ways. First, Natural Neural Networks perform whitening by treating the mean and the whitening matrix as model parameters as opposed to functions of the input examples, which, as pointed out by Ioffe & Szegedy [27, 26], can cause instability in training, with symptoms such as divergence or gradient explosion. Second, during training, a Natural Neural Network uses a running estimate of the mean and whitening matrix to perform whitening for each mini-batch; as a result, it cannot ensure that the transformed activations within each batch are in fact whitened, whereas in our case the activations with a mini-batch are guaranteed to be whitened. Natural Neural Networks thus may suffer instability in training very deep neural networks [14].

Another way to obtain decorrelated activations is to introduce additional regularization in the loss function 

[10, 55]. Cogswell et al. [10] introduced the DeCov loss on the activations as a regularizer to encourage non-redundant representations. Xiong et al. [55] extends [10] to learn group-wise decorrelated representations. Note that these methods are not designed for speeding up training. In fact, empirically they often slow down training [10]

, probably because decorrelated activations are part of the learning objective and thus may not be achieved until later in training.

Our approach is also related to work that implicitly normalizes activations by either normalizing the network weights—e.g. through re-parameterization techniques [43, 25, 24], Riemannian optimization methods [23, 8], or additional weight regularization  [32, 39, 42]—or by designing special scaling coefficients and bias values that can induce normalized activations under certain assumptions [1]. Our work bears some similarity to that of Huang et al[24], which also back-propagates gradients through a ZCA-like normalization transform that involves eigen-decomposition. But the work by Huang et al. normalizes weights instead of activations, which leads to significantly different derivations especially with regards to convolutional layers; in addition, unlike ours, it does not involve a separately estimated whitening matrix during inference, nor does it discuss the stochastic axis swapping issue. Finally, all of these works including  [24] are orthogonal to ours in the sense that their normalization is data independent, whereas ours is data dependent. In fact, as shown in [25, 8, 24, 23], data-dependent and data-independent normalization can be combined to achieve even greater improvement.

3 Decorrelated Batch Normalization

Let be a data matrix that represents inputs to a layer in a mini-batch of size . Let be the

-th column vector of

, i.e. the -dimensional input from the -th example. The whitening transformation is defined as

(3)

where is the mean of , is the covariance matrix of the centered , is a column vector of all ones, and is a small positive number for numerical stability (preventing a singular ). The whitening transformation ensures that for the transformed data is whitened, i.e., .

Although Eqn. 3 gives an analytical form of the whitening transformation, this transformation is in fact not unique. The reason is that , the inverse square root of the covariance matrix, is defined only up to rotation, and as a result there exist infinitely many whitening transformations. Thus, a natural question is whether the specific choice of matters, and if so, which choice to use.

To answer this question, we first discuss a phenomenon we call stochastic axis swapping and show that not all whitening transformations are equally desirable.

3.1 Stochastic Axis Swapping

Given a data point represented as a vector under the standard basis, its representation under another orthogonal basis is , where

is an orthogonal matrix. We define

stochastic axis swapping as follows:

Definition 3.1

Assume a training algorithm that iteratively update weights using a batch of randomly sampled data points per iteration. Stochastic axis swapping occurs when a data point is transformed to be in one iteration and in another iteration such that where is a permutation matrix solely determined by the statistics of a batch.

Stochastic axis swapping makes training difficult, because the random permutation of the input dimensions can greatly confuse the learning algorithm—in the extreme case where the permutation is completely random, what remains is only a bag of activation values (similar to scrambling all pixels in an image), potentially resulting in an extreme loss of information and discriminative power.

Here, we demonstrate that the whitening of activations, if not done properly, can cause stochastic axis swapping in training neural networks. We start with standard PCA whitening [15], which computes through eigen decomposition: , where and

are the eigenvalues and eigenvectors of

, i.e. . That is, the original data point (after centering) is rotated by and then scaled by . Without loss of generalization, we assume that is unique by fixing the sign of its first element. A first opportunity for stochastic axis swapping is that the columns (or rows) of and can be permuted while still giving a valid whitening transformation. But this is easy to fix—we can commit to a unique and by ordering the eigenvalues non-increasingly.

But it turns out that ensuring a unique and is insufficient to avoid stochastic axis swapping. Fig. 1 illustrates an example. Given a mini-batch of data points in one iteration as shown in Fig. 1(a), PCA whitening rotates them by and stretches them along the new axis system by , where . Considering another iteration shown in Figure 1(b), where all data points except the red points are the same, it has the same eigenvectors with different eigenvalues, where . In this case, the new rotation matrix is because we always order the eigenvalues non-increasingly. The blue data points thus have two different representations with the axes swapped.

To further justify our conjecture, we perform an experiment on multilayer perceptrons (MLPs) over the MNIST dataset as shown in Figure 2. We refer to the network without whitening activations as ‘plain’ and the network with PCA whitening as DBN-PCA. We find that DBN-PCA has significantly inferior performance to ‘plain’. Particularly, on the 4-layer MLP, DBN-PCA behaves similarly to random guessing, which implies that it causes severe stochastic axis swapping.

Figure 1: Illustration that PCA whitening suffers from stochastic axis swapping. (a) The axis alignment of PCA whitening in the initial iteration; (b) The axis alignment in another iteration.
(a) 2 layer MLP
(b) 4 layer MLP
Figure 2: Illustration of different whitening methods in training an MLP on MNIST. We use full batch gradient descent and report the best results with respect to the training loss among learning rates=. (a) and (b) show the training loss of the 2-layer and 4-layer MLP, respectively. The number of neurons in each hidden layer is 100. We refer to the network without whitening activation as ‘plain‘, with PCA whitening activation as DBN-PCA, and with ZCA whitening as DBN-ZCA.

The stochastic axis swapping caused by PCA whitening exists because the rotation operation is executed over varied activations. Such variation is a result of two factors: (1) the activations can change due to weight updates during training, following the internal covariate shift described in [27]

; (2) the optimization is based on random mini-batches, which means that each batch will contain a different random set of examples in each training epoch.

A similar phenomenon is also observed in [24]. In this work, PCA-style orthogonalization failed to learn orthogonal filters effectively in neural networks. However, no further analysis was provided to explain why this is the case.

3.2 ZCA Whitening

To address the stochastic axis swapping problem, one straightforward idea is to rotate the transformed input back using the same rotation matrix :

(4)

In other words, we scale along the eigenvectors to get the whitened activations under the original axis system. Such whitening is known as ZCA whitening [4], and has been shown to minimize the distortion introduced by whitening under L2 distance [4, 29, 24]. We perform the same experiments with ZCA whitening as we did with PCA whitening: with MLPs on MNIST. Shown in Figure 2, ZCA whitening (referred to as DBN-ZCA) improves training performance significantly compared to no whitening (‘plain’) and DBN-PCA. This shows that ZCA whitening is critical to addressing the stochastic axis swapping problem.

Back-propagation

It is important to note that the back-propagation through ZCA whitening is non-trivial. In our DBN, the mean and the covariance are not parameters of the whitening transform , but are functions of the mini-batch data . We need to back-propagate the gradients through as in [27, 24]. Here, we use the results from [28] to derive the back-propagation formulations of whitening:

(5)

where is the loss function, is 0-diagonal with , the operator is element-wise matrix multiplication, and sets the off-diagonal elements of as zero.

Detailed derivation can be found in the appendix. Here we only show the simplified formulation:

(6)

where , , , and . The notation represents symmetrizing the corresponding matrix.

3.3 Training and Inference

Decorrelated Batch Normalization (DBN) is a data-dependent whitening transformation with back-propagation formulations. Like Batch Normalization [27], it can be inserted extensively into a network. Algorithms 1 and 2 describe the forward pass and the backward pass of our proposed DBN respectively. During training, the mean and the whitening matrix are calculated within each mini-batch to ensure that the activations are whitened for each mini-batch. We also maintain the expected mean and the expected whitening matrix for use during inference. Specifically, during training, we initialize as and as and update them by running average as described in Line 10 and 11 of Algorithm 1.

Normalizing the activations constrains the model’s capacity for representation. To remedy this, Ioffe and Szegedy [27] introduce extra learnable parameters and in Eqn. 1

. These learnable parameters often marginally improve the performance in our observation. For DBN, we also recommend to use learnable parameters. Specifically, the learnable parameters can be merged into the following ReLU activation

[38], resulting in the Translated ReLU (TReLU) [54].

For a convolutional neural network, the input to the DBN transformation is where and indicate the height and width of feature maps, and and are the numbers of feature maps and examples respectively. Following [27], we view each spatial position of the feature map as a sample. We thus unroll as with examples and feature maps. The whitening operation is performed over the unrolled .

1:  Input: mini-batch inputs , expected mean and expected projection matrix .
2:  Hyperparameters: , running average momentum .
3:  Output: the ZCA-whitened activations .
4:  calculate: .
5:  calculate: .
6:  execute eigenvalue decomposition: .
7:  calculate PCA-whitening matrix: .
8:  calculate PCA-whitened activation : .
9:  calculate ZCA-whitened output: .
10:  update: .
11:  update: .
Algorithm 1 Forward pass of DBN for each iteration.
1:  Input: mini-batch gradients respect to whitened outputs . Other auxiliary data from respective forward pass: (1) eigenvalues; (2) ; (3) .
2:  Output: the gradients respect to the inputs .
3:  calculate the gradients respect to : .
4:  calculate .
5:  calculate 0-diagonal matrix by .
6:  generate diagonal matrix from eigenvalues.
7:  calculate and .
8:  calculate .
9:  calculate by formula 6.
Algorithm 2 Backward pass of DBN for each iteration.

3.4 Group Whitening

As discussed in Section 1, it is necessary to control the extent of whitening such that there are sufficient examples in a batch for estimating the whitening matrix. To do this we use “group whitening”, specifically, we divide the activations along the feature dimension with size into smaller groups of size () and perform whitening within each group. The extent of whitening is controlled by the hyperparameter . In the case , Decorrelated Batch Normalization reduces to the original Batch Normalization.

In addition to controlling the extent of whitening, group whitening reduces the computational complexity [24]. Full whitening costs for a batch of size . When using group whitening, the cost is reduced to . Typically, we choose , therefore the cost of group whitening is .

3.5 Analysis and Discussion

DBN extends BN such that the activations are decorrelated over mini-batch data. DBN thus inherits the beneficial properties of BN, such as the ability to perform efficient training with large learning rates and very deep networks. Here, we further highlight the benefits of DBN over BN, in particular achieving better dynamical isometry [44] and improved conditioning.

Approximate Dynamical Isometry

Saxe et al. [44]

introduce dynamical isometry—the desirable property that occurs when the singular values of the product of Jacobians lie within a small range around

. Enforcing this property, even approximately, is beneficial to training because it preserves the gradient magnitudes during back-propagation and alleviates the vanishing and exploding gradient problems

[44]. Ioffe and Szegedy [27] find that Batch Normalization achieves approximate dynamical isometry under the assumption that (1) the transformation between two consecutive layers is approximately linear, and (2) the activations in each layer are Gaussian and uncorrelated. Our DBN inherently satisfies the second assumption, and therefore is more likely to achieve dynamical isometry than BN.

Improved Conditioning

[14] demonstrated that whitening activations results in a block diagonal Fisher Information Matrix (FIM) for each layer under certain assumptions [18]. Their experiments show that such a block diagonal structure in the FIM can improve the conditioning. The proposed method in [14], however, cannot whiten the activations effectively, as shown in [36] and also discussed in Section 2. DBN, on the other hand, does this directly. Therefore, we conjecture that DBN can further improve the conditioning of the FIM, and we justify this experimentally in Section 4.1.

4 Experiments

We start with experiments to highlight the effectiveness of Decorrelated Batch Normalization (DBN) in improving the conditioning and speeding up convergence on multilayer perceptrons (MLP). We then conduct comprehensive experiments to compare DBN and BN on convolutional neural networks (CNNs). In the last section, we apply our DBN to residual networks on CIFAR-10, CIFAR-100  [31] and ILSVRC-2012 to show its power to improve modern network architectures. The code to reproduce the experiments is available at https://github.com/umich-vl/DecorrelatedBN.

We focus on classification tasks and the loss function is the negative log-likelihood: . Unless otherwise stated, we use random weight initialization as described in [34] and ReLU activations [38].

Figure 3: Conditioning analysis with MLPs trained on the Yale-B dataset. (a) Condition number (log-scale) of relative FIM as a function of updates in the last layer; (b) training loss with respect to wall clock time.
Figure 4: Experiments on MLP architecture over PIE dataset. (a) The effects of group size of DBN, where ’G’ indicates ; (b) Comparison of training loss with respect to wall clock time.

4.1 Ablation Studies on MLPs

In this section, we verify the effectiveness of our proposed method in improving conditioning and speeding up convergence on MLPs. We also discuss the effect of the group size on the tradeoff between the performance and computation cost. We compare against several baselines, including the original network without any normalization (referred to as ‘plain’), Natural Neural Networks (NNN) [14], Layer Normalization (LN) [2], and Batch Normalization (BN) [27]. All results are averaged over 5 runs.

(a) Basic Configuration
(b) Adam optimization
(c) ELU non-linearity
(d) DBN/BN after non-linearity
Figure 5: Comprehensive performance comparison between DBN and BN with the VGG-A architecture on CIFAR-10. We show the training accuracy (solid line) and test accuracy (line marked with plus) for each epoch.
Conditioning Analysis

We perform conditioning analysis on the Yale-B dataset [16], specifically, the subset [6] with 2,414 images and 38 classes. We resize the images to 3232 and reshape them as 1024-dimensional vectors. We then convert the images to grayscale in the range and subtract the per-pixel mean.

For each method, we train a 5-layer MLP with the numbers of neurons in each hidden layer= and use full batch gradient descent. Hyper-parameters are selected by grid search based on the training loss. For all methods, the learning rate is chosen from . For NNN, the revised term is one of and the natural re-parameterization interval is one of .

We evaluate the condition number of the relative Fisher Information Matrix (FIM) [49] with respect to the last layer. Figure 3 (a) shows the evolution of the condition number over training iterations. Figure 3 (b) shows the training loss over the wall clock time. Note that the experiments are performed on CPUs and the model with DBN is slower than the model with BN per iteration. From both figures, we see that NNN, LN, BN and DBN converge faster, and achieve better conditioning compared to ‘plain’. This shows that normalization is able to make the optimization problem easier. Also, DBN achieves the best conditioning compared to other normalization methods, and speeds up convergence significantly.

Effects of Group Size

As discussed in Section 3.4, the group size controls the extent of whitening. Here we show the effects of the hyperparameter on the performance of DBN. We use a subset [6]

of the PIE face recognition

[47] dataset with 68 classes with 11,554 images. We adopt the same pre-processing strategy as with Yale-B.

We trained a 6-layer MLP with the numbers of neurons in each hidden layer=

. We use Stochastic Gradient Descent (SGD) with a batch size of 256. Other configurations are chosen in the same way as the previous experiment. Additionally, we explore group sizes in

for DBN. Note that when , DBN is reduced to the original BN without the extra learnable parameters.

Figure 4 (a) shows the training loss of DBN with different group sizes. We find that the largest (G128) and smallest group sizes (G1) both have noticeably slower convergence compared to the ones with intermediate group sizes such as G16. These results show that (1) decorrelating activations over a mini-batch can improve optimization, and (2) controlling the extent of whitening is necessary, as the estimate of the full whitening matrix might be poor over mini-batch samples. Also, the eigendecomposition with small group sizes (e.g. 16) is less computationally expensive. We thus recommend using group whitening in training deep models.

We also compared DBN with group whitening () to other baselines and the results are shown in Figure 4 (b). We find that DBN converges significantly faster than other normalization methods.

4.2 Experiments on CNNs

We design comprehensive experiments to evaluate the performance of DBN with CNNs against BN, the state-of-the-art normalization technique. For these experiments we use the CIFAR-10 dataset [31], which contains 10 classes, 50k training images, and 10k test images.

4.2.1 Comparison of DBN and BN

We compare DBN to BN over different experimental configurations, including the choice of optimization method, non-linearities, and the position of DBN/BN in the network. We adopt the VGG-A architecture [48] for all experiments, and pre-process the data by subtracting the per-pixel mean and dividing by the variance.

We use SGD with a batchsize of 256, momentum of 0.9 and weight decay of 0.0005. We decay the learning rate by half every iterations. The hyper-parameters are chosen by grid search over a random validation set of 5k examples taken from the training set. The grid search includes the initial learning rate and the decay interval . We set the group size of DBN as . Figure 5 (a) compares the performance of BN and DBN under this configuration.

We also experiment with other configurations, including using (1) Adam [30] as the optimization method, (2) replacing ReLU with another widely used non-linearity called Exponential Linear Units (ELU) [9], and (3) inserting BN/DBN after the non-linearity. All the experimental setups are otherwise the same, except that Adam [30] is used with an initial learning rate in . The respective results are shown in Figure 5 (b), (c) and (d).

In all configurations, DBN converges faster with respect to the epochs and generalizes better, compared to BN. Particularly, in the four experiments above, DBN reduces the absolute test error by , , and respectively. The results demonstrate that our Decorrelated Batch Normalization outperforms Batch Normalization in terms of optimization quality and regularization ability.

(a) Deeper Networks
(b) Learning Rate
Figure 6: DBN can make optimization easier and benefits from a higher learning rate. Results are reported on the S-plain architecture over the CIFAR-10 dataset. (a) Comparison by varying the depth of network. ’-nL’ means the network has n layers; (b) Comparison by using a higher learning rate.
Method Res-20 Res-32 Res-44 Res-56
Baseline* 8.75 7.51 7.17 6.97
Baseline 7.94 7.31 7.17 7.21
DBN-L1 7.94 7.28 6.87 6.63
DBN-scale-L1 7.77 6.94 6.83 6.49
Table 1: Comparison of test errors () with residual networks on CIFAR-10. ‘Res-’ indicates residual network with layers, and ‘Baseline*’ indicates the results reported in [19] with only one run. Our results are averaged over 5 runs.
CIFAR-10 CIFAR-100
Method Baseline* [56] Baseline DBN-scale-L1 Baseline* [56] Baseline DBN-scale-L1
WRN-28-10 3.89 3.99 0.13 3.79 0.09 18.85 18.75 0.28 18.36 0.17
WRN-40-10 3.80 3.80 0.11 3.74 0.11 18.3 18.7 0.22 18.27 0.19
Table 2: Test errors () on wide residual networks over CIFAR-10 and CIFAR-100. ‘Baseline’ and ‘DBN-scale-L1’ refer to the the results we perform, based on the released code of paper [56], and the results are shown in the format of ‘mean ’ computed over 5 random seeds. ‘Baseline*’ refers to the results reported by authors of  [56] on their Github. They report the median of 5 runs on WRN-28-10 and only perform one run on WRN-40-10.
Res-18 Res-34 Res-50 Res-101
Method Top-1 Top-5 Top-1 Top-5 Top-1 Top-5 Top-1 Top-5
Baseline* 24.70 7.80 23.60 7.10
Baseline 30.21 10.87 26.53 8.59 24.87 7.58 22.54 6.38
DBN-scale-L1 29.87 10.36 26.46 8.42 24.29 7.08 22.17 6.09
Table 3: Comparison of test errors (, single model and single-crop) on 18, 34, 50, and 101-layer residual networks on ILSVRC-2012.                                                                                                                                                                       ‘Baseline*’ indicates that the results are obtained from the website: https://github.com/KaimingHe/deep-residual-networks

4.2.2 Analyzing the Properties of DBN

We conduct experiments to support the conclusions from Section 3.5, specifically that DBN has better stability and converges faster than BN with high learning rates in very deep networks. The experiments were conducted on the S-plain network, which follows the design of the residual network [20] but removes the identity maps and uses the same feature maps for simplicity.

Going Deeper

He et al. [20] addressed the degradation problem for the network without identity mappings: that is, when the network depth increases, the training accuracy degrades rapidly, even when Batch Normalization is used. In our experiments, we demonstrate that DBN will relieve this problem to some extent. In other words, a model with DBN is easier to optimize. We validate this on the S-plain architecture with feature maps of dimension and number of layers 20, 32 and 44. The models are trained with a mini-batch size of 128, momentum of 0.9 and weight decay of 0.0005. We set the initial learning rate to be 0.1, dividing it by 5 at 80 and 120 epochs, and end training at 160 epochs. The results in Figure 6 (a) show that, with increased depth, the model with BN was more difficult to optimize than with DBN. We conjecture that the approximate dynamical isometry of DBN alleviates this problem.

Higher Learning Rate

A network with Batch Normalization can benefit from high learning rates, and thus faster training, because it reduces internal covariate shift [27]. Here, we show that DBN can help even more. We train the networks with BN and DBN with a higher learning rate — 0.4. We use the S-plain architecture with feature maps of dimensions and divide the learning rate by 5 at 60, 100, 140, and 180 epochs. The results in Figure 6 (b) show that DBN has significantly better training accuracy than BN. We argue that DBN benefits from higher learning rates because of its property of improved conditioning.

4.3 Applying DBN to Residual Network in Practice

Due to our current un-optimized implementation of DBN, it would incur a high computational cost to replace all BN modules of a network with DBN222See the appendix for more details on computational cost.. We instead only decorrelate the activations among a subset of layers. We find that this in practice is already effective for residual networks [19], because the information in previous layers can pass directly to the later layers through the identity connections. We also show that we can improve upon residual networks [19, 21, 56] by using only one DBN module before the first residual block, which introduces negligible computation cost. In principle, an optimized implementation of DBN will be much faster, and could be injected in multiple places in the network with little overhead. However, optimizing the implementation of DBN is beyond the scope of this work.

Residual Network on CIFAR-10

We apply our method on residual networks [19] by using only one DBN module before the first residual block (denoted as DBN-L1). We also consider DBN with adjustable scale (denoted as DBN-scale-L1) as discussed in Section 3.3

. We adopt the Torch implementation of residual networks

333https://github.com/facebook/fb.resnet.torch and follow the same experimental protocol as described in [19]. We train the residual networks with depth 20, 32, 44 and 56 on CIFAR-10. Table 1 shows the test errors of these networks. Our methods obtain lower test errors compared to BN over all 4 networks, and the improvement is more dramatic for deeper networks. Also, we see that DBN-scale-L1 marginally outperforms DBN-L1 in all cases. Therefore, we focus on comparing DBN-scale-L1 to BN in later experiments.

Wide Residual Network on CIFAR

We apply DBN to Wide Residual Network (WRN) [56] to improve the performance on CIFAR-10 and CIFAR-100. Following the convention set in [56], we use the abbreviation WRN-d-k to indicate a WRN with depth d and width k. We again adopt the publicly available Torch implementation444https://github.com/szagoruyko/wide-residual-networks and follow the same setup as in [56]. The results in Table 2 show that DBN improves the original WRN on both datasets and both networks. In particular, we reduce the test error by and on CIFAR-10 and CIFAR-100, respectively.

Residual Network on ILSVRC-2012

We further validate the scalability of our method on ILSVRC-2012 with 1000 classes  [13]. We use the given official 1.28M training images as a training set, and evaluate the top-1 and top-5 classification errors on the validation set with 50k images. We use the 18, 34, 50 and 101-layer residual network (Res-18, Res-34, Res-50 and Res-101) and perform single model and single-crop testing. We follow the same experimental setup as described in  [19], except that we use 4 GPUs instead of 8 for training Res-50 and 2 GPUs for training Res-18 and Res-34 (whose single crop results have not been previously reported): we apply SGD with a mini-batch size of 256 over 4 GPUs for Res-50 and 8 GPUs for Res-101, momentum of 0.9 and weight decay of 0.0001; we set the initial learning rate of 0.1, dividing it by 10 at 30 and 60 epochs, and end the training at 90 epochs. The results are shown in Table 3. We can see that the DBN-scale-L1 achieves lower test errors compared to the original residual networks.

5 Conclusions

In this paper, we propose Decorrelated Batch Normalization (DBN), which extends Batch Normalization to include whitening over mini-batch data. We find that PCA whitening can sometimes be detrimental to training because it causes stochastic axis swapping, and demonstrate that it is critical to use ZCA whitening, which avoids this issue. DBN retains the advantages of Batch Normalization while using decorrelated representations to further improve models’ optimization efficiency and generalization abilities. This is because DBN can maintain approximate dynamical isometry and improve the conditioning of the Fisher Information Matrix. These properties are experimentally validated, suggesting DBN has great potential to be used in designing DNN architectures.

Acknowledgement This work is partially supported by China Scholarship Council, NSFC-61370125 and SKLSDE-2017ZX-03. We also thank Jonathan Stroud and Lanlan Liu for their help with proofreading and editing.

Appendix

A Derivation for Back-propagation

For illustration, we first provide the forward pass, then derive the backward pass. Regarding notation, we follow the matrix notation that all the vectors are column vectors, except that the gradient vectors are row vectors.

a.1 Forward Pass

Given mini-batch layer inputs where m is the number of examples, the ZCA-whitened output for the input can be calculated as follows:

(A.1)
(A.2)
(A.3)
(A.4)
(A.5)
(A.6)

where and are the mean vector and the covariance matrix within the mini-batch data. Eqn. A.3 is the eigen decomposition where and is a diagonal matrix where the diagonal elements are the eigenvalues. Note that are auxiliary variables for clarity. Actually are the output of PCA whitening. However, PCA whitening hardly works for deep networks as discussed in the paper.

a.2 Back-propagation

Based on the chain rule and the result from

[28], we can get the backward pass derivatives as follows:

(A.7)
(A.8)
(A.9)
(A.10)
(A.11)
(A.12)
(A.13)

where is the loss function, is 0-diagonal with , the operator is element-wise matrix multiplication, sets the off-diagonal elements of to zero, and means symmetrizing by . Note that Eqn. A.11 is from the results in [28]. Besides, a similar formulation to back-propagate the gradient through the whitening transformation has been derived in the context of learning an orthogonal weight matrix in [24].

a.3 Derivation for Simplified Formulation

For more efficient computation, we provide the simplified formulation as follows:

(A.14)

where

f

The details of how to derive Eqn. A.14 are as follows.

Based on Eqn. A.4, A.5, A.8 and A.9, we can get

(A.15)

Denoting , we have

(A.16)

Based on Eqn. A.4 and A.10, we have:

(A.17)

Based on Eqn. A.11, we have :

(A.18)

where and . Based on Eqn. A.12, we have

(A.19)

where . We thus have:

(A.20)

B Computational Cost of DBN

In this part, we analyze the computational cost of DBN module for Convolutional Neural Networks (CNNs). Theoretically, a convolutional layer with a input, batch size of , and filters of size costs . Adding DBN with a group size incurs an overhead of . The relative overhead is , which is negligible when is small (e.g. 16).

Empirically, our unoptimized implementation of DBN costs ms (forward pass + backward pass, averaged over runs) for full whitening, with a input, a batch size of . In comparison, the highly optimized cudnn convolution [7] in Torch [11] with the same input costs ms.

Methods Training Inference
BN 163.68 11.47
DBN-G8 707.47 15.50
DBN-G16 466.92 14.41
DBN-G64 297.25 13.70
DBN-G256 440.88 13.64
DBN-G512 1004 13.68

Table B.1: Time costs (s/per epoch) on VGG-A and CIFAR datasets with different groups.
Training Inference
Method BN DBN-scale BN DBN-scale
Res-56 69.53 86.80 4.57 5.12
Res-44 55.03 72.06 3.65 4.36
Res-32 40.36 57.47 2.80 3.33
Res-20 25.97 42.87 1.94 2.44
WideRes-40 643.94 659.55 25.69 26.08
WideRes-28 440 457 36.56 38.10
Table B.2: Time costs (s/per epoch) for residual networks and wide residual network on CIFAR.
Training Inference
Method BN DBN-scale BN DBN-scale
Res-101 1.52 4.21 0.42 0.57
Res-50 0.51 1.12 0.19 0.25
Res-34 1.21 2.04 0.55 0.71
Res-18 0.81 1.45 0.40 0.51
Table B.3: Time costs (s/per iteration) for residual networks on ImageNet, averaged 10 iterations, with multiple GPUs.

Table B.1, B.2, B.3 show the wall clock time for our CIFAR-10 and ImageNet experiments described in the paper. Note that DBN with small groups (e.g. G8) can cost more time than larger groups due to our unoptimized implementation: for example, we whiten each group sequentially instead of in parallel, because Torch does not yet provide an easy way to use linear algebra library of CUDA in parallel. Our current implementation of DBN has a low GPU utilization (e.g. - on average), versus for BN. Thus there is a lot of room for a more efficient implementation.

Training Inference
Method BN DBN-scale BN DBN-scale
Res-101 1.24 1.35 0.35 0.37
Res-50 0.69 0.80 0.19 0.22
Res-34 0.34 0.45 0.12 0.14
Res-18 0.20 0.31 0.07 0.09
Table B.4: Time costs (s/per iteration) for residual networks on ImageNet, on a single GPU with a batch size of 32, averaged 10 iterations.

We also observe that for the ResNet experiments on ImageNet, the overhead of multi-GPU parallelization is relatively high in our current DBN implementation. Thus, we perform another set of ResNet experiments with the same settings except that we use a batch size of 32 and a single GPU. As shown in Table B.4, the difference in time cost between BN and DBN is much smaller on a single GPU. This suggests that there is room for optimizing our DBN implementation for multi-GPU training and inference.

References