In this paper, we study normalization methods for neural networks from the perspective of elimination singularity. Elimination singularities correspond to the points on the training trajectory where neurons become consistently deactivated. They cause degenerate manifolds in the loss landscape which will slow down training and harm model performances. We show that channel-based normalizations (e.g. Layer Normalization and Group Normalization) are unable to guarantee a far distance from elimination singularities, in contrast with Batch Normalization which by design avoids models from getting too close to them. To address this issue, we propose BatchChannel Normalization (BCN), which uses batch knowledge to avoid the elimination singularities in the training of channel-normalized models. Unlike Batch Normalization, BCN is able to run in both large-batch and micro-batch training settings. The effectiveness of BCN is verified on many tasks, including image classification, object detection, instance segmentation, and semantic segmentation. The code is here: https://github.com/joe-siyuan-qiao/Batch-Channel-Normalization.READ FULL TEXT VIEW PDF
In this paper, we propose Weight Standardization (WS) to accelerate deep...
Batch Normalization (BN) was shown to accelerate training and improve
Batch Normalization (BN) (Ioffe and Szegedy 2015) normalizes the feature...
Normalization methods play an important role in enhancing the performanc...
Batch Normalization (BN) is a vital pillar in the development of deep
We present multi-point optimization: an optimization technique that allo...
Batch Normalization (BN) is one of the most widely used techniques in De...
Deep neural networks achieve state-of-the-art results in many vision tasks [4, 10, 12]. Despite being very effective, deep networks are hard to train. Normalization methods [2, 16] are crucial for stabilizing and accelerating network training. There are many theories explaining how normalizations help optimization. For example, Batch Normalization (BN)  and Layer Normalization (LN)  were proposed based on the conjecture that they are able to reduce internal covariate shift which negatively impacts training. Santurkar et al.  argue that the reason of the success of BN is that it makes the loss landscape significantly smoother. Unlike the previous work, we study normalizations from the perspective of avoiding elimination singularities  that also have negative effects on training.
Elimination singularities refer to the points along the training trajectory where neurons in the networks get eliminated. As shown in , the performance of neural networks is correlated with their distance to elimination singularities: the closer the model is to the elimination singularities, the worse it performs. Sec. 2 provides a closer look at the relationship between the performance and the distance through experiments. Because of this relationship, we ask:
Do all the normalization methods keep their models away from elimination singularities?
Here, we list our findings:
Batch Normalization (BN)  is able to keep models at far distances from the singularities.
These findings provide a new way of understanding why GN is performing better than LN, and how WS improves the performances of both of them.
Since channel-based normalization methods (e.g. LN and GN) have issues with elimination singularity, we can improve their performances if we are able to push models away from them. For this purpose, we propose Batch-Channel Normalization (BCN), which uses batch knowledge to prevent channel-normalized models from getting too close to the elimination singularities. Sec. 3 shows the detailed modeling of the proposed normalization method. Unlike BN, BCN is able to run in both large-batch and micro-batch training settings and can improve the performances of channel-normalized models.
To evaluate our proposed BCN, we test it on various popular vision tasks, including large-batch training of ResNet 
on ImageNet, large-batch training of DeepLabV3  on PASCAL VOC , and micro-batch training of Faster R-CNN  and Mask R-CNN  on MS COCO  dataset. Sec. 4 shows the experimental results, which demonstrate that our proposed BCN is able to outperform the baselines effortlessly. Finally, Sec. 5 discusses the related work, and Sec. 6 concludes the paper.
In this section, we will provide the background of normalization methods, discuss the relationship between the performance and the distance to elimination singularities, and show how well normalization methods are able to prevent models from getting too close to those singularities.
Based on how activations are normalized, we group the normalization methods into two types: batch-based normalization and channel-based normalization, where the batch-based normalization method corresponds to BN and the channel-based normalization methods include LN and GN.
Suppose we are going to normalize a 2D feature map , where is the batch size, is the number of channels, and denote the height and the width. For each channel , BN normalizes by
denote the mean and the standard deviation of all the features of the channel, . Throughout the paper, we use in the subscript to denote all the features along that dimension for convenience.
Unlike BN which computes statistics on the batch dimension in addition to the height and width, channel-based normalization methods compute statistics on the channel dimension. Specifically, they divide the channels to several groups, and normalize each group of channels together, i.e., is reshaped as , and then:
for each sample of samples in a batch and each channel group out of all groups. After Eq. 2, the output is reshaped as and denoted by .
Both batch- and channel-based normalization methods optionally have an affine transformation, i.e.,
Deep neural networks are hard to train partly due to the singularities caused by the non-identifiability of the model . These singularities include overlap singularities, linear dependence singularities, elimination singularities, etc. Degenerate manifolds in the loss landscape will be caused by these singularities, getting closer to which will slow down learning and impact model performances . In this paper, we focus on elimination singularities, which correspond to the points on the training trajectory where neurons in the model become constantly deactivated.
We focus on a basic building element that is widely used in neural networks: a convolutional layer followed by a normalization method (e.g.
BN, LN) and ReLU, i.e.,
ReLU sets any values below to , thus a neuron is constantly deactivated if its maximum value after normalization is below . Their gradients will also be because of ReLU, making them hard to revive; hence, a singularity is created.
Here, we study the effect of BN on elimination singularities. Since the normalization methods all have an optional affine transformation, we focus on the distinct part of BN: Eq. 1
, which normalizes all channels to zero mean and unit variance,i.e.,
As a result, regardless of the weights and the distribution of the inputs, Eq. 1 guarantees that the activations of each channel are zero-centered with unit variance. Therefore, each channel cannot be constantly deactivated because there are always some activations that are , nor almost constantly deactivated caused by one channel having a very small activation scale compared with the others.
BN avoids singularities by normalizing each channel to zero mean and unit variance. What if they are normalized to other means and variances?
We ask this question because this is similar to what happens in channel-normalized models. Channel-based normalization methods, as they do not have batch information, are unable to make sure all neurons have zero mean and unit variance after normalization. Instead, they will have different statistics, thus make the model closer to singularities. Here, by closer, we mean the model is far
from BN where each channel is zero-centered with unit variance, which avoids all singularities. To study the relationship between the performance and the distance to singularities (or how far from BN) caused by statistical differences, we conduct experiments on a 4-layer convolutional network. Each convolutional layer has 32 output channels, and is followed by an average pooling layer which down-samples the features by a factor of 2. Finally, a global average pooling layer and a fully-connected layer will output the logits for Softmax. The experiments are done on CIFAR-10.
In the experiment, each channel will be normalized to a pre-defined mean and a pre-defined variance that are drawn from two distributions, respectively:
The model will be closer to singularities when or increases. BN corresponds to the case where .
After getting and for each channel, we compute
Note that and are fixed during training while and are trainable parameters in the affine transformation.
Fig. 1 shows the experimental results. When and are closer to the origin, the normalization method is more similar to BN, and the model will be farther from the singularities. When their values increase, we observe performance decreases. For extreme cases, we also observe training failures. These results indicate that although the affine transformation theoretically can find solutions that cancel the negative effects of normalizing channels to different statistics, their capability is limited by the gradient-based training. These findings raise concerns about channel normalizations regarding their distance to singularities.
Following our concerns about channel-based normalization and their distance to singularities, we study the statistical differences between channels when they are normalized by a channel-based normalization such as GN or LN.
We train a ResNet-110  on CIFAR-10  normalized by GN, LN, with and without WS . During training, we keep record of the running mean and variance of each channel after convolutional layers. For each group of the channels that are normalized together, we compute their channel statistical difference defined as the standard deviation of their means divided by the mean of their standard deviations, i.e.,
We plot the average statistical differences of all the groups after every training epoch as shown in Fig.3.
By Eq. 8, . In BN, all their means are the same, as well as their variances, thus . As the value of goes up, the differences between channels within a group become larger. Since they will be normalized together as in Eq. 2, large differences will inevitably lead to underrepresented channels. Fig. 2 plots 3 examples of 2 channels before and after normalization in Eq. 2. Compared with those examples, it is clear that the models in Fig. 3 have many underrepresented channels.
Fig. 3 also provides explanations why GN performs better than LN. Comparing GN and LN, the major difference is their numbers of groups for channels: LN has only one group for all the channels in a layer while GN collects them into several groups. A strong benefit of having more than one group is that it guarantees that each group will at least have one neuron that is not suppressed by the others from the same group. Therefore, GN provides a mechanism to prevent the models from getting too close to the elimination singularities.
Fig. 3 also shows the statistical differences when WS is used. From the results, we can clearly see that WS makes StatDiff much closer to . Consequently, the majority of the channels are not underrepresented in WS: most of them are frequently activated and they are at similar activation scales. This makes training with WS easier and their results better.
Here, we also provide our understandings why WS is able to achieve smaller statistical differences. Recall that WS adds constraints to the weight of a convolutional layer with O output channels and I inputs such that ,
With the constraints of WS, and become
when we follow the assumptions in Xavier initialization . When the input channels are similar in their statistics, i.e., , , ,
In other words, WS can pass the statistical similarities from the input channels to the output channels, all the way from the image space where RGB channels are properly normalized. This is similar to the objective of Xavier initialization  or Kaiming initialization , except that WS enforces it by reparameterization throughout the entire training process, thus is able to reduce the statistical differences.
Here, we summarize this subsection. We have shown that channel-based normalization methods, as they do not have batch information, are not able to ensure a far distance from elimination singularities. Without the help of batch information, GN alleviates this issue by assigning channels to more than one group to encourage more activated neurons, and WS adds constraints to pull the channels to be not so statistically different. We notice that the batch information is not hard to collect in reality. This inspires us to equip channel-based normalization with batch information, and the result is Batch-Channel Normalization.
This section presents the definition of Batch-Channel Normalization, discusses why adding batch statistics to channel normalization is not redundant, and shows how BCN runs in large-batch and micro-batch training settings.
Batch-Channel Normalization (BCN) adds batch constraints to channel-based normalization methods. Let be the features to be normalized. Then, the normalization is done as follows. ,
where the purpose of and is to make
Then, is reshaped as to have groups of channels. Next, ,
Finally, is reshaped back to , which is the output of the Batch-Channel Normalization.
Note that in Eq. 13 and 15, only two statistics need batch information: and , as their values depend on more than one sample. Depending on how we obtain the values of and , we have different implementations for large-batch and micro-batch training settings.
One of the motivations of channel normalization is to allow deep networks to train on tasks where the batch size is limited by the GPU memory. Therefore, it is important for Batch-Channel Normalization to be able to work in the micro-batch training setting.
Algorithm 1 shows the feed-forwarding implementation of the micro-batch Batch-Channel Normalization. The basic idea behind this algorithm is to constantly estimate the values of and , which are initialized as and , respectively, and normalize based on these estimates. It is worth noting that in the algorithm, and
are not updated by the gradients computed from the loss function; instead, they are updated towards more accurate estimates of those statistics. Step 3 and 4 in Algorithm1 resemble the update steps in gradient descent; thus, the implementation can also be written in gradient descent by storing the difference and as their gradients. Moreover, we set the update rate to be the learning rate of trainable parameters.
Algorithm 1 also raises an interesting question: when researchers study the micro-batch issue of BN before, why not just use the estimates to batch-normalize the features? In fact,  tries a similar idea, but does not fully solve the micro-batch issue: it needs a bootstrap phase to make the estimates meaningful, and the performances are usually not satisfactory. The underlying difference between micro-batch BCN and  is that BCN has a channel normalization following the estimate-based normalization. This makes the previously unstable estimate-based normalization stable, and the reduction of Lipschitz constants which speeds up training is also done in the channel-based normalization part, which is also impossible to do in estimate-based normalization. In summary, channel-based normalization makes estimate-based normalization possible, and estimate-based normalization helps channel-based normalization to keep models away from elimination singularities.
Batch- and channel-based normalizations are similar in many ways. Is BCN thus redundant as it normalizes normalized features? Our answer is no. Channel normalizations need batch knowledge to keep the models away from elimination singularities; at the same time, it also brings benefits to the batch-based normalization, including:
Batch knowledge without large batches. Since BCN runs in both large-batch and micro-batch settings, it provides a way to utilize batch knowledge to normalize activations without relying on large training batch sizes.
Additional non-linearity. Batch Normalization is linear in the test mode or when the batch size is large in training. By contrast, channel-based normalization methods, as they normalize each sample individually, are not linear. They will add strong non-linearity and increase the model capacity.
Test-time normalization. Unlike BN that relies on estimated statistics on the training dataset for testing, channel normalization normalizes testing data again, thus allows the statistics to adapt to different samples. As a result, channel normalization will be more robust to statistical changes and show better generalizability for unseen data.
In this section, we test the proposed BCN in popular vision benchmarks, including image classification on CIFAR-10/100  and ImageNet , semantic segmentation on PASCAL VOC 2012 , and object detection and instance segmentation on COCO .
CIFAR has two image datasets, CIFAR-10 (C10) and CIFAR-100 (C100). Both C10 and C100 have color images of size . C10 dataset has 10 categories while C100 dataset has 100 categories. Each of C10 and C100 has 50,000 images for training and 10,000 images for testing and the categories are balanced in terms of the number of samples. In all the experiments shown here, the standard data augmentation schemes are used, i.e., mirroring and shifting, for these two datasets. We also standardizes each channel of the datasets for data pre-processing.
Table 1 shows the experimental results that compare our proposed BCN with BN and GN. The results are grouped into 4 parts based on whether the training is large-batch or micro-batch, and whether the dataset is C10 and C100. On C10, our proposed BCN is better than BN on large-batch training, and is better than GN (with or without WS) which is specifically designed for micro-batch training. Here, micro-batch training assumes the batch size is 1, and RN110 is the 110-layer ResNet  with basic block as the building block. The number of groups here for GN is .
Table 2 shows comparisons with more recent normalization methods, Switchable Normalization (SN)  and Dynamic Normalization (DN)  which were tested for a variant of ResNet for CIFAR: ResNet-18. To provide readers with direct comparisons, we also evaluate BCN on ResNet-18 with the group number set to for models that use GN. Again, all the results are organized based on whether they are trained in the micro-batch setting. Based on the results shown in Table 1 and 2, it is clear that BCN is able to outperform the baselines effortlessly in both large-batch and micro-batch training settings.
This section shows the results of training models with BCN on ImageNet . The ImageNet dataset contains 1.28 million color images for training and 5,000 images for validation. There are 1,000 categories in the datasets, which are roughly balanced. We adopt the same training and testing procedures used in , and the baseline performances are copied from them.
Fig. 4 shows the training dynamics of ResNet-50 with GN, GN+WS and BCN+WS, and Table 3 shows the top-1 and top-5 error rates of ResNet-50, ResNet-101 and ResNeXt-50 trained with different normalization methods. From the results, we observe that adding batch information to channel-based normalizations strongly improves their accuracy. As a result, GN, whose performances are similar to BN when used with WS, now is able to achieve better results than the BN baselines. And we find improvements not only in the final model accuracy, but also in the training speed. As shown in Fig. 4, we see a big drop of training error rates at each epoch. This demonstrates that the model is now farther from elimination singularities, resulting in an easier and faster learning.
validation set. Output stride is 16, without multi-scale or flipping when testing.
After evaluating BCN on classification tasks, we test it on dense prediction tasks. We start with semantic segmentation on PASCAL VOC . We choose DeepLabV3  as the evaluation model for its good performances and its use of the pre-trained ResNet-101 backbone.
Table 5 shows our results on PASCAL VOC, which has different categories with background included. We take the common practice to prepare the dataset, and the training set is augmented by the annotations provided in , thus has 10,582 images. We take our ResNet-101 pre-trained on ImageNet and finetune it for the task. Here, we list all the implementation details for easy reproductions of our results: the batch size is set to , the image crop size is , the learning rate follows polynomial decay with an initial rate . The model is trained for iterations, and the multi-grid is instead of . For testing, the output stride is set to , and we do not use multi-scale or horizontal flipping test augmentation. As shown in Table 5, by only changing the normalization methods from BN and GN to our BCN, mIoU increases by about , which is a significant improvement for PASCAL VOC dataset. As we strictly follow the hyper-parameters used in the previous work, there could be even more room of improvements if we tune them to favor BCN, which we do not explore in this paper and leave to future work.
As we have introduced in Sec. 3, our BCN can also be used for micro-batch training, which we will evaluate in this section by showing detection and segmentation results on COCO . It is a very fundamental vision task yet has memory issues when large batch sizes are used.
We take our ResNet-50 and ResNet-101 normalized by BCN pre-trained on ImageNet as the starting point of the backbone, and fine-tune it on COCO train2017 dataset. After training, the models are tested on COCO val2017 dataset. We use 4 GPUs to train all the models, each GPU has one training sample. Learning rate is configured according to the batch size following the common practice provided in [3, 7]. Specifically, we use 1X learning rate schedule for Faster R-CNN and 2X learning rate schedule for Mask R-CNN to get the results reported in this paper. We use FPN  and the 4conv1fc bounding box head. We add BCN to the backbone, bounding box heads, and mask heads. We keep everything else untouched to maximize comparison fairness. Please see [3, 7] for more details.
Table 4 shows the results of Mask R-CNN  between our BCN with GN and GN+WS, and Table 6 shows the comparisons on Faster R-CNN . The results shown in the tables are the Average Precision for bounding box (AP) and instance segmentation (AP). As the tables demonstrate, our BCN is able to outperform the baseline methods by a comfortable margin.
Experiments on COCO differ from the previous results on ImageNet and PASCAL VOC in that they train models in the micro-batch setting: each GPU can only have one training sample and the GPUs are not synchronized – the batch size would be 4 even if they do, which is still not large. The results on ImageNet and PASCAL VOC show that when large-batch training is available, having batch information will strongly improve the results. And the experiments on COCO demonstrate that even when large-batch is not available, having an estimate-based batch normalization is also going to be helpful and will provide improvements. The improvements over WS when GN is used show that although WS is able to alleviate the statistical difference issue, it does not fully solve it. However, we do not just discard WS when we use BCN because WS still has the smoothing effect on the loss landscape which improves training from another perspective. Overall, the results in this section prove the necessity of keeping models away from elimination singularities when training neural networks, and BCN improves results by avoiding them along the training trajectory.
Deep neural networks advance state-of-the-arts in many computer vision tasks[4, 13, 20, 24, 31, 32, 34, 40, 42, 43, 46, 48]. But deep networks are hard to train. To speed up training, proper model initializations are widely used as well as data normalization based on the assumption of the data distribution [8, 11]. On top of data normalization and model initialization, Batch Normalization  is proposed to ensure certain distributions so that the normalization effects will not fade away during training. By performing normalization along the batch dimension, Batch Normalization achieves state-of-the-art performances in many tasks in addition to accelerating the training process. When the batch size decreases, however, the performances of Batch Normalization drop dramatically since the batch statistics are not representative enough of the dataset statistics. Unlike Batch Normalization that works on the batch dimension, Layer Normalization  normalizes data on the channel dimension, Instance Normalization  does Batch Normalization for each sample individually. Group Normalization  also normalizes features on the channel dimension, but it finds a better middle point between Layer Normalization and Instance Normalization.
Batch Normalization, Layer Normalization, Group Normalization, and Instance Normalization are all activation-based normalization methods. Besides them, there are also weight-based normalization methods, such as Weight Normalization  and Weight Standardization [33, 14]. Weight Normalization decouples the length and the direction of the weights, while Weight Standardization ensures the weights to have zero mean and unit variance. Weight Standardization narrows the performance gap between Batch Normalization and Group Normalization, therefore, in this paper, we use Weight Standardization for our proposed method to get all the results.
In this paper, we study normalization methods and Elimination Singularity [30, 44]. There are also other perspectives to understand normalization methods. For example, from the perspective of training robustness, BN is able to make optimization trajectories more robust to parameter initialization . [38, 33] show that normalizations are able to reduce the Lipschitz constants of the loss and the gradients, thus the training becomes easier and faster. From the angle of model generalization,  shows that Batch Normalization relies less on single directions of activations, thus has better generalization properties, and  studies the regularization effects of Batch Normalization.  also explores length-direction decoupling in BN and . Other work also approaches normalizations from the gradient explosion issues  and learning rate tuning .
Our method uses Batch Normalization and Group Normalization at the same time for one layer. Some previous work also uses multiple normalizations or a combined version of normalizations for one layer. For example, SN  computes BN, IN, and LN at the same time and uses AutoML  to determine how to combine them. SSN  uses SparsestMax to get sparse SN. DN  proposes a more flexible form to represent normalizations and finds better normalizations. Unlike them, our method is based on analysis from the angle of elimination singularity instead of AutoML, and our normalizations are used together as a composite function rather than linearly adding up the normalization effects in a flat way.
In this paper, we approach the normalization methods from the perspective of elimination singularities. We study how different normalizations can keep their models away from the elimination singularities, since getting close to them will harm the training of the models. We observe that Batch Normalization (BN) is able to guarantee a far distance from the elimination singularities, while Layer Normalization (LN) and Group Normalization (GN) are unable to keep far distances. We also observe that the situation of LN is worse than that of GN, and Weight Standardization (WS) is able to alleviate this issue. These findings are consistent with their performances. We notice that the cause of LN and GN being unable to keep models away is their lack of batch knowledge. Therefore, to improve their performances, we propose Batch-Channel Normalization (BCN), which adds batch knowledge to channel-normalized models. BCN is able to run and improve the performances in both large-batch and micro-batch settings. We test it on many popular vision benchmarks. The experimental results show that it is able to outperform the baselines effortlessly.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pp. 249–256. Cited by: §2.3, §5.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770–778. Cited by: §1, §1, §2.3, §4.1, Table 1, Table 4.
Proceedings of the 32nd International Conference on Machine Learning (ICML), External Links: Cited by: item 1, §1, Table 1, §5.
ImageNet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1097–1105. Cited by: §5.