Rethinking Normalization and Elimination Singularity in Neural Networks

11/21/2019 ∙ by Siyuan Qiao, et al. ∙ Johns Hopkins University 0

In this paper, we study normalization methods for neural networks from the perspective of elimination singularity. Elimination singularities correspond to the points on the training trajectory where neurons become consistently deactivated. They cause degenerate manifolds in the loss landscape which will slow down training and harm model performances. We show that channel-based normalizations (e.g. Layer Normalization and Group Normalization) are unable to guarantee a far distance from elimination singularities, in contrast with Batch Normalization which by design avoids models from getting too close to them. To address this issue, we propose BatchChannel Normalization (BCN), which uses batch knowledge to avoid the elimination singularities in the training of channel-normalized models. Unlike Batch Normalization, BCN is able to run in both large-batch and micro-batch training settings. The effectiveness of BCN is verified on many tasks, including image classification, object detection, instance segmentation, and semantic segmentation. The code is here: https://github.com/joe-siyuan-qiao/Batch-Channel-Normalization.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

Code Repositories

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Deep neural networks achieve state-of-the-art results in many vision tasks [4, 10, 12]. Despite being very effective, deep networks are hard to train. Normalization methods [2, 16] are crucial for stabilizing and accelerating network training. There are many theories explaining how normalizations help optimization. For example, Batch Normalization (BN) [16] and Layer Normalization (LN) [2] were proposed based on the conjecture that they are able to reduce internal covariate shift which negatively impacts training. Santurkar et al. [38] argue that the reason of the success of BN is that it makes the loss landscape significantly smoother. Unlike the previous work, we study normalizations from the perspective of avoiding elimination singularities [44] that also have negative effects on training.

Elimination singularities refer to the points along the training trajectory where neurons in the networks get eliminated. As shown in [30], the performance of neural networks is correlated with their distance to elimination singularities: the closer the model is to the elimination singularities, the worse it performs. Sec. 2 provides a closer look at the relationship between the performance and the distance through experiments. Because of this relationship, we ask:

Do all the normalization methods keep their models away from elimination singularities?

Here, we list our findings:

  1. [topsep=0.5ex,itemsep=-0.5ex,partopsep=0.5ex,parsep=0.5ex,leftmargin=0.6cm,label*=(0)]

  2. Batch Normalization (BN) [16] is able to keep models at far distances from the singularities.

  3. Channel-based normalization, e.g., Layer Normalization (LN) [2] and Group Normalization (GN) [45], is unable to guarantee far distances, where the situation of LN is worse than that of GN.

  4. Weight Standardization (WS) [33, 14] is able to push the models away from the elimination singularities.

These findings provide a new way of understanding why GN is performing better than LN, and how WS improves the performances of both of them.

Since channel-based normalization methods (e.g. LN and GN) have issues with elimination singularity, we can improve their performances if we are able to push models away from them. For this purpose, we propose Batch-Channel Normalization (BCN), which uses batch knowledge to prevent channel-normalized models from getting too close to the elimination singularities. Sec. 3 shows the detailed modeling of the proposed normalization method. Unlike BN, BCN is able to run in both large-batch and micro-batch training settings and can improve the performances of channel-normalized models.

To evaluate our proposed BCN, we test it on various popular vision tasks, including large-batch training of ResNet [12]

on ImageNet 

[36], large-batch training of DeepLabV3 [5] on PASCAL VOC [6], and micro-batch training of Faster R-CNN [35] and Mask R-CNN [10] on MS COCO [22] dataset. Sec. 4 shows the experimental results, which demonstrate that our proposed BCN is able to outperform the baselines effortlessly. Finally, Sec. 5 discusses the related work, and Sec. 6 concludes the paper.

2 Normalization and Elimination Singularity

In this section, we will provide the background of normalization methods, discuss the relationship between the performance and the distance to elimination singularities, and show how well normalization methods are able to prevent models from getting too close to those singularities.

2.1 Batch- and Channel-based Normalization

Based on how activations are normalized, we group the normalization methods into two types: batch-based normalization and channel-based normalization, where the batch-based normalization method corresponds to BN and the channel-based normalization methods include LN and GN.

Suppose we are going to normalize a 2D feature map , where is the batch size, is the number of channels, and denote the height and the width. For each channel , BN normalizes by

(1)

where and

denote the mean and the standard deviation of all the features of the channel

, . Throughout the paper, we use in the subscript to denote all the features along that dimension for convenience.

Unlike BN which computes statistics on the batch dimension in addition to the height and width, channel-based normalization methods compute statistics on the channel dimension. Specifically, they divide the channels to several groups, and normalize each group of channels together, i.e., is reshaped as , and then:

(2)

for each sample of samples in a batch and each channel group out of all groups. After Eq. 2, the output is reshaped as and denoted by .

Both batch- and channel-based normalization methods optionally have an affine transformation, i.e.,

(3)

2.2 Performance and Distance to Singularities

Deep neural networks are hard to train partly due to the singularities caused by the non-identifiability of the model [44]. These singularities include overlap singularities, linear dependence singularities, elimination singularities, etc. Degenerate manifolds in the loss landscape will be caused by these singularities, getting closer to which will slow down learning and impact model performances [30]. In this paper, we focus on elimination singularities, which correspond to the points on the training trajectory where neurons in the model become constantly deactivated.

We focus on a basic building element that is widely used in neural networks: a convolutional layer followed by a normalization method (e.g.

BN, LN) and ReLU 

[29], i.e.,

(4)

ReLU sets any values below to , thus a neuron is constantly deactivated if its maximum value after normalization is below . Their gradients will also be because of ReLU, making them hard to revive; hence, a singularity is created.

BN avoids elimination singularities.

Here, we study the effect of BN on elimination singularities. Since the normalization methods all have an optional affine transformation, we focus on the distinct part of BN: Eq. 1

, which normalizes all channels to zero mean and unit variance,

i.e.,

(5)

As a result, regardless of the weights and the distribution of the inputs, Eq. 1 guarantees that the activations of each channel are zero-centered with unit variance. Therefore, each channel cannot be constantly deactivated because there are always some activations that are , nor almost constantly deactivated caused by one channel having a very small activation scale compared with the others.

Statistics affect the distance and the performance.

Figure 1: Model accuracy and distance to singularities. Larger circles correspond to higher performances. Red crosses represent failure cases (accuracy ). Circles are farther from singularities if they are closer to the origin.

BN avoids singularities by normalizing each channel to zero mean and unit variance. What if they are normalized to other means and variances?

We ask this question because this is similar to what happens in channel-normalized models. Channel-based normalization methods, as they do not have batch information, are unable to make sure all neurons have zero mean and unit variance after normalization. Instead, they will have different statistics, thus make the model closer to singularities. Here, by closer, we mean the model is far

from BN where each channel is zero-centered with unit variance, which avoids all singularities. To study the relationship between the performance and the distance to singularities (or how far from BN) caused by statistical differences, we conduct experiments on a 4-layer convolutional network. Each convolutional layer has 32 output channels, and is followed by an average pooling layer which down-samples the features by a factor of 2. Finally, a global average pooling layer and a fully-connected layer will output the logits for Softmax. The experiments are done on CIFAR-10 

[19].

In the experiment, each channel will be normalized to a pre-defined mean and a pre-defined variance that are drawn from two distributions, respectively:

(6)

The model will be closer to singularities when or increases. BN corresponds to the case where .

After getting and for each channel, we compute

(7)

Note that and are fixed during training while and are trainable parameters in the affine transformation.

Figure 2: Examples of normalizing two channels in a group when they have different means and variances. Transparent bars mean they are after ReLU. StatDiff defined in Eq. 8.

Fig. 1 shows the experimental results. When and are closer to the origin, the normalization method is more similar to BN, and the model will be farther from the singularities. When their values increase, we observe performance decreases. For extreme cases, we also observe training failures. These results indicate that although the affine transformation theoretically can find solutions that cancel the negative effects of normalizing channels to different statistics, their capability is limited by the gradient-based training. These findings raise concerns about channel normalizations regarding their distance to singularities.

2.3 Statistics in Channel Normalization

Following our concerns about channel-based normalization and their distance to singularities, we study the statistical differences between channels when they are normalized by a channel-based normalization such as GN or LN.

Figure 3: Means and standard deviations of the statistical differences (StatDiff in Eq. 8) of all layers in a ResNet-110 trained on CIFAR-10 with GN, GN+WS, LN, and LN+WS.

Statistical differences in GN, LN and WS.

We train a ResNet-110 [12] on CIFAR-10 [19] normalized by GN, LN, with and without WS [33]. During training, we keep record of the running mean and variance of each channel after convolutional layers. For each group of the channels that are normalized together, we compute their channel statistical difference defined as the standard deviation of their means divided by the mean of their standard deviations, i.e.,

(8)

We plot the average statistical differences of all the groups after every training epoch as shown in Fig. 

3.

By Eq. 8, . In BN, all their means are the same, as well as their variances, thus . As the value of goes up, the differences between channels within a group become larger. Since they will be normalized together as in Eq. 2, large differences will inevitably lead to underrepresented channels. Fig. 2 plots 3 examples of 2 channels before and after normalization in Eq. 2. Compared with those examples, it is clear that the models in Fig. 3 have many underrepresented channels.

Why GN performs better than LN.

Fig. 3 also provides explanations why GN performs better than LN. Comparing GN and LN, the major difference is their numbers of groups for channels: LN has only one group for all the channels in a layer while GN collects them into several groups. A strong benefit of having more than one group is that it guarantees that each group will at least have one neuron that is not suppressed by the others from the same group. Therefore, GN provides a mechanism to prevent the models from getting too close to the elimination singularities.

Fig. 3 also shows the statistical differences when WS is used. From the results, we can clearly see that WS makes StatDiff much closer to . Consequently, the majority of the channels are not underrepresented in WS: most of them are frequently activated and they are at similar activation scales. This makes training with WS easier and their results better.

Why WS helps.

Here, we also provide our understandings why WS is able to achieve smaller statistical differences. Recall that WS adds constraints to the weight of a convolutional layer with O output channels and I inputs such that ,

(9)

With the constraints of WS, and become

(10)

when we follow the assumptions in Xavier initialization [8]. When the input channels are similar in their statistics, i.e., , , ,

(11)
(12)

In other words, WS can pass the statistical similarities from the input channels to the output channels, all the way from the image space where RGB channels are properly normalized. This is similar to the objective of Xavier initialization [8] or Kaiming initialization [11], except that WS enforces it by reparameterization throughout the entire training process, thus is able to reduce the statistical differences.

Here, we summarize this subsection. We have shown that channel-based normalization methods, as they do not have batch information, are not able to ensure a far distance from elimination singularities. Without the help of batch information, GN alleviates this issue by assigning channels to more than one group to encourage more activated neurons, and WS adds constraints to pull the channels to be not so statistically different. We notice that the batch information is not hard to collect in reality. This inspires us to equip channel-based normalization with batch information, and the result is Batch-Channel Normalization.

3 Batch-Channel Normalization

This section presents the definition of Batch-Channel Normalization, discusses why adding batch statistics to channel normalization is not redundant, and shows how BCN runs in large-batch and micro-batch training settings.

3.1 Definition

Batch-Channel Normalization (BCN) adds batch constraints to channel-based normalization methods. Let be the features to be normalized. Then, the normalization is done as follows. ,

(13)

where the purpose of and is to make

(14)

Then, is reshaped as to have groups of channels. Next, ,

(15)

Finally, is reshaped back to , which is the output of the Batch-Channel Normalization.

3.2 Large- and Micro-batch Implementations

Note that in Eq. 13 and 15, only two statistics need batch information: and , as their values depend on more than one sample. Depending on how we obtain the values of and , we have different implementations for large-batch and micro-batch training settings.

Large-batch training.

When the batch size is large, estimating

and is easy: we just use a Batch Normalization layer to achieve the function of Eq. 13 and 14. As a result, the proposed BCN can be written as

(16)

Implementing it is also easy with modern deep learning libraries, which is omitted here.

Micro-batch training.

One of the motivations of channel normalization is to allow deep networks to train on tasks where the batch size is limited by the GPU memory. Therefore, it is important for Batch-Channel Normalization to be able to work in the micro-batch training setting.

Input: , the current estimates of and , and the update rate .
Output: Normalized .
1 Compute ;
2 Compute ;
3 Update ;
4 Update ;
5 Normalize ;
6 Reshape to ;
7 Normalize ;
8 Reshape to ;
Algorithm 1 Micro-batch BCN

Algorithm 1 shows the feed-forwarding implementation of the micro-batch Batch-Channel Normalization. The basic idea behind this algorithm is to constantly estimate the values of and , which are initialized as and , respectively, and normalize based on these estimates. It is worth noting that in the algorithm, and

are not updated by the gradients computed from the loss function; instead, they are updated towards more accurate estimates of those statistics. Step 3 and 4 in Algorithm 

1 resemble the update steps in gradient descent; thus, the implementation can also be written in gradient descent by storing the difference and as their gradients. Moreover, we set the update rate to be the learning rate of trainable parameters.

Algorithm 1 also raises an interesting question: when researchers study the micro-batch issue of BN before, why not just use the estimates to batch-normalize the features? In fact, [17] tries a similar idea, but does not fully solve the micro-batch issue: it needs a bootstrap phase to make the estimates meaningful, and the performances are usually not satisfactory. The underlying difference between micro-batch BCN and [17] is that BCN has a channel normalization following the estimate-based normalization. This makes the previously unstable estimate-based normalization stable, and the reduction of Lipschitz constants which speeds up training is also done in the channel-based normalization part, which is also impossible to do in estimate-based normalization. In summary, channel-based normalization makes estimate-based normalization possible, and estimate-based normalization helps channel-based normalization to keep models away from elimination singularities.

3.3 Is Batch-Channel Normalization Redundant?

Batch- and channel-based normalizations are similar in many ways. Is BCN thus redundant as it normalizes normalized features? Our answer is no. Channel normalizations need batch knowledge to keep the models away from elimination singularities; at the same time, it also brings benefits to the batch-based normalization, including:

Batch knowledge without large batches. Since BCN runs in both large-batch and micro-batch settings, it provides a way to utilize batch knowledge to normalize activations without relying on large training batch sizes.

Additional non-linearity. Batch Normalization is linear in the test mode or when the batch size is large in training. By contrast, channel-based normalization methods, as they normalize each sample individually, are not linear. They will add strong non-linearity and increase the model capacity.

Test-time normalization. Unlike BN that relies on estimated statistics on the training dataset for testing, channel normalization normalizes testing data again, thus allows the statistics to adapt to different samples. As a result, channel normalization will be more robust to statistical changes and show better generalizability for unseen data.

4 Experimental Results

In this section, we test the proposed BCN in popular vision benchmarks, including image classification on CIFAR-10/100 [19] and ImageNet [36], semantic segmentation on PASCAL VOC 2012 [6], and object detection and instance segmentation on COCO [22].

4.1 Image Classification on CIFAR

CIFAR has two image datasets, CIFAR-10 (C10) and CIFAR-100 (C100). Both C10 and C100 have color images of size . C10 dataset has 10 categories while C100 dataset has 100 categories. Each of C10 and C100 has 50,000 images for training and 10,000 images for testing and the categories are balanced in terms of the number of samples. In all the experiments shown here, the standard data augmentation schemes are used, i.e., mirroring and shifting, for these two datasets. We also standardizes each channel of the datasets for data pre-processing.

Table 1 shows the experimental results that compare our proposed BCN with BN and GN. The results are grouped into 4 parts based on whether the training is large-batch or micro-batch, and whether the dataset is C10 and C100. On C10, our proposed BCN is better than BN on large-batch training, and is better than GN (with or without WS) which is specifically designed for micro-batch training. Here, micro-batch training assumes the batch size is 1, and RN110 is the 110-layer ResNet [12] with basic block as the building block. The number of groups here for GN is .

Table 2 shows comparisons with more recent normalization methods, Switchable Normalization (SN) [25] and Dynamic Normalization (DN) [27] which were tested for a variant of ResNet for CIFAR: ResNet-18. To provide readers with direct comparisons, we also evaluate BCN on ResNet-18 with the group number set to for models that use GN. Again, all the results are organized based on whether they are trained in the micro-batch setting. Based on the results shown in Table 1 and 2, it is clear that BCN is able to outperform the baselines effortlessly in both large-batch and micro-batch training settings.

Dataset Model Method Micro-Batch WS Error
C10 RN110 BN 6.43
C10 RN110 BCN 5.90
C10 RN110 GN 7.45
C10 RN110 GN 6.82
C10 RN110 BCN 6.31
C100 RN110 BN 28.86
C100 RN110 BCN 28.36
C100 RN110 GN 32.86
C100 RN110 GN 29.49
C100 RN110 BCN 28.28
Table 1: Error rates of a 110-layer ResNet [12] on CIFAR-10/100 [19] trained with BN [16], GN [45] and our BCN. The results are grouped based on dataset and large/micro-batch training. Micro-batch assumes sample per batch while large-batch uses 128 samples in each batch. WS indicates whether WS [33] is used for weights.
Dataset Model Method Micro-Batch Error
C10 RN18 BN 5.20
C10 RN18 SN 5.60
C10 RN18 DN 5.02
C10 RN18 BCN 4.96
C10 RN18 BN 8.45
C10 RN18 SN 7.62
C10 RN18 DN 7.55
C10 RN18 BCN 5.43
Table 2: Error rates of ResNet-18 on CIFAR-10 trained with SN [25], DN [27] and our BCN. The results are grouped based on large/micro-batch training. The performances of BN, SN and DN are from [27]. Micro-batch for BN, SN and DN uses 2 images per batch, while BCN uses 1.

4.2 Image Classification on ImageNet

Figure 4: Training and validation error rates of ResNet-50 on ImageNet. The comparison is between the baselines GN  [45], GN + WS [33], and our proposed Batch-Channel Normalization (BCN) with WS. Our method BCN not only significantly improves the training speed, it also lowers the error rates of the final models by a comfortable margin.
Dataset Model Method WS Top-1 Top-5
ImageNet RN50 BN 24.30 7.19
ImageNet RN50 BN 23.76 7.13
ImageNet RN50 GN 23.72 6.99
ImageNet RN50 BCN 23.09 6.55
ImageNet RN101 BN 22.44 6.21
ImageNet RN101 BN 21.89 6.01
ImageNet RN101 GN 22.10 6.07
ImageNet RN101 BCN 21.29 5.60
ImageNet RX50 BN 22.60 6.29
ImageNet RX50 GN 22.71 6.38
ImageNet RX50 BCN 22.08 5.99
Table 3: Top-1/5 error rates of ResNet-50, ResNet-101, and ResNeXt-50 on ImageNet. The test size is with center cropping. All normalizations are trained with batch size or per GPU without synchronization.
Model Method WS AP AP AP AP AP AP AP AP AP AP AP AP
RN50 GN 39.8 60.5 43.4 52.4 42.9 23.0 36.1 57.4 38.7 53.6 38.6 16.9
RN50 GN 40.8 61.6 44.8 52.7 44.0 23.5 36.5 58.5 38.9 53.5 39.3 16.6
RN50 BCN 41.4 62.2 45.2 54.7 45.0 24.2 37.3 59.4 39.8 55.0 40.1 17.9
RN101 GN 41.5 62.0 45.5 54.8 45.0 24.1 37.0 59.0 39.6 54.5 40.0 17.5
RN101 GN 42.7 63.6 46.8 56.0 46.0 25.7 37.9 60.4 40.7 56.3 40.6 18.2
RN101 BCN 43.6 64.4 47.9 57.4 47.5 25.6 39.1 61.4 42.2 57.3 42.1 19.1
Table 4: Object detection and instance segmentation results on COCO val2017 [22] of Mask R-CNN [10] and FPN [21] with ResNet-50 and ResNet-101 [12] as backbone. The models are trained with different normalization methods, which are used in their backbones, bounding box heads, and mask heads.

This section shows the results of training models with BCN on ImageNet [36]. The ImageNet dataset contains 1.28 million color images for training and 5,000 images for validation. There are 1,000 categories in the datasets, which are roughly balanced. We adopt the same training and testing procedures used in [33], and the baseline performances are copied from them.

Fig. 4 shows the training dynamics of ResNet-50 with GN, GN+WS and BCN+WS, and Table 3 shows the top-1 and top-5 error rates of ResNet-50, ResNet-101 and ResNeXt-50 trained with different normalization methods. From the results, we observe that adding batch information to channel-based normalizations strongly improves their accuracy. As a result, GN, whose performances are similar to BN when used with WS, now is able to achieve better results than the BN baselines. And we find improvements not only in the final model accuracy, but also in the training speed. As shown in Fig. 4, we see a big drop of training error rates at each epoch. This demonstrates that the model is now farther from elimination singularities, resulting in an easier and faster learning.

4.3 Semantic Segmentation on PASCAL VOC

Dataset Model Method WS mIoU
VOC Val RN101 GN 74.90
VOC Val RN101 GN 77.20
VOC Val RN101 BN 76.49
VOC Val RN101 BN 77.15
VOC Val RN101 BCN 78.10
Table 5: Comparisons of semantic segmentation performance of DeepLabV3 [5] trained with different normalizations on PASCAL VOC 2012 [6]

validation set. Output stride is 16, without multi-scale or flipping when testing.

After evaluating BCN on classification tasks, we test it on dense prediction tasks. We start with semantic segmentation on PASCAL VOC [6]. We choose DeepLabV3 [5] as the evaluation model for its good performances and its use of the pre-trained ResNet-101 backbone.

Table 5 shows our results on PASCAL VOC, which has different categories with background included. We take the common practice to prepare the dataset, and the training set is augmented by the annotations provided in [9], thus has 10,582 images. We take our ResNet-101 pre-trained on ImageNet and finetune it for the task. Here, we list all the implementation details for easy reproductions of our results: the batch size is set to , the image crop size is , the learning rate follows polynomial decay with an initial rate . The model is trained for iterations, and the multi-grid is instead of . For testing, the output stride is set to , and we do not use multi-scale or horizontal flipping test augmentation. As shown in Table 5, by only changing the normalization methods from BN and GN to our BCN, mIoU increases by about , which is a significant improvement for PASCAL VOC dataset. As we strictly follow the hyper-parameters used in the previous work, there could be even more room of improvements if we tune them to favor BCN, which we do not explore in this paper and leave to future work.

4.4 Object Detection and Segmentation on COCO

Model Method WS AP AP AP AP AP AP
RN50 GN 38.0 59.1 41.2 49.5 40.9 22.4
RN50 GN 38.9 60.4 42.1 50.4 42.4 23.5
RN50 BCN 39.7 60.9 43.1 51.7 43.2 24.0
RN101 GN 39.7 60.9 43.3 51.9 43.3 23.1
RN101 GN 41.3 62.8 45.1 53.9 45.2 24.7
RN101 BCN 41.8 63.4 45.8 54.1 45.6 25.6
RX50 GN 39.9 61.7 43.4 51.1 43.6 24.2
RX50 BCN 40.5 62.2 44.2 52.3 44.3 25.1
Table 6: Object detection results on COCO using Faster R-CNN [35] and FPN with different normalization methods.

As we have introduced in Sec. 3, our BCN can also be used for micro-batch training, which we will evaluate in this section by showing detection and segmentation results on COCO [22]. It is a very fundamental vision task yet has memory issues when large batch sizes are used.

We take our ResNet-50 and ResNet-101 normalized by BCN pre-trained on ImageNet as the starting point of the backbone, and fine-tune it on COCO train2017 dataset. After training, the models are tested on COCO val2017 dataset. We use 4 GPUs to train all the models, each GPU has one training sample. Learning rate is configured according to the batch size following the common practice provided in [3, 7]. Specifically, we use 1X learning rate schedule for Faster R-CNN and 2X learning rate schedule for Mask R-CNN to get the results reported in this paper. We use FPN [21] and the 4conv1fc bounding box head. We add BCN to the backbone, bounding box heads, and mask heads. We keep everything else untouched to maximize comparison fairness. Please see [3, 7] for more details.

Table 4 shows the results of Mask R-CNN [10] between our BCN with GN and GN+WS, and Table 6 shows the comparisons on Faster R-CNN [35]. The results shown in the tables are the Average Precision for bounding box (AP) and instance segmentation (AP). As the tables demonstrate, our BCN is able to outperform the baseline methods by a comfortable margin.

Experiments on COCO differ from the previous results on ImageNet and PASCAL VOC in that they train models in the micro-batch setting: each GPU can only have one training sample and the GPUs are not synchronized – the batch size would be 4 even if they do, which is still not large. The results on ImageNet and PASCAL VOC show that when large-batch training is available, having batch information will strongly improve the results. And the experiments on COCO demonstrate that even when large-batch is not available, having an estimate-based batch normalization is also going to be helpful and will provide improvements. The improvements over WS when GN is used show that although WS is able to alleviate the statistical difference issue, it does not fully solve it. However, we do not just discard WS when we use BCN because WS still has the smoothing effect on the loss landscape which improves training from another perspective. Overall, the results in this section prove the necessity of keeping models away from elimination singularities when training neural networks, and BCN improves results by avoiding them along the training trajectory.

5 Related Work

Deep neural networks advance state-of-the-arts in many computer vision tasks 

[4, 13, 20, 24, 31, 32, 34, 40, 42, 43, 46, 48]. But deep networks are hard to train. To speed up training, proper model initializations are widely used as well as data normalization based on the assumption of the data distribution [8, 11]. On top of data normalization and model initialization, Batch Normalization [16] is proposed to ensure certain distributions so that the normalization effects will not fade away during training. By performing normalization along the batch dimension, Batch Normalization achieves state-of-the-art performances in many tasks in addition to accelerating the training process. When the batch size decreases, however, the performances of Batch Normalization drop dramatically since the batch statistics are not representative enough of the dataset statistics. Unlike Batch Normalization that works on the batch dimension, Layer Normalization [2] normalizes data on the channel dimension, Instance Normalization [41] does Batch Normalization for each sample individually. Group Normalization [45] also normalizes features on the channel dimension, but it finds a better middle point between Layer Normalization and Instance Normalization.

Batch Normalization, Layer Normalization, Group Normalization, and Instance Normalization are all activation-based normalization methods. Besides them, there are also weight-based normalization methods, such as Weight Normalization [37] and Weight Standardization [33, 14]. Weight Normalization decouples the length and the direction of the weights, while Weight Standardization ensures the weights to have zero mean and unit variance. Weight Standardization narrows the performance gap between Batch Normalization and Group Normalization, therefore, in this paper, we use Weight Standardization for our proposed method to get all the results.

In this paper, we study normalization methods and Elimination Singularity [30, 44]. There are also other perspectives to understand normalization methods. For example, from the perspective of training robustness, BN is able to make optimization trajectories more robust to parameter initialization [15]. [38, 33] show that normalizations are able to reduce the Lipschitz constants of the loss and the gradients, thus the training becomes easier and faster. From the angle of model generalization, [28] shows that Batch Normalization relies less on single directions of activations, thus has better generalization properties, and [26] studies the regularization effects of Batch Normalization. [18] also explores length-direction decoupling in BN and [37]. Other work also approaches normalizations from the gradient explosion issues [47] and learning rate tuning [1].

Our method uses Batch Normalization and Group Normalization at the same time for one layer. Some previous work also uses multiple normalizations or a combined version of normalizations for one layer. For example, SN [25] computes BN, IN, and LN at the same time and uses AutoML [23] to determine how to combine them. SSN [39] uses SparsestMax to get sparse SN. DN [27] proposes a more flexible form to represent normalizations and finds better normalizations. Unlike them, our method is based on analysis from the angle of elimination singularity instead of AutoML, and our normalizations are used together as a composite function rather than linearly adding up the normalization effects in a flat way.

6 Conclusion

In this paper, we approach the normalization methods from the perspective of elimination singularities. We study how different normalizations can keep their models away from the elimination singularities, since getting close to them will harm the training of the models. We observe that Batch Normalization (BN) is able to guarantee a far distance from the elimination singularities, while Layer Normalization (LN) and Group Normalization (GN) are unable to keep far distances. We also observe that the situation of LN is worse than that of GN, and Weight Standardization (WS) is able to alleviate this issue. These findings are consistent with their performances. We notice that the cause of LN and GN being unable to keep models away is their lack of batch knowledge. Therefore, to improve their performances, we propose Batch-Channel Normalization (BCN), which adds batch knowledge to channel-normalized models. BCN is able to run and improve the performances in both large-batch and micro-batch settings. We test it on many popular vision benchmarks. The experimental results show that it is able to outperform the baselines effortlessly.

References

  • [1] S. Arora, Z. Li, and K. Lyu (2019) Theoretical analysis of auto rate-tuning by batch normalization. In International Conference on Learning Representations (ICLR), Cited by: §5.
  • [2] J. L. Ba, J. R. Kiros, and G. E. Hinton (2016) Layer normalization. arXiv preprint arXiv:1607.06450. Cited by: item 2, §1, §5.
  • [3] K. Chen et al. (2019) MMDetection: open mmlab detection toolbox and benchmark. arXiv preprint arXiv:1906.07155. Cited by: §4.4.
  • [4] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille (2015) Semantic image segmentation with deep convolutional nets and fully connected crfs. In International Conference on Learning Representations (ICLR), Cited by: §1, §5.
  • [5] L. Chen, G. Papandreou, F. Schroff, and H. Adam (2017) Rethinking atrous convolution for semantic image segmentation. arXiv preprint arXiv:1706.05587. Cited by: §1, §4.3, Table 5.
  • [6] M. Everingham, S. M. A. Eslami, L. J. V. Gool, C. K. I. Williams, J. M. Winn, and A. Zisserman (2015) The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision 111 (1), pp. 98–136. External Links: Link, Document Cited by: §1, §4.3, Table 5, §4.
  • [7] R. Girshick, I. Radosavovic, G. Gkioxari, P. Dollár, and K. He (2018) Detectron. Note: https://github.com/facebookresearch/detectron Cited by: §4.4.
  • [8] X. Glorot and Y. Bengio (2010) Understanding the difficulty of training deep feedforward neural networks. In

    Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS)

    ,
    pp. 249–256. Cited by: §2.3, §5.
  • [9] B. Hariharan, P. Arbelaez, L. D. Bourdev, S. Maji, and J. Malik (2011) Semantic contours from inverse detectors. In IEEE International Conference on Computer Vision (ICCV), pp. 991–998. External Links: Link, Document Cited by: §4.3.
  • [10] K. He, G. Gkioxari, P. Dollár, and R. Girshick (2017) Mask r-cnn. In IEEE International Conference on Computer Vision (ICCV), pp. 2961–2969. Cited by: §1, §1, §4.4, Table 4.
  • [11] K. He, X. Zhang, S. Ren, and J. Sun (2015) Delving deep into rectifiers: surpassing human-level performance on imagenet classification. In IEEE International Conference on Computer Vision (ICCV), pp. 1026–1034. Cited by: §2.3, §5.
  • [12] K. He, X. Zhang, S. Ren, and J. Sun (2016) Deep residual learning for image recognition. In

    IEEE Conference on Computer Vision and Pattern Recognition (CVPR)

    ,
    pp. 770–778. Cited by: §1, §1, §2.3, §4.1, Table 1, Table 4.
  • [13] G. Huang, Z. Liu, and K. Q. Weinberger (2017) Densely connected convolutional networks. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269. Cited by: §5.
  • [14] L. Huang, X. Liu, Y. Liu, B. Lang, and D. Tao (2017) Centered weight normalization in accelerating training of deep neural networks. In Proceedings of the IEEE International Conference on Computer Vision, Cited by: item 3, §5.
  • [15] D. J. Im, M. Tao, and K. Branson (2016) An empirical analysis of deep network loss surfaces. Cited by: §5.
  • [16] S. Ioffe and C. Szegedy (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In

    Proceedings of the 32nd International Conference on Machine Learning (ICML)

    ,
    External Links: Link Cited by: item 1, §1, Table 1, §5.
  • [17] S. Ioffe (2017) Batch renormalization: towards reducing minibatch dependence in batch-normalized models. In Advances in Neural Information Processing Systems (NeurIPS), pp. 1945–1953. Cited by: §3.2.
  • [18] J. Kohler, H. Daneshmand, A. Lucchi, M. Zhou, K. Neymeyr, and T. Hofmann (2018) Towards a theoretical understanding of batch normalization. arXiv preprint arXiv:1805.10694. Cited by: §5.
  • [19] A. Krizhevsky and G. Hinton (2009) Learning multiple layers of features from tiny images. Master’s thesis, Department of Computer Science, University of Toronto. Cited by: §2.2, §2.3, Table 1, §4.
  • [20] A. Krizhevsky, I. Sutskever, and G. E. Hinton (2012)

    ImageNet classification with deep convolutional neural networks

    .
    In Advances in Neural Information Processing Systems (NeurIPS), pp. 1097–1105. Cited by: §5.
  • [21] T. Lin, P. Dollár, R. Girshick, K. He, B. Hariharan, and S. Belongie (2017) Feature pyramid networks for object detection. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2117–2125. Cited by: §4.4, Table 4.
  • [22] T. Lin, M. Maire, S. J. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick (2014) Microsoft COCO: common objects in context. In European Conference on Computer Vision (ECCV), pp. 740–755. Cited by: §1, §4.4, Table 4, §4.
  • [23] C. Liu, B. Zoph, M. Neumann, J. Shlens, W. Hua, L. Li, L. Fei-Fei, A. L. Yuille, J. Huang, and K. Murphy (2018) Progressive neural architecture search. In European Conference on Computer Vision (ECCV), pp. 19–35. External Links: Link, Document Cited by: §5.
  • [24] J. Long, E. Shelhamer, and T. Darrell (2015) Fully convolutional networks for semantic segmentation. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3431–3440. External Links: Link, Document Cited by: §5.
  • [25] P. Luo, J. Ren, and Z. Peng (2018) Differentiable learning-to-normalize via switchable normalization. arXiv preprint arXiv:1806.10779. Cited by: §4.1, Table 2, §5.
  • [26] P. Luo, X. Wang, W. Shao, and Z. Peng (2019) Towards understanding regularization in batch normalization. In International Conference on Learning Representations (ICLR), Cited by: §5.
  • [27] P. Luo, P. Zhanglin, S. Wenqi, Z. Ruimao, R. Jiamin, and W. Lingyun (2019) Differentiable dynamic normalization for learning deep representation. In International Conference on Machine Learning, pp. 4203–4211. Cited by: §4.1, Table 2, §5.
  • [28] A. S. Morcos, D. G. Barrett, N. C. Rabinowitz, and M. Botvinick (2018) On the importance of single directions for generalization. arXiv preprint arXiv:1803.06959. Cited by: §5.
  • [29] V. Nair and G. E. Hinton (2010) Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th International Conference on Machine Learning (ICML-10), June 21-24, 2010, Haifa, Israel, External Links: Link Cited by: §2.2.
  • [30] A. E. Orhan and X. Pitkow (2018) Skip connections eliminate singularities. International Conference on Learning Representations (ICLR). Cited by: §1, §2.2, §5.
  • [31] S. Qiao, C. Liu, W. Shen, and A. L. Yuille (2018) Few-shot image recognition by predicting parameters from activations. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Cited by: §5.
  • [32] S. Qiao, W. Shen, Z. Zhang, B. Wang, and A. Yuille (2018) Deep co-training for semi-supervised image recognition. In European Conference on Computer Vision (ECCV), Cited by: §5.
  • [33] S. Qiao, H. Wang, C. Liu, W. Shen, and A. Yuille (2019) Weight standardization. arXiv preprint arXiv:1903.10520. Cited by: item 3, §2.3, Figure 4, §4.2, Table 1, §5, §5.
  • [34] W. Qiu, F. Zhong, Y. Zhang, S. Qiao, Z. Xiao, T. S. Kim, and Y. Wang (2017) Unrealcv: virtual worlds for computer vision. In Proceedings of the 25th ACM international conference on Multimedia, pp. 1221–1224. Cited by: §5.
  • [35] S. Ren, K. He, R. B. Girshick, and J. Sun (2015) Faster R-CNN: towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 91–99. External Links: Link Cited by: §1, §4.4, Table 6.
  • [36] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei (2015) ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV) 115 (3), pp. 211–252. External Links: Document Cited by: §1, §4.2, §4.
  • [37] T. Salimans and D. P. Kingma (2016) Weight normalization: a simple reparameterization to accelerate training of deep neural networks. In Advances in Neural Information Processing Systems (NeurIPS), pp. 901–909. Cited by: §5, §5.
  • [38] S. Santurkar, D. Tsipras, A. Ilyas, and A. Madry (2018) How does batch normalization help optimization?. In Advances in Neural Information Processing Systems (NeurIPS), pp. 2488–2498. Cited by: §1, §5.
  • [39] W. Shao, T. Meng, J. Li, R. Zhang, Y. Li, X. Wang, and P. Luo (2019) SSN: learning sparse switchable normalization via sparsestmax. arXiv preprint arXiv:1903.03793. Cited by: §5.
  • [40] K. Simonyan and A. Zisserman (2015) Very deep convolutional networks for large-scale image recognition. In International Conference on Learning Representations (ICLR), Cited by: §5.
  • [41] D. Ulyanov, A. Vedaldi, and V. Lempitsky (2016) Instance normalization: the missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022. Cited by: §5.
  • [42] Y. Wang, L. Xie, C. Liu, S. Qiao, Y. Zhang, W. Zhang, Q. Tian, and A. Yuille (2017) SORT: Second-Order Response Transform for Visual Recognition. IEEE International Conference on Computer Vision. Cited by: §5.
  • [43] Y. Wang, L. Xie, S. Qiao, Y. Zhang, W. Zhang, and A. L. Yuille (2018) Multi-scale spatially-asymmetric recalibration for image classification. In Proceedings of the European Conference on Computer Vision (ECCV), pp. 509–525. Cited by: §5.
  • [44] H. Wei, J. Zhang, F. Cousseau, T. Ozeki, and S. Amari (2008) Dynamics of learning near singularities in layered networks. Neural computation 20 (3), pp. 813–843. Cited by: §1, §2.2, §5.
  • [45] Y. Wu and K. He (2018) Group normalization. In European Conference on Computer Vision (ECCV), pp. 3–19. Cited by: item 2, Figure 4, Table 1, §5.
  • [46] C. Yang, L. Xie, S. Qiao, and A. Yuille (2018) Knowledge distillation in generations: more tolerant teachers educate better students. AAAI. Cited by: §5.
  • [47] G. Yang, J. Pennington, V. Rao, J. Sohl-Dickstein, and S. S. Schoenholz (2019) A mean field theory of batch normalization. In International Conference on Learning Representations (ICLR), Cited by: §5.
  • [48] Z. Zhang, S. Qiao, C. Xie, W. Shen, B. Wang, and A. L. Yuille (2018) Single-shot object detection with enriched semantics. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5813–5821. Cited by: §5.