Since AlexNet was proposed in krizhevsky2012imagenet, Deep Convolutional Neural Network (DCNN) has been a popular method for vision tasks including image classification deng2009imagenet, object detection lin2014microsoft and semantic segmentation pascal-voc-2012. DCNNs are usually composed of convolutional layers, normalization layers, activation layers, etc. Normalization layers are important in improving performance and speeding up training.
Batch Normalization (BN) was one of the early proposed normalization methods ioffe2015batch and is widely used. It normalizes the feature map with the mean and variance
calculated along with the batch, height, and width dimension of a feature map and then re-scales and re-shifts the normalized feature map to ensure DCNN representation ability. Based on BN, many normalization methods for other tasks have been proposed. For example, Layer Normalization (LN) was proposed for calculating the statistics along the channel, height and width dimension for Recurrent Neural Network (RNN)ba2016layersalimans2016weight
. Divisive Normalization which includes BN and LN as special cases was proposed for image classification, language modeling and super-resolutionren2016normalizing. Instance Normalization (IN) where the statistics were calculated from the height and width dimension was proposed for fast stylization ulyanov2016instance
. Instead of calculating the statistics from data, Normalization Propagation estimated them data-independently from the distribution in layersarpit2016normalization. Group Normalization divided the channels into groups and calculated the statistics for each grouped channel, height and width dimension, showing stability to batch sizes wu2018group. Positional Normalization (PN) was proposed to calculate the statistics along the channel dimension for generative networks li2019positional.
Among these normalization methods, BN can usually achieve good performance at medium and large batch sizes. However, its performance degrades at small batch sizes, as shown in pre works wu2018group; ioffe2017batch. Furthermore, as shown in our experiments, BN’s performance saturates at extreme large batch sizes, i.e., 128 images per worker. GN enjoys a greater degree of stability at different batch sizes, while slightly under-performs BN at medium and large batch sizes. Other normalization methods, including IN, LN and PN perform well in specific tasks, but are usually less generalizable to and under-perform in other vision tasks. As reviewed in Related Work, many works have been conducted on proposing new normalization methods with good performance, stability and generalizability.
In this paper, unlike those reviewed works where additional trainable parameters, extra computation or additional information are used, Batch Group Normalization (BGN) is parameter- and computation-efficient. We know the fact that mini-batch training usually can perform better than single-batch (use a single image example as the DCNN input per iteration) and all-batch training (use all image examples as the DCNN input per iteration), as single-batch training can indicate noisy gradient while all-batch training may not indicate representative gradient (each image example indicates gradient with different directions, thus, adding them all indicates confused gradient). Inspired by this fact, we think the number of feature instances in statistic calculation in normalization should also be moderate, i.e., the degraded/saturated performance of BN on small/extreme large batch sizes is due to noisy/confused statistic calculation.
top). The difference between the statistic calculation in BN, IN, LN, GN, PN and BGN. Each subplot shows a feature map tensor, with N, C and (H, W) as the batch, channel and spatial axes. The pixels in blue are used to compute the statistics. This figure is inspired bywu2018group. bottom). The Top1 accuracy of training ResNet-50 on ImageNet, with different batch sizes, and with BN, IN, LN, GN, PN and BGN as the normalization.
Hence, BGN is proposed to facilitate the performance degradation/saturation of BN at small/extreme large batch sizes with the group technique from GN. It merges the channel, height and width dimensions into a new dimension, divides the new dimension into feature groups, and calculates the statistics across the whole mini-batch and feature group. A hyper-parameter is used to control the level of feature division and to supply proper statistics for different batch sizes. The difference between BN, IN, LN, GN, PN and BGN, with respect to the dimensions along which the statistics are computed, are illustrated in Fig. 1(top). The Top1 accuracy of training ResNet-50 on ImageNet by using BN, IN, LN, GN, PN and BGN as the normalization layer are shown in Fig. 1(bottom). Without adding trainable parameters, using extra information or requiring extra computation, BGN achieves both good performance and stability at different batch sizes. We test on Neural Architecture Search (NAS), adversarial learning, Few Show Learning (FSL) and Unsupervised Domain Adaptation (UDA), BGN outperforms BN, IN, LN, GN and PN, showing its good generalizability.
Why normalization works? The effectiveness of BN has been attributed to internal covariate shift where the distribution of each layer’s input changes and hence lower learning rate and careful parameter initialization are essential to guarantee good training in DCNNs without normalization ioffe2015batch. Other works have investigated the reasons for the success of normalization. For example, Santurkar et al. santurkar2018does proposed that the effectiveness of BN has little to do with internal covariate shift, but to make the optimization landscape smoother and hence to introduce stable gradients and faster training. Bjorck et al. bjorck2018understanding proposed that allowing larger learning rates is the main reason for BN to achieve faster convergence and better generalization. A theoretical support was supplied for the effectiveness of BN in tuning learning rates less arora2019theoretical. Luo et al. demonstrated that BN and regularization share the same traits luo2018towards
. A quantitative analysis was provided to compare the gradient descent with and without BN on Ordinary Least Squares (OLS)cai2019quantitative. In fan2017revisit, it was shown with fuzzy neural networks that BN estimates the correct bias induced by the generalized hamming distance. In contrast, BN was proved to be the cause of the gradient explosion for DCNNs without residual learning yang2018a. In li2019understanding, from the theoretical and statistical aspects, the disharmony between dropout and BN was explained. It was claimed that BN is not unique for stable training, higher learning rates, accelerated convergence and improved generalization, and can be replaced by better initialization zhang2018residual; de2020batch
. Normalization layers were shown to introduce a stable gradient magnitude when training the Long-Short-Term Memory (LSTM)hou2019normalization. Even though many interesting theories have been proposed to explain the effectiveness of BN, there is still a lack of consensus.
Improvements Previous works can be further improved. Centered WN proposed to add a learnable parameter to adjust the weight norm in WN huang2017centered. Recurrent BN was proposed to not only apply BN in the input-to-hidden, but also to the hidden-to-hidden transformation of RNNs cooijmans2016recurrent. Batch Renormalization was proposed to decrease the dependence of BN on the size of mini-batches ioffe2017batch. Riemannian approach was combined to BN on the space of weight vectors, which improves BN’s performance on various DCNN architectures and datasets cho2017riemannian. EvalNorm was proposed to estimate the corrected normalization statistics during evaluation to address the performance degradation of BN singh2019evalnorm. Moving Average BN proposed using the batch statistics in the backward propagation of traditional BN yan2019towards. Based on LN, Adaptive Normalization proposed to modify bias and gain by using a new transformation function xu2019understanding. Root Mean Square LN proposed to abandon the re-centering and keep the re-scaling in LN zhang2019root.
Researchers are also looking into non-linear normalizing techniques with extra computation. For example, ZCA was used to replace the centering and scaling calculations in BN, resulting in Decorrelated BN huang2018decorrelated. Iterative Normalization proposed using Newton iterations to avoid the eigen-decomposition in Decorrelated BN, indicating much more efficient whitening huang2019iterative
. Instead of normalizing in the spatial space, Spectral Normalization was proposed to normalize the spectral norm of weights and was used in Generative Adversarial Networks (GANs)miyato2018spectral and adversarial training farnia2018generalizable.
Learning-to-normalize is also explored. Different normalization methods introduced before can be combined to achieve better performance and generalization. Batch-Instance Normalization (BIN) was proposed for adaptively style-invariant neural networks, with using a trainable parameter to combine the feature maps calculated from BN and IN nam2018batch. Switchable Normalization (SN) was proposed by using trainable parameters on the mean and variance calculated in BN, IN and LN to learn new statistics luo2018differentiable; luo2019switchable. Sparse SN used SparseMax to make the trainable parameters in SN sparse shao2019ssn. In Instance-Level Meta Normalization jia2019instance, feature feed-forward and gradient back-propagation were used to learn the normalization parameters.
Others Normalization can be used as a method to achieve a task directly. For example, a real-time and arbitrary style transfer was achieved by aligning the mean and variance of the content features to that of the style features huang2017arbitrary. Domain adaptation was achieved by changing the statistics from the source domain to the target domain li2018adaptive. New tasks can be solved by using BN to combine learned tasks data2018interpolating. Specific normalization methods have also been specifically proposed for cross-domain tasks li2019efficient; wang2019transferable, global covariance pooling networks li2018towards, multitask learning chen2018gradnorm; deecke2018mode, UDA chang2019domain, semantic image synthesis park2019semantic, medical area zhou2019normalization; zhou2019u; zhou20193d, and scene text detection xu2019geometry. Kalman Normalization (KN) was proposed to combine internal representations across multiple DCNN layers wang2018kalman. And, instead of using norm, and norm were proposed for numerical stability in low-precision calculation hoffer2018norm.
In a DCNN with layers, for an input feature map , which is usually with four dimensions , where is the batch, channel, height and width dimension respectively. For simplification, are the corresponding batch, channel, height and width indices and will not be repeatly defined in the following usage. The feature map at th layer is calculated as:
where and are the trainable weight and bias parameters in convolutional layers, and are the trainable re-scale and re-shift parameters in normalization layers,
is the activation function.is the normalization function. is the convolusion function.
A typical normalization layer includes four steps: (1) divide the feature map into feature groups; (2) calculate the and statistics for each feature group; (3) normalize each feature group with the calculated statistics; (4) re-scale and re-shift the normalized feature map to maintain the DCNN representation ability. For example, in BN, the feature map is divided along the channel dimension, the and are calculated along the batch, height and width dimension as:
Then the feature map is normalized as:
is a small number added for division stability. In order to maintain the DCNN representation ability, extra trainable parameters are added for each feature channel:
|Batch size||Group number|
By including multiple examples into statistic calculation, BN enjoys good performance at medium and large batch sizes and good generalizability to multiple vision tasks, i.e. NAS. However, its performance degrades dramatically, i.e., in our ImageNet experiment, at small batch sizes. In order to improve this shortage, GN includes grouped channel dimension into statistic calculation:
where , is a hyper parameter - group number, , is floor division. GN enjoys good stability to different batch sizes, however, its performance is slightly lower than BN at medium and large batch sizes and its generalizability to other vision tasks is weaker than BN. Except these known phenomena, our experiments show that BN saturates at extreme large batch sizes.
We think the degradation/saturation of BN at small/extreme large batch sizes are caused by noisy/confused statistic calculation. Similar indication also exists in mini-batch training, where single-batch/all-batch training are usually worse than mini-batch training, as noisy/confused gradients are calculated. To facilitate this, we propose BGN where the number of feature instances used for statistic calculation is controlled to be proper by using the group technique in GN. To be in more details, we first merge the channel, height and width dimensions into a new dimension and achieve , where . The mean and variance are calculated along the batch and new dimension as:
where is the number of groups that the new dimension is divided and is a hyper-parameter, is the number of instances inside each divided feature group. When the batch size is small, a small G is used to combine the whole new dimension into statistic calculation to avoid noisy statistics, while when the batch size is large, a large G is used to split the new dimension into small pieces for calculating the statistics to avoid confused statistics. and are used in the same way as BN. In BN, and used in the test are the moving average of that in the training stage. The proposed BGN uses this policy as well.
Relation to General Batch Group Normalization (GBGN) GBGN summers2019four could be confused as a similar work to BGN, we clarify our contributions as below: 1. We first propose the saturation of BN at extreme large batch sizes. 2. We first propose that the degradation/saturation of BN at small/extreme large batch size are caused by noisy/confused statistic calculation. 3. We propose to use group in the channel, height and width dimension to compensate (in GBGN, the group is along the batch and channel dimension). 4. We offer extensive experiments on image classification, NAS, adversarial learning, FSL and UDA to validate our thoughts.
Among the reviewed normalization methods, BN, IN, LN, GN, and PN222Re-injection path was used in the original PN li2019positional, in this paper, for a fair comparison, it is not included. are suitable for being baselines of BGN, as other normalization methods usually use additional trainable parameters, non-linear normalization or information from multiple layers, which are orthogonal to and can be combined to BGN to improve its performance further. BGN is validated on applications, including image classification, NAS, adversarial training, FSL and UDA.
Image Classification on ImageNet with ResNet-50
Image classification is one of the applications used to validate BGN. We focus on ImageNet krizhevsky2012imagenet which contains training images and validation images. The model used is ResNet-50 he2016deep.
8 GPUs are used in all ImageNet experiments. The gradients used for backpropagation are averaged across 8 GPUs, while themean and variance used in BN and BGN are calculated within each GPU. and are initialized as 1 and 0 respectively, while all other trainable parameters are initialized as in he2016deep
. 120 epochs are trained with the learning rate decayed byat the 30th, 60th, and 90th epoch. The initial learning rates for the experiments with batch sizes of 128, 64, 32, 16, 8, 4 and 2 are 0.4, 0.2, 0.1, 0.05, 0.025, 0.0125 and 0.00625 respectively, following goyal2017accurate
. Stochastic Gradient Descent (SGD) is used as the optimizer. A weight decay ofis applied to all trainable parameters. For the validation, each image is cropped into patches from the center, and the Top1 accuracy is reported. Following wu2018group, the median accuracy in the last five epochs is reported to reduce the random variance. All experiments are trained under the same programming implementation, with replacing the normalization layer into BN, IN, LN, GN, PN, and BGN respectively.
Hyper-parameter - : to explore the hyper-parameter , BGN with a group number of 512, 256, 128, 64, 32, 16, 8, 4, 2 and 1 respectively are used as the normalization layer in ResNet-50 for ImageNet classification. The largest (according to GPU memory) and smallest batch size in our experiments - 128 and 2 are tested. The Top1 accuracy of the validation dataset is shown in Tab. 1. We can see that a large G - 512 is suitable for a large batch size - 128, while a small G - 1 is suitable for a small batch size - 2. It can support our claims that a proper number of feature instances is important for the statistic calculation in normalization. When the batch size is large/small, a large/small is used to split/combine the new dimension to avoid confused/noisy statistic calculation. The accuracy variance for batch size 128 is smaller than that for batch size of 2, indicating that either saturation is less serious than degradation in normalization or a large batch size of 128 has not reached the saturation edge yet.
Comparison with baselines: BN, IN, LN, GN, PN, GBGN and BGN are used as the normalization layer in ResNet-50, with a batch size of 128, 64, 32, 16, 8, 4 and 2 respectively. The group number in GN is set as 32, which was claimed as the optimal configuration for GN wu2018group. The group number for channel is set as 32 while that for batch is set as the batch size for GBGN based on our experience. With this setting, GBGN equals to BGN at moderate batch sizes, hence GBGN is only included in ImageNet experiments and is ignored in the following few experiments where moderate batch sizes are used. The group number in BGN is set to be 512, 256, 128, 64, 16, 2 and 1 for batch sizes of 128, 64, 32, 16, 8, 4 and 2 respectively. We choose for the largest and smallest batch size according to Tab. 1 while choose
for other batch sizes with interpolation. The Top1 accuracy is shown in Tab.2. We can see that the proposed BGN out-performs all previous methods, including BN, IN, LN, GN, PN and GBGN at all different batch sizes. To be specific, BN approaches BGN’s performance at large batch sizes, however, its performance degrades quickly at small batch sizes. GBGN is proposed for small batch sizes, however it under-performs BGN with at the batch size of 2, indicating the importance of introducing the whole channel, height and width dimension to compensate the noisy statistic calculation. IN overall performs not well on ImageNet classification. LN, GN and PN achieve average Top1 accuracy of , , respectively, while the proposed BGN achieves higher average Top1 accuracy of .
Image Classification on CIFAR-10 with NAS
Except manually designed and regular DCNN, BGN is applicable to automatically-designed and less-regular ones as well. We experiment with cell-based architectures designed automatically with NAS, specifically DARTS liu2018darts and Multi-agent Neural Architecture Search (MANAS MANAS2019). For DARTS, we experiment with normalization methods for both the searching and training. For MANAS, we experiment with normalization methods for the training only.
DARTS and MANAS share the same search space, the family of architectures searched (the search space; see Fig. 2) is composed of a sequence of cells, where each cell is a directed acyclic graph with nodes representing feature maps and edges representing network operations, e.g. convolutions or pooling layers (MANAS2019 and references therein). Given a set of possible operations, DARTS encodes the architecture search space with continuous parameters to form a one-shot model and performs searching by training the one-shot model with bi-level optimization, where the model weights and architecture parameters are optimized with training and validation data alternatively. MANAS uses a multi-agent learning approach (the search strategy) to find the combination of operations leading to the best-performing architecture according the validation accuracy.
DARTS training configuration: we follow the same experiment setting as in liu2018darts. We replace the BN layers in DARTS with IN, LN, GN, PN and BGN in both search and evaluation stage. We search for 8 cells in epochs with batch size and initial number of channels as 16. We use SGD to optimize the model weights with initial learning rate , momentum and weight decay . Adam kingma2014adam is used to optimize architecture parameters with initial learning rate , momentum and weight decay . We use network of 20 cells and 36 initial channels for evaluation to ensure a comparable model size as other baseline models. We use the whole training set to train the model for 600 epochs with batch size 96 to ensure convergence. For GN, we use in wu2018group while for BGN, we use following Tab. 2. Other hyper-parameters are set the same as the ones in the search stage.
The best 20-cell architecture searched on CIFAR-10 by DARTS is trained from scratch with corresponding normalization methods used during the search phase. The validation accuracy of each method is reported in Tab.3. We can see that IN and LN fails to converge while BGN out-performs GN and PN significantly and outperforms BN slightly. The accuracy of BN is re-implication of liu2018darts.
MANAS training configuration: a single GPU is used to train the searched neural architectures (by BN) with replacing the normalization layers into BN, IN, LN, GN, PN and BGN. For GN, we use the best configuration in wu2018group while for BGN, we use . The network training protocol is the same as in MANAS2019
, with the following hyperparameters: batch size, epochs , cutout length
, drop path probability, initial channels , Cross-Entropy loss, SGD optimizer, learning rate decayed from to , momentum , weight decay .
The best 20-cell architecture searched on CIFAR-10 by MANAS is retrained from scratch with different normalization methods in place of the original BN used during the search phase. The validation accuracy of each method is reported in Tab. 4. We can see that IN, LN and PN fails to converge while BGN out-performs GN significantly and under-performs BN only slightly. It is worth noting that BN is used as the normalization layer in the neural architecture search phase, hence BN is at an advantage in this comparison.
DARTS experiment shows that BGN is generalizable to NAS for both search and evaluation. MANAS experiment shows that BGN is generalizable to less-regular neural architectures searched from NAS method.
Adversarial Training on CIFAR-10
DCNNs have been known to be vulnerable to malicious perturbed examples, known as adversarial attacks. Adversarial training was proposed to counter this problem. In this experiment, we apply BGN to adversarial training and compare its performance to BN, IN, LN, GN, and PN.
Implementation details: the WideResNet BMVC2016_87 with the depth set as 10 and the wide factor set as 2 is used for image classification tasks on the CIFAR-10. The neural network is trained and evaluated against a four-step Projected Gradient Descent (PGD) attack. For the PGD attack, we set the step size as , and the maximum perturbation norm as 0.0157. 200 epochs are trained until convergence. Due to the specialty of adversarial training, is used in GN and BGN. It will divide images into patches, which can help to improve the robustness by breaking the correlation of adversarial attacks in different image blocks and constraining the adversarial attacks on the features within a limited range. This effect holds some similarity to the spectral normalization in farnia2018generalizable. In the experiment, we use the Adam optimizer with a learning rate of 0.01.
The robust and clean accuracy of training WideResNet with BN, IN, LN, GN, PN and BGN as the normalization layer are shown in Tab. 5. The robust accuracy is more important than the clean accuracy in judging an adversarial network. PN experiences convergence difficulty and fails to converge. BGN out-performs BN and IN with a certain margin and out-performs LN and GN significantly.
Few Shot Learning
We evaluate BGN on FSL task. FSL aims to train models capable of recognizing new, previously unseen categories using only limited training samples. Basically, a training dataset with sufficient annotated samples comprise base categories. The test dataset contains novel classes, each of which is associated with only a few labelled samples (e.g. samples) compose the support set, while the remaining unlabelled samples consist the query set are used for evaluation (See Fig. 3). This is also referred to as a -way -shot FSL classification problem.
Implementation details: we experiment with imprinted weights qi2018low model, which is one of the state-of-art metric-based FSL approaches and is widely used as the baseline in the current FSL community lifchitz2019dense; su2019boosting; gidaris2018dynamic
. At training time, a cosine classifier is learned on top of feature extraction layers and each column of classifier parameter weights can be regarded as a prototype for the respective class. At test time, a new class prototype (new column of classifier weight parameters) is defined by averaging the feature representation of support images, and the unlabelled images are classified via a nearest neighbor strategy. We test different settings, including 5-way 1-shot and 5-way 5-shot for the ResNet-12 backboneoreshkin2018tadam on miniImageNet vinyals2016matching. We use the training protocol described in gidaris2018dynamic
: our model is optimized using SGD with Nesterov momentum set to, weight decay to 0.0005, mini-batch size to 256, and 60 epochs. All input images were resized to . The learning rate was initialized to 0.1, and changed to 0.006, 0.0012, and 0.00024 at the 20th, 40th and 50th, respectively.
The mean accuracy of replacing the normalization layers in Imprinted Weights to BN, IN, LN, GN, PN and BGN, of training on miniImageNet, and of the 5-way 1-shot and 5-shot tasks are shown Tab. 6. We can see that BGN out-performs BN slightly while out-performs IN, LN, GN and PN significantly, indicating the generalizability of BGN when the very limited labeled data is available.
ImageNet of Imprinted Weights with using ResNet-12 as a backbone. The normalization layer is replaced to BN, IN, LN, GN, PN and BGN. The mean accuracy of 600 randomly generated test episodes with 95% confidence intervals is reported.
Unsupervised Domain Adaptation on Office-31
UDA aims to learn models on a target domain while annotations are only accessible in a related source domain. Normalization layers have effect of aligning feature distributions and reducing the domain gap li2016revisiting; cariucci2017autodial. We evaluate BGN and other normalization layers on a widely adopted UDA benchmark Office-31 saenko2010adapting, which consists of 4110 images belonging to 31 classes, with three different domains: Amazon, Webcam and Digital SLR camera (DSLR). CAN kang2019contrastive is adopted as our model with replacing original BN with different normalization layers.
Implementation details: we follow the official released code’s implementation of CAN and use ImageNet-pretrained ResNet-50 as the model’s backbone. For tasks da (from domain DSLR to Amazon), wa and wd, the hyper-parameter is set to 512. For ad and aw, we reduce to 1 and 8 relatively as the source domain Amazon’s backgrounds are totally white and may result in noisy statistics when the group size is small. For dw, is set to 32. We use Adam optimizer to optimize our model kingma2014adam. The learning rate is set to 0.001 and exponential learning rate decay is applied with decay rate 0.1 and decay step 25. The mini-batch size is set to 30. The training stops when the distance between source and target features’ center is smaller than 0.001.
The results of BN, IN, LN, GN, PN and BGN are summarized in Table 7. We can see that BGN outperforms other normalization layers in most adaptation tasks, especially in wa with an accuracy improvement.
BGN is proposed with good performance, stability and generalizability and without using additional trainable parameters, information across multiple layers or iterations, or extra computation. BGN facilitates the noisy/confused statistic calculation in BN with adaptively introducing feature instances from the grouped (channel, height and width) dimensions and uses a hyper-parameter to control the size of divided feature groups. It is intuitive to implement, is orthogonal to and can be used in addition to many methods reviewed in Related Work to further improve performance.