Standardizing weights to accelerate micro-batch training
In this paper, we propose Weight Standardization (WS) to accelerate deep network training. WS is targeted at the micro-batch training setting where each GPU typically has only 1-2 images for training. The micro-batch training setting is hard because small batch sizes are not enough for training networks with Batch Normalization (BN), while other normalization methods that do not rely on batch knowledge still have difficulty matching the performances of BN in large-batch training. Our WS ends this problem because when used with Group Normalization and trained with 1 image/GPU, WS is able to match or outperform the performances of BN trained with large batch sizes with only 2 more lines of code. In micro-batch training, WS significantly outperforms other normalization methods. WS achieves these superior results by standardizing the weights in the convolutional layers, which we show is able to smooth the loss landscape by reducing the Lipschitz constants of the loss and the gradients. The effectiveness of WS is verified on many tasks, including image classification, object detection, instance segmentation, video recognition, semantic segmentation, and point cloud recognition. The code is available here: https://github.com/joe-siyuan-qiao/WeightStandardization.READ FULL TEXT VIEW PDF
In this paper, we study normalization methods for neural networks from t...
In this paper, we propose Normalized Convolutional Neural Network(NCNN)....
Batch Normalization (BN) (Ioffe and Szegedy 2015) normalizes the feature...
Substantial experiments have validated the success of Batch Normalizatio...
Normalization techniques are important in different advanced neural netw...
Batch Normalization (BN) is one of the key components for accelerating
Normalization techniques, such as batch normalization (BN), have led to
Standardizing weights to accelerate micro-batch training
in their architectures because BN in most cases is able to accelerate training and help the models to converge to better solutions. BN stabilizes the training by controlling the first two moments of the distributions of the layer inputs in each mini-batch during training and is especially helpful for training very deep networks that have hundreds of layers[13, 14]. Despite its practical success, BN has several shortcomings that draw a lot of attentions from researchers. For example, (1) we lack good understandings of the reasons of BN’s success, and (2) BN works well only when the batch size is sufficiently large, which prohibits it from being used in micro-batch training. We argue that these two drawbacks are related: a good understanding of BN may lead us to other normalization techniques that train deep networks faster without relying on any batch knowledge, hence micro-batch training can be possible. Although some normalization methods are specifically designed for micro-batch training such as Group Normalization (GN) , they still have difficulty matching the performances of BN in large-batch training (Fig. 1).
The widely accepted explanation was related to internal covariate shift until  shows that the performance gain of BN has little to do with the reduction of internal covariate shift. Instead,  proves that BN makes the landscape of the corresponding optimization problem significantly smoother. We follow this idea and aim to propose another normalization technique that further smooths the landscape. Our goal is to accelerate the training of deep networks like BN but without relying on large batch sizes during training.
In this paper, we propose Weight Standardization (WS), which smooths the loss landscape by standardizing the weights in convolutional layers. Different from the previous normalization methods that focus on activations [3, 17, 40, 42], WS considers the smoothing effects of weights more than just length-direction decoupling . To show its effectiveness, we study WS from both theoretical and experimental viewpoints. The highlights of our contributions are:
Theoretically, we prove that WS reduces the Lipschitz constants of the loss and the gradients. Hence, WS smooths the loss landscape and improves training.
To show that our WS is applicable to many vision tasks, we conduct comprehensive experiments, including image classification on ImageNet dataset , object detection and instance segmentation on COCO dataset , video recognition on Something-SomethingV1 dataset , semantic image segmentation on PASCAL VOC , and point cloud classification on ModelNet40 dataset . The results show that our WS is able to accelerate training and improve performances on all of them.
shows that BN reduces the Lipschitz constants of the loss function, and makes the gradients more Lipschitz, too,i.e., the loss will have a better -smoothness 
. These results are done on the activations, which BN standardizes to have zero mean and unit variance.
We notice that BN considers the Lipschitz constants with respect to activations, not the weights that the optimizer is directly optimizing. Therefore, we argue that we can also standardize the weights in the convolutional layers to further smooth the landscape. By doing so, we do not have to worry about transferring smoothing effects from activations to weights, and the smoothing effects on activations and weights are also additive. Based on these motivations, we propose Weight Standardization.
Here, we show the detailed modeling of Weight Standardization (WS) (Fig. 2). Consider a standard convolutional layer with its bias term set to 0:
where denotes the weights in the layer and denotes the convolution operation. For , is the number of the output channels, corresponds to the number of input channels within the kernel region of each output channel. Taking Fig. 2 as an example, and . In Weight Standardization, instead of directly optimizing the loss on the original weights , we reparameterize the weights as a function of , i.e., , and optimize the loss on by SGD:
Similar to BN, WS controls the first and second moments of the weights of each output channel individually in convolutional layers. Note that many initialization methods also initialize the weights in some similar ways. Different from those methods, WS standardizes the weights in a differentiable way which aims to normalize gradients during back-propagation. Note that we do not have any affine transformation on . This is because we assume that normalization layers such as BN or GN will normalize this convolutional layer again, and having affine transformation will harm training as we will show in the experiments. In the following, we first discuss the normalization effects of WS to the gradients.
For convenience, we set (in Eq. 2). We first focus on one output channel . Let be all the outputs of channel during one pass of feedforwarding and back-propagation, and be the corresponding inputs. Then, we can rewrite Eq. 2 and 3 as
where denotes dot product and denotes Hadamard power. Then, the gradients are
In Eq. 8, to compute , is first subtracted by a weighted average of and then divided by . Note that when BN is used to normalize this convolutional layer, as BN will compute again the scaling factor , the effects of dividing the gradients by will be canceled in both feedforwarding and back-propagation. As for the additive term, its effect will depend on the statistics of and . We will later show that this term will reduce the gradient norm regardless of the statistics. As for Eq. 9, it zero-centers the gradients from . When the mean gradient is large, zero-centering will significantly affect the gradients passed to .
We will show that WS is able to make the loss landscape smoother. Specifically, we show that optimizing on has smaller Lipschitz constants on both the loss and the gradients than optimizing on . Lipschitz constant of a function is the value of if satisfies . For the loss and gradients, will be and , and will be . Smaller Lipschitz constants on the loss and gradients mean that the changes of the loss and the gradients during training will be bounded more. They will provide more confidence when the optimizer takes a big step in the gradient direction as the gradient direction will vary less within the range of the step. In other words, the optimizer can take longer steps without worrying about sudden changes of the loss landscape and gradients. Therefore, WS is able to accelerate training.
By Eq. 6, we know that . Then,
Since we assume that this convolutional layer is followed by a normalization layer such as BN or GN, the effect of will be canceled. Therefore, the real effect on the gradient norm is the reduction , which reduces the Lipschitz constant of the loss.
Next, we study the effect of Eq. 9. By definition,
By Eq. 8, we rewrite the second term:
Since , we have
Although both Eq. 8 and 9 reduce the Lipschitz constant, their real effects depend the statistics of the weights and the gradients. For example, the reduction effect of Eq. 9 depends on the average gradients on . As for Eq. 8, note that , its effect might be limited when is evenly distributed around . To understand their real effects, we conduct a case study on ResNet-50 trained on ImageNet to see which one of Eq. 5 and 6 has bigger effects or they contribute similarly to smoothing the landscape.
Before the Lipschitzness study on the gradients, we first show a case study where we train ResNet-50 models on ImageNet following the conventional training procedure . In total, we train four models, including ResNet-50 with GN, ResNet-50 with GN+Eq. 5, ResNet-50 with GN+Eq. 6 and ResNet-50 with GN+Eq. 5&6. The training dynamics are shown in Fig. 4, from which we observe that Eq. 6 slightly improves the training speed and performances of models with or without Eq. 5 while the major improvements are from Eq. 5. This observation motivates us to study the real effects of Eq. 5 and 6. To investigate this, we take a look at the values of , and during training.
for 90 epochs, and we save the gradients and the weights of the first training iteration of each epoch. The right figure of Fig.4 shows the average percentages of , , and . From the right figure we can see that is small compared with other two components (). In other words, although Eq. 6 decreases the gradient norm regardless of the statistics of the weights and gradients, its real effect is limited due to the distribution of and . Nevertheless, from the left figures we can see that Eq. 6 still improves the training. Since Eq. 6 requires very little computations, we will keep it in WS.
From the experiments above, we observe that the training speed boost is mainly due to Eq. 5. As the effect of Eq. 6 is limited, in this section, we only study the effect of Eq. 5 on the Hessian of and . Here, we will show that Eq. 5 decreases the Frobenius norm of the Hessian matrix of the weights, i.e., . With smaller Frobenius norm, the gradients of are more predictable, thus the loss is smoother and easier to optimize.
We use and to denote the Hessian matrices of and , respectively, i.e.,
We first derive the relationship between and :
Therefore, Eq. 5 not only zero-centers the feedforwarding outputs and the back-propagated gradients, but also the Hessian matrix. Next, we compute its Frobenius norm:
|Method – Batch Size||BN  – 64 / 32||SN  – 1||GN  – 1||BN+WS – 64 / 32||GN+WS – 1|
In this section, we will present the experimental results, including image classification on ImageNet , object detection and instance segmentation on COCO , video recognition on Something-SomethingV1 dataset , semantic segmentation on PASCAL VOC , and point cloud classification on ModelNet40 .
ImageNet dataset is a large-scale image classification dataset. There are about 1.28 million training samples and 50K validation images. It has 1000 categories, each has roughly 1300 training images and exactly 50 validation samples. Table 2 shows the top-1 error rates of ResNet-50 on ImageNet when it is trained with different normalization methods, including Layer Normalization , Instance Normalization , Group Normalization  and Batch Normalization. From Table 2, we can see that when the batch size is limited to 1, GN+WS is able to achieve performances comparable to BN with large batch size. Therefore, we will use GN+WS for micro-batch training because GN shows the best results among all the normalization methods that can be trained with 1 image per GPU. Table 2 also shows WS with affine transformation, which harms the training.
|Method||GN ||GN+WS |
|Batch Size = 1||Top-1||Top-5||Top-1||Top-5|
Table 1 shows our major experimental results on the ImageNet dataset . Note that Table 1 only shows the error rates of ResNet-50 and ResNet-101. This is to compare with the previous work that focus on micro-batch training problem, e.g. Switchable Normalization  and Group Normalization 
. We run all the experiments using the official PyTorch implementations of the layers except for SN which are the performances reported in their paper. This makes sure that all the experimental results are comparable, and our improvements are reproducible.
In Table 3, we also provide the experimental results on ResNeXt , and the comparisons of the training curves of ResNet and ResNeXt trained with different normalization methods are shown in Fig. 5. Here, we show the performance comparisons between ResNeXt+GN and ResNeXt+GN+WS. Note that GN originally did not provide results on ResNeXt. Without tuning the hyper-parameters in the Group Normalization layers, we use 32 groups for each of them which is the default configuration for ResNet that GN was originally proposed for. ResNeXt-50 and 101 are 32x4d. We train the models for 100 epochs with batch size set to 1 and iteration size set to 32. As the table shows, the performance of GN on training ResNeXt is unsatisfactory: they perform even worse than the original ResNets. Especially for ResNeXt-101, the performance difference is large between GN and BN. In the same setting, WS is able to make the training of ResNeXt a lot easier. This enables ResNeXt to train with GN in Mask R-CNN  and Faster R-CNN , which we will discuss in the next subsection.
Here, we list the hyper-parameters used for getting all those results. For all models, the learning rate is set to 0.1 initially, and is multiplied by after every 30 epochs. We use SGD to train the models, where the weight decay is set to 0.0001 and the momentum is set to 0.9. For ResNet-50 with BN or BN+WS, the training batch is set to 256 for 4 GPUs. Without synchronized BN , the effective batch size is 64. For other ResNet-50 where batch size is per GPU, we set the iteration size to , i.e., the gradients are averaged across every 64 iterations and then one step is taken. This is to ensure fair comparisons because by doing so the total numbers of parameter updates are the same even if their batch sizes are different. We train ResNet-50 with different normalization techniques for 90 epochs. For ResNet-101, we set the batch size to 128 because some of the models will use more than 12GB per GPU when setting their batch size to 256. In total, we train all ResNet-101 models for 100 epochs. Similarly, we set the iteration size for models trained with 1 image per GPU to be 32 in order to compensate for the total numbers of parameter updates.
Unlike image classification on ImageNet where we could afford large batch training when the models are not too big, object detection and segmentation on COCO  usually use 1 or 2 images per GPU for training due to the high resolution. Given the good performances of our method on ImageNet which are comparable to the large-batch BN training, we expect that our method is able to significantly improve the performances on COCO because of the training setting.
We use a PyTorch-based Mask R-CNN framework111https://github.com/facebookresearch/maskrcnn-benchmark for all the experiments except for SN  where we use their official PyTorch implementations. We take the models pre-trained on ImageNet, fine-tune them on COCO train2017 set, and test them on COCO val2017 set. To maximize the comparison fairness, we use the models we pre-trained on ImageNet instead of downloading the pre-trained models available online except for SN for which we use the downloaded models SN(8,1) per the instructions on their official GitHub website. We use 4 GPUs to train the models and apply the learning rate schedules for all models following the practice used in the Mask R-CNN framework our work is based on. We use 1X learning rate schedule for Faster R-CNN and 2X learning rate schedule for Mask R-CNN. For ResNet-50 and ResNeXt-50, we use 2 images per GPU to train the models, and for ResNet-101 and ResNeXt-101 we use 1 image per GPU because the models cannot fit in 12GB GPU memory. We then adapt the learning rates and the training steps accordingly. The configurations we run use FPN  and a 4conv1fc bounding box head. All the training procedures strictly follow their original settings.
Table 5 reports the Average Precision (AP) of Faster R-CNN trained with different methods and Table 4 reports the Average Precision for bounding box (AP) and instance segmentation (AP). From the two tables, we can observe results similar to those on ImageNet. GN has limited performance improvements when it is used on more complicated architectures such as ResNet-101, ResNeXt-50 and ResNeXt-101. But when we add our method to GN, we are able to train the models much better. The improvements become more significant when the network complexity increases. Considering nowadays deep networks are becoming deeper and wider, having a normalization technique such as our WS will ease the training a lot without worrying about the memory and batch size issues.
In this subsection, we show the results of applying our method on video recognition on Something-SomethingV1 dataset . Something-SomethingV1 is a video dataset which includes a large number of video clips that show humans performing pre-defined basic actions. The dataset has 86,017 clips for training and 11,522 clips for validation.
We use the state-of-the-art method TSM  for video recognition, which uses a ResNet-50 with BN as its backbone network. The codes are based on TRN  and then adapted to TSM. The reimplementation is different from the original TSM : we use models pre-trained on ImageNet rather than Kinetics dataset  as the starting points. Then, we fine-tune the pre-trained models on Something-SomethingV1 for 45 epochs. The batch size is set to 32 for 4 GPUs, and the learning rate is initially set to 0.0125, then divided by 10 at the 26th and the 36th epochs. The batch normalization layers are not fixed during training. With all the changes, the reimplemented TSM-BN achieves top-1/5 accuracy 44.30/74.53, higher than 43.4/73.2 originally reported in the paper.
Then, we compare the performances when different normalization methods are used in training TSM. Table 6 shows the top-1/5 accuracies of TSM when trained with GN, GN+WS, BN and BN+WS. From the table we can see that WS increases the top-1 accuracy about for both GN and BN. The improvements help GN to cache up the performances of BN, and boost BN to even better accuracies, which roughly match the performances of the ensemble TSM with 24 frames reported in the paper.
We continue to show the general applicability of Weight Standardization on the task of semantic image segmentation. Specifically, we choose PASCAL VOC 2012 , which contains 21 categories including background. Following common practice, the training set is augmented by the annotations provided in , resulting in 10,582 images.
We select DeepLabv3 
as the base model for its competitive performance and use ResNet-101 as backbone. We finetune from the respective ImageNet checkpoint. Our reimplementation follows every detail, including 16 batch size, 513 image crop size, 0.007 learning rate with polynomial decay, 30K training iterations, except that we use multi-grid (1, 1, 1) instead of (1, 2, 4). During testing, we stick to output stride 16 and do not use multi-scale or left-right flip of input images. As shown in Table7, BN achieves 76.49% mean IOU, which aligns with the numbers reported in . Adding WS improves upon BN and GN by 0.66% and 2.30%, respectively. This is further evidence that our Weight Standardization also works well for dense image prediction tasks.
Here, we show the generalizability of our method to point cloud classification by evaluating our method on ModelNet40 , which contains 40 categories of CAD models, including 9,843 shapes for training and 2,468 for testing.
We follow the state-of-the-art method DGCNN  and use authors’ implementation222https://github.com/WangYueFt/dgcnn for all our experiments. All the settings are kept exactly the same as the implementation. As in , we replace all BNs with GNs. The best number of groups is 16 based on our grid search and it is fixed for all experiments.
Table 8 shows that applying WS to GN improves 1.8% and 1.5% respectively on mean class accuracy and overall accuracy. BN+WS further improves 0.3% over BN. This demonstrates that WS also works beyond grid-based CNNs.
Deep neural networks achieve state-of-the-arts in many vision tasks[4, 14, 20, 25, 31, 32, 39, 46]. Data normalization is widely used for speed up training through proper initialization based on the assumption of the data distribution [7, 11]. However, as training evolves, the normalization effects of initialization may fade away. To ensure normalization throughout the entire training process, Batch Normalization  was proposed to perform normalization along the batch dimension, and now is a fundamental component in many state-of-the-art deep networks for its fast training speed and superior performances. Despite its great success, BN’s performances can dramatically drop when the batch size is reduced or when the data statistics do not support mini-batch training. To explore other dimensions [2, 33], Layer Normalization  normalizes data on the channel dimension, Instance Normalization  performs BN for each sample individually, and Group Normalization  finds a better middle point between Layer Normalization and Instance Normalization. However, all of them have difficulty matching the performances of BN.
To alleviates BN’s issue when the batches become small, BR  constrains the statistics of BN within a certain range to reduce their drift when the batch size is not sufficiently large, and  proposes synchronized BN which synchronizes statistics across multiple GPUs through engineering. Yet, none of them solve the BN’s issues because they still rely on batch knowledge.
Weight Normalization  is close to our method in that it also considers weights instead of activations for normalization. But as we have shown in the paper, zero centering weights and gradients is the key to the success of our method instead of the division-based normalization. We have also provided the comparison with it in the experiments, and our method outperform it by a very large margin.
To fully solve BN’s issues, many researchers study the underlying mechanism of BN. To name a few,  shows that BN is able to make optimization trajectories more robust to the parameter initialization.  shows that networks with BN have better generalization properties because they tend to rely less on single directions of activations.  explores the effect of length-direction decoupling used in BN and .  focuses on the generalization properties of BN by studying its regularization effects.  proposes a mean field theory for BN to study the gradient explosion issues.  provides theoretical analysis on the auto learning rate tuning property of BN. Our work is based on  which shows that BN reduces the Lipschitz constants of the loss and the gradients.
Recently, Luo et al.  use the idea of AutoML [24, 48] to learn how to combine IN, LN, and BN to get a new normalization method. Shao et al.  extend  by learning a sparse switchable normalization, which is more similar to other AutoML techniques. Since these methods study normalization on activations, our method can also be applied to networks that use them. Due to the time limit, we only apply WS to pre-defined normalization and leave the learned normalization to the future work.
In this paper, we propose a novel normalization method Weight Standardization, which is motivated by a recent result  that shows the smoothing effects of BN on activations. Similar to BN that smooths loss landscape by normalizing activations, our method further smooths the loss landscape by standardizing the weights in convolutional layers. With our proposed WS, the performances of normalization methods that do not require batch knowledge are improved by a very large margin.
We study WS from both theoretical and experimental viewpoints. On the theoretical side, we investigate the smoothing effects of WS on the loss landscape. We have shown that WS reduces the Lipschitz constants of the loss and the gradients. On the experimental side, we have done comprehensive experiments to show the effectiveness of our WS. The results show that WS+GN with batch size 1 is able to match the performances of BN with large batch sizes when large batch is available, and significantly improve the performances when only micro-batch training is available.
We thank Yuxin Wu for sharing the original drawing of Figure 2. We would also like to thank Yingda Xia and Chenxu Luo for sharing the implementation of TSM.
International Journal of Computer Vision, 111(1):98–136, 2015.
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), pages 249–256, 2010.
IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, 2016.
Proceedings of the 32nd International Conference on Machine Learning (ICML), 2015.
Imagenet classification with deep convolutional neural networks.In Advances in Neural Information Processing Systems (NeurIPS), pages 1097–1105, 2012.