1 Introduction
The successes of deep learning in computer vision can be attributed to the increase of two factors: (1) data size and (2) depth of network. For example, there are often hundreds of layers in the recent convolutional neural networks (CNNs), including handcrafted architectures such as Inception
[26, 27], ResNet [10] and DenseNet [12], devicefriendly architectures such as MobileNet [11] and ShuffleNet [33], as well as neural architecture searches [16, 23].The above CNNs typically stacked a basic network many times to build deep models. This “atomic” network consists of a convolutional layer, a normalization layer, and an activation function. It can be seen that the normalization method is an indispensable component in these deep models, where each model may have tens of normalization layers. However, existing CNNs assumed that all normalization layers use the same normalization approach uniformly such as batch normalization (BN)
[14], resulting in nonoptimal performance.Overview. This work presents the first systematical study of a novel perspective in deep learning: whether different convolutional layers in a CNN should use different normalizers. This viewpoint would impact many vision problems. We employ Switchable Normalization (SN) [20] as our approach, which is a recentlyproposed normalization method that learns to select appropriate normalizer such as BN, instance normalization (IN) [28], and layer normalization (LN) [1] for each normalization layer of a CNN.
We first explain necessary mathematics of our empirical setups and then investigate the selectivity of normalizers in three important vision tasks, including image recognition in ImageNet [6], object detection in COCO [15], and scene segmentation in Cityscapes [5] and ADE20K [35]. The ‘selectivity’ is defined as the learning dynamics when selecting normalizers. This work answers the following three questions.

Is it useful to allow each normalization layer to use its own normalization operation? We show that performance in ImageNet can be improved by placing distinct normalizers in appropriate positions of a network, because they have different properties that would help learning image representation. The learned features and chosen normalizers are transferable to COCO, Cityscapes, and ADE20K.

What impacts the choices of normalizers? By studying their learning dynamics in ImageNet, we find that the selection is more sensitive to depth, batch size, and input image size, but less relevant to random parameter initialization, learning rate decay (e.g. stepwise decay, cosine decay [17]), and solver (e.g
. SGD+momentum, RMSProp).

Do different networks, tasks, and datasets prefer different normalizers? Our empirical results suggest that they prefer different dynamics and configurations of normalizers.
We make three key contributions. (1) This is the first work to investigate the impacts of using different normalization methods after different convolutional layers within a CNN. We verify this viewpoint in computer vision. (2) We identify key factors that affect the selectivity of normalizers. Our findings are useful in many vision tasks and may inspire the other problems that are not presented in this work. (3) We make this study completely reproducible by organizing a codebase that contains the pretrained or finetuned models in many benchmarks. This codebase will be released.
2 Approaches and Setups
This section explains our approaches and empirical setups.
Selecting Normalizers. To investigate the selectivity of normalizers, we adopt switchable normalization (SN) [20] that learns to choose a normalizer from a set of normalization methods. SN is applied after each convolutional layer,
(1) 
where and are pixel values of each hidden channel before and after normalization,
are mean and standard deviation of
estimated by using a certain normalizer , and indicates a set of normalizers. andare importance ratios of mean and variance respectively. They are learnable parameters by using softmax function. We have
, and. Existing work also applied a linear transformation after
and before ReLU, by learning a scale parameter
and a bias parameter , that is . Eqn.(1) shows that each pixel is normalized by using a weighted average of statistics, which are estimated by a set of normalizers in .and . We point out that the important ratios and are not identical in Eqn.(1). We are interested in this setup because and have different impacts in training. We take BN as an example to understand this. The impacts of and can be distinguished by using weight normalization (WN) [25], which is written by and where and represent a filter and an image patch respectively. Note that BN is reduced to WN by assuming and [25]. In fact, a network trained with WN might not perform as well as BN. However, by investigating how well meanonly BN and stdonly BN^{1}^{1}1WN trained with meanonly BN or stdonly BN are achieved by stacking a BN layer after a WN layer, where either or are used in BN. would improve WN, we can distinguish the impacts from and .
In fact, previous work [25, 21] found that WN trained with either meanonly or stdonly BN cannot outperform BN, and both of them are important to achieve comparable performance to BN. Therefore, we treat and differently throughout our studies to better understand their different behaviors.
Properties of Normalizers in . We define . We are not going to exhaustively enumerate the methods in , where BN, IN, and LN are chosen as representatives. Their importance ratios in training are sufficient to show the learning behaviors of distinct normalizers. To see this, we can also represent IN and LN by using WN. Therefore, the characteristics of BN, IN, LN, and SN can be compared in a unify way.
Specifically, the computation of WN is defined by , which normalizes the norm of each filter to ‘1’. We use to indicate a filter of the th channel. As shown in Fig.1(a), WN normalizes the norm of each filter to a unit sphere with length ‘1’, and then rescales the length to that is a learnable scale parameter. To simplify our discussions, we suppose this scale parameter is shared among all channels. In other words, WN decomposes the optimization of a filter into its direction and length.
By assuming and for all the normalizers in , we see that they can be also represented by using WN. For instance, IN turns into that is identical to WN as shown in Fig.1(b). For LN, each channel is normalized by all the channels and thus where is the number of channels. The filter norm in LN is less constrained than WN and IN as visualized in Fig.1(c), where the filter norm can be either longer or shorter than , in the sense that LN increases learning capacity of the networks by diminishing regularization.
Furthermore, BN can be represented by WN as discussed before. Luo et al. [21] showed that BN imposes regularization on norms of all the filters and reduces correlations between filters. Its geometry interpretation can be viewed in Fig.1(d), where the filter norms would be shorter and angle between filters would be larger than the other normalizers. In conclusion, BN improves generalization [14]. In general, we would have the following relationships
Finally, SN can also be interpreted by using WN. To see this, we consider a simple case when . In this case, Eqn.(1) can be reformulated as by combing the above normalizers. It is seen that SN aggregates benefits from all three approaches as shown in Fig.1(e), by learning their important ratios to adjust both learning and generalization capacity.
2.1 Connections with Previous Work
Normalization Methods. There are normalization approaches in the literature other than those mentioned above. Although they are not considered in , we would like to acknowledge their contributions. For example, group normalization (GN) [31] divides channels into groups. Let be the number of groups. GN can be represented by WN as that is a special case of LN. Since GN introduces an additional hyperparameter and its learning behavior would be similar to LN, we do not add GN in . Furthermore, batch renormalization (BRN) [13] and batch kalman normalization (BKN) [30] extended BN to account for training in small batch size. And divisive normalization (DN) [24] normalized each pixel by using its neighboring region.
Understanding Depth. This work investigates depth of CNN with respect to the selectivity of normalizers. There are also many studies that explored depth of network from the other aspects. We review some representatives. Ba et al. [2]
showed that by using a shallow student network to mimic a deep teacher network, multilayer perceptron (MLP) can learn complex functions previously learned by deep CNNs while using similar number of parameters. This was the first study to show that MLPs are not necessary to be deep to achieve good performance.
Urban et al. [29] found that the above observation does not apply when the student is a CNN. Many convolutional layers are required to learn functions of a deep teacher. This study showed that a CNN student should be sufficiently deep to achieve good performance, although it may not be as deep as its teacher. Moreover, Zhang et al. [32] showed that small generalization error of a deep network may not attribute to its depth and regularization techniques used in training, because we are far away from understanding why CNNs are relatively easy to optimize and why they have good generalization ability.
The above studies shed light on foundations to understand deep learning, which might not have direct guidance in practice. However, unlike them, our study explores a new viewpoint to understand and to design deep models, which is of great value in many practical problems.
3 Training Methods
Now we introduce pretraining and finetuning procedures that are used throughout this work.
Pretraining. For a CNN with a set of network parameters and a set of training samples and their labels denoted as
, the training problem is formulated as minimizing an empirical loss function
with respect to , where is a function learned by the CNN to predict . When a CNN is trained with SN, we replace all its previous normalization layers such as BN by using SN. This step introduces a set of control parameters for normalization layers. Therefore, the optimization problem becomes . For example, when pretraining in ImageNet, both andare optimized jointly in a single feedforward step by using SGD with momentum 0.9 and an initial learning rate 0.1, which is divided by 10 after 30, 60, and 90 epochs.
The above learning problem has important differences compared to metalearning [4, 22, 23], which are also able to learn control parameters. We move discussions to Appendix A due to length of the paper.
Initializing SN. Given a CNN model, we initialize its network parameters, batch size, learning rate, and the other hyperparameters (such as weight decay) by strictly following existing settings. The only difference is that the normalization layer is replaced by SN. We initialize the scale parameter , the shift parameter , and all the control parameters . We also add weight decay of to all these parameters. We would like to point out that initializing specially in certain network may improve performance. For example, Goyal et al. [7] showed that performance of ResNet50 [10] would be improved when of each residual block’s last normalization layer is initialized as 0 but not 1. However, we didn’t adopt any trick in order to keep parameter initializations as simple as possible.
Finetuning. We finetune a pretrained deep model by following its original protocols in the corresponding benchmarks. BN in SN are not frozen and not synchronized across GPUs. Moreover, BN in SN is evaluated by using batch average following [20] rather than moving average.
4 Experiments
This section systematically investigates the selectivity behaviors in ImageNet [6], COCO [15], Cityscapes [5], and ADE20K [35]. In each benchmark, all models are trained by following common settings. Details are moved to Appendix C. By default, the models of SN are trained by following Sec.3.
4.1 Image Recognition in ImageNet
We first present results in ImageNet.
Comparisons. Table 1 shows that ResNet50 trained with distinct normalizers (using SN) outperforms when it is trained with BN, GN, IN, and LN alone. For example, SN surpasses BN by and in and . In small minibatch , SN achieves the secondbest performance next to GN. It is seen that GN has uniform accuracy around for all batch sizes [31], because it is independent with batch statistics. The accuracy of SN increases when batch size increases. In Table 1, SN indicates ratios of means and variances are tied . We observe that sharing the ratios in SN degenerates performance.
Subsets of . Now we repeat training ResNet50 with SN three times but removing one normalizer in at a time, resulting in three subsets {IN,LN}, {IN,BN}, and {LN,BN}. Their top1/top5 accuracies are , and respectively. They are comparable or better than just using IN, LN, and BN uniformly as reported in Table 1. However, the result is when . We conclude that all normalizers in are important to achieve the best accuracy.
Res50  BN [14]  GN [31]  IN [28]  LN [1]  SN  SN 
76.4(0.7)  75.9(1.2)  71.6(5.5)  74.7(2.4)  77.1  76.2(0.9)  
75.2(1.5)  76.0(0.7)  –  –  76.7  76.1(0.6)  
65.3(10.3)  75.9(0.3)  –  –  75.6  75.5(0.1) 


Random initializations. Here we show that the improvements of performance come from distinct normalizers but not random parameter initializations. This is demonstrated by training ResNet50+SN(8,32) three times. At each time, we use a different random seed to initialize the network parameters. The top1/top5 accuracies are , , and respectively. Their average is that only has a small variance (), demonstrating that results in Table 1 are significant.
4.1.1 Learning Dynamics of Ratios
This section studies learning behaviors of ratios.
Soft ratios vary in training. The values of soft ratios and are between and . Fig.3(a,b) plot their values for each normalization layer at every epoch. These values would have smooth fluctuation in training, implying that different epochs may have their own preference of normalizers. In general, we see that mostly prefers BN, while prefers IN and BN when receptive field (RF) 49 and prefers LN when RF299. Fig.3(c) shows the discrepancy between (a) and (b) by computing a symmetry metric, that is, where
is KullbackLeibler divergence. Larger value of
indicates larger discrepancy between the distributions of and .For example, and of the first layer choose different normalizers (see the first subfigure in (a,b)), making them had moderately large divergence (see the first subfigure in (c)). Moreover, the ratios of the 2, 3, and 4 layer are similar and they have small divergence . We also see that and prefer different normalizers when RF is 199299 and 299427 where , as shown in the last row of (c). In conclusion, more than 50% number of layers have large divergence between and , confirming that different ratios should be learned for and .
Hard ratios are relatively stable,
although the soft ratios are varying in training. A hard ratio is a sparse vector obtained by applying
function to or such as , that is, only one entry is ‘1’ and the others are ‘0’ to select only one normalizer. Fig.2 shows hard ratios for each layer in three snapshots, which are ResNet50+SN(8,32) trained after 30, 60, and 90 epochs respectively. For example, and use LN and IN respectively in the first layer in Fig.2.We have several observations. First, the size of filters ( or ) seem to have no preference of specific normalizer, while the skipconnections prefer different normalizers for and at the 90 epoch. Second, around number of layers select two different normalizers for and in these three snapshots, which are (), (), and () respectively. Third, the discrepancy between snapshots are small. For instance, layers are different when comparing between 30 and 60 epoch, while only layers are different between 60 and 90 epoch. Fourth, the layers that choose different normalizers are mainly presented when RF40 and 200, rendering depth would be a major factor that affects the ratios.
Performance of hard ratios. We further examine performance of hard ratios. We finetune the above snapshots by replacing soft ratios with hard ratios. The right of Table 2 shows that these models perform as good as their soft counterpart (), implying that we could increase sparsity in ratios to reduce computations of multiple normalizers while maintaining good performance.
Furthermore, we train the above models from scratch by initializing the ratios as the hard ratios in the above snapshots, rather than using the default initial value . However, this setting harms generalization as shown in Table 2. We conjecture that all normalizers in are helpful to smooth loss landscape at the beginning of training. Therefore, the sparsity of ratios should be enhanced gently as training progresses. In other words, initializing ratios by is a good practice. Tuning this value is cumbersome and may also imped generalization ability.
Depth is a main factor. As shown by the ratios in Fig.2, IN mainly takes place in lower layers to reduce variations in lowlevel features, LN presents in deeper layers to increase learning ability, while BN presents in the middle. Similar trait can be observed in Appendix Fig.9 showing ratios at the 100 epoch when training converged. Furthermore, we find that analogous trends can be also observed in the other networks such as ResNet101 and Inceptionv3 shown in Appendix Fig.11.
Batch size is another major factor. We compare ratios by decreasing the batch size from to . Results are shown in Fig.4(17). For models trained by SN, the ratios of BN are reduced at every RF range, see (14) and Fig.3(a,b). Similar trend can be observed in (57) for SN. These plots show that the dynamics of ratios are closely related to batch size, where BN would be reduced as it is unstable in small batch. This can be also viewed by comparing Fig.5(14), where smaller batch size leads to larger divergence for both SN and SN.
Input size and sample size are subordinate factors. Here we investigate whether reducing image size and number of training data changes the dynamics of ratios. Fig.4(811) show the ratios of these two variants. In particular, ResNet50+SN(8,32) is trained by downsampling each image from to (89), and trained using only 50% ImageNet data while keeping the previous input size (1011).
The top1/top5 accuracies of the above two variants are and respectively. It is seen that reducing image size degenerates accuracy in ImageNet more severely than reducing sample size by half. To closely see this, we compare divergences of their ratios to the previous model that achieves . As shown in Fig.5(56), divergences in (5) are generally larger than (6) in every RF range.
Solvers, random initializations, and learning rates are less relevant factors. Fig.5(79) examine the ratios in three factors including solver (SGD vs. RMSProp), random parameter initialization, and learning rate decay (stepwise vs. cosine decay). For the first two factors in (78), we find that their divergences in ‘RF:ALL’ are mostly smaller than 0.5, while the divergences in (9) are moderately larger than 0.5 showing that solver may marginally affect the ratios.
More specifically, these three factors have more impact in upper layers such as RF299 than lower layers. But they are not the main factors that affect the dynamics. For example, the BN ratios are compared in Fig.6 where the difference among three random initializations are small, while the ratios are also similar for different solvers and learning rate decays.
4.2 Object Detection and Scene Segmentation
Now we use distinct normalizers in detection and segmentation. To obtain representative results, we employ advanced frameworks such as Mask RCNN [9] for detection in COCO [15] and DeepLabv2 [3] for segmentation in Cityscapes [5] and ADE20K [35]. All these frameworks use ResNet50 as backbone and they are trained by following protocols in the corresponding benchmarks, while only the normalization layers are replaced by SN. More empirical settings are provided in Appendix C.
Comparisons. Table 3 compares different models that are pretrained by using BN [14], GN [31], and SN with three batch sizes. All these models are finetuned by , which is a popular setup in these benchmarks. Moreover, by following usual practice, BN is frozen in COCO and it is synchronized across 8 GPUs in Cityscapes and ADE20K, while GN and SN are neither frozen nor synchronized.
We see that using BN and GN uniformly in the network does not perform well in these tasks, even though BN is frozen or synchronized. In COCO and ADE20K, ‘SN’ performs significantly better than the others. In these two tasks, ‘SN’ and ‘SN’ may reduce performance, because BN has large ratio in these models, implying that finetuning SN pretrained with large batch to small batch would be unstable. Nevertheless, ‘SN’ and ‘SN’ achieve better results than ‘SN’ in Cityscapes. This could be attributed to large input image size 713713 that diminishes noise in the batch statistics of BN (in SN).
BN(8,32)  GN(8,32)  SN(8,32)  SN(8,4)  SN(8,2)  
mAP  38.6/2.4  40.2/0.8  39.8/1.2  –  41.0 
mAP  34.2/2.3  35.7/0.8  35.3/1.2  –  36.5 
mIoU  72.7/3.1  72.2/3.6  75.8/+0.4  75.8/+0.4  75.4 
mIoU  37.7/1.5  36.3/2.9  38.4/0.8  39.0/0.2  39.2 
Dynamics of ratios are more smooth in finetuning than pretraining. Fig.7 visualizes ratios in finetuning, which are more smooth than pretraining as compared to Fig.4. Intuitively, this is because a small learning rate is typically used in finetuning. In Fig.7, IN and LN ratios are generally larger than BN because of small batch size. The lower layers (RF49) prefer IN more than the upper layers (RF299) that choose LN.
When comparing different tasks, Fig.7(b,c) in segmentation have analogue dynamics where BN ratios are gently decreased (0.3). But they are different from detection in (a) where BN gradually increases when 49RF299 (0.5). This could be attributed to the twostage pipeline of Mask RCNN. To see this, Fig.8(b) and Appendix Fig.10 plot the ratios in COCO. We see that BN has larger impact in backbone and box stream than the other components. Furthermore, ratios in the backbone have different dynamics in pretraining and finetuning by comparing two ResNet50 models in Fig.8(a,b), that is, the BN ratios decrease in pretraining for recognition, while increase in finetuning for detection even though the batch size is .
Finetuning appropriate pretrained models is a good practice. We observe that ratios pretrained with different batch sizes bring different impacts in finetuning. Using models that are pretrained and finetuned with comparable batch size would be a best practice for good results. Otherwise, performance may degenerate.
We take ADE20K as an example. Appendix Fig.12 shows the ratios when SN is used in pretraining but SN in finetuning. In line with expectation, the BN ratios are suppressed during finetuning, as the batch statistics become unstable. However, these ratios are still suboptimal until training converged as shown in Appendix Fig.13, where the BN ratios finetuned from SN are still larger than those directly finetuned from SN, reducing performance in ADE20K (see Table 3).
5 Summary and Future Work
This work studies a new viewpoint in deep learning, showing that each convolutional layer would be better to select its own normalizer. We investigate ratios of normalizers in popular benchmarks including ImageNet, COCO, Cityscapes and ADE20K, and summarize our findings.

Learning dynamics of ratios in ImageNet are more relevant to depth of networks, batch size, and image size, but less pertinent to random parameter initialization, learning rate decay, and solver.

The ratios of BN are proportional to batch size, that is, BN ratios increase along with the increase of batch size. IN and LN are inversely proportional to batch size. LN is proportional to depth. BN and IN are inversely proportional to depth.

Removing any one normalizer from harms generalization in ImageNet. Similar trait has been observed in the other models such as ResNet101 and Inceptionv3.

Hard (sparse) ratios could outperform soft ratios in SN. But the soft ratios may help smooth loss landscape, initializing the ratios by using hard ratios instead of harms generalization. In other words, sparsity of ratios should be enhanced during training, rather than at the very beginning of training.

Recognition, detection, and segmentation have distinct learning dynamics of ratios. More important practice of SN, trial and error are summarized in Appendix B.
Future work involve three aspects. (1) Algorithm will be devised to learn sparse ratios for SN. (2) We will impose structure in ratios such as dividing them into groups to choose normalizer for each group. (3) As IN and LN also work well in the tasks of lowlevel vision and recurrent neural networks (RNNs), trying SN in these problems is also a future direction. (4) We’ll try to understand the learning and generalization ability of SN theoretically, though it is an open and challenging problem in deep learning. (5) Switching between whitening
[19, 18] and standardization (e.g. BN) will be also important and valuable.References
 [1] J. L. Ba, J. R. Kiros, and G. E. Hinton. Layer normalization. arXiv:1607.06450, 2016.
 [2] L. J. Ba and R. Caruana. Do deep nets really need to be deep? In NIPS, 2014.
 [3] L. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell., 40(4):834–848, 2018.
 [4] B. Colson, P. Marcotte, and G. Savard. An overview of bilevel optimization. Annals of operations research, 2007.

[5]
M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benenson,
U. Franke, S. Roth, and B. Schiele.
The cityscapes dataset for semantic urban scene understanding.
In CVPR, 2016.  [6] J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. FeiFei. Imagenet: A largescale hierarchical image database. In CVPR, 2009.
 [7] P. Goyal, P. Dollar, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv:1706.02677, 2017.
 [8] P. Goyal, P. Doll r, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv:1706.02677, 2017.
 [9] K. He, G. Gkioxari, P. Doll r, and R. Girshick. Mask rcnn. ICCV, 2017.
 [10] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016.
 [11] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam. Mobilenets: Efficient convolutional neural networks for mobile vision applications. In arXiv:1704.04861, 2017.
 [12] G. Huang, Z. Liu, K. Q. Weinberger, and L. van der Maaten. Densely connected convolutional networks. 2017.
 [13] S. Ioffe. Batch renormalization: Towards reducing minibatch dependence in batchnormalized models. arXiv:1702.03275, 2017.
 [14] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015.
 [15] T.Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C. L. Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014.
 [16] H. Liu, K. Simonyan, and Y. Yang. Darts: Differentiable architecture search. arXiv:1806.09055, 2018.
 [17] I. Loshchilov and F. Hutter. Sgdr: stochastic gradient descent with restarts. arXiv:1608.03983, 2016.
 [18] P. Luo. Eigennet: Towards fast and structural learning of deep neural networks. IJCAI, 2017.
 [19] P. Luo. Learning deep architectures via generalized whitened neural networks. ICML, 2017.
 [20] P. Luo, J. Ren, Z. Peng, R. Zhang, and J. Li. Differentiable learningtonormalize via switchable normalization. In arXiv:1806.10779, 2018.
 [21] P. Luo, X. Wang, W. Shao, and Z. Peng. Towards understanding regularization in batch normalization. arXiv:1809.00846, 2018.

[22]
D. Maclaurin, D. Duvenaud, and R. Adams.
Gradientbased hyperparameter optimization through reversible learning.
ICML, 2015.  [23] H. Pham, M. Y. Guan, B. Zoph, Q. V. Le, and J. Dean. Efficient neural architecture search via parameter sharing. arXiv:1802.03268, 2018.
 [24] M. Ren, R. Liao, R. Urtasun, F. H. Sinz, and R. S. Zemel. Normalizing the normalizers: Comparing and extending network normalization schemes. In ICLR, 2016.
 [25] T. Salimans and D. P. Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. arXiv:1602.07868, 2016.
 [26] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015.
 [27] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna. Rethinking the inception architecture for computer vision. In CVPR, 2016.
 [28] D. Ulyanov, A. Vedaldi, and V. Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv:1607.08022, 2016.
 [29] G. Urban, K. J. Geras, S. E. Kahou, O. Aslan, S. Wang, A. Mohamed, M. Philipose, M. Richardson, and R. Caruana. Do deep convolutional nets really need to be deep and convolutional? In ICLR, 2016.
 [30] G. Wang, J. Peng, P. Luo, X. Wang, and L. Lin. Batch kalman normalization: Towards training deep neural networks with microbatches. NIPS, 2018.
 [31] Y. Wu and K. He. Group normalization. arXiv:1803.08494, 2018.
 [32] C. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires rethinking generalization. In ICLR, 2017.
 [33] X. Zhang, X. Zhou, M. Lin, and J. Sun. Shufflenet: An extremely efficient convolutional neural network for mobile devices. In arXiv:1707.01083, 2017.
 [34] H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia. Pyramid scene parsing network. In CVPR, 2017.
 [35] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba. Semantic understanding of scenes through the ade20k dataset. arXiv:1608.05442, 2016.
Appendices
A Relation with Meta Learning
We draw a connection between SN’s learning problem and meta learning (ML) [4, 22, 23], which can be also used to learn the control parameters in SN. In general, ML is defined as where is a constant multiplier. Unlike SN trained with a single stage, this loss function is minimized by performing two feedforward stages iteratively until converged. First, by fixing the current estimated control parameters , the network parameters are optimized by . Second, by fixing , the control parameters are found by .
The above two stages are usually optimized by using two different sets of training data. For example, previous work [16, 23] used to search network architectures from a set of modules with different numbers of parameters and computational complexities. They divided an entire training set into a training and a validation set without overlapping, where is learned from the validation set while is learned from the training set. This is because would choose the module with large complexity to overfit training data, if both and are optimized in the same dataset.
The above 2stage training increases computations and runtime. In contrast, and for SN can be generally optimized within a single stage in the same dataset, because regularizes training by choosing different normalizers from to prevent overfitting.
B Trial & Error
In Table 4, we report several important practices when learning to select normalizers. Many of these practices shed light on future work of sparse SN and synchronized SN.
1.  Initializing ratios of normalizers uniformly e.g. . Carefully tuning the initial ratios may harm generalization. 
2.  Adding dropout with a small ratio (e.g. 0.10.2) after each SN layer provides minor improvement of generalization in ImageNet, but it reduces overfitting. 
3.  Adding dropout in the last fullyconnected layer helps generalization in ImageNet. 
4.  A model in pretraining and finetuning should have comparable batch size. 
5.  Do not put SN after global pooling when feature map size is 11, because IN and LN are unstable after global pooling. 
6.  SN performs comparably well with SN when batch size is small e.g. . 
7.  Sparse SN improves SN in ImageNet. 
8.  Sparse SN reduces computational runtime in inference compared to SN. 50% number of layers in sparse SN select BN for both and , meaning that these BN layers can be turned into linear transformation to reduce runtime in inference. 
9.  Synchronizing BN in SN improves generalization. 
C Experimental Protocols
ImageNet. All models in ImageNet are trained on 1.2M images and evaluated on 50K validation images. They are trained by using SGD with different settings of batch sizes, which are denoted as a 2tuple, (number of GPUs, number of samples per GPU). For each setting, the gradients are aggregated over all GPUs, and the means and variances of the normalization methods are computed in each GPU. The network parameters are initialized by following [10]. For all normalization methods, all ’s are initialized as 1 and all ’s as 0. The parameters of SN ( and ) are initialized as 1/3. We use a weight decay of for all parameters including and . All models are trained for 100 epoches with a initial learning rate of 0.1, which is deceased by 10 after 30, 60, and 90 epoches. For different batch sizes, the initial learning rate is linearly scaled according to [8]. During training, we employ data augmentation the same as [10]. The top1 classification accuracy on the 224224 center crop is reported.
COCO. We train all models on 8 GPUs and 2 images per GPU. Each image is rescaled to its shorter side of 800 pixels. In particular, the learning rate (LR) is initialized as 0.02 and is decreased by the LR schedule as 2 schedule. We set weight decay to 0 for both and following [31].
All the above models are trained in the 2017 train set of COCO by using SGD with a momentum of 0.9 and a weight decay of on the network parameters, and tested in the 2017 val set. We report the standard metrics of COCO, that is, average precisions at IoU=0.5:0.05:0.75 (AP).
Cityscapes and ADE20K.
We use 2 samples per GPU for ADE20K and 1 sample per GPU for Cityscapes. We employ the opensource software in PyTorch
^{2}^{2}2https://github.com/CSAILVision/semanticsegmentationpytorch and only replace the normalization layers in CNNs with the other settings fixed. For both datasets, we use DeepLabv2 [3] with ResNet50 as the backbone network, where and the last two blocks in the original ResNet contains atrous convolution with and respectively. Following [34], we employ “poly” learning rate policy with and use the auxiliary loss with the weight during training. The bilinear operation is adopted to upsmaple the score maps in the validation phase.In ADE20K, we resize each image to and train for iterations. We performance multiscale testing with . In Cityscapes, we use random crop with the size and train for epoches. For multiscale testing, the inference scales are .
D More Results
More results are plotted in the remaining figures due to the limited length of the paper.
Comments
There are no comments yet.