Mode Normalization

10/12/2018 ∙ by Lucas Deecke, et al. ∙ 0

Normalization methods are a central building block in the deep learning toolbox. They accelerate and stabilize training, while decreasing the dependence on manually tuned learning rate schedules. When learning from multi-modal distributions, the effectiveness of batch normalization (BN), arguably the most prominent normalization method, is reduced. As a remedy, we propose a more flexible approach: by extending the normalization to more than a single mean and variance, we detect modes of data on-the-fly, jointly normalizing samples that share common features. We demonstrate that our method outperforms BN and other widely used normalization techniques in several experiments, including single and multi-task datasets.



There are no comments yet.


This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

A fundamental challenge in optimizing deep learning models is the continuous change in input distributions at each layer, complicating the training process. Normalization methods, such as batch normalization (BN) (Ioffe & Szegedy, 2015) are aimed at overcoming this issue — often referred to as internal covariate shift (Shimodaira, 2000).111Note that the underlying mechanisms are still being explored from a theoretical perspective, see Kohler et al. (2018); Santurkar et al. (2018).

When applied successfully in practice, BN enables the training of very deep networks, shortens training times by supporting larger learning rates, and reduces sensitivity to parameter initializations. As a result, BN has become an integral element of many state-of-the-art machine learning techniques 

(He et al., 2016; Silver et al., 2017).

Despite its great success, BN has drawbacks due to its strong reliance on the mini-batch statistics. While the stochastic uncertainty of the batch statistics acts as a regularizer that can boost the robustness and generalization of the network, it also has significant disadvantages when the estimates of the mean and variance become less accurate. In particular, heterogeneous data

(Bilen & Vedaldi, 2017) and small batch sizes (Ioffe, 2017; Wu & He, 2018) are reported to cause inaccurate estimations and thus have a detrimental effect on models that incorporate BN. For the former, Bilen & Vedaldi (2017)

showed that when training a deep neural network on images that come from a diverse set of visual domains, each with significantly different statistics, then BN is not effective at normalizing the activations with a single mean and variance.

In this paper we relax the assumption that the entire mini-batch should be normalized with the same mean and variance. We propose a novel normalization method, mode normalization (MN), that first assigns samples in a mini-batch to different modes via a gating network, and then normalizes each sample with estimators for its corresponding mode (see Figure 1). We further show that MN can be incorporated into other normalization techniques such as group normalization (GN) (Wu & He, 2018) by learning which filters should be grouped together. The proposed methods can easily be implemented as layers in standard deep learning libraries, and their parameters are learned jointly with the other parameters of the network in an end-to-end manner. We evaluate MN on multiple classification tasks and demonstrate that it achieves a consistent improvement over BN and GN.

In Section 2, we present how this paper is related to previous work. We then review BN and GN, and introduce our method in Section 3. The proposed methods are evaluated on multiple benchmarks in Section 4, and our findings are summarized in Section 5.

Figure 1: In mode normalization, incoming samples are weighted by a set of gating functions . Gated samples contribute to component-wise estimators and , under which the data is normalized. After a weighted summation, the batch is passed on to the next layer. Note that during inference, estimators are computed from running averages instead.

2 Related work


Normalizing input data (LeCun et al., 1998) or initial weights of neural networks (Glorot & Bengio, 2010) are known techniques to support faster model convergence, and were studied extensively in previous work. More recently, normalization has been evolved into functional layers to adjust the internal activations of neural networks. Local response normalization (LRN) (Lyu & Simoncelli, 2008; Jarrett et al., 2009) is used in various models (Krizhevsky et al., 2012; Sermanet et al., 2014) to perform normalization in a local neighborhood, and thereby enforce competition between adjacent pixels in a feature map. BN (Ioffe & Szegedy, 2015) implements a more global normalization along the batch dimension. In contrast to LRN, BN requires two distinct train and inference modes. At training time, samples in each batch are normalized with the batch statistics, while during inference samples are normalized using precomputed statistics from the training set. Small batch sizes or heterogeneity can lead to inconsistencies between training and test data. Our proposed method alleviates such issues by better dealing with different modes in the data, simultaneously discovering these and normalizing the data accordingly.

Several recent normalization methods (Ba et al., 2016; Ulyanov et al., 2017; Ioffe, 2017) have emerged that perform normalization along the channel dimension (Ba et al., 2016), or over a single sample (Ulyanov et al., 2017) to overcome the limitations of BN. Ioffe (2017) proposes a batch renormalization strategy that clips gradients for estimators by using a predefined range to prevent degenerate cases. While these methods are effective for training sequential and generative models respectively, they have not been able to reach the same level of performance as BN in supervised classification. Simultaneously to these developments, BN has started to attract attention from theoretical viewpoints (Kohler et al., 2018; Santurkar et al., 2018).

More recently, Wu and He (Wu & He, 2018) have proposed a simple yet effective alternative to BN by first dividing the channels into groups and then performing normalization within each group. The authors show that group normalization (GN) can be coupled with small batch sizes without any significant performance loss, and delivers comparable results to BN when the batch size is large. We build on this method in Section 3.2, and show that it is possible to automatically infer filter groupings. An alternative normalization strategy is to design a data independent reparametrization of the weights in a neural network by implicitly whitening the representation obtained at each layer (Desjardins et al., 2015; Arpit et al., 2016). While these methods show promising results, they do not generalize to arbitrary non-linearities and layers.

Mixtures of experts.

Mixtures of experts (MoE) (Jacobs et al., 1991; Jordan & Jacobs, 1994) are a family of models that involve combining a collection of simple learners to split up the learning problem. Samples are thereby allocated to differing subregions of the model that are best suited to deal with a given example. There is a vast body of literature describing how to incorporate MoE with different types of expert architectures such as SVMs (Collobert et al., 2002), Gaussian processes (Tresp, 2001), or deep neural networks (Eigen et al., 2013; Shazeer et al., 2017). Most similar to ours, Eigen et al. (2013) propose to use a different gating network at each layer in a multilayer network to enable an exponential number of combinations of expert opinions. While our method also uses a gating function at every layer to assign the samples in a mini-batch to separate modes, it differs from the above MoE approaches in two key aspects: (i.) we use the assignments from the gating functions to normalize the data within a corresponding mode, (ii.) the normalized data is forwarded to a common module (i.e. a convolutional layer) rather than to multiple separate experts.

Our method is also loosely related to Squeeze-and-Excitation Networks (Hu et al., 2018), that adaptively recalibrate channel-wise feature responses with a gating function. Different to their approach, we use the outputs of the gating function to normalize the responses within each mode.

Multi-domain learning.

Our approach also relates to methods that parametrize neural networks with domain-agnostic and specific layers, and transfer the agnostic parameters to the analysis of very different types of images (Bilen & Vedaldi, 2017; Rebuffi et al., 2017, 2018). In contrast to these methods, which require the supervision of domain knowledge to train domain-agnostic parameters, our method can automatically learn to discover modes both in single and multi-domain settings, without any supervision.

3 Method

We first review the formulations of BN and GN in Section 3.1, and introduce our method in Section 3.2.

3.1 Batch and group normalization

Our goal is to learn a prediction rule that infers a class label for a previously unseen sample . For this purpose, we optimize the parameters of on a training set for which the corresponding label information is available, where denotes the number of samples in the data.

Without loss of generality, in this paper we consider image data as the input, and deep convolutional neural networks as our model. In a slight abuse of notation, we also use the symbol

to represent the features computed by layers within the deep network, producing a three-dimensional tensor

where the dimensions indicate the number of feature channels, height and width respectively. Batch normalization (BN) computes estimators for the mini-batch (usually ) by average pooling over all but the channel dimensions.222How estimators are computed is what differentiates many of the normalization techniques currently available. Wu & He (2018) provide a detailed introduction. Then BN normalizes the samples in the batch as


where and

are the mean and standard deviation of the mini-batch, respectively. The parameters

and are

-dimensional vectors representing a learned affine transformation along the channel dimensions, purposed to retain each layer’s representative capacity

(Ioffe & Szegedy, 2015). This normalizing transformation ensures that the mini-batch has zero mean and unit variance when viewed along the channel dimensions.

Group normalization (GN) performs a similar transformation to that in (1), but normalizes along different dimensions. As such, GN first separates channels into fixed groups , over which it then jointly computes estimators, e.g. for the mean . Note that GN does not average the statistics along the mini-batch dimension, and is thus of particular interest when large batch sizes become a prohibitive factor.

A potential problem when using GN is that channels that are being grouped together might get prevented from developing distinct characteristics in feature space. In addition, computing estimators from manually engineered rules as those found in BN and GN can be too restrictive under a number of circumstances, for example when jointly learning on multiple domains.

  Input: parameters , batch of samples , small , learnable .
     Compute expert assignments:
  for  to  do
     Determine new component-wise statistics:
     Update running means:
  end for
  for  to  do
     Normalize samples with component-wise estimators:
  end for
Algorithm 1 Mode normalization, training phase.
  Input: refer to Alg. 1.
     Compute expert assignments:
  for  to  do
     Normalize samples with running average of component-wise estimators:
  end for
Algorithm 2 Mode normalization, test phase.
  Input: parameter , sample , small , learnable .
     Compute channel-wise gates:
  for  to  do
     Update estimators and normalize:
  end for
Algorithm 3 Mode group normalization.

3.2 Mode normalization

The heterogeneous nature of complex datasets motivates us to propose a more flexible treatment of normalization. Before the actual normalization is carried out, the data is first organized into modes to which it likely belongs. To achieve this, we reformulate the normalization in the framework of mixtures of experts (MoE). In particular, we introduce a set of simple gating functions where and . In mode normalization (MN, Alg. 1), each sample in the mini-batch is then normalized under voting from its gate assignment:


where and are a learned affine transformation, just as in standard BN.333We experimented with learning individual for each mode. However, we have not observed any additional gains in performance from this.

The estimators for mean and variance are computed under weighing from the gating network, e.g. the ’th mean is estimated from the batch as


where . In our experiments, we parametrize the gating networks via an affine transformation which is jointly learned alongside the other parameters of the network. This transformation is followed by a softmax activation , reminiscent of attention mechanisms (Denil et al., 2012; Vinyals et al., 2015). Note that if we set , or when the gates collapse , then (2) becomes equivalent to BN, c.f. (1).

As in BN, during training we normalize samples with estimators computed from the current batch. To normalize the data during inference (Alg. 2), we keep track of component-wise running estimates, borrowing from online EM approaches (Cappé & Moulines, 2009; Liang & Klein, 2009). Running estimates are updated in each iteration with a memory parameter , e.g. for the mean:


Bengio et al. (2015) and Shazeer et al. (2017) propose the use of additional losses that either prevent all samples to focus on a single gate, encourage sparsity in the gate activations, or enforce variance over gate assignments. In MN, such additional penalties are not needed. Importantly, we want MN to be able to seek out a form in which it recovers traditional BN, whenever that is the optimal thing to do. In practice, we seldom observed this behavior: gates tend to receive an even share of samples overall, and they are usually assigned to individual modes.

3.3 Mode group normalization

As discussed in Section 2, GN is less sensitive to the batch size (Wu & He, 2018). Here, we show that similarly to BN, GN can also benefit from soft assignments into different modes. In contrast to BN, GN computes averages over individual samples instead of the entire mini-batch. This makes slight modifications necessary, resulting in mode group normalization (MGN, Alg. 3). Instead of learning mappings with their preimage in , in MGN we learn a gating network that assigns channels to modes. After average-pooling over width and height, estimators are computed by averaging over channel values , for example for the mean , where . Each sample is subsequently transformed via


where and are learnable parameters for channel-wise affine transformations. One of the notable advantages of MGN (that it shares with GN) is that inputs are transformed in the same way during training and inference.

A potential risk for clustering approaches is that clusters or modes might collapse into one, as described by e.g. Xu et al. (2005). Although it is possible to address this with a regularizer, it has not been an issue in either MN or MGN experiments. This is likely a consequence of the large dimensionality of feature spaces that we study in this paper, as well as sufficient levels of variation in the data.

4 Experiments

We consider two experimental settings to evaluate our methods: (i.) multi-task, and (ii.) single task. All experiments use standard routines within PyTorch

(Paszke et al., 2017).

4.1 Multi-task


In the first experiment, we wish to enforce heterogeneity in the data distribution, i.e. explicitly design a distribution of the form . We realize this by generating a dataset whose images come from significantly diverse distributions, combining four image datasets: (i.) MNIST (LeCun, 1998) which contains grayscale scans of handwritten digits. The dataset has a total of 60000 training samples, as well as 10000 samples set aside for validation. (ii.) CIFAR-10 (Krizhevsky & Hinton, 2009) is a dataset of colored images that show real world objects of one of ten classes. It contains 50000 training and 10000 test images. (iii.) SVHN (Netzer et al., 2011) is a real-world dataset consisting of 73257 training samples, and 26032 samples for testing. Each image shows one of ten digits in natural scenes. (iv.) Fashion-MNIST (Xiao et al., 2017) consists of the same number of single-channel images as are contained in MNIST. The images contain fashion items such as sneakers, sandals, or dresses instead of digits as object classes. We assume that labels are mutually exclusive, and train a single network — LeNet (LeCun et al., 1989)

with a 40-way classifier at the end — to jointly learn predictions on them.

Mode normalization.

Training is carried out for 3.5 million data touches (15 epochs), with learning rate reductions by 1/10 after 2.5 and 3 million data touches, respectively. Note that training for additional epochs did not result in any notable performance gains. The batch size is set to

, and running estimates are kept with . We vary the number of modes in MN over . Average performances over five random initializations as well as standard deviations are shown in Table 1. MN outperforms standard BN, as well as all other normalization methods. This shows that accounting for multiple modes is an effective way to normalize intermediate features when the data is heterogeneous.

Interestingly, increasing does not always improve the performance. The reduction in effectiveness of higher mode numbers is likely a consequence of finite estimation, i.e. of computing estimates from smaller and smaller partitions of the batch, a known issue in traditional BN.444Experiments with larger batch sizes support this argument, see Appendix. In all remaining trials which involve single datasets and deeper networks, we therefore fixed . Note that the additional overhead from coupling LeNet with MN is limited. Even in our naive implementation, setting results in roughly a 5% increase in runtime.

26.91  1.08 28.87  2.28 27.31  0.71 23.16  1.23 2
24.25  0.71 4
25.12  1.48 6
Table 1: Test set error rates (%) of batch norm (BN), instance norm (IN) (Ulyanov et al., 2017), layer norm (LN) (Ba et al., 2016), and mode norm (MN) in the multi-task setting for a batch size of . Shown are average top performances over five initializations alongside standard deviations. Additional results for are shown in the Appendix.

Mode group normalization.

Group normalization is designed specifically for applications in which large batch sizes become prohibitive. We therefore simulate this by reducing batch sizes to , and train each model for gradient updates. This uses the same configuration as previously, except for a smaller initial learning rate , which is reduced by 1/10 after and updates. In GN, we allocate two groups per layer, and accordingly set in MGN. As a baseline, results for BN and MN were also included. Average performances over five initializations and their standard deviations are shown in Table 2. As previously shown by Wu & He (2018), BN fails to maintain its performance when the batch size is small during training. Though MN performs slightly better than BN, its performance also degrades in this regime. GN is more robust to small batch sizes, yet MGN further improves over GN, and — by combining the advantages of GN and MN  —  achieves the best performance for different batch sizes among all four methods.

4 33.40  0.75 32.80  1.59 32.15  1.10 31.30  1.65
8 31.98  1.53 29.05  1.51 28.60  1.45 26.83  1.34
16 30.38  0.60 28.70  0.68 27.63  0.45 26.00  1.68
Table 2: Test set error rates (%) for BN, MN, mode group norm (MGN) and group norm (GN) on small batch sizes. Shown are average top performances over five initializations alongside standard deviations.

4.2 Single task


Here our method is evaluated in single image classification tasks, showing that it can be used to improve performance in several recently proposed convolutional networks. For this, we incorporate MN into multiple modern architectures, first evaluating it on CIFAR10 and CIFAR100 datasets and later on a large-scale dataset, ILSVRC12 (Deng et al., 2009). Differently from CIFAR10, CIFAR100 has 100 classes, but contains the same number of training images, 600 images per class. ILSVRC12 contains around 1.2 million images from 1000 object categories.

Network In Network.

Since the original Network In Network (NIN) (Lin et al., 2013) does not contain any normalization layers, we modify the network architecture to add them, coupling each convolutional layer with a normalization layer (either BN or MN). We then train the resulting model on CIFAR10 and CIFAR100 for 100 epochs with SGD and momentum as optimizer, using a batch size of . Initial learning rates are set to , which we reduce by 1/10 at epochs 65 and 80 for all methods. Running averages are stored with

. During training we randomly flip images horizontally, and crop each image after padding it with four pixels on each side. Dropout

(Srivastava et al., 2014) is known to occasionally cause issues in combination with BN (Li et al., 2018), and reducing it to 0.25 (as opposed to 0.5 in the original publication) was beneficial to performance. Note that incorporating MN with into NIN adds less than 1% to the number of trainable parameters.

We report the test error rates with NIN on CIFAR10 and CIFAR100 in Table 3 (left). We first observe that NIN with BN obtains an error rate similar to that reported for the original network in Lin et al. (2013). MN () achieves an additional boost of 0.4% and 0.6% over BN on CIFAR10 and CIFAR100, respectively.

center Network In Network Lin et al. BN MN CIFAR10 8.81 8.82 8.42 CIFAR100 32.30 31.66 VGG13 BN MN 8.28 7.79 31.15 30.06 ResNet20 He et al. BN MN 8.75 8.44 7.99 31.56 30.53

Table 3: Test set error rates (%) with BN and MN for deep architectures on CIFAR10, CIFAR100. Shown are NIN (left), VGG13 (middle) and ResNet20 (right).

VGG Networks.

Another popular class of deep convolutional neural networks are VGG networks (Simonyan & Zisserman, 2014). In particular we trained a VGG13 with BN and MN on CIFAR10 and CIFAR100. For both datasets we optimized using SGD with momentum for 100 epochs, setting the initial learning rate to , and reducing it at epochs 65, 80, and 90 by a factor of 1/10. The batch size is set to . As before, we set the number of modes in MN to , and keep estimators with . When incorporated into the network, MN improves the performance of VGG13 by 0.4% on CIFAR10, and gains over 1% on CIFAR100.

Residual Networks.

Contrary to NIN and VGG, Residual Networks (He et al., 2016) were originally conceptualized with layer-wise batch normalizations. We trained a ResNet20 on CIFAR10 and CIFAR100 in its original architecture (i.e. with BN), as well as with MN (), see Table 3 (right). On both datasets we follow the standard training procedure and train both models for 160 epochs, with SGD as optimizer, momentum parameter of 0.9, and weight decay of . Running estimates were kept with , the batch size set to . Our implementation of ResNet20 (BN in Table 3) performs slightly better than that reported in the original publication (8.42% versus 8.82%). Replacing BN with MN achieves a notable 0.45% and 0.7% performance gain over BN in CIFAR10 and CIFAR100, respectively.

We also test our method in the large-scale image recognition task of ILSVRC12. Concretely, we replaced BN in a ResNet18 with MN (), and trained both resulting models on ILSVRC12 for 90 epochs. We set the initial learning rate to , reducing it at epochs 30 and 60 by a factor of 1/10. SGD was used as optimizer (with momentum parameter set to 0.9, weight decay of ). To accelerate training we distributed the model over four GPUs, with an overall batch size of . As can be seen from Table 4, MN results in a small but consistent improvement over BN in terms of top-1 and top-5 errors.

Top- Error BN MN
1 30.25 30.07
5 10.90 10.65
Table 4:

Top-1 and top-5 error rates (%) of ResNet18 on ImageNet ILSVRC12, with BN and MN.

Qualitative analysis.

In Fig. 2 we evaluated the experts for samples from the CIFAR10 test set in layers conv3-64-1 and conv-3-256-1

of VGG13, and show those samples that have been assigned the highest probability to belong to either of the

modes. In the normalization belonging to conv3-64-1, MN is sensitive to a red-blue color mode, and separates images accordingly. In deeper layers such as conv-3-256-1, separations seem to occur on the semantic level. In this particular example, MN separates smaller objects from such that occupy a large portion of the image.

Figure 2: Test samples from CIFAR10 that were clustered together by two experts in an early layer (top) and a deeper layer (bottom) of VGG13.

5 Conclusion

Stabilizing the training process of deep neural networks is a challenging problem. Several normalization approaches that aim to tackle this issue have recently emerged, enabling training with higher learning rates, faster model convergence, and allowing for more complex network architectures.

Here, we showed that two widely used normalization techniques, BN and GN, can be extended to allow the network to jointly normalize its features within multiple modes. We further demonstrated that our method can be incorporated to various deep network architectures and improve their classification performance consistently with a negligible increase in computational overhead. As part of future work, we plan to explore customized, layer-wise mode numbers in MN, and automatically determining them, e.g. by utilizing concepts from sparsity regularization.


Appendix A Additional multi-task results

Shown in Table 5

are additional results for jointly training on MNIST, CIFAR10, SVHN, and Fashion-MNIST. The same network is used as in previous multi-task experiments, for hyperparameters see Section

4. In these additional experiments, we varied the batch size to . For larger batch sizes, increasing to values larger than two increases performance, while for a smaller batch size of (c.f. Table 1), errors incurred by finite estimation prevent this benefit from appearing.

256 26.34  1.82 31.15  3.45 26.95  2.51 25.29  1.31 2
25.04  1.88 4
24.88  1.24 6
512 26.51  1.15 29.00  1.85 28.98  1.32 26.18  1.86 2
24.29  1.82 4
25.33  1.33 6
Table 5: Test set error rates (%) of multiple normalization methods in the multi-task setting for large batch sizes. The table contains average performances over five initializations, alongside their standard deviation.