Batch Normalization (BN) (Ioffe and Szegedy 2015) normalizes the features of an input image via statistics of a batch of images and this batch information is considered as batch noise that will be brought to the features of an instance by BN. We offer a point of view that self-attention mechanism can help regulate the batch noise by enhancing instance-specific information. Based on this view, we propose combining BN with a self-attention mechanism to adjust the batch noise and give an attention-based version of BN called Instance Enhancement Batch Normalization (IEBN) which recalibrates channel information by a simple linear transformation. IEBN outperforms BN with a light parameter increment in various visual tasks universally for different network structures and benchmark data sets. Besides, even if under the attack of synthetic noise, IEBN can still stabilize network training with good generalization. The code of IEBN is available at https://github.com/gbup-group/IEBNREAD FULL TEXT VIEW PDF
In this paper, we propose Weight Standardization (WS) to accelerate deep...
In this paper, we study normalization methods for neural networks from t...
Attention-based deep neural networks (DNNs) that emphasize the informati...
It this paper we revisit the fast stylization method introduced in Ulyan...
This paper presents a normalization mechanism called Instance-Level Meta...
Batch Normalization (BN) is a vital pillar in the development of deep
A recently-proposed technique called self-adaptive training augments mod...
Mini-batch Stochastic Gradient Descent (SGD) is a simple and effective method in large-scale optimization by aggregating multiple samples at each iteration to reduce operation and memory cost. However, SGD is sensitive to the choice of hyperparameters and it may cause training instability[Luo, Xiong, and Liu2019]. Normalization is one possible choice to remedy SGD methods for better stability and generalization. Batch Normalization (BN) [Ioffe and Szegedy2015]
is a frequently-used normalization method that normalizes the features of an image using the mean and variance of the features of a batch of images during training. Meanwhile, the tracked mean and variance that estimate the statistics of the whole dataset are used for normalization during testing. It has been shown that BN is an effective module to regularize parameters[Luo et al.2019], stabilize training, smooth gradients [Santurkar et al.2018], and enable a larger learning rate [Bjorck et al.2018, Cai, Li, and Shen2019] for faster convergence.
Two kinds of noise effects in SGD and BN are concerned in this paper.
Estimation Noise. In BN, the mean and variance of a batch are used to estimate those of the whole dataset; in SGD, the gradient of the loss over the batch is applied to approximate that of the whole dataset. These estimation errors are called estimation noise.
Batch Noise. In the forward pass, BN incorporates batch information to the features of an instance via the normalization with batch statistics. In the back-propagation, the gradient of an instance will be disturbed by the batch information due to BN and SGD. These disturbances to an instance caused by the batch is referred to as batch noise.
The randomness of BN and SGD has been well-known to improve the performance of deep networks and there exists extensive study on optimizing their effeteness via tuning batch sizes. On the one hand, a small batch size will lead to a high variance of statistics and weaken the training stability. On the other hand, a large batch size can reduce the estimation noise but it will cause a sharp landscape of loss [Keskar et al.2016] making the optimization problem more challenging. Therefore, it is important to choose an appropriate batch size to make a good balance but the noise still exists. These two kinds of noise will finally influence the gradient when performing a forward pass and back-propagation. In fact, the appropriate estimation noise and batch noise can benefit the generalization of the network. BN with the estimation noise can work as an adaptive regularizer of parameters [Luo et al.2019] and the moderate noise can help escape bad local minima and saddle point [Jin et al.2017, Ge et al.2015].
It is an art to infuse a model with the appropriate noise. We argue that self-attention mechanism is an adaptive noise regulator for the model by enhancing instance specificity. The appropriate noise enables a model with BN to ease optimization and benefit generalization, which motivates us to design a new normalization to combine the advantage of BN and self-attention. This paper proposes an attention-based BN which adaptively emphasizes instance information called as Instance Enhancement Batch Normalization (IEBN). The idea behind IEBN is simple. As shown in Fig. 1, IEBN extracts the instance statistic of a channel before BN and applies it to rescale the output channel of BN with a pair of additional parameters. IEBN costs a light parameter increment and a low computation complexity increment. The extended experiment shows that IEBN outperforms BN on benchmark datasets over popular architectures for image classification. Our contribution is summarized as followed,
We offer a point of view that self-attention mechanism can regulate the batch noise adaptively.
We propose a simple-yet-effective and attention-based BN called as Instance Enhancement Batch Normalization (IEBN). We demonstrate empirically the effectiveness of IEBN on benchmark datasets with different network architectures.
This session reviews related works and mainly focuses on two directions, normalization, and self-attention mechanism. Then we will discuss a work which combines them together.
Normalization. The normalization layer is an important component of a deep network. Multiple normalization methods have been proposed for different tasks. Batch Normalization [Ioffe and Szegedy2015] which normalizes input by mini-batch statistics has been a foundation of visual recognition tasks [He et al.2016a]. Instance Normalization [Ulyanov, Vedaldi, and Lempitsky2017a] performs one instance BN-like normalization and is widely used in generative model [Johnson, Alahi, and Fei-Fei2016a, Zhu et al.2017]. There are some variants of BN, such as, Conditional Batch Normalization [de Vries et al.2017] for Visual Questioning and Answering, Group Normalization [Wu and He2018] and Batch Renormalization [Ioffe2017] for small batch size training, Adaptive Batch Normalization [Li et al.2018] for domain adaptation and Switchable normalization [Luo, Ren, and Peng2018] which learns to select different normalizers for different normalization layers. Among them, Conditional Batch Norm and Batch Renorm adjust the trainable parameters in reparameterization step of BN. Both of them are most related to our work which modifies the trainable scaling parameter.
Self-attention Mechanism. Self-attention mechanism selectively focuses on the most informative components of a network via self-information processing and has gained a promising performance on vision tasks. The procedure of attention mechanism can be divided into three parts. First, the added-in module extracts internal information of a networks which can be squeezed channel-wise information [Hu, Shen, and Sun2018, Li et al.2019, Huang et al.2019] or spatial information [Wang et al.2018, Li, Hu, and Yang2019]. Next, the module processes the extraction and generates a mask to measure the importance of features via fully connected layer [Hu, Shen, and Sun2018], convolution layer [Wang et al.2018] or LSTM [Huang et al.2019]. Last, the mask is applied back to features to enhance feature importance [Hu, Shen, and Sun2018, Li et al.2019, Huang et al.2019].
The cooperation of BN and attention dates back to Visual Questioning and Answering (VQA) which inputs an image and an image-related question and then outputs the answer to the question. For this task, Conditional Batch Norm [de Vries et al.2017]
is proposed to influence the feature extraction of an image via the feature collected from the question. A Recurrent Neural Network (RNN) is used to extract the features from the question while a Convolutional Neural Network (CNN), a pre-trained ResNet, performs features selection from the image. The features extracted from the question are conditioned on the shift and scale parameters of the BN in the pre-trained ResNet such that the feature selection of the CNN is question-referenced and the overall networks can handle different reasoning tasks. Note that for VQA, the features from question can be viewed as external attention to guide the training of overall network since those features are external regarding the image. In our work, the IEBN we proposed can also be viewed as a kind of Conditional Batch Norm but the guidance of the network training is using the internal attention since we use self-attention mechanism to extract the information from the image itself.
This session first reviews BN and then introduces IEBN.
We consider a batch input , where and stand for batch size, number of channels (feature maps), height and width respectively. For simplicity, we denote as the value of pixel at channel of instance and
as the tensor at channelof instance .
The computation of BN can be divided into two steps: batch-normalized step and reparameterization step. Without loss of generality, we perform BN on the channel of the instance , i.e., .
In batch-normalized step, each channel of features is normalized using mean and variance of a batch over the channel,
Then in reparameterization step, a pair of learnable parameters scale and shift the normalized tensor to restore the representation power,
As said in Introduction, the batch noise mainly comes from the batch-normalized step where the feature of the instance is mixed with information from the batch, i.e., and .
The showcase of IEBN is shown in Fig. 1, where we highlight the instance enhancement process of one channel. The detailed computation can be found in Alg. 1. IEBN is based on the adjustment of the trainable scaling parameter on BN and its implementation consists of three operations: global squeezing, feature processing, and instance embedding.
Global Squeezing. The global reception field of a feature map is captured by average pooling . We obtain a shrinking feature descriptor of the channel for the instance by taking average over the channel,
will serve as a shrinking feature to adjust the channel after BN and is exclusive to the instance .
Feature Processing. The shrinking feature will be processed to generate a weight coefficient ranged in for self-recalibration of channel . To enhance self-regulating capacity, we introduce an addition pair of parameters , for the channel, which serve as scale and shift respectively to linearly transform . Then Sigmoid function (i.e., ) is applied to the value after linear transformation as a gating mechanism:
Specially, the parameters are initialized by constant 0 and -1 respectively. We will discuss the initialization in Ablation Study.
Instance Embedding. works as a weight coefficient to adjust the scaling in the reparameterization step of BN for the instance . We embed the recalibration to compensate the instance information in Eqn. 2,
is composed of nonlinear activation function and an additional pair of parameters which helps improve the nonlinearity of reparameterization of BN.
We conduct IEBN on all channels, i.e., Compared with BN, the parameter increment comes from the additional pair parameter for generating coefficient for each channel. The total number of parameter increment is equal to twice the number of channels.
In this section, we evaluate the performance of IEBN in image classification task and empirically demonstrate its effectiveness. We conduct experiments on benchmark datasets with popular networks.
Dataset and Model. We conduct experiments on CIFAR10, CIFAR100 [Krizhevsky and Hinton2009]
, and ImageNet 2012[Russakovsky et al.2015]. CIAFR10 or CIFAR100 has 50k train images and 10k test images of size 32 by 32 but has 10 and 100 classes respectively. ImageNet 2012 [Russakovsky et al.2015] comprises 1.28 million training and 50k validation images from 1000 classes, and the random cropping of size 224 by 224 is used in our experiments. We evaluates our methods with popular networks, ResNet [He et al.2016a], PreResNet [He et al.2016b] and ResNeXt [Xie et al.2017]. In our experiments, we replace all the BNs in the original networks with IEBN. The implementation details can be found in the Appendix.
Image Classification. As shown in Table 1
, the IEBN improves the testing accuracy over BN for different datasets and different network backbones. For small-classes dataset CIFAR10, the performance of the networks with BN is good enough, so there is not large space for improvement. However, for CIFAR100 and ImageNet datasets, the networks with IEBN achieve a significant testing accuracy improvement over BN. In particular, the performance improvement of the ResNet with the IEBN is most remarkable. Due to the popularity of ResNet and the light additional parameter increment, the IEBN has good application potential in various deep learning tasks.
In this session, we explore the role of self-attention mechanism on enhancing instance information and regulating the batch noise. We analysis through the style transfer and experiments with the synthetic noise attack.
We explore the role of self-attention mechanism on instance enhancement through the example of the style transfer task [Gatys, Ecker, and Bethge2016]
. We use the style transfer method which generates image by a network called transformation network[Johnson, Alahi, and Fei-Fei2016b].
It has been empirically shown that the type of normalization in the network has an impact on the quality of image generation [Ulyanov, Vedaldi, and Lempitsky2017b, Huang and Belongie2017, Dumoulin, Shlens, and Kudlur2016]. Instance Normalization (IN) is widely used in generative models and it had proved to have a significant advantage over BN in style transfer tasks [Ulyanov, Vedaldi, and Lempitsky2017b]. The formulation of IN is followed,
denote the mean and standard deviation of the instanceat the channel . Similarly, the formulation of BN can be written in this form,
and are learned parameters and both are closely related to the target style [Dumoulin, Shlens, and Kudlur2016]. From Eqn. 6 and Eqn. 7, IN or BN directly leads to the scaling of that affects the style of images. Different from BN, IN affects the style by self-information instead of batch information. Fig. 2 compares the quality of images generated by the network with BN, IN and SE module. The style transfer task is noise-sensitive, and when the batch noise is added by BN, the style of the generated image becomes more confused. We add the SE module [Hu, Shen, and Sun2018] to the transformation network with BN to find its effectiveness of regulating batch noise. We can see in Fig. 2 that the attention mechanism (SE) visually improves the effect of style transfer and the quality of the generated images is closer to that of IN. Fig. 3 shows the training loss with respect to the iterations by applying the style Mosaic. The BN network with SE module achieves smaller style loss and smaller content loss than BN, and is closer to IN (see Appendix for more results about the loss by applying other style). Therefore, although the BN can bring the batch information to an instance, it simultaneously introduce batch noise to network training. The attention mechanism such as SE module may be good at alleviating the batch noise and we will investigate it further.
IEBN is a BN equipped with self-attention and Fig. 2 shows the similarity of the generated images of the SE module and IEBN. In fact, we consider IEBN:
where is defined in Eqn. 4 and contains information from the instance . It seems like the added-in is only directly applied to scaling parameter of BN, but it does scale the batch information (i.e., ) to regulate the batch information via supplement of instance information. This adjustment of batch information via makes the Eqn. 8 closer to Eqn. 6 than Eqn. 7 and also leads to the similar results in style transfer between IN and IEBN. %ͨ δμ_c^B σ_c^B batch information ĵ ʹ ù ʽ8 ʽ6 ƣ ʹ IBEN ڷ Ǩ ϵ Ч ̵ܹ BN Ҹ ӽ IN Ч
To further study the ability to regulate the noise of IEBN, two kinds of strategies is used to add the synthetic noise in the batch-normalized step of BN.
We add constant noise into each BN in the batch-normalized step as followed,
where are a pair of constant as the constant noise. Table 2 shows the testing accuracy of ResNet164 on CIFAR100 under different pairs of constant noise.
The added constant noise is equivalent to disturbing and such that we can use the inaccurate estimations of mean and variance respectively of the whole dataset in training. This bad estimation can lead to terrible performance. Denote as . Then in the reparameterization step of BN, we introduce the learnable parameters and and get
From the inference of Eqn. 10, the impact of constant noise can be easily neutralized by the linear transformation of and because and are just constants. However, in Table 2, the network with only BN is not good at handling most constant noise (). The trainable and of BN does not have enough power to help BN reduce the impact of the constant noise. Due to the forward propagation, the noise will accumulate as the depth increases and a certain amount of noise leads to poor performance and training instability. As shown in Table 2, SE module can partly alleviate this problem, but not enough because of the high variance of the testing accuracy under most pairs of constant noise.
For IEBN, we can rewrite Eqn. 10 as
where denotes the attention learned in IEBN. Compared to Eqn. 10, Eqn. 11 with from IEBN has successfully adjusted constant noise and even achieved better performance under partial noise configuration. If only excites , we can rewrite Eqn. 11 as
where can only adjust the noise in instead of . But if applied to , can handle the noise of scale and bias simultaneously. It may be the reason why the result about only exciting is worse than the other in Table 5, but better than the original model with BN in Table 1.
In this part, we consider interfering with and by simultaneously training on the datasets with different distributions in one network. Unlike constant noise which is added to networks directly, this noise is implicit and is generated when BN computes the mean and variance of training data from different distribution. These datasets differ widely in their distribution and causes severe batch noise. Compared with the constant noise, this noise is not easy to eliminate by linear transformation of and .
In our experiments, we train ResNet164 on CIFAR100 but mix up with MINIST [LeCun and Cortes2010] or FashionMINIST [Xiao, Rasul, and Vollgraf2017] in a batch and compare the performance of BN and IEBN. Table 3 shows the test accuracy on CIFAR100, where “C+ M or F” means we sample a batch consisted of 100 images from CIFAR100 (C) and images from MNIST (M) or FashionMNIST (F) at each iteration during training. As increases, the batch noise becomes more severe for CIFAR100 since and contains more information about MNIST or FashionMnist. In most cases, despite the severe noise like “C+2”, the model with IEBN still performs better than the model with BN training merely on CIFAR100. On the other hand, the drop in accuracy of IEBN is smaller than that of IEBN, and IEBN alleviates the degradation of network generalization. These phenomena illustrate that, although under the influence of MINIST or FashionMINIST, the model with IEBN has a stronger ability to resist the batch noise.
|Dateset||test acc||acc drop||test acc||acc drop|
In this section, we conduct experiments to explore the effect of different configurations of IEBN. We study different ways of generating , the position for applying the attention, initialization of IEBN and activation function used in IEBN. All experiments are performed on CIFAR100 with ResNet164 using 2 GPUs.
This part we study different ways to process the squeezed features to generate . As shown in Alg. 1, IEBN squeezes the channel through global average pooling and processes the squeezed feature by linear transformation (i.e. AVG() + ) for each channel, denoted as “Linear”. We also consider another two methods to process the information. The first one is that we remove the additional trainable parameters and
for linear transformation in IEBN and directly apply the squeezed feature after sigmoid function to the channel, denoted as “Identity”. The second one is that we use a fully connected layer stacking of a linear transformation, a ReLU layer, and a linear transformation to fuse the squeezed features of all channels, denotes as “FC”. “FC” is similar to the configuration as SE module introduced in [Hu, Shen, and Sun2018].
Table 4 shows the testing accuracy using different ways to process the squeezed features. “FC” operator provides more nonlinearity than “Linear” operator (IEBN), but such nonlinearity may lead to overfitting and the “Linear” operator (IEBN) simplifies the squeezed feature processing and has better generalization ability. Furthermore, the result of ”Identity” indicates that it is not enough to simply and directly use instance information to enhance self-information without any trainable parameters. The operators with trainable parameters, such as “Linear” (IEBN) and “FC”, are needed to process the instance information such that the adaptive and advantageous noise during training can be regulated to improve the performance.
We study the influence of different positions that excites. For self-attention mechanism like SENet [Hu, Shen, and Sun2018], DIANet [Huang et al.2019] and SGENet [Li, Hu, and Yang2019], the rescaling coefficient usually excites both the trainable parameter and of BN. In IEBN, the is only applied to adjust the scaling parameter in BN. To differentiate the influence of the excitation positions, Table 5 shows testing accuracy with different positions where the excites. We show that the performance is unsatisfied when the is merely exciting . Moreover, there is a slight difference between exciting only and exciting both and , and the former excitation position has better performance. From the point of view of adjusting noise, Eqn. 11 and Eqn. 12 can explain the result shown in Table 5. Therefore, the results suggest that to make IEBN more effective, it is important to carefully choose the position where the should excite.
This part studies the initialization of trainable parameters and which are used to process the squeezed feature in IEBN. According to the experiments in Table 4, the learnable parameters, and , are indispensable for IBEN to be effective. Therefore, further study of different initialization configuration is essential to understand IEBN in depth. In order to explore this impact, we use constant 1, 0 and -1 for grid search to find the best pair of initialization for and . We find that the initialization of the trainable parameters of IBEN and have a significant impact on the performance of model: From Table 6, the performance is varying as different initialization is chosen. Note that, the best choice of is 0 when we freeze the initialization of . Similarly, the effect of the model is the best when the initialization of is fixed to be -1. The theoretical nature behind the best initialization configuration will be our future work.
We explore the choice of activation function in IEBN. We consider four options for activation function: sigmoid, tanh, ReLU and Softmax. The testing accuracy results are reported in Fig. 4. Note that, ReLU may be a terrible choice which maintains only 1% accuracy throughout the training. In addition, the performance of Softmax is evidently worse than that of sigmoid or tanh. The choice of sigmoid can benefit the stability of training and performance. In fact, sigmoid is used in many attention-based methods like SENet [Hu, Shen, and Sun2018] to generate attention maps as a gate mechanism. The testing accuracy of different choices of activation functions in Table 4 shows that sigmoid helps IEBN as a gate to rescale channel features better. The similar ablation study in the SENet paper [Hu, Shen, and Sun2018] also shows the performance of different activation functions like: sigmoid, tanh , and ReLU (bigger is better), which coincides to our reported results.
In this paper, we introduce two kinds of noise brought by BN and offer a point of view that self-attention mechanism can regulate the batch noise adaptively. We propose a simple-yet-effective and attention-based BN called as Instance Enhancement Batch Normalization (IEBN). We demonstrate empirically the effectiveness of IEBN on benchmark datasets with different network architectures and also provide ablation study to explore the effect of different configurations of IEBN.
S. Liang and H. Yang gratefully acknowledge the support of National Supercomputing Center (NSCC) Singapore [nscc] and High Performance Computing (HPC) of National University of Singapore for providing computational resources, and the support of NVIDIA Corporation with the donation of the Titan Xp GPU used for this research. Z. Huang thanks New Oriental AI Research Academy Beijing for GPU resources. H. Yang thanks the support of the start-up grant by the Department of Mathematics at the National University of Singapore, the Ministry of Education in Singapore for the grant MOE2018-T2-2-147.
Proceedings of the 32Nd International Conference on International Conference on Machine Learning - Volume 37, ICML’15, 448–456. JMLR.org.
Perceptual losses for real-time style transfer and super-resolution.In ECCV (2), 694–711.
Unpaired image-to-image translation using cycle-consistent adversarial networks.In ICCV, 2242–2251.
The style transfer loss of different styles can be found in Fig. 5.