# Understanding Regularization in Batch Normalization

Batch Normalization (BN) makes output of hidden neuron had zero mean and unit variance, improving convergence and generalization when training neural networks. This work understands these phenomena theoretically. We analyze BN by using a building block of neural networks, which consists of a weight layer, a BN layer, and a nonlinear activation function. This simple network helps us understand the characteristics of BN, where the results are generalized to deep models in numerical studies. We explore BN in three aspects. First, by viewing BN as a stochastic process, an analytical form of regularization inherited in BN is derived. Second, the optimization dynamic with this regularization shows that BN enables training converged with large maximum and effective learning rates. Third, BN's generalization with regularization is explored by using random matrix theory and statistical mechanics. Both simulations and experiments support our analyses.

## Authors

• 59 publications
• 6 publications
• 6 publications
• 10 publications
• ### Static Activation Function Normalization

Recent seminal work at the intersection of deep neural networks practice...
05/03/2019 ∙ by Pierre H. Richemond, et al. ∙ 0

• ### Cosine Normalization: Using Cosine Similarity Instead of Dot Product in Neural Networks

Traditionally, multi-layer neural networks use dot product between the o...
02/20/2017 ∙ by Chunjie Luo, et al. ∙ 0

• ### Learning Compact Neural Networks with Regularization

We study the impact of regularization for learning neural networks. Our ...
02/05/2018 ∙ by Samet Oymak, et al. ∙ 0

• ### Self-Orthogonality Module: A Network Architecture Plug-in for Learning Orthogonal Filters

In this paper, we investigate the empirical impact of orthogonality regu...
01/05/2020 ∙ by Ziming Zhang, et al. ∙ 9

• ### Mean Spectral Normalization of Deep Neural Networks for Embedded Automation

Deep Neural Networks (DNNs) have begun to thrive in the field of automat...
07/09/2019 ∙ by Anand Krishnamoorthy Subramanian, et al. ∙ 4

• ### Implicit Self-Regularization in Deep Neural Networks: Evidence from Random Matrix Theory and Implications for Learning

Random Matrix Theory (RMT) is applied to analyze weight matrices of Deep...
10/02/2018 ∙ by Charles H. Martin, et al. ∙ 0

• ### Net-Trim: Convex Pruning of Deep Neural Networks with Performance Guarantee

We introduce and analyze a new technique for model reduction for deep ne...
11/16/2016 ∙ by Alireza Aghasi, et al. ∙ 0

##### This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

## 1 Introduction

Batch Normalization (BN) is an indispensable component in many deep neural networks (He et al., 2016; Huang et al., 2017). Experimental studies (Ioffe & Szegedy, 2015) suggested that BN improves convergence and generalization by enabling large learning rate and preventing overfitting in training. Understanding BN theoretically is a key question.

Notations.

This work denotes a scalar and a vector by using lowercase letter (

e.g. ) and bold lowercase letter (e.g. x

) respectively. BN is investigated in a single-layer perceptron that is a building block of deep models, consisting of a kernel layer, a BN layer, and a nonlinear activation function. Its forward computation can be written by

 y=g(^h),  ^h=γh−μBσB+β  and  h=wTx, (1)

where

denotes an activation function such as ReLU,

and are the hidden values before and after normalization, and are kernel weight vector and network input respectively. In BN, and

represent mean and standard deviation of

. They are estimated within a batch of

samples. is a scale parameter and is a shift parameter.

Despite the simplicity of the above basic network, it builds up the blocks of deep networks. It has been widely adopted in theoretical studies such as proper initialization (Krogh & Hertz, 1992; Advani & Saxe, 2017), dropout (Wager et al., 2013), weight decay and data augmentation (Bös, 1998). Our analyses assume that neurons at the BN layer are independent similar to (Salimans & Kingma, 2016; van Laarhoven, 2017), as the mean and the variance are estimated individually for each neuron. But we get rid of Gaussian assumption on the network input and the weight vector in theorem 1 that is our main result, meaning our assumption is milder than those in (Yoshida et al., 2017; Ba et al., 2016; Salimans & Kingma, 2016). Overall, several frequently-used notations are summarized in Table 2 in Appendix for reference.

### 1.1 Highlights of Results

Out main results are organized below in three aspects.

First, Sec.2 decomposes BN into population normalization (PN) and gamma decay, which is an explicit regularization form of and . These statistics have different impacts: (1) discourages reliance on a single neuron and encourages different neurons to have equal magnitude, in the sense that corrupting individual neuron does not harm generalization. This phenomenon was also found empirically in a recent work (Morcos et al., 2018), but has not been established analytically. (2)

reduces kurtosis of the input distribution as well as correlations between neurons. (3) The regularization strengths of these statistics are

inversely proportional to the batch size , indicating that BN with large batch would decrease generalization. (4) Removing either one of and could imped convergence and generalization.

Second, by using ordinary differential equations (ODEs), Sec.

3 shows that gamma decay enables the network trained with BN to converge with large maximum and effective learning rate, leading to faster training speed compared to the network trained without BN or trained with weight normalization (WN) (Salimans & Kingma, 2016) that is a counterpart of BN.

Third, Sec.4 compares generalization errors of BN, WN, and vanilla SGD by using statistical mechanics. The “large-scale” regime is of interest, where number of samples and number of neurons are both large but their ratio is finite. In this regime, the generalization errors are quantified both analytically and empirically.

Numerical results in Sec.5 show that BN in CNNs has the same traits of regularization as disclosed by the above analyses.

## 2 A Probabilistic Interpretation of BN

We treat and

as random variables to derive regularization of BN. Since one sample

is seen many times in training and at each time it is presented with other samples in a batch that is drawn randomly, and can be treated as injected random noise for .

Loss Function. Training a neural network typically involves minimizing a negative log likelihood function with respect to a set of parameters

. Then the loss function can be defined by

 1PP∑j=1ℓ(^hj)=−1PP∑j=1logp(yj|^hj;θ)+ζ∥w∥22, (2)

where represents the likelihood function of the network and

is number of training samples. As Gaussian distribution is often employed as prior distribution for the weight parameters, we have a weight decay

(Krizhevsky et al., 2012)

that is a popular technique in deep learning.

Prior. By following (Teye et al., 2018), we find that BN also induces Gaussian priors for and . We have and , where is batch size, and are population mean and standard deviation respectively, and is kurtosis that measures the peakedness of the distribution of . These priors tell us that and would produce Gaussian noise in training. There is a tradeoff regarding noise. For example, when is small, training could diverge due to large noise. This is supported by experiment of BN (Wu & He, 2018) where training diverges when

in ImageNet

(Russakovsky et al., 2015). When is large, the noise is reduced because and get close to and . It is known that

would provide a moderate noise, as the sample statistics converges in probability to the population statistics by the weak Law of Large Numbers. This is also supported by experiment

(Ioffe & Szegedy, 2015) where BN with already works well in ImageNet.

### 2.1 A Regularization Form

The loss function in Eqn.(2) can be written as an expected loss by integrating over the priors of and , that is, where denotes expectation. We show that and impose regularization on the scale parameter , but result in different regularization strengths. In theorem 1, we employ ReLU activation function as a concrete example that is widely used in practice. In general, the results can be extended to the other activation functions as shown in Appendix C.2.

###### Theorem 1 (Regularization of μB,σB).

Let be the loss function of BN and the activation function be ReLU. Then

 1PP∑j=1EμB,σBℓ(^hj)≃1PP∑j=1ℓ(¯hj)+ζ(h)γ2   and   ζ(h)=ρ+28MFγfrom σB+12M1PP∑j=1σ(¯hj)from μB, (3)

where represents the population normalization (PN) and . is a data-dependent coefficient of gamma decay, is the kurtosis of distribution of , represents Fisher Information Matrix (FIM) of , and

is a sigmoid function.

From theorem 1, we have several observations that are of both theoretical and practical values.

First, it decomposes BN into population normalization (PN) and gamma decay. PN replaces the batch statistics in BN by population statistics. In gamma decay, computation of is data-dependent, making it differed from weight decay where the coefficient is determined empirically. In essentials, Eqn.(3) represents the randomness in BN in a deterministic way, not only enabling us to apply methodologies such as ODEs and statistical mechanics to analyze BN, but also inspiring us to imitate BN’s performance by WN without computing batch statistics in numerical study.

Second, PN is closely connected to WN that is independent from sample mean and variance. WN (Salimans & Kingma, 2016) is defined by that normalizes the weight vector to have unit variance, where is a learnable parameter. Let each diagonal element of the covariance matrix of be and all the off-diagonal elements be zeros. can be rewritten as

 ¯hj=γwTxj−μPσP+β=υwTxj||w||2+b, (4)

where and . Eqn.(4) removes the estimations of statistics and eases our analyses of regularization for BN.

Third, and produce different parts in . The strength from depends on the expectation of , which represents excitation or inhibition of a neuron, meaning that a neuron with larger output may exposure to larger regularization, encouraging different neurons to have equal magnitude. This is consistent with empirical result (Morcos et al., 2018) which prevented reliance on single neuron to improve generalization. The strength from works as a complement for . For a single neuron, represents the norm of gradient, implying that BN punishes large gradient norm. For multiple neurons, is the FIM of , meaning that BN would penalize correlations among neurons. Both and are important, removing either one of them would imped performance.

We observe in experiments in Sec.5 that BN in CNNs share similar traits of regularization. However, in deep models the priors for and become multivariate Gaussian distributions where relationships between layers may not be neglected. In this case, we didn’t find meaningful analytical form for BN.

## 3 Optimization with Regularization

We show that BN converges with large maximum and effective learning rate (lr) that are larger than a network trained without BN. Our result explains why large lr can be used in practice in BN (Ioffe & Szegedy, 2015)

. Our analyses are conducted in three stages. First, we establish dynamical equations of a teacher-student (T-S) model in thermodynamic limit and acquire the fixed point. Second, we investigate eigenvalues of the corresponding Jacobian matrix at this fixed point. Finally, we calculate the maximum and the effective lr.

Teacher-Student Model. We first introduce useful techniques from statistical mechanics (SM). In SM, a student network is dedicated to learn relationship between an input and an output with a weight vector as parameters. It is useful to characterize behavior of the student by using a teacher network that uses as a ground-truth vector. We treat the single-layer perceptron as the student, which is optimized by minimizing the euclidian distance between its output and the supervision provided by a teacher without BN. The student and the teacher have identical activation function.

Loss Function. We define a loss function of the above T-S model by . Here represents supervision from the teacher, while is the output of student trained to mimic its teacher. The student is defined by Eqn.(4) where and the bias term is merged into . This loss function represents BN using WN with gamma decay and it is sufficient to study the lr of different approaches. Let be a set of parameters updated by SGD, i.e. where denotes a learning rate. The update rules for and are

 wj+1−wj=ηδj(γj√N∥wj∥2xj−~wjTxj∥wj∥22wj)   and   γj+1−γj=η(δj√NwjTxj∥wj∥2−ζγj), (5)

where denotes a normalized weight vector of the student, that is, , and represents the gradient111 denotes the first derivative of . for clarity of notation.

Order Parameters. As we are interested in the “large-scale” regime where both and are large and their ratio is finite, it is difficult to examine a student with parameters in high dimensions directly. Therefore, we transform the weight vectors to order parameters that fully characterize interactions between the student and the teacher. In this case, the parameter vector can be reparameterized by using a vector of three elements including , , and . In particular, measures norm of the normalized weight vector , that is, . The parameter measures angle (overlapping ratio) between the weight vectors of student and teacher. We have where the norm of the ground-truth vector is . Moreover, represents norm of the original weight vector and . With the above definitions, relationship between and can be represented by .

### 3.1 Learning Dynamics of Order Parameters

Now we transform update equations (5) by using order parameters. To this end, we define three variables , and . The update rule for variable can be obtained by following update rule of . Similarly, the update rules for variables and are and by multiplying both sides of (5) by .

To define the learning dynamics, we turn the above update rules into ODEs. We take as an example. Its differential equation can be defined by , where is a normalized sample index that can be treated as a continuous time variable. We have that approaches zero in the thermodynamic limit when . denotes expectation over the distribution of . The differential equations of and can be defined in the same way. We simplify notations by representing , and . We obtain a dynamical system

 dγdt=ηI1γ−ηζγ,   dRdt=ηγL2I3−ηRL2I1−η2γ2R2L4I2   and   dLdt=η2γ22L3I2. (6)

More results are provided in Appendix C.4.

### 3.2 Fixed Point of the Dynamical System

To investigate lr of BN, we derive the fixed point of (6) by setting . The fixed points of BN, WN, and vanilla SGD (without BN and WN) are given in Table 1. In the thermodynamic limit, the optima for would be . Our main interest is the overlapping ratio between the student and the teacher, because it optimizes the direction of the weight vector. We see that for all three approaches attain optimum ‘1’. Intuitively, in BN and WN, this optimal solution does not depend on because their weight vectors are normalized. In other words, WN and BN are easier to optimize than vanilla SGD where both and have to be optimized. In BN, depends on the activation function. For ReLU, we have , meaning that norm of the normalized weight vector relies on the decay factor. In WN, we have as WN has no regularization on .

### 3.3 Maximum and Effective Learning Rates

With the above fixed points, we derive the maximum and the effective lr. Specifically, we analyze eigenvalues and eigenvectors of the Jacobian matrix corresponding to (

6). We are interested in the lr to approach . We find that this optimum value only depends on its eigenvalue denoted as , , where and are maximum and effective lr (proposition 1 in Appendix C.4). They are given in Table 1. We demonstrate that if and only if , such that the fixed point is stable for all approaches (proposition 2 in Appendix C.5). It is able to show that of BN () is larger than WN and SGD, enabling to converge with a larger learning rate. With ReLU, we find that (proposition 3 in Appendix C.6). Moreover, the effective lr’s in Table 1 are consistent with previous work (van Laarhoven, 2017).

## 4 Generalization Analysis

To investigate generalization of BN, we adopt a teacher-student model with identity activation function, which minimizes a loss function , where represents the teacher’s output and is the student’s output. We compare BN with WN+gamma decay and vanilla SGD. All of them share the same teacher network whose output is defined by , where is drawn from and is an unobserved Gaussian noise. We are interested to see how different methods resist this noise.

For vanilla SGD, the student is computed by with being the weight vector to optimize, where has the same dimension as to be a realizable learning problem. The loss function of vanilla SGD is , whose solution asymptotically approaches the Moore–Penrose pseudo inverse solution . For BN, the student is defined as . As our main interest is the weight vector, we freeze the bias similar to vanilla SGD by setting . Therefore, the loss function is written as . For WN+gamma decay, the student is computed similar to Eqn.(4) by . Then the loss function is defined by . In the T-S model with identity unit, expression of becomes after applying theorem 1 (Appendix C.1). With the above definitions, the three approaches are studied under the same T-S model, where their generalization errors can be strictly compared with the other factors ruled out.

### 4.1 Generalization Errors

We provide closed-form solutions of the generalization errors for vanilla SGD and WN+gamma decay. They are compared with numerical solutions of BN.

vanilla SGD. The solution of generalization error depends on the rank of correlation matrix . Here we define an effective load that is the ratio between number of samples and number of input neurons (number of learnable parameters). The generalization error denoted as can be acquired by using the distribution of eigenvalues of following (Advani & Saxe, 2017). If , . Otherwise, where is the injected noise to the teacher. The values of with respect to are plotted in blue curve in the top of Fig.1. It first decreases but then increases as increases from 0 to 1, diverges at , and it would decrease again when .

WN+gamma decay. The decay term turns the correlation matrix to that is positive definite. Following statistical mechanics (Krogh & Hertz, 1992), the generalization error is where We see that can be computed quantitatively given the values of and . Let the variance of noise injected to the teacher be 0.25. Fig.1 top shows that no other curves could outperform the curve when , a value equal to the noise magnitude. The smaller than would exhibit overtraining around , but they still perform significantly better than vanilla SGD.

Numerical Solutions of BN. We employ SGD with to find solutions of for BN. The generalization error is evaluated as difference between the validation and the training loss. The number of input neurons is 4096 and the number of training samples can be varied to change . The results are marked as black squares in the top of Fig.1. After applying theorem 1 to the T-S model, BN is equivalent to WN+gamma decay when . It is seen that BN gets in line with the curve of ‘=1/2M’ () and thus quantitatively validates our derivations. Their generalization errors are further compared in the bottom of Fig.1 at , where vanilla SGD clearly diverges, while BN and WN+gamma decay are comparable.

## 5 Experiments in CNNs

This section shows that BN in CNNs follows similar traits of regularization as our analyses. We employ a 6-layered CNN similar to (Salimans & Kingma, 2016). For all experiments, the network architecture is fixed while the normalization layers can be changed. We adopt CIFAR10 (Krizhevsky, 2009) that contains 60k images of 10 categories (50k images for training and 10k images for test). All models are trained by using SGD with momentum on a single GPU, while the initial learning rates are scaled proportionally for different batch sizes (Goyal et al., 2017). In order to study regularization of BN, we discard any other trick such as weight decay and data augmentation. More empirical setting can be found in Appendix B.

Evaluation of Theorem 1. We compare BN with PN+gamma decay where the population statistics and the regularization coefficient are estimated by using sufficient amount of training samples. BN trained with a normal batch size is treated as baseline in Fig.2. When batch size increases, BN would imped both loss and accuracy. For example, when increasing to , performance decreases because the regularization from the batch statistics reduces in large batch, resulting in overtraining (see the gap between train and validation loss when ).

In comparison, we train PN by using 10k training samples to estimate statistics. This further reduces regularization. We see that the release of regularization can be complemented by gamma decay, making PN even outperformed BN. This empirical result verifies our derivation of regularization for BN. Similar trend can be observed by experiment in a downsampled version of ImageNet (see Appendix B.1). Nevertheless, we would like to point out that PN+gamma decay is of interest in theoretical analysis, but it is computation-demanding when applied in practice because evaluating , and may require sufficiently large number of samples.

Study of Regularization. We study the regulation strengthes of vanilla SGD, BN, WN, WN+mean-only BN, and WN+variance-only BN. Fig.3 compares their training and validation losses. We see that the generalization error of BN is much lower than WN and vanilla SGD. The reason has been disclosed in this work: stochastic behaviors of and in BN improves generalization.

To investigate and individually, we decompose their contributions by running a WN with mean-only BN as well as a WN with variance-only BN, to simulate their respective regularization. As shown in Fig.3, improvements from the mean-only and the variance-only BN over WN verify our conclusion that noises from and have different regularization strengths. Both of them are essential to produce good result.

Parameter Norm. We further demonstrate impact of BN to the norm of parameters. We compare BN with vanilla SGD. A network is first trained by BN in order to converge to a local minima where the parameters do not change much. At this local minima, the weight vector is frozen and denoted as . Then this network is finetuned by using vanilla SGD with a small learning rate and its kernel parameters are initialized by , where is the moving average of .

Fig.7 in Appendix B.2 visualizes the results. As and are removed in vanilla SGD, it is found from the last two figures that the training loss decreases while the validation loss increases, implying that reduction in regularization makes the network converged to a sharper local minimum that generalizes less well. The magnitudes of kernel parameters at different layers are also displayed in the first four figures. All of them increase after freezing BN, due to the release of regularization on these parameters.

Batch size. To study BN with different batch sizes, we train different networks but only add BN at one layer at a time. The regularization on the parameter is compared in Fig.4 (a) when BN is located at different layers. The values of increase along with the batch size due to the weaker regularization for the larger batches. The increase of also makes all validation losses increased as shown in Fig.4 (b).

BN+dropout. Despite the better generalization of BN with smaller batch sizes, large-batch training is more efficient in real cases. Therefore, improving generalization of BN with large batch is more desiring. However, gamma decay requires estimating the population statistics that increases computations. We also found that treating the decay coefficient as a constant hardly improves generalization for large batch. Therefore, we utilize dropout as an alternative to compensate for the insufficient regularization. Dropout has also been analytically viewed as a regularizer (Wager et al., 2013). We add a dropout after each BN layer to impose regularization.

Fig.5 plots the results. The generalization of BN deteriorates significantly when increases from 64 to 256. This is observed by the much higher validation loss (top) and lower validation accuracy (bottom) when . If a dropout layer with ratio is added after each BN layer for , the validation loss is suppressed and accuracy increased by a great margin. This superficially contradicts with the original claim that BN reduces the need for dropout (Ioffe & Szegedy, 2015). There are two differences between our study and previous work.

First, in pervious study the batch size was fixed at a quite small value (e.g. 32), at which the regularization was already quite strong. Therefore, an additional dropout could not further cause better regularization, but on the contrary increases the instability in training and yields a lower accuracy. However, our study explores relatively large batch that degrades the regularization of BN, and thus dropout with a small ratio can complement. Second, usual trials put dropout before BN and cause BN to have different variances during training and test. In contrast, dropout follows BN in this study and thus the problem can be alleviated. The improvement by applying dropout after BN has also been observed by a recent work (Li et al., 2018).

WN+dropout. Since BN can be treated as WN trained with regularization in this study, combining WN with regularization should be able to match the performance of BN. As WN outperforms BN in running speed (without calculating statistics) and it suits better in RNNs than BN, an improvement of its generalization is also of great importance. Fig.5 shows that WN can also be regularized by dropout. We apply dropout after each WN layer with ratio 0.25. We found that the improvement on both validation accuracy and loss is surprising. The accuracy increases from 0.73 to 0.80, surpassing ‘BN M=256’ and on par with ‘BN M=64’.

## 6 Related Work

Neural Network Analysis. Many studies analysed neural networks (Opper et al., 1990; Saad & Solla, 1996; Bs & Opper, 1998; Pennington & Bahri, 2017; Zhang et al., 2017b; Brutzkus & Globerson, 2017; Raghu et al., 2017; Mei et al., 2016; Tian, 2017). For example, for a multilayer network with linear activation function, Glorot & Bengio (2010) explored its SGD dynamics and Kawaguchi (2016) showed that every local minimum is global. Tian (2017) studied the critical points and convergence behaviors of a 2-layered network with ReLU units. Zhang et al. (2017b) investigated a teacher-student model when the activation function is harmonic. In (Saad & Solla, 1996), the learning dynamics of a committee machine were discussed when the activation function is error function . Unlike previous work, this work analyzes regularization emerged in BN and its impact to both learning and generalization, which are still unseen in the literature.

Normalization. Many normalization methods have been proposed recently. For example, BN (Ioffe & Szegedy, 2015) was introduced to stabilize the distribution of input data of each hidden layer. Weight normalization (WN) (Salimans & Kingma, 2016) decouples the lengths of the network parameter vectors from their directions, by normalizing the parameter vectors to unit length. The dynamic of WN was studied by using a single-layer network (Yoshida et al., 2017). Moreover, Li et al. (2018) diagnosed the compatibility of BN and dropout (Srivastava et al., 2014) by reducing the variance shift produced by them. van Laarhoven (2017) showed that weight decay has no regularization effect when using together with BN or WN. Ba et al. (2016) demonstrated when BN or WN is employed, back-propagating gradients through a hidden layer is scale-invariant with respect to the network parameters. Santurkar et al. (2018) gave another perspective of the role of BN during training instead of reducing the covariant shift. They argued that BN results in a smoother optimization landscape and the Lipschitzness is strengthened in networks trained with BN. However, both analytical and empirical results of regularization in BN are still desirable. Our study explores regularization, optimization, and generalization of BN in the scenario of online learning.

Regularization. Ioffe & Szegedy (2015) conjectured that BN implicitly regularizes training to prevent overfitting. Zhang et al. (2017a) categorized BN as an implicit regularizer from experimental results. Szegedy et al. (2015) also conjectured that in the Inception network, BN behaves similar to dropout to improve the generalization ability. Gitman & Ginsburg (2017) experimentally compared BN and WN, and also confirmed the better generalization of BN. In the literature there are also implicit regularization schemes other than BN. For instance, random noise in the input layer for data augmentation has long been discovered equivalent to a weight decay method, in the sense that the inverse of the signal-to-noise ratio acts as the decay factor (Krogh & Hertz, 1992; Rifai et al., 2011). Dropout (Srivastava et al., 2014) was also proved able to regularize training by using the generalized linear model (Wager et al., 2013).

## 7 Discussions and Future Work

This work investigated regularization emerged in BN. By utilizing a single-layer perceptron, BN was decomposed into PN and gamma decay, where the regularization strengths from and are different and their impacts in training were explored. Moreover, convergence and generalization of BN with regularization were derived and compared with vanilla SGD, WN, and WN+gamma decay, showing that BN enables training to converge with large maximum and effective learning rate, as well as leads to better generalization. Our analytical results explain many existing empirical phenomena. Experiments in CNNs showed that BN in deep networks share the same traits of regularization. Furthermore, a combination of dropout and BN might ameliorate BN when batch size goes large. Our result also encourages us to combine WN and dropout, outperforming BN in some senses without estimating batch statistics.

In future work, we are interested in finding analytical form of regularization for BN in deep networks, although it might involve multivariate Gaussian prior distributions, making it a non-trivial problem. Moreover, investigating the other normalizers such as instance normalization (IN) (Ulyanov et al., 2016) and layer normalization (LN) (Ba et al., 2016) is also important. Understanding the characteristics of these normalizers should be the first step to analyze some recent best practices such as group normalization (Wu & He, 2018) that merged IN and LN, and switchable normalization (Luo et al., 2018) that chose BN, IN, and LN in each normalization layer. Furthermore, devising an efficient counterpart of gamma decay is desirable in the community and will be investigated in the future, as it may improve generalization of WN that is independent of batch statistics.

## Appendices

### B More Empirical Settings and Results

All experiments in Sec.5 are conducted in CIFAR10 with a CNN architecture similar to (Salimans & Kingma, 2016)

that is summarized as ‘conv(3,32)-conv(3,32)-conv(3,64)-conv(3,64)-pool(2,2)-fc(512)-fc(10)’, where ‘conv(3,32)’ represents a convolution with kernel size 3 and 32 channels, ‘pool(2,2)’ is max-pooling with kernel size 2 and stride 2, and ‘fc’ indicates a full connection. We follow a configuration for training by using SGD with a momentum value of 0.9 and continuously decaying the learning rate by a factor of

each step. For different batch sizes, the initial learning rate is scaled proportionally with the batch size to maintain a similar learning dynamics (Goyal et al., 2017).

#### b.1 Results in downsampled ImageNet

Besides CIFAR10, we also evaluate theorem 1 by employing a downsampled version of ImageNet (Loshchilov & Hutter, 2016), which contains identical 1.2 million data and 1k categories as the original ImageNet, but each image is scaled to 3232. We train ResNet18 in downsampled ImageNet by following the training protocol used in (He et al., 2016)

. In particular, ResNet18 is trained by using SGD with momentum of 0.9 and the initial learning rate is 0.1, which is then decayed by a factor of 10 after 30, 60, and 90 training epochs.

In downsampled ImageNet, we observe similar trends as those presented in CIFAR10. For example, we see that BN would imped both loss and accuracy when batch size increases. When increasing to as shown in Fig.6, both the loss and validation accuracy decrease because the regularization from the random batch statistics reduces in large batch size, resulting in overtraining. This can be seen by the gap between the training and the validation loss. Nevertheless, we see that the reduction of regularization can be complemented when PN is trained with adaptive gamma decay, which makes PN performed comparably to BN in downsampled ImageNet.

#### b.2 Impact of BN to the Norm of Parameters

We demonstrate the impact of BN to the norm of parameters. We compare BN with vanilla SGD, where a network is first trained by BN in order to converge to a local minima when the parameters do not change much. At this local minima, the weight vector is frozen and denoted as . Then this network is finetuned by using vanilla SGD with a small learning rate with the kernel parameters initialized by , where is the moving average of .

Fig.7 below visualizes the results. As and are removed in the vanilla SGD, it is found from the last two figures that the training loss decreases while the validation loss increases, meaning that the reduction in regularization makes the network converged to a sharper local minimum that generalizes less well. The magnitudes of kernel parameters at different layers are also displayed in the first four figures. All of them increase after freezing BN, due to the release of regularization on these parameters.

### C Proof of Results

#### c.1 Proof of Theorem 1

###### Theorem 1 (Regularization of μB,σB).

Let be the strength (coefficient) of the regularization and the activation function be ReLU. Then

 1PP∑j=1EμB,σBℓ(^hj)≃1PP∑j=1ℓ(¯hj)+ζγ2, and  ζ=ρ+28MFγfrom σB+12M1PP∑j=1σ(¯hj)from μB,

where , is the kurtosis of the distribution of , is a Fisher Information Matrix of , and is a sigmoid function.

###### Proof.

Let and . We prove theorem 1 by performing a Taylor expansion on a function at , where is a function of defined according to a particular activation function. The negative log likelihood function of the single-layer perceptron can be generally defined as , which is similar to the loss function of the generalized linear models with different activation functions.

 1PP∑j=1EμB,σB[l(^hj)] =1PP∑j=1EμB,σB[A(^hj)−yj^hj] =1PP∑j=1(A(¯hj)−yj¯hj)+1PP∑j=1EμB,σB[−yj(^hj−¯hj)+A(^hj)−A(¯hj)] =1PP∑j=1l(¯hj)+1PP∑j=1EμB,σB[(A′(¯hj)−yj)(^hj−¯hj)] +1PP∑j=1EμB,σB[A′′(¯hj)2(^hj−¯hj)2] =1PP∑j=1l(¯hj)+Rf+Rq,

where and denote the first and second derivatives of function . The first and second order terms in the expansion are represented by and respectively. To derive the analytical forms of and , we take a second-order Taylor expansion of of and around , it suffices to have

 1σB≈1σP+(−1σ2P)(σB−σP)+1σ3P(σB−σP)2

and

 1σ2B≈1σ2P+(−2σ3P)(σB−σP)+3σ4P(σB−σP)2.

By applying the distributions of and in the paper, can be derived as

 Rf =1PP∑j=1EμB,σB[(A′(¯hj)−yj)(^hj¯hj)] =1PP∑j=1EμB,σB[(A′(¯hj)−yj)(γwTxj−μBσB−γwTxj−μPσP)] =1PP∑j=1EμB,σB[(A′(¯hjyj)(γwTxj(1σB−1σP)+γ(−μBσB+μPσP))] =1PP∑j=1γ(A′(¯hj)−yj)(wTxj−μP)EσB[1σB−1σP] =1PP∑j=1ρ+24Mγ(A′(¯hj)−yj)wTxj−μPσP.

This term can be understood as below. Let and the distribution of the population data be . We establish the following relationship

 E(x,y)∼pxyEμB,σB[(A′(¯h)−y)h] =EμB,σBEx∼pxEy|x∼py|x[(A′(¯h)−y)h] =0.

Since the sample mean converges in probability to the population mean by the Weak Law of Large Numbers, for all and a constant number ( and ), we have . The above equation means that is sufficiently small given moderately large number of data points (the above inequality holds when ).

On the other hand, can be derived as

 Rq =1PP∑j=1EμB,σB[A′′(¯hj)2(^hj−¯hj)2] =1PP∑j=1A′′(¯hj)2EμB,σB[(γwTxj−μBσB+β−γwTxj−μPσP+β)2] =1PP∑j=1A′′(¯hj)2EμB,σB[(γwTxj)2(1σB−1σP)2−2γμPwTxj(1σB−1σP)2+(μBσB−μPσP)2] ≃1PP∑j=1γ2A′′(¯hj)2((wTxj−μP)2EμB,σ