The Deep Neural Network (DNN) community can be roughly split into two groups. One group is driving innovations empirically, and has delivered important innovations in quantization [Lin2016], ResNets [He2016], BatchNorm [ioffe2015batch] and binary Convolutional Neural Networks (CNNs) [hubara2016binarized]. The other group focuses on understanding why these innovations work, see for instance Hanin Hanin2018 and Yang et al. yang2019MeanFieldTheoryBatchNorm. However, due to widespread access to large datasets and GPU-accelerated frameworks [PyTorchNEURIPS2019_9015], experiment-driven research has become the prevailing paradigm, leading to a growing gap between DNN innovations and their theoretical understanding.
This paper is focused on improving our theoretical understanding of BatchNorm, which is a popular normalization technique used in CNN training. The primary benefit of BatchNorm is in enabling training with higher learning rates and thus higher training speed. It is also critical to the convergence of advanced networks such as ResNet [He2016]. In their original work, Ioffe and Szegedy ioffe2015batch theorized that BatchNorm reduces the effect of Internal Covariate Shift (ICS), but later publications have abandoned this idea. Saturkar et al. santurkar2018does demonstrated that they could deliberately induce ICS without affecting convergence speed. Zhang et al. Zhang2019Fixup hypothesized that BatchNorm helps control exploding gradients and proposed an initialization that overcomes this issue in ResNets without BatchNorm. Balduzzi et al. Balduzzi2017 showed experimentally that BatchNorm prevents exploding gradients, but did not explain how BatchNorm achieves this. It is also worth noting that Saturkar et al. santurkar2018does and Zhang et al. Zhang2019Fixup provide conflicting hypotheses on BatchNorm, with experiments supporting both.
Our work leverages insight from the traditional adaptive filter domain. Although modern DNNs are considered a separate field, neural networks (NNs) originated from the signal processing community in the early 1960s [widrow1960adaline]
. Back then, NNs were shallower, fully connected (FC), and treated as a special kind of adaptive filter. Like modern DNNs, adaptive filters are trained by minimizing a loss function such as Least Mean Squares (LMS)[widrow1960adaptive] and using gradient descent. More significantly, BatchNorm appears similar to Normalized Least Mean Square (NLMS) [nagumo1967learning], a technique used to increase convergence speed in adaptive filters. Motivated by this similarity, this work:
restructures CNNs to follow the traditional adaptive filter notation (Section 2). We address the handling of CNN nonlinearities and the global cost function over multiple layers. This explicit recasting is necessary to reconcile the differences between CNNs and adaptive filters.
demonstrates that CNNs, similar to adaptive filters, have natural modes, stability bounds and training speeds that are controlled by input autocorrelation matrix eigenvalues. We analyze two variants of BatchNorm to explore its effects on the eigenvalues (Section 3).
proves that the Principle of Minimum Disturbance (PMD) can be applied to CNNs and that under certain conditions, BatchNorm placed before the convolution operation is equivalent to NLMS (Section 4).
2 Casting CNN Features as Adaptive Filters
2.1 Definition of Adaptive Filter Variables
For adaptive filters with vector weights and scalar outputs, we define the following variables at time step[Haykin:2002]:
column vector input
column vector weight:
scalar desired response
difference between old and new weights
The goal is to update using gradient descent so that eventually matches . The weight update is , where and is the expectation. is the gradient of with respect to the weights, and is the learning parameter.
In this paper, convolution refers to the CNN’s spatial convolutions. This is in contrast to the to the convolution through time commonly assumed in signal processing (Figure 1).
2.2 The Convolution Layer as an Adaptive Filter
In CNNs, the weights are a 4D tensor with spatial dimensions input channeloutput channelheightwidth (). Both the input and the output for any given layer are 3D arrays. The input array has spatial dimensions input heightinput widthinput channels (). Similarly, the output array has spatial dimensions output heightoutput widthoutput channels ().
In our analysis, we restructure the 4D spatial convolution into a matrix vector multiplication (Figure 2
). The sliding window of inputs during the convolution is rearranged as a vector and inputs slide in/out during each spatial stride. Since operations along the output channels and batches are independent of one another, our analysis is restricted to a single output channel and single batch without loss of generality.
Given this restructured form, we adapt the framework from Section 2.1 to describe the CNN layer components. In this section, we use to index and for the time step. For a given convolution layer , we have:
: weight filter. is the weight element indexed by in .
: unrolled -element local error vector. is the th element in .
: unrolled -element local desired response vector. is the element in .
: -element output vector of the convolution. is the in .
: unrolled patch of the spatial input map that is convolved with the filter. is the th element in this vector.
For output pixels, the input patches are unrolled into , with dimensions . is the th column of . We use this array to define the output vector . Figure 3
The desired output is required for further analysis. In adaptive filters, can be solved for. However, the convolution layer’s nonlinearity prevents the derivation of as a function of the global loss function. Therefore, we limit the analysis to the operations sandwiched between any two nonlinearities, enabling us to derive from the local error provided by backpropagation.
The CNN global cost function, , defined at the network output, requires further approximations. This is because , where is the set of all layer parameters, is defined over the entire network architecture and thus has a component for every weight element of every layer. This makes further analysis intractable. To solve this issue, we approximate as a function of only the layer input , the layer weights for layer and the downstream weights, where . At time step , we fix all downstream weights as constants, since their update does not affect the weight update at layer . Therefore, the only variables in are due to the and . With this constraint, .
is the local gradient calculated using the chain rule.
where is the expectation over the batch and is the th row of .
3 Layer Dynamics and Normalization Effects
3.1 Natural Modes
Here, we show that CNN layers have natural modes similar to adaptive filters, as described in [Haykin:2002][widrow1971adaptive]. To aid this analysis, we define the following expressions: is the input autocorrelation matrix, is the cross-correlation vector between the input and the desired response, and is the unrolled local error vector (see Section 2.2).
The principle of orthogonality applied to the CNN layer states that at a special operating condition, the estimation error vector,, is orthogonal to the input . Expanded to all the rows of , is orthogonal to all rows of . This condition is met when the weights for layer are at their optimum value, . At this point, . Applying this condition to (2.2): . Then:
At the optimal operating conditions, the gradient is zero, the error vector is at its minimum, denoted as , and the weights are . That is, , giving the Wiener-Hopf equations for the CNN layer:
After reintroducing the time step and applying (2) to the weight update equation, we find:
Here, is a Hermitian matrix and has the eigen decomposition , where the columns of
are the eigenvectors of, is the Hermitian transpose, and is the diagonal matrix containing the eigenvalues of . Let . Substitute and into (5) to get a transformed set of weight update equations: . Let be a single element in , indexed by , such that for eigenvalues and assume an initial starting point . We then have:
The stability of the weight update rests on the stability of (6). It is stable when , and the largest eigenvalue sets the tightest bound. Assuming that the system is stable, each entry in the weight error matrix decays via an exponential with time constant . The larger the time constant, the longer the modes take to decay, with an upper bound set by the smallest eigenvalue. This suggests that the training converges faster using techniques that boost the smallest eigenvalues. On the other hand, the stability condition implies that any CNN architecture that uses a technique to suppress the largest eigenvalues is stable at comparatively higher learning rates.
3.2 Channel Normalization and Eigenvalues
A difference between adaptive filter convolution and CNN convolution lies in the channel mechanics of the CNN layer. In the multi-channel CNN case, strides are local to the individual feature maps. In the unrolled , pixels still rotate through but are restricted to a region corresponding to their channel. Therefore, is a block vector of each channel’s unrolled patch vectors. In a 3-channel example, is a concatenation of 3 vectors: , and , which are the unrolled patches from channels 1, 2 and 3, respectively (Figure 4). As the filters stride over the feature maps, pixels from the input channel indexed by rotate only through .
The autocorrelation matrix becomes the concatenation of block correlation matrices. Along the main diagonal are the input autocorrelation matrices , and , where . Consider the effect of scaling along the channels of the 3D input feature map, such that channel is scaled by to create a new input vector , such that . The block autocorrelation matrix indexed by along the main diagonal is scaled by .
The resulting eigenvalues do not exactly follow a similarly neat scaling effect, but are scaled by a mixture of factors from all the channels. However, as channels become increasingly decorrelated, the off-diagonal blocks become closer to zero, and eigenvalues become more closely influenced by a single channel. In the extreme case where the channels are completely decorrelated, blocks of eigenvalues are scaled exactly by their corresponding channel scale factor . In other words, channel-wise scaling can control the values of the maximum and minimum eigenvalues as long as the feature maps between channels are decorrelated and maximum and minimum eigenvalues correspond most strongly with different channels.
Section 3.2 implies that BatchNorm and other channel-wise normalization techniques have two separate effects. For channels that have small power levels, normalization amplifies the associated eigenvalues. For channels that have large power levels, normalization suppresses those eigenvalues.
To test these ideas, we compare BatchNorm with two variants: BN_Amplify and BN_Suppress. BN_Amplify applies normalization only to channels above a power threshold of 1.0. BN_Suppress applies normalization to only channels below a power threshold of 1.0 (Figure 4(c)). The networks are based on LeNet [le1989handwritten] with one FC layer removed (to reduce memory requirements for saving activations from multiple training steps). We compare four networks:
Baseline: Described in Table 1
BatchNorm: BatchNorm layer after each conv layer
BN_Amplify: BN_Amplify layer after each conv layer
BN_Suppress: BN_Suppress layer after each conv layer
All networks are trained with 5 seeds, starting with the same set of random weights on the MNIST dataset [lecun1998gradientMNIST]
. The learning rates are swept to observe how convergence speed changes. BatchNorm parameters, convolution biases and FC layers are trained at a fixed learning rate of 0.1 without dropout. The training algorithm uses stochastic gradient descent and cross-entropy loss.
3.3.1 Training Convergence Speed Experiments
plots the training curve after 20 epochs. To provide insight early in training, the minimum eigenvalues are plotted after 5 steps (Figure6). At low learning rates, BatchNorm has the same amplification effects on the smallest eigenvalues as BN_Amplify, resulting in improved speed. As the learning rate increases, there is less need to amplify the smaller eigenvalues, and the training curves for Baseline and BN_Suppress eventually catch up to BatchNorm’s performance.
Training curves at different learning rates. Color bands denote 2nd to 3rd quartile spread of validation error from 5 seeds.
3.3.2 Stability Experiments
We plot the validation error after 20 epochs. Figure 7(a) shows that BatchNorm and BN_Suppress remain stable at high learning rates. The maximum eigenvalues are plotted after 5 training steps (Figure 7(b)). Early in training, BN_Suppress suppresses the largest eigenvalue, allowing the network to remain stable at high learning rates. At high learning rates, BatchNorm has a similar effect, allowing the network to remain stable and match BN_Suppress’ performance. In contrast, BN_Amplify and Baseline cannot suppress the largest eigenvalues and therefore become unstable at high learning rates.
4 Insights from PMD
In this section, we apply the PMD optimization problem to CNNs to draw connections between NLMS and BatchNorm. Unless otherwise noted, we drop references to the layer index . To establish the required background, consider the NLMS update equation, in which the input is normalized by its power:
NLMS is derived from an optimization problem based on PMD [Haykin:2002] and the division by the power is an artifact of the problem setup. In Section 4.1
, we show that under certain assumptions, BatchNorm’s division by the standard deviation is equivalent.
4.1 PMD in CNNs and the Connection to NLMS
Starting at time step of a single pixel in an output feature map, the output, is scalar and the weight update can be expressed as in Section 2.1. However, applying the PMD requires the desired output . For an adaptive filter, is provided externally. In CNNs, internal layers do not have a local desired response, but instead have the local error . Therefore, we assume the following relationship to derive the desired response: . The NLMS derivation can now be applied.
Recall that for every output pixel , where , there is an unrolled input patch , with size . However, there is a distinct set of weights for the block row of weights corresponding to channel . Using the notation developed in Section 2.2 for block rows in , we can use to index into this patch and the associated weight element, where for input channel :
We extend the PMD optimization to an array of output pixels. Any given single weight value is used in the calculation of output pixels. Therefore, in a single time step, has updates from M sources. Because the weight update contributions from each output pixel are summed together to create the total weight update for , these updates are independent of each other. Instead of a single optimization with constraints, there are separate optimization problems. Therefore, to extend (8) over an output pixel array, we sum over all constraint equations and introduce the learning parameter , which yields:
Explicit normalization by the input power in (9
) can be removed if the input is normalized to zero mean and unit variance. Assume that within each single feature map HW patch the pixels have the same variance. Then, the variance for is the same for all and , and is indicated by . Assume that the input along channel is zero mean, denoted by . Then, can replace :
Here, the variance is estimated over a single input feature map, which can be as small as 7x7. If weight updates are calculated across input batches, the channel variances are estimated from a size of and become more accurate.
4.2 BatchNorm and NLMS
Now we analyze how BatchNorm deviates from the analysis in Section 4.1. Starting with the base arrangement of a convolution layer followed by a ReLU nonlinearity (Figure 8(a)), there are two possible placements of BatchNorm. While the common convention is to place BatchNorm after the convolution layer (Figure 8(b)), in [Mishkin2016], the authors found that networks performed slightly better when the BatchNorm layer was placed before the convolution layer (Figure 8(d)). Figure 8(c) illustrates how NLMS operates on the weight update during the backward pass. We do not expect the configuration in Figure 8(b) to match NLMS’ performance because it is normalizing the outputs instead of the inputs.
Assuming BN_Prior (Figure 8(d)), the key deviation from normalization requirements of PMD are the BatchNorm channel-wise scale and shift parameters, and respectively. These parameters change the channel-wise mean from 0 to and the variance from 1 to . To follow the PMD exactly, the common CNN update equation needs the following change:
where and are usually initialized to 1 and 0, respectively. Therefore, the CNN follows the PMD exactly in the first training step, the point in training when NLMS is needed most. During the first training steps when weights have not settled, the input power is still strongly fluctuating. Even though and allow deviations from the weight update size mandated by the PMD, as long as and stay close to their initial values, we expect BN_Prior to come close to satisfying the PMD.
4.3 Experiments with NLMS
In this section, we quantify the effect of NLMS on gradient noise, and compare NLMS to BatchNorm before and after the convolution operation. The network variants (Figure 9) are:
Baseline: Network described in Table 1.
BatchNorm: BatchNorm layer after the convolution.
NLMS_L1: NLMS, with L1 norm.
NLMS_L2: NLMS, with L2 norm.
BN_Prior: BatchNorm layer before the convolution.
To isolate the effects of NLMS on the convolution layers, and not have the learning power of the FC layers compensate for the effect of noise in the weight updates, we leave the FC layer as untrained and only apply BatchNorm and NLMS to the second convolution layer. This results in BatchNorm having a higher validation error, dropping the accuracy of the baseline from 99% to 95%. This is acceptable for our analysis because we are only interested in the variance in the validation error due to noise injection. To measure the effect of NLMS in a controlled manner, we inject noise into the local error (before weight gradient calculation). The noise is drawn from a Gaussian and scaled according the local error variance. We train with stochastic gradient descent, no dropout, and no weight decay or momentum over 40 epochs (Figure 10).
Figure 10 shows that NLMS, which satisfies PMD exactly, has the least amount of noise amplification, resulting in the smoothest curve in the presence of noise. NLMS_L2 has the smallest band, and shows the greatest resilience to noise. BN_Prior, which comes close to satisfying PMD, performs similar to NLMS_L2 and shows the second least sensitivity to noise. Other networks, including both BatchNorm and Baseline are very sensitive to injected noise. As a result of leaving the FC layer untrained, the variant with BatchNorm placed after the convolution layer settles at a higher error.
5 Related Works
Our work is similar to others that focus on the Hessian matrix because the input autocorrelation matrix is the expectation of the local Hessian of the layer weights. Zhang et al. Zhang2018 studies the relationship between the local Hessian and backpropagation and similarly proposed that BatchNorm applied to a FC layer controls the spectrum of the local Hessian eigenvalues, which can lead to better training speed. In this work, we study BatchNorm applied to convolution layers, and separate the amplification and suppression effects of BatchNorm to demonstrate that it is the amplification of the smallest eigenvalues that leads to the increase in training speed. LeCun et al. Lecun1993LeCun2012 derive the Hessian and draw conclusions similar to our work. They extend the conclusions to Hessian-free techniques for determining an adaptive learning rate. In this work we derive the relationship between BatchNorm and the eigenvalues of by applying adaptive filter ideas (principle of orthogonality and Wiener-Hopf equations).
We used tools from adaptive filter theory to study the inner workings of BatchNorm. We show that the convolution layers have natural modes that result in bounds on the stability and convergence speed that are functions of the eigenvalues. We demonstrate that BatchNorm has two separate effects on the eigenvalues, and these lead to the commonly associated stability and convergence speed benefits. At lower learning rates, BatchNorm amplifies the smallest eigenvalues, leading to higher convergence speed. We separately show that BatchNorm suppresses the largest eigenvalues, which increases the largest learning rate at which the network can stably train.
Although BatchNorm and NLMS bear some similarity, only BatchNorm placed before the convolution operation allows the weight update algorithm to meet the PMD condition. We use injected noise to prove that BatchNorm layers placed before the convolution have the same effect on training as NLMS. This similarity is not observed if the BatchNorm is placed after the convolution operation.