Theoretical Analysis of Auto Rate-Tuning by Batch Normalization

12/10/2018 ∙ by Sanjeev Arora, et al. ∙ Princeton University Tsinghua University 20

Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization. While the idea makes intuitive sense, theoretical analysis of its effectiveness has been lacking. Here theoretical support is provided for one of its conjectured properties, namely, the ability to allow gradient descent to succeed with less tuning of learning rates. It is shown that even if we fix the learning rate of scale-invariant parameters (e.g., weights of each layer with BN) to a constant (say, 0.3), gradient descent still approaches a stationary point (i.e., a solution where gradient is zero) in the rate of T^-1/2 in T iterations, asymptotically matching the best bound for gradient descent with well-tuned learning rates. A similar result with convergence rate T^-1/4 is also shown for stochastic gradient descent.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Batch Normalization (abbreviated as BatchNorm or BN) (Ioffe & Szegedy, 2015)

is one of the most important innovation in deep learning, widely used in modern neural network architectures such as ResNet 

(He et al., 2016), Inception (Szegedy et al., 2017), and DenseNet (Huang et al., 2017). It also inspired a series of other normalization methods (Ulyanov et al., 2016; Ba et al., 2016; Ioffe, 2017; Wu & He, 2018).

BatchNorm consists of standardizing the output of each layer to have zero mean and unit variance. For a single neuron, if

is the original outputs in a mini-batch, then it adds a BatchNorm layer which modifies the outputs to

(1)

where and are the mean and variance within the mini-batch, and are two learnable parameters. BN appears to stabilize and speed up training, and improve generalization. The inventors suggested (Ioffe & Szegedy, 2015) that these benefits derive from the following:

  1. By stabilizing layer outputs it reduces a phenomenon called Internal Covariate Shift, whereby the training of a higher layer is continuously undermined or undone by changes in the distribution of its inputs due to parameter changes in previous layers.,

  2. Making the weights invariant to scaling, appears to reduce the dependence of training on the scale of parameters and enables us to use a higher learning rate;

  3. By implictly regularizing the model it improves generalization.

But these three benefits are not fully understood in theory. Understanding generalization for deep models remains an open problem (with or without BN). Furthermore, in demonstration that intuition can sometimes mislead, recent experimental results suggest that BN does not reduce internal covariate shift either (Santurkar et al., 2018), and the authors of that study suggest that the true explanation for BN’s effectiveness may lie in a smoothening effect (i.e., lowering of the Hessian norm) on the objective. Another recent paper (Kohler et al., 2018)

tries to quantify the benefits of BN for simple machine learning problems such as regression but does not analyze deep models.

Provable quantification of Effect 2 (learning rates).

Our study consists of quantifying the effect of BN on learning rates. Ioffe & Szegedy (2015) observed that without BatchNorm, a large learning rate leads to a rapid growth of the parameter scale. Introducing BatchNorm usually stabilizes the growth of weights and appears to implicitly tune the learning rate so that the effective learning rate adapts during the course of the algorithm. They explained this intuitively as follows. After BN the output of a neuron is unaffected when the weight is scaled, i.e., for any scalar ,

Taking derivatives one finds that the gradient at equals to the gradient at multiplied by a factor . Thus, even though the scale of weight parameters of a linear layer proceeding a BatchNorm no longer means anything to the function represented by the neural network, their growth has an effect of reducing the learning rate.

Our paper considers the following question: Can we rigorously capture the above intuitive behavior?

Theoretical analyses of speed of gradient descent algorithms in nonconvex settings  study the number of iterations required for convergence to a stationary point (i.e., where gradient vanishes). But they need to assume that the learning rate has been set (magically) to a small enough number determined by the smoothness constant of the loss function — which in practice are of course unknown. With this tuned learning rate, the norm of the gradient reduces asymptotically as

in iterations. In case of stochastic gradient descent, the reduction is like . Thus a potential way to quantify the rate-tuning behavior of BN would be to show that even when the learning rate is fixed to a suitable constant, say , from the start, after introducing BN the convergence to stationary point is asymptotically just as fast (essentially) as it would be with a hand-tuned learning rate required by earlier analyses. The current paper rigorously establishes such auto-tuning behavior of BN (See below for an important clarification about scale-invariance).

We note that a recent paper (Wu et al., 2018) introduced a new algorithm WNgrad that is motivated by BN and provably has the above auto-tuning behavior as well. That paper did not establish such behavior for BN itself, but it was a clear inspiration for our analysis of BN.

Scale-invariant and scale-variant parameters.

The intuition of Ioffe & Szegedy (2015) applies for all scale-invariant parameters, but the actual algorithm also involves other parameters such as and whose scale does matter. Our analysis partitions the parameters in the neural networks into two groups (scale-invariant) and (scale-variant). The first group, , consists of all the parameters whose scales does not affect the loss, i.e., scaling to for any does not change the loss (see Definition 2.1 for a formal definition); the second group, , consists of all other parameters that are not scale-invariant. In a feedforward neural network with BN added at each layer, the layer weights are all scale-invariant. This is also true for BN with normalization strategies (Santurkar et al., 2018; Hoffer et al., 2018) and other normalization layers, such as Weight Normalization (Salimans & Kingma, 2016), Layer Normalization (Ba et al., 2016), Group Normalization (Wu & He, 2018) (see Table 1 in Ba et al. (2016) for a summary).

1.1 Our contributions

In this paper, we show that the scale-invariant parameters do not require rate tuning for lowering the training loss. To illustrate this, we consider the case in which we set learning rates separately for scale-invariant parameters and scale-variant parameters . Under some assumptions on the smoothness of the loss and the boundedness of the noise, we show that

  1. In full-batch gradient descent, if the learning rate for is set optimally, then no matter how the learning rates for is set, converges to a first-order stationary point in the rate , which asymptotically matches with the convergence rate of gradient descent with optimal choice of learning rates for all parameters (Theorem 3.1);

  2. In stochastic gradient descent, if the learning rate for is set optimally, then no matter how the learning rate for is set, converges to a first-order stationary point in the rate , which asymptotically matches with the convergence rate of gradient descent with optimal choice of learning rates for all parameters (up to a factor) (Theorem 4.2).

In the usual case where we set a unified learning rate for all parameters, our results imply that we only need to set a learning rate that is suitable for . This means introducing scale-invariance into neural networks potentially reduces the efforts to tune learning rates, since there are less number of parameters we need to concern in order to guarantee an asymptotically fastest convergence.

In our study, the loss function is assumed to be smooth. However, BN introduces non-smoothness in extreme cases due to division by zero when the input variance is zero (see equation 1). Note that the suggested implementation of BN by Ioffe & Szegedy (2015)

uses a smoothening constant in the whitening step, but it does not preserve scale-invariance. In order to avoid this issue, we describe a simple modification of the smoothening that maintains scale-invariance. Also, our result cannot be applied to neural networks with ReLU, but it is applicable for its smooth approximation softplus 

(Dugas et al., 2001).

We include some experiments in Appendix D, showing that it is indeed the auto-tuning behavior we analysed in this paper empowers BN to have such convergence with arbitrary learning rate for scale-invariant parameters. In the generalization aspect, a tuned learning rate is still needed for the best test accuracy, and we showed in the experiments that the auto-tuning behavior of BN also leads to a wider range of suitable learning rate for good generalization.

1.2 Related works

Previous work for understanding Batch Normalization. Only a few recent works tried to theoretically understand BatchNorm. Santurkar et al. (2018) was described earlier. Kohler et al. (2018) aims to find theoretical setting such that training neural networks with BatchNorm is faster than without BatchNorm. In particular, the authors analyzed three types of shallow neural networks, but rather than consider gradient descent, the authors designed task-specific training methods when discussing neural networks with BatchNorm. Bjorck et al. (2018) observes that the higher learning rates enabled by BatchNorm improves generalization.

Convergence of adaptive algorithms. Our analysis is inspired by the proof for WNGrad (Wu et al., 2018), where the author analyzed an adaptive algorithm, WNGrad, motivated by Weight Normalization (Salimans & Kingma, 2016). Other works analyzing the convergence of adaptive methods are (Ward et al., 2018; Li & Orabona, 2018; Zou & Shen, 2018; Zhou et al., 2018).

Invariance by Batch Normalization. Cho & Lee (2017) proposed to run riemmanian gradient descent on Grassmann manifold since the weight matrix is scaling invariant to the loss function. Hoffer et al. (2018) observed that the effective stepsize is proportional to .

2 General framework

In this section, we introduce our general framework in order to study the benefits of scale-invariance.

2.1 Motivating examples of neural networks

Scale-invariance is common in neural networks with BatchNorm. We formally state the definition of scale-invariance below:

Definition 2.1.

(Scale-invariance) Let be a loss function. We say that is a scale-invariant parameter of if for all , ; if is not scale-invariant, then we say is a scale-variant parameter of .

We consider the following -layer “fully-batch-normalized” feedforward network for illustration:

(2)

is a mini-batch of pairs of input data and ground-truth label from a data set . is an objective function depending on the label, e.g., could be a cross-entropy loss in classification tasks. are weight matrices of each layer.

is a nonlinear activation function which processes its input elementwise (such as ReLU, sigmoid). Given a batch of inputs

,

outputs a vector

defined as

(3)

where and are the mean and variance of , and are two learnable parameters which rescale and offset the normalized outputs to retain the representation power. The neural network is thus parameterized by weight matrices in each layer and learnable parameters in each BN.

BN has the property that the output is unchanged when the batch inputs are scaled or shifted simultaneously. For being the output of a linear layer, it is easy to see that is scale-invariant, and thus each row vector of weight matrices in are scale-invariant parameters of

. In convolutional neural networks with BatchNorm, a similar argument can be done. In particular, each filter of convolutional layer normalized by BN is scale-invariant.

With a general nonlinear activation, other parameters in , the scale and shift parameters and in each BN, are scale-variant. When ReLU or Leaky ReLU (Maas et al., 2013) are used as the activation , the vector of each BN at layer (except the last one) is indeed scale-invariant. This can be deduced by using the the (positive) homogeneity of these two types of activations and noticing that the output of internal activations is processed by a BN in the next layer. Nevertheless, we are not able to analyse either ReLU or Leaky ReLU activations because we need the loss to be smooth in our analysis. We can instead analyse smooth activations, such as sigmoid, tanh, softplus (Dugas et al., 2001), etc.

2.2 Framework

Now we introduce our general framework. Let be a neural network parameterized by . Let be a dataset, where each data point is associated with a loss function ( can be the set of all possible mini-batches). We partition the parameters into , where consisting of parameters that are scale-invariant to all , and contains the remaining parameters. The goal of training the neural network is to minimize the expected loss over the dataset: . In order to illustrate the optimization benefits of scale-invariance, we consider the process of training this neural network by stochastic gradient descent with separate learning rates for and :

(4)

2.3 The intrinsic optimization problem

Thanks to the scale-invariant properties, the scale of each weight does not affect loss values. However, the scale does affect the gradients. Let be the set of normalized weights, where . The following simple lemma can be easily shown:

Lemma 2.2 (Implied by Ioffe & Szegedy (2015)).

For any and ,

(5)

To make to be small, one can just scale the weights by a large factor. Thus there are ways to reduce the norm of the gradient that do not reduce the loss.

For this reason, we define the intrinsic optimization problem for training the neural network. Instead of optimizing and over all possible solutions, we focus on parameters in which for all . This does not change our objective, since the scale of does not affect the loss.

Definition 2.3 (Intrinsic optimization problem).

Let be the intrinsic domain. The intrinsic optimization problem is defined as optimizing the original problem in :

(6)

For being a sequence of points for optimizing the original optimization problem, we can define , where , as a sequence of points optimizing the intrinsic optimization problem.

In this paper, we aim to show that training neural network for the original optimization problem by gradient descent can be seen as training by adaptive methods for the intrinsic optimization problem, and it converges to a first-order stationary point in the intrinsic optimization problem with no need for tuning learning rates for .

2.4 Assumptions on the loss

We assume is defined and twice continuously differentiable at any satisfying none of is . Also, we assume that the expected loss is lower-bounded by .

Furthermore, for , where , we assume that the following bounds on the smoothness:

In addition, we assume that the noise on the gradient of in SGD is upper bounded by :

Smoothed version of motivating neural networks. Note that the neural network illustrated in Section 2.1 does not meet the conditions of the smooothness at all since the loss function could be non-smooth. We can make some mild modifications to the motivating example to smoothen it 111Our results to this network are rather conceptual, since the smoothness upper bound can be as large as , where is the number of layers and is the maximum width of each layer.:

  1. The activation could be non-smooth. A possible solution is to use smooth nonlinearities, e.g., sigmoid, tanh, softplus (Dugas et al., 2001), etc. Note that softplus can be seen as a smooth approximation of the most commonly used activation ReLU.

  2. The formula of BN shown in equation 3 may suffer from the problem of division by zero. To avoid this, the inventors of BN, Ioffe & Szegedy (2015), add a small smoothening parameter to the denominator, i.e.,

    (7)

    However, when , adding a constant directly breaks the scale-invariance of . We can preserve the scale-invariance by making the smoothening term propositional to , i.e., replacing with . By simple linear algebra and letting , this smoothed version of BN can also be written as

    (8)

    Since the variance of inputs is usually large in practice, for small , the effect of the smoothening term is negligible except in extreme cases.

Using the above two modifications, the loss function is already smooth. However, the scale of scale-variant parameters may be unbounded during training, which could cause the smoothness unbounded. To avoid this issue, we can either project scale-variant parameters to a bounded set, or use weight decay for those parameters (see Appendix C for a proof for the latter solution).

2.5 Key observation: the growth of weights

The following lemma is our key observation. It establishes a connection between the scale-invariant property and the growth of weight scale, which further implies an automatic decay of learning rates:

Lemma 2.4.

For any scale-invariant weight in the network , we have:

  1. and are always perpendicular;

  2. .

Proof.

Let be all the parameters in other than . Taking derivatives with respect to for the both sides of , we have The right hand side equals , so the first proposition follows by taking . Applying Pythagorean theorem and Lemma 2.2, the second proposition directly follows. ∎

Using Lemma 2.4, we can show that performing gradient descent for the original problem is equivalent to performing an adaptive gradient method for the intrinsic optimization problem:

Theorem 2.5.

Let . Then for all ,

(9)

where is a projection operator which maps any vector to .

Remark 2.6.

Wu et al. (2018) noticed that Theorem 2.5 is true for Weight Normalization by direct calculation of gradients. Inspiring by this, they proposed a new adaptive method called . Our theorem is more general since it holds for any normalization methods as long as it induces scale-invariant properties to the network. The adaptive update rule derived in our theorem can be seen as with projection to unit sphere after each step.

Proof for Theorem 2.5.

Using Lemma 2.2, we have

which implies the first equation. The second equation is by Lemma 2.4. ∎

While popular adaptive gradient methods such as AdaGrad (Duchi et al., 2011)

, RMSprop

(Tieleman & Hinton, 2012), Adam (Kingma & Ba, 2014) adjust learning rates for each single coordinate, the adaptive gradient method described in Theorem 2.5 sets learning rates for each scale-invariant parameter respectively. In this paper, we call the effective learning rate of or , because it’s instead of alone that really determines the trajectory of gradient descent given the normalized scale-invariant parameter . In other words, the magnitude of the initialization of parameters before BN is as important as their learning rates: multiplying the initialization of scale-invariant parameter by constant is equivalent to dividing its learning rate by . Thus we suggest researchers to report initialization as well as learning rates in the future experiments.

3 Training by full-batch gradient descent

In this section, we rigorously analyze the effect related to the scale-invariant properties in training neural network by full-batch gradient descent. We use the framework introduced in Section 2.2 and assumptions from Section 2.4. We focus on the full-batch training, i.e., is always equal to the whole training set and .

3.1 Settings and main theorem

Assumptions on learning rates. We consider the case that we use fixed learning rates for both and , i.e., and . We assume that is tuned carefully to for some constant . For , we do not make any assumption, i.e., can be set to any positive value.

Theorem 3.1.

Consider the process of training by gradient descent with and arbitrary . Then converges to a stationary point in the rate of

(10)

where with , suppresses polynomial factors in , , , , , for all , and we see .

This matches the asymptotic convergence rate of GD by Carmon et al. (2018).

3.2 Proof sketch

The high level idea is to use the decrement of loss function to upper bound the sum of the squared norm of the gradients. Note that . For the first part , we have

(11)

Thus the core of the proof is to show that the monotone increasing has an upper bound for all . It is shown that for every , the whole training process can be divided into at most two phases. In the first phase, the effective learning rate is larger than some threshold (defined in Lemma 3.2) and in the second phase it is smaller.

Lemma 3.2 (Taylor Expansion).

Let . Then

(12)

If is large enough and that the process enters the second phase, then by Lemma 3.2 in each step the loss function will decrease by (Recall that by Lemma 2.4). Since is lower-bounded, we can conclude is also bounded. For the second part, we can also show that by Lemma 3.2

Thus we can conclude convergence rate of as follows.

The full proof is postponed to Appendix A.

4 Training by stochastic gradient descent

In this section, we analyze the effect related to the scale-invariant properties when training a neural network by stochastic gradient descent. We use the framework introduced in Section 2.2 and assumptions from Section 2.4.

4.1 Settings and main theorem

Assumptions on learning rates. As usual, we assume that the learning rate for is chosen carefully and the learning rate for is chosen rather arbitrarily. More specifically, we consider the case that the learning rates are chosen as

We assume that the initial learning rate of is tuned carefully to for some constant . Note that this learning rate schedule matches the best known convergence rate of SGD in the case of smooth non-convex loss functions (Ghadimi & Lan, 2013).

For the learning rates of , we only assume that , i.e., the learning rate decays equally as or slower than the optimal SGD learning rate schedule. can be set to any positive value. Note that this includes the case that we set a fixed learning rate for by taking .

Remark 4.1.

Note that the auto-tuning behavior induced by scale-invariances always decreases the learning rates. Thus, if we set , there is no hope to adjust the learning rate to the optimal strategy . Indeed, in this case, the learning rate in the intrinsic optimization process decays exactly in the rate of , which is the best possible learning rate can be achieved without increasing the original learning rate.

Theorem 4.2.

Consider the process of training by gradient descent with and , where and is arbitrary. Then converges to a stationary point in the rate of

(13)

where with , suppresses polynomial factors in , , , , , , for all , and we see .

Note that this matches the asymptotic convergence rate of SGD, within a factor.

4.2 Proof sketch

We delay the full proof into Appendix B and give a proof sketch in a simplified setting where there is no and . We also assume there’s only one , that is, and omit the index .

By Taylor expansion, we have

(14)

We can lower bound the effective learning rate and upper bound the second order term respectively in the following way:

  1. For all , the effective learning rate ;

  2. .

Taking expectation over equation 14 and summing it up, we have

Plug the above bounds into the above inequality, we complete the proof.

5 Conclusions and future works

In this paper, we studied how scale-invariance in neural networks with BN helps optimization, and showed that (stochastic) gradient descent can achieve the asymptotic best convergence rate without tuning learning rates for scale-invariant parameters. Our analysis suggests that scale-invariance in nerual networks introduced by BN reduces the efforts for tuning learning rate to fit the training data.

However, our analysis only applies to smooth loss functions. In modern neural networks, ReLU or Leaky ReLU are often used, which makes the loss non-smooth. It would have more implications by showing similar results in non-smooth settings. Also, we only considered gradient descent in this paper. It can be shown that if we perform (stochastic) gradient descent with momentum, the norm of scale-invariant parameters will also be monotone increasing. It would be interesting to use it to show similar convergence results for more gradient methods.

Acknowledgments

Thanks Yuanzhi Li, Wei Hu and Noah Golowich for helpful discussions. This research was done with support from NSF, ONR, Simons Foundation, Mozilla Research, Schmidt Foundation, DARPA, and SRC.

References

Appendix A Proof for Full-Batch Gradient Descent

By the scale-invariant property of , we know that . Also, the following identities about derivatives can be easily obtained:

Thus, the assumptions on the smoothness imply

(15)
(16)
(17)
Proof for Lemma 3.2.

Using Taylor expansion, we have , such that for ,

Note that is perpendicular to , we have

Thus,