1 Introduction
Batch Normalization (abbreviated as BatchNorm or BN) (Ioffe & Szegedy, 2015)
is one of the most important innovation in deep learning, widely used in modern neural network architectures such as ResNet
(He et al., 2016), Inception (Szegedy et al., 2017), and DenseNet (Huang et al., 2017). It also inspired a series of other normalization methods (Ulyanov et al., 2016; Ba et al., 2016; Ioffe, 2017; Wu & He, 2018).BatchNorm consists of standardizing the output of each layer to have zero mean and unit variance. For a single neuron, if
is the original outputs in a minibatch, then it adds a BatchNorm layer which modifies the outputs to(1) 
where and are the mean and variance within the minibatch, and are two learnable parameters. BN appears to stabilize and speed up training, and improve generalization. The inventors suggested (Ioffe & Szegedy, 2015) that these benefits derive from the following:

By stabilizing layer outputs it reduces a phenomenon called Internal Covariate Shift, whereby the training of a higher layer is continuously undermined or undone by changes in the distribution of its inputs due to parameter changes in previous layers.,

Making the weights invariant to scaling, appears to reduce the dependence of training on the scale of parameters and enables us to use a higher learning rate;

By implictly regularizing the model it improves generalization.
But these three benefits are not fully understood in theory. Understanding generalization for deep models remains an open problem (with or without BN). Furthermore, in demonstration that intuition can sometimes mislead, recent experimental results suggest that BN does not reduce internal covariate shift either (Santurkar et al., 2018), and the authors of that study suggest that the true explanation for BN’s effectiveness may lie in a smoothening effect (i.e., lowering of the Hessian norm) on the objective. Another recent paper (Kohler et al., 2018)
tries to quantify the benefits of BN for simple machine learning problems such as regression but does not analyze deep models.
Provable quantification of Effect 2 (learning rates).
Our study consists of quantifying the effect of BN on learning rates. Ioffe & Szegedy (2015) observed that without BatchNorm, a large learning rate leads to a rapid growth of the parameter scale. Introducing BatchNorm usually stabilizes the growth of weights and appears to implicitly tune the learning rate so that the effective learning rate adapts during the course of the algorithm. They explained this intuitively as follows. After BN the output of a neuron is unaffected when the weight is scaled, i.e., for any scalar ,
Taking derivatives one finds that the gradient at equals to the gradient at multiplied by a factor . Thus, even though the scale of weight parameters of a linear layer proceeding a BatchNorm no longer means anything to the function represented by the neural network, their growth has an effect of reducing the learning rate.
Our paper considers the following question: Can we rigorously capture the above intuitive behavior?
Theoretical analyses of speed of gradient descent algorithms in nonconvex settings study the number of iterations required for convergence to a stationary point (i.e., where gradient vanishes). But they need to assume that the learning rate has been set (magically) to a small enough number determined by the smoothness constant of the loss function — which in practice are of course unknown. With this tuned learning rate, the norm of the gradient reduces asymptotically as
in iterations. In case of stochastic gradient descent, the reduction is like . Thus a potential way to quantify the ratetuning behavior of BN would be to show that even when the learning rate is fixed to a suitable constant, say , from the start, after introducing BN the convergence to stationary point is asymptotically just as fast (essentially) as it would be with a handtuned learning rate required by earlier analyses. The current paper rigorously establishes such autotuning behavior of BN (See below for an important clarification about scaleinvariance).We note that a recent paper (Wu et al., 2018) introduced a new algorithm WNgrad that is motivated by BN and provably has the above autotuning behavior as well. That paper did not establish such behavior for BN itself, but it was a clear inspiration for our analysis of BN.
Scaleinvariant and scalevariant parameters.
The intuition of Ioffe & Szegedy (2015) applies for all scaleinvariant parameters, but the actual algorithm also involves other parameters such as and whose scale does matter. Our analysis partitions the parameters in the neural networks into two groups (scaleinvariant) and (scalevariant). The first group, , consists of all the parameters whose scales does not affect the loss, i.e., scaling to for any does not change the loss (see Definition 2.1 for a formal definition); the second group, , consists of all other parameters that are not scaleinvariant. In a feedforward neural network with BN added at each layer, the layer weights are all scaleinvariant. This is also true for BN with normalization strategies (Santurkar et al., 2018; Hoffer et al., 2018) and other normalization layers, such as Weight Normalization (Salimans & Kingma, 2016), Layer Normalization (Ba et al., 2016), Group Normalization (Wu & He, 2018) (see Table 1 in Ba et al. (2016) for a summary).
1.1 Our contributions
In this paper, we show that the scaleinvariant parameters do not require rate tuning for lowering the training loss. To illustrate this, we consider the case in which we set learning rates separately for scaleinvariant parameters and scalevariant parameters . Under some assumptions on the smoothness of the loss and the boundedness of the noise, we show that

In fullbatch gradient descent, if the learning rate for is set optimally, then no matter how the learning rates for is set, converges to a firstorder stationary point in the rate , which asymptotically matches with the convergence rate of gradient descent with optimal choice of learning rates for all parameters (Theorem 3.1);

In stochastic gradient descent, if the learning rate for is set optimally, then no matter how the learning rate for is set, converges to a firstorder stationary point in the rate , which asymptotically matches with the convergence rate of gradient descent with optimal choice of learning rates for all parameters (up to a factor) (Theorem 4.2).
In the usual case where we set a unified learning rate for all parameters, our results imply that we only need to set a learning rate that is suitable for . This means introducing scaleinvariance into neural networks potentially reduces the efforts to tune learning rates, since there are less number of parameters we need to concern in order to guarantee an asymptotically fastest convergence.
In our study, the loss function is assumed to be smooth. However, BN introduces nonsmoothness in extreme cases due to division by zero when the input variance is zero (see equation 1). Note that the suggested implementation of BN by Ioffe & Szegedy (2015)
uses a smoothening constant in the whitening step, but it does not preserve scaleinvariance. In order to avoid this issue, we describe a simple modification of the smoothening that maintains scaleinvariance. Also, our result cannot be applied to neural networks with ReLU, but it is applicable for its smooth approximation softplus
(Dugas et al., 2001).We include some experiments in Appendix D, showing that it is indeed the autotuning behavior we analysed in this paper empowers BN to have such convergence with arbitrary learning rate for scaleinvariant parameters. In the generalization aspect, a tuned learning rate is still needed for the best test accuracy, and we showed in the experiments that the autotuning behavior of BN also leads to a wider range of suitable learning rate for good generalization.
1.2 Related works
Previous work for understanding Batch Normalization. Only a few recent works tried to theoretically understand BatchNorm. Santurkar et al. (2018) was described earlier. Kohler et al. (2018) aims to find theoretical setting such that training neural networks with BatchNorm is faster than without BatchNorm. In particular, the authors analyzed three types of shallow neural networks, but rather than consider gradient descent, the authors designed taskspecific training methods when discussing neural networks with BatchNorm. Bjorck et al. (2018) observes that the higher learning rates enabled by BatchNorm improves generalization.
Convergence of adaptive algorithms. Our analysis is inspired by the proof for WNGrad (Wu et al., 2018), where the author analyzed an adaptive algorithm, WNGrad, motivated by Weight Normalization (Salimans & Kingma, 2016). Other works analyzing the convergence of adaptive methods are (Ward et al., 2018; Li & Orabona, 2018; Zou & Shen, 2018; Zhou et al., 2018).
2 General framework
In this section, we introduce our general framework in order to study the benefits of scaleinvariance.
2.1 Motivating examples of neural networks
Scaleinvariance is common in neural networks with BatchNorm. We formally state the definition of scaleinvariance below:
Definition 2.1.
(Scaleinvariance) Let be a loss function. We say that is a scaleinvariant parameter of if for all , ; if is not scaleinvariant, then we say is a scalevariant parameter of .
We consider the following layer “fullybatchnormalized” feedforward network for illustration:
(2) 
is a minibatch of pairs of input data and groundtruth label from a data set . is an objective function depending on the label, e.g., could be a crossentropy loss in classification tasks. are weight matrices of each layer.
is a nonlinear activation function which processes its input elementwise (such as ReLU, sigmoid). Given a batch of inputs
,outputs a vector
defined as(3) 
where and are the mean and variance of , and are two learnable parameters which rescale and offset the normalized outputs to retain the representation power. The neural network is thus parameterized by weight matrices in each layer and learnable parameters in each BN.
BN has the property that the output is unchanged when the batch inputs are scaled or shifted simultaneously. For being the output of a linear layer, it is easy to see that is scaleinvariant, and thus each row vector of weight matrices in are scaleinvariant parameters of
. In convolutional neural networks with BatchNorm, a similar argument can be done. In particular, each filter of convolutional layer normalized by BN is scaleinvariant.
With a general nonlinear activation, other parameters in , the scale and shift parameters and in each BN, are scalevariant. When ReLU or Leaky ReLU (Maas et al., 2013) are used as the activation , the vector of each BN at layer (except the last one) is indeed scaleinvariant. This can be deduced by using the the (positive) homogeneity of these two types of activations and noticing that the output of internal activations is processed by a BN in the next layer. Nevertheless, we are not able to analyse either ReLU or Leaky ReLU activations because we need the loss to be smooth in our analysis. We can instead analyse smooth activations, such as sigmoid, tanh, softplus (Dugas et al., 2001), etc.
2.2 Framework
Now we introduce our general framework. Let be a neural network parameterized by . Let be a dataset, where each data point is associated with a loss function ( can be the set of all possible minibatches). We partition the parameters into , where consisting of parameters that are scaleinvariant to all , and contains the remaining parameters. The goal of training the neural network is to minimize the expected loss over the dataset: . In order to illustrate the optimization benefits of scaleinvariance, we consider the process of training this neural network by stochastic gradient descent with separate learning rates for and :
(4) 
2.3 The intrinsic optimization problem
Thanks to the scaleinvariant properties, the scale of each weight does not affect loss values. However, the scale does affect the gradients. Let be the set of normalized weights, where . The following simple lemma can be easily shown:
Lemma 2.2 (Implied by Ioffe & Szegedy (2015)).
For any and ,
(5) 
To make to be small, one can just scale the weights by a large factor. Thus there are ways to reduce the norm of the gradient that do not reduce the loss.
For this reason, we define the intrinsic optimization problem for training the neural network. Instead of optimizing and over all possible solutions, we focus on parameters in which for all . This does not change our objective, since the scale of does not affect the loss.
Definition 2.3 (Intrinsic optimization problem).
Let be the intrinsic domain. The intrinsic optimization problem is defined as optimizing the original problem in :
(6) 
For being a sequence of points for optimizing the original optimization problem, we can define , where , as a sequence of points optimizing the intrinsic optimization problem.
In this paper, we aim to show that training neural network for the original optimization problem by gradient descent can be seen as training by adaptive methods for the intrinsic optimization problem, and it converges to a firstorder stationary point in the intrinsic optimization problem with no need for tuning learning rates for .
2.4 Assumptions on the loss
We assume is defined and twice continuously differentiable at any satisfying none of is . Also, we assume that the expected loss is lowerbounded by .
Furthermore, for , where , we assume that the following bounds on the smoothness:
In addition, we assume that the noise on the gradient of in SGD is upper bounded by :
Smoothed version of motivating neural networks. Note that the neural network illustrated in Section 2.1 does not meet the conditions of the smooothness at all since the loss function could be nonsmooth. We can make some mild modifications to the motivating example to smoothen it ^{1}^{1}1Our results to this network are rather conceptual, since the smoothness upper bound can be as large as , where is the number of layers and is the maximum width of each layer.:

The activation could be nonsmooth. A possible solution is to use smooth nonlinearities, e.g., sigmoid, tanh, softplus (Dugas et al., 2001), etc. Note that softplus can be seen as a smooth approximation of the most commonly used activation ReLU.

The formula of BN shown in equation 3 may suffer from the problem of division by zero. To avoid this, the inventors of BN, Ioffe & Szegedy (2015), add a small smoothening parameter to the denominator, i.e.,
(7) However, when , adding a constant directly breaks the scaleinvariance of . We can preserve the scaleinvariance by making the smoothening term propositional to , i.e., replacing with . By simple linear algebra and letting , this smoothed version of BN can also be written as
(8) Since the variance of inputs is usually large in practice, for small , the effect of the smoothening term is negligible except in extreme cases.
Using the above two modifications, the loss function is already smooth. However, the scale of scalevariant parameters may be unbounded during training, which could cause the smoothness unbounded. To avoid this issue, we can either project scalevariant parameters to a bounded set, or use weight decay for those parameters (see Appendix C for a proof for the latter solution).
2.5 Key observation: the growth of weights
The following lemma is our key observation. It establishes a connection between the scaleinvariant property and the growth of weight scale, which further implies an automatic decay of learning rates:
Lemma 2.4.
For any scaleinvariant weight in the network , we have:

and are always perpendicular;

.
Proof.
Let be all the parameters in other than . Taking derivatives with respect to for the both sides of , we have The right hand side equals , so the first proposition follows by taking . Applying Pythagorean theorem and Lemma 2.2, the second proposition directly follows. ∎
Using Lemma 2.4, we can show that performing gradient descent for the original problem is equivalent to performing an adaptive gradient method for the intrinsic optimization problem:
Theorem 2.5.
Let . Then for all ,
(9) 
where is a projection operator which maps any vector to .
Remark 2.6.
Wu et al. (2018) noticed that Theorem 2.5 is true for Weight Normalization by direct calculation of gradients. Inspiring by this, they proposed a new adaptive method called . Our theorem is more general since it holds for any normalization methods as long as it induces scaleinvariant properties to the network. The adaptive update rule derived in our theorem can be seen as with projection to unit sphere after each step.
Proof for Theorem 2.5.
While popular adaptive gradient methods such as AdaGrad (Duchi et al., 2011)
, RMSprop
(Tieleman & Hinton, 2012), Adam (Kingma & Ba, 2014) adjust learning rates for each single coordinate, the adaptive gradient method described in Theorem 2.5 sets learning rates for each scaleinvariant parameter respectively. In this paper, we call the effective learning rate of or , because it’s instead of alone that really determines the trajectory of gradient descent given the normalized scaleinvariant parameter . In other words, the magnitude of the initialization of parameters before BN is as important as their learning rates: multiplying the initialization of scaleinvariant parameter by constant is equivalent to dividing its learning rate by . Thus we suggest researchers to report initialization as well as learning rates in the future experiments.3 Training by fullbatch gradient descent
In this section, we rigorously analyze the effect related to the scaleinvariant properties in training neural network by fullbatch gradient descent. We use the framework introduced in Section 2.2 and assumptions from Section 2.4. We focus on the fullbatch training, i.e., is always equal to the whole training set and .
3.1 Settings and main theorem
Assumptions on learning rates. We consider the case that we use fixed learning rates for both and , i.e., and . We assume that is tuned carefully to for some constant . For , we do not make any assumption, i.e., can be set to any positive value.
Theorem 3.1.
Consider the process of training by gradient descent with and arbitrary . Then converges to a stationary point in the rate of
(10) 
where with , suppresses polynomial factors in , , , , , for all , and we see .
This matches the asymptotic convergence rate of GD by Carmon et al. (2018).
3.2 Proof sketch
The high level idea is to use the decrement of loss function to upper bound the sum of the squared norm of the gradients. Note that . For the first part , we have
(11) 
Thus the core of the proof is to show that the monotone increasing has an upper bound for all . It is shown that for every , the whole training process can be divided into at most two phases. In the first phase, the effective learning rate is larger than some threshold (defined in Lemma 3.2) and in the second phase it is smaller.
Lemma 3.2 (Taylor Expansion).
Let . Then
(12) 
If is large enough and that the process enters the second phase, then by Lemma 3.2 in each step the loss function will decrease by (Recall that by Lemma 2.4). Since is lowerbounded, we can conclude is also bounded. For the second part, we can also show that by Lemma 3.2
Thus we can conclude convergence rate of as follows.
The full proof is postponed to Appendix A.
4 Training by stochastic gradient descent
In this section, we analyze the effect related to the scaleinvariant properties when training a neural network by stochastic gradient descent. We use the framework introduced in Section 2.2 and assumptions from Section 2.4.
4.1 Settings and main theorem
Assumptions on learning rates. As usual, we assume that the learning rate for is chosen carefully and the learning rate for is chosen rather arbitrarily. More specifically, we consider the case that the learning rates are chosen as
We assume that the initial learning rate of is tuned carefully to for some constant . Note that this learning rate schedule matches the best known convergence rate of SGD in the case of smooth nonconvex loss functions (Ghadimi & Lan, 2013).
For the learning rates of , we only assume that , i.e., the learning rate decays equally as or slower than the optimal SGD learning rate schedule. can be set to any positive value. Note that this includes the case that we set a fixed learning rate for by taking .
Remark 4.1.
Note that the autotuning behavior induced by scaleinvariances always decreases the learning rates. Thus, if we set , there is no hope to adjust the learning rate to the optimal strategy . Indeed, in this case, the learning rate in the intrinsic optimization process decays exactly in the rate of , which is the best possible learning rate can be achieved without increasing the original learning rate.
Theorem 4.2.
Consider the process of training by gradient descent with and , where and is arbitrary. Then converges to a stationary point in the rate of
(13) 
where with , suppresses polynomial factors in , , , , , , for all , and we see .
Note that this matches the asymptotic convergence rate of SGD, within a factor.
4.2 Proof sketch
We delay the full proof into Appendix B and give a proof sketch in a simplified setting where there is no and . We also assume there’s only one , that is, and omit the index .
By Taylor expansion, we have
(14) 
We can lower bound the effective learning rate and upper bound the second order term respectively in the following way:

For all , the effective learning rate ;

.
Taking expectation over equation 14 and summing it up, we have
Plug the above bounds into the above inequality, we complete the proof.
5 Conclusions and future works
In this paper, we studied how scaleinvariance in neural networks with BN helps optimization, and showed that (stochastic) gradient descent can achieve the asymptotic best convergence rate without tuning learning rates for scaleinvariant parameters. Our analysis suggests that scaleinvariance in nerual networks introduced by BN reduces the efforts for tuning learning rate to fit the training data.
However, our analysis only applies to smooth loss functions. In modern neural networks, ReLU or Leaky ReLU are often used, which makes the loss nonsmooth. It would have more implications by showing similar results in nonsmooth settings. Also, we only considered gradient descent in this paper. It can be shown that if we perform (stochastic) gradient descent with momentum, the norm of scaleinvariant parameters will also be monotone increasing. It would be interesting to use it to show similar convergence results for more gradient methods.
Acknowledgments
Thanks Yuanzhi Li, Wei Hu and Noah Golowich for helpful discussions. This research was done with support from NSF, ONR, Simons Foundation, Mozilla Research, Schmidt Foundation, DARPA, and SRC.
References
 Ba et al. (2016) Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
 Bjorck et al. (2018) Nils Bjorck, Carla P Gomes, Bart Selman, and Kilian Q Weinberger. Understanding batch normalization. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 7705–7716. Curran Associates, Inc., 2018.
 Carmon et al. (2018) Yair Carmon, John C Duchi, Oliver Hinder, and Aaron Sidford. Lower bounds for finding stationary points of nonconvex, smooth highdimensional functions. 2018.
 Cho & Lee (2017) Minhyung Cho and Jaehyung Lee. Riemannian approach to batch normalization. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 5225–5235. Curran Associates, Inc., 2017.
 Duchi et al. (2011) John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
 Dugas et al. (2001) Charles Dugas, Yoshua Bengio, François Bélisle, Claude Nadeau, and René Garcia. Incorporating secondorder functional knowledge for better option pricing. In T. K. Leen, T. G. Dietterich, and V. Tresp (eds.), Advances in Neural Information Processing Systems 13, pp. 472–478. MIT Press, 2001.
 Ghadimi & Lan (2013) Saeed Ghadimi and Guanghui Lan. Stochastic firstand zerothorder methods for nonconvex stochastic programming. SIAM Journal on Optimization, 23(4):2341–2368, 2013.

Glorot & Bengio (2010)
Xavier Glorot and Yoshua Bengio.
Understanding the difficulty of training deep feedforward neural
networks.
In Yee Whye Teh and Mike Titterington (eds.),
Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics
, volume 9 of Proceedings of Machine Learning Research, pp. 249–256, Chia Laguna Resort, Sardinia, Italy, 13–15 May 2010. PMLR. 
He et al. (2016)
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
Deep residual learning for image recognition.
In
Proceedings of the IEEE conference on computer vision and pattern recognition
, pp. 770–778, 2016.  Hoffer et al. (2018) Elad Hoffer, Ron Banner, Itay Golan, and Daniel Soudry. Norm matters: efficient and accurate normalization schemes in deep networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 2164–2174. Curran Associates, Inc., 2018.
 Huang et al. (2017) Gao Huang, Zhuang Liu, Laurens van der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2261–2269. IEEE, 2017.
 Ioffe (2017) Sergey Ioffe. Batch renormalization: Towards reducing minibatch dependence in batchnormalized models. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett (eds.), Advances in Neural Information Processing Systems 30, pp. 1945–1953. Curran Associates, Inc., 2017.
 Ioffe & Szegedy (2015) Sergey Ioffe and Christian Szegedy. Batch normalization: accelerating deep network training by reducing internal covariate shift. In Proceedings of the 32nd International Conference on International Conference on Machine LearningVolume 37, pp. 448–456. JMLR. org, 2015.
 Kingma & Ba (2014) Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
 Kohler et al. (2018) Jonas Kohler, Hadi Daneshmand, Aurelien Lucchi, Ming Zhou, Klaus Neymeyr, and Thomas Hofmann. Towards a theoretical understanding of batch normalization. arXiv preprint arXiv:1805.10694, 2018.
 Li & Orabona (2018) Xiaoyu Li and Francesco Orabona. On the convergence of stochastic gradient descent with adaptive stepsizes. arXiv preprint arXiv:1805.08114, 2018.
 Maas et al. (2013) Andrew L Maas, Awni Y Hannun, and Andrew Y Ng. Rectifier nonlinearities improve neural network acoustic models. In in ICML Workshop on Deep Learning for Audio, Speech and Language Processing. Citeseer, 2013.
 Salimans & Kingma (2016) Tim Salimans and Durk P Kingma. Weight normalization: A simple reparameterization to accelerate training of deep neural networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett (eds.), Advances in Neural Information Processing Systems 29, pp. 901–909. Curran Associates, Inc., 2016.
 Santurkar et al. (2018) Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, and Aleksander Madry. How does batch normalization help optimization? In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. CesaBianchi, and R. Garnett (eds.), Advances in Neural Information Processing Systems 31, pp. 2488–2498. Curran Associates, Inc., 2018.
 Simonyan & Zisserman (2014) Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for largescale image recognition. arXiv preprint arXiv:1409.1556, 2014.

Szegedy et al. (2017)
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander Alemi.
Inceptionv4, inceptionresnet and the impact of residual connections on learning.
In AAAI Conference on Artificial Intelligence, 2017.  Tieleman & Hinton (2012) Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5rmsprop: Divide the gradient by a running average of its recent magnitude. COURSERA: Neural networks for machine learning, 4(2):26–31, 2012.
 Ulyanov et al. (2016) Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Instance normalization: The missing ingredient for fast stylization. arXiv preprint arXiv:1607.08022, 2016.
 Ward et al. (2018) Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes, from any initialization. arXiv preprint arXiv:1806.01811, 2018.
 Wu et al. (2018) Xiaoxia Wu, Rachel Ward, and Léon Bottou. WNGrad: Learn the Learning Rate in Gradient Descent. arXiv preprint arXiv:1803.02865, 2018.
 Wu & He (2018) Yuxin Wu and Kaiming He. Group normalization. In The European Conference on Computer Vision (ECCV), September 2018.
 Zhou et al. (2018) Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, and Quanquan Gu. On the convergence of adaptive gradient methods for nonconvex optimization. arXiv preprint arXiv:1808.05671, 2018.
 Zou & Shen (2018) Fangyu Zou and Li Shen. On the convergence of adagrad with momentum for training deep neural networks. arXiv preprint arXiv:1808.03408, 2018.
Appendix A Proof for FullBatch Gradient Descent
By the scaleinvariant property of , we know that . Also, the following identities about derivatives can be easily obtained:
Thus, the assumptions on the smoothness imply
(15)  
(16)  
(17) 
Proof for Lemma 3.2.
Using Taylor expansion, we have , such that for ,
Note that is perpendicular to , we have
Thus,
Comments
There are no comments yet.