Theoretical Analysis of Auto Rate-Tuning by Batch Normalization

12/10/2018
by   Sanjeev Arora, et al.
20

Batch Normalization (BN) has become a cornerstone of deep learning across diverse architectures, appearing to help optimization as well as generalization. While the idea makes intuitive sense, theoretical analysis of its effectiveness has been lacking. Here theoretical support is provided for one of its conjectured properties, namely, the ability to allow gradient descent to succeed with less tuning of learning rates. It is shown that even if we fix the learning rate of scale-invariant parameters (e.g., weights of each layer with BN) to a constant (say, 0.3), gradient descent still approaches a stationary point (i.e., a solution where gradient is zero) in the rate of T^-1/2 in T iterations, asymptotically matching the best bound for gradient descent with well-tuned learning rates. A similar result with convergence rate T^-1/4 is also shown for stochastic gradient descent.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/07/2018

WNGrad: Learn the Learning Rate in Gradient Descent

Adjusting the learning rate schedule in stochastic gradient methods is a...
research
09/29/2018

On the Convergence and Robustness of Batch Normalization

Despite its empirical success, the theoretical underpinnings of the stab...
research
01/09/2018

Convergence Analysis of Gradient Descent Algorithms with Proportional Updates

The rise of deep learning in recent years has brought with it increasing...
research
01/08/2021

Towards Accelerating Training of Batch Normalization: A Manifold Perspective

Batch normalization (BN) has become a crucial component across diverse d...
research
06/19/2023

Understanding Generalization in the Interpolation Regime using the Rate Function

In this paper, we present a novel characterization of the smoothness of ...
research
02/24/2022

On the influence of roundoff errors on the convergence of the gradient descent method with low-precision floating-point computation

The employment of stochastic rounding schemes helps prevent stagnation o...
research
10/26/2020

Convergence Acceleration via Chebyshev Step: Plausible Interpretation of Deep-Unfolded Gradient Descent

Deep unfolding is a promising deep-learning technique, whose network arc...

Please sign up or login with your details

Forgot password? Click here to reset