Spherical Perspective on Learning with Batch Norm

06/23/2020
by   Simon Roburin, et al.
25

Batch Normalization (BN) is a prominent deep learning technique. In spite of its apparent simplicity, its implications over optimization are yet to be fully understood. In this paper, we study the optimization of neural networks with BN layers from a geometric perspective. We leverage the radial invariance of groups of parameters, such as neurons for multi-layer perceptrons or filters for convolutional neural networks, and translate several popular optimization schemes on the L_2 unit hypersphere. This formulation and the associated geometric interpretation sheds new light on the training dynamics and the relation between different optimization schemes. In particular, we use it to derive the effective learning rate of Adam and stochastic gradient descent (SGD) with momentum, and we show that in the presence of BN layers, performing SGD alone is actually equivalent to a variant of Adam constrained to the unit hypersphere. Our analysis also leads us to introduce new variants of Adam. We empirically show, over a variety of datasets and architectures, that they improve accuracy in classification tasks. The complete source code for our experiments is available at: https://github.com/ymontmarin/adamsrt

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/15/2020

Slowing Down the Weight Norm Increase in Momentum-based Optimizers

Normalization techniques, such as batch normalization (BN), have led to ...
research
08/02/2019

Calibrating the Learning Rate for Adaptive Gradient Methods to Improve Generalization Performance

Although adaptive gradient methods (AGMs) have fast speed in training de...
research
02/18/2021

Attempted Blind Constrained Descent Experiments

Blind Descent uses constrained but, guided approach to learn the weights...
research
03/31/2021

Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization

It is well-known that stochastic gradient noise (SGN) acts as implicit r...
research
02/21/2020

The Break-Even Point on Optimization Trajectories of Deep Neural Networks

The early phase of training of deep neural networks is critical for thei...
research
04/24/2017

Active Bias: Training More Accurate Neural Networks by Emphasizing High Variance Samples

Self-paced learning and hard example mining re-weight training instances...
research
02/24/2023

On the Training Instability of Shuffling SGD with Batch Normalization

We uncover how SGD interacts with batch normalization and can exhibit un...

Please sign up or login with your details

Forgot password? Click here to reset