On the Training Instability of Shuffling SGD with Batch Normalization

02/24/2023
by   David X. Wu, et al.
0

We uncover how SGD interacts with batch normalization and can exhibit undesirable training dynamics such as divergence. More precisely, we study how Single Shuffle (SS) and Random Reshuffle (RR) – two widely used variants of SGD – interact surprisingly differently in the presence of batch normalization: RR leads to much more stable evolution of training loss than SS. As a concrete example, for regression using a linear network with batch normalization, we prove that SS and RR converge to distinct global optima that are "distorted" away from gradient descent. Thereafter, for classification we characterize conditions under which training divergence for SS and RR can, and cannot occur. We present explicit constructions to show how SS leads to distorted optima in regression and divergence for classification, whereas RR avoids both distortion and divergence. We validate our results by confirming them empirically in realistic settings, and conclude that the separation between SS and RR used with batch normalization is relevant in practice.

READ FULL TEXT
research
07/28/2020

Stochastic Normalized Gradient Descent with Momentum for Large Batch Training

Stochastic gradient descent (SGD) and its variants have been the dominat...
research
02/14/2019

CrossNorm: Normalization for Off-Policy TD Reinforcement Learning

Off-policy Temporal Difference (TD) learning methods, when combined with...
research
06/07/2021

Batch Normalization Orthogonalizes Representations in Deep Random Networks

This paper underlines a subtle property of batch-normalization (BN): Suc...
research
07/06/2022

Difference in Euclidean Norm Can Cause Semantic Divergence in Batch Normalization

In this paper, we show that the difference in Euclidean norm of samples ...
research
11/21/2019

Rethinking Normalization and Elimination Singularity in Neural Networks

In this paper, we study normalization methods for neural networks from t...
research
06/23/2020

Spherical Perspective on Learning with Batch Norm

Batch Normalization (BN) is a prominent deep learning technique. In spit...
research
06/06/2021

Reducing the feature divergence of RGB and near-infrared images using Switchable Normalization

Visual pattern recognition over agricultural areas is an important appli...

Please sign up or login with your details

Forgot password? Click here to reset