The Implicit Biases of Stochastic Gradient Descent on Deep Neural Networks with Batch Normalization

02/06/2021
by   Ziquan Liu, et al.
0

Deep neural networks with batch normalization (BN-DNNs) are invariant to weight rescaling due to their normalization operations. However, using weight decay (WD) benefits these weight-scale-invariant networks, which is often attributed to an increase of the effective learning rate when the weight norms are decreased. In this paper, we demonstrate the insufficiency of the previous explanation and investigate the implicit biases of stochastic gradient descent (SGD) on BN-DNNs to provide a theoretical explanation for the efficacy of weight decay. We identity two implicit biases of SGD on BN-DNNs: 1) the weight norms in SGD training remain constant in the continuous-time domain and keep increasing in the discrete-time domain; 2) SGD optimizes weight vectors in fully-connected networks or convolution kernels in convolution neural networks by updating components lying in the input feature span, while leaving those components orthogonal to the input feature span unchanged. Thus, SGD without WD accumulates weight noise orthogonal to the input feature span, and cannot eliminate such noise. Our empirical studies corroborate the hypothesis that weight decay suppresses weight noise that is left untouched by SGD. Furthermore, we propose to use weight rescaling (WRS) instead of weight decay to achieve the same regularization effect, while avoiding performance degradation of WD on some momentum-based optimizers. Our empirical results on image recognition show that regardless of optimization methods and network architectures, training BN-DNNs using WRS achieves similar or better performance compared with using WD. We also show that training with WRS generalizes better compared to WD, on other computer vision tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2017

L2 Regularization versus Batch and Weight Normalization

Batch Normalization is a commonly used trick to improve the training of ...
research
06/08/2023

Correlated Noise in Epoch-Based Stochastic Gradient Descent: Implications for Weight Variances

Stochastic gradient descent (SGD) has become a cornerstone of neural net...
research
02/02/2022

Robust Training of Neural Networks using Scale Invariant Architectures

In contrast to SGD, adaptive gradient methods like Adam allow robust tra...
research
05/22/2019

Fine-grained Optimization of Deep Neural Networks

In recent studies, several asymptotic upper bounds on generalization err...
research
11/14/2019

Understanding the Disharmony between Weight Normalization Family and Weight Decay: ε-shifted L_2 Regularizer

The merits of fast convergence and potentially better performance of the...
research
05/25/2019

Stochastic Shared Embeddings: Data-driven Regularization of Embedding Layers

In deep neural nets, lower level embedding layers account for a large po...
research
06/15/2020

Slowing Down the Weight Norm Increase in Momentum-based Optimizers

Normalization techniques, such as batch normalization (BN), have led to ...

Please sign up or login with your details

Forgot password? Click here to reset