Understanding the Disharmony between Weight Normalization Family and Weight Decay: ε-shifted L_2 Regularizer

11/14/2019
by   Li Xiang, et al.
0

The merits of fast convergence and potentially better performance of the weight normalization family have drawn increasing attention in recent years. These methods use standardization or normalization that changes the weight W to W', which makes W' independent to the magnitude of W. Surprisingly, W must be decayed during gradient descent, otherwise we will observe a severe under-fitting problem, which is very counter-intuitive since weight decay is widely known to prevent deep networks from over-fitting. In this paper, we theoretically prove that the weight decay term 1/2λ||W||^2 merely modulates the effective learning rate for improving objective optimization, and has no influence on generalization when the weight normalization family is compositely employed. Furthermore, we also expose several critical problems when introducing weight decay term to weight normalization family, including the missing of global minimum and training instability. To address these problems, we propose an ϵ-shifted L_2 regularizer, which shifts the L_2 objective by a positive constant ϵ. Such a simple operation can theoretically guarantee the existence of global minimum, while preventing the network weights from being too small and thus avoiding gradient float overflow. It significantly improves the training stability and can achieve slightly better performance in our practice. The effectiveness of ϵ-shifted L_2 regularizer is comprehensively validated on the ImageNet, CIFAR-100, and COCO datasets. Our codes and pretrained models will be released in https://github.com/implus/PytorchInsight.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/05/2018

Norm matters: efficient and accurate normalization schemes in deep networks

Over the past few years batch-normalization has been commonly used in de...
research
11/14/2017

Fixing Weight Decay Regularization in Adam

We note that common implementations of adaptive gradient algorithms, suc...
research
02/06/2021

The Implicit Biases of Stochastic Gradient Descent on Deep Neural Networks with Batch Normalization

Deep neural networks with batch normalization (BN-DNNs) are invariant to...
research
07/21/2019

Adaptive Weight Decay for Deep Neural Networks

Regularization in the optimization of deep neural networks is often crit...
research
02/27/2019

Equi-normalization of Neural Networks

Modern neural networks are over-parametrized. In particular, each rectif...
research
08/12/2021

Logit Attenuating Weight Normalization

Over-parameterized deep networks trained using gradient-based optimizers...
research
06/12/2021

Go Small and Similar: A Simple Output Decay Brings Better Performance

Regularization and data augmentation methods have been widely used and b...

Please sign up or login with your details

Forgot password? Click here to reset