Logit Attenuating Weight Normalization

08/12/2021
by   Aman Gupta, et al.
0

Over-parameterized deep networks trained using gradient-based optimizers are a popular choice for solving classification and ranking problems. Without appropriately tuned ℓ_2 regularization or weight decay, such networks have the tendency to make output scores (logits) and network weights large, causing training loss to become too small and the network to lose its adaptivity (ability to move around) in the parameter space. Although regularization is typically understood from an overfitting perspective, we highlight its role in making the network more adaptive and enabling it to escape more easily from weights that generalize poorly. To provide such a capability, we propose a method called Logit Attenuating Weight Normalization (LAWN), that can be stacked onto any gradient-based optimizer. LAWN controls the logits by constraining the weight norms of layers in the final homogeneous sub-network. Empirically, we show that the resulting LAWN variant of the optimizer makes a deep network more adaptive to finding minimas with superior generalization performance on large-scale image classification and recommender systems. While LAWN is particularly impressive in improving Adam, it greatly improves all optimizers when used with large batch sizes

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/30/2022

Adaptive Weight Decay: On The Fly Weight Decay Tuning for Improving Robustness

We introduce adaptive weight decay, which automatically tunes the hyper-...
research
09/24/2017

Comparison of Batch Normalization and Weight Normalization Algorithms for the Large-scale Image Classification

Batch normalization (BN) has become a de facto standard for training dee...
research
05/26/2023

XGrad: Boosting Gradient-Based Optimizers With Weight Prediction

In this paper, we propose a general deep learning training framework XGr...
research
10/21/2022

Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale

We present Amos, a stochastic gradient-based optimizer designed for trai...
research
07/18/2023

Promoting Exploration in Memory-Augmented Adam using Critical Momenta

Adaptive gradient-based optimizers, particularly Adam, have left their m...
research
09/29/2022

Batch Normalization Explained

A critically important, ubiquitous, and yet poorly understood ingredient...
research
11/14/2019

Understanding the Disharmony between Weight Normalization Family and Weight Decay: ε-shifted L_2 Regularizer

The merits of fast convergence and potentially better performance of the...

Please sign up or login with your details

Forgot password? Click here to reset