GradInit: Learning to Initialize Neural Networks for Stable and Efficient Training

02/16/2021
by   Chen Zhu, et al.
0

Changes in neural architectures have fostered significant breakthroughs in language modeling and computer vision. Unfortunately, novel architectures often require re-thinking the choice of hyperparameters (e.g., learning rate, warmup schedule, and momentum coefficients) to maintain stability of the optimizer. This optimizer instability is often the result of poor parameter initialization, and can be avoided by architecture-specific initialization schemes. In this paper, we present GradInit, an automated and architecture agnostic method for initializing neural networks. GradInit is based on a simple heuristic; the variance of each network layer is adjusted so that a single step of SGD or Adam results in the smallest possible loss value. This adjustment is done by introducing a scalar multiplier variable in front of each parameter block, and then optimizing these variables using a simple numerical scheme. GradInit accelerates the convergence and test performance of many convolutional architectures, both with or without skip connections, and even without normalization layers. It also enables training the original Post-LN Transformer for machine translation without learning rate warmup under a wide range of learning rates and momentum coefficients. Code is available at https://github.com/zhuchen03/gradinit.

READ FULL TEXT
research
04/06/2020

Applying Cyclical Learning Rate to Neural Machine Translation

In training deep learning networks, the optimizer and related learning r...
research
09/07/2021

Tom: Leveraging trend of the observed gradients for faster convergence

The success of deep learning can be attributed to various factors such a...
research
09/05/2023

AdaPlus: Integrating Nesterov Momentum and Precise Stepsize Adjustment on AdamW Basis

This paper proposes an efficient optimizer called AdaPlus which integrat...
research
08/08/2019

On the Variance of the Adaptive Learning Rate and Beyond

The learning rate warmup heuristic achieves remarkable success in stabil...
research
09/05/2023

A Simple Asymmetric Momentum Make SGD Greatest Again

We propose the simplest SGD enhanced method ever, Loss-Controlled Asymme...
research
07/15/2020

Non-greedy Gradient-based Hyperparameter Optimization Over Long Horizons

Gradient-based hyperparameter optimization is an attractive way to perfo...
research
09/18/2021

AutoInit: Analytic Signal-Preserving Weight Initialization for Neural Networks

Neural networks require careful weight initialization to prevent signals...

Please sign up or login with your details

Forgot password? Click here to reset