Amos: An Adam-style Optimizer with Adaptive Weight Decay towards Model-Oriented Scale

10/21/2022
by   Ran Tian, et al.
0

We present Amos, a stochastic gradient-based optimizer designed for training deep neural networks. It can be viewed as an Adam optimizer with theoretically supported, adaptive learning-rate decay and weight decay. A key insight behind Amos is that it leverages model-specific information to determine the initial learning-rate and decaying schedules. When used for pre-training BERT variants and T5, Amos consistently converges faster than the state-of-the-art settings of AdamW, achieving better validation loss within <=70 time, while requiring <=51 at: https://github.com/google-research/jestimator

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/17/2021

Training Aware Sigmoidal Optimizer

Proper optimization of deep neural networks is an open research question...
research
11/22/2018

HyperAdam: A Learnable Task-Adaptive Adam for Network Training

Deep neural networks are traditionally trained using human-designed stoc...
research
03/23/2021

How to decay your learning rate

Complex learning rate schedules have become an integral part of deep lea...
research
06/16/2021

To Raise or Not To Raise: The Autonomous Learning Rate Question

There is a parameter ubiquitous throughout the deep learning world: lear...
research
08/12/2021

Logit Attenuating Weight Normalization

Over-parameterized deep networks trained using gradient-based optimizers...
research
02/17/2022

General Cyclical Training of Neural Networks

This paper describes the principle of "General Cyclical Training" in mac...
research
09/07/2021

Tom: Leveraging trend of the observed gradients for faster convergence

The success of deep learning can be attributed to various factors such a...

Please sign up or login with your details

Forgot password? Click here to reset