Adaptivity without Compromise: A Momentumized, Adaptive, Dual Averaged Gradient Method for Stochastic Optimization

01/26/2021
by   Aaron Defazio, et al.
1

We introduce MADGRAD, a novel optimization method in the family of AdaGrad adaptive gradient methods. MADGRAD shows excellent performance on deep learning optimization problems from multiple fields, including classification and image-to-image tasks in vision, and recurrent and bidirectionally-masked models in natural language processing. For each of these tasks, MADGRAD matches or outperforms both SGD and ADAM in test set performance, even on problems for which adaptive methods normally perform poorly.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/20/2020

Dual Averaging is Surprisingly Effective for Deep Learning Optimization

First-order stochastic optimization methods are currently the most widel...
research
03/03/2022

AdaFamily: A family of Adam-like adaptive gradient methods

We propose AdaFamily, a novel method for training deep neural networks. ...
research
02/29/2020

Conjugate-gradient-based Adam for stochastic optimization and its application to deep learning

This paper proposes a conjugate-gradient-based Adam algorithm blending A...
research
01/18/2020

Adaptive Stochastic Optimization

Optimization lies at the heart of machine learning and signal processing...
research
02/12/2020

LaProp: a Better Way to Combine Momentum with Adaptive Gradient

Identifying a divergence problem in Adam, we propose a new optimizer, La...
research
06/12/2020

ACMo: Angle-Calibrated Moment Methods for Stochastic Optimization

Due to its simplicity and outstanding ability to generalize, stochastic ...
research
10/21/2019

Kernelized Wasserstein Natural Gradient

Many machine learning problems can be expressed as the optimization of s...

Please sign up or login with your details

Forgot password? Click here to reset