On the Convergence of AdaBound and its Connection to SGD

08/13/2019
by   Pedro Savarese, et al.
0

Adaptive gradient methods such as Adam have gained extreme popularity due to their success in training complex neural networks and less sensitivity to hyperparameter tuning compared to SGD. However, it has been recently shown that Adam can fail to converge and might cause poor generalization -- this lead to the design of new, sophisticated adaptive methods which attempt to generalize well while being theoretically reliable. In this technical report we focus on AdaBound, a promising, recently proposed optimizer. We present a stochastic convex problem for which AdaBound can provably take arbitrarily long to converge in terms of a factor which is not accounted for in the convergence rate guarantee of Luo et al. (2019). We present a new $O(\sqrt T)$ regret guarantee under different assumptions on the bound functions, and provide empirical results on CIFAR suggesting that a specific form of momentum SGD can match AdaBound's performance while having less hyperparameters and lower computational costs.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/11/2021

Momentum Centering and Asynchronous Update for Adaptive Gradient Methods

We propose ACProp (Asynchronous-centering-Prop), an adaptive optimizer w...
research
02/15/2021

The Role of Momentum Parameters in the Optimal Convergence of Adaptive Polyak's Heavy-ball Methods

The adaptive stochastic gradient descent (SGD) with momentum has been wi...
research
02/26/2019

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Adaptive optimization methods such as AdaGrad, RMSprop and Adam have bee...
research
10/27/2021

Multilayer Lookahead: a Nested Version of Lookahead

In recent years, SGD and its variants have become the standard tool to t...
research
06/12/2017

YellowFin and the Art of Momentum Tuning

Hyperparameter tuning is one of the big costs of deep learning. State-of...
research
12/20/2017

Improving Generalization Performance by Switching from Adam to SGD

Despite superior training outcomes, adaptive optimization methods such a...
research
07/21/2023

Convergence of SGD for Training Neural Networks with Sliced Wasserstein Losses

Optimal Transport has sparked vivid interest in recent years, in particu...

Please sign up or login with your details

Forgot password? Click here to reset