Equilibrated adaptive learning rates for non-convex optimization

02/15/2015
by   Yann N. Dauphin, et al.
0

Parameter-specific adaptive learning rate methods are computationally efficient ways to reduce the ill-conditioning problems encountered when training large deep networks. Following recent work that strongly suggests that most of the critical points encountered when training such networks are saddle points, we find how considering the presence of negative eigenvalues of the Hessian could help us design better suited adaptive learning rate schemes. We show that the popular Jacobi preconditioner has undesirable behavior in the presence of both positive and negative curvature, and present theoretical and empirical evidence that the so-called equilibration preconditioner is comparatively better suited to non-convex problems. We introduce a novel adaptive learning rate scheme, called ESGD, based on the equilibration preconditioner. Our experiments show that ESGD performs as well or better than RMSProp in terms of convergence speed, always clearly improving over plain stochastic gradient descent.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/22/2017

Training Deep Networks without Learning Rates Through Coin Betting

Deep learning methods achieve state-of-the-art performance in many appli...
research
03/23/2022

An Adaptive Gradient Method with Energy and Momentum

We introduce a novel algorithm for gradient-based optimization of stocha...
research
02/09/2022

Optimal learning rate schedules in high-dimensional non-convex optimization problems

Learning rate schedules are ubiquitously used to speed up and improve op...
research
04/21/2020

AdaX: Adaptive Gradient Descent with Exponential Long Term Memory

Although adaptive optimization algorithms such as Adam show fast converg...
research
06/29/2020

Adai: Separating the Effects of Adaptive Learning Rate and Momentum Inertia

Adaptive Momentum Estimation (Adam), which combines Adaptive Learning Ra...
research
05/27/2019

Stochastic Gradient Methods with Layer-wise Adaptive Moments for Training of Deep Networks

We propose NovoGrad, a first-order stochastic gradient method with layer...
research
09/17/2021

AdaLoss: A computationally-efficient and provably convergent adaptive gradient method

We propose a computationally-friendly adaptive learning rate schedule, "...

Please sign up or login with your details

Forgot password? Click here to reset