Second-order Information in First-order Optimization Methods

12/20/2019
by   Yuzheng Hu, et al.
0

In this paper, we try to uncover the second-order essence of several first-order optimization methods. For Nesterov Accelerated Gradient, we rigorously prove that the algorithm makes use of the difference between past and current gradients, thus approximates the Hessian and accelerates the training. For adaptive methods, we related Adam and Adagrad to a powerful technique in computation statistics—Natural Gradient Descent. These adaptive methods can in fact be treated as relaxations of NGD with only a slight difference lying in the square root of the denominator in the update rules. Skeptical about the effect of such difference, we design a new algorithm—AdaSqrt, which removes the square root in the denominator and scales the learning rate by sqrt(T). Surprisingly, our new algorithm is comparable to various first-order methods(such as SGD and Adam) on MNIST and even beats Adam on CIFAR-10! This phenomenon casts doubt on the convention view that the square root is crucial and training without it will lead to terrible performance. As far as we have concerned, so long as the algorithm tries to explore second or even higher information of the loss surface, then proper scaling of the learning rate alone will guarantee fast training and good generalization performance. To the best of our knowledge, this is the first paper that seriously considers the necessity of square root among all adaptive methods. We believe that our work can shed light on the importance of higher-order information and inspire the design of more powerful algorithms in the future.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
10/26/2022

Adaptive scaling of the learning rate by second order automatic differentiation

In the context of the optimization of Deep Neural Networks, we propose t...
research
06/16/2020

Curvature is Key: Sub-Sampled Loss Surfaces and the Implications for Large Batch Training

We study the effect of mini-batching on the loss landscape of deep neura...
research
06/01/2020

ADAHESSIAN: An Adaptive Second Order Optimizer for Machine Learning

We introduce AdaHessian, a second order stochastic optimization algorith...
research
07/09/2022

Improved Binary Forward Exploration: Learning Rate Scheduling Method for Stochastic Optimization

A new gradient-based optimization approach by automatically scheduling t...
research
02/28/2020

Do optimization methods in deep learning applications matter?

With advances in deep learning, exponential data growth and increasing m...
research
07/06/2020

TDprop: Does Jacobi Preconditioning Help Temporal Difference Learning?

We investigate whether Jacobi preconditioning, accounting for the bootst...
research
06/10/2021

Investigating Alternatives to the Root Mean Square for Adaptive Gradient Methods

Adam is an adaptive gradient method that has experienced widespread adop...

Please sign up or login with your details

Forgot password? Click here to reset