Reverse engineering learned optimizers reveals known and novel mechanisms

11/04/2020
by   Niru Maheswaranathan, et al.
22

Learned optimizers are algorithms that can themselves be trained to solve optimization problems. In contrast to baseline optimizers (such as momentum or Adam) that use simple update rules derived from theoretical principles, learned optimizers use flexible, high-dimensional, nonlinear parameterizations. Although this can lead to better performance in certain settings, their inner workings remain a mystery. How is a learned optimizer able to outperform a well tuned baseline? Has it learned a sophisticated combination of existing optimization techniques, or is it implementing completely new behavior? In this work, we address these questions by careful analysis and visualization of learned optimizers. We study learned optimizers trained from scratch on three disparate tasks, and discover that they have learned interpretable mechanisms, including: momentum, gradient clipping, learning rate schedules, and a new form of learning rate adaptation. Moreover, we show how the dynamics of learned optimizers enables these behaviors. Our results help elucidate the previously murky understanding of how learned optimizers work, and establish tools for interpreting future learned optimizers.

READ FULL TEXT

page 6

page 7

page 8

page 9

page 10

page 11

page 13

page 14

research
07/27/2023

The Marginal Value of Momentum for Small Learning Rate SGD

Momentum is known to accelerate the convergence of gradient descent in s...
research
11/11/2018

Lessons Learned in Migrating from Swing to JavaFX

Martin P. Robillard and Kaylee Kutschera...
research
09/20/2019

Learning an Adaptive Learning Rate Schedule

The learning rate is one of the most important hyper-parameters for mode...
research
06/29/2020

Adai: Separating the Effects of Adaptive Learning Rate and Momentum Inertia

Adaptive Momentum Estimation (Adam), which combines Adaptive Learning Ra...
research
10/11/2019

Decaying momentum helps neural network training

Momentum is a simple and popular technique in deep learning for gradient...
research
11/01/2021

STORM+: Fully Adaptive SGD with Momentum for Nonconvex Optimization

In this work we investigate stochastic non-convex optimization problems ...
research
01/12/2023

Progress measures for grokking via mechanistic interpretability

Neural networks often exhibit emergent behavior, where qualitatively new...

Please sign up or login with your details

Forgot password? Click here to reset