Log In Sign Up

Online Learning Rate Adaptation with Hypergradient Descent

by   Atilim Gunes Baydin, et al.

We introduce a general method for improving the convergence rate of gradient-based optimizers that is easy to implement and works well in practice. We analyze the effectiveness of the method by applying it to stochastic gradient descent, stochastic gradient descent with Nesterov momentum, and Adam, showing that it improves upon these commonly used algorithms on a range of optimization problems; in particular the kinds of objective functions that arise frequently in deep neural network training. Our method works by dynamically updating the learning rate during optimization using the gradient with respect to the learning rate of the update rule itself. Computing this "hypergradient" needs little additional computation, requires only one extra copy of the original gradient to be stored in memory, and relies upon nothing more than what is provided by reverse-mode automatic differentiation.


Stochastic Learning Rate Optimization in the Stochastic Approximation and Online Learning Settings

In this work, multiplicative stochasticity is applied to the learning ra...

Scaling transition from momentum stochastic gradient descent to plain stochastic gradient descent

The plain stochastic gradient descent and momentum stochastic gradient d...

Automatic Tuning of Stochastic Gradient Descent with Bayesian Optimisation

Many machine learning models require a training procedure based on runni...

The Curse of Unrolling: Rate of Differentiating Through Optimization

Computing the Jacobian of the solution of an optimization problem is a c...

FASFA: A Novel Next-Generation Backpropagation Optimizer

This paper introduces the fast adaptive stochastic function accelerator ...

Adaptive scaling of the learning rate by second order automatic differentiation

In the context of the optimization of Deep Neural Networks, we propose t...

Reverse engineering learned optimizers reveals known and novel mechanisms

Learned optimizers are algorithms that can themselves be trained to solv...