On the adequacy of untuned warmup for adaptive optimization

10/09/2019
by   Jerry Ma, et al.
0

Adaptive optimization algorithms such as Adam (Kingma Ba, 2014) are widely used in deep learning. The stability of such algorithms is often improved with a warmup schedule for the learning rate. Motivated by the difficulty of choosing and tuning warmup schedules, Liu et al. (2019) propose automatic variance rectification of Adam's adaptive learning rate, claiming that this rectified approach ("RAdam") surpasses the vanilla Adam algorithm and reduces the need for expensive tuning of Adam with warmup. In this work, we point out various shortcomings of this analysis. We then provide an alternative explanation for the necessity of warmup based on the magnitude of the update term, which is of greater relevance to training stability. Finally, we provide some "rule-of-thumb" warmup schedules, and we demonstrate that simple untuned warmup of Adam performs more-or-less identically to RAdam in typical practical settings. We conclude by suggesting that practitioners stick to linear warmup with Adam, with a sensible default being linear warmup over 2 / (1 - β_2) training iterations.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
03/01/2021

Acceleration via Fractal Learning Rate Schedules

When balancing the practical tradeoffs of iterative methods for large-sc...
research
07/09/2021

REX: Revisiting Budgeted Training with an Improved Schedule

Deep learning practitioners often operate on a computational and monetar...
research
10/16/2019

An Exponential Learning Rate Schedule for Deep Learning

Intriguing empirical evidence exists that deep learning can work well wi...
research
11/15/2020

Explaining the Adaptive Generalisation Gap

We conjecture that the reason for the difference in generalisation betwe...
research
07/14/2019

Finite-Time Performance Bounds and Adaptive Learning Rate Selection for Two Time-Scale Reinforcement Learning

We study two time-scale linear stochastic approximation algorithms, whic...
research
02/26/2020

Disentangling Adaptive Gradient Methods from Learning Rates

We investigate several confounding factors in the evaluation of optimiza...
research
04/07/2023

CMA-ES with Learning Rate Adaptation: Can CMA-ES with Default Population Size Solve Multimodal and Noisy Problems?

The covariance matrix adaptation evolution strategy (CMA-ES) is one of t...

Please sign up or login with your details

Forgot password? Click here to reset