Why ADAM Beats SGD for Attention Models

12/06/2019
by   Jingzhao Zhang, et al.
0

While stochastic gradient descent (SGD) is still the de facto algorithm in deep learning, adaptive methods like Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to Adam are not well understood yet. In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is a root cause of SGD's poor performance. Based on this observation, we study clipped variants of SGD that circumvent this issue; we then analyze their convergence under heavy-tailed noise. Furthermore, we develop a new adaptive coordinate-wise clipping algorithm (ACClip) tailored to such settings. Subsequently, we show how adaptive methods like Adam can be viewed through the lens of clipping, which helps us explain Adam's strong performance under heavy-tail noise settings. Finally, we show that the proposed ACClip outperforms Adam for both BERT pretraining and finetuning tasks.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/27/2023

Noise Is Not the Main Factor Behind the Gap Between SGD and Adam on Transformers, but Sign Descent Might Be

The success of the Adam optimizer on a wide array of architectures has m...
research
06/08/2020

The Heavy-Tail Phenomenon in SGD

In recent years, various notions of capacity and complexity have been pr...
research
02/20/2021

Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance

Recent studies have provided both empirical and theoretical evidence ill...
research
04/06/2022

Nonlinear gradient mappings and stochastic optimization: A general framework with applications to heavy-tail noise

We introduce a general framework for nonlinear stochastic gradient desce...
research
02/08/2021

Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise

The empirical success of deep learning is often attributed to SGD's myst...
research
02/10/2021

On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes

The noise in stochastic gradient descent (SGD), caused by minibatch samp...
research
12/05/2022

Rethinking the Structure of Stochastic Gradients: Empirical and Statistical Evidence

Stochastic gradients closely relate to both optimization and generalizat...

Please sign up or login with your details

Forgot password? Click here to reset