Toward Understanding Why Adam Converges Faster Than SGD for Transformers

05/31/2023
by   Yan Pan, et al.
0

While stochastic gradient descent (SGD) is still the most popular optimization algorithm in deep learning, adaptive algorithms such as Adam have established empirical advantages over SGD in some deep learning applications such as training transformers. However, it remains a question that why Adam converges significantly faster than SGD in these scenarios. In this paper, we propose one explanation of why Adam converges faster than SGD using a new concept directional sharpness. We argue that the performance of optimization algorithms is closely related to the directional sharpness of the update steps, and show SGD has much worse directional sharpness compared to adaptive algorithms. We further observe that only a small fraction of the coordinates causes the bad sharpness and slow convergence of SGD, and propose to use coordinate-wise clipping as a solution to SGD and other optimization algorithms. We demonstrate the effect of coordinate-wise clipping on sharpness reduction and speeding up the convergence of optimization algorithms under various settings. We show that coordinate-wise clipping improves the local loss reduction when only a small fraction of the coordinates has bad sharpness. We conclude that the sharpness reduction effect of adaptive coordinate-wise scaling is the reason for Adam's success in practice and suggest the use of coordinate-wise clipping as a universal technique to speed up deep learning optimization.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/23/2019

Blockwise Adaptivity: Faster Training and Better Generalization in Deep Learning

Stochastic methods with coordinate-wise adaptive stepsize (such as RMSpr...
research
02/16/2018

Distributed Stochastic Optimization via Adaptive Stochastic Gradient Descent

Stochastic convex optimization algorithms are the most popular way to tr...
research
09/10/2012

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Gradient Descent (SGD) has become popular for solving large s...
research
05/23/2017

The Marginal Value of Adaptive Gradient Methods in Machine Learning

Adaptive optimization methods, which perform local optimization with a m...
research
01/02/2019

SGD Converges to Global Minimum in Deep Learning via Star-convex Path

Stochastic gradient descent (SGD) has been found to be surprisingly effe...
research
10/07/2021

G̅_mst:An Unbiased Stratified Statistic and a Fast Gradient Optimization Algorithm Based on It

-The fluctuation effect of gradient expectation and variance caused by p...
research
02/03/2022

Characterizing Finding Good Data Orderings for Fast Convergence of Sequential Gradient Methods

While SGD, which samples from the data with replacement is widely studie...

Please sign up or login with your details

Forgot password? Click here to reset