-
The Heavy-Tail Phenomenon in SGD
In recent years, various notions of capacity and complexity have been pr...
read it
-
Stochastic Optimization with Heavy-Tailed Noise via Accelerated Gradient Clipping
In this paper, we propose a new accelerated stochastic first-order metho...
read it
-
Convergence Rates of Stochastic Gradient Descent under Infinite Noise Variance
Recent studies have provided both empirical and theoretical evidence ill...
read it
-
On Minibatch Noise: Discrete-Time SGD, Overparametrization, and Bayes
The noise in stochastic gradient descent (SGD), caused by minibatch samp...
read it
-
Eliminating Sharp Minima from SGD with Truncated Heavy-tailed Noise
The empirical success of deep learning is often attributed to SGD's myst...
read it
-
Escaping Saddle Points with Adaptive Gradient Methods
Adaptive methods such as Adam and RMSProp are widely used in deep learni...
read it
-
On the Heavy-Tailed Theory of Stochastic Gradient Descent for Deep Neural Networks
The gradient noise (GN) in the stochastic gradient descent (SGD) algorit...
read it
Why ADAM Beats SGD for Attention Models
While stochastic gradient descent (SGD) is still the de facto algorithm in deep learning, adaptive methods like Adam have been observed to outperform SGD across important tasks, such as attention models. The settings under which SGD performs poorly in comparison to Adam are not well understood yet. In this paper, we provide empirical and theoretical evidence that a heavy-tailed distribution of the noise in stochastic gradients is a root cause of SGD's poor performance. Based on this observation, we study clipped variants of SGD that circumvent this issue; we then analyze their convergence under heavy-tailed noise. Furthermore, we develop a new adaptive coordinate-wise clipping algorithm (ACClip) tailored to such settings. Subsequently, we show how adaptive methods like Adam can be viewed through the lens of clipping, which helps us explain Adam's strong performance under heavy-tail noise settings. Finally, we show that the proposed ACClip outperforms Adam for both BERT pretraining and finetuning tasks.
READ FULL TEXT
Comments
There are no comments yet.