Towards understanding how momentum improves generalization in deep learning

07/13/2022
by   Samy Jelassi, et al.
0

Stochastic gradient descent (SGD) with momentum is widely used for training modern deep learning architectures. While it is well-understood that using momentum can lead to faster convergence rate in various settings, it has also been observed that momentum yields higher generalization. Prior work argue that momentum stabilizes the SGD noise during training and this leads to higher generalization. In this paper, we adopt another perspective and first empirically show that gradient descent with momentum (GD+M) significantly improves generalization compared to gradient descent (GD) in some deep learning problems. From this observation, we formally study how momentum improves generalization. We devise a binary classification setting where a one-hidden layer (over-parameterized) convolutional neural network trained with GD+M provably generalizes better than the same network trained with GD, when both algorithms are similarly initialized. The key insight in our analysis is that momentum is beneficial in datasets where the examples share some feature but differ in their margin. Contrary to GD that memorizes the small margin data, GD+M still learns the feature in these data thanks to its historical gradients. Lastly, we empirically validate our theoretical findings.

READ FULL TEXT

page 4

page 16

research
02/24/2020

Scheduled Restart Momentum for Accelerated Stochastic Gradient Descent

Stochastic gradient descent (SGD) with constant momentum and its variant...
research
10/04/2020

Provable Acceleration of Neural Net Training via Polyak's Momentum

Incorporating a so-called "momentum" dynamic in gradient descent methods...
research
04/11/2020

Exploit Where Optimizer Explores via Residuals

To train neural networks faster, many research efforts have been devoted...
research
10/16/2018

Quasi-hyperbolic momentum and Adam for deep learning

Momentum-based acceleration of stochastic gradient descent (SGD) is wide...
research
03/31/2021

Positive-Negative Momentum: Manipulating Stochastic Gradient Noise to Improve Generalization

It is well-known that stochastic gradient noise (SGN) acts as implicit r...
research
05/31/2016

Asynchrony begets Momentum, with an Application to Deep Learning

Asynchronous methods are widely used in deep learning, but have limited ...
research
10/08/2021

Momentum Doesn't Change the Implicit Bias

The momentum acceleration technique is widely adopted in many optimizati...

Please sign up or login with your details

Forgot password? Click here to reset