Empirically explaining SGD from a line search perspective

03/31/2021
by   Maximus Mutschler, et al.
0

Optimization in Deep Learning is mainly guided by vague intuitions and strong assumptions, with a limited understanding how and why these work in practice. To shed more light on this, our work provides some deeper understandings of how SGD behaves by empirically analyzing the trajectory taken by SGD from a line search perspective. Specifically, a costly quantitative analysis of the full-batch loss along SGD trajectories from common used models trained on a subset of CIFAR-10 is performed. Our core results include that the full-batch loss along lines in update step direction is highly parabolically. Further on, we show that there exists a learning rate with which SGD always performs almost exact line searches on the full-batch loss. Finally, we provide a different perspective why increasing the batch size has almost the same effect as decreasing the learning rate by the same factor.

READ FULL TEXT

page 5

page 13

page 14

page 17

research
08/31/2021

Using a one dimensional parabolic model of the full-batch loss to estimate learning rates during training

A fundamental challenge in Deep Learning is to find optimal step sizes f...
research
06/14/2023

Beyond Implicit Bias: The Insignificance of SGD Noise in Online Learning

The success of SGD in deep learning has been ascribed by prior works to ...
research
06/16/2020

Curvature is Key: Sub-Sampled Loss Surfaces and the Implications for Large Batch Training

We study the effect of mini-batching on the loss landscape of deep neura...
research
08/05/2019

Learning Stages: Phenomenon, Root Cause, Mechanism Hypothesis, and Implications

Under StepDecay learning rate strategy (decaying the learning rate after...
research
11/04/2020

Direction Matters: On the Implicit Regularization Effect of Stochastic Gradient Descent with Moderate Learning Rate

Understanding the algorithmic regularization effect of stochastic gradie...
research
06/05/2023

Decentralized SGD and Average-direction SAM are Asymptotically Equivalent

Decentralized stochastic gradient descent (D-SGD) allows collaborative l...

Please sign up or login with your details

Forgot password? Click here to reset