Stochastic Gradient Descent on Separable Data: Exact Convergence with a Fixed Learning Rate

06/05/2018
by   Mor Shpigel Nacson, et al.
0

Stochastic Gradient Descent (SGD) is a central tool in machine learning. We prove that SGD converges to zero loss, even with a fixed learning rate --- in the special case of linear classifiers with smooth monotone loss functions, optimized on linearly separable data. Previous proofs with an exact asymptotic convergence of SGD required a learning rate that asymptotically vanishes to zero, or averaging of the SGD iterates. Furthermore, if the loss function has an exponential tail (e.g., logistic regression), then we prove that with SGD the weight vector converges in direction to the L_2 max margin vector as O(1/(t)) for almost all separable datasets, and the loss converges as O(1/t) --- similarly to gradient descent. These results suggest an explanation to the similar behavior observed in deep networks when trained with SGD.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/12/2018

Convergence of SGD in Learning ReLU Models with Separable Data

We consider the binary classification problem in which the objective fun...
research
07/01/2020

Online Robust Regression via SGD on the l1 loss

We consider the robust linear regression problem in the online setting w...
research
09/15/2022

Random initialisations performing above chance and how to find them

Neural networks trained with stochastic gradient descent (SGD) starting ...
research
10/24/2020

Inductive Bias of Gradient Descent for Exponentially Weight Normalized Smooth Homogeneous Neural Nets

We analyze the inductive bias of gradient descent for weight normalized ...
research
06/25/2020

Implicitly Maximizing Margins with the Hinge Loss

A new loss function is proposed for neural networks on classification ta...
research
12/04/2020

A Variant of Gradient Descent Algorithm Based on Gradient Averaging

In this work, we study an optimizer, Grad-Avg to optimize error function...
research
03/08/2018

Fast Convergence for Stochastic and Distributed Gradient Descent in the Interpolation Limit

Modern supervised learning techniques, particularly those using so calle...

Please sign up or login with your details

Forgot password? Click here to reset