Convergence of SGD in Learning ReLU Models with Separable Data

06/12/2018

∙

We consider the binary classification problem in which the objective function is the exponential loss with a ReLU model, and study the convergence property of the stochastic gradient descent (SGD) algorithm on linearly separable data. We show that the gradient descent (GD) algorithm do not always learn desirable model parameters due to the nonlinear ReLU model. Then, we identify a certain condition of data samples, under which we show that SGD can learn a proper classifier with implicit bias. In specific, we establish the sub-linear convergence rate of the function value generated by SGD to global minimum. We further show that SGD actually converges in expectation to the maximum margin classifier with respect to the samples with +1 label under the ReLU model at the rate O(1/ln t). We also extend our study to the case of multi-ReLU neurons, and show that SGD converges to a certain non-linear maximum margin classifier for a class of non-linearly separable data.

READ FULL TEXT

Convergence of SGD in Learning ReLU Models with Separable Data

Sign in with Google

Consider DeepAI Pro