Fitting ReLUs via SGD and Quantized SGD

In this paper we focus on the problem of finding the optimal weights of the shallowest of neural networks consisting of a single Rectified Linear Unit (ReLU). These functions are of the form x→(0,〈w,x〉) with w∈R^d denoting the weight vector. We focus on a planted model where the inputs are chosen i.i.d. from a Gaussian distribution and the labels are generated according to a planted weight vector. We first show that mini-batch stochastic gradient descent when suitably initialized, converges at a geometric rate to the planted model with a number of samples that is optimal up to numerical constants. Next we focus on a parallel implementation where in each iteration the mini-batch gradient is calculated in a distributed manner across multiple processors and then broadcast to a master or all other processors. To reduce the communication cost in this setting we utilize a Quanitzed Stochastic Gradient Scheme (QSGD) where the partial gradients are quantized. Perhaps unexpectedly, we show that QSGD maintains the fast convergence of SGD to a globally optimal model while significantly reducing the communication cost. We further corroborate our numerical findings via various experiments including distributed implementations over Amazon EC2.

READ FULL TEXT
research
05/24/2018

Local SGD Converges Fast and Communicates Little

Mini-batch stochastic gradient descent (SGD) is the state of the art in ...
research
06/28/2015

Stochastic Gradient Made Stable: A Manifold Propagation Approach for Large-Scale Optimization

Stochastic gradient descent (SGD) holds as a classical method to build l...
research
09/09/2019

Communication-Censored Distributed Stochastic Gradient Descent

This paper develops a communication-efficient algorithm to solve the sto...
research
05/10/2017

Learning ReLUs via Gradient Descent

In this paper we study the problem of learning Rectified Linear Units (R...
research
11/18/2019

vqSGD: Vector Quantized Stochastic Gradient Descent

In this work, we present a family of vector quantization schemes vqSGD (...
research
09/09/2018

Stochastic Gradient Descent Learns State Equations with Nonlinear Activations

We study discrete time dynamical systems governed by the state equation ...
research
06/15/2016

A Class of Parallel Doubly Stochastic Algorithms for Large-Scale Learning

We consider learning problems over training sets in which both, the numb...

Please sign up or login with your details

Forgot password? Click here to reset