Sparsified SGD with Memory

09/20/2018
by   Sebastian U. Stich, et al.
0

Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far. In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the better scalability for distributed applications.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/21/2021

Escaping Saddle Points with Compressed SGD

Stochastic gradient descent (SGD) is a prevalent optimization technique ...
research
06/21/2018

Error Compensated Quantized SGD and its Applications to Large-scale Distributed Optimization

Large-scale distributed optimization is of great importance in various a...
research
09/19/2020

Sparse Communication for Training Deep Networks

Synchronous stochastic gradient descent (SGD) is the most common method ...
research
06/15/2023

Evaluation and Optimization of Gradient Compression for Distributed Deep Learning

To accelerate distributed training, many gradient compression methods ha...
research
12/15/2021

Communication-Efficient Distributed SGD with Compressed Sensing

We consider large scale distributed optimization over a set of edge devi...
research
09/09/2019

Communication-Censored Distributed Stochastic Gradient Descent

This paper develops a communication-efficient algorithm to solve the sto...
research
05/15/2021

Drill the Cork of Information Bottleneck by Inputting the Most Important Data

Deep learning has become the most powerful machine learning tool in the ...

Please sign up or login with your details

Forgot password? Click here to reset