Communication-efficient distributed SGD with Sketching

03/12/2019
by   Nikita Ivkin, et al.
10

Large-scale distributed training of neural networks is often limited by network bandwidth, wherein the communication time overwhelms the local computation time. Motivated by the success of sketching methods in sub-linear/streaming algorithms, we propose a sketching-based approach to minimize the communication costs between nodes without losing accuracy. In our proposed method, workers in a distributed, synchronous training setting send sketches of their gradient vectors to the parameter server instead of the full gradient vector. Leveraging the theoretical properties of sketches, we show that this method recovers the favorable convergence guarantees of single-machine top-k SGD. Furthermore, when applied to a model with d dimensions on W workers, our method requires only Θ(kW) bytes of communication, compared to Ω(dW) for vanilla distributed SGD. To validate our method, we run experiments using a residual network trained on the CIFAR-10 dataset. We achieve no drop in validation accuracy with a compression ratio of 4, or about 1 percentage point drop with a compression ratio of 8. We also demonstrate that our method scales to many workers.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
11/01/2021

Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization

Due to the explosion in the size of the training datasets, distributed l...
research
02/16/2018

Variance-based Gradient Compression for Efficient Distributed Deep Learning

Due to the substantial computational cost, training state-of-the-art dee...
research
11/20/2019

Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

To reduce the long training time of large deep neural network (DNN) mode...
research
10/14/2022

Communication-Efficient Adam-Type Algorithms for Distributed Data Mining

Distributed data mining is an emerging research topic to effectively and...
research
08/22/2018

Don't Use Large Mini-Batches, Use Local SGD

Mini-batch stochastic gradient methods are the current state of the art ...
research
11/12/2020

Distributed Sparse SGD with Majority Voting

Distributed learning, particularly variants of distributed stochastic gr...
research
10/17/2019

Communication-Efficient Asynchronous Stochastic Frank-Wolfe over Nuclear-norm Balls

Large-scale machine learning training suffers from two prior challenges,...

Please sign up or login with your details

Forgot password? Click here to reset