DeepAI AI Chat
Log In Sign Up

Communication-efficient distributed SGD with Sketching

by   Nikita Ivkin, et al.
Johns Hopkins University
berkeley college

Large-scale distributed training of neural networks is often limited by network bandwidth, wherein the communication time overwhelms the local computation time. Motivated by the success of sketching methods in sub-linear/streaming algorithms, we propose a sketching-based approach to minimize the communication costs between nodes without losing accuracy. In our proposed method, workers in a distributed, synchronous training setting send sketches of their gradient vectors to the parameter server instead of the full gradient vector. Leveraging the theoretical properties of sketches, we show that this method recovers the favorable convergence guarantees of single-machine top-k SGD. Furthermore, when applied to a model with d dimensions on W workers, our method requires only Θ(kW) bytes of communication, compared to Ω(dW) for vanilla distributed SGD. To validate our method, we run experiments using a residual network trained on the CIFAR-10 dataset. We achieve no drop in validation accuracy with a compression ratio of 4, or about 1 percentage point drop with a compression ratio of 8. We also demonstrate that our method scales to many workers.


page 1

page 2

page 3

page 4


Communication-Compressed Adaptive Gradient Method for Distributed Nonconvex Optimization

Due to the explosion in the size of the training datasets, distributed l...

Variance-based Gradient Compression for Efficient Distributed Deep Learning

Due to the substantial computational cost, training state-of-the-art dee...

Layer-wise Adaptive Gradient Sparsification for Distributed Deep Learning with Convergence Guarantees

To reduce the long training time of large deep neural network (DNN) mode...

Communication-Efficient Adam-Type Algorithms for Distributed Data Mining

Distributed data mining is an emerging research topic to effectively and...

Don't Use Large Mini-Batches, Use Local SGD

Mini-batch stochastic gradient methods are the current state of the art ...

Distributed Sparse SGD with Majority Voting

Distributed learning, particularly variants of distributed stochastic gr...

Downlink Compression Improves TopK Sparsification

Training large neural networks is time consuming. To speed up the proces...

Code Repositories


TopH Count Sketch in Python

view repo