TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

05/22/2017
by   Wei Wen, et al.
0

High network communication cost for synchronizing gradients and parameters is the well-known bottleneck of distributed training. In this work, we propose TernGrad that uses ternary gradients to accelerate distributed deep learning in data parallelism. Our approach requires only three numerical levels -1,0,1, which can aggressively reduce the communication time. We mathematically prove the convergence of TernGrad under the assumption of a bound on gradients. Guided by the bound, we propose layer-wise ternarizing and gradient clipping to improve its convergence. Our experiments show that applying TernGrad on AlexNet does not incur any accuracy loss and can even improve accuracy. The accuracy loss of GoogLeNet induced by TernGrad is less than 2 performance model is proposed to study the scalability of TernGrad. Experiments show significant speed gains for various deep neural networks. Our source code is available.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/02/2019

Nested Dithered Quantization for Communication Reduction in Distributed Training

In distributed training, the communication cost due to the transmission ...
research
10/05/2021

S2 Reducer: High-Performance Sparse Communication to Accelerate Distributed Deep Learning

Distributed stochastic gradient descent (SGD) approach has been widely u...
research
11/19/2019

On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning

Compressed communication, in the form of sparsification or quantization ...
research
09/26/2021

Quantization for Distributed Optimization

Massive amounts of data have led to the training of large-scale machine ...
research
06/28/2020

PyTorch Distributed: Experiences on Accelerating Data Parallel Training

This paper presents the design, implementation, and evaluation of the Py...
research
10/18/2020

Fast Distributed Training of Deep Neural Networks: Dynamic Communication Thresholding for Model and Data Parallelism

Data Parallelism (DP) and Model Parallelism (MP) are two common paradigm...
research
01/14/2022

Layerwise Geo-Distributed Computing between Cloud and IoT

In this paper, we propose a novel architecture for a deep learning syste...

Please sign up or login with your details

Forgot password? Click here to reset