Optimal Gradient Quantization Condition for Communication-Efficient Distributed Training

02/25/2020
by   An Xu, et al.
0

The communication of gradients is costly for training deep neural networks with multiple devices in computer vision applications. In particular, the growing size of deep learning models leads to higher communication overheads that defy the ideal linear training speedup regarding the number of devices. Gradient quantization is one of the common methods to reduce communication costs. However, it can lead to quantization error in the training and result in model performance degradation. In this work, we deduce the optimal condition of both the binary and multi-level gradient quantization for ANY gradient distribution. Based on the optimal condition, we develop two novel quantization schemes: biased BinGrad and unbiased ORQ for binary and multi-level gradient quantization respectively, which dynamically determine the optimal quantization levels. Extensive experimental results on CIFAR and ImageNet datasets with several popular convolutional neural networks show the superiority of our proposed methods.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/29/2020

Quantized Adam with Error Feedback

In this paper, we present a distributed variant of adaptive stochastic g...
research
11/30/2019

A binary-activation, multi-level weight RNN and training algorithm for processing-in-memory inference with eNVM

We present a new algorithm for training neural networks with binary acti...
research
07/30/2021

DQ-SGD: Dynamic Quantization in SGD for Communication-Efficient Distributed Learning

Gradient quantization is an emerging technique in reducing communication...
research
10/27/2020

A Statistical Framework for Low-bitwidth Training of Deep Neural Networks

Fully quantized training (FQT), which uses low-bitwidth hardware by quan...
research
11/24/2018

On Periodic Functions as Regularizers for Quantization of Neural Networks

Deep learning models have been successfully used in computer vision and ...
research
12/03/2020

Accumulated Decoupled Learning: Mitigating Gradient Staleness in Inter-Layer Model Parallelization

Decoupled learning is a branch of model parallelism which parallelizes t...
research
10/23/2020

Adaptive Gradient Quantization for Data-Parallel SGD

Many communication-efficient variants of SGD use gradient quantization s...

Please sign up or login with your details

Forgot password? Click here to reset