AdaComp : Adaptive Residual Gradient Compression for Data-Parallel Distributed Training

12/07/2017
by   Chia-Yu Chen, et al.
0

Highly distributed training of Deep Neural Networks (DNNs) on future compute platforms (offering 100 of TeraOps/s of computational capacity) is expected to be severely communication constrained. To overcome this limitation, new gradient compression techniques are needed that are computationally friendly, applicable to a wide variety of layers seen in Deep Neural Networks and adaptable to variations in network architectures as well as their hyper-parameters. In this paper we introduce a novel technique - the Adaptive Residual Gradient Compression (AdaComp) scheme. AdaComp is based on localized selection of gradient residues and automatically tunes the compression rate depending on local activity. We show excellent results on a wide spectrum of state of the art Deep Learning models in multiple domains (vision, speech, language), datasets (MNIST, CIFAR10, ImageNet, BN50, Shakespeare), optimizers (SGD with momentum, Adam) and network parameters (number of learners, minibatch-size etc.). Exploiting both sparsity and quantization, we demonstrate end-to-end compression rates of 200X for fully-connected and recurrent layers, and 40X for convolutional layers, without any noticeable degradation in model accuracies.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
04/21/2021

ScaleCom: Scalable Sparsified Gradient Compression for Communication-Efficient Distributed Training

Large-scale distributed training of Deep Neural Networks (DNNs) on state...
research
09/29/2015

Compression of Deep Neural Networks on the Fly

Thanks to their state-of-the-art performance, deep neural networks are i...
research
10/14/2019

Learning Sparsity and Quantization Jointly and Automatically for Neural Network Compression via Constrained Optimization

Deep Neural Networks (DNNs) are widely applied in a wide range of usecas...
research
10/31/2022

L-GreCo: An Efficient and General Framework for Layerwise-Adaptive Gradient Compression

Data-parallel distributed training of deep neural networks (DNN) has gai...
research
02/25/2018

Wide Compression: Tensor Ring Nets

Deep neural networks have demonstrated state-of-the-art performance in a...
research
02/05/2023

Quantized Distributed Training of Large Models with Convergence Guarantees

Communication-reduction techniques are a popular way to improve scalabil...
research
11/07/2017

Compression-aware Training of Deep Networks

In recent years, great progress has been made in a variety of applicatio...

Please sign up or login with your details

Forgot password? Click here to reset