DeepAI
Log In Sign Up

Efficient Distributed Auto-Differentiation

02/18/2021
by   Bradley T. Baker, et al.
0

Although distributed machine learning has opened up numerous frontiers of research, the separation of large models across different devices, nodes, and sites can invite significant communication overhead, making reliable training difficult. The focus on gradients as the primary shared statistic during training has led to a number of intuitive algorithms for distributed deep learning; however, gradient-based algorithms for training large deep neural networks (DNNs) are communication-heavy, often requiring additional modifications via sparsity constraints, compression, quantization, and other similar approaches, to lower bandwidth. We introduce a surprisingly simple statistic for training distributed DNNs that is more communication-friendly than the gradient. The error backpropagation process can be modified to share these smaller intermediate values instead of the gradient, reducing communication overhead with no impact on accuracy. The process provides the flexibility of averaging gradients during backpropagation, enabling novel flexible training schemas while leaving room for further bandwidth reduction via existing gradient compression methods. Finally, consideration of the matrices used to compute the gradient inspires a new approach to compression via structured power iterations, which can not only reduce bandwidth but also enable introspection into distributed training dynamics, without significant performance loss.

READ FULL TEXT
03/28/2021

MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Large-scale distributed training is increasingly becoming communication ...
12/05/2017

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Large-scale distributed training requires significant communication band...
02/16/2018

Variance-based Gradient Compression for Efficient Distributed Deep Learning

Due to the substantial computational cost, training state-of-the-art dee...
09/26/2021

Quantization for Distributed Optimization

Massive amounts of data have led to the training of large-scale machine ...
05/22/2018

Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication

Currently, progressively larger deep neural networks are trained on ever...
01/06/2019

Bandwidth Reduction using Importance Weighted Pruning on Ring AllReduce

It is inevitable to train large deep learning models on a large-scale cl...
10/09/2021

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

The continuous growth in both size and training data for modern Deep Neu...