Log In Sign Up

Efficient Distributed Auto-Differentiation

by   Bradley T. Baker, et al.

Although distributed machine learning has opened up numerous frontiers of research, the separation of large models across different devices, nodes, and sites can invite significant communication overhead, making reliable training difficult. The focus on gradients as the primary shared statistic during training has led to a number of intuitive algorithms for distributed deep learning; however, gradient-based algorithms for training large deep neural networks (DNNs) are communication-heavy, often requiring additional modifications via sparsity constraints, compression, quantization, and other similar approaches, to lower bandwidth. We introduce a surprisingly simple statistic for training distributed DNNs that is more communication-friendly than the gradient. The error backpropagation process can be modified to share these smaller intermediate values instead of the gradient, reducing communication overhead with no impact on accuracy. The process provides the flexibility of averaging gradients during backpropagation, enabling novel flexible training schemas while leaving room for further bandwidth reduction via existing gradient compression methods. Finally, consideration of the matrices used to compute the gradient inspires a new approach to compression via structured power iterations, which can not only reduce bandwidth but also enable introspection into distributed training dynamics, without significant performance loss.


MergeComp: A Compression Scheduler for Scalable Communication-Efficient Distributed Training

Large-scale distributed training is increasingly becoming communication ...

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Large-scale distributed training requires significant communication band...

Variance-based Gradient Compression for Efficient Distributed Deep Learning

Due to the substantial computational cost, training state-of-the-art dee...

Quantization for Distributed Optimization

Massive amounts of data have led to the training of large-scale machine ...

Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication

Currently, progressively larger deep neural networks are trained on ever...

Bandwidth Reduction using Importance Weighted Pruning on Ring AllReduce

It is inevitable to train large deep learning models on a large-scale cl...

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

The continuous growth in both size and training data for modern Deep Neu...