Efficient Distributed Auto-Differentiation

02/18/2021
by   Bradley T. Baker, et al.
0

Although distributed machine learning has opened up numerous frontiers of research, the separation of large models across different devices, nodes, and sites can invite significant communication overhead, making reliable training difficult. The focus on gradients as the primary shared statistic during training has led to a number of intuitive algorithms for distributed deep learning; however, gradient-based algorithms for training large deep neural networks (DNNs) are communication-heavy, often requiring additional modifications via sparsity constraints, compression, quantization, and other similar approaches, to lower bandwidth. We introduce a surprisingly simple statistic for training distributed DNNs that is more communication-friendly than the gradient. The error backpropagation process can be modified to share these smaller intermediate values instead of the gradient, reducing communication overhead with no impact on accuracy. The process provides the flexibility of averaging gradients during backpropagation, enabling novel flexible training schemas while leaving room for further bandwidth reduction via existing gradient compression methods. Finally, consideration of the matrices used to compute the gradient inspires a new approach to compression via structured power iterations, which can not only reduce bandwidth but also enable introspection into distributed training dynamics, without significant performance loss.

READ FULL TEXT
research
12/05/2017

Deep Gradient Compression: Reducing the Communication Bandwidth for Distributed Training

Large-scale distributed training requires significant communication band...
research
02/16/2023

THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression

Deep neural networks (DNNs) are the de-facto standard for essential use ...
research
02/16/2018

Variance-based Gradient Compression for Efficient Distributed Deep Learning

Due to the substantial computational cost, training state-of-the-art dee...
research
05/22/2018

Sparse Binary Compression: Towards Distributed Deep Learning with minimal Communication

Currently, progressively larger deep neural networks are trained on ever...
research
01/06/2019

Bandwidth Reduction using Importance Weighted Pruning on Ring AllReduce

It is inevitable to train large deep learning models on a large-scale cl...
research
10/09/2021

Themis: A Network Bandwidth-Aware Collective Scheduling Policy for Distributed Training of DL Models

The continuous growth in both size and training data for modern Deep Neu...
research
04/22/2022

FPGA-based AI Smart NICs for Scalable Distributed AI Training Systems

Rapid advances in artificial intelligence (AI) technology have led to si...

Please sign up or login with your details

Forgot password? Click here to reset