On the Discrepancy between the Theoretical Analysis and Practical Implementations of Compressed Communication for Distributed Deep Learning

11/19/2019
by   Aritra Dutta, et al.
0

Compressed communication, in the form of sparsification or quantization of stochastic gradients, is employed to reduce communication costs in distributed data-parallel training of deep neural networks. However, there exists a discrepancy between theory and practice: while theoretical analysis of most existing compression methods assumes compression is applied to the gradients of the entire model, many practical implementations operate individually on the gradients of each layer of the model. In this paper, we prove that layer-wise compression is, in theory, better, because the convergence rate is upper bounded by that of entire-model compression for a wide range of biased and unbiased compression methods. However, despite the theoretical bound, our experimental study of six well-known methods shows that convergence, in practice, may or may not be better, depending on the actual trained model and compression ratio. Our findings suggest that it would be advantageous for deep learning frameworks to include support for both layer-wise and entire-model compression.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/16/2022

Compressed-VFL: Communication-Efficient Learning with Vertically Partitioned Data

We propose Compressed Vertical Federated Learning (C-VFL) for communicat...
research
02/27/2020

On Biased Compression for Distributed Learning

In the last few years, various communication compression techniques have...
research
05/22/2017

TernGrad: Ternary Gradients to Reduce Communication in Distributed Deep Learning

High network communication cost for synchronizing gradients and paramete...
research
03/16/2021

Learned Gradient Compression for Distributed Deep Learning

Training deep neural networks on large datasets containing high-dimensio...
research
04/14/2022

Sign Bit is Enough: A Learning Synchronization Framework for Multi-hop All-reduce with Ultimate Compression

Traditional one-bit compressed stochastic gradient descent can not be di...
research
05/27/2019

Natural Compression for Distributed Deep Learning

Due to their hunger for big data, modern deep learning models are traine...
research
05/25/2023

Unbiased Compression Saves Communication in Distributed Optimization: When and How Much?

Communication compression is a common technique in distributed optimizat...

Please sign up or login with your details

Forgot password? Click here to reset