L-GreCo: An Efficient and General Framework for Layerwise-Adaptive Gradient Compression
Data-parallel distributed training of deep neural networks (DNN) has gained very widespread adoption, but can still experience communication bottlenecks due to gradient transmission. To address this issue, entire families of lossy gradient compression mechanisms have been developed, including quantization, sparsification, and low-rank approximation, some of which are seeing significant practical adoption. Despite this progress, almost all known compression schemes apply compression uniformly across DNN layers, although layers are heterogeneous in terms of parameter count and their impact on model accuracy. In this work, we provide a general framework for adapting the degree of compression across the model's layers dynamically during training, significantly improving the overall compression without sacrificing accuracy. Our framework, called L-GreCo, is based on an efficient adaptive algorithm, which automatically picks the optimal compression parameters for model layers guaranteeing the best compression ratio while respecting a theoretically-justified error constraint. Our extensive experimental study over image classification and language modeling tasks shows that L-GreCo is effective across all three compression families, and achieves up to 2.5× training speedup and up to 5× compression improvement over efficient implementations of standard approaches while recovering full accuracy. Moreover, we show that L-GreCo is complementary to existing adaptive algorithms improving their compression ratio by 50
READ FULL TEXT