Adaptive Periodic Averaging: A Practical Approach to Reducing Communication in Distributed Learning

07/13/2020
by   Peng Jiang, et al.
0

Stochastic Gradient Descent (SGD) is the key learning algorithm for many machine learning tasks. Because of its computational costs, there is a growing interest in accelerating SGD on HPC resources like GPU clusters. However, the performance of parallel SGD is still bottlenecked by the high communication costs even with a fast connection among the machines. A simple approach to alleviating this problem, used in many existing efforts, is to perform communication every few iterations, using a constant averaging period. In this paper, we show that the optimal averaging period in terms of convergence and communication cost is not a constant, but instead varies over the course of the execution. Specifically, we observe that reducing the variance of model parameters among the computing nodes is critical to the convergence of periodic parameter averaging SGD. Given a fixed communication budget, we show that it is more beneficial to synchronize more frequently in early iterations to reduce the initial large variance and synchronize less frequently in the later phase of the training process. We propose a practical algorithm, named ADaptive Periodic parameter averaging SGD (ADPSGD), to achieve a smaller overall variance of model parameters, and thus better convergence compared with the Constant Periodic parameter averaging SGD (CPSGD). We evaluate our method with several image classification benchmarks and show that our ADPSGD indeed achieves smaller training losses and higher test accuracies with smaller communication compared with CPSGD. Compared with gradient-quantization SGD, we show that our algorithm achieves faster convergence with only half of the communication. Compared with full-communication SGD, our ADPSGD achieves 1:14x to 1:27x speedups with a 100Gbps connection among computing nodes, and the speedups increase to 1:46x   1:95x with a 10Gbps connection.

READ FULL TEXT
research
06/12/2020

O(1) Communication for Distributed SGD through Two-Level Gradient Averaging

Large neural network models present a hefty communication challenge to d...
research
10/27/2014

Parallel training of DNNs with Natural Gradient and Parameter Averaging

We describe the neural-network training framework used in the Kaldi spee...
research
05/19/2021

Accelerating Gossip SGD with Periodic Global Averaging

Communication overhead hinders the scalability of large-scale distribute...
research
03/02/2020

Iterate Averaging Helps: An Alternative Perspective in Deep Learning

Iterate averaging has a rich history in optimisation, but has only very ...
research
12/06/2018

Elastic Gossip: Distributing Neural Network Training Using Gossip-like Protocols

Distributing Neural Network training is of particular interest for sever...
research
11/27/2018

Stochastic Gradient Push for Distributed Deep Learning

Large mini-batch parallel SGD is commonly used for distributed training ...
research
10/13/2021

Adaptive Elastic Training for Sparse Deep Learning on Heterogeneous Multi-GPU Servers

Motivated by extreme multi-label classification applications, we conside...

Please sign up or login with your details

Forgot password? Click here to reset