A Distributed Hierarchical SGD Algorithm with Sparse Global Reduction

03/12/2019
by   Fan Zhou, et al.
0

Reducing communication overhead is a big challenge for large-scale distributed training. To address this issue, we present a hierarchical averaging stochastic gradient descent (Hier-AVG) algorithm that reduces global reductions (averaging) by employing less costly local reductions. As a very general type of parallel SGD, Hier-AVG can reproduce several commonly adopted synchronous parallel SGD variants by adjusting its parameters. We establish standard convergence results of Hier-AVG for non-convex smooth optimization problems. Under the non-asymptotic scenario, we show that Hier-AVG with less frequent global averaging can sometimes have faster training speed. In addition, we show that more frequent local averaging with more participants involved can lead to faster training convergence. By comparing Hier-AVG with another distributed training algorithm K-AVG, we show that through deploying local averaging with less global averaging Hier-AVG can still achieve comparable training speed while constantly get better test accuracy. As a result, local averaging can serve as an alternative remedy to effectively reduce communication overhead when the number of learners is large. We test Hier-AVG with several state-of-the-art deep neural nets on CIFAR-10 to validate our analysis. Further experiments to compare Hier-AVG with K-AVG on ImageNet-1K also show Hier-AVG's superiority over K-AVG.

READ FULL TEXT
research
07/17/2018

Parallel Restarted SGD for Non-Convex Optimization with Faster Convergence and Less Communication

For large scale non-convex stochastic optimization, parallel mini-batch ...
research
04/30/2020

Breaking (Global) Barriers in Parallel Stochastic Optimization with Wait-Avoiding Group Averaging

Deep learning at scale is dominated by communication time. Distributing ...
research
11/27/2018

Stochastic Gradient Push for Distributed Deep Learning

Large mini-batch parallel SGD is commonly used for distributed training ...
research
09/29/2022

Stop Wasting My Time! Saving Days of ImageNet and BERT Training with Latest Weight Averaging

Training vision or language models on large datasets can take days, if n...
research
05/19/2021

Accelerating Gossip SGD with Periodic Global Averaging

Communication overhead hinders the scalability of large-scale distribute...
research
06/23/2016

Parallel SGD: When does averaging help?

Consider a number of workers running SGD independently on the same pool ...
research
02/17/2021

Oscars: Adaptive Semi-Synchronous Parallel Model for Distributed Deep Learning with Global View

Deep learning has become an indispensable part of life, such as face rec...

Please sign up or login with your details

Forgot password? Click here to reset