Pipe-SGD: A Decentralized Pipelined SGD Framework for Distributed Deep Net Training

11/08/2018
by   Youjie Li, et al.
0

Distributed training of deep nets is an important technique to address some of the present day computing challenges like memory consumption and computational demands. Classical distributed approaches, synchronous or asynchronous, are based on the parameter server architecture, i.e., worker nodes compute gradients which are communicated to the parameter server while updated parameters are returned. Recently, distributed training with AllReduce operations gained popularity as well. While many of those operations seem appealing, little is reported about wall-clock training time improvements. In this paper, we carefully analyze the AllReduce based setup, propose timing models which include network latency, bandwidth, cluster size and compute time, and demonstrate that a pipelined training with a width of two combines the best of both synchronous and asynchronous training. Specifically, for a setup consisting of a four-node GPU cluster we show wall-clock time training improvements of up to 5.4x compared to conventional approaches.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/13/2019

Layered SGD: A Decentralized and Synchronous SGD Algorithm for Scalable Deep Neural Network Training

Stochastic Gradient Descent (SGD) is the most popular algorithm for trai...
research
01/21/2023

ABS: Adaptive Bounded Staleness Converges Faster and Communicates Less

Wall-clock convergence time and communication rounds are critical perfor...
research
02/07/2018

MiMatrix: A Massively Distributed Deep Learning Framework on a Petascale High-density Heterogeneous Cluster

In this paper, we present a co-designed petascale high-density GPU clust...
research
04/04/2016

Revisiting Distributed Synchronous SGD

Distributed training of deep learning models on large-scale training dat...
research
01/12/2022

Adaptive Worker Grouping For Communication-Efficient and Straggler-Tolerant Distributed SGD

Wall-clock convergence time and communication load are key performance m...
research
07/02/2021

ResIST: Layer-Wise Decomposition of ResNets for Distributed Training

We propose , a novel distributed training protocol for Residual Networks...
research
11/24/2018

Hydra: A Peer to Peer Distributed Training & Data Collection Framework

The world needs diverse and unbiased data to train deep learning models....

Please sign up or login with your details

Forgot password? Click here to reset