Dynamic Stale Synchronous Parallel Distributed Training for Deep Learning

08/16/2019
by   Xing Zhao, et al.
0

Deep learning is a popular machine learning technique and has been applied to many real-world problems. However, training a deep neural network is very time-consuming, especially on big data. It has become difficult for a single machine to train a large model over large datasets. A popular solution is to distribute and parallelize the training process across multiple machines using the parameter server framework. In this paper, we present a distributed paradigm on the parameter server framework called Dynamic Stale Synchronous Parallel (DSSP) which improves the state-of-the-art Stale Synchronous Parallel (SSP) paradigm by dynamically determining the staleness threshold at the run time. Conventionally to run distributed training in SSP, the user needs to specify a particular staleness threshold as a hyper-parameter. However, a user does not usually know how to set the threshold and thus often finds a threshold value through trial and error, which is time-consuming. Based on workers' recent processing time, our approach DSSP adaptively adjusts the threshold per iteration at running time to reduce the waiting time of faster workers for synchronization of the globally shared parameters, and consequently increases the frequency of parameters updates (increases iteration throughput), which speedups the convergence rate. We compare DSSP with other paradigms such as Bulk Synchronous Parallel (BSP), Asynchronous Parallel (ASP), and SSP by running deep neural networks (DNN) models over GPU clusters in both homogeneous and heterogeneous environments. The results show that in a heterogeneous environment where the cluster consists of mixed models of GPUs, DSSP converges to a higher accuracy much earlier than SSP and BSP and performs similarly to ASP. In a homogeneous distributed cluster, DSSP has more stable and slightly better performance than SSP and ASP, and converges much faster than BSP.

READ FULL TEXT
research
04/30/2020

Dynamic backup workers for parallel machine learning

The most popular framework for distributed training of machine learning ...
research
01/06/2020

Elastic Bulk Synchronous Parallel Model for Distributed Deep Learning

The bulk synchronous parallel (BSP) is a celebrated synchronization mode...
research
07/16/2023

Accelerating Distributed ML Training via Selective Synchronization

In distributed training, deep neural networks (DNNs) are launched over m...
research
06/07/2018

Fast Distributed Deep Learning via Worker-adaptive Batch Sizing

Deep neural network models are usually trained in cluster environments, ...
research
04/16/2021

Sync-Switch: Hybrid Parameter Synchronization for Distributed Deep Learning

Stochastic Gradient Descent (SGD) has become the de facto way to train d...
research
09/17/2019

Heterogeneity-Aware Asynchronous Decentralized Training

Distributed deep learning training usually adopts All-Reduce as the sync...
research
02/17/2021

Oscars: Adaptive Semi-Synchronous Parallel Model for Distributed Deep Learning with Global View

Deep learning has become an indispensable part of life, such as face rec...

Please sign up or login with your details

Forgot password? Click here to reset