High Throughput Synchronous Distributed Stochastic Gradient Descent

03/12/2018
by   Michael Teng, et al.
0

We introduce a new, high-throughput, synchronous, distributed, data-parallel, stochastic-gradient-descent learning algorithm. This algorithm uses amortized inference in a compute-cluster-specific, deep, generative, dynamical model to perform joint posterior predictive inference of the mini-batch gradient computation times of all worker-nodes in a parallel computing cluster. We show that a synchronous parameter server can, by utilizing such a model, choose an optimal cutoff time beyond which mini-batch gradient messages from slow workers are ignored that maximizes overall mini-batch gradient computations per second. In keeping with earlier findings we observe that, under realistic conditions, eagerly discarding the mini-batch gradient computations of stragglers not only increases throughput but actually increases the overall rate of convergence as a function of wall-clock time by virtue of eliminating idleness. The principal novel contribution and finding of this work goes beyond this by demonstrating that using the predicted run-times from a generative model of cluster worker performance to dynamically adjust the cutoff improves substantially over the static-cutoff prior art, leading to, among other things, significantly reduced deep neural net training times on large computer clusters.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
06/14/2022

MBGDT:Robust Mini-Batch Gradient Descent

In high dimensions, most machine learning method perform fragile even th...
research
12/07/2018

Nonlinear Conjugate Gradients For Scaling Synchronous Distributed DNN Training

Nonlinear conjugate gradient (NLCG) based optimizers have shown superior...
research
05/31/2020

DaSGD: Squeezing SGD Parallelization Performance in Distributed Training Using Delayed Averaging

The state-of-the-art deep learning algorithms rely on distributed traini...
research
05/20/2023

Taming Resource Heterogeneity In Distributed ML Training With Dynamic Batching

Current techniques and systems for distributed model training mostly ass...
research
10/17/2014

mS2GD: Mini-Batch Semi-Stochastic Gradient Descent in the Proximal Setting

We propose a mini-batching scheme for improving the theoretical complexi...
research
02/07/2018

MiMatrix: A Massively Distributed Deep Learning Framework on a Petascale High-density Heterogeneous Cluster

In this paper, we present a co-designed petascale high-density GPU clust...
research
10/06/2020

A Closer Look at Codistillation for Distributed Training

Codistillation has been proposed as a mechanism to share knowledge among...

Please sign up or login with your details

Forgot password? Click here to reset