Dynamic Scheduling of MPI-based Distributed Deep Learning Training Jobs

08/21/2019
by   Tim Capes, et al.
0

There is a general trend towards solving problems suited to deep learning with more complex deep learning architectures trained on larger training sets. This requires longer compute times and greater data parallelization or model parallelization. Both data and model parallelism have been historically faster in parameter server architectures, but data parallelism is starting to be faster in ring architectures due to algorithmic improvements. In this paper, we analyze the math behind ring architectures and make an informed adaptation of dynamic scheduling to ring architectures. To do so, we formulate a non-convex, non-linear, NP-hard integer programming problem and a new efficient doubling heuristic for its solution. We build upon Horovod: an open source ring architecture framework over TensorFlow. We show that Horovod jobs have a low cost to stop and restart and that stopping and restarting ring architecture jobs leads to faster completion times. These two facts make dynamic scheduling of ring architecture jobs feasible. Lastly, we simulate a scheduler using these runs and show a more than halving of average job time on some workload patterns.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/02/2022

GADGET: Online Resource Optimization for Scheduling Ring-All-Reduce Learning Jobs

Fueled by advances in distributed deep learning (DDL), recent years have...
research
07/16/2022

On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention

Powered by advances in deep learning (DL) techniques, machine learning a...
research
05/10/2018

Unifying Data, Model and Hybrid Parallelism in Deep Learning via Tensor Tiling

Deep learning systems have become vital tools across many fields, but th...
research
10/03/2018

Learning Scheduling Algorithms for Data Processing Clusters

Efficiently scheduling data processing jobs on distributed compute clust...
research
12/30/2020

New Partitioning Techniques and Faster Algorithms for Approximate Interval Scheduling

Interval scheduling is a basic problem in the theory of algorithms and a...
research
05/28/2021

A Sum-of-Ratios Multi-Dimensional-Knapsack Decomposition for DNN Resource Scheduling

In recent years, to sustain the resource-intensive computational needs f...
research
08/24/2012

Parallel ACO with a Ring Neighborhood for Dynamic TSP

The current paper introduces a new parallel computing technique based on...

Please sign up or login with your details

Forgot password? Click here to reset