A Sum-of-Ratios Multi-Dimensional-Knapsack Decomposition for DNN Resource Scheduling

05/28/2021
by   Menglu Yu, et al.
0

In recent years, to sustain the resource-intensive computational needs for training deep neural networks (DNNs), it is widely accepted that exploiting the parallelism in large-scale computing clusters is critical for the efficient deployments of DNN training jobs. However, existing resource schedulers for traditional computing clusters are not well suited for DNN training, which results in unsatisfactory job completion time performance. The limitations of these resource scheduling schemes motivate us to propose a new computing cluster resource scheduling framework that is able to leverage the special layered structure of DNN jobs and significantly improve their job completion times. Our contributions in this paper are three-fold: i) We develop a new resource scheduling analytical model by considering DNN's layered structure, which enables us to analytically formulate the resource scheduling optimization problem for DNN training in computing clusters; ii) Based on the proposed performance analytical model, we then develop an efficient resource scheduling algorithm based on the widely adopted parameter-server architecture using a sum-of-ratios multi-dimensional-knapsack decomposition (SMD) method to offer strong performance guarantee; iii) We conduct extensive numerical experiments to demonstrate the effectiveness of the proposed schedule algorithm and its superior performance over the state of the art.

READ FULL TEXT
research
02/02/2022

GADGET: Online Resource Optimization for Scheduling Ring-All-Reduce Learning Jobs

Fueled by advances in distributed deep learning (DDL), recent years have...
research
07/16/2022

On Scheduling Ring-All-Reduce Learning Jobs in Multi-Tenant GPU Clusters with Communication Contention

Powered by advances in deep learning (DL) techniques, machine learning a...
research
08/06/2021

Toward Efficient Online Scheduling for Distributed Machine Learning Systems

Recent years have witnessed a rapid growth of distributed machine learni...
research
01/03/2018

Online Job Scheduling in Distributed Machine Learning Clusters

Nowadays large-scale distributed machine learning systems have been depl...
research
06/22/2021

BFTrainer: Low-Cost Training of Neural Networks on Unfillable Supercomputer Nodes

Supercomputer FCFS-based scheduling policies result in many transient id...
research
04/25/2016

Do the Hard Stuff First: Scheduling Dependent Computations in Data-Analytics Clusters

We present a scheduler that improves cluster utilization and job complet...
research
08/21/2019

Dynamic Scheduling of MPI-based Distributed Deep Learning Training Jobs

There is a general trend towards solving problems suited to deep learnin...

Please sign up or login with your details

Forgot password? Click here to reset