Online Job Scheduling in Distributed Machine Learning Clusters

01/03/2018
by   Yixin Bao, et al.
0

Nowadays large-scale distributed machine learning systems have been deployed to support various analytics and intelligence services in IT firms. To train a large dataset and derive the prediction/inference model, e.g., a deep neural network, multiple workers are run in parallel to train partitions of the input dataset, and update shared model parameters. In a shared cluster handling multiple training jobs, a fundamental issue is how to efficiently schedule jobs and set the number of concurrent workers to run for each job, such that server resources are maximally utilized and model training can be completed in time. Targeting a distributed machine learning system using the parameter server framework, we design an online algorithm for scheduling the arriving jobs and deciding the adjusted numbers of concurrent workers and parameter servers for each job over its course, to maximize overall utility of all jobs, contingent on their completion times. Our online algorithm design utilizes a primal-dual framework coupled with efficient dual subroutines, achieving good long-term performance guarantees with polynomial time complexity. Practical effectiveness of the online algorithm is evaluated using trace-driven simulation and testbed experiments, which demonstrate its outperformance as compared to commonly adopted scheduling algorithms in today's cloud systems.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/24/2020

Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs

Distributed Deep Learning (DDL) has rapidly grown its popularity since i...
research
10/06/2018

Towards Self-Tuning Parameter Servers

Recent years, many applications have been driven advances by the use of ...
research
05/11/2018

Peacock: Probe-Based Scheduling of Jobs by Rotating Between Elastic Queues

In this paper, we propose Peacock, a new distributed probe-based schedul...
research
12/26/2021

Large-scale Machine Learning Cluster Scheduling via Multi-agent Graph Reinforcement Learning

Efficient scheduling of distributed deep learning (DL) jobs in large GPU...
research
10/02/2019

Scheduling Stochastic Real-Time Jobs in Unreliable Workers

We consider a distributed computing network consisting of a master and m...
research
08/06/2021

Toward Efficient Online Scheduling for Distributed Machine Learning Systems

Recent years have witnessed a rapid growth of distributed machine learni...
research
05/28/2021

A Sum-of-Ratios Multi-Dimensional-Knapsack Decomposition for DNN Resource Scheduling

In recent years, to sustain the resource-intensive computational needs f...

Please sign up or login with your details

Forgot password? Click here to reset