Online Evolutionary Batch Size Orchestration for Scheduling Deep Learning Workloads in GPU Clusters

08/08/2021
by   Zhengda Bian, et al.
0

Efficient GPU resource scheduling is essential to maximize resource utilization and save training costs for the increasing amount of deep learning workloads in shared GPU clusters. Existing GPU schedulers largely rely on static policies to leverage the performance characteristics of deep learning jobs. However, they can hardly reach optimal efficiency due to the lack of elasticity. To address the problem, we propose ONES, an ONline Evolutionary Scheduler for elastic batch size orchestration. ONES automatically manages the elasticity of each job based on the training batch size, so as to maximize GPU utilization and improve scheduling efficiency. It determines the batch size for each job through an online evolutionary search that can continuously optimize the scheduling decisions. We evaluate the effectiveness of ONES with 64 GPUs on TACC's Longhorn supercomputers. The results show that ONES can outperform the prior deep learning schedulers with a significantly shorter average job completion time.

READ FULL TEXT
research
02/24/2020

Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs

Distributed Deep Learning (DDL) has rapidly grown its popularity since i...
research
10/01/2021

Characterizing Concurrency Mechanisms for NVIDIA GPUs under Deep Learning Workloads

We investigate the performance of the concurrency mechanisms available o...
research
01/08/2020

HyperSched: Dynamic Resource Reallocation for Model Development on a Deadline

Prior research in resource scheduling for machine learning training work...
research
08/20/2020

Heterogeneity-Aware Cluster Scheduling Policies for Deep Learning Workloads

Specialized accelerators such as GPUs, TPUs, FPGAs, and custom ASICs hav...
research
12/31/2018

Dynamic Space-Time Scheduling for GPU Inference

Serving deep neural networks in latency critical interactive settings of...
research
11/10/2022

Job Scheduling in Datacenters using Constraint Controlled RL

This paper studies a model for online job scheduling in green datacenter...
research
12/19/2021

Efficient Strong Scaling Through Burst Parallel Training

As emerging deep neural network (DNN) models continue to grow in size, u...

Please sign up or login with your details

Forgot password? Click here to reset