Deep Learning Workload Scheduling in GPU Datacenters: Taxonomy, Challenges and Vision

05/24/2022
by   Wei Gao, et al.
0

Deep learning (DL) shows its prosperity in a wide variety of fields. The development of a DL model is a time-consuming and resource-intensive procedure. Hence, dedicated GPU accelerators have been collectively constructed into a GPU datacenter. An efficient scheduler design for such GPU datacenter is crucially important to reduce the operational cost and improve resource utilization. However, traditional approaches designed for big data or high performance computing workloads can not support DL workloads to fully utilize the GPU resources. Recently, substantial schedulers are proposed to tailor for DL workloads in GPU datacenters. This paper surveys existing research efforts for both training and inference workloads. We primarily present how existing schedulers facilitate the respective workloads from the scheduling objectives and resource consumption features. Finally, we prospect several promising future research directions. More detailed summary with the surveyed paper and code links can be found at our project website: https://github.com/S-Lab-System-Group/Awesome-DL-Scheduling-Papers

READ FULL TEXT

page 1

page 2

page 3

page 4

research
12/04/2020

Nimble: Lightweight and Parallel GPU Task Scheduling for Deep Learning

Deep learning (DL) frameworks take advantage of GPUs to improve the spee...
research
01/17/2019

Analysis of Large-Scale Multi-Tenant GPU Clusters for DNN Training Workloads

With widespread advances in machine learning, a number of large enterpri...
research
05/11/2021

ANDREAS: Artificial intelligence traiNing scheDuler foR accElerAted resource clusterS

Artificial Intelligence (AI) and Deep Learning (DL) algorithms are curre...
research
01/01/2023

MIGPerf: A Comprehensive Benchmark for Deep Learning Training and Inference Workloads on Multi-Instance GPUs

New architecture GPUs like A100 are now equipped with multi-instance GPU...
research
01/17/2022

VELTAIR: Towards High-Performance Multi-tenant Deep Learning Services via Adaptive Compilation and Scheduling

Deep learning (DL) models have achieved great success in many applicatio...
research
10/01/2021

Characterizing Concurrency Mechanisms for NVIDIA GPUs under Deep Learning Workloads

We investigate the performance of the concurrency mechanisms available o...
research
02/03/2021

Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

Driven by the tremendous effort in researching novel deep learning (DL) ...

Please sign up or login with your details

Forgot password? Click here to reset