Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

09/03/2021
by   Qinghao Hu, et al.
0

Modern GPU datacenters are critical for delivering Deep Learning (DL) models and services in both the research community and industry. When operating a datacenter, optimization of resource scheduling and management can bring significant financial benefits. Achieving this goal requires a deep understanding of the job features and user behaviors. We present a comprehensive study about the characteristics of DL jobs and resource management. First, we perform a large-scale analysis of real-world job traces from SenseTime. We uncover some interesting conclusions from the perspectives of clusters, jobs and users, which can facilitate the cluster system designs. Second, we introduce a general-purpose framework, which manages resources based on historical data. As case studies, we design: a Quasi-Shortest-Service-First scheduling service, which can minimize the cluster-wide average job completion time by up to 6.5x; and a Cluster Energy Saving service, which improves overall cluster utilization by up to 13

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/13/2019

DL2: A Deep Learning-driven Scheduler for Deep Learning Clusters

More and more companies have deployed machine learning (ML) clusters, wh...
research
02/03/2021

Horizontally Fused Training Array: An Effective Hardware Utilization Squeezer for Training Novel Deep Learning Models

Driven by the tremendous effort in researching novel deep learning (DL) ...
research
01/17/2022

VELTAIR: Towards High-Performance Multi-tenant Deep Learning Services via Adaptive Compilation and Scheduling

Deep learning (DL) models have achieved great success in many applicatio...
research
07/31/2019

Deploying a Top-100 Supercomputer for Large Parallel Workloads: the Niagara Supercomputer

Niagara is currently the fastest supercomputer accessible to academics i...
research
08/04/2023

A Deep Dive into the Google Cluster Workload Traces: Analyzing the Application Failure Characteristics and User Behaviors

Large-scale cloud data centers have gained popularity due to their high ...
research
10/21/2020

Speculative Container Scheduling for Deep Learning Applications in a Kubernetes Cluster

In the past decade, we have witnessed a dramatically increasing volume o...
research
06/18/2019

MultiCloud Resource Management using Apache Mesos with Apache Airavata

We discuss initial results and our planned approach for incorporating Ap...

Please sign up or login with your details

Forgot password? Click here to reset