Energy-Efficient GPU Clusters Scheduling for Deep Learning

04/13/2023
by   Diandian Gu, et al.
0

Training deep neural networks (DNNs) is a major workload in datacenters today, resulting in a tremendously fast growth of energy consumption. It is important to reduce the energy consumption while completing the DL training jobs early in data centers. In this paper, we propose PowerFlow, a GPU clusters scheduler that reduces the average Job Completion Time (JCT) under an energy budget. We first present performance models for DL training jobs to predict the throughput and energy consumption performance with different configurations. Based on the performance models, PowerFlow dynamically allocates GPUs and adjusts the GPU-level or job-level configurations of DL training jobs. PowerFlow applies network packing and buddy allocation to job placement, thus avoiding extra energy consumed by cluster fragmentations. Evaluation results show that under the same energy consumption, PowerFlow improves the average JCT by 1.57 - 3.39 x at most, compared to competitive baselines.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
02/24/2020

Communication Contention Aware Scheduling of Multiple Deep Learning Training Jobs

Distributed Deep Learning (DDL) has rapidly grown its popularity since i...
research
08/12/2022

Zeus: Understanding and Optimizing GPU Energy Consumption of DNN Training

Training deep neural networks (DNNs) is becoming increasingly more resou...
research
10/12/2021

Synergy: Resource Sensitive DNN Scheduling in Multi-Tenant Clusters

Training Deep Neural Networks (DNNs) is a widely popular workload in bot...
research
03/04/2023

Chasing Low-Carbon Electricity for Practical and Sustainable DNN Training

Deep learning has experienced significant growth in recent years, result...
research
05/11/2021

ANDREAS: Artificial intelligence traiNing scheDuler foR accElerAted resource clusterS

Artificial Intelligence (AI) and Deep Learning (DL) algorithms are curre...
research
05/30/2022

A Transistor Operations Model for Deep Learning Energy Consumption Scaling Law

Deep Learning (DL) has transformed the automation of a wide range of ind...
research
12/13/2019

Queueing Analysis of GPU-Based Inference Servers with Dynamic Batching: A Closed-Form Characterization

GPU-accelerated computing is a key technology to realize high-speed infe...

Please sign up or login with your details

Forgot password? Click here to reset