DLRover: An Elastic Deep Training Extension with Auto Job Resource Recommendation

04/04/2023
by   Qinlong Wang, et al.
0

The cloud is still a popular platform for distributed deep learning (DL) training jobs since resource sharing in the cloud can improve resource utilization and reduce overall costs. However, such sharing also brings multiple challenges for DL training jobs, e.g., high-priority jobs could impact, even interrupt, low-priority jobs. Meanwhile, most existing distributed DL training systems require users to configure the resources (i.e., the number of nodes and resources like CPU and memory allocated to each node) of jobs manually before job submission and can not adjust the job's resources during the runtime. The resource configuration of a job deeply affect this job's performance (e.g., training throughput, resource utilization, and completion rate). However, this usually leads to poor performance of jobs since users fail to provide optimal resource configuration in most cases.  is a distributed DL framework can auto-configure a DL job's initial resources and dynamically tune the job's resources to win the better performance. With elastic capability,  can effectively adjusts the resources of a job when there are performance issues detected or a job fails because of faults or eviction. Evaluations results show  can outperform manual well-tuned resource configurations. Furthermore, in the production Kubernetes cluster of ,  reduces the medium of job completion time by 31%, and improves the job completion rate by 6%, CPU utilization by 15%, and memory utilization by 20% compared with manual configuration.

READ FULL TEXT

page 1

page 2

page 3

page 4

research
05/22/2019

Two stage cluster for resource optimization with Apache Mesos

As resource estimation for jobs is difficult, users often overestimate t...
research
09/08/2021

An Optimal Resource Allocator of Elastic Training for Deep Learning Jobs on Cloud

Cloud training platforms, such as Amazon Web Services and Huawei Cloud p...
research
06/24/2020

Effective Elastic Scaling of Deep Learning Workloads

The increased use of deep learning (DL) in academia, government and indu...
research
02/12/2019

Salus: Fine-Grained GPU Sharing Primitives for Deep Learning Applications

GPU computing is becoming increasingly more popular with the proliferati...
research
04/07/2022

Elastic Model Aggregation with Parameter Service

Model aggregation, the process that updates model parameters, is an impo...
research
05/20/2018

Machine Learning for Predictive Analytics of Compute Cluster Jobs

We address the problem of predicting whether sufficient memory and CPU r...
research
12/06/2021

End-to-end Adaptive Distributed Training on PaddlePaddle

Distributed training has become a pervasive and effective approach for t...

Please sign up or login with your details

Forgot password? Click here to reset