An Optimal Resource Allocator of Elastic Training for Deep Learning Jobs on Cloud

09/08/2021
by   Liang Hu, et al.
0

Cloud training platforms, such as Amazon Web Services and Huawei Cloud provide users with computational resources to train their deep learning jobs. Elastic training is a service embedded in cloud training platforms that dynamically scales up or down the resources allocated to a job. The core technique of an elastic training system is to best allocate limited resources among heterogeneous jobs in terms of shorter queueing delay and higher training efficiency. This paper presents an optimal resource allocator for elastic training system that leverages a mixed-integer programming (MIP) model to maximize the training progress of deep learning jobs. We take advantage of the real-world job data obtained from ModelArts, the deep learning training platform of Huawei Cloud and conduct simulation experiments to compare the optimal resource allocator with a greedy one as benchmark. Numerical results show that the proposed allocator can reduce queuing time by up to 32 accelerate training efficiency by up to 24 allocator, thereby greatly improving user experience with Huawei ModelArts and potentially enabling the realization of higher profits for the product. Also, the optimal resource allocator is fast in decision-making, taking merely 0.4 seconds on average.

READ FULL TEXT
research
04/04/2023

DLRover: An Elastic Deep Training Extension with Auto Job Resource Recommendation

The cloud is still a popular platform for distributed deep learning (DL)...
research
10/10/2020

A Predictive Autoscaler for Elastic Batch Jobs

Large batch jobs such as Deep Learning, HPC and Spark require far more c...
research
09/14/2022

Cost-efficient Auto-scaling of Container-based Elastic Processes

In business process landscapes, a common challenge is to provide the nec...
research
01/18/2018

Batch Auction Design For Cloud Container Services

Cloud containers represent a new, light-weight alternative to virtual ma...
research
07/10/2018

Cost-Efficient Orchestration of Containers in Clouds: A Vision, Architectural Elements, and Future Directions

This paper proposes an architectural framework for the efficient orchest...
research
10/21/2020

Speculative Container Scheduling for Deep Learning Applications in a Kubernetes Cluster

In the past decade, we have witnessed a dramatically increasing volume o...
research
04/03/2019

Model Slicing for Supporting Complex Analytics with Elastic Inference Cost and Resource Constraints

Deep learning models have been used to support analytics beyond simple a...

Please sign up or login with your details

Forgot password? Click here to reset