Effective Elastic Scaling of Deep Learning Workloads

06/24/2020
by   Vaibhav Saxena, et al.
0

The increased use of deep learning (DL) in academia, government and industry has, in turn, led to the popularity of on-premise and cloud-hosted deep learning platforms, whose goals are to enable organizations utilize expensive resources effectively, and to share said resources among multiple teams in a fair and effective manner. In this paper, we examine the elastic scaling of Deep Learning (DL) jobs over large-scale training platforms and propose a novel resource allocation strategy for DL training jobs, resulting in improved job run time performance as well as increased cluster utilization. We begin by analyzing DL workloads and exploit the fact that DL jobs can be run with a range of batch sizes without affecting their final accuracy. We formulate an optimization problem that explores a dynamic batch size allocation to individual DL jobs based on their scaling efficiency, when running on multiple nodes. We design a fast dynamic programming based optimizer to solve this problem in real-time to determine jobs that can be scaled up/down, and use this optimizer in an autoscaler to dynamically change the allocated resources and batch sizes of individual DL jobs. We demonstrate empirically that our elastic scaling algorithm can complete up to ≈ 2 × as many jobs as compared to a strong baseline algorithm that also scales the number of GPUs but does not change the batch size. We also demonstrate that the average completion time with our algorithm is up to ≈ 10 × faster than that of the baseline.

READ FULL TEXT

page 1

page 2

research
04/04/2023

DLRover: An Elastic Deep Training Extension with Auto Job Resource Recommendation

The cloud is still a popular platform for distributed deep learning (DL)...
research
02/16/2022

Aryl: An Elastic Cluster Scheduler for Deep Learning

Companies build separate training and inference GPU clusters for deep le...
research
10/10/2020

A Predictive Autoscaler for Elastic Batch Jobs

Large batch jobs such as Deep Learning, HPC and Spark require far more c...
research
05/17/2018

Dependability in a Multi-tenant Multi-framework Deep Learning as-a-Service Platform

Deep learning (DL), a form of machine learning, is becoming increasingly...
research
05/19/2020

Optimal Resource Allocation for Elastic and Inelastic Jobs

Modern data centers are tasked with processing heterogeneous workloads c...
research
04/07/2022

Elastic Model Aggregation with Parameter Service

Model aggregation, the process that updates model parameters, is an impo...
research
08/10/2021

Evaluation of Load Prediction Techniques for Distributed Stream Processing

Distributed Stream Processing (DSP) systems enable processing large stre...

Please sign up or login with your details

Forgot password? Click here to reset