Speculative Container Scheduling for Deep Learning Applications in a Kubernetes Cluster

10/21/2020
by   Ying Mao, et al.
0

In the past decade, we have witnessed a dramatically increasing volume of data collected from varied sources. The explosion of data has transformed the world as more information is available for collection and analysis than ever before. To maximize the utilization, various machine and deep learning models have been developed, e.g. CNN [1] and RNN [2], to study data and extract valuable information from different perspectives. While data-driven applications improve countless products, training models for hyperparameter tuning is still a time-consuming and resource-intensive process. Cloud computing provides infrastructure support for the training of deep learning applications. The cloud service providers, such as Amazon Web Services [3], create an isolated virtual environment (virtual machines and containers) for clients, who share physical resources, e.g., CPU and memory. On the cloud, resource management schemes are implemented to enable better sharing among users and boost the system-wide performance. However, general scheduling approaches, such as spread priority and balanced resource schedulers, do not work well with deep learning workloads. In this project, we propose SpeCon, a novel container scheduler that is optimized for shortlived deep learning applications. Based on virtualized containers, such as Kubernetes [4] and Docker [5], SpeCon analyzes the common characteristics of training processes. We design a suite of algorithms to monitor the progress of the training and speculatively migrate the slow-growing models to release resources for fast-growing ones. Specifically, the extensive experiments demonstrate that SpeCon improves the completion time of an individual job by up to 41.5 system-wide and 24.7

READ FULL TEXT

page 1

page 8

page 9

page 11

research
05/22/2018

DRAPS: Dynamic and Resource-Aware Placement Scheme for Docker Containers in a Heterogeneous Cluster

Virtualization is a promising technology that has facilitated cloud comp...
research
10/24/2020

Differentiate Quality of Experience Scheduling for Deep Learning Applications with Docker Containers in the Cloud

With the prevalence of big-data-driven applications, such as face recogn...
research
09/03/2021

Characterization and Prediction of Deep Learning Workloads in Large-Scale GPU Datacenters

Modern GPU datacenters are critical for delivering Deep Learning (DL) mo...
research
09/08/2021

An Optimal Resource Allocator of Elastic Training for Deep Learning Jobs on Cloud

Cloud training platforms, such as Amazon Web Services and Huawei Cloud p...
research
11/01/2018

Modeling Conceptual Characteristics of Virtual Machines for CPU Utilization Prediction

Cloud services have grown rapidly in recent years, which provide high fl...
research
09/20/2020

VirtualFlow: Decoupling Deep Learning Model Execution from Underlying Hardware

State-of-the-art deep learning systems tightly couple model execution wi...
research
06/22/2018

Assumption Commitment Types for Resource Management in Virtually Timed Ambients

This paper introduces a type system for resource management in the conte...

Please sign up or login with your details

Forgot password? Click here to reset