Multiverse: Dynamic VM Provisioning for Virtualized High Performance Computing Clusters

06/22/2020
by   Jashwant Raj Gunasekaran, et al.
0

Traditionally, HPC workloads have been deployed in bare-metal clusters; but the advances in virtualization have led the pathway for these workloads to be deployed in virtualized clusters. However, HPC cluster administrators/providers still face challenges in terms of resource elasticity and virtual machine (VM) provisioning at large-scale, due to the lack of coordination between a traditional HPC scheduler and the VM hypervisor (resource management layer). This lack of interaction leads to low cluster utilization and job completion throughput. Furthermore, the VM provisioning delays directly impact the overall performance of jobs in the cluster. Hence, there is a need for effectively provisioning virtualized HPC clusters, which can best-utilize the physical hardware with minimal provisioning overheads. Towards this, we propose Multiverse, a VM provisioning framework, which can dynamically spawn VMs for incoming jobs in a virtualized HPC cluster, by integrating the HPC scheduler along with VM resource manager. We have implemented this framework on the Slurm scheduler along with the vSphere VM resource manager. In order to reduce the VM provisioning overheads, we use instant cloning which shares both the disk and memory with the parent VM, when compared to full VM cloning which has to boot-up a new VM from scratch. Measurements with real-world HPC workloads demonstrate that, instant cloning is 2.5x faster than full cloning in terms of VM provisioning time. Further, it improves resource utilization by up to 40 1.5x, when compared to full clone for bursty job arrival scenarios.

READ FULL TEXT

page 1

page 4

page 8

page 9

page 10

research
11/01/2022

Using Unused: Non-Invasive Dynamic FaaS Infrastructure with HPC-Whisk

Modern HPC workload managers and their careful tuning contribute to the ...
research
01/12/2023

Analyzing Resource Utilization in an HPC System: A Case Study of NERSC Perlmutter

The resource demands of HPC applications vary significantly. However, it...
research
06/22/2021

Energy hardware and workload aware job scheduling towards interconnected HPC environments

New HPC machines are getting close to the exascale. Power consumption fo...
research
05/03/2018

Why do Users Kill HPC Jobs?

Given the cost of HPC clusters, making best use of them is crucial to im...
research
06/18/2016

Scalability of VM Provisioning Systems

Virtual machines and virtualized hardware have been around for over half...
research
08/04/2021

The MIT Supercloud Dataset

Artificial intelligence (AI) and Machine learning (ML) workloads are an ...
research
02/19/2020

Holistic Slowdown Driven Scheduling and Resource Management for Malleable Jobs

In job scheduling, the concept of malleability has been explored since m...

Please sign up or login with your details

Forgot password? Click here to reset