Holistic Slowdown Driven Scheduling and Resource Management for Malleable Jobs

02/19/2020
by   Marco D'Amico, et al.
0

In job scheduling, the concept of malleability has been explored since many years ago. Research shows that malleability improves system performance, but its utilization in HPC never became widespread. The causes are the difficulty in developing malleable applications, and the lack of support and integration of the different layers of the HPC software stack. However, in the last years, malleability in job scheduling is becoming more critical because of the increasing complexity of hardware and workloads. In this context, using nodes in an exclusive mode is not always the most efficient solution as in traditional HPC jobs, where applications were highly tuned for static allocations, but offering zero flexibility to dynamic executions. This paper proposes a new holistic, dynamic job scheduling policy, Slowdown Driven (SD-Policy), which exploits the malleability of applications as the key technology to reduce the average slowdown and response time of jobs. SD-Policy is based on backfill and node sharing. It applies malleability to running jobs to make room for jobs that will run with a reduced set of resources, only when the estimated slowdown improves over the static approach. We implemented SD-Policy in SLURM and evaluated it in a real production environment, and with a simulator using workloads of up to 198K jobs. Results show better resource utilization with the reduction of makespan, response time, slowdown, and energy consumption, up to respectively 7 workloads.

READ FULL TEXT

page 6

page 8

research
06/22/2021

Energy hardware and workload aware job scheduling towards interconnected HPC environments

New HPC machines are getting close to the exascale. Power consumption fo...
research
09/16/2020

Extending SLURM for Dynamic Resource-Aware Adaptive Batch Scheduling

With the growing constraints on power budget and increasing hardware fai...
research
06/22/2020

Multiverse: Dynamic VM Provisioning for Virtualized High Performance Computing Clusters

Traditionally, HPC workloads have been deployed in bare-metal clusters; ...
research
09/29/2021

Optimisation of job scheduling for supercomputers with burst buffers

The ever-increasing gap between compute and I/O performance in HPC platf...
research
11/01/2022

Using Unused: Non-Invasive Dynamic FaaS Infrastructure with HPC-Whisk

Modern HPC workload managers and their careful tuning contribute to the ...
research
04/28/2022

Predicting batch queue job wait times for informed scheduling of urgent HPC workloads

There is increasing interest in the use of HPC machines for urgent workl...
research
03/24/2022

Quantum Computing in the Cloud: Analyzing job and machine characteristics

As the popularity of quantum computing continues to grow, quantum machin...

Please sign up or login with your details

Forgot password? Click here to reset