Extending SLURM for Dynamic Resource-Aware Adaptive Batch Scheduling

09/16/2020
by   Mohak Chadha, et al.
0

With the growing constraints on power budget and increasing hardware failure rates, the operation of future exascale systems faces several challenges. Towards this, resource awareness and adaptivity by enabling malleable jobs has been actively researched in the HPC community. Malleable jobs can change their computing resources at runtime and can significantly improve HPC system performance. However, due to the rigid nature of popular parallel programming paradigms such as MPI and lack of support for dynamic resource management in batch systems, malleable jobs have been largely unrealized. In this paper, we extend the SLURM batch system to support the execution and batch scheduling of malleable jobs. The malleable applications are written using a new adaptive parallel paradigm called Invasive MPI which extends the MPI standard to support resource-adaptivity at runtime. We propose two malleable job scheduling strategies to support performance-aware and power-aware dynamic reconfiguration decisions at runtime. We implement the strategies in SLURM and evaluate them on a production HPC system. Results for our performance-aware scheduling strategy show improvements in makespan, average system utilization, average response, and waiting times as compared to other scheduling strategies. Moreover, we demonstrate dynamic power corridor management using our power-aware strategy.

READ FULL TEXT

page 1

page 8

research
02/19/2020

Holistic Slowdown Driven Scheduling and Resource Management for Malleable Jobs

In job scheduling, the concept of malleability has been explored since m...
research
03/24/2021

Towards Accommodating Real-time Jobs on HPC Platforms

Increasing data volumes in scientific experiments necessitate the use of...
research
06/25/2021

RFaaS: RDMA-Enabled FaaS Platform for Serverless High-Performance Computing

The rigid MPI programming model and batch scheduling dominate high-perfo...
research
02/14/2020

An optimal scheduling architecture for accelerating batch algorithms on Neural Network processor architectures

In neural network topologies, algorithms are running on batches of data ...
research
12/17/2021

Mitigating inefficient task mappings with an Adaptive Resource-Moldable Scheduler (ARMS)

Efficient runtime task scheduling on complex memory hierarchy becomes in...
research
05/24/2019

Performance-Feedback Autoscaling with Budget Constraints for Cloud-based Workloads of Workflows

The growing popularity of workflows in the cloud domain promoted the dev...
research
08/31/2021

Plan-based Job Scheduling for Supercomputers with Shared Burst Buffers

The ever-increasing gap between compute and I/O performance in HPC platf...

Please sign up or login with your details

Forgot password? Click here to reset