Towards Accommodating Real-time Jobs on HPC Platforms

03/24/2021
by   Sam Nickolay, et al.
0

Increasing data volumes in scientific experiments necessitate the use of high-performance computing (HPC) resources for data analysis. In many scientific fields, the data generated from scientific instruments and supercomputer simulations must be analyzed rapidly. In fact, the requirement for quasi-instant feedback is growing. Scientists want to use results from one experiment to guide the selection of the next or even to improve the course of a single experiment. Current HPC systems are typically batch-scheduled under policies in which an arriving job is run immediately only if enough resources are available; otherwise, it is queued. It is hard for these systems to support real-time jobs. Real-time jobs, in order to meet their requirements, should sometimes have to preempt batch jobs and/or be scheduled ahead of batch jobs that were submitted earlier. Accommodating real-time jobs may negatively impact system utilization also, especially when preemption/restart of batch jobs is involved. We first explore several existing scheduling strategies to make real-time jobs more likely to be scheduled in due time. We then rigorously formulate the problem as a mixed-integer linear programming for offline scheduling and develop novel scheduling heuristics for online scheduling. We perform simulation studies using trace logs of Mira, the IBM BG/Q system at Argonne National Laboratory, to quantify the impact of real-time jobs on batch job performance for various percentages of real-time jobs in the workload. We present new insights gained from grouping jobs into different categories based on runtime and the number of nodes used and studying the performance of each category. Our results show that with 10 just-in-time checkpointing combined with our heuristic can improve the slowdowns of real-time jobs by 35 of batch jobs to 10

READ FULL TEXT

page 1

page 2

page 3

page 4

research
09/16/2020

Extending SLURM for Dynamic Resource-Aware Adaptive Batch Scheduling

With the growing constraints on power budget and increasing hardware fai...
research
12/31/2021

BatchLens: A Visualization Approach for Analyzing Batch Jobs in Cloud Systems

Cloud systems are becoming increasingly powerful and complex. It is high...
research
05/29/2019

Evaluation of pilot jobs for Apache Spark applications on HPC clusters

Big Data has become prominent throughout many scientific fields and, as ...
research
11/04/2018

Exploring the Relation Between Two Levels of Scheduling Using a Novel Simulation Approach

Modern high performance computing (HPC) systems exhibit a rapid growth i...
research
06/22/2021

BFTrainer: Low-Cost Training of Neural Networks on Unfillable Supercomputer Nodes

Supercomputer FCFS-based scheduling policies result in many transient id...
research
02/03/2018

JobPruner: A Machine Learning Assistant for Exploring Parameter Spaces in HPC Applications

High Performance Computing (HPC) applications are essential for scientis...
research
08/05/2020

Best of Both Worlds: High Performance Interactive and Batch Launching

Rapid launch of thousands of jobs is essential for effective interactive...

Please sign up or login with your details

Forgot password? Click here to reset