Predicting batch queue job wait times for informed scheduling of urgent HPC workloads

04/28/2022
by   Nick Brown, et al.
0

There is increasing interest in the use of HPC machines for urgent workloads to help tackle disasters as they unfold. Whilst batch queue systems are not ideal in supporting such workloads, many disadvantages can be worked around by accurately predicting when a waiting job will start to run. However there are numerous challenges in achieving such a prediction with high accuracy, not least because the queue's state can change rapidly and depend upon many factors. In this work we explore a novel machine learning approach for predicting queue wait times, hypothesising that such a model can capture the complex behaviour resulting from the queue policy and other interactions to generate accurate job start times. For ARCHER2 (HPE Cray EX), Cirrus (HPE 8600) and 4-cabinet (HPE Cray EX) we explore how different machine learning approaches and techniques improve the accuracy of our predictions, comparing against the estimation generated by Slurm. We demonstrate that our techniques deliver the most accurate predictions across our machines of interest, with the result of this work being the ability to predict job start times within one minute of the actual start time for around 65% of jobs on ARCHER2 and 4-cabinet, and 76% of jobs on Cirrus. When compared against what Slurm can deliver, this represents around 3.8 times better accuracy on ARCHER2 and 18 times better for Cirrus. Furthermore our approach can accurately predicting the start time for three quarters of all job within ten minutes of the actual start time on ARCHER2 and 4-cabinet, and for 90% of jobs on Cirrus. Whilst the driver of this work has been to better facilitate placement of urgent workloads across HPC machines, the insights gained can be used to provide wider benefits to users and also enrich existing batch queue systems and inform policy too.

READ FULL TEXT

page 1

page 11

research
08/19/2020

End-to-End Predictions-Based Resource Management Framework for Supercomputer Jobs

Job submissions of parallel applications to production supercomputer sys...
research
08/20/2023

I/O Burst Prediction for HPC Clusters using Darshan Logs

Understanding cluster-wide I/O patterns of large-scale HPC clusters is e...
research
02/19/2020

Holistic Slowdown Driven Scheduling and Resource Management for Malleable Jobs

In job scheduling, the concept of malleability has been explored since m...
research
05/23/2019

The Supermarket Model with Known and Predicted Service Times

The supermarket model typically refers to a system with a large number o...
research
06/22/2021

Energy hardware and workload aware job scheduling towards interconnected HPC environments

New HPC machines are getting close to the exascale. Power consumption fo...
research
09/28/2020

On the sojourn time of a batch in the M^[X]/M/1 Processor Sharing Queue

In this paper, we analyze the sojourn of an entire batch in a processor ...
research
02/03/2022

Can machines solve general queueing systems?

In this paper, we analyze how well a machine can solve a general problem...

Please sign up or login with your details

Forgot password? Click here to reset