End-to-End Predictions-Based Resource Management Framework for Supercomputer Jobs

08/19/2020
by   Swetha Hariharan, et al.
0

Job submissions of parallel applications to production supercomputer systems will have to be carefully tuned in terms of the job submission parameters to obtain minimum response times. In this work, we have developed an end-to-end resource management framework that uses predictions of queue waiting and execution times to minimize response times of user jobs submitted to supercomputer systems. Our method for predicting queue waiting times adaptively chooses a prediction method based on the cluster structure of similar jobs. Our strategy for execution time predictions dynamically learns the impact of load on execution times and uses this to predict a set of execution time ranges for the target job. We have developed two resource management techniques that employ these predictions, one that selects the number of processors for execution and the other that also dynamically changes the job submission time. Using workload simulations of large supercomputer traces, we show large-scale improvements in predictions and reductions in response times over existing techniques and baseline strategies.

READ FULL TEXT

page 1

page 7

page 11

research
07/03/2019

CloudCoaster: Transient-aware Bursty Datacenter Workload Scheduling

Today's clusters often have to divide resources among a diverse set of j...
research
04/28/2022

Predicting batch queue job wait times for informed scheduling of urgent HPC workloads

There is increasing interest in the use of HPC machines for urgent workl...
research
11/19/2021

START: Straggler Prediction and Mitigation for Cloud Computing Environments using Encoder LSTM Networks

Modern large-scale computing systems distribute jobs into multiple small...
research
12/06/2021

End-to-end Adaptive Distributed Training on PaddlePaddle

Distributed training has become a pervasive and effective approach for t...
research
11/30/2018

Optimized Portfolio Contracts for Bidding the Cloud

Amazon EC2 provides two most popular pricing schemes--i) the costly on-...
research
03/24/2022

Adaptive job and resource management for the growing quantum cloud

As the popularity of quantum computing continues to grow, efficient quan...
research
05/23/2019

The Supermarket Model with Known and Predicted Service Times

The supermarket model typically refers to a system with a large number o...

Please sign up or login with your details

Forgot password? Click here to reset