I/O Burst Prediction for HPC Clusters using Darshan Logs

08/20/2023
by   Ehsan Saeedizade, et al.
0

Understanding cluster-wide I/O patterns of large-scale HPC clusters is essential to minimize the occurrence and impact of I/O interference. Yet, most previous work in this area focused on monitoring and predicting task and node-level I/O burst events. This paper analyzes Darshan reports from three supercomputers to extract system-level read and write I/O rates in five minutes intervals. We observe significant (over 100x) fluctuations in read and write I/O rates in all three clusters. We then train machine learning models to estimate the occurrence of system-level I/O bursts 5 - 120 minutes ahead. Evaluation results show that we can predict I/O bursts with more than 90 accuracy (F-1 score) five minutes ahead and more than 87 ahead. We also show that the ML models attain more than 70 estimating the degree of the I/O burst. We believe that high-accuracy predictions of I/O bursts can be used in multiple ways, such as postponing delay-tolerant I/O operations (e.g., checkpointing), pausing nonessential applications (e.g., file system scrubbers), and devising I/O-aware job scheduling methods. To validate this claim, we simulated a burst-aware job scheduler that can postpone the start time of applications to avoid I/O bursts. We show that the burst-aware job scheduling can lead to an up to 5x decrease in application runtime.

READ FULL TEXT

page 3

page 9

research
04/28/2022

Predicting batch queue job wait times for informed scheduling of urgent HPC workloads

There is increasing interest in the use of HPC machines for urgent workl...
research
06/22/2021

Energy hardware and workload aware job scheduling towards interconnected HPC environments

New HPC machines are getting close to the exascale. Power consumption fo...
research
08/01/2023

CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters

We present CASSINI, a network-aware job scheduler for machine learning (...
research
11/21/2022

Fine-Grained Scheduling for Containerized HPC Workloads in Kubernetes Clusters

Containerization technology offers lightweight OS-level virtualization, ...
research
02/06/2023

Optimization of Topology-Aware Job Allocation on a High-Performance Computing Cluster by Neural Simulated Annealing

Jobs on high-performance computing (HPC) clusters can suffer significant...
research
02/16/2020

Demystifying the Performance of HPC Scientific Applications on NVM-based Memory Systems

The emergence of high-density byte-addressable non-volatile memory (NVM)...
research
08/26/2022

Affinity-Aware Resource Provisioning for Long-Running Applications in Shared Clusters

Resource provisioning plays a pivotal role in determining the right amou...

Please sign up or login with your details

Forgot password? Click here to reset