I/O Burst Prediction for HPC Clusters using Darshan Logs

by   Ehsan Saeedizade, et al.

Understanding cluster-wide I/O patterns of large-scale HPC clusters is essential to minimize the occurrence and impact of I/O interference. Yet, most previous work in this area focused on monitoring and predicting task and node-level I/O burst events. This paper analyzes Darshan reports from three supercomputers to extract system-level read and write I/O rates in five minutes intervals. We observe significant (over 100x) fluctuations in read and write I/O rates in all three clusters. We then train machine learning models to estimate the occurrence of system-level I/O bursts 5 - 120 minutes ahead. Evaluation results show that we can predict I/O bursts with more than 90 accuracy (F-1 score) five minutes ahead and more than 87 ahead. We also show that the ML models attain more than 70 estimating the degree of the I/O burst. We believe that high-accuracy predictions of I/O bursts can be used in multiple ways, such as postponing delay-tolerant I/O operations (e.g., checkpointing), pausing nonessential applications (e.g., file system scrubbers), and devising I/O-aware job scheduling methods. To validate this claim, we simulated a burst-aware job scheduler that can postpone the start time of applications to avoid I/O bursts. We show that the burst-aware job scheduling can lead to an up to 5x decrease in application runtime.


page 3

page 9


Predicting batch queue job wait times for informed scheduling of urgent HPC workloads

There is increasing interest in the use of HPC machines for urgent workl...

Energy hardware and workload aware job scheduling towards interconnected HPC environments

New HPC machines are getting close to the exascale. Power consumption fo...

CASSINI: Network-Aware Job Scheduling in Machine Learning Clusters

We present CASSINI, a network-aware job scheduler for machine learning (...

Fine-Grained Scheduling for Containerized HPC Workloads in Kubernetes Clusters

Containerization technology offers lightweight OS-level virtualization, ...

Optimization of Topology-Aware Job Allocation on a High-Performance Computing Cluster by Neural Simulated Annealing

Jobs on high-performance computing (HPC) clusters can suffer significant...

Demystifying the Performance of HPC Scientific Applications on NVM-based Memory Systems

The emergence of high-density byte-addressable non-volatile memory (NVM)...

Affinity-Aware Resource Provisioning for Long-Running Applications in Shared Clusters

Resource provisioning plays a pivotal role in determining the right amou...

Please sign up or login with your details

Forgot password? Click here to reset