Optimisation of job scheduling for supercomputers with burst buffers

09/29/2021
by   Jan Kopanski, et al.
0

The ever-increasing gap between compute and I/O performance in HPC platforms, together with the development of novel NVMe storage devices (NVRAM), led to the emergence of the burst buffer concept - an intermediate persistent storage layer logically positioned between random-access main memory and a parallel file system. Since the appearance of this technology, numerous supercomputers have been equipped with burst buffers exploring various architectures. Despite the development of real-world architectures as well as research concepts, Resource and Job Management Systems, such as Slurm, provide only marginal support for scheduling jobs with burst buffer requirements. This research is primarily motivated by the alerting observation that burst buffers are omitted from reservations in the procedure of backfilling in existing job schedulers. In this dissertation, we forge a detailed supercomputer simulator based on Batsim and SimGrid, which is capable of simulating I/O contention and I/O congestion effects. Due to the lack of publicly available workloads with burst buffer requests, we create a burst buffer request distribution model derived from Parallel Workload Archive logs. We investigate the impact of burst buffer reservations on the overall efficiency of online job scheduling for canonical algorithms: First-Come-First-Served (FCFS) and Shortest-Job-First (SJF) EASY-backfilling. Our results indicate that the lack of burst buffer reservations in backfilling may significantly deteriorate the performance of scheduling. [...] Furthermore, this lack of reservations may cause the starvation of medium-size and wide jobs. Finally, we propose a burst-buffer-aware plan-based scheduling algorithm with simulated annealing optimisation, which improves the mean waiting time by over 20 slowdown by 27

READ FULL TEXT
research
08/31/2021

Plan-based Job Scheduling for Supercomputers with Shared Burst Buffers

The ever-increasing gap between compute and I/O performance in HPC platf...
research
10/14/2022

Probabilistic Scheduling of Dynamic I/O Requests via Application Clustering for Burst-Buffer Equipped HPC

Burst-Buffering is a promising storage solution that introduces an inter...
research
02/19/2020

Holistic Slowdown Driven Scheduling and Resource Management for Malleable Jobs

In job scheduling, the concept of malleability has been explored since m...
research
05/18/2020

Semi-online Scheduling: A Survey

In online scheduling, jobs are available one by one and each job must be...
research
06/18/2018

AccaSim: a Customizable Workload Management Simulator for Job Dispatching Research in HPC Systems

We present AccaSim, a simulator for workload management in HPC systems. ...
research
05/07/2015

Development of a Burst Buffer System for Data-Intensive Applications

Modern parallel filesystems such as Lustre are designed to provide high,...
research
09/10/2021

Solving the Extended Job Shop Scheduling Problem with AGVs – Classical and Quantum Approaches

The subject of Job Scheduling Optimisation (JSO) deals with the scheduli...

Please sign up or login with your details

Forgot password? Click here to reset