Optimization of Topology-Aware Job Allocation on a High-Performance Computing Cluster by Neural Simulated Annealing

02/06/2023
by   Zekang Lan, et al.
0

Jobs on high-performance computing (HPC) clusters can suffer significant performance degradation due to inter-job network interference. Topology-aware job allocation problem (TJAP) is such a problem that decides how to dedicate nodes to specific applications to mitigate inter-job network interference. In this paper, we study the window-based TJAP on a fat-tree network aiming at minimizing the cost of communication hop, a defined inter-job interference metric. The window-based approach for scheduling repeats periodically taking the jobs in the queue and solving an assignment problem that maps jobs to the available nodes. Two special allocation strategies are considered, i.e., static continuity assignment strategy (SCAS) and dynamic continuity assignment strategy (DCAS). For the SCAS, a 0-1 integer programming is developed. For the DCAS, an approach called neural simulated algorithm (NSA), which is an extension to simulated algorithm (SA) that learns a repair operator and employs them in a guided heuristic search, is proposed. The efficacy of NSA is demonstrated with a computational study against SA and SCIP. The results of numerical experiments indicate that both the model and algorithm proposed in this paper are effective.

READ FULL TEXT

page 1

page 9

research
04/12/2020

QoS-Driven Job Scheduling: Multi-Tier Dependency Considerations

For a cloud service provider, delivering optimal system performance whil...
research
12/26/2021

Large-scale Machine Learning Cluster Scheduling via Multi-agent Graph Reinforcement Learning

Efficient scheduling of distributed deep learning (DL) jobs in large GPU...
research
04/04/2020

Energy-aware Allocation of Graph Jobs in Vehicular Cloud Computing-enabled Software-defined IoV

Software-defined internet of vehicles (SDIoV) has emerged as a promising...
research
08/13/2018

Allocation of Graph Jobs in Geo-Distributed Cloud Networks

Recently, processing of big-data has drawn tremendous attention, where c...
research
08/05/2020

Best of Both Worlds: High Performance Interactive and Batch Launching

Rapid launch of thousands of jobs is essential for effective interactive...
research
08/20/2023

I/O Burst Prediction for HPC Clusters using Darshan Logs

Understanding cluster-wide I/O patterns of large-scale HPC clusters is e...
research
12/22/2022

Comparison of Three Job Mapping Algorithms for Supercomputer Resource Managers

Performance of supercomputer depends on the quality of resource manager,...

Please sign up or login with your details

Forgot password? Click here to reset