Application-aware Congestion Mitigation for High-Performance Computing Systems

12/14/2020
by   Archit Patke, et al.
0

High-performance computing (HPC) systems frequently experience congestion leading to significant application performance variation. However, the impact of congestion on application runtime differs from application to application depending on their network characteristics (such as bandwidth and latency requirements). We leverage this insight to develop Netscope, an automated ML-driven framework that considers those network characteristics to dynamically mitigate congestion. We evaluate Netscope on four Cray Aries systems, including a production supercomputer on real scientific applications. Netscope has a lower training cost and accurately estimates the impact of congestion on application runtime with a correlation between 0.7and 0.9 for common scientific applications. Moreover, we find that Netscope reduces tail runtime variability by up to 14.9 times while improving median system utility by 12

READ FULL TEXT

page 1

page 3

research
08/20/2020

An In-Depth Analysis of the Slingshot Interconnect

The interconnect is one of the most critical components in large scale c...
research
09/17/2019

Mitigating Network Noise on Dragonfly Networks through Application-Aware Routing

System noise can negatively impact the performance of HPC systems, and t...
research
07/11/2019

A Study of Network Congestion in Two Supercomputing High-Speed Interconnects

Network congestion in high-speed interconnects is a major source of appl...
research
10/14/2022

Probabilistic Scheduling of Dynamic I/O Requests via Application Clustering for Burst-Buffer Equipped HPC

Burst-Buffering is a promising storage solution that introduces an inter...
research
07/10/2018

SiL: An Approach for Adjusting Applications to Heterogeneous Systems Under Perturbations

Scientific applications consist of large and computationally-intensive l...
research
05/18/2023

TSoR: TCP Socket over RDMA Container Network for Cloud Native Computing

Cloud-native containerized applications constantly seek high-performance...
research
01/11/2018

A parallel workload has extreme variability in a production environment

Writing data in parallel is a common operation in some computing environ...

Please sign up or login with your details

Forgot password? Click here to reset