High-Quality Fault-Resiliency in Fat-Tree Networks (Extended Abstract)

11/21/2022
by   John Gliksberg, et al.
0

Coupling regular topologies with optimized routing algorithms is key in pushing the performance of interconnection networks of HPC systems. In this paper we present Dmodc, a fast deterministic routing algorithm for Parallel Generalized Fat-Trees (PGFTs) which minimizes congestion risk even under massive topology degradation caused by equipment failure. It applies a modulo-based computation of forwarding tables among switches closer to the destination, using only knowledge of subtrees for pre-modulo division. Dmodc allows complete rerouting of topologies with tens of thousands of nodes in less than a second, which greatly helps centralized fabric management react to faults with high-quality routing tables and no impact to running applications in current and future very large-scale HPC clusters. We compare Dmodc against routing algorithms available in the InfiniBand control software (OpenSM) first for routing execution time to show feasibility at scale, and then for congestion risk under degradation to demonstrate robustness. The latter comparison is done using static analysis of routing tables under random permutation (RP), shift permutation (SP) and all-to-all (A2A) traffic patterns. Results for Dmodc show A2A and RP congestion risks similar under heavy degradation as the most stable algorithms compared, and near-optimal SP congestion risk up to 1

READ FULL TEXT
research
11/23/2022

High-Quality Fault Resiliency in Fat Trees

Coupling regular topologies with optimised routing algorithms is key in ...
research
08/20/2020

An In-Depth Analysis of the Slingshot Interconnect

The interconnect is one of the most critical components in large scale c...
research
11/21/2022

Node-Type-Based Load-Balancing Routing for Parallel Generalized Fat-Trees

High-Performance Computing (HPC) clusters are made up of a variety of no...
research
12/13/2018

Toward incremental FIB aggregation with quick selections (FAQS)

Several approaches to mitigating the Forwarding Information Base (FIB) o...
research
12/24/2018

Compact Oblivious Routing

Oblivious routing is an attractive paradigm for large distributed system...
research
01/30/2020

Routing-Led Placement of VNFs in Arbitrary Networks

The ever increasing demand for computing resources has led to the creati...
research
04/23/2021

SpectralFly: Ramanujan Graphs as Flexible and Efficient Interconnection Networks

In recent years, graph theoretic considerations have become increasingly...

Please sign up or login with your details

Forgot password? Click here to reset