1 Introduction and Motivation
High performance computing (HPC) systems are costly hardware and optimization of application performance is a very challenging problem. The critical component of HPC system is an interconnect that properties stand behind the scalability of any MPI based parallel algorithm. The interconnect is largely controlled by routing algorithms which determine how to move packets through an interconnect topology.
We consider the Angara interconnect [simonov2014pervoye] that was developed at JSC NICEVT in Russia. Angara is the lowlatency, high bandwidth interconnect with 4D torus topology, the minimum obtained MPIlatency between 2 adjacent nodes is 850 ns. The AngaraC1 [agarkov2016performance] and Desmos [stegailov2019angara] cluster systems are based on the Angara interconnect. Many authors have obtained results with use of the Angara interconnect [khalilov2018optimization, ostroumova2019reactive, polyakov2018, stegailovvasp2019, tolstykh2018structure, stegailov2020]. Now the maximum node number of Angara based high performance computing systems is 96 nodes.
Deterministic routing in the Angara interconnect is based on the direction order routing (DOR) [adiga2005blue, scott1996cray] implemented using direction bits [scott1996cray] and additional First and Last Steps [scott1996cray] for bypassing failed nodes and links. Implementation of the First and Last Steps method allows to violate the DOR rule. In the previous work [mukosey2019extended] we have proposed an algorithm for generation and analysis of routing tables that guarantees no deadlocks in the Angara interconnect. The proposed algorithm increases the number of different routable systems and improves fault tolerance, the algorithm is based on a breadthfirst search algorithm in a graph and practically does not take into consideration a communication channel load. Also we have never evaluated the influence of a routing table generation algorithm on the performance of a realworld Angara based cluster.
In this work we focus on development of an advanced routing table generation algorithm and on the routing algorithm evaluation on the Desmos Angara based cluster.
1.1 Related Work
Routing algorithms can be divided in two categories: oblivious and adaptive algorithms [dally2004principles]. An oblivious algorithm chooses a route for each sourcedestination pair without considering any information about the network’s traffic. An adaptive algorithm adapts to the current network traffic conditions. Deterministic routing algorithms is a subset of oblivious algorithms and choose the same path between and , even if there are multiple possible paths.
The network property path diversity is a routing possibility to provide multiple minimal routes between most pairs of nodes, it adds to the robustness of the network by balancing load across channels and allowing the network to tolerate faulty channels and nodes. Balancing the load across the communication channels affects the network performance.
The edgeforwarding index is the maximum channel load for each communication channel from a set of all communication channels for a network [heydemann1989forwarding]. The channel load is a number of routes crossing network channel . The edgeforwarding index addresses link contention and thus defines the balancedness and the quality of the routing table.
The perfect channel load is an average channel load in a system with a routing table, in which each route has minimal length. By analogy with [sancho2003routing] we define the deviation metric for a routing table . The deviation metric characterizes a deviation of channel loads from a perfect channel load and can be more descriptive than the edgeforwarding index.
Optimal deterministic routing (with a minimal edgeforwarding index) in arbitrary networks is NPhard [saad1993complexity]. Several approaches and techniques exist to resolve balancing problem in general and torus topology networks.
The first approach to implement load balanced routing for torus networks over dimension (or direction) order routing is to forward packets depending on relative position of a sourcedestination network node pair [montanana2009balanced]. This algorithm is simple, it can be implemented in hardware, but it does not take into consideration nonsymmectric topology cases and channel faults.
Instead of choosing routing paths locally at each router it is possible to use global routing, all communications can be optimized for each sourcedestination node pairs, particularly in case of communication pattern presence. Mixed integer linear programming (MILP) can be used to obtain an optimal solution of the global routing problem
[kinsy2009application, abdel2011transcom] for arbitrary topology, but because of the computational complexity, this approach does not scale well.In [abdel2016scalable] A. AbdelGawad has proposed a decoupled approach SGR that consists of discovering the best possible channel loads and constructing sourcedestination pairs that achieve such optimal channel loads. The best channel loads are obtained by cast the global routing problem to a linear programming (LP) problem, which is solvable in polynomial time. But the obtained best channel loads imply a sourcedestination flow splitting and the second SGR stage of constructing sourcedestination pair paths is based on the greedy algorithm that achives optimal channel loads through a possibility of additional paths for a sourcedestination pair. MILP and LP formulations take into consideration injection flows, as a result the global routing can be application specific.
Graph methods are natural way to perform balancing of routing paths. A network can be modeled as a directed network graph in which represents the set of network nodes and represents the set of communication links between the nodes. The channel dependency graph is a directed graph with the edge set of a network as vertices, the routing function defines edges as a possible path step in a network. [dally1988deadlock, schwiebert2001deadlock] states that a routing function is deadlockfree if the corresponding channel dependency graph is acyclic.
T. Hoefler et al. have proposed [hoefler2009optimized] a balancing routing algoritm for arbitrary interconnect topologies based on the singlesource shortest path (SSSP) Dijkstra algorithm [dijkstra1959note]. For each network node routing paths are constructed by the singlesource shortest paths algorithm through the network graph. The balancing is reached through an edge weight, which is incremented by one for all edges of a path from a source to a destination node. A  algorithm variant for each source node builds routing paths to all possible destination nodes, and then modifies edge weights.  does not balance the edgeforwarding index ideally, but its runtime includes SSSP calls, where is a number of network nodes. A more accurate  algorithm performs SSSP for each sourcedestination pair and updates the edge weights every time. Clearly, the  runtime is dominated by calls of SSSP. The SSSP based algorithms might introduce cyclic dependencies, which might lead to network deadlocks. For that reason the second part of an algorithm called DFSSSP [domke2011deadlock] places all obtained paths to a channel dependency graph and then breaks cycles in the graph by moving paths to virtual graph layers, each layer is assigned to a virtual channel.
The breadthfirst and depthfirst search algorithms through the network graph can be used to generate routing tables for arbitrary irregular network topology [sancho2004effective].
A genetic algorithm is a wellknown search heuristic to solve NPhard problems which relies on the process of natural selection. But for large networks the problem is the time complexity of genetic algorithms.
In [sancho2000improving] a heuristic iterative load balancing algoritms is proposed. For all sourcedestination node pairs all possible paths are constructed, and then in each step the channel with highest load value is determined, a routing path crossing this channel is selected to be removed and excluded from the consideration if there is more than one routing path between a source and a destination nodes of this routing path. The algorithm finishes when the number of routing paths between every pair of nodes is reduced down to the unit.
The major contributions of our work are:

We present full description of a routing graph notation that provides a possibility to build routes in the Angara interconnect with a routing that places restrictions on a further packet route with the previous route steps in mind.

We propose a deadlockfree routing algorithm based on a fast SSSP algorithm for the deterministic routing of the Angara interconnect with a single virtual channel.

We evaluate the SSSP based proposed algorithm, a genetic algorithm and a previously proposed BFS based algorithm and compare the edgeforwarding index, routing time and application performance for synthetic communication patterns and NPB benchmarks on the Desmos cluster.
The paper is structured as follows. Section 2 descibes the Angara interconnect architecture and its routing. In Section 3 we present the routing graph notation and a condition and an algorithm for edges adding to a routing graph that does not lead to deadlocks. Section 4 presents the previously proposed BFS based algorithm, devised genetic algorithm and proposed SSSP based algorithm. In Section 5 we discuss the improvements of the SSSP routing for balancedness quality, runtime and application performance. The last Section 6 concludes the paper.
2 The Angara Interconnect Architecture
The Angara interconnect is a Russiandesigned communication network with 4D torus topology. The interconnect ASIC was developed by JSC NICEVT and manufactured by TSMC with the 65nm process.
The Angara ASIC chip (Figure 1) consists of an adapter and a router. The router contains 8 link blocks that are connected via a crossbar. The crossbar can simultaneously transmit flits (128 bits) from links if there are no conflicts.
In each direction there are five virtual channels: two channels for deterministic routing (blocking request channel and nonblocking response channel); a separate virtual channel is used for adaptive routing with the ability to switch to a nonblocking deterministic channel in case of potential deadlock; two more virtual channels are used to send messages over a virtual subnet for collective operations. The collective operations (broadcast and reduce) support is implemented using a virtual subnet with a tree topology, which is applied to multidimensional torus topology. Deterministic routing preserves the order of packet transmission and prevents deadlocks; adaptive routing uses one of the possible minimum routes for delivering packets, ignoring the order of directions, which allows to bypass the overloaded and broken parts of the network.
The link layer supports fault tolerant transmission of packets using packet numbering, counting for each checksum and retransmission if the checksum recorded in the last flit of the packet does not coincide with that calculated after the transfer. There is also a mechanism for bypassing failed communication channels and nodes by rebuilding the routing tables and using a nonstandard first / last steps of the packet route. For performing various service operations, including setting up / rebuilding routing tables, and performing some calculations, a service processor can be used that interacts with the adapter via the ELB interface.
The Angara chip supports simultaneous operations with multiple threads/processes of a user task; it is implemented as several injection channels available for use by independent packet buffers.
The network adapter at the hardware level supports remote (RDMA) write, read, and atomic operations. Atomic operations of two types are available – addition and exclusive OR. Each node has a dedicated memory region available for remote access from other nodes to support OpenSHMEM and PGAS. Different programming models are supported, including MPI and OpenMP. The Angara software stack is presented in Figure 2.
The network adapter extension card can be connected to the card in adjacent nodes by up to six cables (or up to eight with an extension card). The following topologies are supported: a ring, 2D, 3D and 4D torus.
2.1 Deterministic Routing
Deadlockfree deterministic routing in the Angara interconnect is performed preserving the order of directions (Direction order routing, DOR) [dally1986torus]. All directions in a network route are used in a predetermined order . Direction order algorithms have a property of absence of mutual locks between the rings of torus topology with any number of simultaneous requests for data transmission over the network. To avoid deadlocks in a ring (movement without changing a direction) the bubble routing rule is used [adiga2005blue, scott1996cray, puente1999adaptive].
The second Angara routing restriction is a direction bit rule, which does not allow both positive and negative directions of a dimension in a route. denotes a set of directions matches the direction bit rule.
The First Step / Last Step (FS/LS) method is used in the Angara interconnect as a mechanism for relaxing routing rules. The FS/LS method provides a possibility of a first positive and/or a last negative nonstandard steps in a route. A route from to nodes with FS/LS steps can be written as , where is a positive step, is a negative step. Note that a set of directions can violate the direction bit and the direction order rules, a deadlock can be arised.
Angara system software includes a SLURM plugin called Angara Node Selection Utility, ANSU. Each router in a network node contains a routing table, which is written by ANSU before each user application run. The described deterministic Angara routing rules allow path diversity, i.e. multiple routes between a sourcedestination pair. Each routing table for each node contains a routing information and determines a route to the node. When a packet is injected to a network, a router creates a 128bit header flit with a route information from a routing table for a destination node. The header flit is added to a network packet, then the packet deterministically moves through the network.
3 Routing Graph
Let be a set of network nodes. The dimensional torus has nodes, with nodes along each dimension , where for . Each node is identified by coordinates , where for .
The direction
is a vector
, where each coordinate is zero except a position and it is equal , if or , if . The direction is positive, when and is negative, when .Two nodes and are neighbors in a direction if . A set of all directions is denoted by . On the set of directions we introduce the order: , if .
The channel dependency graph is not suitable for building routes in the Angara interconnect, because it does not support the Angara direction bit routing rule: the channel dependency graph does not allow to analyze an interconnect packet route history, which is needed to control the absense in the packet route of simultaneous using of positive and negative directions of a dimension.
For Angara routing table generation we propose a special routing graph (RG). The idea of the routing graph is to store a history of a possible packet route with a contracted representation of the history.
3.1 First and Last Steps Preserving Direction Order Routing
Initially, we consider a case of FS/LS steps that preserve the direction order. In a routing graph for each network node a vertex set , , , , is defined according to the following.

is a vertex from which a path is started (injection to the network).

is a vertex that can be reached by completing the first step and is a positive direction, .

is a vertex that can be reached by completing steps in the ordered direction list, . The ordered direction list satisfies the direction bit rule, then can be presented as a dimension vector, in which for each network dimension a packet might do 3 kind of steps: in a positive direction, in a negative direction or no steps.

is a vertex that can be reached by completing the last step and is a negative direction, .

is a vertex in which a path is finished (ejection from the network).
Vertices of a are connected in such a way that a movement from a vertex to another vertex corresponds to packet route through the corresponding network nodes and satisfies the Angara routing rules. The directed edge set of a routing graph is defined according to the following possible steps.

Consider a vertex corresponding to a network node , which is a start vertex for a path. A first step of a route from can be

a positive nonstandard step in a direction to a vertex , which corresponds to a node

a step in a direction to a vertex , where corresponds to one single direction and a node

a step to a vertex, which denotes the end of a path.


Consider the vertices, , which correspond to a node . From the vertices it can be possible to make a step

to the vertices, which correspond to a step in a direction, in case of . The index corresponds to one single direction.

to a vertex, the end of a path.


Consider the vertices corresponding to the routes through a node with the ordered direction list . From the vertices it can be possible to make a step

to a vertex in a direction , in case of a last direction of and ,

to a vertex in a direction, in case of a last direction of and

to a vertex, the end of a path.


Consider the vertices corresponding to the routes through a node with a last negative nonstandard step in a direction . From the vertices it can be possible to make a step to a vertex, the end of a path.
Figure 3 presents a subgraph example of a routing graph for a 2D torus topology system. The subgraph includes two set of vertices for nodes and and edges corresponding to a step from to along direction .
For each network node a number of vertices . Consider an edge set for a vertex set . First, the vertices have edges. Secondly, for each nonstandard step there are edges, the total is . Thirdly, for each of vertices there are at most edges. Finally, we have edges. A total edge number for each network node is .
Theorem 1
From a network node to a node there is a route if and only if there is a path from a vertex to a vertex in a routing graph , which corresponds to the network topology.
The proof is based on a routing graph constuction algorithm. For each step , of a route in a network there is an edge between vertices and in a corresponding routing graph, where if the step does not change a route direction or switches to according to the step, if the step changes a route direction. Conversely, for a path in the routing graph there is a route in the network.
3.2 DeadlockFree Routing Graph
In [mukosey2019extended] we relax the direction order rule by augmenting the edge set of the routing graph by adding the edges, which do not lead to deadlocks. We determine these edges by a channel dependency graph. The node set of a channel dependency graph consists of the communications channels that can be presented for a torus topology as a pair , where is a network node, is a direction. The edge set of can be presented as channel pairs , i.e. , where , , and .
For a given torus topology we construct a routing graph and a channel dependency graph , which provides the direction order rule satisfaction.
To evaluate a possibility of making steps with violation of the deadlockfreedom rules, we build for each vertex of a channel dependency graph a used direction set in the paths from this vertex. The used direction set is a set of directions , which can be used in all possible paths started from a vertex in the channel dependency graph.
Theorem 2
New edge of , where and does not create cycles in the graph if .
By contradiction, we add the edge to the and suppose there is a path in the that creates a cycle. Then there is a path from to , then direction , it contradicts to the theorem condition.
Algorithm for adding edges to the channel dependency graph can be divided into two stages. In the first stage the breadthfirst search (BFS) algorithm is used to build a used direction set for each vertex. BFS starts from each vertex and stores all directions from the reachable vertices.
In the second stage the algorithm finds possible edges satisfying the Theorem 2. For each vertex the algorithm adds edges to all the neighbor vertices , when is not in the used direction set of . Then we rebuild the used direction set for each vertex from which there is a path to the vertex by starting BFS from in with the reverted edges and adding directions from to all reachable vertices. The second stage are performed until all possible edges are constructed.
After channel dependency graph analysis we need to augment the routing graph. Each added edge of allows a new route turn that violates the direction order rule, i.e. . The Angara routing allows the direction order rule violation only for a first positive and/or a last negative nonstandard steps. When is a first positive step, for a corresponding routing graph vertex we add an edge to a vertex with . When is a last negative step, for the corresponding vertices, such that is a last direction in the ordered direction list , we add an edge to a vertex with .
The added routing graph edges preserve Theorem 1 correctness.
4 Routing Algorithms
In this section we present the previously proposed BFS based algorithm, devised Genetic and SSSP algorithms.
4.1 BFS Algorithm
A BFS routing table generation algorithm is based on the breadthfirst search algorithm in a routing graph. The paths are built by iterative application of the breadthfirst search algorithm, see Algorithm 2. The edge link attribute denotes a link corresponding to an edge, we note that for an edge set there is a link. The reverse path is used to generate the routing tables, called . All edge weights along the built paths are increased by 1. Ascending vertex order in by outgoing edge weights provides the better load balancing.
The whole routing BFS algorithm, see Algorithm 1, iterates over all nodes in the given set to find the paths from a source to all other nodes. The next source node is the most distant from the current node (a GET_MAX_REMOTE procedure), this heuristic provides the better load balancing. Initially all edge weights are set to 0. The breadthfirst search in the routing graph has a complexity of , the time complexity of the BFS routing algorithm is , where and are relatively large constants, we note that these constants depend on number of torus dimensions .
4.2 Genetic Algorithm
We design a Genetic algorithm for heuristic search space exploration. We consider the Genetic algorithm as an alternative way to evaluate an affordable routing table quality. The Genetic algorithm operates with a population of solution candidates represented as chromosomes. Then, based on current generation, it produces the next generation by applying crossover on some randomly selected chromosome pairs or mutates some random ones and then it selects the best chromosomes based on a fitness function to produce the next generation. Details of this algorithm are as follows.
Chromosome is a set of genes, representing a routing table for each pair of network nodes. A gene is a route between a pair of network nodes, where is a position of the gene in a chromosome, because network node pairs can be mapped to positive integers,
– variant route number for the network node pair. We construct an initial generation by randomly producing each of 100 chromosomes. We use a twopoint crossover operation with panmixia as a parent selection algorithm. In a mutation operation each children gene is randomly changed with probability 2%. As a fitness function we use
with , we identify that it provides rapid convergence. The Genetic algorithm stops when it can not improve the best chromosomes after a rather large number of generations. In our simulation Genetic is terminated after 30 consecutive iteration or if the fitness value of next best solution less then 0.05 of fitness value of global best solution.4.3 SSSP Algorithm
The third routing table generation algorithm is presented in Algorithm 3 and is based on a singlesource shortestpath (SSSP) algorithm in a routing graph.
In our algorithm we aim to reduce a number of SSSP calls, preserving the desired balancedness quality. First, we count a path number and a path length from each network node to all other network nodes by the breadthfirst search algorithm in the routing graph . We detect that for the 49 2D, 339 3D, 1173 4D random torus topology systems, where each dimension size , 81.6%, 58.2% and 36.6% sourcedestination node pairs, on average, respectively, have a single minimum route, we assume the Angara routing rules. Then we apply these unique routes and adjust edge weights.
Secondly, we sort and group sourcedestination node pairs by ascending order of a route turn number in torus topology and then by ascending order of route length. Also we group node pairs by a source node in order that we can run a BUILD_SSSP Algorithm 4 from a source to a set of destination nodes. Then for the sake of the best balancing quality we apply those built paths that do not cross each other, adjust edge weights and repeat the BUILD_SSSP call for a set of remaining destination nodes until is not an empty set.
Algorithm 4 builds singlesource shortest paths from a source to a given set . An initial edge weight of routing graph is to force minimal shortest paths following T. Hoefler in [domke2011deadlock]. Only paths are considered, each , from this it follows that the shortestpath algorithm with the edge initialization never chooses a detour.
We identify that the edge initialization allows to consider vertices in each breadthfirst search level in arbitrary order. When a level processing is finished, then tentative distance of a vertex belonged to this level is settled. Therefore we use actually the breadthfirst search algorithm with edge relaxation instead of the Dijkstra algorithm and thus optimize a complexity of the proposed algorithm to instead of . Also the proposed algorithm breaks when all vertices are processed.
The complexity of the SSSP routing algorithm is dominated by the second algorithm stage. Figure 4 presents a call number of the BUILD_SSSP algorithm. If a routing algorithm consists of only the second stage called ’Sort and group’, then a number of BUILD_SSSP calls is between and . Adding the first stage called ’Unique routes’ allows to significantly reduce a number of BUILD_SSSP calls. The first stage has a complexity as calls of the breadthfirst search algorithm. But still an upperbound complexity of the SSSP routing algorithm is .
5 Experimental Evaluation
We implemented the BFS, Genetic and SSSP algorithms using C language as a part of a SLURM plugin called Angara Node Selection Utulity (ANSU). We compare the different routing table generation algorithms by the balancedness quality and runtime on a synthetic system set and benchmark the performance for the communication patterns and NPB benchmarks on the Desmos cluster.
5.1 Balancedness and Runtime
We consider 49 2D, 339 3D, 1173 4D torus random topology systems, each dimension size . When is 2, then along this dimension a system topology is mesh, torus in other cases.
Figure 5 compares the quality and runtime of the presented routing algorithms. The 2D, 3D and 4D torus topology systems we divide into three groups by comparison (less, equal or greater) the edgeforwarding index of the BFS algorithm with the Genetic and SSSP algorithms. For each group we report the ratio of the total edgeforwarding index for the Genetic and SSSP algorithms to the BFS algorithm. Also we report a number of systems for each group (sys_num in Figure 5). The SSSP algorithm outperforms Genetic and BFS in the most cases. In the most cases, on average, SSSP is better than Genetic by 4% and 7% for 3D and 4D torus topologies, and is better than BFS by 13%, 11% and 17% for 2D, 3D and 3D torus topologies (see Figures (b)b and (c)c).
Figures (d)d, (e)e and (f)f presents the algorithm runtime. The evaluation shows that Genetic algorithm is too computationally expensive to be used in practice. Recall (Section 3.1) that the total vertex and edge number in a routing graph substantively depend on the number of dimensions. We define 0.25 sec as a practical boundary threshold of the algorithm runtime, then the SSSP algorithm can be used in practice up to 150 nodes of 3D torus topology, 90 nodes torus topology of 4D and for 2D torus topology systems. The BFS algorithm can be used in practice up to 160 and 300 nodes of 3D and 4D torus topologies, respectively.
5.2 RealWorld System
Cluster  Desmos 

Chassis  SuperServer 1018GRT 
Processor  E51650v3 (6c, 3.0 GHz) 
GPU  AMD Radeon Instinct MI50 
Memory  DDR4 16 GB 
Number of nodes  32 
Interconnect  Angara 4D torus 
Operating system  SLES 11 SP4 
Compiler  gcc 7.4.1 
MPI  MPICH 3.3 
We benchmarked and compared the routing table generation algorithms on the 32node Desmos cluster at Joint Institute for High Temperatures of the Russian Academy of Sciences [stegailov2019angara]. Table I provides the Desmos system configuration. All codes are compiled with gcc version 7.4.1 using O2.
For the Desmos cluster we have chosen a node set for each number of nodes and provide to the SLURM the same node set for each benchmark. For each node number Table II compares the BFS, Genetic and SSSP algorithms. Maximum diameter of a routing table is the same for each node number, because the routing algorithms generate minimal paths. On 2 and 4 nodes all the algorithms obtain an optimal routing table, and . On 16 and 32 nodes the BFS algorithm is significantly worse than the Genetic and SSSP algorithms, the Genetic algorithm is slightly better than the SSSP algorithm.
To evaluate the routing table generation algorithms without application level impact we use four sythetic communication patterns: Transpose, Neighbor, Tornado and Alltoall [dally2004principles]. The Transpose, Neighbor, Tornado patterns we implement using the MPI library. The Alltoall pattern implementation is an osu_alltoall benchmark from OSU MicroBenchmarks version 5.6.2 [osumb].
Routing Algorithm  # nodes  

BFS  2  1  1  0  1 
4  2  2  0  2  
8  6  2  1.429  3  
16  18  3  4.341  4  
32  48  3  12.254  5  
Genetic  2  1  1  0  1 
4  2  2  0  2  
8  4  4  0  3  
16  11  5  1.745  4  
32  27  8  6.274  5  
SSSP  2  1  1  0  1 
4  2  2  0  2  
8  5  3  0.707  3  
16  12  3  2.319  4  
32  28  8  6.298  5 
BFS  Genetic  SSSP  Improvement in % of  
Benchmark  Mops  CoV  Mops  CoV  Mops  CoV  SSSP to BFS 
LU  364353  0.38%  367771  0.65%  368548  0.50%  101.15% 
MG  374789  0.59%  376040  0.39%  377562  0.51%  100.74% 
CG  88426  0.57%  90354  0.56%  90443  0.59%  102.28% 
FT  172567  0.54%  181744  0.50%  181911  0.55%  105.41% 
IS  7771  0.29%  8073  0.18%  8181  0.13%  105.28% 
Figure 6 compares the communication pattern results on the Desmos system. The message size is 1 MB, we perform 5 runs for each benchmark and node number. The running times are very stable, the coefficient of variation for a run does not exceed 0.71%. On 2, 4 and 8 nodes the results are approximately the same. The Genetic algorithm is worse than BFS by 3.9% on 32 nodes for the Transpose pattern (Figure (a)a). For the other cases Genetic and SSSP are better than the BFS algorithm, on 32 nodes for the Neighbor pattern (Figure (b)b) the improvement is up to 11.6%, for the Tornado pattern (Figure (c)c) is up to 52.9%, for the Alltoall pattern is 11.1%. For the Alltoall pattern (Figure (d)d) Genetic outperforms the SSSP algorithm by 1.1% on 32 nodes, the obtained routing table is slightly better balanced, which is characterized by and values, see Table II. Coefficient of variation for that case is 0.08% and 0.14% for the Genetic and SSSP algorithms, respectively.
For the application benchmarks we use the MPI based NAS Parallel Benchmarks suite [npb] version 3.3.1, class C, 4 MPI processes per node. We use five parallel benchmarks LU, MG, FT, CG and IS, which represent wide different computationtocommunication ratio. We present in Table III NPB performance results on 32 Desmos nodes and the relative performance improvement of the SSSP routing to BFS. Some benchmark results was unstable, for that reason we have adjusted a number of internal NPB iterations for each benchmark and have obtained fairly small coefficient of variation values.
LU is a symmetric successive overrelaxation solver with a regular block matrix and exposes the high ratio between computation and communication. Observable performance improvement of SSSP can be explained by the benchmark sensitivity to small size message latency between neighbor nodes, recall the better SSSP routing table for the Neighbor pattern.
MG is a simplified multigrid method, it is a memory intensive benchmark. It requires highly structured short and long distance communication, but the whole volume of communication data is lower than for the LU benchmark. Therefore the performance difference for the considered routing algorithms is negligible.
CG is a conjugate gradient method with a large sparse symmetric positivedefinite matrix. The communication pattern of CG is similar to MG, but the volume of communication data for CG is greater than for the MG benchmark [riesen2006communication]. As a result, the SSSP based performance exceeds BFS based by 2.28%.
FT solves a 3dimensional partial differential equation using the fast Fourier transforms, IS is an integer sort. The FT and IS benchmarks use MPI collective operations mainly alltoall communication. The advantage of the SSSP routing for the Alltoall synthetic benchmark leads to more than 5% performance improvement for FT and IS on 32 nodes of the Desmos cluster.
6 Conclusion
We proposed the deadlockfree routing table generation algorithm based on the fast SSSP algorithm for the deterministic routing of the torus topology Angara interconnect with a single virtual channel. The algorithm uses the routing graph that we proposed to provide a possibility to build routes in the torus topology Angara interconnect with a routing, that places restrictions on a further packet route with the previous route steps in mind. Compared to other BFS and Genetic routing algorithms in the most cases, on average, the edge forwarding index of the proposed algorithm is better than Genetic by 4% and 7% for 3D and 4D torus topologies, and is better than BFS by 13%, 11% and 17% for 2D, 3D and 4D torus topologies.
We investigated the considered routing algorithms on the 32node Desmos cluster system and benchmarked the proposed algorithm performance improvement of 11.1% for the Alltoall communication pattern and of more than 5% for the FT and IS application kernels.
The proposed routing table generation algorithm can be used in practice up to 150 nodes of 3D, 90 nodes of 4D and for 2D torus topology systems. In future works, we will try to improve a routing algorithm runtime to support large node number, preserving the routing quality.
The research was carried out with a grant from the Russian Science Foundation (project No. 207110127).
Comments
There are no comments yet.