Optimized Graph Based Routing Algorithm for the Angara Interconnect

10/02/2021
by   Anatoly Mukosey, et al.
0

JSC NICEVT has developed the Angara high-speed interconnect with 4D torus topology. The Angara interconnect router implements deterministic routing based on the bubble flow control, a direction order routing (DOR) and direction bits rules. The router chip also supports non standard First Step / Last Step for bypassing failed nodes and links, these steps can violate the DOR rule. In the previous work we have proposed an algorithm for generation and analysis of routing tables that guarantees no deadlocks in the Angara interconnect. It is based on a breadth-first search algorithm in a graph and it practically does not take into consideration communication channel load. Also we have never evaluated the influence of routing table generation algorithm on the performance of a real-world Angara based cluster. In this paper we present a routing graph notation that provides a possibility to build routes in the torus topology of the Angara interconnect. We propose a deadlock-free routing algorithm based on a fast single-source shortest path algorithm for the deterministic Angara routing with a single virtual channel. We evaluated the considered routing algorithms on a 32-node Desmos cluster system and benchmarked the proposed algorithm performance improvement of 11.1 Alltoall communication pattern and of more than 5 application kernels.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 3

page 8

10/25/2018

A General, Fault tolerant, Adaptive, Deadlock-free Routing Protocol for Network-on-chip

The paper presents a topology-agnostic greedy protocol for network-on-ch...
12/03/2020

Multi-Path Routing on the Jellyfish Networks

The Jellyfish network has recently be proposed as an alternate to the fa...
02/18/2019

Routing in Histograms

Let P be an x-monotone orthogonal polygon with n vertices. We call P a s...
08/04/2020

Optimum Reconfiguration of Routing Interconnection Network in APSoC Fabrics

This paper presents an automated algorithm for optimum configuration of ...
08/01/2021

Efficient On-Chip Multicast Routing based on Dynamic Partition Merging

Networks-on-chips (NoCs) have become the mainstream communication infras...
11/07/2020

Open Area Path Finding to Improve Wheelchair Navigation

Navigation is one of the most widely used applications of the Location B...
08/21/2018

DeltaPath: dataflow-based high-performance incremental routing

Routing controllers must react quickly to failures, reconfigurations and...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction and Motivation

High performance computing (HPC) systems are costly hardware and optimization of application performance is a very challenging problem. The critical component of HPC system is an interconnect that properties stand behind the scalability of any MPI based parallel algorithm. The interconnect is largely controlled by routing algorithms which determine how to move packets through an interconnect topology.

We consider the Angara interconnect [simonov2014pervoye] that was developed at JSC NICEVT in Russia. Angara is the low-latency, high bandwidth interconnect with 4D torus topology, the minimum obtained MPI-latency between 2 adjacent nodes is 850 ns. The Angara-C1 [agarkov2016performance] and Desmos [stegailov2019angara] cluster systems are based on the Angara interconnect. Many authors have obtained results with use of the Angara interconnect [khalilov2018optimization, ostroumova2019reactive, polyakov2018, stegailovvasp2019, tolstykh2018structure, stegailov2020]. Now the maximum node number of Angara based high performance computing systems is 96 nodes.

Deterministic routing in the Angara interconnect is based on the direction order routing (DOR) [adiga2005blue, scott1996cray] implemented using direction bits [scott1996cray] and additional First and Last Steps [scott1996cray] for bypassing failed nodes and links. Implementation of the First and Last Steps method allows to violate the DOR rule. In the previous work [mukosey2019extended] we have proposed an algorithm for generation and analysis of routing tables that guarantees no deadlocks in the Angara interconnect. The proposed algorithm increases the number of different routable systems and improves fault tolerance, the algorithm is based on a breadth-first search algorithm in a graph and practically does not take into consideration a communication channel load. Also we have never evaluated the influence of a routing table generation algorithm on the performance of a real-world Angara based cluster.

In this work we focus on development of an advanced routing table generation algorithm and on the routing algorithm evaluation on the Desmos Angara based cluster.

1.1 Related Work

Routing algorithms can be divided in two categories: oblivious and adaptive algorithms [dally2004principles]. An oblivious algorithm chooses a route for each source-destination pair without considering any information about the network’s traffic. An adaptive algorithm adapts to the current network traffic conditions. Deterministic routing algorithms is a subset of oblivious algorithms and choose the same path between and , even if there are multiple possible paths.

The network property path diversity is a routing possibility to provide multiple minimal routes between most pairs of nodes, it adds to the robustness of the network by balancing load across channels and allowing the network to tolerate faulty channels and nodes. Balancing the load across the communication channels affects the network performance.

The edge-forwarding index is the maximum channel load for each communication channel from a set of all communication channels for a network [heydemann1989forwarding]. The channel load is a number of routes crossing network channel . The edge-forwarding index addresses link contention and thus defines the balancedness and the quality of the routing table.

The perfect channel load is an average channel load in a system with a routing table, in which each route has minimal length. By analogy with [sancho2003routing] we define the deviation metric for a routing table . The deviation metric characterizes a deviation of channel loads from a perfect channel load and can be more descriptive than the edge-forwarding index.

Optimal deterministic routing (with a minimal edge-forwarding index) in arbitrary networks is NP-hard [saad1993complexity]. Several approaches and techniques exist to resolve balancing problem in general and torus topology networks.

The first approach to implement load balanced routing for torus networks over dimension (or direction) order routing is to forward packets depending on relative position of a source-destination network node pair [montanana2009balanced]. This algorithm is simple, it can be implemented in hardware, but it does not take into consideration nonsymmectric topology cases and channel faults.

Instead of choosing routing paths locally at each router it is possible to use global routing, all communications can be optimized for each source-destination node pairs, particularly in case of communication pattern presence. Mixed integer linear programming (MILP) can be used to obtain an optimal solution of the global routing problem

[kinsy2009application, abdel2011transcom] for arbitrary topology, but because of the computational complexity, this approach does not scale well.

In [abdel2016scalable] A. Abdel-Gawad has proposed a decoupled approach SGR that consists of discovering the best possible channel loads and constructing source-destination pairs that achieve such optimal channel loads. The best channel loads are obtained by cast the global routing problem to a linear programming (LP) problem, which is solvable in polynomial time. But the obtained best channel loads imply a source-destination flow splitting and the second SGR stage of constructing source-destination pair paths is based on the greedy algorithm that achives optimal channel loads through a possibility of additional paths for a source-destination pair. MILP and LP formulations take into consideration injection flows, as a result the global routing can be application specific.

Graph methods are natural way to perform balancing of routing paths. A network can be modeled as a directed network graph in which represents the set of network nodes and represents the set of communication links between the nodes. The channel dependency graph is a directed graph with the edge set of a network as vertices, the routing function defines edges as a possible path step in a network. [dally1988deadlock, schwiebert2001deadlock] states that a routing function is deadlock-free if the corresponding channel dependency graph is acyclic.

T. Hoefler et al. have proposed [hoefler2009optimized] a balancing routing algoritm for arbitrary interconnect topologies based on the single-source shortest path (SSSP) Dijkstra algorithm [dijkstra1959note]. For each network node routing paths are constructed by the single-source shortest paths algorithm through the network graph. The balancing is reached through an edge weight, which is incremented by one for all edges of a path from a source to a destination node. A - algorithm variant for each source node builds routing paths to all possible destination nodes, and then modifies edge weights. - does not balance the edge-forwarding index ideally, but its runtime includes SSSP calls, where is a number of network nodes. A more accurate - algorithm performs SSSP for each source-destination pair and updates the edge weights every time. Clearly, the - runtime is dominated by calls of SSSP. The SSSP based algorithms might introduce cyclic dependencies, which might lead to network deadlocks. For that reason the second part of an algorithm called DFSSSP [domke2011deadlock] places all obtained paths to a channel dependency graph and then breaks cycles in the graph by moving paths to virtual graph layers, each layer is assigned to a virtual channel.

The breadth-first and depth-first search algorithms through the network graph can be used to generate routing tables for arbitrary irregular network topology [sancho2004effective].

A genetic algorithm is a well-known search heuristic to solve NP-hard problems which relies on the process of natural selection. But for large networks the problem is the time complexity of genetic algorithms.

In [sancho2000improving] a heuristic iterative load balancing algoritms is proposed. For all source-destination node pairs all possible paths are constructed, and then in each step the channel with highest load value is determined, a routing path crossing this channel is selected to be removed and excluded from the consideration if there is more than one routing path between a source and a destination nodes of this routing path. The algorithm finishes when the number of routing paths between every pair of nodes is reduced down to the unit.

The major contributions of our work are:

  • We present full description of a routing graph notation that provides a possibility to build routes in the Angara interconnect with a routing that places restrictions on a further packet route with the previous route steps in mind.

  • We propose a deadlock-free routing algorithm based on a fast SSSP algorithm for the deterministic routing of the Angara interconnect with a single virtual channel.

  • We evaluate the SSSP based proposed algorithm, a genetic algorithm and a previously proposed BFS based algorithm and compare the edge-forwarding index, routing time and application performance for synthetic communication patterns and NPB benchmarks on the Desmos cluster.

The paper is structured as follows. Section 2 descibes the Angara interconnect architecture and its routing. In Section 3 we present the routing graph notation and a condition and an algorithm for edges adding to a routing graph that does not lead to deadlocks. Section 4 presents the previously proposed BFS based algorithm, devised genetic algorithm and proposed SSSP based algorithm. In Section 5 we discuss the improvements of the SSSP routing for balancedness quality, runtime and application performance. The last Section 6 concludes the paper.

2 The Angara Interconnect Architecture

The Angara interconnect is a Russian-designed communication network with 4D torus topology. The interconnect ASIC was developed by JSC NICEVT and manufactured by TSMC with the 65-nm process.

The Angara ASIC chip (Figure 1) consists of an adapter and a router. The router contains 8 link blocks that are connected via a crossbar. The crossbar can simultaneously transmit flits (128 bits) from links if there are no conflicts.

In each direction there are five virtual channels: two channels for deterministic routing (blocking request channel and non-blocking response channel); a separate virtual channel is used for adaptive routing with the ability to switch to a nonblocking deterministic channel in case of potential deadlock; two more virtual channels are used to send messages over a virtual subnet for collective operations. The collective operations (broadcast and reduce) support is implemented using a virtual subnet with a tree topology, which is applied to multidimensional torus topology. Deterministic routing preserves the order of packet transmission and prevents deadlocks; adaptive routing uses one of the possible minimum routes for delivering packets, ignoring the order of directions, which allows to bypass the overloaded and broken parts of the network.

The link layer supports fault tolerant transmission of packets using packet numbering, counting for each checksum and retransmission if the checksum recorded in the last flit of the packet does not coincide with that calculated after the transfer. There is also a mechanism for bypassing failed communication channels and nodes by rebuilding the routing tables and using a non-standard first / last steps of the packet route. For performing various service operations, including setting up / rebuilding routing tables, and performing some calculations, a service processor can be used that interacts with the adapter via the ELB interface.

Fig. 1: The Angara ASIC architecture.
Fig. 2: The Angara software stack.

The Angara chip supports simultaneous operations with multiple threads/processes of a user task; it is implemented as several injection channels available for use by independent packet buffers.

The network adapter at the hardware level supports remote (RDMA) write, read, and atomic operations. Atomic operations of two types are available – addition and exclusive OR. Each node has a dedicated memory region available for remote access from other nodes to support OpenSHMEM and PGAS. Different programming models are supported, including MPI and OpenMP. The Angara software stack is presented in Figure 2.

The network adapter extension card can be connected to the card in adjacent nodes by up to six cables (or up to eight with an extension card). The following topologies are supported: a ring, 2D, 3D and 4D torus.

2.1 Deterministic Routing

Deadlock-free deterministic routing in the Angara interconnect is performed preserving the order of directions (Direction order routing, DOR) [dally1986torus]. All directions in a network route are used in a predetermined order . Direction order algorithms have a property of absence of mutual locks between the rings of torus topology with any number of simultaneous requests for data transmission over the network. To avoid deadlocks in a ring (movement without changing a direction) the bubble routing rule is used [adiga2005blue, scott1996cray, puente1999adaptive].

The second Angara routing restriction is a direction bit rule, which does not allow both positive and negative directions of a dimension in a route. denotes a set of directions matches the direction bit rule.

The First Step / Last Step (FS/LS) method is used in the Angara interconnect as a mechanism for relaxing routing rules. The FS/LS method provides a possibility of a first positive and/or a last negative non-standard steps in a route. A route from to nodes with FS/LS steps can be written as , where is a positive step, is a negative step. Note that a set of directions can violate the direction bit and the direction order rules, a deadlock can be arised.

Angara system software includes a SLURM plugin called Angara Node Selection Utility, ANSU. Each router in a network node contains a routing table, which is written by ANSU before each user application run. The described deterministic Angara routing rules allow path diversity, i.e. multiple routes between a source-destination pair. Each routing table for each node contains a routing information and determines a route to the node. When a packet is injected to a network, a router creates a 128-bit header flit with a route information from a routing table for a destination node. The header flit is added to a network packet, then the packet deterministically moves through the network.

3 Routing Graph

Let be a set of network nodes. The -dimensional torus has nodes, with nodes along each dimension , where for . Each node is identified by coordinates , where for .

The direction

is a vector

, where each coordinate is zero except a position and it is equal , if or , if . The direction is positive, when and is negative, when .

Two nodes and are neighbors in a direction if . A set of all directions is denoted by . On the set of directions we introduce the order: , if .

The channel dependency graph is not suitable for building routes in the Angara interconnect, because it does not support the Angara direction bit routing rule: the channel dependency graph does not allow to analyze an interconnect packet route history, which is needed to control the absense in the packet route of simultaneous using of positive and negative directions of a dimension.

For Angara routing table generation we propose a special routing graph (RG). The idea of the routing graph is to store a history of a possible packet route with a contracted representation of the history.

3.1 First and Last Steps Preserving Direction Order Routing

Initially, we consider a case of FS/LS steps that preserve the direction order. In a routing graph for each network node a vertex set , , , , is defined according to the following.

  1. is a vertex from which a path is started (injection to the network).

  2. is a vertex that can be reached by completing the first step and is a positive direction, .

  3. is a vertex that can be reached by completing steps in the ordered direction list, . The ordered direction list satisfies the direction bit rule, then can be presented as a -dimension vector, in which for each network dimension a packet might do 3 kind of steps: in a positive direction, in a negative direction or no steps.

  4. is a vertex that can be reached by completing the last step and is a negative direction, .

  5. is a vertex in which a path is finished (ejection from the network).

Fig. 3: Subgraph example of a routing graph for a 2D torus topology system.

Vertices of a are connected in such a way that a movement from a vertex to another vertex corresponds to packet route through the corresponding network nodes and satisfies the Angara routing rules. The directed edge set of a routing graph is defined according to the following possible steps.

  1. Consider a vertex corresponding to a network node , which is a start vertex for a path. A first step of a route from can be

    • a positive non-standard step in a direction to a vertex , which corresponds to a node

    • a step in a direction to a vertex , where corresponds to one single direction and a node

    • a step to a vertex, which denotes the end of a path.

  2. Consider the vertices, , which correspond to a node . From the vertices it can be possible to make a step

    • to the vertices, which correspond to a step in a direction, in case of . The index corresponds to one single direction.

    • to a vertex, the end of a path.

  3. Consider the vertices corresponding to the routes through a node with the ordered direction list . From the vertices it can be possible to make a step

    • to a vertex in a direction , in case of a last direction of and ,

    • to a vertex in a direction, in case of a last direction of and

    • to a vertex, the end of a path.

  4. Consider the vertices corresponding to the routes through a node with a last negative non-standard step in a direction . From the vertices it can be possible to make a step to a vertex, the end of a path.

Figure 3 presents a subgraph example of a routing graph for a 2D torus topology system. The subgraph includes two set of vertices for nodes and and edges corresponding to a step from to along direction .

For each network node a number of vertices . Consider an edge set for a vertex set . First, the vertices have edges. Secondly, for each non-standard step there are edges, the total is . Thirdly, for each of vertices there are at most edges. Finally, we have edges. A total edge number for each network node is .

Theorem 1

From a network node to a node there is a route if and only if there is a path from a vertex to a vertex in a routing graph , which corresponds to the network topology.

The proof is based on a routing graph constuction algorithm. For each step , of a route in a network there is an edge between vertices and in a corresponding routing graph, where if the step does not change a route direction or switches to according to the step, if the step changes a route direction. Conversely, for a path in the routing graph there is a route in the network.

3.2 Deadlock-Free Routing Graph

In [mukosey2019extended] we relax the direction order rule by augmenting the edge set of the routing graph by adding the edges, which do not lead to deadlocks. We determine these edges by a channel dependency graph. The node set of a channel dependency graph consists of the communications channels that can be presented for a torus topology as a pair , where is a network node, is a direction. The edge set of can be presented as channel pairs , i.e. , where , , and .

For a given torus topology we construct a routing graph and a channel dependency graph , which provides the direction order rule satisfaction.

To evaluate a possibility of making steps with violation of the deadlock-freedom rules, we build for each vertex of a channel dependency graph a used direction set in the paths from this vertex. The used direction set is a set of directions , which can be used in all possible paths started from a vertex in the channel dependency graph.

Theorem 2

New edge of , where and does not create cycles in the graph if .

By contradiction, we add the edge to the and suppose there is a path in the that creates a cycle. Then there is a path from to , then direction , it contradicts to the theorem condition.

Algorithm for adding edges to the channel dependency graph can be divided into two stages. In the first stage the breadth-first search (BFS) algorithm is used to build a used direction set for each vertex. BFS starts from each vertex and stores all directions from the reachable vertices.

In the second stage the algorithm finds possible edges satisfying the Theorem 2. For each vertex the algorithm adds edges to all the neighbor vertices , when is not in the used direction set of . Then we rebuild the used direction set for each vertex from which there is a path to the vertex by starting BFS from in with the reverted edges and adding directions from to all reachable vertices. The second stage are performed until all possible edges are constructed.

After channel dependency graph analysis we need to augment the routing graph. Each added edge of allows a new route turn that violates the direction order rule, i.e. . The Angara routing allows the direction order rule violation only for a first positive and/or a last negative non-standard steps. When is a first positive step, for a corresponding routing graph vertex we add an edge to a vertex with . When is a last negative step, for the corresponding vertices, such that is a last direction in the ordered direction list , we add an edge to a vertex with .

The added routing graph edges preserve Theorem 1 correctness.

4 Routing Algorithms

In this section we present the previously proposed BFS based algorithm, devised Genetic and SSSP algorithms.

4.1 BFS Algorithm

A BFS routing table generation algorithm is based on the breadth-first search algorithm in a routing graph. The paths are built by iterative application of the breadth-first search algorithm, see Algorithm 2. The edge link attribute denotes a link corresponding to an edge, we note that for an edge set there is a link. The reverse path is used to generate the routing tables, called . All edge weights along the built paths are increased by 1. Ascending vertex order in by outgoing edge weights provides the better load balancing.

The whole routing BFS algorithm, see Algorithm 1, iterates over all nodes in the given set to find the paths from a source to all other nodes. The next source node is the most distant from the current node (a GET_MAX_REMOTE procedure), this heuristic provides the better load balancing. Initially all edge weights are set to 0. The breadth-first search in the routing graph has a complexity of , the time complexity of the BFS routing algorithm is , where and are relatively large constants, we note that these constants depend on number of torus dimensions .

Input: – routing graph, – node set
      Output: Routing table for

1:procedure build_rt_bfs()
2:     .beginVertex(.first())
3:     repeat
4:         build_bfs_routes(, )
5:         mark_processed()
6:          = get_max_remote()
7:     until not all_processed()
8:end procedure
9:
10:procedure get_max_remote(, )
11:      =
12:     for all not_processed(do
13:         if length(, )
14:length(, then
15:               =
16:         end if
17:     end for
18:     return
19:end procedure
Algorithm 1 Build routes by BFS in a routing graph

Input: Routing graph , start vertex
      Output: All shortest paths from

1:procedure build_bfs_routes(, )
2:     /* build BFS tree from source to all targets */
3:     for all  do initialize (.parent = )
4:     end for
5:      =
6:     while  do
7:          =
8:         /* ascending vertex sort in Q by outgoing edge weights */
9:         for all vertex ascending sort  do
10:              for all  do
11:                  if .parent  then
12:                       .parent =
13:                        =
14:                  end if
15:              end for
16:         end for
17:         
18:     end while
19:     /* build paths from BFS tree in reverse order */
20:     for all end vertex  do
21:         
22:         repeat
23:              [.node, .node].
24:                    push_back(.node)
25:              for all , ) = ,
26:                    where .link == (.parent, ).link do
27:                  .weight = e.weight
28:              end for
29:              .parent
30:         until 
31:     end for
32:end procedure
Algorithm 2 Build routes by BFS in a routing graph from a source vertex

4.2 Genetic Algorithm

We design a Genetic algorithm for heuristic search space exploration. We consider the Genetic algorithm as an alternative way to evaluate an affordable routing table quality. The Genetic algorithm operates with a population of solution candidates represented as chromosomes. Then, based on current generation, it produces the next generation by applying crossover on some randomly selected chromosome pairs or mutates some random ones and then it selects the best chromosomes based on a fitness function to produce the next generation. Details of this algorithm are as follows.

Chromosome is a set of genes, representing a routing table for each pair of network nodes. A gene is a route between a pair of network nodes, where is a position of the gene in a chromosome, because network node pairs can be mapped to positive integers,

– variant route number for the network node pair. We construct an initial generation by randomly producing each of 100 chromosomes. We use a two-point crossover operation with panmixia as a parent selection algorithm. In a mutation operation each children gene is randomly changed with probability 2%. As a fitness function we use

with , we identify that it provides rapid convergence. The Genetic algorithm stops when it can not improve the best chromosomes after a rather large number of generations. In our simulation Genetic is terminated after 30 consecutive iteration or if the fitness value of next best solution less then 0.05 of fitness value of global best solution.

4.3 SSSP Algorithm

The third routing table generation algorithm is presented in Algorithm 3 and is based on a single-source shortestpath (SSSP) algorithm in a routing graph.

In our algorithm we aim to reduce a number of SSSP calls, preserving the desired balancedness quality. First, we count a path number and a path length from each network node to all other network nodes by the breadth-first search algorithm in the routing graph . We detect that for the 49 2D, 339 3D, 1173 4D random torus topology systems, where each dimension size , 81.6%, 58.2% and 36.6% source-destination node pairs, on average, respectively, have a single minimum route, we assume the Angara routing rules. Then we apply these unique routes and adjust edge weights.

Secondly, we sort and group source-destination node pairs by ascending order of a route turn number in torus topology and then by ascending order of route length. Also we group node pairs by a source node in order that we can run a BUILD_SSSP Algorithm 4 from a source to a set of destination nodes. Then for the sake of the best balancing quality we apply those built paths that do not cross each other, adjust edge weights and repeat the BUILD_SSSP call for a set of remaining destination nodes until is not an empty set.

Algorithm 4 builds single-source shortest paths from a source to a given set . An initial edge weight of routing graph is to force minimal shortest paths following T. Hoefler in [domke2011deadlock]. Only paths are considered, each , from this it follows that the shortest-path algorithm with the edge initialization never chooses a detour.

Fig. 4: Number of BUILD_SSSP calls within the SSSP routing algorithm.

We identify that the edge initialization allows to consider vertices in each breadth-first search level in arbitrary order. When a level processing is finished, then tentative distance of a vertex belonged to this level is settled. Therefore we use actually the breadth-first search algorithm with edge relaxation instead of the Dijkstra algorithm and thus optimize a complexity of the proposed algorithm to instead of . Also the proposed algorithm breaks when all vertices are processed.

The complexity of the SSSP routing algorithm is dominated by the second algorithm stage. Figure 4 presents a call number of the BUILD_SSSP algorithm. If a routing algorithm consists of only the second stage called ’Sort and group’, then a number of BUILD_SSSP calls is between and . Adding the first stage called ’Unique routes’ allows to significantly reduce a number of BUILD_SSSP calls. The first stage has a complexity as calls of the breadth-first search algorithm. But still an upper-bound complexity of the SSSP routing algorithm is .

Input: – routing graph, – node set
      Output: Routing table for

1:procedure build_rt_sssp()
2:     /* 1) Unique routes */
3:     for all  do
4:         count a path number and a path length from .beginVertex() to end vertices of all other nodes by BFS
5:     end for
6:     for all . path_number == 1 do
7:         [, ] = buildBFSRoute(, )
8:     end for
9:     /* 2) Sort and group */
10:      = sort and group pairs by 1) ascending order of turn number in a route 2) ascending order of path length 3)
11:     for all  do
12:         while  do
13:               = build_sssp()
14:               routes corresponding to do not cross with each other
15:              applyRoutes(
16:                  [])
17:              increase edge weight by 1 along all
18:              
19:         end while
20:     end for
21:end procedure
22:
Algorithm 3 Build routes by SSSP in a routing graph

Input: Routing graph , start vertex , destination node set
      Output: All shortest paths from to

1:procedure build_sssp(, , )
2:     /* build a SSSP tree from source to all targets */
3:     for all  do
4:         initialize (.distance = , .parent = )
5:     end for
6:      = ,
7:     while  do
8:          =
9:         for all vertex , all  do
10:              if .distance + .distance then
11:                  if .distance == then
12:                       
13:                  .distance = .distance +
14:                  .parent =
15:                   =
16:              end if
17:         end for
18:         if then break
19:         
20:     end while
21:end procedure
Algorithm 4 Build single source shortest paths by BFS with relaxation in a routing graph

5 Experimental Evaluation

We implemented the BFS, Genetic and SSSP algorithms using C language as a part of a SLURM plugin called Angara Node Selection Utulity (ANSU). We compare the different routing table generation algorithms by the balancedness quality and runtime on a synthetic system set and benchmark the performance for the communication patterns and NPB benchmarks on the Desmos cluster.

5.1 Balancedness and Runtime

(a) 2D
(b) 3D
(c) 4D
(d) 2D
(e) 3D
(f) 4D
Fig. 5: Comparison the edge-forwarding index ratio with respect to the BFS algorithm and runtime of routing algorithms on 2D, 3D and 4D torus topologies.

We consider 49 2D, 339 3D, 1173 4D torus random topology systems, each dimension size . When is 2, then along this dimension a system topology is mesh, torus in other cases.

Figure 5 compares the quality and runtime of the presented routing algorithms. The 2D, 3D and 4D torus topology systems we divide into three groups by comparison (less, equal or greater) the edge-forwarding index of the BFS algorithm with the Genetic and SSSP algorithms. For each group we report the ratio of the total edge-forwarding index for the Genetic and SSSP algorithms to the BFS algorithm. Also we report a number of systems for each group (sys_num in Figure 5). The SSSP algorithm outperforms Genetic and BFS in the most cases. In the most cases, on average, SSSP is better than Genetic by 4% and 7% for 3D and 4D torus topologies, and is better than BFS by 13%, 11% and 17% for 2D, 3D and 3D torus topologies (see Figures (b)b and (c)c).

Figures (d)d, (e)e and (f)f presents the algorithm runtime. The evaluation shows that Genetic algorithm is too computationally expensive to be used in practice. Recall (Section 3.1) that the total vertex and edge number in a routing graph substantively depend on the number of dimensions. We define 0.25 sec as a practical boundary threshold of the algorithm runtime, then the SSSP algorithm can be used in practice up to 150 nodes of 3D torus topology, 90 nodes torus topology of 4D and for 2D torus topology systems. The BFS algorithm can be used in practice up to 160 and 300 nodes of 3D and 4D torus topologies, respectively.

5.2 Real-World System

Cluster Desmos
Chassis SuperServer 1018GR-T
Processor E5-1650v3 (6c, 3.0 GHz)
GPU AMD Radeon Instinct MI50
Memory DDR4 16 GB
Number of nodes 32
Interconnect Angara 4D torus
Operating system SLES 11 SP4
Compiler gcc 7.4.1
MPI MPICH 3.3
TABLE I: The Desmos cluster configuration

We benchmarked and compared the routing table generation algorithms on the 32-node Desmos cluster at Joint Institute for High Temperatures of the Russian Academy of Sciences [stegailov2019angara]. Table I provides the Desmos system configuration. All codes are compiled with gcc version 7.4.1 using -O2.

(a) Transpose
(b) Neighbor
(c) Tornado
(d) Alltoall
Fig. 6: Comparison of the routing algorithms for communication patterns on the Desmos cluster.

For the Desmos cluster we have chosen a node set for each number of nodes and provide to the SLURM the same node set for each benchmark. For each node number Table II compares the BFS, Genetic and SSSP algorithms. Maximum diameter of a routing table is the same for each node number, because the routing algorithms generate minimal paths. On 2 and 4 nodes all the algorithms obtain an optimal routing table, and . On 16 and 32 nodes the BFS algorithm is significantly worse than the Genetic and SSSP algorithms, the Genetic algorithm is slightly better than the SSSP algorithm.

To evaluate the routing table generation algorithms without application level impact we use four sythetic communication patterns: Transpose, Neighbor, Tornado and Alltoall [dally2004principles]. The Transpose, Neighbor, Tornado patterns we implement using the MPI library. The Alltoall pattern implementation is an osu_alltoall benchmark from OSU MicroBenchmarks version 5.6.2 [osumb].

Routing Algorithm # nodes
BFS 2 1 1 0 1
4 2 2 0 2
8 6 2 1.429 3
16 18 3 4.341 4
32 48 3 12.254 5
Genetic 2 1 1 0 1
4 2 2 0 2
8 4 4 0 3
16 11 5 1.745 4
32 27 8 6.274 5
SSSP 2 1 1 0 1
4 2 2 0 2
8 5 3 0.707 3
16 12 3 2.319 4
32 28 8 6.298 5
TABLE II: Comparison of routing table generation algorithms on the Desmos cluster.
BFS Genetic SSSP Improvement in % of
Benchmark Mops CoV Mops CoV Mops CoV SSSP to BFS
LU 364353 0.38% 367771 0.65% 368548 0.50% 101.15%
MG 374789 0.59% 376040 0.39% 377562 0.51% 100.74%
CG 88426 0.57% 90354 0.56% 90443 0.59% 102.28%
FT 172567 0.54% 181744 0.50% 181911 0.55% 105.41%
IS 7771 0.29% 8073 0.18% 8181 0.13% 105.28%
TABLE III: NPB performance results on 32 nodes of the Desmos cluster for different routing table generation algorithms.

Figure 6 compares the communication pattern results on the Desmos system. The message size is 1 MB, we perform 5 runs for each benchmark and node number. The running times are very stable, the coefficient of variation for a run does not exceed 0.71%. On 2, 4 and 8 nodes the results are approximately the same. The Genetic algorithm is worse than BFS by 3.9% on 32 nodes for the Transpose pattern (Figure (a)a). For the other cases Genetic and SSSP are better than the BFS algorithm, on 32 nodes for the Neighbor pattern (Figure (b)b) the improvement is up to 11.6%, for the Tornado pattern (Figure (c)c) is up to 52.9%, for the Alltoall pattern is 11.1%. For the Alltoall pattern (Figure (d)d) Genetic outperforms the SSSP algorithm by 1.1% on 32 nodes, the obtained routing table is slightly better balanced, which is characterized by and values, see Table II. Coefficient of variation for that case is 0.08% and 0.14% for the Genetic and SSSP algorithms, respectively.

For the application benchmarks we use the MPI based NAS Parallel Benchmarks suite [npb] version 3.3.1, class C, 4 MPI processes per node. We use five parallel benchmarks LU, MG, FT, CG and IS, which represent wide different computation-to-communication ratio. We present in Table III NPB performance results on 32 Desmos nodes and the relative performance improvement of the SSSP routing to BFS. Some benchmark results was unstable, for that reason we have adjusted a number of internal NPB iterations for each benchmark and have obtained fairly small coefficient of variation values.

LU is a symmetric successive over-relaxation solver with a regular block matrix and exposes the high ratio between computation and communication. Observable performance improvement of SSSP can be explained by the benchmark sensitivity to small size message latency between neighbor nodes, recall the better SSSP routing table for the Neighbor pattern.

MG is a simplified multigrid method, it is a memory intensive benchmark. It requires highly structured short and long distance communication, but the whole volume of communication data is lower than for the LU benchmark. Therefore the performance difference for the considered routing algorithms is negligible.

CG is a conjugate gradient method with a large sparse symmetric positive-definite matrix. The communication pattern of CG is similar to MG, but the volume of communication data for CG is greater than for the MG benchmark [riesen2006communication]. As a result, the SSSP based performance exceeds BFS based by 2.28%.

FT solves a 3-dimensional partial differential equation using the fast Fourier transforms, IS is an integer sort. The FT and IS benchmarks use MPI collective operations mainly all-to-all communication. The advantage of the SSSP routing for the Alltoall synthetic benchmark leads to more than 5% performance improvement for FT and IS on 32 nodes of the Desmos cluster.

6 Conclusion

We proposed the deadlock-free routing table generation algorithm based on the fast SSSP algorithm for the deterministic routing of the torus topology Angara interconnect with a single virtual channel. The algorithm uses the routing graph that we proposed to provide a possibility to build routes in the torus topology Angara interconnect with a routing, that places restrictions on a further packet route with the previous route steps in mind. Compared to other BFS and Genetic routing algorithms in the most cases, on average, the edge forwarding index of the proposed algorithm is better than Genetic by 4% and 7% for 3D and 4D torus topologies, and is better than BFS by 13%, 11% and 17% for 2D, 3D and 4D torus topologies.

We investigated the considered routing algorithms on the 32-node Desmos cluster system and benchmarked the proposed algorithm performance improvement of 11.1% for the Alltoall communication pattern and of more than 5% for the FT and IS application kernels.

The proposed routing table generation algorithm can be used in practice up to 150 nodes of 3D, 90 nodes of 4D and for 2D torus topology systems. In future works, we will try to improve a routing algorithm runtime to support large node number, preserving the routing quality.

The research was carried out with a grant from the Russian Science Foundation (project No. 20-71-10127).

References