Large-scale matrix- or graph-based applications such as numerical simulations (Trobec and Kosec, 2015) or massive network analytics (Que et al., 2015) often run on parallel systems with distributed memory. The iterative nature of the underlying algorithms typically requires recurring communication between the processing elements (PEs). To optimize the running time of such an application, the computational load should be evenly distributed onto the PEs, while at the same time, the communication volume between them should be low. When mapping processes to PEs on non-uniform memory access (NUMA) systems, it should also be taken into account that the cost for communication operations depends on the locations of the PEs involved. In particular, one wants to map heavily communicating processes “close to each other”.
More formally, let the application be modeled by an application graph that needs to be distributed over the PEs. A vertex in represents a computational task of the application, while an edge indicates data exchanges between tasks and , with specifying the amount of data to be exchanged. The topology of the parallel system is modeled by a processor graph , where the edge weight indicates the cost of exchanging one data unit between the PEs and (Deveci et al., 2015; Glantz et al., 2015). A balanced distribution of processes onto PEs thus corresponds to a mapping such that, for some small ,
for all . Therefore, induces a balanced partition of with blocks , . Conversely, one can find a mapping that fulfills Eq. (1) by first partitioning into balanced parts (Bichot and Siarry, 2011) and then specifying a one-to-one mapping from the blocks of the partition onto the vertices of . For an accurate modeling of communication costs, one would also have to specify the path(s) over which two vertices communicate in . This is, however, system-specific. For the purpose of generic algorithms, it is thus often abstracted away by assuming routing on shortest paths in (Hoefler and Snir, 2011). In this work, we make the same assumption.
In order to steer an optimization process for obtaining good mappings, different objective functions have been proposed (Hoefler and Snir, 2011; Kim and Lai, 1991; Deveci et al., 2015). One of the most commonly used (Pellegrini, 2011) objective functions is (also referred to as hop-byte in (Yu et al., 2006)), cf. Eq. (3):
where denotes the distance between and in , i. e. the number of edges on shortest paths. Broadly speaking, is minimised when pairs of highly communicating processes are placed in nearby processors. Finding is -hard. Indeed, finding for a complete graph amounts to graph partitioning, which is an -hard problem (Garey and Johnson, 1979).
Related work and motivation
One way of looking at previous algorithmic work from a high-level perspective (more details can be found in the overview articles by Pellegrini (Pellegrini, 2011), Buluç et al. (Buluç et al., 2016), and Aubanel (Aubanel, 2009)) is to group it into two categories. One line has tried to couple mapping with graph partitioning. To this end, the objective function for partitioning, usually the edge cut, is replaced by an objective function like that considers distances (Walshaw and Cross, 2001). To avoid recomputing these values, Walshaw and Cross (Walshaw and Cross, 2001) store them in a network cost matrix (NCM). When our proposed method TiMEr (Topology-induced Mapping Enhancer) is used to enhance a mapping, it does so without an NCM, thus avoiding its quadratic space complexity.
The second line of research decouples partitioning and mapping. First, is partitioned into blocks without taking into account. Typically, this step involves multilevel approaches whose hierarchies on are built with edge contraction (Karypis and Kumar, 1998; Osipov and Sanders, 2010; Sanders and Schulz, 2013a) or weighted aggregation (Chevalier and Safro, 2009; Meyerhenke et al., 2009). Contraction of the blocks of into single vertices yields the communication graph , where , , aggregates the weight of edges in with end vertices in different blocks. For an example see Figures (a)a,(b)b. Finally, one computes a bijection that minimizes or a related function, using (for example) greedy methods (Hoefler and Snir, 2011; Brandfass et al., 2013; Glantz et al., 2015; Deveci et al., 2015) or metaheuristics (Brandfass et al., 2013; Ucar et al., 2006), see Figure (c)c. When TiMEr is used to enhance such a mapping, it modifies not only , but also affects the partition of (and thus possibly ). Hence, deficiencies due to the decoupling of partitioning and mapping can be compensated. Note that, since TiMEr is proposed as an improvement on mappings derived from state-of-the-art methods, we assume that an initial mapping is provided. This is no limitation: a mapping can be easily computed from the solution of a graph partitioner using the identity mapping from to .
Being man-made and cost-effective, the topologies of parallel computers exhibit special properties that can be used to find good mappings. A widely used property is that PEs are often “hierarchically organized into, e. g., islands, racks, nodes, processors, cores with corresponding communication links of similar quality” (Schulz and Träff, 2017). Thus, dual recursive bisection (DRB) (Pellegrini, 1994), the recent embedded sectioning (Kirmani et al., 2017), and what can be called recursive multisection (Chan et al., 2012; Jeannot et al., 2013; Schulz and Träff, 2017) suggest themselves for mapping: DRB and embedded sectioning cut both (or ) and into two blocks recursively and assign the respective blocks to each other in a top-down manner. Recursive multisection as performed by (Schulz and Träff, 2017) models the hierarchy inherent in as a tree. The tree’s fan-out then determines into how many blocks needs to be partitioned. Again, TiMEr is complementary in that we do not need an actual hierarchy of the topology.
Overview of our method.
To the best of our knowledge, TiMEr is the first method to exploit the fact that many parallel topologies are partial cubes. A partial cube is an isometric subgraph of a hypercube, see Section 2 for a formal definition. This graph class includes all rectangular and cubic meshes, any such torus with even extensions in each dimension, all hypercubes, and all trees.
being a partial cube allows us to label its vertices with bitvectors such that the distance between two vertices in amounts to the Hamming distance (number of different bits) between the vertices’ labels (see Sections 2 and 3). This will allow us to compute the contribution of an edge to quickly. The labels are then extended to labels for vertices in (Section 4). More specifically, a label of a vertex consists of a left and a right part, where the left part encodes the mapping and the right part makes the labels unique. Note that it can be instructive to view the vertex labels of , , as a recursive bipartitioning of : the -th leftmost bit of defines whether belongs to one or the other side of the corresponding cut in the -th recursion level. Vertices connected by an edge with high weight should then be assigned labels that share many of the bits in the left part.
We extend the objective function by a factor that accounts for the labels’ right part and thus for the uniqueness of the label, We do so without limiting the original objective that improves the labels’ left part (and thus ) (Section 5). Finally, we use the labeling of to devise a multi-hierarchical search method in which label swaps improve . One can view this as a solution method for finding a -optimal numbering of the vertices in , where the numbering is determined by the labels.
Compared to simple hill-climbing methods, we increase the local search space by employing very diverse hierarchies. These hierarchies result from random permutations of the label entries; they are imposed on the vertices of and built by label contractions performed digit by digit. This approach provides a fine-to-coarse view during the optimization process (Section 6) that is orthogonal to typical multilevel graph partitioning methods.
As the underlying application for our experiments, we assume parallel complex network analysis on systems with distributed memory. Thus, in Section 7, we apply TiMEr to further improve mappings of complex networks onto partial cubes with 256 and 512 nodes. The initial mappings are computed using KaHIP (Sanders and Schulz, 2013b), Scotch (Pellegrini, 2007) and LibTopoMap (Hoefler and Snir, 2011). To evaluate the quality results of TiMEr, we use , for which we observe a relative improvement from 6% up to 34%, depending on the choice of the initial mapping. Running times for TiMEr are on average 42% faster than the running times for partitioning the graphs with KaHIP.
2. Partial cubes and their hierarchies
The graph-theoretical concept of a partial cube is central for this paper. Its definition is based on the hypercube.
Definition 2.1 (Hypercube).
A -dimensional hypercube is the graph with and , where denotes the Hamming distance (number of different bits) between and .
More generally, , i. e., the (unweighted) shortest path length equals the Hamming distance in a hypercube.
Partial cubes are isometric subgraphs of hypercubes, i. e., the distance between any two nodes in a partial cube is the same as their distance in the hypercube. Put differently:
Definition 2.2 (Partial cube).
A graph with vertex set is a partial cube of dimension if (i) there exists a labeling such that for all and (ii) is as small as possible.
The labeling gives rise to hierarchies on as follows. For any permutation of , one can group the vertices in by means of the equivalence relations , where :
for all (where refers to the -th character in string ). As an example, , where is the identity on , gives rise to the partition in which and belong to the same part if and only if their labels agree at the first positions. More generally, for each permutation from the set of all permutations on , the equivalence relations , , give rise to a hierarchy of increasingly coarse partitions . As an example, the hierarchies defined by the permutations and , , respectively, are opposite hierarchies, see Figure 5.
3. Vertex labels of processor graph
In this section we provide a way to recognize if a graph is a partial cube, and if so, we describe how a labeling on can be obtained in time. To this end, we characterize partial cubes in terms of (cut-sets of) convex cuts.
Definition 3.1 (Convex cut).
Let be a graph, and let be a cut of . The cut is called convex if no shortest path from a vertex in  to another vertex in  contains a vertex of .
The cut-set of a cut of consists of all edges in with one end vertex in and the other one in . Given the above definitions, is a partial cube if and only if (i) is bipartite and (ii) the cut-sets of ’s convex cuts partition (Ovchinnikov, 2008). In this case, the equivalence relation behind the partition is the Djoković relation (Ovchinnikov, 2008). Let . An edge is Djoković related to if one of ’s end vertices is closer to than to , while the other end vertex of is closer to than to . Formally,
Consequently, no pair of edges on any shortest path of is Djoković related. In the following, we use the above definitions to find out whether is a partial cube and, if so, to compute a labeling for according to Definition 2.2:
Test whether is bipartite (in asymptotic time ). If is not bipartite, it is not a partial cube.
Pick an arbitrary edge and compute the edge set .
Keep picking edges that are not contained in an edge set computed so far and compute the edge sets , . If there is an overlap with a previous edge set, is not a partial cube.
While calculating where , set
For an example of such vertex labels see Figure 9a.
Assuming that all distances between node pairs are computed beforehand, calculating for and setting the labeling (in steps , and ) take time. Detecting overlaps with already calculated edge sets takes time (in step ), which summarizes the time complexity of the proposed method.
Note that the problem has to be solved only once for a parallel computer. Since for 2D/3D grids and tori – the processor graphs of highest interest in this paper – and for all partial cubes, our simple method is (almost) as fast, asymptotically, as the fastest and rather involved methods that solve the problem. Indeed, the fastest method to solve the problem takes time (Imrich, 1993). Assuming that integers of at least bits can be stored in a single machine word and that addition, bitwise Boolean operations, comparisons and table look-ups can be performed on these words in constant time, the asymptotic time is reduced to (Eppstein, 2011).
4. Vertex labels of application graph
Given an application graph , a processor graph that is a partial cube and a mapping , the aim of this section is to define a labeling of . This labeling is then used later to improve w. r. t. Eq. (3) by swapping labels between vertices of . In particular, for any , the effect of their label swap on should be a function of their labels and those of their neighbors in . It turns out that being able to access the vertices of by their (unique) labels is crucial for the efficiency of our method. The following requirements on meet our needs.
For any two vertices , we can derive the distance between and from and . Thus, for any edge of , we can find out how many hops it takes in for its end vertices to exchange information, see Figure (c)c.
The labels are unique on .
To compute such , we first transport labeling from to through for all . This new labeling already fulfills items 1) and 2). Indeed, item 1) holds, since labels are unique on ; item 2) holds, because is a partial cube. To fulfill item 3), we extend the labeling to a labeling , where yet undefined should exceed only by the smallest amount necessary to ensure that whenever . The gap depends on the size of the largest part in the partition induced by .
Definition 4.1 ().
Let be a mapping. We set:
For any , is a bitvector of length ; its first entries coincide with , see above, and its last entries serve to make the labeling unique. We denote the bitvector formed by the last entries by . Here, the subscript in stands for “extension”. To summarize,
stands for concatenation. Except for temporary permutations of the labels’ entries, the set of labels will remain the same. A label swap between and alters if and only if . The balance of the partition of , as induced by , is preserved by swapping the labels of .
Computing is straightforward. First, the vertices in each , , are numbered from to . Second, these decimal numbers are then interpreted as bitvectors/binary numbers. Finally, the entries of the corresponding bitvectors are shuffled, so as to provide a good random starting point for the improvement of , see Lemma 5.1 in Section 5.
5. Extension of the objective function
Given the labeling of vertices in , i. e., , it is easy to see that solely determines the value of . (This fact results from encoding the distances between vertices in .) On the other hand, due to the uniqueness of the labels , restricts : implies , i. e., that and are mapped to different PEs.
The plan of the following is to ease this restriction by avoiding as many cases of as possible. To this end, we incorporate into the objective function, thus replacing by a modified one. Observe that and give rise to two disjoint subsets of , i. e. subsets of edges, the two end vertices of which agree on and , respectively:
In general these two sets do not form a partition of , since there can be edges whose end vertices disagree both on and on . With denoting the Hamming distance, optimizing in Eq. (3) can be rewritten as follows. Find
For all it follows that , implying that and are mapped to different PEs. Thus, any edge increases the value of with a damage of
This suggests that reducing may be good for minimizing . The crucial question here is whether reducing can obstruct our primary goal, i. e., growing (see Eq. (9)). Lemma 5.1 below shows that this is not the case, at least for perfectly balanced . A less technical version of Lemma 5.1 is the following: If is perfectly balanced, then two mappings, where one has bigger and one has smaller (both w. r. t. set inclusion), can be combined to a third mapping that has the bigger and the smaller . The third mapping provides better prospects for minimizing than the mapping from which it inherited big , as, due to smaller , the third mapping is less constrained by the uniqueness requirement:
Lemma 5.1 (Reducing compatible with growing ).
Let be such that for all (perfect balance). Furthermore, let and be bijective labelings that correspond to as specified in Section 4. Then, and implies that there exists that also corresponds to with (i) and (ii) .
Set . Then, fulfills (i) and (ii) in the claim, and the first entries of specify . If remains to show that the labeling is unique on . This is a consequence of (a) being unique and (b) . ∎
In practice we usually do not have perfect balance. Yet, the balance is typically low, e. g. . Thus, we still expect that having small is beneficial for minimizing .
Minimization of the damage to from edges in , see Eq. (10), amounts to maximizing the diversity of the label extensions in . Formally, in order to diversify, we want to find
We combine our two objectives, i. e., minimization of and maximization of , with the objective function :
6. Multi-hierarchical label swapping
After formulating the mapping problem as finding an optimal labeling for the vertices in , we can now turn our attention to how to find such a labeling – or at least a very good one. Our algorithm is meant to improve mappings and resembles a key ingredient of the classical graph partitioning algorithm by Kernighan and Lin (KL) (Kernighan and Lin, 1970): vertices of swap their membership to certain subsets of vertices. Our strategy differs from KL in that we rely on a rather simple local search which is, however, performed on multiple (and very diverse) random hierarchies on . These hierarchies are oblivious to ’s edges and correspond to recursive bipartitions of , which, in turn, are extensions of natural recursive bipartitions of .
6.1. Algorithm TiMEr
Our algorithm, see procedure TiMEr in Algorithm 1, takes as input (i) an application graph , (ii) a processor graph with the partial cube property, (iii) an initial mapping and (iv) the number of hierarchies, . controls the tradeoff between running time and the quality of the results. The output of TiMEr consists of a bijective labeling such that is low (but not necessarily optimal). Recall from Section 1 that requiring as input is no major limitation. An initial bijection representing this is found in lines 1, 2 of Algorithm 1.
In lines 3 through 21 we take attempts to improve , where each attempt uses another hierarchy on . Before the new hierarchy is built, the old labeling is saved in case the label swapping in the new hierarchy turns out to be a setback w. r. t. (line 4). This may occur, since the gain w. r. t.
on a coarser level of a hierarchy is only an estimate of the gain on the finest level (see below). The hierarchy and the current mapping are encoded by (i) a sequence of graphs, , with , (ii) a sequence of labelings
and (iii) a vector, called, that provides the hierarchical relation between the vertices. From now on, we interpret vertex labels as integers whenever convenient. More precisely, an integer arises from a label, i. e., a bitvector, by interpreting the latter as a binary number.
In lines 6 and 7, the entries of the vertex labels of are permuted according to a random permutation . The construction of the hierarchy (lines 9 through 14) goes hand in hand with label swapping (lines 10-12) and label coarsening (in line 13). The function contracts any pair of vertices of whose labels agree on all but the last digit, thus generating . The same function also cuts off the last digit of for all , and creates the labeling for the next coarser level. Finally, builds the vector (for encoding the hierarchy of the vertices). In line 15, the call to derives a new labeling from the labelings on the levels of a hierarchy (read more in Section 6.2). The permutation of ’s entries is undone in line 16, and is kept only if better than , see lines 17 to 19. Figure 10 depicts a snapshot of TiMEr on a small instance.
Function (Algorithm 2) in line 15 of TiMEr turns the hierarchy of graphs into a new labeling for , digit by digit, using labels (in Algorithm 2, “” denotes a left shift by digits). The least and the most significant digit of any are inherited from (line 7 of Algorithm 1) and do not change (lines 2, 17 and 18). The remaining digits are set in the loop from line 5 to 16. Whenever possible, digit of is set to the last digit of the parent of (= preferred digit) on level , see lines 9, 11. This might, however, lead to a label that is not in any more, which would change the set of labels and may violate the balance constraint coming from . To avoid such balance problems, we take the last digit of if possible (in lines 9-11) or, if not, we switch to the (old) inverted digit, see line 13. Since , new on can be taken as new on , see line 18 in Algorithm 1.
6.3. Running time analysis
The expected asymptotic running time of function is . Here, “expected” is due to the condition in line 10 that is checked in expected constant time. (We use a hashing-based C++ std::unordered_map to find a vertex with a certain label. A plain array would be too large for large grids and tori, especially if the blocks are large, too.) For Algorithm 1, the expected running time is dominated by the loop between lines 9 and 14. The loop between lines 10 and 12 takes amortized expected time (“expected”, because we have to go from the labels to the vertices and “amortized”, because we have to check the neighborhoods of all ). The contraction in line 13 takes time , too. Thus, the loop between lines 9 and 14 takes time . In total, the expected running time of Algorithm 1 is .
An effective first step toward a parallel version of our algorithm would be simple loop parallelization in lines 10-12 of Algorithm 1. To avoid stale data, label accesses need to be coordinated.
7.1. Description of experiments
In this section we specify our test instances, our experimental setup and the way we evaluate the computed mappings.
The application graphs are the 15 complex networks used by Safro et al. (Safro et al., 2012) for partitioning experiments and in (Glantz et al., 2015) for mapping experiments, see Table 1. Regarding the processor graphs, we follow loosely current architectural trends. Several leading supercomputers have a torus architecture (Deveci et al., 2015), and grids (= meshes) experience rising importance in emerging multicore chips (Clauss et al., 2011). As processor graphs we therefore use a 2DGrid(), a 3DGrid(), a 2DTorus(), a 3DTorus() and, for more theoretical reasons, an -dimensional hypercube. In our experiments, we set the number of hierarchies () for TiMEr to 50 and whenever is needed for partitioning/mapping with state-of-the-art tools, the load imbalance is set to 3%. All computations are based on sequential C++ code. Each experiment is executed on a node with two Intel XeonE5-2680 processors (Sandy Bridge) at 2.7 GHz equipped with 32 RAM and 8 cores per processor.
|p2p-Gnutella||6 405||29 215||file-sharing network|
|PGPgiantcompo||10 680||24 316||largest connected component in network of PGP users|
|email-EuAll||16 805||60 260||network of connections via email|
|as-22july06||22 963||48 436||network of internet routers|
|soc-Slashdot0902||28 550||379 445||news network|
|loc-brightkite_edges||56 739||212 945||location-based friendship network|
|loc-gowalla_edges||196 591||950 327||location-based friendship network|
|citationCiteseer||268 495||1 156 647||citation network|