FastSV: A Distributed-Memory Connected Component Algorithm with Fast Convergence

10/14/2019 ∙ by Yongzhe Zhang, et al. ∙ 0

This paper presents a new distributed-memory algorithm called FastSV for finding connected components in an undirected graph. Our algorithm simplifies the classic Shiloach-Vishkin algorithm and employs several novel and efficient hooking strategies for faster convergence. We map different steps of FastSV to linear algebraic operations and implement them with the help of scalable graph libraries. FastSV uses sparse operations to avoid redundant work and optimized MPI communication to avoid bottlenecks. The resultant algorithm shows high-performance and scalability as it can find the connected components of a hyperlink graph with over 134B edges in 30 seconds using 262K cores on a Cray XC40 supercomputer. FastSV outperforms the state-of-the-art algorithm by an average speedup of 2.21x (max 4.27x) on a variety of real-world graphs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 10

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

This paper presents a distributed-memory parallel algorithm for finding connected components (CC) in an undirected graph where and are the set of vertices and edges, respectively. A connected component is a subgraph of in which every pair of vertices are connected by paths and no vertex in the subgraph is connected to any other vertex outside of the subgraph. Finding connected components has numerous applications in bioinformatics [van2000graph]

, computer vision 

[yang1989improved], and scientific computing [pothen1990computing].

Sequentially, connected components of a graph with vertices and edges can be easily found by breadth-first search (BFS) or depth-first search in time. While this approach performs linear work, the depth is proportional to the sum of the diameters of the connected components. Therefore, BFS-based parallel algorithms are not suitable for high-diameter graphs or graphs with millions of connected components. Connectivity algorithms based on the “tree hooking” scheme work by arranging the vertices into disjoint trees such that at the end of the algorithm, all vertices in a tree represent a connected component. Shiloach and Vishkin [sv] used this idea to develop a highly-parallel PRAM (parallel random access machine) algorithm that runs in time using processors. Their algorithm is not work efficient as it performs work, but the availability of parallel work made it an attractive choice for large-scale distributed-memory systems. Therefore, the Shiloach-Vishkin (SV) algorithm and its variants are frequently used in scalable distributed-memory CC algorithms such as LACC [lacc], ParConnect [parconnect], and Hash-Min [hashmin].

To the best of our knowledge, LACC [lacc] is the most scalable published CC algorithm that scales to 262K cores when clustering graphs with more than 50B edges. LACC is based on the Awerbuch-Shiloach (AS) algorithm, which is a simplification of the SV algorithm. The AS algorithm consists of four steps: (a) finding stars (trees of height 1), (b) hooking stars conditionally onto other trees, (c) hooking starts unconditionally onto other trees, (d) shortcutting to reduce the height of trees. LACC mapped these operations to parallel linear-algebraic operations such as those defined in the GraphBLAS standard [graphblas] and implemented them in the CombBLAS [combblas] library for scalability and performance. We observed that LACC’s requirements of star hooking and unconditional hooking can be safely removed to design a simplified SV algorithm with just two steps: (a) hooking trees conditionally onto other trees and (b) shortcutting. After mapping these two operation to linear algebra and performing other simplifications, we developed a distributed-memory SV algorithm that is both simpler and faster than LACC. Since, each of the four operations in LACC takes about 25% of the total runtime, each iteration of our SV is usually more than faster than each iteration of LACC when run on the same number of processors. However, the simplified SV requires more iterations than LACC because of the removal of unconditional hooking. To alleviate this problem, we developed several novel hooking strategies for faster convergence, hence the new algorithm is named as FastSV.

The simplicity of FastSV along with its fast convergence schemes makes it suitable for distributed-memory platforms. We map different steps of FastSV to linear-algebraic operations and implemented the algorithm using the CombBLAS library. We choose CombBLAS due to its high scalability and the fact that the state-of-the-art connected component algorithm LACC and ParConnect also rely on CombBLAS. We further employ several optimization techniques for eliminating communication bottlenecks. The resultant algorithm is highly parallel as it scales up to cores of a Cray XC40 supercomputer and can find CCs from graphs with billions of vertices and hundreds of billions of edges in just 30 seconds. FastSV advances the state-of-the-art in parallel CC algorithm as it is on average faster than the previous fastest algorithm LACC. Overall, we made the following technical contributions in this paper:

  • [leftmargin=*]

  • We develop a simple and efficient algorithm FastSV for finding connected components in distributed memory. FastSV uses novel hooking strategies for fast convergence.

  • We present FastSV using a handful of GraphBLAS operations and implement the algorithm in CombBLAS for distributed-memory platforms and in SuiteSparse:GraphBLAS for shared-memory platforms. We dynamically use sparse operations to avoid redundant work and optimize MPI communication to avoid bottlenecks.

  • Both shared- and distributed-memory implementations of FastSV are significantly faster than the state-of-the-art algorithm LACC. The distributed-memory implementation of FastSV can find CCs in a hyperlink graph with 3.27B vertices and 124.9B edges in just seconds using 262K cores of a XC40 supercomputer.

2 Background

2.1 Notations.

This paper assumes the connected component algorithm to be performed on an undirected graph with vertices and edges. For each vertex , we use to denote ’s neighbors, the set of vertices adjacent to . We use pointer graph to refer to an auxiliary directed graph for , where for every vertex there is exactly one directed edge and . If we ignore the self-loops , defines a forest of directed rooted trees where every vertex can follow the directed edges to reach the root vertex. In , a tree is called a star if every vertex in the tree points to a root vertex (a root points to itself).

2.2 GraphBLAS.

Expressing graph algorithms in the language of linear algebra is appealing. By using a small set of matrix and vector (linear algebra) operations, many scalable graph algorithms can be described clearly, the parallelism is hidden for the programmers, and the high performance can be achieved by performance experts implementing those primitives on various architectures. Several independent systems have emerged that use matrix algebra to perform graph computations 

[combblas, gpi, graphmat]. Recently, GraphBLAS [graphblas] defines a standard set of linear-algebraic operations (and C APIs [graphblas-C]) for implementing graph algorithms. In this paper, we will use the functions from the GraphBLAS API to describe our algorithms due to its conciseness. Our distributed implementation is based on CombBLAS [combblas].

2.3 The original SV algorithm.

The SV algorithm stores the connectivity information in a forest of rooted trees, where each vertex maintains a field through which it points to either itself or another vertex in the same connected component. All vertices in a tree belong to the same component, and at termination of the algorithm, all vertices in a connected component belong to the same tree. Each tree has a designated root (a vertex having a self-loop) that serves as the representative vertex for the corresponding component. This data structure is called a pointer graph, which changes dynamically during the course of the algorithm.

The algorithm begins with single-vertex trees and iteratively merges trees to find connected components. Each iteration of the original SV algorithm performs a sequence of four operations: (a) conditional hooking, (b) shortcutting, (c) unconditional hooking and (d) another shortcutting. Here, hooking is a process where the root of a tree becomes a child of a vertex from another tree. Conditional hooking of a root is allowed only when ’s id is larger than the vertex which is hooked into. Unconditional hooking can hook any trees that remained unchanged in the preceding conditional hooking. The shortcutting step reduces the height of trees by replacing a vertex’s parent by its grandparent. With these four steps the SV algorithm is guaranteed to finish in iterations, where each iteration performs parallel work.

The original Shiloach-Vishkin paper mentioned that the last shortcutting is for a simpler proof of their algorithm. Hence, it can be removed without sacrificing correctness or convergence speed. If we remove unconditional hooking, the algorithm is still correct, but it may need iterations in the worst case. Nevertheless, practical parallel algorithms often remove the unconditional hooking [palgol, cc-ipdps18] because it needs to keep track of unchanged trees (also known as stagnant trees), which is expensive, especially in distributed memory. We follow the same route and use a simplified SV algorithm discussed next.

1:procedure SV()
2:     for every vertex  do
3:                
4:     repeat
5:           Step 1: Tree hooking
6:          for every  do in parallel
7:               if  and  then
8:                                              
9:          
10:           Step 2: Shortcutting
11:          for every  do in parallel
12:               if  then
13:                                              
14:          
15:     until  remains unchanged
Algorithm 1 The SV algorithm. Input: An undirected graph . Output: The parent vector .

2.4 A simplified SV algorithm

Algorithm 1 describes the simplified SV algorithm, which is the basis of our parallel algorithm. Initially, the parent of a vertex is set to itself to denote single-vertex trees. We additionally maintain a copy of the parent vector so that the parallel algorithm reads from and writes to . Given a fixed ordering of vertices, each execution of Algorithm 1 generates exactly the same pointer graph after the th iteration because of using separate vectors for reading and writing. Hence, the convergence pattern of this parallel algorithm is completely deterministic, making it suitable for massively-parallel distributed systems. By contrast, concurrent reading from and writing to a single vector still deliver the correct connected components, but the structures of intermediate pointer graphs are not deterministic.

In each iteration, the algorithm performs tree hooking and shortcutting operations in order:

  • Tree hooking (line 6–8): for every edge , if ’s parent is a root and , then make point to . As mentioned before, the updated parents are stored in a separate vector so the the updated parents are not used in the current iteration.

  • Shortcutting (line 11–13): if a vertex does not point to a root vertex, make point to its grandparent .

The algorithm terminates when the parent vector remains unchanged in the latest iteration. At termination, every tree becomes a star, and vertices in a star constitute a connected component. The correctness of this algorithm is discussed in previous work [greiner1994comparison]. However, as mentioned before, without the unconditional hooking used in the original SV algorithm, we can no longer guarantee that Algorithm 1 converges in iterations. We will show in Section 5 that Algorithm 1 indeed converges slowly, but does not require the worst case bound iterations for the practical graphs we considered. Nevertheless, the extra iterations needed by Algorithm 1 increase the runtime of parallel SV algorithms. To alleviate this problem, we develop several novel hooking schemes, ensuring that the improved algorithm FastSV is as simple as Algorithm 1, but the former converges faster than the latter.

3 The FastSV algorithm

In this section, we introduce four important optimizations for the simplified SV algorithm, obtaining FastSV with faster convergence.

3.1 Hooking to grandparent.

Figure 1: Two different ways of performing the tree hooking. (1) the original algorithm that hooks ’s parent to ’s parent , (2) hook ’s parent to ’s grandparent . Both strategies are correct and the latter one improves the convergence.

In the original algorithm, the tree hooking is represented by the assignment (line 8 in Algorithm 1) requiring to be a root vertex, and . It is not hard to see, if we perform the tree hooking using ’s grandparent , saying , the algorithm will still produce the correct answer. To show this, we visualize both operations in Figure 1.

Suppose is an edge in the input graph and . The original hooking operation is represented by the the green arrow in the figure, which hooks to ’s parent . Then, our new strategy simply changes to ’s grandparent , resulting the red arrow from to . It is not hard to see, as long as we choose a value like such that it is in the same tree of , we can easily prove the correctness of the algorithm. One can also expect that any value like (’s -th level ancestor) will also work.

Intuitively, choosing a higher ancestor of in the tree hooking will likely create shorter trees, leading to faster convergence (all trees are stars at termination). However, finding higher ancestors may incur additional computational cost. Here, we choose grandparents because they are needed in the shortcutting operation anyway; hence, using grandparents does not incur additional cost in the hooking operation.

Figure 2: The stochastic hooking strategy. Suppose there are two edges and activating the hooking operation. The red arrows are the potential modifications to the pointer graph due to our stochastic hooking strategy, which tries to hook a non-root vertex to another vertex. The solid line successfully modifies ’s pointer to , but the dashed lines do not take effect due to the ordering on the vertices.

3.2 Stochastic hooking.

The original SV algorithm and Algorithm 1 always hooked the root of a tree onto another tree (see Figure 1 for an example). Therefore, the hooking operation in Algorithm 1 never breaks a tree into multiple parts and hooks different parts to different trees. This restriction is enforced by the equality check in line 7 of Algorithm 1, which is only satisfied by roots and their children. We observed that this restriction is not necessary for the correctness of the SV algorithm. Intuitively, we can split a tree into multiple parts and hook them independently because these tree fragments will eventually be merged to a single connected component when the algorithm terminates. We call this strategy stochastic hooking.

The stochastic hooking strategy can be employed by simply removing the condition from line 7 of Algorithm 1. Then, any part of a tree is allowed to hook onto another vertex when the other hooking conditions are satisfied. It should be noted that after removing the condition , it is possible that a tree may hook onto a vertex in the same tree. This will not affect the correctness though. In this case, the effect of stochastic hooking is similar to the shortcutting, which hooks a vertex to some other vertex with a smaller identifier.

Figure 2 shows an example of stochastic hooking by the solid red arrow from to . In the original algorithm, does not modify its non-root parent ’s pointer, but stochastic hooking changes ’s pointer to one of ’s neighbor’s grandparent . Suppose points to after the tree hooking, we can see that and might be no longer in the same connected component (assuming and are in different trees). Possible splitting of trees is a policy that differs from the conventional SV algorithm, but it gives a non-root vertex an opportunity to be hooked. In Figure 2, ’s new parent is smaller than , which can expedite the convergence.

Algorithm 2 presents the high-level description of FastSV using the new hooking strategies. Here, denotes a compare-and-assign operation that updates an entry of only when the right hand side is smaller. The stochastic hooking is shown in line 6–7, and the shortcutting operation in line 12–13 is also affected by the removal of the predicate .

Figure 3: The aggressive hooking strategy. Suppose there are two edges and activating the hooking operation. The green arrows represent the hooking strategies introduced so far, and the red arrows represent our aggressive hooking strategy where a vertex may point on one of its neighbor’s grandparent. Some vertices may have multiple arrows (like ), and which vertex to hook onto is decided by the ordering on the vertices.

3.3 Aggressive hooking.

Next, we give a vertex another chance to hook itself onto another tree whenever possible. This strategy is called aggressive hooking, performed by for all . Figure 3 gives an example of aggressive hooking by the red arrow for and . Here, ’s pointer will not be modified by any hooking operation introduced so far. Then, the aggressive hooking makes point to its newest grandparents , as if an additional shortcutting is performed. We should mention that the cost of an additional shortcutting is expensive due to the recalculation of grandparents, while the aggressive hooking is essentially a cheap element-wise operation over by reusing some results in the stochastic hooking. We will discuss how they are implemented Section 4.1.

For in Figure 3, it only performs the shortcutting operation in the original algorithm, and now the aggressive hooking performs . Our implementation let point to the smaller one between and , which is expected to give the best convergence for vector .

1:procedure FastSV()
2:     for every vertex  do
3:                
4:     repeat
5:           Step 1: Stochastic hooking
6:          for every  do in parallel
7:                          
8:           Step 2: Aggressive hooking
9:          for every  do in parallel
10:                          
11:           Step 3: Shortcutting
12:          for every  do in parallel
13:                          
14:          
15:     until  remains unchanged
Algorithm 2 The FastSV algorithm. Input: . Output: The parent vector

3.4 Early termination.

The last optimization is a generic one that applies to most variations of the SV algorithm. SV’s termination is based on the stabilization of the parent vector , which means even if reaches the converged state (where every vertex points to the smallest vertex in its connected component), we need an additional iteration to verify that. We will see in Section 5.2 that for most real-world graphs, FastSV usually takes 5 to 10 iterations to converge. Hence, this additional iteration can consume a significant portion of the runtime. The removal of the last iteration is possible by detecting the stabilization of the grandparent instead of . The following lemma ensures the correctness of this new termination condition. After an iteration, if the grandparent remains unchanged, then the vector will not be changed afterwards.

Proof. See Appendix A.

In practice, we found that on most practical graphs, FastSV identifies all the connected components before converged, and the last iteration always performs the shortcutting operation to turn the trees into stars. In such case, the grandparent vector converges one iteration earlier than .

4 Implementation of FastSV in linear algebra

In this section, we first give the formal description of FastSV in GraphBLAS, a standardized set of linear algebra primitives for describing graph algorithms. We then present its linear algebra distributed-memory implementation in Combinatorial BLAS [combblas] and discuss two optimization techniques for improving its performance.

4.1 Implementation in GraphBLAS.

In GraphBLAS, we assume that the vertices are indexed from to , then the vertices and their associated values are stored as GraphBLAS object GrB_Vector. The graph’s adjacency matrix is stored as a GraphBLAS object GrB_Matrix. For completeness, we concisely describe the GraphBLAS functions used in our implementation below, where the formal descriptions of these functions can be found in the API document [graphblas]. We use to denote GrB_NULL, which is fed to those ignored input parameters.

  • [leftmargin=*]

  • The function multiplies the matrix with the vector on a semiring and outputs the result to the vector . When the accumulator (a binary operation accum) is specified, the multiplication result is combined with ’s original value instead of overwriting it.

  • The function extracts a sub-vector from the specified positions in an input vector . We can regard this operation as for where is the length of the array and also the vector .

  • The function assigns the entries from the input vector to the specified positions of an output vector . We can regard it as for where is the length of the array and also the vector . is the same as the one in GrB_mxv.

  • The function performs the element-wise (generalized) multiplication on the intersection of elements of two vectors and and outputs the vector .

  • The function extracts the nonzero elements (tuples of index and value) from vector into two separate arrays and . It returns the element count to .

For the rest functions, we have GrB_Vector_dup to duplicate a vector, GrB_reduce to reduce a vector to a scalar value through a user-specified binary operation, and GrB_Matrix_nrows to obtain the dimension of a matrix.

Algorithm 3 describes the FastSV algorithm in GraphBLAS. Before every iteration, we calculate the initial grandparent for every vertex. First, we perform the stochastic hooking in line 9–10. GraphBLAS has no primitive that directly implements the parallel-for on an edge list (line 9 in Algorithm 2), so we have to first aggregate ’s grandparent to vertex for every , obtaining the vector . This can be implemented by a matrix-vector multiplication using the (select2nd, min) semiring. Next, the hooking operation is implemented by the assignment for every vertex . This is exactly the GrB_assign function in line 10 where the indices are the values of vector extracted in either line 5 before the first iteration or line 16 from the previous iteration. The accumulator GrB_MIN prevents the nondeterminism caused by the modification to the same entry of , and the minimum operation gives the best convergence in practice.

Aggressive hooking is then implemented by an element-wise multiplication in line 13. Although it is another operation in FastSV that performs the parallel-for on an edge list, it can reuse the vector computed in the previous step, so the aggressive hooking is actually efficient. Shortcutting is also implemented by an the element-wise multiplication in line 15. Next, we calculate the grandparent vector . It is implemented by the GrB_extract function in line 17 where the indices are the values of extracted in line 16.

At the end of each iteration, we calculate the number of modified entries in in line 19 – 20 to check whether the algorithm has converged or not. A copy of is stored in the vector for determining the termination in the next iteration.

1:procedure FastSV()
2:     GrB_Matrix_nrows
3:     GrB_Vector_dup initial grandparent
4:     GrB_Vector_dup duplication of
5:     GrB_Vector_extractTuples
6:     
7:     repeat
8:           Step 1: Stochastic hooking
9:          GrB_mxv
10:          GrB_assign
11:           Step 2: Aggressive hooking
12:          GrB_eWiseMult
13:           Step 3: Shortcutting
14:          GrB_eWiseMult
15:           Step 4: Calculate grandparents
16:          GrB_Vector_extractTuples
17:          GrB_extract
18:           Step 5: Check termination
19:          GrB_eWiseMult
20:          GrB_reduce
21:          GrB_Vector_dup
22:     until 
Algorithm 3 The linear algebra FastSV algorithm. Input: The adjacency matrix and the parent vector . Output: The parent vector .

4.2 Distributed implementation using CombBLAS.

The distributed version of FastSV is implemented in CombBLAS [combblas]. CombBLAS provides all operations needed for FastSV, but its API differs from the GraphBLAS standard. GraphBLAS’s collections (matrices and vectors) are opaque datatypes whose internal representations (sparse or dense) are not exposed to users, but CombBLAS distinguishes them in the user interface. Then, GraphBLAS’s functions often consist of multiple operations (like masking, accumulation and the main operation) as described in Section 4.1, while in CombBLAS we usually perform a single operation at a time. Despite these differences, a straightforward implementation of FastSV on CombBLAS can be obtained by transforming each GraphBLAS function to the semantically equivalent ones in CombBLAS, using dense vectors in all scenarios.

The parallel complexity of the main linear algebraic operations used in FastSV (the vector variants of GrB_extract and GrB_assign, and the GrB_mxv), as well as the potential optimizations are discussed in the LACC paper [lacc]. Due to the similarity of FastSV and LACC in the algorithm logic, they can be optimized by the similar optimization techniques. We briefly summarize them below.

Broadcasting-based implementation for the extract and assign operations. The extract and assign

operations fetch or write data on the specified locations of a vector, which may cause a load balancing issue when there is too much access on a few locations. In FastSV, these locations are exactly the set of parent vertices in the pointer graph, and due to the skewed structure of the pointer graph, the root vertices (especially those belonging to a large component) will have extremely high workload. When using the default

assign and extract implementations in CombBLAS via all-to-all communication, several processes become the bottleneck and slow down the whole operation significantly. The solution is a manual implementation of these two operations via the detection of the hot spots and broadcasting the entries on those processes.

Taking advantage of the sparsity. The matrix-vector multiplication is an expensive operation in FastSV (see our performance profiling in Section 5.7). The straightforward implementation naturally chooses the sparse-matrix dense-vector (SpMV) multiplication, since all the vectors in FastSV are dense vectors. Alternatively, we can use an incremental implementation by computing , where containing only the modified entries of is stored as a sparse vector, so the multiplication is the sparse-matrix sparse-vector multiplication (SpMSpV) [azad2017work]. Depending on the sparsity of , SpMSpV could have much lower computation and communication cost than SpMV. We use a threshold on the portion of modified entries of to decide which method to use in each iteration, which effective reduces the computation time. Section 5.7 presents a detailed evaluation.

5 Experiments

Graph Vertices Directed edges Components Description
Queen_4147 4.15M 166.82M 1 3D structural problem [davis2011university]
kmer_A2a 170.73M 180.29M 5353 Protein k-mer graphs from GenBank [davis2011university]
archaea 1.64M 204.78M 59794 archaea protein-similarity network [hipmcl]
kmer_V1r 214.01M 232.71M 9 Protein k-mer graphs, from GenBank [davis2011university]
HV15R 2.02M 283.07M 1 Computational Fluid Dynamics Problem [davis2011university]
uk-2002 18.48M 298.11M 1990 2002 web crawl of .uk domain [davis2011university]
eukarya 3.24M 359.74M 164156 eukarya protein-similarity network [hipmcl]
uk-2005 39.45M 936.36M 7727 2005 web crawl of .uk domain [davis2011university]
twitter7 41.65M 1.47B 1 twitter follower network [davis2011university]
SubDomain 82.92M 1.94B 246969 1st-level subdomain graph extracted from Hyperlink [meusel2014graph]
sk-2005 50.64M 1.95B 45 2005 web crawl of .sk domain [davis2011university]
MOLIERE_2016 30.22M 3.34B 4457 automatic biomedical hypothesis generation system [davis2011university]
Metaclust50 282.20M 37.28B 15982994 similarities of proteins in Metaclust50 [hipmcl]
Hyperlink 3.27B 124.90B 29360027 hyperlink graph extract from the Common Crawl [meusel2014graph]
Table 1: Graph datasets used to evaluate the parallel connected component algorithms.
Figure 4: Number of iterations the simplified SV takes after performing each of the optimizations (sv5 is exactly our FastSV algorithm), and the number of iterations LACC takes.
Figure 5: Performance of the parallel FastSV and LACC in SuiteSparse:GraphBLAS on six small graphs.
Figure 6: Strong scaling of distributed-memory FastSV and LACC using up to cores (256 nodes).
Figure 7:

The speedup of FastSV over LACC on twelve small datasets using 256 cores (bar chart) and each graph’s density in terms of average vertex degree (line chart). A positive correlation between the two metrics can be observed, except for the two outliers

archaea and eukarya where LACC makes use of the detection of converged connected components to accelerate its execution.

In this section, we evaluate various aspects of FastSV showing its fast convergence, shared- and distributed-memory performance, scalability and several other performance characteristics. We compare FastSV with LACC [lacc] that has demonstrated superior performance over other distributed-memory parallel CC algorithms. Table 1 shows a diverse collection of large graphs used to evaluate CC algorithms. To the best of out knowledge, the Hyperlink graph [meusel2014graph] with 3.27B vertices and 124.90B edges is the largest publicly available graph.

5.1 Evaluation platform.

We evaluate the performance of distributed algorithms on NERSC Cori supercomputer. Each node of Cori has Intel KNL processor with 68 cores and 96GB of memory. All operations in CombBLAS are parallelized with OpenMP and MPI. Given MPI processes, we always used a square process grid. In our experiments, we used 16 threads per MPI process. The execution pattern of our distributed algorithm follows the bulk synchronous parallel (BSP) model, where all MPI processes perform local computation followed by synchronized communication rounds.

We also show the shared-memory performance of FastSV implemented using the SuiteSparse:GraphBLAS library [suitesparse]. These experiments are conducted on Amazon EC2’s r5.4xlarge instance (128G memory, 16 threads).

5.2 Speed of convergence.

At first, we show how different hooking strategies impact the convergence of SV and FastSV algorithms. We start with the simplified SV algorithm (Algorithm 1) and incrementally add different hooking strategies as shown in Figure 4. The rightmost bars report the number of iterations needed by LACC.

Figure 4 shows that the simplified SV without unconditional hooking can take up to more iterations than LACC. We note that despite needing more iterations, Algorithm 1 can run faster than LACC in practice because each iteration of the former is faster than each iteration of the latter. Figure 4 demonstrates that SV converges faster as we incrementally apply advanced hooking strategies. In fact, every hooking strategy improves the convergence of some graphs, and their combination improves the convergence of all graphs. Finally, the early termination discussed in Section 3.4 always removes an additional iteration needed by other algorithms. With all improvements, sv5 which represents Algorithm 2, on average reduces iterations (min , max ) from Algorithm 1. Therefore, FastSV converges as quickly as, or faster than, LACC.

5.3 Performance in shared-memory platform using SuiteSparse:GraphBLAS.

To check the correctness of Algorithm 3, we implemented it in SuiteSparse:GraphBLAS, a multi-threaded implementation of the GraphBLAS standard. LACC also has an unoptimized SuiteSparse:GraphBLAS implementation available as part of the LAGraph library [lagraph]. We compare the performance of Figure 5 and LACC in this setting on an Amazon EC2’s r5.4xlarge instance with 16 threads. Figure 5 shows that FastSV is significantly faster than LACC (avg. , max ). Although both algorithms are designed for distributed-memory platforms, we still observe better performance of FastSV, thanks to its simplicity.

5.4 Performance in distributed-memory platform using CombBLAS.

We now evaluate the performance of FastSV implemented using CombBLAS and compare its performance with LACC on Cori supercomputer. Both algorithms are implemented in CombBLAS, so they share quite a lot of common operations and optimization techniques (see Section 4.2), making it a fair comparison between the two algorithms. Generally, FastSV operates with simpler computation logic and uses less expensive parallel operations than LACC. However, depending on the structure of the graph, LACC can detect the already converged connected components on the fly and can potentially use more sparse operations. Hence, the structure of the input graph often influences the relative performance of these algorithms.

Figure 6 summarizes the performance of FastSV and LACC on twelve small datasets. We observe that both FastSV and LACC scale to cores on all the graphs, and for the majority of the graphs (8 out of 12), they continue scaling to cores. The four graphs on which they stop scaling are just too small that both algorithms finish within 2 seconds. FastSV outperforms LACC on all instances. On 256 cores, FastSV is faster than LACC on average (min , max ). When increasing the number of nodes, the performance gap between FastSV and LACC shrinks slightly, but FastSV is still , and faster than LACC on average on 1024, 4096 and 16384 cores, respectively.

To see how the performance of FastSV and LACC are affected by the graph structure, we plot the average degree () and the speedup of FastSV over LACC for each graph (using 1024 cores) in Figure 7. Generally, FastSV tends to outperform LACC by a significant margin on denser graphs. This is mainly due to fewer matrix-vector multiplications used in FastSV, whose parallel complexity is highly related to the density of the graph. The outliers archaea and eukarya are graphs with a large number of small connected components: they have more than converged vertices detected early. On such graphs, LACC’s detection of converged connected components provides it with better opportunities to employ sparse operations, while such detection is not allowed in FastSV. Nevertheless, LACC’s sparsity optimization still cannot compensate its high computational cost in each iteration.

5.5 Performance of FastSV with bigger graphs.

Figure 8: Performance of FastSV and LACC with two large graphs on CoriKNL (up to 262144 cores using 4096 nodes).

We separately analyze the performance of FastSV and LACC on the two largest graphs in Table 1. Hyperlink is perhaps the largest publicly available graph, making it the largest connectivity problem we can currently solve. Since each of these two graphs requires more than 1TB memory, it may be impossible to process them on a typical shared-memory server. Figure 8 shows the strong scaling of both algorithms and the better performance of FastSV. On the smaller graph Metaclust50, both algorithms scale to cores where FastSV is faster than LACC. On the Hyperlink graph containing 124.9 billion edges, they continue scaling to cores, where FastSV achieves an speedup over the LACC algorithm.

Figure 9: Performance breakdown of FastSV on four representative graphs.

5.6 Performance characteristics for operations.

Figure 9 shows the execution time of FastSV by breaking the runtime into three parts: finding the grandparent, matrix-vector multiplication, the hooking operations. The time spent on checking the termination is omitted, since it is insignificant relative to other operations. Each of these operations contributes significantly to the total execution time. Finding the grandparent and the hooking operations basically reflect the parallel complexity of the extract and assign operations, and the ratio of them is relatively stable for all graphs. By contrast, the execution time of SpMV varies considerably across different graphs, because SpMV’s complexity depends on the density of a graph.

Figure 10: Percentage of vertices that participate the SpMV (sparse matrix-vector multiplication) operation for each iteration (bar chart), and the runtime for SpMV (line chart). A vertex participates the SpMV if its grandparent is not changed in the previous iteration.

5.7 Execution time reduced by the sparsity optimization.

As mentioned in  Section 4.2, FastSV dynamically selects SpMV or SpMSpV based on the changes in the grandparent vector . This optimization is particularly effective for high-density graphs where SpMV usually dominates the runtime (see Figure 9). Figure 10 explains the benefit of sparsity with four representative graphs, where we plot the number of vertices modified in each iteration. We observe that only a small fraction of vertices participates in the last few iterations when SpMSpV can be used instead of SpMV. As shown by the red runtime lines in Figure 10, the use of SpMSpV drastically reduces the runtime of the last few iterations.

6 Related work

Finding the connected components of an undirected graph is a well-studied problem in PRAM model. Many of these algorithms such as the Shiloach-Vishkin (SV) algorithm assume the CRCW (concurrent-read and concurrent-write model) model. The SV algorithm [sv] takes time on processors. The Awerbuch-Shiloach (AS) algorithm [as] is a simplification of SV by using a different termination condition. Transforming the complete SV or AS to distributed-memory is possible [svppa, lacc], but the detection of stagnant trees in SV’s unconditional hooking step is in fact not suitable for a distributed-memory implementation, which introduces considerable computation and communication cost. Therefore, we based our distributed-memory FastSV on a simplified SV algorithm preserving only the essential steps and introduce efficient hooking steps to ensure fast convergence in practice.

There are several distributed-memory connected component algorithms proposed in the literature. Parallel BFS is a popular method that are implemented and optimized in various systems [parallel-bfs, graphx, powerlyra, gemini], but its complexity is bounded by the diameter of the graph, so it is mainly used on small-world networks. LACC [lacc] is the state-of-the-art algorithm prior to our work, which guarantees the convergence in iterations by transforming the complete AS algorithm into linear algebra operations. FastSV’s high performance comes from a much simplified computation logic than LACC. ParConnect [parconnect] is another distributed-memory algorithm that adaptively uses parallel BFS and SV and dynamically selects which method to use. For other software architectures, there are Hash-Min [hashmin] for MapReduce systems and S-V PPA [svppa] for vertex-centric message passing systems [pregel]. MapReduce algorithms tend to perform poorly in tightly-couple parallel systems our work targets, compared to the loosely-coupled architectures that are optimized for cloud workloads. The S-V PPA algorithm, due to the requirement of communicating between non-neighboring vertices, is only supported by several Pregel-like systems [giraph, pregelplus, pregel-channel], and these frameworks tend to have limited scalability on multi-core clusters due to the lack of multi-threading support.

7 Conclusion

In this paper, we present a new distributed-memory connected component algorithm FastSV that is scalable to hundreds of thousands processors on modern supercomputers. FastSV achieves its efficiency by first keeping the backbones of the Shiloach-Vishkin algorithm and then employing several novel hooking strategies for fast convergence. FastSV attains high performance by employing scalable GraphBLAS operations and optimized communication. Given the generic nature of our algorithm, it can be easily implemented for any computing platforms such as using GraphBLAST [yang2019graphblast] for GPUs and using any programming language such as using pygraphblas (https://github.com/michelp/pygraphblas) in Python.

Finding CCs is a fundamental operation is many large-scale applications such as metagenome assembly and protein clustering. With the exponential growth of genomic data, these applications will generate graphs with billions of vertices and trillions of edges and will use upcoming exascale computers to solve scientific problems. FastSV is generic enough to be integrated with many existing libraries and scalable enough to be integrated with massively-parallel applications.

References

Appendix A Correctness of the early termination

section 3.4 states that, in FastSV if the grandparent remains unchanged after an iteration, then the vector will not be changed afterwards. The proof makes use of the following lemmas.

During the whole algorithm, holds for all vertices .

Initially, for all vertex and the lemma holds trivially. The the operation ensures that can only decrease, so the lemma always holds.

After an iteration, if the grandparent remains unchanged, then every vertex hooks onto its grandparent in the previous operation.

By contradiction. Suppose changes its pointer to some other than , then since it overrides the shortcutting operation we know that . By appendix A, ’s new grandparent , then the grandparent of is changed.

After an iteration, if the grandparent remains unchanged, then every vertex points to a root now.

By contradiction. Suppose ’s new parent is not a root, then ’s new grandparent is (by appendix A and appendix A), which means ’s grandparent has changed.

Here we prove section 3.4.

We show that no hooking operation will be performed if remains unchanged after an iteration. The aggressive hooking in the form of is overridden by the shortcutting operation in the previous iteration (by appendix A), meaning that for all . Since then, is not changed, so the aggressive hooking will not be performed in the current iteration either. The stochastic hooking will not be performed since for all we have . Shortcutting will not be performed either since every vertex points to a root now (by appendix A). Then, no hooking operation can be performed, and the vector remains unchanged afterwards.