The Reverse Cuthill-McKee Algorithm in Distributed-Memory

by   Ariful Azad, et al.
Berkeley Lab

Ordering vertices of a graph is key to minimize fill-in and data structure size in sparse direct solvers, maximize locality in iterative solvers, and improve performance in graph algorithms. Except for naturally parallelizable ordering methods such as nested dissection, many important ordering methods have not been efficiently mapped to distributed-memory architectures. In this paper, we present the first-ever distributed-memory implementation of the reverse Cuthill-McKee (RCM) algorithm for reducing the profile of a sparse matrix. Our parallelization uses a two-dimensional sparse matrix decomposition. We achieve high performance by decomposing the problem into a small number of primitives and utilizing optimized implementations of these primitives. Our implementation shows strong scaling up to 1024 cores for smaller matrices and up to 4096 cores for larger matrices.



There are no comments yet.


page 8

page 9


An SSD-based eigensolver for spectral analysis on billion-node graphs

Many eigensolvers such as ARPACK and Anasazi have been developed to comp...

Communication-Avoiding and Memory-Constrained Sparse Matrix-Matrix Multiplication at Extreme Scale

Sparse matrix-matrix multiplication (SpGEMM) is a widely used kernel in ...

The Chunks and Tasks Matrix Library 2.0

We present a C++ header-only parallel sparse matrix library, based on sp...

Parallel Enumeration of Triangulations

We report on the implementation of an algorithm for computing the set of...

Accelerating Clique Counting in Sparse Real-World Graphs via Communication-Reducing Optimizations

Counting instances of specific subgraphs in a larger graph is an importa...

Efficient Distributed Transposition Of Large-Scale Multigraphs And High-Cardinality Sparse Matrices

Graph-based representations underlie a wide range of scientific problems...

H2OPUS-TLR: High Performance Tile Low Rank Symmetric Factorizations using Adaptive Randomized Approximation

Tile low rank representations of dense matrices partition them into bloc...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Reordering a sparse matrix to reduce its bandwidth or profile can speed up many sparse matrix computations [1, 2]. For example, a matrix with a small profile is useful in direct methods for solving sparse linear systems since it allows a simple data structure to be used. It is also useful in iterative methods because the nonzero elements will be clustered close to the diagonal, thereby enhancing data locality. Given a symmetric matrix , a bandwidth-reduction ordering aims to find a permutation so that the bandwidth of is small. Since obtaining a reordering to minimize bandwidth is an NP-complete problem [3]

, various heuristics are used in practice such as Cuthill-McKee, Reverse Cuthill-McKee (RCM), and Sloan’s algorithms 

[4, 5, 6]. This paper solely focuses on the RCM algorithm [5] because, with careful algorithm design, it is amenable to massive distributed-memory parallelism – the primary topic of interest of this paper.

Fig. 1: The time to solve thermal2 using the conjugate gradient method with block Jacobi preconditioner in PETSc on NERSC/Edison. thermal2 has 1.2M rows and columns and 4.9M nonzeros. The bandwidths of the original and RCM-permuted matrix are 1,226,000 and 795, respectively.

The need for distributed memory RCM algorithms is driven by the move to extreme-scale computing. In a large scale scientific application, the matrix has often been distributed already by the time one arrives at the numerical phases of sparse matrix computations. Hence, it would be a waste of time to gather a distributed matrix onto a single processor to compute an ordering using sequential or multithreaded algorithms. Furthermore, by clustering the nonzeros closer to the diagonal, RCM ordering not only increases cache performance of iterative solvers, but it can often restrict the communication to resemble more of a nearest-neighbor pattern. Figure 1 illustrates the performance effects of RCM ordering on a preconditioned conjugate gradient solver of the popular PETSc package [7]. Notice that the performance benefit of RCM ordering actually increases as the number of cores increases, possible due to reduced communication costs. Our goal in this paper is therefore to design and develop a scalable distributed-memory parallel implementation of the RCM algorithm [5].

Similar to many sparse matrix computations, RCM ordering has been shown to be a difficult problem to parallelize [8]. The RCM algorithm relies on the repeated application of breadth-first search (BFS) and its computational load is highly dynamic, especially if the graph has high diameter. The problem exacerbates on high concurrency, where load imbalance and communication overhead degrade the performance of the parallel algorithm. RCM is harder to parallelize than BFS primarily for two reasons. First, RCM imposes additional restrictions on how the vertices are numbered in the graph traversal, limiting available parallelism. Second, RCM is often used for matrices with medium-to-high diameter while most of the existing work on parallel BFS has focused on low-diameter graphs such as synthetic graphs used by the Graph500 benchmark. A higher diameter increases the critical path of level-synchronous BFS algorithms, also limiting available parallelism.

In this paper, we aim to overcome these challenges by using the graph-matrix duality and replacing unstructured graph operations by structured matrix/vector operations. We make the following contributions:

  • We present a scalable distributed-memory algorithm for RCM ordering. Our algorithm relies on a handful of bulk-synchronous parallel primitives that are optimized for both shared-memory and massively parallel distributed memory systems.

  • The quality (bandwidth and envelope) of ordering from our distributed-memory implementation is comparable to the state-of-the-art and remains insensitive to the degree of concurrency.

  • We provide a hybrid OpenMP-MPI implementation of the RCM ordering that attains up to 38 speedup on matrices from various applications on 1024 cores of a Cray XC30 supercomputer. We provide detailed performance evaluation on up to 4096 cores, which sheds light on the performance bottlenecks and opportunities for future research.

Ii Preliminaries

Ii-a Serial Algorithms

Let be the number of columns in a symmetric matrix . Let be the the column subscript of the first nonzero element in column of : . The -th bandwidth is defined as . The overall bandwidth of matrix is denoted and is defined as . Using these notations, the envelope is defined as:

The quantity is called the profile or envelope size of .

Reverse Cuthill-McKee ordering, or RCM, introduced by George [9], is a variant of the Cuthill-McKee ordering [4], which aims at reducing the bandwidth of a sparse symmetric matrix . This is of particular importance when the matrix is to be stored using a profile based format. Finding a reordering of the rows/columns of corresponds to the process of labeling vertices of the graph associated with .

RCM repeatedly labels vertices adjacent to the current vertex until all have been labeled, as depicted in Algorithm LABEL:algo:seq_rcm. The algorithm essentially processes vertices levels by levels and labels them in reverse order. The resulting reordered matrix often has a smaller profile [9].


For the graph associated with , is the set of vertices and is the set of edges in . The number of vertices in is denoted and is the number of edges . The eccentricity [10] of a vertex is defined as where is the graph distance between and .

The rooted level structure [11] of a vertex is the partioning of satisfying


The length of corresponds to the eccentricity while the width of is defined by

The first vertex being labeled strongly impacts the bandwidth of the permuted matrix. Experience shows that it is better to start with a node having a large eccentricity. A peripheral vertex is a vertex with maximum eccentricity. Finding such vertex is prohibitively expensive, and a common heuristic is to use a pseudo-peripheral vertex instead [12, 5]. A pseudo-peripheral vertex is a vertex displaying a high eccentricity, as close to the graph diameter as possible. Gibbs et al. [13] introduced an algorithm to find such a vertex, which is later refined by George and Liu [12]. The process, given in Algorithm LABEL:algo:seq_pseudo, starts with an arbitrary vertex in and computes its rooted level structure. Then, a vertex in the last level is picked and the corresponding level structure is computed. This process is repeated until the number of levels in the rooted level structure converges. Computing the level structure corresponds to a BFS of the graph .


Function Arguments Returns Serial Communication
Ind : a sparse vector local indices of None
nonzero entries of
Select : a sparse vector an empty sparse vector
: a dense vector for None
: logical expr. on    if then
Set : a sparse vector for None
: a dense vector    
SpMSpV : a sparse matrix AllGather &
: a sparse vector returns AlltoAll on
SR: a semiring subcommunicator [14]
Reduce : a sparse vector mv = maximum value in
: a dense vector for AllReduce
a reduction operation    mv min{mv, }
SortPerm an empty array of tuples AllToAll
: a sparse vector for
: a dense vector    
sort and return the permutation
TABLE I: Matrix-algebraic primitives needed for the RCM algorithm.

Ii-B Previous Work on Parallel RCM

There is a strong connection between BFS, finding a pseudo-peripheral vertex and computing the RCM ordering. However, RCM is often used for matrices with higher diameter than graphs for which parallel BFS is often optimized [14, 15]. In addition, computing the actual RCM ordering is even harder to parallelize because the vertices within each level of the traversal tree has to be ordered by degree.

Computing sparse matrix orderings in parallel have received intermittent attention over the last few decades. While RCM is often used to accelerate iterative solvers, one of its first reported supercomputer-scale implementations was in the context of direct solvers by Ashcraft et al. [16] on a CRAY X-MP. It is hard to compare results from 30 years ago, where both the architectures and the sizes of matrices were significantly different. Karantasis et al. [8] recently studied the shared-memory parallelization of various reordering algorithms including RCM.

Iii Algorithms based on matrix algebra

In this section, we present a distributed memory algorithm for computing the RCM ordering that takes advantage of the equivalence between graph algorithms and sparse matrix algebra to exploit latest hardware platforms. The function computes the number of nonzeros in its input, e.g., returns the number of nonzeros in . We utilize the Matlab colon notation: denotes the th column, and denotes the th row.



Iii-a Primitives for the RCM algorithm

With an aim to design a scalable parallel algorithm, we redesign the sequential algorithms for RCM ordering (Algorithm LABEL:algo:seq_rcm) and finding a pseudo-peripheral vertex (Algorithm LABEL:algo:seq_pseudo) in terms of matrix-algebraic operations. For this purpose, we use the primitives summarized in Table I. A sparse vector is used to represent a subset of vertices. Consider that a subset of unique vertices is represented by a sparse vector . Then, for each vertex in , there is a nonzero entry in the -th location of (i.e., ), and the number of nonzeros in is equal to . The nonzero entries of the sparse vector can store arbitrary values depending on the context of the algorithm. By contrast, a dense vector stores information about all vertices in the graph, hence the length of is always . In sparse matrix computations, we often are interested in where the nonzero elements are in the sparse matrices. Hence, if is an sparse symmetric matrix, then we use a graph to describe the sparsity structure of : , where corresponds to row/column of , for , and when .

Let be a sparse vector, be a dense vector, be a logical unary operation and be a sparse matrix. Ind() returns the indices of nonzero entries in . Select() returns every nonzero entry of where is true. Set() replaces values of by corresponding nonzero entries of (other entries of remain unchanged). Reduce() considers only the nonzero indices of the sparse vector and performs a reduction on the values of the dense vector using the operation . In Table I, we provide an example where the reduction operation is to find the minimum value. SortPerm() creates a tuple for each nonzero index in . The list of tuples are lexicographically sorted and the function returns the sorted permutation (the indices passed to the sorting routine are used to obtain the permutation). Finally, SpMSpv() performs a sparse matrix-sparse vector multiplication over the semiring . For the purposes of this work, a semiring is defined over (potentially separate) sets of ‘scalars’, and has its two operations ‘multiplication’ and ‘addition’ redefined. We refer to a semiring by listing its scaling operations, such as the (multiply, add) semiring. The usual semiring multiply for BFS is select2nd, which returns the second value it is passed. The BFS semiring is defined over two sets: the matrix elements are from the set of binary numbers, whereas the vector elements are from the set of integers.

Fig. 2: An example of using specialized sparse matrix-sparse vector multiplication in exploring the next-level vertices in the RCM algorithm. On the left, we show a BFS tree rooted at vertex . The current frontier is the set of two vertices {, }. Labels of the already explored vertices are shown in red numbers. The adjacency matrix representing the graph is shown on the right. The input sparse vector corresponding to the current frontier is show at the top of the matrix. The sparse vector has two nonzero entries corresponding to and . The values of the nonzero entries are the labels of the vertices.The SpMSpV algorithm over a (select2nd, min) semiring then selects the columns of the matrix corresponding to the nonzero indices of the vector shown in gray columns and retains the minimum product in each row where there exists at least one nonzero entry in the selected columns. The indices of the output vector represent the vertices in the next level while the values in the vector denote the labels of parent vertices. The (select2nd, min) semiring ensures that a child always attaches itself to a parent with the minimum label among all of ’s visited neighbors. Therefore, for vertex , vertex is selected as a parent instead of vertex because the former has a smaller label than the latter.

Iii-B The RCM algorithm using matrix-algebraic primitives

Algorithm LABEL:algo:dist_RCM describes the RCM algorithm using the operations from Table I. Here, we assume that the graph is connected. The case for more than connected components can be handled by repeatedly invoking Algorithm LABEL:algo:dist_RCM for each connected component. Algorithm LABEL:algo:dist_RCM takes a sparse adjacency matrix and a pseudo-peripheral vertex as inputs, and returns the RCM ordering as a dense vector .

The -th element of is initialized to and it retains the initial value until the -th vertex is visited and labeled. Let and be the set of vertices in the current and the next BFS level, respectively. is called the BFS frontier (the set of current active vertices). The algorithm starts by labeling the pseudo-peripheral vertex , inserting it into .

The while loop (lines LABEL:algo:li_while1-LABEL:algo:li_while2 of Algorithm LABEL:algo:dist_RCM) explores the vertices level-by-level until the frontier becomes empty. Vertices in have been already labeled in the previous iteration or during initialization. Hence each iteration of the while loop traverses the unvisited neighbors of and labels vertices in .

At the beginning of the while loop of Algorithm LABEL:algo:dist_RCM, the values of the sparse vector are set to the labels of vertices. Next, we discover vertices that are adjacent to the current frontier by multiplying by over (select2nd, min) semiring (line 7). Here the overloaded multiplication operation selec2nd passes the labels of parents to the children and the overloaded addition operation min ensures that each vertex in becomes a child of a vertex in with the minimum label. This step is explained in Figure 2 with an example. Notice that the specific operator overloading by a (select2nd, min) semiring makes the vertex exploration deterministic, and parallelizing the specialized SpMSpV becomes more challenging than traditional BFS. Thankfully, the linear algebraic formulation with its operator overloading feature hides the RCM-specific complexity and is able to achieve high performance despite the deterministic nature of the computation.

After the vertices in the next level are discovered via SpMSpV, previously labeled vertices are removed from (line 8). The next step is to label the vertices in . For this purpose, we sort vertices in based on the labels of the parents of and the degrees of vertices . The sorted permutation of vertices in is used to set their labels. Finally, we update the labels of the newly visited vertices (line 12) and proceed to the next iteration of the while loop. When the current frontier becomes empty, Algorithm LABEL:algo:dist_RCM returns with the RCM ordering by reversing the labels of the vertices.

Iii-C Finding the pseudo-peripheral vertex

Algorithm LABEL:algo:dist_pseudo finds a pseudo-peripheral vertex using matrix-algebraic operations from Table I. Similar to Algorithm LABEL:algo:seq_pseudo, Algorithm LABEL:algo:dist_pseudo starts with an arbitrary vertex . Initially, the number of levels in the current and previous BFSs are set to and , respectively, so that the while loop in line 4 iterates at least twice before termination. Each iteration of the while loop in line 4 runs a full BFS starting with . As before, and represent the subsets of vertices in the current and next levels of the BFS. stores the BFS level to which each vertex belongs ( denotes an unvisited vertex). The do-while loop (lines 8-16) performs the BFS traversal similar to Algorithm LABEL:algo:dist_RCM. The overloaded addition operation of the SpMSpV at line 12 is set to min only for convenience. It can be replaced by any equivalent operation because the order of vertices within a level of BFS does not matter in the discovery of a pseudo-peripheral vertex. At the end of the BFS, we find a vertex with the minimum degree from the last non-empty level and make it the root for the next BFS (line 17). The while loop at line 4 terminates when the number of levels in the latest BFS is not greater than the previous BFS. At this point, Algorithm LABEL:algo:dist_pseudo returns the pseudo-peripheral vertex .

Iv Distributed memory algorithm

In order to parallelize the RCM algorithm, we simply need to parallelize the matrix-algebraic functions described in Table I.

Iv-a Data distribution and storage

We use the Combinatorial BLAS (CombBLAS) framework [17], which distributes its sparse matrices on a 2D processor grid. Processor stores the submatrix of dimensions in its local memory. Vectors are also distributed on the same 2D processor grid. CombBLAS supports different formats to store its local submatrices. In this work, we use the CSC format as we found it to be the fastest for the SpMSpV operation with very sparse vectors, which is often the case for matrices where RCM is commonly used. CombBLAS uses a vector of {index, value} pairs for storing sparse vectors. To balance load across processors, we randomly permute the input matrix before running the RCM algorithm.

Iv-B Analysis of the distributed algorithm

We measure communication by the number of words moved () and the number of messages sent (). The cost of communicating a length message is where is the latency and is the inverse bandwidth, both defined relative to the cost of a single arithmetic operation. Hence, an algorithm that performs arithmetic operations, sends messages, and moves words takes time.

Since the amount of work is variable across BFS iterations, we analyze the aggregate cost over the whole BFS. For ease of analysis, matrix and vector nonzeros are assumed to be i.i.d. distributed. We also assume a square processor grid . Number of complete breadth-first searches is denoted by . The value of is exactly one in Algorithm LABEL:algo:dist_RCM, but it can be more than one in Algorithm LABEL:algo:dist_pseudo.

We leverage the 2D SpMSpV algorithms implemented in CombBLAS. The complexity of parallel SpMSpV algorithm has been analyzed in this context recently [18], hence we just state the result here:

As described before, the SortPerm function returns the sorted permutation of the vertices in the next frontier based on the lexicographic order of (parent_label, degree, vertex_id), where parent_label is the label of the parent of a vertex in . We observe that parents of are a subset of the current frontier and vertices in are incrementally labeled in the previous iteration. Hence, the number of unique parent labels is less than or equal to . This observation motivates us to design a distributed bucket sort algorithm where the -th processor is responsible for sorting vertices whose parent’s labels fall in the range

Hence, processors exchange tuples using AllToAll and the responsible processors sort tuples locally. After sorting, the vertex indices associated with the tuples work as the inverse permutation. Processors conduct another round of AllToAll (only the indices) to obtain the sorted permutation.

The total number of nonzeros sorted is exactly the sum of the frontier sizes over all iterations, which is . Hence the per-process cost of SortPerm is upper bounded by

using personalized all-to-all [19]. We found our specialized bucket sort to be faster than state-of-the-art general sorting libraries, such as HykSort [20].

V Results

Name Spy Plot Dimensions BW (pre-RCM)
Description Nonzeros BW (post-RCM)
nd24k 72K72K 68,114
3D mesh 29M 10,294
problem 14

Ldoor 952K952K 686,979
structural prob. 42.49M 9,259

Serena 1.39M1.39M 81,578
gas reservoir 64.1M 81,218
simulation 58

audikw_1 943K943K 925,946
structural prob 78M 35,170

dielFilterV3real 1.1M1.1M 1,036,475
higher-order 89.3M 23,813
finite element 84

1.6M1.6M 20,702
3D model of 114M 20,600
a steel flange 199

Li7Nmax6 664K664K 663,498
nuclear configuration 212M 490,000
interaction calculations 7

Nm7 4M4M 4,073,382
nuclear configuration 437M 3,692,599
interaction calculations 5

nlpkkt240 78M78M 14,169,841
Sym. indefinite 760M 361,755
KKT matrix 243

Fig. 3: Structural information on the sparse matrices used in our experiments. All matrices, except two, are from the University of Florida sparse matrix collection [21]. Li7Nmax6 and Nm7 [22] are from nuclear configuration interaction calculations.

V-a Experimental Platform

We evaluated the performance of parallel RCM algorithm on Edison, a Cray XC30 supercomputer at NERSC. In Edison, nodes are interconnected with the Cray Aries network using a Dragonfly topology. Each compute node is equipped with 64 GB RAM and two 12-core 2.4 GHz Intel Ivy Bridge processors, each with 30 MB L3 cache. We used Cray’s MPI implementation, which is based on MPICH2. We used OpenMP for intra-node multithreading and compiled the code with gcc 5.2.0 with -O2 -fopenmp flags. In our experiments, we only used square process grids because rectangular grids are not supported in CombBLAS [17]. When cores were allocated for an experiment, we created a process grid, where was the number of threads per process. In our hybrid OpenMP-MPI implementation, all MPI processes performed local computation followed by synchronized communication rounds. Local computation in every matrix-algebraic kernel was fully multithreaded using OpenMP. Only one thread in every process made MPI calls in the communication rounds.

Graphs SpMP Distributed RCM
BW Runtime (sec) Runtime (sec)
1t 6t 24t 1t 6t 24t
nd24k 10,608 0.26 0.06 0.03 1.45 0.38 0.12
ldoor 9,099 3.25 0.52 0.28 4.63 1.52 0.74
Serena 85,229 1.64 0.49 0.66 7.75 2.26 1.08
audikw_1 34,202 1.31 0.34 0.16 7.31 1.99 0.81
dielFilterV3real 25,436 1.99 0.73 0.46 8.63 2.37 0.95
Flan_1565 20,849 1.86 0.44 0.17 12.11 3.88 1.35
Li7Nmax6 443,991 4.62 1.48 0.87 20.28 4.91 2.85
Nm7 - - - - - - -
nlpkkt240 346,556 57.21 25.17 9.92 - - -
TABLE II: The bandwidth and runtime of the shared-memory RCM implementation in SpMP. SpMP did not finish in 30 minutes for Nm7. On Nm7 and nlpkkt240 matrices, our distributed-memory implementation ran out of memory on a single node of Edison.

V-B Matrix Suite

The sparse matrix test suite used in our experiments are shown in Figure 3. These matrices came from a set of real applications, where either a sparse system

is solved or an eigenvalue problem

is solved. The matrices were chosen to represent a variety of different structures and nonzero densities. Since RCM is only well defined on symmetric matrices, all matrices are symmetric.

In the last column of Figure 3, the original (pre-RCM) as well as final (post-RCM) bandwidth of the matrix are shown. In the majority of the cases, RCM effectively reduces the bandwidth. Serena and Flan_1565 seem to be the only two matrices where RCM was ineffective in that regard.

V-C Shared-memory performance

Our implementation is fully multithreaded to take advantage of the shared-memory parallelism available within a node of modern supercomputers. Here, we compare the quality and runtime of our algorithm with the RCM implementation in SpMP (Sparse Matrix Pre-processing) by Park et al. [23], which is based on optimization from [24] and on the algorithm presented in [8]. The results from SpMP is shown in Table II. For four out of eight matrices where SpMP was able to compute RCM, the RCM ordering from our distributed-memory algorithm (shown in Figure 3) yields smaller bandwidths than SpMP. SpMP is faster than our implementation in shared-memory due to our distributed-memory parallelization overheads. However, SpMP sometimes loses efficiency across NUMA domains. For example, SpMP slows down for Serena on 24 cores compared to 6 cores.

Fig. 13: Runtime breakdown of distributed-memory RCM on different graphs on Edison. Six threads per MPI process are used when the number of cores is greater than or equal to six. Nm7 and nlpkkt240 ran out of memory on a single node of Edison.

One very important thing to note is that in order to compute an ordering using a serial or multithreaded implementation of the RCM algorithm such as SpMP, the matrix structure has to be gathered on a single node. Indeed, in many real-life applications, the matrix is already distributed and this mandatory communication step has a non-negligible cost. For example, it takes over 9 seconds to gather the nlpkkt240 matrix from being distributed over 1024 cores into a single node/core. This time is approximately longer than computing RCM using our algorithm on the same number of cores. One of the key benefit of our approach is that it does not require this step. Adding those times to the time required to compute the ordering itself using SpMP makes our approach highly competitive and often faster.

Fig. 23: Computation and communication time in all SpMSpV calls on different graphs on Edison. Six threads per MPI process are used.
Fig. 24: Breakdown of total runtime when computing RCM ordering for ldoor using one thread per MPI process (flat MPI) on Edison.

V-D Distributed-memory performance

We ran the distributed-memory RCM algorithm on up to 4096 cores of Edison. All performance results shown in this section used six threads per MPI process as it performed the best. Figure 13 shows the strong scaling of the distributed-memory RCM algorithm for nine graphs from Table 3. To better understand the performance, we break down the runtime into five parts at each concurrency where the height of a bar denotes the total runtime of identifying a pseudo-peripheral vertex and computing the RCM ordering.

Our distributed algorithm scales up to 1024 cores for all of the nine graphs as shown in Figure 13. The RCM algorithm attains the best speedup of for Li7Nmax6 and on nd24k on 1024 cores. By contrast, it achieves and speedups for ldoor and Flan_1565. The sharp drop in parallel efficiency on these graphs is due to their relatively high diameters. The level-synchronous nature of our BFS incurs high latency costs and decreases the amount of work per processor on high-diameter graphs. Figure 13 shows that SpMSpV is usually the most expensive operation on lower concurrency. However, SortPerm starts to dominate on high concurrency because it performs an AllToAll among all processes, which has higher latency. The size of the matrix also contributes to the scalability of the distributed-memory RCM algorithm. For example, the largest two matrices in our test set (Nm7 and nlpkkt240) continue to scale on more than 4K cores whereas smaller problems do not scale beyond 1K cores. This demonstrates that our algorithm can scale on even higher core count if larger matrices are given as inputs.

Since the distributed-memory SpMSpV is the most expensive step of our RCM algorithm, we investigate its performance more closely. Figure 23 shows the breakdown of computation and communication time of the distributed SpMSpV primitive. We observe that on lower concurrency, SpMSpV is dominated by its computation, as expected. The communication time of SpMSpV starts to dominate the computation time on higher concurrency, and the crossover point where the communication becomes more expensive than computation depends on the properties of the matrices. Graphs with higher diameters have higher communication overhead than graphs with low diameters. For example, ldoor has one of the highest pseudo-diameters among problems we considered. Hence, the RCM algorithm on ldoor becomes communication bound on 1K cores. By contrast, other lower diameter graphs with similar sizes continue to compute bound and scale beyond 1K cores on Edison.

We have also experimented with a flat MPI approach. This non-threaded RCM implementation had higher communication overhead and ran slower than the multithreaded implementation, especially on higher concurrencies. For example, the flat MPI implementation took longer to compute RCM on the ldoor matrix using 4096 cores of Edison as shown in Figure 24.

Vi Conclusion and Future Work

We have introduced the first distributed-memory implementation of the RCM algorithm. In particular, we have described how the RCM algorithm can be reformulated using the sparse matrix / graph duality, so that the parallel implementation can be accomplished using a small set of parallel primitives that have been highly optimized. Our experiments have shown that the distributed-memory implementation of the RCM algorithm presented in this work scales well up to 1024 processors for smaller matrices, and even to 4096 cores for the largest problems we have evaluated.

We have shown that this new approach is overall faster than the traditional approach that gathers the matrix structure on a single node, followed by the application of a serial or a multithreaded implementation of the RCM algorithm, and then redistributing the permuted matrix. More importantly, our approach removes the memory bottleneck that may be caused by having to store the entire graph with a single node.

A scalable implementation of the RCM algorithm is of prime importance for many iterative solvers that benefit from reordering the matrix. We have assessed this on a sample problem using the conjugate gradient method with block Jacobi preconditioner available in PETSc.

Immediate future work involves finding alternatives to sorting (i.e. global sorting at the end, or not sorting at all and sacrifice some quality). Longer term work would investigate alternative BFS formulations that are not level-synchronous, as the existing approach has trouble scaling to high-diameter graphs.


We thank Hari Sundar for sharing the source code of HykSort. This work is supported by the Applied Mathematics and the SciDAC Programs of the DOE Office of Advanced Scientific Computing Research under contract number DE-AC02-05CH11231. We used resources of the NERSC supported by the Office of Science of the DOE under Contract No. DE-AC02-05CH11231.


  • [1] I. S. Duff, J. K. Reid, and J. A. Scott, “The use of profile reduction algorithms with a frontal code,” International Journal for Numerical Methods in Engineering, vol. 28, no. 11, pp. 2555–2568, 1989.
  • [2] I. S. Duff and G. A. Meurant, “The effect of ordering on preconditioned conjugate gradients,” BIT Numerical Mathematics, vol. 29, no. 4, pp. 635–657, 1989.
  • [3] C. H. Papadimitriou, “The NP-completeness of the bandwidth minimization problem,” Computing, vol. 16, no. 3, pp. 263–270, 1976.
  • [4] E. Cuthill and J. McKee, “Reducing the bandwidth of sparse symmetric matrices,” in Proc. of 24th national conference.   ACM, 1969, pp. 157–172.
  • [5] A. George and J. W.-H. Liu, Computer Solution of Large Sparse Positive Definite Systems.   Englewood Cliffs, New Jersey: Prentice-Hall Inc., 1981.
  • [6] S. Sloan, “An algorithm for profile and wavefront reduction of sparse matrices,” International Journal for Numerical Methods in Engineering, vol. 23, no. 2, pp. 239–251, 1986.
  • [7] S. Balay, S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B. F. Smith, S. Zampini, H. Zhang, and H. Zhang, “PETSc users manual,” Argonne National Laboratory, Tech. Rep. ANL-95/11 - Revision 3.7, 2016. [Online]. Available:
  • [8] K. I. Karantasis, A. Lenharth, D. Nguyen, M. J. Garzaran, and K. Pingali, “Parallelization of reordering algorithms for bandwidth and wavefront reduction,” in International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14).   IEEE, 2014, pp. 921–932.
  • [9] J. A. George, “Computer implementation of the finite element method,” Ph.D. dissertation, Stanford, CA, USA, 1971.
  • [10] B. Claude, The Theory of Graphs and its Applications.   John Wiley & Sons, 1962.
  • [11] I. Arany, L. Szoda, and W. Smyth, “An improved method for reducing the bandwidth of sparse symmetric matrices.” in IFIP Congress (2), 1971, pp. 1246–1250.
  • [12] A. George and J. W. H. Liu, “An implementation of a pseudoperipheral node finder,” ACM Transactions on Mathematical Software (TOMS), vol. 5, no. 3, pp. 284–295, Sep. 1979.
  • [13] N. E. Gibbs, W. G. Poole, Jr, and P. K. Stockmeyer, “An algorithm for reducing the bandwidth and profile of a sparse matrix,” SIAM Journal on Numerical Analysis (SINUM), vol. 13, no. 2, pp. 236–250, 1976.
  • [14] A. Buluç and K. Madduri, “Parallel breadth-first search on distributed memory systems,” in International Conference for High Performance Computing, Networking, Storage and Analysis (SC’11).   ACM, 2011, pp. 65:1–65:12.
  • [15] F. Checconi, F. Petrini, J. Willcock, A. Lumsdaine, A. R. Choudhury, and Y. Sabharwal, “Breaking the speed and scalability barriers for graph exploration on distributed-memory machines,” in International Conference for High Performance Computing, Networking, Storage and Analysis (SC’12).   IEEE, 2012, pp. 1–12.
  • [16] C. C. Ashcraft, R. G. Grimes, J. G. Lewis, B. W. Peyton, H. D. Simon, and P. E. Bjørstad, “Progress in sparse matrix methods for large linear systems on vector supercomputers,” International Journal of High-Performance Computing Applications (IJHPCA), vol. 1, no. 4, pp. 10–30, 1987.
  • [17] A. Buluç and J. R. Gilbert, “The Combinatorial BLAS: Design, implementation, and applications,” International Journal of High-Performance Computing Applications (IJHPCA), vol. 25, no. 4, 2011.
  • [18] A. Azad and A. Buluç, “Distributed-memory algorithms for maximum cardinality matching in bipartite graphs,” in IEEE International Parallel & Distributed Processing Symposium (IPDPS), 2016.
  • [19] J. Bruck, C.-T. Ho, S. Kipnis, E. Upfal, and D. Weathersby, “Efficient algorithms for all-to-all communications in multiport message-passing systems,” IEEE Transactions on Parallel and Distributed Systems (TPDS), vol. 8, no. 11, pp. 1143–1156, 1997.
  • [20] H. Sundar, D. Malhotra, and G. Biros, “HykSort: a new variant of hypercube quicksort on distributed memory architectures,” in International Conference on Supercomputing (ICS).   ACM, 2013, pp. 293–302.
  • [21] T. A. Davis and Y. Hu, “The university of florida sparse matrix collection,” ACM Transactions on Mathematical Software (TOMS), vol. 38, no. 1, p. 1, 2011.
  • [22] H. M. Aktulga, A. Buluç, S. Williams, and C. Yang, “Optimizing sparse matrix-multiple vectors multiplication for nuclear configuration interaction calculations,” in IEEE International Parallel & Distributed Processing Symposium (IPDPS).   IEEE Computer Society, 2014.
  • [23] J. Park et al., “SpMP: Sparse Matrix Pre-processing.” [Online]. Available:
  • [24] J. Chhugani, N. Satish, C. Kim, J. Sewall, and P. Dubey, “Fast and efficient graph traversal algorithm for CPUs: Maximizing single-node efficiency,” in IEEE International Parallel & Distributed Processing Symposium (IPDPS).   IEEE, 2012, pp. 378–389.