1. Introduction
The use of graphs to model large scale realworld data is ubiquitous in our everyday lives. In order to analyze and study the relationships these graphs model, graph analytic operations such as finding patterns of interest, analyzing the community structure and connectivity, and determining influential entities in a given graph, are commonly used.
One such graph analytic operation is counting the number of triangles in a graph. The number of triangles in a graph is an important statistic that is used as an intermediary step in various applications. It is used in computing the clustering coefficient and the transitivity ratio of graphs, both of which are used in characterizing the tendency of the nodes in a graph to cluster together. Furthermore, the computations involved in triangle counting forms an important step in computing the truss decomposition of a graph, detecting community structures, studying motif occurrences, detecting spamming activities, and understanding the structure of biological networks (Girvan and Newman, 2002; Smith et al., 2017; Shrivastava et al., 2008; Watts and Strogatz, 1998).
In recent years, driven by the growing size of the graphs that needs to be analyzed, there has been significant research in improving the efficiency of parallel algorithms for computing the exact and approximate number of triangles. Parallel triangle counting algorithms have been specifically built for GPUs, external memory, sharedmemory, and distributedmemory platforms (Green et al., 2014; Hu et al., 2018; Bisson and Fatica, 2018; Polak, 2016; Wang et al., 2016; Arifuzzaman et al., 2017; Parimalarangan et al., 2017; Shun and Tangwongsan, 2015; Yaşar et al., 2018). The sharedmemory class of solutions are limited by the amount of memory that is available in a single processor, thus, limiting the size of the graphs that can be analyzed. Moreover, in many practical settings, such large graphs are stored in a distributed fashion in the aggregate memory that is available in a distributedmemory system. Being able to successfully analyze large graphs in such scenarios requires the need of developing distributedmemory algorithms for counting the number of triangles. However, despite the advantages of a distributedmemory system, these algorithms face higher costs as compared to sharedmemory systems during communication and synchronization steps. Furthermore, distributedmemory graph algorithms also entail the problem of intelligent graph partitioning in order to reduce the costs involved in communicating with neighboring vertices.
We present an MPIbased distributedmemory algorithm for triangle counting using a set intersection based approach (Tom et al., 2017). The key difference between our algorithm and previously proposed approaches is that it utilizes a 2D decomposition of the data and associated computations, which increases the concurrency that can be exploited and reduces the overall communication cost. Furthermore, our algorithm moves the data among the processors by utilizing a sequence of communication steps that are similar to those used by Cannon’s parallel matrixmatrix multiplication algorithm. This ensures that our algorithm is memory scalable and faces low communication overhead. We also include key optimizations that leverage the sparsity of the graph and the way the computations are structured. Some of these optimizations include enumerating the triangles using an ordering that specifically leverages hashmaps in the set intersection computation, changing the hashing routine for vertices based on the density of its adjacency list, and eliminating unnecessary intersection operations.
We evaluate the performance of our algorithm on various realworld and synthetically generated graphs and compare it against other existing stateoftheart approaches. Our experiments show that we obtain a relative speedup that range between and out of across the datasets using 169 MPI ranks over the performance achieved by 16 MPI ranks. Moreover, the performance of our parallel algorithm compares favorably against those achieved by existing distributed memory algorithms that rely on 1D decomposition (Pearce et al., 2013; Arifuzzaman et al., 2017; Kanewala et al., 2018).
The rest of the paper is organized as follows. Section 2 introduces the notation and definitions used in the paper, Section 3 details the necessary background required, Section 4 describes competing approaches we use, Section 5 explains our 2D parallel algorithm and finally, Section 6 details several experiments which demonstrate our algorithm’s speedup and scaling capabilities.
2. Definitions and notations
We will assume that the graphs that we operate on are simple and undirected and are represented using the standard notation. We will use to denote the adjacency list of ; i.e., the set of vertices that are adjacent to . We will use to denote the degree of , i.e., . We will use to indicate a mod operation and to indicate a divide operation.
We will use to denote the adjacency matrix of a symmetric vertex graph, in which , if there is an edge between and , and 0, otherwise. Furthermore, we will use and to denote the upper and the lower triangular portion of the adjacency matrix of . A triangle is a set of three vertices if the edges , , and exist in . The problem of triangle counting is to compute the total number of unique triangles in .
Lastly, let there be processors in the system, which can be arranged in a square grid of the form . The processor in the th row and th column in this square grid is denoted by .
3. Background
3.1. Triangle counting
Triangle counting algorithms iterate over all the edges in a graph and for each edge, count the number of triangles that this edge is a part of. To ensure that they do not count the same triangle multiple times, they impose strict vertex enumeration rules and ignore any triangles that do not satisfy those. There are two such rules, which are referred to as and , where . The rule dictates that the algorithm starts from the first vertex () and iterates over the nonzero entries of in a rowwise fashion (i.e., edges of the form ) and considers only the triangleclosing vertices , where . In contrast, the rule dictates that the algorithm also starts from the first vertex () and considers the triangleclosing vertices , where , but, iterates over the nonzero entries of in a columnwise fashion (i.e., edges of the form ).
Using the rule, it is easy to show that the number of triangles, , that result from ’s edge is
(1) 
and the total number of triangles in the graph is
(2) 
The computation associated with can be viewed as a setintersection operation between the adjacency lists of and , where , and can be modeled as a listbased intersection and a mapbased intersection. For every edge in , the listbased intersection involves jointly traversing the adjacency lists of vertex and , and finding common vertices, , between them, such that . In the mapbased approach, we use an auxiliary data structure and hash (or, ) within that. This auxiliary data structure is then used to lookup the vertices of (or, ). Each successful lookup accounts for a complete triangle. On comparing the performance of the two approaches, we observe that mapbased approaches are faster than listbased approaches. This is because for each , the hashmap used to store the adjacency list of can be reused for all . The listbased and mapbased methods are further detailed in (Parimalarangan et al., 2017; Shun and Tangwongsan, 2015; Tom et al., 2017).
Note that an alternative way of writing Equation 1 is
(3) 
since all entries of for are by definition 0. Given this, we can conceptually view the computations associated with triangle counting as that of computing certain entries of the matrix that results from multiplying matrices and . Specifically, the entries that we care about are those that correspond to the nonzero entries of . We will use to denote the nonzero entries of U, and we will use
(4) 
to denote this operation of computing the entries of .
Furthermore, studies (Arifuzzaman et al., 2017) have shown that ordering the vertices in nondecreasing degree before the triangle counting step leads to lower runtimes. This optimization has been incorporated in most triangle counting algorithms that have been developed so far. By using such a degreebased ordering, the length of ’s adjacency list will tend to be smaller than that of ’s for every edge with . Therefore, if ’s adjacency list is considerably larger than that of ’s, then it is beneficial to employ the enumeration scheme and hash ’s adjacency list. In our previous work on triangle counting algorithms for sharedmemory platforms (Tom et al., 2017), the enumeration scheme along with mapbased intersection approaches gave us faster runtimes as compared to other approaches. In order to use the enumeration scheme, the nonzero entries of are used and is used to denote the nonzero entries of . As in Equation 4, is equal to .
3.2. Cannon’s parallel matrix multiplication algorithm
Cannon’s algorithm is a widely used method for multiplying two matrices in parallel (Cannon, 1969). It is a dense, memoryefficient matrix multiplication algorithm that performs the multiplication operation in shifts. Let and be the input matrices that need to be multiplied, and , be the resulting matrix, i.e., . The algorithm assumes a 2D block distribution of the input matrices and distributes these blocks across the square () processor grid such that processor locally stores the blocks and . The algorithm multiplies these local blocks first. The local blocks of and are then shifted left along the row and up along the column, respectively, and multiplied again. For dense matrix multiplication, every processor is busy with computations after each shift, and the number of communications is bounded at .
4. Related work
Distributed memory algorithms are based on concepts similar to the ones described in Section 3.
Pearce et al. (Pearce, 2017) developed a triangle counting application over HavoqGT (Pearce et al., 2013), an asynchronous vertexcentric graph analytics framework for distributedmemory. Their approach includes distributing the undirected graph using distributed delegates (Pearce, 2017), followed by a twocore decomposition of the graph which removes the vertices that cannot be a part of any triangle. After this, they proceed with reordering their vertices based on degree to compute wedges. These wedges, partitioned using 1D decomposition, are henceforth queried for closure to check the existence of a triangle.
Arifuzzaman et al. (Arifuzzaman et al., 2017) developed two different approaches, one which avoids communication with overlapping partitions whereas, the other which optimizes on memory usage. In the communicationavoiding approach, the vertices of the graph are partitioned into disjoint subsets, where,
is the number of processors. Each processor is responsible for storing one of these subsets and their corresponding adjacency lists. Additionally, in order for each processor to work independently, the adjacency lists of the adjacent vertices of these vertices are stored too. Since most realworld sparse graphs follow a powerlaw degree distribution, a naive partitioning of the vertices of such a graph will lead to high memory overheads as the lengths of adjacency lists will be very skewed. Furthermore, the triangle counting operation will also incur high load imbalance, which will negatively impact the performance. Arifuzzaman et al. have explored these aspects and have developed various partitioning schemes in order to load balance their computations. In order to reduce the memory overheads of the above approach, Arifuzzaman et al. have further developed a spaceefficient method, which involves higher communication costs. In this approach, they partitioned the vertices across processors into disjoint subsets and only stored the adjacency list of these vertices. Subsequently, only one copy of the graph exists across all the processors. For every intersection operation, they follow a pushbased mechanism, in which the adjacency list of a vertex is sent to processors which require this particular list for performing the intersection. However, this leads to high communication overheads.
Kanewala et al. (Kanewala et al., 2018) describes a distributed, sharedmemory algorithm to triangle counting. Their algorithm explores different combinations of the upper triangular part of the adjacency matrix and the lower triangular part of the adjacency matrix to perform the set intersection operations (Refer to Section 3) between the adjacency lists. In order to parallelize the algorithm in a distributed setting, they perform a 1D decomposition of the adjacency matrix and send the adjacency list of a vertex to the rank which stores the adjacency lists of its adjacent vertex. However, in order to curb the number of messages generated, they block vertices and their adjacency lists and process them in blocks.
5. Methods
This section presents our parallel algorithm for triangle counting in distributed memory parallel systems. Our algorithm reads the input graph, preprocesses it to reorder the vertices in nondecreasing degree among other operations, and stores the graph using compressed spare row (CSR) format prior to triangle counting. Our implementation is based on the mapbased triangle counting approach outlined in (Tom et al., 2017), which was briefly described in Section 3.
5.1. Parallelization
Task Decomposition and Data Distribution
In our algorithm, we treat the computations required for computing an entry of the matrix (Equation 3) as an indivisible computational task. We decompose the computations among the processors, by mapping the tasks associated with using a 2D decomposition of . Specifically, the processors are organized in the form of a processor grid and each processor is responsible for the elements of that exist in the entries of that were assigned to it.
However, as we only consider the upper triangular matrix, a naive 2D block partitioning will lead to load imbalance. Moreover, as the vertices are sorted in nondecreasing degree, the length of the adjacency list increases as the vertex id increases. Therefore, the tasks associated with the extreme right and the lower part of the matrix will be computationally more expensive as they employ such vertices for the intersection. This further contributes to the load imbalance.
To address both issues and evenly distribute the work between the processors, we perform a cyclic distribution of over the processor grid. Because of the degreebased ordering, successive rows/columns in the upper and lower triangular portions of the adjacency matrix will have similar number of nonzeros. Consequently, a cellbycell cyclic distribution will tend to assign a similar number of nonzeros (tasks) of and at the same time, a similar number of light and heavy tasks to each processor.
Furthermore, in order to map the input blocks and with the tasks owned by the processors, we decompose the matrices using cyclic distribution over the processor grid. After the decomposition, as the number of vertices assigned to a processor is not contiguous anymore, the adjacency list of a vertex is accessed using the transformed index in the perprocessor CSR representation. Let the blocks and be the respective decomposition of and over the grid such that processor is responsible for those blocks.
Orchestration of Computation and Communication
Consider the set of tasks that processor is responsible for. For every task in that block, in order to apply Equation 4, it requires the adjacency lists of the set of and vertices. Thus, needs the blocks and to determine the number of triangles in , and following the convention of Equation 4, the associated computations can be written as
(5) 
Though this can be done by having each processor first collect the necessary rows and column blocks of matrices and , respectively, and then proceed to perform the required computations, such an approach will increase the memory overhead of the algorithm. We address this problem by realizing that the above summation can be appropriately performed by the communication pattern of Cannon’s 2D parallel matrix multiplication, and we utilize the same pattern in our algorithm (refer Section 3). In terms of this communication pattern, Equation 2 can be rewritten as (recall that indicates a mod operation)
(6) 
In each shift, for each nonzero element that exists in block , we hash the row that exists in block and lookup the vertices that exist in the column in block to find the number of triangles incident on the edge .
The initial shifts of Cannon’s algorithm sends the block to processor and the block to processor . After performing the triangle counting operation on the blocks associated with that shift, the block is sent left to and the block of is sent up over to in the next shifts, and the triangle counting operation is performed as before. Every processor accumulates this count of triangles corresponding to the tasks stored in over the shifts and, the same is globally reduced over all the processors in the grid in the end.
Finally, as discussed in Section 3, to leverage the benefits of enumerating the triangles using the scheme in the mapbased triangle counting approach, we define a task based on the nonzero elements in instead of , as contains the incidence list for each vertex . Therefore, , instead of , is cyclically distributed to construct a task block, denoted by .
5.2. Optimizations
We include several optimizations which leverage the characteristics of sparse graphs to further increase the performance of our distributedmemory algorithm. These are detailed below.
Modifying the hashing routine for sparser vertices
Due to the 2D decomposition and the fact that we perform the required computations by operating on blocks of and , the lengths of the adjacency lists that are being intersected will tend to be smaller (on the average, they should be smaller by a factor of
). A direct consequence of this is that even with a moderately sized hashmap, the number of collisions will tend to be smaller. In order to take advantage of this, before we hash the adjacency of a vertex within the triangle counting routine, we heuristically determine if the vertex is involved in collisions by utilizing the length of the adjacency list of the vertex. If the length of the adjacency list is less than the size of the hashmap, then those vertices will face no collision while being hashed. Such vertices are hashed by performing a direct bitwise AND operation without involving any probing.
Doubly sparse traversal of the CSR structure
As we are performing a block cyclic distribution of the upper and the lower triangular portion of the adjacency matrix, multiple vertices allocated to a processor may not contain any adjacent vertices. This is because each processor will have roughly of the adjacency lists and if the vertices in the adjacency list has a degree in that is less than the degree of the vertex itself, then the adjacency list of that vertex could be rendered empty. However, these vertices can not be directly eliminated from the CSR structure due to the indexing scheme we use to avoid maintaining offsets. Therefore, in order to eliminate this unnecessary looping over these vertices while performing the set intersection operation, we use a data structure that is inspired by the doubly compressed sparse row structure (Buluç and Gilbert, 2010) to store the task matrix, as well as the upper and the lower triangular portion of the adjacency matrix. In our algorithm, while creating the CSR structure for each processor, we also associate with it a list of vertices that contain nonempty adjacency lists. We use this list of vertices to index into the CSR structure; thus, we avoid any vertices that have empty adjacency lists.
Eliminating unnecessary intersection operations
While performing the intersection operation between the two adjacency lists, we only need to consider those triangleclosing vertices, , that satisfy . However, as the adjacency lists are split over processors, it is possible that some of the participating vertices are assigned to other processors, and performing an intersection operation with these entries will result in no triangles. Therefore, to overcome this problem, while performing the hashmap lookups, we traverse row backwards, that is, from the last adjacent vertex of , and then break out of the loop as soon as we encounter an adjacent vertex id that is lesser than the last adjacent vertex id in column . With this, we weed out all the unnecessary intersection operations that would otherwise result in no common vertices. Since the adjacency lists stored in the CSR structure are not required to be sorted for any given vertex, this optimization requires an initial sort before we start the triangle counting phase. However, the cost incurred by this sorting is amortized over the many set intersection operations that take place in this phase.
Reducing overheads associated with communication
In order to eliminate the cost of serializing and deserializing memory during MPI communication steps, we allocate the memory associated with all of the information for a sparse matrix as a single blob, and “allocate” from within that blob the various arrays of the sparse matrix that are required for the processing of the algorithm. Specifically, we convert and store the blocks and as a blob of bytes before the shifts begin. This gives us some savings with respect to the amount of time spent in communication.
5.3. Preprocessing
Our algorithm assumes that the graph is initially stored using a 1D distribution, in which each processor has vertices and its associated adjacency lists. Given that, it proceeds to perform a series of preprocessing steps whose goals are to (i) perform a cyclic distribution of the graph followed by a relabeling of the vertices, (ii) reorder the vertices of the graph in nondecreasing degree, (iii) redistribute the graph based on the 2D cyclic distribution required by our parallel formulation, and, (iv) create the upper and the lower triangular portion of the adjacency matrix. The rational behind these preprocessing steps and our corresponding parallel formations are described in the rest of this section.
Initial redistribution
In some cases the initial distribution of the graph may be such that even though each processor had vertices, it may have adjacency lists with significant variation in the number of nonzeros elements. In order to reduce the load imbalance incurred while preprocessing the graph datasets that contain localized sets of highly dense vertices, we perform an initial cyclic distribution of the graph and relabel the vertices in the adjacency list accordingly.
Reorder vertices based on nondecreasing degree
Recall from the discussion in Section 3, prior to triangle counting, the graph vertices are reordered in nondecreasing degree as this significantly reduces the time required to do triangle counting. In order to achieve the same, the vertices of the graph and the vertices in their respective adjacency lists are relabeled based on its position after sorting it in nondecreasing degree. To perform this relabeling efficiently, we use distributed counting sort. Note that a side effect of this reordering is on the utilization of the available locality. Hashmap based triangle counting routines could potentially make use of the locality obtained by processing the vertices in an order such that vertices with similar adjacency lists are processed together. However, although the degreebased ordering destroys this locality, the gains achieved by such a reordering results in faster runtimes and is proven in (Arifuzzaman et al., 2017).
Create the upper and the lower triangular portion of the adjacency matrix
Recall from Subsection 5.1 that the upper and the lower triangular portions of the adjacency matrix are used in order to perform the computations involved in the triangle counting phase. The algorithm first processes its local chunk of vertices and the associated adjacency lists to determine the vertices that will need to be distributed to remote nodes to form a 2D cyclic distribution. Once each node has received its portion of the adjacency matrix, a single scan through the adjacency list of each vertex in the chunk is performed to create the upper and the lower triangular portions. To convert the matrix into these triangular portions, the degree of a vertex is compared with the degree of an adjacent vertex, and the adjacent vertex is placed in the upper portion if its degree is greater than that of the former. If not, the vertex is placed in the lower portion of the adjacency matrix. Moreover, since the vertices are reordered based on nondecreasing degree, the global position of the vertices can be used to compare the degrees of the vertices. However, in many scenarios, the position of the adjacent vertex is not locally available. Thus, this requires us to perform a communication step with all nodes which adds to the overheads in the parallel algorithm.
5.4. Cost Analysis
In order to analyze our algorithm, we derive the parallel time complexity for the computation and the communication steps in the preprocessing and the triangle counting phases of our algorithm. Let be the total number of vertices of the graph, be the total number of edges of the graph, and the total number of ranks used. Moreover, let and be the average degree and the maximum degree of the graph, respectively. The computations in the preprocessing phase, as discussed above, involve multiple scans over the chunk of the adjacency matrix owned by each rank which amounts to a computation time of . Moreover, the communication step in the preprocessing phase requires an alltoall personalized communication operation. Since we implemented this communication step using pointtopoint send and receive operations, its complexity is lower bounded by . The preprocessing phase also includes the distributed counting sort for relabeling the vertices in nondecreasing degree order. The computations associated with this sort includes two scans of the local vertices to determine the local maximum degree and the local positions, an overall reduction to compute the global maximum degree and, a scan with a cost of to determine the new labels of the vertices by computing the new positions in the distributed system. Furthermore, to find the new positions, we perform a prefix sum which incurs a communication time of . Thus, the total preprocessing phase takes time
In the triangle counting phase, for each shift, the amount of computations is on the average , whereas the communication cost is . As a result, the overall amount of time, across the shifts, spent in computation and communication for the triangle counting phase is
Graph  #vertices  #edges  #triangles 

twitter (Kwak et al., 2010)  41,652,230  1,202,513,046  34,824,916,864 
friendster (Samsi et al., 2017)  119,432,957  1,799,999,986  191,716 
g500s26 (Nelson et al., 2014)  67,108,864  1,073,741,824  49,158,464,716 
g500s27 (Nelson et al., 2014)  134,217,728  2,147,483,648  106,858,898,940 
g500s28 (Nelson et al., 2014)  268,435,456  4,294,967,296  231,425,307,324 
g500s29 (Nelson et al., 2014)  536,870,912  8,589,934,592  499,542,556,876 
Summary of the graphs that were used to evaluate the performance of our triangle counting algorithms. The number of triangles involved in each graph is listed as well.
#nodes  expected  ppt  ppt  tct  tct  overall  overall  

datasets  ranks  used  speedup  time  speedup  time  speedup  runtime  speedup 
g500s28  16  8  151.71  576.30  728.23  
25  7  1.56  113.80  1.33  408.21  1.41  522.31  1.39  
36  9  2.25  76.19  1.99  291.11  1.98  367.47  1.98  
49  9  3.06  63.96  2.37  222.64  2.59  286.78  2.54  
64  7  4.00  56.40  2.69  221.65  2.60  278.26  2.62  
81  9  5.06  46.12  3.29  160.30  3.60  206.61  3.52  
100  10  6.25  43.12  3.52  136.90  4.21  180.18  4.04  
121  11  7.56  40.22  3.77  121.33  4.75  161.67  4.50  
144  18  9.00  34.35  4.42  105.60  5.46  140.05  5.20  
169  17  10.56  30.69  4.94  79.82  7.22  110.51  6.59  
g500s29  16  16  323.50  1371.75  1695.57  
25  25  1.56  158.90  2.04  731.32  1.88  890.56  1.90  
36  18  2.25  163.41  1.98  697.33  1.97  861.21  1.97  
49  17  3.06  126.98  2.55  510.08  2.69  637.41  2.66  
64  16  4.00  100.59  3.22  473.49  2.90  574.34  2.95  
81  17  5.06  88.60  3.65  386.65  3.55  475.55  3.57  
100  17  6.25  72.13  4.48  280.05  4.90  352.36  4.81  
121  21  7.56  65.69  4.92  250.37  5.48  316.24  5.36  
144  24  9.00  63.06  5.13  232.09  5.91  295.33  5.74  
169  29  10.56  53.54  6.04  191.16  7.18  244.81  6.93  
16  2  60.76  109.46  170.45  
25  2  1.56  39.59  1.53  64.73  1.69  104.50  1.63  
36  2  2.25  39.63  1.53  61.33  1.78  101.17  1.68  
49  3  3.06  33.45  1.82  45.31  2.42  79.04  2.16  
64  3  4.00  30.16  2.01  42.13  2.60  72.48  2.35  
81  4  5.06  29.08  2.09  30.46  3.59  59.68  2.86  
100  5  6.25  32.74  1.86  30.81  3.55  63.70  2.68  
121  6  7.56  32.64  1.86  24.75  4.42  57.50  2.96  
144  7  9.00  33.36  1.82  25.39  4.31  58.85  2.90  
169  8  10.56  31.62  1.92  18.52  5.91  50.29  3.39  
friendster  16  3  91.54  95.41  187.21  
25  2  1.56  57.84  1.58  71.82  1.33  129.78  1.44  
36  3  2.25  48.51  1.89  64.29  1.48  112.98  1.66  
49  5  3.06  36.47  2.51  46.75  2.04  83.37  2.25  
64  4  4.00  35.80  2.56  45.61  2.09  81.66  2.29  
81  5  5.06  33.24  2.75  35.36  2.70  68.78  2.72  
100  5  6.25  35.56  2.57  35.24  2.71  71.04  2.64  
121  7  7.56  29.51  3.10  27.51  3.47  57.09  3.28  
144  6  9.00  38.62  2.37  37.65  2.53  76.53  2.45  
169  8  10.56  31.55  2.90  29.43  3.24  61.23  3.06 

The column labeled ppt shows the preprocessing runtime, tct shows the triangle counting runtime, and the column labeled overall is the overall runtime for the datasets. The runtimes are in seconds. The speedup and efficiency were computed relative to the 16 rank runtimes. The column labeled #nodes used correspond to the number of nodes used to run the algorithm by distributing the various ranks over it. We select the number of nodes to minimize the maximum number of nodes used such that the aggregate memory that exists in the selected number of nodes satisfies the memory requirement of our algorithm.
6. Experimental methodology
6.1. Datasets
We used six large, realworld and synthetic graphs with varying degree distributions to evaluate the performance of our algorithms. Various statistics related to these graphs and the sources from where they were obtained are shown in Table 1. twitter and friendster datasets are sparse, social networks, g500s26, g500s27, g500s28 and g500s29 were generated using the graph500 generator provided in (Nelson et al., 2014). These follow the RMAT graph specifications (Graph500, 2018). Our algorithm creates these synthetic graphs as input to each run prior to calling our triangle counting routine. This way, we avoid reading the big graphs from the disk. We converted all the graph datasets to undirected, simple graphs.
6.2. Experimental setup
Our experiments were conducted on up to 29 dualsocket nodes with 24 cores of Intel Haswell E52680v3, each with 64 GB memory. More details about the system is detailed in (Institute, 2018). Our programs were developed using C and OpenMPI (v3.1.2), and compiled using GCC (v8.1.0) with O3 optimization. We ran our MPI programs with the option bindto core. We ran our experiments with ranks ranging from 16 to 169, such that the number of ranks is a perfect square which forms a processor grid. The number of nodes used for the different number of ranks is detailed in Table 2. In order to best utilize the available resources, we resort to minimizing the maximum number of nodes used such that the aggregate memory that exists in the selected number of nodes satisfies the memory requirement of our algorithm. Moreover, we have bound cores to socket as well. For example, with 36 ranks, we bind one core per socket to get better performance while ensuring our program does not run out of memory. We obtain runtimes starting with 16 ranks, as the largest graph in our experiments, i.e., g500s29, required all the memory provided by 16 nodes for the processing.
Our competing approach, Havoq, was executed on 48 nodes, with all 24 cores being used across the node. The program was compiled using GCC (v5.4.0) and OpenMPI (v3.1.2). We had to resort to this particular version of GCC since Havoq ran successfully on this version. Moreover, Havoq produced faster runtimes on this version than on GCC (v8.1.0).
6.3. Performance metrics
We make use of the following two performance metrics for assessing the performance of our parallel formulation

Speedup  Speedup is computed by dividing the runtime obtained by the baseline algorithm against the runtime obtained by the parallel algorithm. We consider the runtime obtained with 16 ranks as the baseline and report the speedup of the algorithm computed against that.

Efficiency  Similar to above, we compute the efficiency obtained by using the runtime of the 16rank case as the baseline. Specifically, if is the number of processors and is the parallel runtime of our algorithm (or one of its phases), then we use .
7. Results
7.1. Parallel Performance
Table 2 shows the performance achieved by our algorithm on the two larger synthetic graphs and the two realworld graphs. From these results, we notice that as the number of ranks increase, the time required by the preprocessing and triangle counting times decrease. The overall speedup on 169 ranks relative to 16 ranks is in the range of to (compared to an expected speedup of ). Note that the synthetic graphs achieving better speedups over the realworld graphs is due to the fact that the synthetic graphs we experiment on are larger. Since the performance advantage of our algorithm tapers off after a certain number of ranks (which is also contingent on the size of the dataset), we restrict ourselves to measuring runtimes up until processor grid.
Comparing the scalability of the preprocessing and the triangle counting phases, we can see that the latter scales better. The relative speedup of the triangle counting phase on ranks is on the average times higher than that achieved by the preprocessing phase. This can also be seen in the efficiency plots of Figure 1, where the efficiency of the preprocessing phase decreases faster than the triangle counting phase when the number of ranks increase. In light of the analysis presented in Section 5.4, this scaling behavior was expected. The communication and the computation of the preprocessing phase is of the same order. However, the computation in the triangle counting phase is of the order more than the preprocessing phase. Thus, the triangle counting phase continues to scale better than the preprocessing phase with increasing number of ranks.
Moreover, for almost all graphs, the performance at ranks shows a superlinear speedup when compared to the runtimes obtained at ranks. We believe this happens because both the triangle counting and the preprocessing phase utilize caches better as the aggregate amount of cache memory increases with increasing number of ranks. This is further quantified in Figure 2, which plots the operation rate in kOps/second for both the phases for g500s29. We can see that although the preprocessing phase continues to show higher operation rates with increasing ranks, the triangle counting phase shows its peak performance at ranks.
Finally, on analyzing the realworld graphs, we notice that twitter attains better speedups than friendster. We believe this happens because the triangle counting phase in twitter involves more work as opposed to friendster. We measure the average number of probes performed per shift in every rank for a rank run in both twitter and friendster, and we observe that the number of probes in twitter is % more than that of friendster.
7.2. Sources of overhead
ranks  maximum runtime  average runtime  load imbalance 

25  187.93  177.81  1.05 
36  106.65  93.79  1.14 

We measure the maximum runtime of a phase in the triangle counting routine and compute the associated load imbalance by dividing the maximum runtime per shift over average, in the g500s29 dataset for 25 and 36 MPI ranks.
We analyze three different sources of parallel overheads in our triangle counting algorithm. The first has to do with load imbalance as a result of assigning a different number of computations to each rank during each one of the steps of the algorithm. The second is due to the redundant work performed with increasing number of ranks. Finally, the third overhead we analyze is due to the time spent in communication as we increase the number of ranks.
Load imbalance
Recall from Section 5, that the computations during triangle counting is organized in phases, and in each phase each processor performs a shift operation and processes the block of and that it just received. If the amount of computation associated with processing each pair of blocks is different, then this can lead to load imbalance. In order to quantify this load imbalance in the just the compute phase, we performed a series of experiments in which we measured the time obtained per shift for the computations involved in the triangle counting phase for 25 and 36 MPI ranks. The load imbalance was measured as the ratio of the maximum amount of time over all pairs of blocks over the average time. The results are shown in Table 3. For 25 MPI ranks, the load imbalance is and for 36 MPI ranks is . We also quantify how distributing the data and the tasks contributes to the load imbalance. We count the number of nonzero tasks associated with each rank with increasing grid sizes, and compute the load imbalance. In general, the load imbalance that we observed was less than , which can further explain the load imbalance observed over and ranks runtimes.
Redundant work
In Section 5, we discussed various optimizations that were designed to efficiently operate on the very sparse blocks of and (e.g., doubly sparse traversal), in order to eliminate redundant computations. However, those optimizations do not entirely eliminate the redundant computations. To measure how much extra work we do as we increase the number of ranks, we instrumented our code to count the number of tasks that result in the mapbased set intersection operation throughout the execution of the triangle counting phase. The number of such tasks for g500s29 on 16, 25, and 36 ranks are shown in Table 4. We see that as the number of ranks grow from 16 to 25 and 25 to 36, the number of tasks increases by and , respectively. This extra work is responsible for some of the loss in the potential speedup observed in the results shown in Table 3.
Communication overheads
The fraction of time spent on communication over the entire runtime of the preprocessing and the triangle counting phase can be identified by looking at Figure 3. This plot shows that for both preprocessing and triangle counting, the bulk of the time is spent in computations for the largest graph in our testbed. However, the portion of the overall runtime that is attributed to the communication keeps increasing as we increase the number of ranks.
7.3. Quantifying the gains achieved by the optimizations
Recall from Section 5.2 that we introduced various optimizations in our triangle counting phase that leverage the sparsity that occur in the graphs and the structure of the computations. In order to quantify the reduction in the runtime of the triangle counting phase, we choose the first two optimizations, which we believe gave us maximum benefits: (i) using a doubly sparse traversal of the CSR structure, and, (ii) modifying the hashing routine for sparser vertices, and, recorded the runtime of the triangle counting phase without using these optimizations.
Based on the results that we obtained from these experiments on g500s29, we observe that the doublysparse traversal of the vertices has reduced the runtime of the triangle counting phase by and for ranks and ranks, respectively. In a similar light, the modified hashing routine has reduced the runtime of the triangle counting phase by and for ranks and ranks, respectively. Moreover, we also recorded the improvement obtained by using the enumeration scheme as as opposed to in our algorithm. We observe that the triangle counting runtime decreased by when we use the enumeration scheme as compared to the enumeration scheme.
7.4. Comparison against other algorithms
ranks  task  percent increase with respect 

used  counts  to previous rank 

We count the number of tasks that result in the mapbased set intersection operation in the g500s29 dataset to measure the redundant work with increasing number of ranks.
2core  directed wedge  our  speedup  

dataset  time  counting time  runtime  obtained 
g500s26  1.59  239.64  20.35  11.9 
g500s27  3.37  576.45  41.93  13.7 
g500s28  7.32  1395.11  79.82  14.6 
1.88  124.72  18.52  6.2  
friendster  3.29  24.75  29.43 

Runtimes obtained by Havoq’s triangle counting routine on the different input datasets.

2core time corresponds to the amount of time taken by Havoq to generate directed wedges. directed wedge counting time corresponds to the amount of time taken by Havoq to count the existence of the wedges generated.

We use the ingest_edgelist executable provided in the Havoq executable to convert the input data to their format and persist it in /dev/shm. The 2core time and the directed wedge counting time that are reported by Havoq are added to get the total triangle counting time.
algorithm  fastest runtime reported  cores used 

Our work  51.7  169 
AOP (Arifuzzaman et al., 2017)  564.0  200 
Surrogate (Arifuzzaman et al., 2017)  739.8  200 
OPTPSP (Kanewala et al., 2018)  2048 

We make comparisons with three stateoftheart approaches in (Arifuzzaman et al., 2017; Kanewala et al., 2018) on the twitter dataset against our algorithm using the fastest runtimes (in seconds) reported. We also report the number of cores they used in obtaining the runtimes. has been extrapolated from the strong scaling results in (Kanewala et al., 2018), using the speedup achieved and their fastest sequential runtime.
Comparison against Havoq (Pearce, 2017)
As discussed in Section 4, various distributed memory parallel triangle counting algorithms have been developed. We perform two different evaluations. First, we perform a direct comparison with Havoq on the graphs detailed in Table 1. Table 5 compares the triangle counting runtime obtained by Havoq and the triangle counting time obtained by our approach. Havoq runtimes were obtained on 1152 cores (using 48 nodes) and our runtimes were obtained on 169 cores. On average, we get a speedup of times over their approach. In friendster, our approach is slower than Havoq. We believe this is because Havoq does an edgebased partitioning scheme (referred to as delegate partitioning), which leads to better scaling capability as compared to our method, which incurs more overheads as the number of ranks increase. Furthermore, Havoq required more number of nodes than what was available in our system for g500s29 and we could not obtain the runtime for the same.
Comparison against other distributedmemory algorithm (Arifuzzaman et al., 2017; Kanewala et al., 2018)
We also contrast the performance achieved by our algorithm against what was achieved by other previous approaches on only the twitter graph, since this was the common benchmark. For this second evaluation, we use the runtimes that were reported in the respective papers, which were obtained on different architectures and number of ranks. Thus, the comparisons presented with these approaches in Table 6 should be interpreted in view of this caveat. The performance achieved by the various algorithms on twitter is shown in Table 6. Algorithm with Overlapping Partitioning (AOP) (Arifuzzaman et al., 2017) was run on 200 cores. Their experimental setup included 64 computing nodes (QDR InfiniBand interconnect) with 16 processors (Sandy Bridge E52670, 2.6GHz) per node, has 4GB memory per processor, and uses the operating system CentOS Linux 6. Surrogate (Arifuzzaman et al., 2017) was also run on 200 cores and their experimental setup was the same as that of AOP. OPTPSP algorithm (Kanewala et al., 2018) was run on 2048 cores. The experimental setup for this algorithm included a Cray XC system which has 2 Broadwell 22core Intel Xeon processors, and the scaling experiments used only up to 16 cores to uniformly double the problem size. From these results we can see that our implementation is comparable to all previous approaches. Moreover, the relative performance advantage of our method still holds, when we account for the fact that some of the runtimes reported in Table 6 use more number of cores than those used in our experiments.
8. Conclusion
In this paper we presented a distributed memory formulation for triangle counting and evaluated its performance on realword and synthetic graphs. Compared to prior parallel formulations, our formulation utilizes a 2D decomposition which increases the concurrency that it can exploit while reducing the overall communication overhead. The experimental results showed that these features lead to good scaling performance. Our analysis also identified areas that can benefit from further algorithmic and data structure improvements in order to better balance the work and reduce the amount of redundant computations. Moreover, we also note that this work can be easily extended to deal with rectangular processor grids using the SUMMA (Van De Geijn and Watts, 1997) algorithm.
Acknowledgements.
This work was supported in part by NSF (1447788, 1704074, 1757916, 1834251), Army Research Office (W911NF1810344), Intel Corp, and the Digital Technology Center at the University of Minnesota. Access to research and computing facilities was provided by the Digital Technology Center and the Minnesota Supercomputing Institute.References
 (1)
 Arifuzzaman et al. (2017) Shaikh Arifuzzaman, Maleq Khan, and Madhav Marathe. 2017. DistributedMemory Parallel Algorithms for Counting and Listing Triangles in Big Graphs. arXiv preprint arXiv:1706.05151 (2017).
 Bisson and Fatica (2018) Mauro Bisson and Massimiliano Fatica. 2018. Update on Static Graph Challenge on GPU. In 2018 IEEE High Performance extreme Computing Conference (HPEC). IEEE, 1–8.
 Buluç and Gilbert (2010) Aydın Buluç and John R Gilbert. 2010. Highly parallel sparse matrixmatrix multiplication. arXiv preprint arXiv:1006.2183 (2010).

Cannon (1969)
Lynn E Cannon.
1969.
A CELLULAR COMPUTER TO IMPLEMENT THE KALMAN FILTER ALGORITHM.
Technical Report. MONTANA STATE UNIV BOZEMAN ENGINEERING RESEARCH LABS.  Girvan and Newman (2002) Michelle Girvan and Mark EJ Newman. 2002. Community structure in social and biological networks. Proceedings of the national academy of sciences 99, 12 (2002), 7821–7826.
 Graph500 (2018) Graph500. 2018. graph500. https://graph500.org
 Green et al. (2014) Oded Green, Pavan Yalamanchili, and LluísMiquel Munguía. 2014. Fast triangle counting on the GPU. In Proceedings of the 4th Workshop on Irregular Applications: Architectures and Algorithms. IEEE Press, 1–8.
 Hu et al. (2018) Yang Hu, Hang Liu, and H Howie Huang. 2018. HighPerformance Triangle Counting on GPUs. In 2018 IEEE High Performance extreme Computing Conference (HPEC). IEEE, 1–5.
 Institute (2018) Minnesota Supercomputing Institute. 2018. Mesabi Description. https://www.msi.umn.edu/content/mesabi
 Kanewala et al. (2018) Thejaka Amila Kanewala, Marcin Zalewski, and Andrew Lumsdaine. 2018. Distributed, SharedMemory Parallel Triangle Counting. In Proceedings of the Platform for Advanced Scientific Computing Conference. ACM, 5.
 Kwak et al. (2010) Haewoon Kwak, Changhyun Lee, Hosung Park, and Sue Moon. 2010. What is Twitter, a social network or a news media?. In Proceedings of the 19th international conference on World wide web. ACM, 591–600.
 Nelson et al. (2014) Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, and Mark Oskin. 2014. Grappa: A latencytolerant runtime for largescale irregular applications. In International Workshop on RackScale Computing (WRSC w/EuroSys).
 Parimalarangan et al. (2017) Sindhuja Parimalarangan, George M Slota, and Kamesh Madduri. 2017. Fast Parallel Triad Census and Triangle Listing on SharedMemory Platforms. In Parallel and Distributed Processing Symposium Workshop (IPDPSW), 2017 IEEE International. IEEE.
 Pearce (2017) Roger Pearce. 2017. Triangle counting for scalefree graphs at scale in distributed memory. In 2017 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–4.
 Pearce et al. (2013) R. Pearce, M. Gokhale, and N. M. Amato. 2013. Scaling Techniques for Massive ScaleFree Graphs in Distributed (External) Memory. In 2013 IEEE 27th International Symposium on Parallel and Distributed Processing. 825–836. https://doi.org/10.1109/IPDPS.2013.72
 Polak (2016) Adam Polak. 2016. Counting triangles in large graphs on GPU. In Parallel and Distributed Processing Symposium Workshops, 2016 IEEE International. IEEE, 740–746.
 Samsi et al. (2017) Siddharth Samsi, Vijay Gadepally, Michael Hurley, Michael Jones, Edward Kao, Sanjeev Mohindra, Paul Monticciolo, Albert Reuther, Steven Smith, William Song, Diane Staheli, and Jeremy Kepner. 2017. Static Graph Challenge: Subgraph Isomorphism. IEEE HPEC (2017).
 Shrivastava et al. (2008) Nisheeth Shrivastava, Anirban Majumder, and Rajeev Rastogi. 2008. Mining (social) network graphs to detect random link attacks. In Data Engineering, 2008. ICDE 2008. IEEE 24th International Conference on. IEEE, 486–495.
 Shun and Tangwongsan (2015) Julian Shun and Kanat Tangwongsan. 2015. Multicore triangle computations without tuning. In Data Engineering (ICDE), 2015 IEEE 31st International Conference on. IEEE, 149–160.
 Smith et al. (2017) Shaden Smith, Xing Liu, Nesreen K Ahmed, Ancy Sarah Tom, Fabrizio Petrini, and George Karypis. 2017. Truss decomposition on sharedmemory parallel systems. In High Performance Extreme Computing Conference (HPEC). IEEE, 1–6.
 Tom et al. (2017) Ancy Sarah Tom, Narayanan Sundaram, Nesreen K Ahmed, Shaden Smith, Stijn Eyerman, Midhunchandra Kodiyath, Ibrahim Hur, Fabrizio Petrini, and George Karypis. 2017. Exploring optimizations on sharedmemory platforms for parallel triangle counting algorithms. In High Performance Extreme Computing Conference (HPEC), 2017 IEEE. IEEE, 1–7.
 Van De Geijn and Watts (1997) Robert A Van De Geijn and Jerrell Watts. 1997. SUMMA: Scalable universal matrix multiplication algorithm. Concurrency: Practice and Experience 9, 4 (1997), 255–274.
 Wang et al. (2016) Leyuan Wang, Yangzihao Wang, Carl Yang, and John D Owens. 2016. A comparative study on exact triangle counting algorithms on the GPU. In Proceedings of the ACM Workshop on High Performance Graph Processing. ACM, 1–8.
 Watts and Strogatz (1998) Duncan J Watts and Steven H Strogatz. 1998. Collective dynamics of smallworld networks. nature 393, 6684 (1998), 440–442.
 Yaşar et al. (2018) Abdurrahman Yaşar, Sivasankaran Rajamanickam, Michael Wolf, Jonathan Berry, and Ümit V Çatalyürek. 2018. Fast Triangle Counting Using Cilk. In 2018 IEEE High Performance extreme Computing Conference (HPEC). IEEE, 1–7.
Comments
There are no comments yet.