K-Clique Counting on GPUs

Counting k-cliques in a graph is an important problem in graph analysis with many applications. Counting k-cliques is typically done by traversing search trees starting at each vertex in the graph. An important optimization is to eliminate search tree branches that discover the same clique redundantly. Eliminating redundant clique discovery is typically done via graph orientation or pivoting. Parallel implementations for both of these approaches have demonstrated promising performance on CPUs. In this paper, we present our GPU implementations of k-clique counting for both the graph orientation and pivoting approaches. Our implementations explore both vertex-centric and edge-centric parallelization schemes, and replace recursive search tree traversal with iterative traversal based on an explicitly-managed shared stack. We also apply various optimizations to reduce memory consumption and improve the utilization of parallel execution resources. Our evaluation shows that our best GPU implementation outperforms the best state-of-the-art parallel CPU implementation by a geometric mean speedup of 12.39x, 6.21x, and 18.99x for k = 4, 7, and 10, respectively. We also evaluate the impact of the choice of parallelization scheme and the incremental speedup of each optimization. Our code will be open-sourced to enable further research on parallelizing k-clique counting on GPUs.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

09/04/2019

Fast BFS-Based Triangle Counting on GPUs

In this paper, we propose a novel method to compute triangle counting on...
06/05/2017

QuickCSG: Fast Arbitrary Boolean Combinations of N Solids

QuickCSG computes the result for general N-polyhedron boolean expression...
08/26/2020

Exploring the Design Space of Static and Incremental Graph Connectivity Algorithms on GPUs

Connected components and spanning forest are fundamental graph algorithm...
11/16/2019

Pangolin: An Efficient and Flexible Graph Mining System on CPU and GPU

There is growing interest in graph mining algorithms such as motif count...
12/21/2021

Accelerating Clique Counting in Sparse Real-World Graphs via Communication-Reducing Optimizations

Counting instances of specific subgraphs in a larger graph is an importa...
03/14/2021

TRUST: Triangle Counting Reloaded on GPUs

Triangle counting is a building block for a wide range of graph applicat...
04/21/2018

Parallel Implementations of Cellular Automata for Traffic Models

The Biham-Middleton-Levine (BML) traffic model is a simple two-dimension...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1. Introduction

Dense sub-graph counting and listing is an important computation in graph mining (Danisch et al., 2017; Lee et al., 2010; Gionis and Tsourakakis, 2015). A -clique is a fundamental dense sub-graph with great importance in many applications such as community detection (Fang et al., 2019; Gregori et al., 2012; Sariyüce et al., 2017; Tsourakakis, 2015; Yuan et al., 2017; Hao et al., 2018), graph partition and compression (Glaria et al., 2021; Rossi and Zhou, 2018, 2019), learning network embedding (Yu et al., 2019; Rossi et al., 2018a), and recommendation systems (Vilakone et al., 2018; Manoharan, 2020). A -clique (or a -vertex clique) in a graph is a complete sub-graph with exactly vertices and edges, such that every vertex in the clique is connected to every other vertex.

A common approach to -clique counting is to traverse a search tree for each vertex in the graph and find the -cliques that contain that vertex. This approach is commonly parallelized by processing different trees or subtrees in parallel. One fundamental optimization is to eliminate the search tree branches that discover the same clique redundantly. Two prominent approaches to eliminate redundant clique discovery during -clique counting are graph orientation (Shi et al., 2020; Danisch et al., 2018; Chiba and Nishizeki, 1985; Eppstein et al., 2010; Li et al., 2020; Finocchi et al., 2015) and pivoting (Jain and Seshadhri, 2020).

The graph orientation approach orients the graph to a directed graph so that each symmetric -clique is only found from one of the vertices it contains. Common orientation criteria include vertex degree (Chiba and Nishizeki, 1985; Shi et al., 2020; Finocchi et al., 2015), graph coloring (Li et al., 2020), and degeneracy based on -core decomposition (Eppstein et al., 2010; Danisch et al., 2018; Shi et al., 2020) or relaxations of -core (Shi et al., 2020). On the other hand, the pivoting approach (Jain and Seshadhri, 2020) for -clique counting is inspired by the Bron-Kerbosh maximal clique finding approach (Eppstein et al., 2010). Rather than searching for all -cliques, the pivoting approach finds the largest cliques, then calculates the number of -cliques they contain. To the best of our knowledge, the state-of-the-art parallel implementations for the graph orientation and pivoting approaches are ARB-COUNT (Shi et al., 2020) and Pivoter (Jain and Seshadhri, 2020), respectively. Both of these parallel implementations are designed for CPUs.

In this paper, we present our GPU implementations of -clique counting for both the graph orientation and pivoting approaches. Compared to CPUs, parallelizing -clique counting on GPUs comes with unique opportunities and challenges such as fine-grain parallelism, a multi-level parallelization structure, inefficient support for recursion, and constrained memory requirements. Our implementations explore both vertex-centric and edge-centric parallelization across thread blocks, as well as different ways for partitioning work within thread blocks for each approach. We replace the recursive search tree traversal with an iterative traversal whereby threads jointly traverse a subtree by explicitly managing a shared stack. We use binary encoding of induced sub-graphs to reduce memory consumption and allow for highly parallel list intersection operations. We take advantage of the Independent Thread Scheduling feature (NVIDIA, 2017) in the NVIDIA Volta architecture and its successors to allow threads to collaborate at sub-warp granularity for better utilization of parallel execution resources. We also employ various techniques to reduce the memory consumption of our implementations.

Our evaluation shows that our best GPU implementation outperforms the best state-of-the-art parallel CPU implementation by a geometric mean speedup of 9.88, 5.71, and 17.23 for , , and , respectively. We also show that the binary encoding and sub-warp partitioning optimizations yield significant performance gains, and we provide insights on the choice of parallelization scheme and sub-warp partitioning granularity.

In summary, we make the following contributions:

  • A GPU implementation of the graph orientation approach to -clique counting that outperforms the state-of-the-art parallel CPU implementation

  • A GPU implementation of the pivoting approach to -clique counting that outperforms the state-of-the-art parallel CPU implementation

  • An exploration of the vertex-centric and edge-centric parallelization schemes for -clique counting on GPUs

  • Various optimization techniques for -clique counting on GPUs that are common to both approaches and parallelization schemes, including binary encoding of induced sub-graphs, sub-warp partitioning, and efficient memory management

Our code will be open-sourced to enable further research on parallelizing -clique counting on GPUs.

2. Background

2.1. Clique Counting

Consider an undirected simple graph where is the set of vertices in the graph, is the set of edges in the graph, and is the adjacency list of a vertex . A -clique in is a complete sub-graph of with exactly vertices and edges. That is, every vertex in the clique is connected to every other vertex with an edge. This paper focuses on -clique counting which counts the number of -cliques in a given graph for a given .

A common approach to counting -cliques in a graph is to traverse a search tree for each vertex in the graph to find -cliques that contain that vertex. The search tree for each vertex (1-clique) branches out to the vertex’s neighbors to find edges (2-cliques), then for each edge, branches out to the common neighbors of the edge’s end-points to find triangles (3-cliques), then for each triangle, branches out to the common neighbors of the vertices in the triangle to find 4-cliques, and so on. In general, for each -clique, the tree branches out to the common neighbors of the vertices in the clique to find the -cliques. Finding the common neighbors of a set of vertices requires adjacency list intersection operations. This approach to -clique counting is commonly parallelized by processing different trees or subtrees in parallel.

One key distinguishing feature among algorithms is how they avoid discovering the same clique redundantly from multiple root vertices. Avoiding redundant clique discovery results in a substantial reduction in the amount of work done, thereby improving performance. The two major approaches to avoiding redundant clique discovery are: (1) orienting the graph before traversal, and (2) pivoting. These two approaches are described in Sections 2.2 and 2.3, respectively.

Figure 1. Counting 4-cliques in an example graph using different approaches

2.2. Graph Orientation Approach

One common approach to avoiding redundant clique discovery is orienting the graph before traversing it (Shi et al., 2020; Danisch et al., 2018; Chiba and Nishizeki, 1985; Eppstein et al., 2010; Li et al., 2020; Finocchi et al., 2015). Graph orientation (or vertex ordering) is a preprocessing step that converts the graph from an undirected graph to a directed one. Common orientation criteria include vertex degree (Chiba and Nishizeki, 1985; Shi et al., 2020; Finocchi et al., 2015), graph coloring (Li et al., 2020), and degeneracy based on -core decomposition (Eppstein et al., 2010; Danisch et al., 2018; Shi et al., 2020) or relaxations of -core (Shi et al., 2020). Graph orientation relies on the fact that a clique is a symmetric substructure, hence, it can be found by starting from any vertex it contains.

Figure 1(b) shows how the graph in Figure 1(a) is explored in the graph orientation approach to find all the 4-cliques. Assume that the edges are oriented in alphabetical order (from the earlier letter to the later letter). The first level contains all the vertices in the graph representing the root of their respective search trees. At the second level, each tree branches out from the root vertex to its neighbors. For example, the branch indicates that there is an edge from vertex A to vertex B. At the third level, each edge branches out to the triangles it participates in. For example, the path indicates that there is a triangle containing vertices A, B, and C. Here, C is found while intersecting the adjacency lists of vertices A and B (i.e., ). Finally, at the fourth level, each triangle branches out to the 4-cliques it participates in. For example, the path indicates that there is a 4-clique containing vertices A, B, C, and D. Here, D is found while intersecting the adjacency lists for A, B, and C. Since, was computed in the previous level, what remains is intersecting the previous result with . Since we are looking for 4-cliques, the tree traversal stops at the fourth level. In general, while looking for -cliques, the tree traversal stops at level .

Graph orientation has two main benefits. The first benefit is that it eliminates redundant clique discovery as previously mentioned. In the example in Figure 1(b), the 4-clique containing vertices A, B, C, and D is only discovered in the tree rooted at vertex A. It is not redundantly discovered in the other trees because vertex A is not reachable from the other vertices. The second benefit of graph orientation is that it reduces the out-degrees of the vertices and the maximum out-degree of the graph. The maximum out-degree has a quadratic impact on memory consumption, as we show in Section 2.4.

Parallel -clique counting based on graph orientation has been extensively studied on CPU (Shi et al., 2020; Danisch et al., 2018; Li et al., 2020; Chiba and Nishizeki, 1985). Vertices or edges are typically distributed across CPU threads, and each thread performs a sequential depth-first traversal of the tree or subtree to count the underlying cliques. To the best of our knowledge, ARB-COUNT (Shi et al., 2020) is the best performing parallel CPU implementation of -clique counting based on graph orientation. In this paper, we present our parallel GPU implementation of -clique counting based on graph orientation and compare its performance to ARB-COUNT.

2.3. Pivoting Approach

Another approach to avoiding redundant clique discovery is pivoting (Jain and Seshadhri, 2020). The idea of pivoting is inspired by the Bron-Kerbosh maximal clique finding approach (Eppstein et al., 2010). Pivoting relies on the fact that a -clique consists of -cliques, so instead of searching for all of these -cliques, it is sufficient to find the largest -clique and all the -cliques it contains are found. For example, the graph in Figure 1(a) contains a 5-clique consisting of vertices A, B, C, D, and J. This 5-clique contains different 4-cliques. In the graph oriented approach in Figure 1(b), these five 4-cliques are discovered by five different paths in the search trees. The pivoting approach aims at discovering the 5-clique, then conclude the existence of five 4-cliques.

Figure 1(c) shows an example of how the graph in Figure 1(a) is explored using the pivoting approach. As the search tree is traversed, at every branching step, one pivot child vertex is selected which is typically the vertex that has the largest common number of neighbors with its parent. All the pivot’s neighbors are then excluded from the branching since these neighbors are eventually reachable from the pivot vertex (i.e., the pivot vertex is their parent in the search tree). For example, at the first level, vertex B is selected as the pivot vertex because it has the largest number of neighbors. Accordingly, all of B’s neighbors are excluded from the first level. At the second level, when branching from vertex B to its neighbors, vertex D is selected as the pivot because it has the largest number of common neighbors with B. Accordingly, all of D’s neighbors are excluded from the branch. The tree traversal proceeds in this way. Unlike the graph orientation approach, the pivoting approach does not stop at level , but keeps exploring until it reaches the bottom of the tree (or satisfies a stopping criteria (Jain and Seshadhri, 2020)). To calculate the number of -cliques found on a path, the combinatorial formula is used, where is the number of pivots in the branch, and and the number of vertices in the branch.

Compared to graph orientation, pivoting has the advantage that it reduces the search space significantly by eliminating the neighbors of the pivot vertex from the branching. This reduction is clear when comparing Figures 1(b) and 1(c). The reduction in branching makes pivoting particularly suitable for large , or for counting all cliques for all . On the other hand, pivoting has several disadvantages. The first disadvantage of pivoting is that it requires deeper exploration of the search tree which exacerbates load imbalance in parallel implementations. The second disadvantage of pivoting is that it reduces the amount of parallelism available. The third disadvantage of pivoting is that it requires an undirected graph, which makes adjacency list intersection operations prohibitively expensive for large graphs where vertices have much higher degree, hence longer adjacency lists. The second and third disadvantages are mitigated (Jain and Seshadhri, 2020) by starting with the graph orientation approach for the initial level(s) of the tree using a directed graph, then switching to the pivoting approach for the remaining levels using an undirected induced sub-graph (see Section 2.4). An example of this hybrid approach is illustrated in Figures 1(d) and 1(e).

To the best of our knowledge, Pivoter (Jain and Seshadhri, 2020) is the only implementation of -clique counting based on pivoting. The implementation is parallelized on the CPU, again by distributing vertices or edges across CPU threads, and having each thread perform a sequential depth-first traversal of the tree or subtree. In this paper, we present our parallel GPU implementation of -clique counting based on pivoting and compare its performance to Pivoter.

2.4. Induced Sub-graph Optimization

Both the graph orientation and pivoting approaches spend a significant amount of time performing adjacency list intersection operations. A common optimization is to shrink the size of the adjacency lists being interesected by, for each search tree, removing the vertices from the graph that will never be reached in the search tree. For example, in Figure 1, when traversing a tree rooted at a vertex A, only the neighbors of A can ever be reached. Hence, any vertex that is not a neighbor of A can be removed from the graph before traversing the tree rooted at A.

In general, when traversing a tree rooted at vertex , the first step is to extract the vertex-induced sub-graph consisting of the vertices in . This induced sub-graph is used throughout the tree traversal instead of the full graph. The induced sub-graph is typically much smaller than the full graph, with smaller adjacency lists resulting in faster adjacency list intersection operations. Note that an induced sub-graph may be extracted at any level in the search tree. For example, if the tree contains a path , the subtree rooted at can only reach vertices that are neighbors of all the vertices . Therefore, the induced sub-graph consisting of the vertices in is sufficient for traversing the subtree. However, extracting the induced sub-graph at each level is usually not worth the overhead. The induced sub-graph is typically extracted at one level, either the first or the second. In this paper, both alternatives are explored.

The largest possible induced sub-graph has vertices where is the maximum out-degree of the graph. Therefore, the upper bound on the size of an induced sub-graph is . Since the maximum out-degree of the graph has a quadratic impact on memory consumption, the choice of graph orientation is critical for reducing memory consumption, as mentioned in Section 2.2. It becomes even more critical in parallel implementations because when the trees or subtrees are processed in parallel and each has a different induced sub-graph, each needs its own memory space to store its induced sub-graph. Section 3.4 describes how the memory consumption of the induced sub-graphs is further reduced in our parallel GPU implementation.

3. Clique Counting on GPUs

This section describes our implementation of k-clique counting on GPUs, including the graph format, parallelization schemes, iterative subtree traversal, and various optimizations.

3.1. Graph Format and Orientation Criteria

Figure 2. Hybrid CSR+COO representation of the graph in Figure 1(a)

We represent the input graph using the hybrid CSR+COO storage format. Figure 2 shows an example of how this storage format is used to represent the graph in Figure 1(a). The row pointer and column index arrays together constitute the CSR part. They facilitate finding the adjacency list of a given vertex, which is useful for vertex-centric processing and parallelization. The row index and column index arrays together constitute the COO part. They facilitate finding the source and destination vertex of a given edge, which is useful for edge-centric processing and parallelization.

Before clique counting begins, we first orient the graph to a directed graph. Recall that both the graph orientation approach and the pivoting approach require the graph to be oriented at the beginning. Our implementation supports two different orientation criteria: degree orientation and degeneracy orientation. Degree orientation orients the edges from the vertex with the lower degree to the vertex with the higher degree. Degeneracy orientation orients the edges from the lower -core order to the higher -core order. The -core order is obtained from -core decomposition which iteratively eliminates the minimum degree vertices from the graph. The -core order is the order in which the vertex is removed from the graph.

We implement both orientation criteria on the GPU, the latter using a bucket-based approach (Che et al., 2020). In both cases, after determining which edges to keep, the undesired edges are filtered out and the row pointers are recomputed with histogram and exclusive scan operations. The tradeoff between orientation criteria is evaluated in Section 4.4.

We note that although degree orientation and degeneracy orientation are currently supported, our implementation can easily be extended to support other orientation criteria. The choice of orientation criteria is orthogonal to our work and not intended as a contribution of this paper.

3.2. Parallelization Schemes

Our main strategy for parallelizing k-clique counting on GPU is to traverse different search trees or subtrees in parallel. GPUs provide a massive amount of parallelism and are capable of running tens of thousands to hundreds of thousands of threads concurrently. A grid of threads running on a GPU is divided into thread blocks. Threads in the same thread block can collaborate by synchronizing at a barrier and sharing a fast scratchpad memory (also called shared memory). Thread blocks are further divided into warps which typically consist of 32 threads bound by the SIMD execution model. Threads in the same warp can collaborate using low cost warp-level primitives.

In this paper, we implement two different parallelization schemes: a vertex-centric scheme and an edge-centric scheme. In the vertex-centric scheme, each thread block is assigned to a vertex (level 1 in the search tree) and is responsible for traversing the tree rooted at that vertex. The threads in the block collaborate to extract the induced sub-graph consisting of the vertex’s neighbors, then proceed to traverse the vertex’s tree using the induced sub-graph. In the edge-centric scheme, each thread block is assigned to an edge (level 2 in the search tree) and is responsible for traversing the subtree rooted at that edge. The threads in the block collaborate to extract the induced sub-graph consisting of the common neighbors of the edge’s endpoints, then proceed to traverse the edge’s subtree using the induced sub-graph.

The advantage of the vertex-centric scheme over the edge-centric scheme is that it extracts an induced sub-graph for each vertex’s tree as opposed to each edge’s subtree, thereby amortizing the cost of extracting the induced sub-graph over a larger tree. The advantage of the edge-centric scheme over the vertex-centric scheme is that it extracts smaller induced sub-graphs (albeit more of them), thereby resulting in shorter adjacency lists. It also exposes finer-grain parallelism which makes it more robust against load imbalance. We compare the performance of the vertex-centric and edge-centric schemes in Section 4.3.

Figure 3. Parallelization schemes

Parallelization of work within each thread block differs between the graph orientation approach and the pivoting approach. In the graph orientation approach, we partition the blocks into groups of threads and each group independently traverses one of the subtrees in the next level. In the vertex-centric scheme, each group of threads is assigned to an outgoing edge (level 2 in the search tree) of the block’s vertex, and the group independently traverses the subtree rooted at the edge. An example is shown in Figure 3(a). In the edge-centric scheme, each group of threads is assigned to a triangle (level 3 in the search tree) that the block’s edge participates in, and the group independently traverses the subtree rooted at that triangle. An example is shown in Figure 3(b). In both cases, the group of threads jointly perform a depth-first traversal of the subtree it is assigned to, visiting the nodes in the search tree sequentially. The threads in the group collaborate to perform the adjacency list intersection operation in parallel.

In the pivoting approach, we also partition the blocks into groups of threads, however these groups do not process next-level subtrees independently. Instead, all threads in the block stay together as they jointly perform a sequential depth-first traversal of the tree/subtree that the block is assigned to. However, at each node in the search tree, identifying which neighbor is the pivot vertex requires performing a list intersection operation for each neighbor. Therefore, each group of threads is assigned to a different neighbor and performs a list intersection operation for that neighbor to check if it is the pivot. Examples of the vertex-centric and edge-centric schemes for the pivoting approach are shown in Figures 3(c) and 3(d), respectively.

There are two reasons why, in the pivoting approach, groups of threads are not assigned to process the next-level subtrees independently. The first reason is that the process of finding the pivot element at each tree node is expensive and already provides enough work to be parallelized across groups. The second reason is that the pivoting approach has deeper trees than the graph orientation approach, thereby requiring more memory to store intermediate results. Parallelizing the next-level subtrees across groups of threads would require too much memory for storing the intermediate results of each subtree.

For now, a group of threads can be thought of as a warp, which is the most natural way to partition a block. However, we improve on this partitioning granularity in Section 3.5.

3.3. Traversing a Subtree

In Section 3.2, we saw how the trees or subtrees are distributed across threads to be traversed in parallel. In this section, we explain how a block or group of threads jointly traverse a single tree or subtree. Tree traversal on CPUs is typically done using recursion. However, using recursion on the GPU is not an efficient approach. Due to the massive number of threads that run on the GPU, each thread can only afford to have a small stack which limits the recursion depth. Moreover, since multiple threads are used to jointly traverse a tree, these threads access the same intermediate traversal data. Replicating this traversal data in all the threads’ private stacks is not memory efficient. For this reason, our implementation replaces recursion with an iterative tree traversal whereby threads traversing the same tree explicitly manage a shared stack.

The shared stack we use for traversal consists the following data:

  • : a counter to track the current level

  • : an array that stores the lists of vertices at each level

  • : an array that stores the number of vertices at each level

  • : an array that stores the index of the current vertex at each level (note that this is not the vertex ID, but the index into the array to obtain the vertex ID)

1:procedure traverseSubtree(k, , G, levelVertices, numLevelVertices)
2:      =
3:     levelVertices[] = levelVertices
4:     numLevelVertices[] = numLevelVertices
5:     currVertexIdx[] = 0
6:     while currVertexIdx[] ¡ numLevelVertices[do
7:          v = levelVertices[][currVertexIdx[]]
8:          levelVertices[] =
9:               intersect(levelVertices[], Adj(v) in G)
10:          numLevelVertices[] = —levelVertices[]—
11:          if numLevelVertices[] ¿ 0 and ¡ k then
12:               ++ Go to child level
13:               currVertexIdx[] = 0
14:          else
15:               if  then
16:                    numCliques += numLevelVertices[]                
17:               ++currVertexIdx[] Go to next sibling
18:               while currVertexIdx[] == numNodes[] and ¿  do
19:                     Go to parent level
20:                    ++currVertexIdx[] Go to parent’s next sibling                               
Algorithm 1 Iterative Traversal for Graph Orientation Approach

Algorithm 1 shows the pseudocode for our iterative subtree traversal for the graph orientation approach. The routine takes , an initial level , the induced sub-graph , the list of vertices at the initial level , and the number of vertices at the initial level (line 1). First, the parameters are used to initialize the stack data (lines 2-5). Next, a loop iterates until there are no more vertices to process (line 6). Inside the loop, the current vertex is read from the list of vertices (line 7). The children of the vertex are obtained by intersecting the list of vertices at the current level (which is also the list of common neighbors of all the vertex’s ancestors) with the adjacency list of the vertex itself (lines 8-10). The adjacency list is obtained from the induced sub-graph. The threads in the group collaborate to perform the intersection operation in parallel. After the intersection, if the vertex is found to have children that are not at the bottom of the tree, we go to the next level to traverse the children (lines 11-13). Otherwise, if the bottom of the tree is reached, we accumulate the number of cliques (lines 15-16). We then increment the vertex index to go to the vertex’s next sibling (line 17). If the vertex has no more siblings, then we are done traversing the vertex’s parent, so we move on to the parent’s next sibling (lines 18-20).

For the pivoting approach, we also use the shared stack to store the following additional data:

  • : an array that stores the pivot vertex ID at each level

  • : an array that stores, at each level, the number of pivot elements on the path to the vertex at the current level

  • : an array that stores, at each level, the list of vertices that are common neighbors of all the vertices at the previous levels (in the graph orientation approach, is equivalent to , but in the pivoting approach, it includes the neighbors of the pivot that are excluded from )

1:procedure traverseSubtree(k, , G, intersection)
2:      =
3:     intersection[] = intersection
4:     currVertexIdx[] = 0
5:     pivotCount[] = 0
6:     pivotVertex[] = findPivotVertex(intersection[], G)
7:     levelVertices[] =
8:               intersection[] - (Adj(pivotVertex[]) in )
9:     numLevelVertices[] = —levelVertices[]—
10:     while currVertexIdx[] ¡ numLevelVertices[do
11:          v = levelVertices[][currVertexIdx[]]
12:          pivotCount[] = pivotCount[] + (v == pivotVertex[]) ? 1 : 0
13:          if  - pivotCount[] ¿ k then
14:               traverseChildLevel = false
15:          else
16:               intersection[] =
17:                    intersect(intersection[],Adj(v) in )
18:                         - levelVertices[][0 : currVertexIdx[]]
19:               traverseChildLevel = —intersection[]— ¿ 0           
20:          if traverseChildLevel then
21:               ++ Go to child level
22:               currVertexIdx[] = 0
23:               pivotVertex[] = findPivotVertex(intersection[], G)
24:               levelVertices[] =
25:                    intersection[] - (Adj(pivotVertex[]) in G)
26:               numLevelVertices[] = —levelVertices[]—
27:          else
28:               if then
29:                    numCliques +=
30:                         choose(pivotCount[], - k)                
31:               ++currVertexIdx[] Go to next sibling
32:               while currVertexIdx[] == numNodes[] and ¿  do
33:                     Go to parent level
34:                    ++currVertexIdx[] Go to parent’s next sibling                               
Algorithm 2 Iterative Traversal for Pivoting Approach

Algorithm 2 shows the pseudocode for our iterative subtree traversal for the pivoting approach. The routine takes , an initial level , the induced sub-graph , and the common neighbors of all the vertices up to the initial level (line 1). First, the parameters are used to initialize the stack data (lines 2-4). Next, we find the pivot vertex for the current level (line 6). Recall that in the pivoting approach, all threads in the block jointly traverse the same tree or subtree. To find the pivot vertex, the vertices in are distributed across the groups of threads and each group intersects its vertex’s adjacency list with to find the number of common neighbors it has with its parent. A max operation is then performed across groups to identify the pivot vertex. After the pivot vertex is found, the neighbors of the pivot are excluded from the level vertices (lines 7-9). A loop then begins to iterate until there are no more vertices to process (line 10).

Inside the loop, the current vertex is read from the list of vertices (line 11). Based on whether the current vertex is the pivot vertex or not, the number of pivot elements up to the next level is set (line 12). Next, we must figure out if we need to traverse the child level or we have reached the bottom of the tree (lines 13-19). If the stopping criterion (Jain and Seshadhri, 2020) is met, we set a flag not to go to the child level (lines 13-14). Otherwise, we find the child vertices by intersecting the current vertex’s adjacency list with the common neighbors of its parents and removing the current vertex’s already processed siblings (lines 16-18). If child vertices are found, we set a flag to traverse the child level, otherwise we have reached the bottom (line 19). If we have determined that we need to traverse the child level (line 20), we now go to the child level (lines 21-22). We also identify the pivot vertex at the child level and exclude its neighbors from the level vertices (lines 23-26) following the same procedure that was used for the initial level (lines 6-9). Otherwise, if we have determined that we have reached the bottom (line 27), we identify the number of non-redundant -cliques in the clique that we found and accumulate the number of cliques (lines 28-30). We then increment the current vertex index to go to the vertex’s next sibling (line 31). If the vertex has no more siblings, then we are done traversing the vertex’s parent, so we move on to the parent’s next sibling (lines 32-34).

3.4. Binary Encoding of Induced Sub-graphs

Recall that the first step a thread block performs before traversing its tree or subtree is to extract an induced sub-graph. In the vertex-centric scheme, the induced sub-graph consists of the vertex’s neighbors, whereas in the edge-centric scheme, the induced sub-graph consists of the common neighbors of the edge’s endpoints. Each thread block has a different induced sub-graph. The block’s induced sub-graph is used by all threads in the block to perform the adjacency list intersection operations throughout the tree traversal.

The storage format used for the induced sub-graph is of utmost importance for both memory consumption and execution time. For memory consumption, since each thread block has a different induced sub-graph, enough memory must be allocated for all the blocks running simultaneously so they can each store a private induced sub-graph. Hence, the space efficiency of the induced sub-graph storage format is crucial for the overall memory consumption. For execution time, since threads use the induced sub-graph to perform adjacency list intersection operations, the storage format must be designed to enable low-latency parallel intersections.

To optimize both memory consumption and execution time, we use binary encoding to represent the induced sub-graph. Figure 4 shows an example of how an induced sub-graph can be binary encoded. Assume that a thread block is assigned to traverse the tree rooted at vertex A in the graph in Figure 4(a). The only vertices ever visited in that tree are the neighbors of A. Therefore, the thread block starts by extracting the induced sub-graph consisting of the neighbors of A shown in Figure 4(b). This induced sub-graph is binary encoded as shown in Figure 4

(c). The adjacency list of each vertex in the sub-graph consists of a bit vector with a 1 for each neighbor and a 0 otherwise. Any two lists can be intersected by performing a simple bitwise-and operation between the two bit vectors. We use a directed induced sub-graph for the graph orientation approach, and an undirected induced sub-graph for the pivoting approach. Note that in addition to the induced sub-graph being binary encoded, all the intermediate vertex lists (i.e., the

and lists in Algorithms 1 and 2) are also binary encoded, and intersections with these intermediate vertex lists also uses bitwise-and.

Figure 4. Binary encoding example

The advantage of binary encoding for memory consumption is that each vertex in an adjacency list is represented with a single bit. Since dynamic memory allocation is not efficient on GPUs, the space for each thread block to store its induced sub-graph and its intermediate vertex lists must be pre-allocated with enough capacity for the largest possible sub-graph. The largest possible sub-graph may have vertices, where is the maximum out-degree of the graph, and the sub-graph may be completely dense. In this case, storing the sub-graph would require memory, and storing the intermediate vertex lists would require memory per level. Binary encoding reduces these amounts by a factor of 32 by encoding each vertex in each list as a single bit.

The advantage of binary encoding for execution time is that it allows list intersection operations to be performed using simple bitwise-and operations. Traditional adjacency list intersection techniques on GPUs are complex to parallelize, suffer from control divergence, and exhibit uncoalesced memory access patterns. On the other hand, performing bitwise-and on a bit vector is easy to parallelize across threads in a group or block, does not suffer from control divergence, and enables coalescing of memory accesses.

The reason binary encoding is particularly effective for the induced sub-graphs and intermediate vertex lists is that they typically consist of a small number of vertices, especially when a good graph orientation criteria is used. Moreover, the induced sub-graphs are typically denser than the full graph. In contrast, binary encoding for the full graph is impractical because the full graph has many more vertices and is usually much more sparse, resulting in many wasted 0 bits. For this reason, we continue to represent the full graph using the hybrid CSR+COO format (see Section 3.1) and only represent the induced sub-graphs using binary encoding. To extract the binary encoded induced sub-graph from the full hybrid CSR+COO graph, we intersect the adjacency lists of the hybrid CSR+COO graph using binary-search-based intersections.

To the best of our knowledge, our implementation is the first to use binary encoding for the induced sub-graphs in -clique counting. Related work on the maximal clique problem on GPUs uses binary encoding for the entire graph (VanCompernolle et al., 2016a) or for specialized data structures to represent and operate on the candidate maximal cliques (Yu and Liu, 2019; Dasari et al., 2014). Many other graph processing works use binary encoding in different contexts. We evaluate the performance improvement of using binary encoded induced sub-graphs in Section 4.3.

3.5. Sub-warp Partitioning

In Section 3.2, we explained that for both the graph orientation approach and the pivoting approach, our implementation partitions thread blocks into groups of threads and distributes the block’s work across these groups. In the graph orientation approach, a block is assigned to a tree or subtree, and has each group of threads process one of the next level subtrees in parallel. In the pivoting approach, a block is also assigned to a tree or subtree, but the groups of threads jointly iterate over the tree nodes sequentially. However, for each tree node, when determining which child vertex is the pivot, each group of threads is used to check a different child in parallel.

One important design consideration is the granularity at which thread blocks are partitioned into groups. The most natural granularity is the warp because threads in the same warp are bound together by SIMD and are able to collaborate using low cost warp-level primitives. However, the introduction of binary encoding makes the warp granularity often too coarse. Recall that threads within a group collaborate to perform a single list intersection operation in parallel. With binary encoding, each thread can intersect 32 list elements simultaneously with a single bitwise-and operation. Therefore, to fully utilize all 32 threads in the warp, the intersection needs to contain 1024 list elements. We show in Section 4.4 that with proper graph orientation criteria, the maximum out-degree (and consequently, the largest binary encoded list size) is often much smaller than that. Therefore, partitioning blocks at the warp granularity would lead to underutilization of parallel execution resources.

To address this issue, we implement sub-warp partitioning where thread blocks are partitioned to groups smaller than a warp. Since the Volta architecture, the introduction of Independent Thread Scheduling (NVIDIA, 2017) has enabled fine-grain collaboration between a subset of threads within a warp. This can be done by providing a 32-bit mask to the warp-level primitives to specify the threads participating in the intra-warp synchronization operation. For example, the operation __warpsync(0x0000000F) synchronizes the first four threads in a warp, while the remaining threads execute independently. We leverage such primitives to enable the creation of thread groups that are 32, 16, 8, 4, 2, or 1 threads in size. The number of threads per group is a tunable parameter and the same traversal code works for any group size. We evaluate the performance improvement of using sub-warp partitioning in Section 4.3.

3.6. Memory Management

The issue of memory consumption is exacerbated on the GPU compared to the CPU for two key reasons. The first reason is that the capacity of the device memory on a typical GPU is much smaller than the capacity of main memory on a typical CPU. The second reason is that GPUs are massively parallel processors so they perform much more work in parallel, thereby requiring much more intermediate data to be stored simultaneously. For example, a CPU may run tens of threads at a time, so it only needs enough memory to maintain that many different sets of induced sub-graphs and intermediate vertex lists. In contrast, a GPU may run hundreds to thousands of thread blocks at once, so it needs enough memory to maintain hundreds to thousands of different sets of induced sub-graphs and intermediate vertex lists. Therefore, with a lower memory capacity and a higher demand for memory, it becomes critical to manage memory efficiently on the GPU.

We have discussed multiple techniques that we use for reducing memory consumption throughout this paper. In Section 3.2, we discuss how an induced sub-graph is extracted once per thread block and shared by multiple groups of threads in their traversal of different subtrees. In Section3.3, we discuss how threads in the same group or block explicitly manage a shared stack for storing the intermediate vertex lists, rather than each storing a private copy. In Section 3.4, we discuss how binary encoding of sub-graphs and intermediate vetrex lists reduces their memory requirement. In this section, we discuss one more technique for reducing memory consumption, then analyze the overall memory consumption of our approaches.

A GPU grid typically consists of many more thread blocks than the number that can run simultaneously. A block executes to completion and is replaced by another thread block after it is done. For this reason, it is unnecessary to pre-allocate memory space for the induced sub-graph and intermediate vertex lists of all the blocks in the grid. Instead, we pre-allocate memory space for the maximum number of blocks that can run simultaneously, and reuse the same space for multiple consecutive blocks.

Graph Orientation Approach Pivoting Approach
Data structure
Memory
Space
Memory
Consumption
Memory
Space
Memory
Consumption
G Global Global
Global Global
Shared Global
Shared Global
- - Shared
- - Global
- - Global
Table 1. Upper bound on memory consumed per block for our graph orientation and pivoting implementations

Table 1 shows the upper bound on the memory consumed per thread block for each of the graph orientation approach and the pivoting approach. For both approaches, a thread block has one copy of the induced sub-graph which consumes memory as explained in Section 3.4. The remaining data structures are part of the shared stack and their size depends on the maximum depth of the stack as well as the number of shared stacks within the block (i.e., the number of thread groups processing subtrees independently). In the graph orientation approach, recall that each group of threads processes a subtree in parallel, hence there are shared stacks where is the number of thread groups in a block. Since the maximum depth of the stack is , each stack data structure has elements. In the pivoting approach, recall that all the groups of threads process the same tree or subtree, hence there is only one shared stack for the entire block. However, pivoting does not stop at level but goes to the bottom of the tree, so the maximum depth of the stack is the maximum number of vertices which is . Hence, each stack data structure has elements. The and arrays are binary encoded vertex lists so the size of each element is . The rest of the data structures consist of a single value per element. Note that some of the stack data structures are placed in shared memory for fast access.

4. Evaluation

4.1. Methodology

Evaluation Platform. We evaluate our GPU implementations on a dual socket machine with 512 GB of main memory. Each socket has a 20-core (with 2-way hyperthreading) Intel Xeon Gold 6230 CPU and two V100 GPUs with 32 GB and 16 GB of HBM2 global memory. The four GPUs are connected through NVLink 2.0. We only use a single 32GB V100 GPU in our evaluation. We compile our code with NVCC (CUDA 10.2) and GCC 8.3.1 with the -O3 flag. The CUDA driver version used is 440.82. For the pre-porcessing filter and exclusive scan operations, we use Nvidia CUB 1.8.0 (Merrill, 2015).

Figure 5. Comparing execution time with state-of-the-art parallel CPU implementations (lower is better, exact times in Table 3)

Datasets. We use the same graphs used by ARB-COUNT (Shi et al., 2020) for exact -clique counting evaluation, namely: as-skitter, com-dblp, com-orkut, com-friendster, and com-lj. These graphs are real-world undirected graphs from the Stanford Network Analysis Project (SNAP) (Leskovec and Krevl, 2014).

Baselines. We compare our GPU implementations to two CPU baselines: ARB-COUNT (Shi et al., 2020) and Pivoter (Jain and Seshadhri, 2020). ARB-COUNT (Shi et al., 2020) represents the state-of-the-art parallel graph orientation implementation on CPU, which significantly outperforms other graph orientation implementations (Danisch et al., 2018; Li et al., 2020). Pivoter (Jain and Seshadhri, 2020) represents the state-of-the-art parallel pivoting implementation on CPU. We use the execution times reported in ARB-COUNT (Shi et al., 2020) for the parallel implementations of ARB-COUNT and Pivoter. The execution times they report include the time spent pre-processing and counting, and exclude the time spent loading the graph. ARB-COUNT reports the best time achieved using different graph orientation and parallelization strategies.

Reporting of Measurements. For our GPU implementations, the execution times we report include the time spent pre-processing and counting, and exclude the time spent loading the graph. Unless otherwise specified, we report the time achieved for the best combination of orientation criteria (degree or degeneracy), parallelization scheme (vertex-centric or edge-centric), binary encoding, and sub-warp partition size. Similar to ARB-COUNT (Shi et al., 2020), we do not report times greater than five hours.

4.2. Comparison with State-of-the-Art Parallel CPU Implementations

Figure 5 compares the execution time of the state-of-the-art parallel CPU implementations with our GPU implementations for both the graph orientation approach and the pivoting approach. The execution times in Figure 5 are also reported in Table 3 at the end of the paper. The missing datapoints in the figure are those that take longer than 5 hrs to run, or run out of memory in the case of com-friendster using Pivoter. Note that our GPU implementations do not run out of memory for any scenario, despite the constrained memory requirements on the GPU. Based on the results in Figure 5, we make two key observations.

Graph Orientation vs. Pivoting. The first observation is that the graph orientation approach performs better that the pivoting approach for small values of , while the latter performs better for large values of . This observation is consistent with that made in prior work (Shi et al., 2020). The observation applies for both CPU and GPU implementations. Recall from Section 2.3 that pivoting has the advantage of reducing the amount of branching but the disadvantage of having deeper search trees. For small values of , branching for the graph orientation approach is moderate, whereas the deep trees in the pivoting approach create load imbalance. However, as gets larger, the branching drastically increases causing the graph orientation approach to suffer. In most cases, the transition from the graph orientation approach being fastest to the pivoting approach being fastest happens at around .

GPU vs. CPU. The second observation is that our best GPU implementation consistently and significantly outperforms the best parallel state-of-the-art CPU implementation across all values of . Our best GPU implementation outperforms the best CPU implementation by a geometric mean speedup of 12.39, 6.21, and 18.99 for , , and , respectively. This result demonstrates the effectiveness of GPUs at accelerating -clique counting.

4.3. Impact of Parallelization Schemes and Speedup of Optimizations

Figure 6. Impact of parallelization schemes and speedup of optimizations for each approach (higher is better)

Figures 6(a) and 6(b) show the incremental speedup of binary encoding and sub-warp partitioning for both parallelization schemes for the graph orientation approach and the pivoting approach, respectively. VC and EC refer to the vertex-centric and edge-centric parallelization schemes, respectively, with induced sub-graphs represented using the CSR format and blocks partitioned into groups at warp granularity. The +BE suffix refers to when binary encoding is applied to the induced sub-graphs instead of using CSR. The +P suffix refers when sub-warp partitioning is applied and the best partition size is used. The omitted datapoints in the figure are those where the baseline (VC) takes longer than 5 hrs to run.

Parallelization Scheme. We observe that vertex-centric parallelization scheme is more effective than the edge-centric parallelization scheme for the initial values of , particularly for the graph orientation approach. For example, for , VC+BE+P has a geometric mean speedup over EC+BE+P of 2.27 for the graph orientation approach and 1.09 for the pivoting approach. However, as gets larger, the edge-centric parallelization scheme becomes more effective. For example, for , EC+BE+P has a geometric mean speedup over VC+BE+P of 4.68 for the graph orientation approach and 1.45 for the pivoting approach. Recall from Section 3.2 that the vertex-centric scheme has the advantage of amortizing the cost of extracting the induced sub-graph over a larger tree, whereas the edge-centric scheme has the advantage of extracting more parallelism making it more robust against load imbalance. The graph orientation approach for small has the smallest trees, hence amortizing the induced sub-graph extraction across larger trees makes the vertex-centric scheme attractive. However, as increases, load imbalance increases, which makes the edge-centric scheme attractive. We also note that the transition from the vertex-centric scheme being fastest to the edge-centric scheme being fastest happens at a much smaller for our GPU implementation than it does for prior parallel CPU implementations (Shi et al., 2020). GPUs exhibit more parallel execution resources than CPUs which makes them more sensitive to load imbalance, thereby favoring an earlier transition to the more load-balanced edge-centric scheme.

Binary Encoding. We observe that binary encoding gives consistent performance improvement for both the graph orientation and pivoting approaches across all graphs, parallelization schemes, and values of . The geometric mean speedup of applying binary encoding is 2.17 for the graph orientation approach, and 1.38 for the pivoting approach. Recall from Section 3.4 that binary encoding improves execution time because it enables lower-latency parallel list intersection operations. The graph orientation approach spends the majority of time performing list intersection operations, whereas the pivoting approach performs other kinds of operations like finding a maximum. For this reason, it is expected that the speedup of binary encoding would be more pronounced in the graph orientation approach.

Sub-warp Partitioning. We observe that sub-warp partitioning also gives consistent performance improvement for both the graph orientation and pivoting approaches across all graphs, parallelization schemes, and values of . The geometric mean speedup of applying sub-warp partitioning is 1.98 for the graph orientation approach, and 1.73 for the pivoting approach. Recall from Section 3.2 that the groups of threads within a block in the graph orientation approach operate completely independently, whereas in pivoting, these groups collaborate with each other to find the pivot vertex. For this reason, it is expected that the speedup of unleashing more parallelism via sub-warp partitioning will be more pronounced in the graph orientation approach. Regarding the choice of the best partition size, recall from Section 3.5 that sub-warp partitioning allows better utilization of the threads in the thread group when the adjacency lists being intersected are small. Our analysis shows that graphs with lower maximum degree (hence, shorter adjacency lists) favor having fewer threads per group, whereas graphs with larger maximum degree (hence, longer adjacency lists) favor having more threads per group.

4.4. Impact of Graph Orientation Criteria

Undirected Degree Orientation Degeneracy Orientation
Graph
Preprocessing
Time in seconds
Preprocessing
Time in seconds
as-skitter 35,455 231 0.005 111 0.205
com-dblp 343 113 0.002 113 0.051
com-orkut 33,313 535 0.056 253 0.833
com-friendster 5,214 868 5.465 304 12.294
com-lj 14,815 524 0.016 360 0.421
Table 2. Achieved maximum out-degree () and pre-processing time of orientation criteria

Table 2 shows the achieved maximum out-degree of the graph and the pre-processing time of the two different orientation criteria used in this work, namely degree orientation and degeneracy orientation. It is clear that degeneracy orientation achieves lower maximum out-degree but has higher processing time. Recall that the maximum out-degree forms an upper bound on the length of the adjacency list intersections performed throughout the traversal. Therefore, a lower maximum out-degree results in faster list intersection operations.

Our analysis of the best orientation criteria for different runs shows that runs with lower values of favor degree orientation, whereas runs with higher values of favor degeneracy orientation. Runs with higher values of take longer and perform more list intersections, hence there is enough work reduction to amortize the higher pre-processing cost. On the other hand, runs with lower values of do not perform enough list intersections to amortize the pre-processing cost. This trend is consistent with the trend observed in prior work (Shi et al., 2020). We include this analysis here for completeness, however, as mentioned in Section 3.1, the choice of orientation criteria is orthogonal to our work and not intended as a contribution of this paper.

—V— —E— k # k-Cliques ARB-COUNT (Shi et al., 2020) Pivoter (Jain and Seshadhri, 2020) GPU-GraphOrientation GPU-Pivot
as-skitter 1,696,415 11,095,298 4 148,834,439 0.60 20.90 0.034 0.459
5 1,183,885,507 0.67 22.54 0.069 0.721
6 9,759,000,981 1.24 25.30 0.245 1.01
7 73,142,566,591 5.73 27.14 1.434 1.269
8 481,576,204,696 28.38 27.46 9.531 1.585
9 2,781,731,674,867 158.45 28.45 68.221 1.835
10 14,217,188,170,569 854.87 28.45 474.785 1.777
11 64,975,151,572,336 4,158.53 28.45 2997.385 1.778
com-dblp 317,080 1,049,866 4 16,713,192 0.10 2.88 0.008 0.109
5 262,663,639 0.13 2.88 0.016 0.109
6 4,221,802,226 0.30 2.88 0.042 0.109
7 60,913,718,813 2.05 2.88 0.545 0.109
8 777,232,734,905 24.06 2.88 9.031 0.109
9 8,813,264,533,265 281.39 2.88 139.046 0.109
10 89,563,892,212,629 2,981.74 2.88 2262.99 0.109
11 822,551,101,011,469 >5 hours 2.88 >5 hours 0.109
com-orkut 3,072,441 117,185,083 4 3,221,946,137 3.10 292.35 0.426 8.83
5 15,766,607,860 4.94 385.04 1.014 13.869
6 75,249,427,585 12.57 462.05 3.506 17.229
7 353,962,921,685 42.09 517.29 11.719 20.331
8 1,632,691,821,296 150.87 559.75 45.319 26.137
9 7,248,102,160,867 584.39 598.88 212.912 33.644
10 30,288,138,110,629 2,315.89 647.18 1,002.165 39.957
11 117,138,620,358,191 8,843.51 647.18 4,421.597 48.101
com-friendster 65,608,366 1,806,067,135 4 8,963,503,263 109.46 Out M 10.215 44.897
5 21,710,817,218 111.75 Out M 11.796 53.874
6 59,926,510,355 115.52 Out M 17.22 63.874
7 296,858,496,789 139.98 Out M 45.697 66.544
8 3,120,447,373,827 300.62 Out M 99.866 67.064
9 40,033,489,612,826 1,796.12 Out M 803.53 71.404
10 487,090,833,092,739 16,836.41 Out M 12,775.67 71.051
11 5,403,375,502,221,431 >5 hours Out M >5 hours 71.448
com-lj 3,997,962 34,681,189 4 5,216,918,441 1.77 268.06 0.104 10.864
5 246,378,629,120 7.52 1,475.99 0.943 68.966
6 10,990,740,312,954 258.46 7,816.13 23.792 379.88
7 449,022,426,169,164 10,733.21 >5 hours 1,077.66 1,639.537
8 16,890,998,195,437,619 >5hours >5 hours >5 hours 6850.989
Table 3. Comparison of total execution time in seconds between ARB-COUNT (Shi et al., 2020), Pivoter (Jain and Seshadhri, 2020), and our GPU implementations

5. Related Work

Graph Orientation Approach to Clique Counting. Graph orientation is a fundamental approach to avoiding redundant clique discovery (Chiba and Nishizeki, 1985; Finocchi et al., 2015; Danisch et al., 2018; Shi et al., 2020; Li et al., 2020). The Chiba-Nishizeki algorithm (Chiba and Nishizeki, 1985) was the first practical sequential algorithm to search for dense structures in graphs. This algorithm uses degree orientation. There have been several works that implement parallel -clique counting on CPUs. Finocchi et al. (Finocchi et al., 2015) present parallel -clique counting algorithms for MapReduce. Degree orientation is also used as a pre-processing step. KCList (Danisch et al., 2018) is designed to list all the cliques in a graph. It is based on the Chiba-Nishizeki algorithm, but uses degeneracy (-core) orientation instead of degree orientation. Vertex-centric and edge-centric parallelizations are both explored. ARB-COUNT (Shi et al., 2020) includes a parallel work-efficient CPU implementation of -clique counting. It supports different orientation criteria such as degree, degeneracy, Barenboim-Elkin (Barenboim and Elkin, 2010), and Goodrich-Pszona (Goodrich and Pszona, 2011). Vertex-centric and edge-centric parallelizations are also both explored. ARB-COUNT reports being faster than other parallel -clique implementations such as KCList (Danisch et al., 2018), as well as other sub-graph matching implementations such as the worst-case optimal join algorithm (Mhedhbi and Salihoglu, 2019), Lai et al.’s implementation of a binary join algorithm (Lai et al., 2019), and the Escape algorithm (Pinar et al., 2017). Li et el. (Li et al., 2020) propose an orientation based on graph coloring for -clique listing, but report that ARB-COUNT (Shi et al., 2020) is faster than their approach. All these parallel implementations of the graph orientation approach are designed for CPUs. We implement the graph orientation approach for -clique counting on the GPU and compare our performance with ARB-COUNT (Shi et al., 2020).

Pivoting Approach to Clique Counting. Pivoter (Jain and Seshadhri, 2020) is a recent work on -clique counting that is inspired by the classical pivoting idea of Bron-Kerbosh’s maximal clique finding (Eppstein et al., 2010). ARB-COUNT (Shi et al., 2020) compares to Pivoter and shows that it is faster for large values. Pivoter is implemented on the CPU. We implement the pivoting approach for -clique counting on the GPU and compare our performance with Pivoter (Jain and Seshadhri, 2020).

Maximal Clique Enumeration. Enumerating the maximal cliques in a graph has been extensively studied on CPUs (Cheng et al., 2012; Schmidt et al., 2009; Das et al., 2020; Xu et al., 2014) and GPUs (Lessley et al., 2017; Jenkins et al., 2011; VanCompernolle et al., 2016b; Henry and Eng, 2014; VanCompernolle et al., 2016a; Yu and Liu, 2019; Dasari et al., 2014). The pivoting approach is inspired by techniques used in maximal clique enumeration. Our work solves the -clique counting problem.

Triangle Counting. Many works perform triangle counting on the CPU (Pagh and Tsourakakis, 2012; Al Hasan and Dave, 2018; Halappanavar and Ghosh, 2020; Kolountzakis et al., 2012) or the GPU (Pearson et al., 2019; Wang and Owens, 2019; Pandey et al., 2019; Hu et al., 2018; Mailthody et al., 2018; Bisson and Fatica, 2018b; Green et al., 2014a, b; Wang et al., 2016b). A triangle is a 3-clique which is a special case of a -clique. Our implementation performs -clique counting for any value of .

Generalized Sub-graph Matching. Many works perform generalized sub-graph matching on the CPU (Pinar et al., 2017; Ahmed et al., 2017; Elenberg et al., 2016; Rossi et al., 2018b; Wang et al., 2017) and the GPU (Chen et al., 2020; Guo et al., 2020; Wang et al., 2016a; Lin et al., 2016; Tran et al., 2015; Zeng et al., 2020; Wang and Owens, 2020). These frameworks search for an arbitrary -vertex sub-graph and support different values of . Due to generalization, such sub-graph matching frameworks suffer from memory explosion or prolonged execution times. Our implementation is specialized for -cliques which are an important special case of a -vertex sub-graph. Specializing for -cliques enables optimizations that are not applicable to general sub-graphs, thereby providing better scalability for large values of .

Truss decomposition. Recently, -truss decomposition has received significant attention on CPUs (Wang and Cheng, 2012; Pearce and Sanders, 2018; Low et al., 2018; Conte et al., 2018) and GPUs (Almasri et al., 2019; Diab et al., 2020; Che et al., 2020; Green et al., 2017; Bisson and Fatica, 2018a; Blanco et al., 2019). A -truss is a relaxation of a -clique. Our work solves the -clique counting problem.

List Intersections. List intersection is an important operation at the heart of many sub-graph search algorithms. Different list intersection strategies have been proposed for GPUs such as pointer-chasing (Almasri et al., 2019), binary-search (Pearson et al., 2019), merge-path (Green et al., 2014b), hash-based (Pandey et al., 2019), tile-based (Diab et al., 2020), bitmap-based (Bisson and Fatica, 2018b), and others. Our implementations use binary-search intersections for extracting an induced sub-graph, and binary encoding for the lists in the induced sub-graph.

6. Conclusion

We present GPU implementations of -clique counting for both the graph orientation approach and the pivoting approach. We explore vertex-centric and edge-centric parallelization schemes, replace the recursive search tree traversal with an iterative traversal, and apply various optimizations such as binary encoding and sub-warp partitioning to reduce memory consumption and efficiently utilize parallel execution resources. Our evaluation shows that our GPU implementations substantially outperform their state-of-the-art CPU counterparts. We also show that the binary encoding and sub-warp partitioning optimizations yield significant performance gains, and provide insights on the choice of parallelization scheme and sub-warp partitioning granularity.

References

  • N. K. Ahmed, J. Neville, R. A. Rossi, N. G. Duffield, and T. L. Willke (2017) Graphlet decomposition: framework, algorithms, and applications. Knowledge and Information Systems 50 (3), pp. 689–722. Cited by: §5.
  • M. Al Hasan and V. S. Dave (2018) Triangle counting in large networks: a review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8 (2), pp. e1226. Cited by: §5.
  • M. Almasri, O. Anjum, C. Pearson, Z. Qureshi, V. S. Mailthody, R. Nagi, J. Xiong, and W. Hwu (2019) Update on k-truss decomposition on gpu. In 2019 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7. Cited by: §5, §5.
  • L. Barenboim and M. Elkin (2010) Sublogarithmic distributed mis algorithm for sparse graphs using nash-williams decomposition. Distributed Computing 22 (5-6), pp. 363–379. Cited by: §5.
  • M. Bisson and M. Fatica (2018a) Update on static graph challenge on gpu. In 2018 IEEE High Performance extreme Computing Conference (HPEC), Vol. , pp. 1–8. External Links: Document Cited by: §5.
  • M. Bisson and M. Fatica (2018b) Update on static graph challenge on gpu. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pp. 1–8. Cited by: §5, §5.
  • M. Blanco, T. M. Low, and K. Kim (2019) Exploration of fine-grained parallelism for load balancing eager k-truss on gpu and cpu. In 2019 IEEE High Performance Extreme Computing Conference (HPEC), Vol. , pp. 1–7. External Links: Document Cited by: §5.
  • Y. Che, Z. Lai, S. Sun, Y. Wang, and Q. Luo (2020) Accelerating truss decomposition on heterogeneous processors. Proceedings of the VLDB Endowment 13 (10), pp. 1751–1764. Cited by: §3.1, §5.
  • X. Chen, R. Dathathri, G. Gill, and K. Pingali (2020) Pangolin: an efficient and flexible graph mining system on cpu and gpu. Proceedings of the VLDB Endowment 13 (10), pp. 1190–1205. Cited by: §5.
  • J. Cheng, L. Zhu, Y. Ke, and S. Chu (2012) Fast algorithms for maximal clique enumeration with limited memory. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1240–1248. Cited by: §5.
  • N. Chiba and T. Nishizeki (1985) Arboricity and subgraph listing algorithms. SIAM Journal on computing 14 (1), pp. 210–223. Cited by: §1, §1, §2.2, §2.2, §5.
  • A. Conte, D. De Sensi, R. Grossi, A. Marino, and L. Versari (2018) Discovering -trusses in large-scale networks. In 2018 IEEE High Performance extreme Computing Conference (HPEC), Vol. , pp. 1–6. External Links: Document Cited by: §5.
  • M. Danisch, O. Balalau, and M. Sozio (2018) Listing k-cliques in sparse real-world graphs. In Proceedings of the 2018 World Wide Web Conference, pp. 589–598. Cited by: §1, §1, §2.2, §2.2, §4.1, §5.
  • M. Danisch, T. H. Chan, and M. Sozio (2017) Large scale density-friendly graph decomposition via convex programming. In Proceedings of the 26th International Conference on World Wide Web, pp. 233–242. Cited by: §1.
  • A. Das, S. Sanei-Mehri, and S. Tirthapura (2020) Shared-memory parallel maximal clique enumeration from static and dynamic graphs. ACM Transactions on Parallel Computing (TOPC) 7 (1), pp. 1–28. Cited by: §5.
  • N. S. Dasari, R. Desh, and Z. M (2014) PbitMCE: a bit-based approach for maximal clique enumeration on multicore processors. In 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS), Vol. , pp. 478–485. External Links: Document Cited by: §3.4, §5.
  • S. Diab, M. G. Olabi, and I. El Hajj (2020) KTrussExplorer: exploring the design space of k-truss decomposition optimizations on gpus. In 2020 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–8. Cited by: §5, §5.
  • E. R. Elenberg, K. Shanmugam, M. Borokhovich, and A. G. Dimakis (2016)

    Distributed estimation of graph 4-profiles

    .
    In Proceedings of the 25th International Conference on World Wide Web, pp. 483–493. Cited by: §5.
  • D. Eppstein, M. Löffler, and D. Strash (2010) Listing all maximal cliques in sparse graphs in near-optimal time. In International Symposium on Algorithms and Computation, pp. 403–414. Cited by: §1, §1, §2.2, §2.3, §5.
  • Y. Fang, K. Yu, R. Cheng, L. V. Lakshmanan, and X. Lin (2019) Efficient algorithms for densest subgraph discovery. Proceedings of the VLDB Endowment 12 (11), pp. 1719–1732. Cited by: §1.
  • I. Finocchi, M. Finocchi, and E. G. Fusco (2015) Clique counting in mapreduce: algorithms and experiments. Journal of Experimental Algorithmics (JEA) 20, pp. 1–20. Cited by: §1, §1, §2.2, §5.
  • A. Gionis and C. E. Tsourakakis (2015) Dense subgraph discovery: kdd 2015 tutorial. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2313–2314. Cited by: §1.
  • F. Glaria, C. Hernández, S. Ladra, G. Navarro, and L. Salinas (2021) Compact structure for sparse undirected graphs based on a clique graph partition. Information Sciences 544, pp. 485–499. Cited by: §1.
  • M. T. Goodrich and P. Pszona (2011) External-memory network analysis algorithms for naturally sparse graphs. In European Symposium on Algorithms, pp. 664–676. Cited by: §5.
  • O. Green, J. Fox, E. Kim, F. Busato, N. Bombieri, K. Lakhotia, S. Zhou, S. Singapura, H. Zeng, R. Kannan, V. Prasanna, and D. Bader (2017) Quickly finding a truss in a haystack. In 2017 IEEE High Performance Extreme Computing Conference (HPEC), Vol. , pp. 1–7. External Links: Document Cited by: §5.
  • O. Green, P. Yalamanchili, and L. Munguía (2014a) Fast triangle counting on the gpu. In Proceedings of the 4th Workshop on Irregular Applications: Architectures and Algorithms, pp. 1–8. Cited by: §5.
  • O. Green, P. Yalamanchili, and L. Munguía (2014b) Fast triangle counting on the gpu. In Proceedings of the 4th Workshop on Irregular Applications: Architectures and Algorithms, IA¡sup¿3¡/sup¿ ’14, pp. 1–8. External Links: ISBN 9781479970568 Cited by: §5, §5.
  • E. Gregori, L. Lenzini, and S. Mainardi (2012) Parallel k-clique community detection on large-scale networks. IEEE Transactions on Parallel and Distributed Systems 24 (8), pp. 1651–1660. Cited by: §1.
  • W. Guo, Y. Li, and K. Tan (2020) Exploiting reuse for gpu subgraph enumeration. IEEE Transactions on Knowledge and Data Engineering. Cited by: §5.
  • M. Halappanavar and S. Ghosh (2020) TriC: distributed-memory triangle counting by exploiting the graph structure. Technical report Pacific Northwest National Lab.(PNNL), Richland, WA (United States). Cited by: §5.
  • F. Hao, L. Wang, Y. Sun, and D. Park (2018) Detecting (k, r)-clique communities from social networks. In Advanced Multimedia and Ubiquitous Engineering, pp. 583–590. Cited by: §1.
  • C. Henry and P. Eng (2014) A parallel gpu solution to the maximal clique enumeration problem for cbir. In GPU Technology Conference (GTC 2014), Cited by: §5.
  • Y. Hu, H. Liu, and H. H. Huang (2018) High-performance triangle counting on gpus. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pp. 1–5. Cited by: §5.
  • S. Jain and C. Seshadhri (2020) The power of pivoting for exact clique counting. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 268–276. Cited by: §1, §1, §2.3, §2.3, §2.3, §2.3, §3.3, §4.1, Table 3, §5.
  • J. Jenkins, I. Arkatkar, J. D. Owens, A. Choudhary, and N. F. Samatova (2011) Lessons learned from exploring the backtracking paradigm on the gpu. In European Conference on Parallel Processing, pp. 425–437. Cited by: §5.
  • M. N. Kolountzakis, G. L. Miller, R. Peng, and C. E. Tsourakakis (2012) Efficient triangle counting in large graphs via degree-based vertex partitioning. Internet Mathematics 8 (1-2), pp. 161–185. Cited by: §5.
  • L. Lai, Z. Qing, Z. Yang, X. Jin, Z. Lai, R. Wang, K. Hao, X. Lin, L. Qin, W. Zhang, et al. (2019) Distributed subgraph matching on timely dataflow. Proceedings of the VLDB Endowment 12 (10), pp. 1099–1112. Cited by: §5.
  • V. E. Lee, N. Ruan, R. Jin, and C. Aggarwal (2010) A survey of algorithms for dense subgraph discovery. In Managing and Mining Graph Data, pp. 303–336. Cited by: §1.
  • J. Leskovec and A. Krevl (2014) SNAP datasets: stanford large network dataset collection. Cited by: §4.1.
  • B. Lessley, T. Perciano, M. Mathai, H. Childs, and E. W. Bethel (2017) Maximal clique enumeration with data-parallel primitives. In 2017 IEEE 7th Symposium on Large Data Analysis and Visualization (LDAV), pp. 16–25. Cited by: §5.
  • R. Li, S. Gao, L. Qin, G. Wang, W. Yang, and J. X. Yu (2020)

    Ordering heuristics for k-clique listing

    .
    Proceedings of the VLDB Endowment 13 (12), pp. 2536–2548. Cited by: §1, §1, §2.2, §2.2, §4.1, §5.
  • W. Lin, X. Xiao, X. Xie, and X. Li (2016) Network motif discovery: a gpu approach. IEEE transactions on knowledge and data engineering 29 (3), pp. 513–528. Cited by: §5.
  • T. M. Low, D. G. Spampinato, A. Kutuluru, U. Sridhar, D. T. Popovici, F. Franchetti, and S. McMillan (2018) Linear algebraic formulation of edge-centric k-truss algorithms with adjacency matrices. In 2018 IEEE High Performance extreme Computing Conference (HPEC), Vol. , pp. 1–7. External Links: Document Cited by: §5.
  • V. S. Mailthody, K. Date, Z. Qureshi, C. Pearson, R. Nagi, J. Xiong, and W. Hwu (2018) Collaborative (cpu+ gpu) algorithms for triangle counting and truss decomposition. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pp. 1–7. Cited by: §5.
  • S. Manoharan (2020)

    Patient diet recommendation system using k clique and deep learning classifiers

    .

    Journal of Artificial Intelligence

    2 (02), pp. 121–130.
    Cited by: §1.
  • D. Merrill (2015) Cub. NVIDIA Research. Cited by: §4.1.
  • A. Mhedhbi and S. Salihoglu (2019) Optimizing subgraph queries by combining binary and worst-case optimal joins. arXiv preprint arXiv:1903.02076. Cited by: §5.
  • T. NVIDIA (2017) V100 gpu architecture whitepaper. NVIDIA. Cited by: §1, §3.5.
  • R. Pagh and C. E. Tsourakakis (2012) Colorful triangle counting and a mapreduce implementation. Information Processing Letters 112 (7), pp. 277–281. Cited by: §5.
  • S. Pandey, X. S. Li, A. Buluc, J. Xu, and H. Liu (2019) H-index: hash-indexing for parallel triangle counting on gpus. In 2019 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7. Cited by: §5, §5.
  • R. Pearce and G. Sanders (2018) K-truss decomposition for scale-free graphs at scale in distributed memory. In 2018 IEEE High Performance extreme Computing Conference (HPEC), Vol. , pp. 1–6. External Links: Document Cited by: §5.
  • C. Pearson, M. Almasri, O. Anjum, V. S. Mailthody, Z. Qureshi, R. Nagi, J. Xiong, and W. Hwu (2019) Update on triangle counting on gpu. In 2019 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7. Cited by: §5, §5.
  • A. Pinar, C. Seshadhri, and V. Vishal (2017) Escape: efficiently counting all 5-vertex subgraphs. In Proceedings of the 26th international conference on world wide web, pp. 1431–1440. Cited by: §5, §5.
  • R. A. Rossi, N. K. Ahmed, and E. Koh (2018a) Higher-order network representation learning. In Companion Proceedings of the The Web Conference 2018, pp. 3–4. Cited by: §1.
  • R. A. Rossi, R. Zhou, and N. K. Ahmed (2018b) Estimation of graphlet counts in massive networks.

    IEEE transactions on neural networks and learning systems

    30 (1), pp. 44–57.
    Cited by: §5.
  • R. A. Rossi and R. Zhou (2018) Graphzip: a clique-based sparse graph compression method. Journal of Big Data 5 (1), pp. 1–14. Cited by: §1.
  • R. A. Rossi and R. Zhou (2019) System and method for compressing graphs via cliques. Google Patents. Note: US Patent 10,217,241 Cited by: §1.
  • A. E. Sariyüce, C. Seshadhri, A. Pinar, and Ü. V. Çatalyürek (2017) Nucleus decompositions for identifying hierarchy of dense subgraphs. ACM Transactions on the Web (TWEB) 11 (3), pp. 1–27. Cited by: §1.
  • M. C. Schmidt, N. F. Samatova, K. Thomas, and B. Park (2009) A scalable, parallel algorithm for maximal clique enumeration. Journal of Parallel and Distributed Computing 69 (4), pp. 417–428. Cited by: §5.
  • J. Shi, L. Dhulipala, and J. Shun (2020) Parallel clique counting and peeling algorithms. arXiv preprint arXiv:2002.10047. Cited by: §1, §1, §2.2, §2.2, §4.1, §4.1, §4.1, §4.2, §4.3, §4.4, Table 3, §5, §5.
  • H. Tran, J. Kim, and B. He (2015) Fast subgraph matching on large graphs using graphics processors. In International Conference on Database Systems for Advanced Applications, pp. 299–315. Cited by: §5.
  • C. Tsourakakis (2015) The k-clique densest subgraph problem. In Proceedings of the 24th international conference on world wide web, pp. 1122–1132. Cited by: §1.
  • M. VanCompernolle, L. Barford, and F. Harris (2016a) Maximum clique solver using bitsets on gpus. In Information Technology: New Generations, pp. 327–337. Cited by: §3.4, §5.
  • M. VanCompernolle, L. Barford, and F. Harris (2016b) Maximum clique solver using bitsets on gpus. In Information Technology: New Generations, pp. 327–337. Cited by: §5.
  • P. Vilakone, D. Park, K. Xinchang, and F. Hao (2018) An efficient movie recommendation algorithm based on improved k-clique. Human-centric Computing and Information Sciences 8 (1), pp. 1–15. Cited by: §1.
  • J. Wang and J. Cheng (2012) Truss decomposition in massive networks. arXiv preprint arXiv:1205.6693. Cited by: §5.
  • L. Wang and J. D. Owens (2019) Fast bfs-based triangle counting on gpus. In 2019 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. Cited by: §5.
  • L. Wang and J. D. Owens (2020) Fast gunrock subgraph matching (gsm) on gpus. arXiv preprint arXiv:2003.01527. Cited by: §5.
  • L. Wang, Y. Wang, and J. D. Owens (2016a) Fast parallel subgraph matching on the gpu. In HPDC, Cited by: §5.
  • L. Wang, Y. Wang, C. Yang, and J. D. Owens (2016b) A comparative study on exact triangle counting algorithms on the gpu. In Proceedings of the ACM Workshop on High Performance Graph Processing, pp. 1–8. Cited by: §5.
  • P. Wang, J. Zhao, X. Zhang, Z. Li, J. Cheng, J. C. Lui, D. Towsley, J. Tao, and X. Guan (2017) MOSS-5: a fast method of approximating counts of 5-node graphlets in large graphs. IEEE Transactions on Knowledge and Data Engineering 30 (1), pp. 73–86. Cited by: §5.
  • Y. Xu, J. Cheng, A. W. Fu, and Y. Bu (2014) Distributed maximal clique computation. In 2014 IEEE International Congress on Big Data, pp. 160–167. Cited by: §5.
  • T. Yu and M. Liu (2019) A memory efficient maximal clique enumeration method for sparse graphs with a parallel implementation. Parallel Computing 87, pp. 46–59. External Links: ISSN 0167-8191, Document, Link Cited by: §3.4, §5.
  • Y. Yu, Z. Lu, J. Liu, G. Zhao, and J. Wen (2019) Rum: network representation learning using motifs. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1382–1393. Cited by: §1.
  • L. Yuan, L. Qin, W. Zhang, L. Chang, and J. Yang (2017) Index-based densest clique percolation community search in networks. IEEE Transactions on Knowledge and Data Engineering 30 (5), pp. 922–935. Cited by: §1.
  • L. Zeng, L. Zou, M. T. Özsu, L. Hu, and F. Zhang (2020) GSI: gpu-friendly subgraph isomorphism. In 2020 IEEE 36th International Conference on Data Engineering (ICDE), Vol. , pp. 1249–1260. External Links: Document Cited by: §5.