1. Introduction
Dense subgraph counting and listing is an important computation in graph mining (Danisch et al., 2017; Lee et al., 2010; Gionis and Tsourakakis, 2015). A clique is a fundamental dense subgraph with great importance in many applications such as community detection (Fang et al., 2019; Gregori et al., 2012; Sariyüce et al., 2017; Tsourakakis, 2015; Yuan et al., 2017; Hao et al., 2018), graph partition and compression (Glaria et al., 2021; Rossi and Zhou, 2018, 2019), learning network embedding (Yu et al., 2019; Rossi et al., 2018a), and recommendation systems (Vilakone et al., 2018; Manoharan, 2020). A clique (or a vertex clique) in a graph is a complete subgraph with exactly vertices and edges, such that every vertex in the clique is connected to every other vertex.
A common approach to clique counting is to traverse a search tree for each vertex in the graph and find the cliques that contain that vertex. This approach is commonly parallelized by processing different trees or subtrees in parallel. One fundamental optimization is to eliminate the search tree branches that discover the same clique redundantly. Two prominent approaches to eliminate redundant clique discovery during clique counting are graph orientation (Shi et al., 2020; Danisch et al., 2018; Chiba and Nishizeki, 1985; Eppstein et al., 2010; Li et al., 2020; Finocchi et al., 2015) and pivoting (Jain and Seshadhri, 2020).
The graph orientation approach orients the graph to a directed graph so that each symmetric clique is only found from one of the vertices it contains. Common orientation criteria include vertex degree (Chiba and Nishizeki, 1985; Shi et al., 2020; Finocchi et al., 2015), graph coloring (Li et al., 2020), and degeneracy based on core decomposition (Eppstein et al., 2010; Danisch et al., 2018; Shi et al., 2020) or relaxations of core (Shi et al., 2020). On the other hand, the pivoting approach (Jain and Seshadhri, 2020) for clique counting is inspired by the BronKerbosh maximal clique finding approach (Eppstein et al., 2010). Rather than searching for all cliques, the pivoting approach finds the largest cliques, then calculates the number of cliques they contain. To the best of our knowledge, the stateoftheart parallel implementations for the graph orientation and pivoting approaches are ARBCOUNT (Shi et al., 2020) and Pivoter (Jain and Seshadhri, 2020), respectively. Both of these parallel implementations are designed for CPUs.
In this paper, we present our GPU implementations of clique counting for both the graph orientation and pivoting approaches. Compared to CPUs, parallelizing clique counting on GPUs comes with unique opportunities and challenges such as finegrain parallelism, a multilevel parallelization structure, inefficient support for recursion, and constrained memory requirements. Our implementations explore both vertexcentric and edgecentric parallelization across thread blocks, as well as different ways for partitioning work within thread blocks for each approach. We replace the recursive search tree traversal with an iterative traversal whereby threads jointly traverse a subtree by explicitly managing a shared stack. We use binary encoding of induced subgraphs to reduce memory consumption and allow for highly parallel list intersection operations. We take advantage of the Independent Thread Scheduling feature (NVIDIA, 2017) in the NVIDIA Volta architecture and its successors to allow threads to collaborate at subwarp granularity for better utilization of parallel execution resources. We also employ various techniques to reduce the memory consumption of our implementations.
Our evaluation shows that our best GPU implementation outperforms the best stateoftheart parallel CPU implementation by a geometric mean speedup of 9.88, 5.71, and 17.23 for , , and , respectively. We also show that the binary encoding and subwarp partitioning optimizations yield significant performance gains, and we provide insights on the choice of parallelization scheme and subwarp partitioning granularity.
In summary, we make the following contributions:

A GPU implementation of the graph orientation approach to clique counting that outperforms the stateoftheart parallel CPU implementation

A GPU implementation of the pivoting approach to clique counting that outperforms the stateoftheart parallel CPU implementation

An exploration of the vertexcentric and edgecentric parallelization schemes for clique counting on GPUs

Various optimization techniques for clique counting on GPUs that are common to both approaches and parallelization schemes, including binary encoding of induced subgraphs, subwarp partitioning, and efficient memory management
Our code will be opensourced to enable further research on parallelizing clique counting on GPUs.
2. Background
2.1. Clique Counting
Consider an undirected simple graph where is the set of vertices in the graph, is the set of edges in the graph, and is the adjacency list of a vertex . A clique in is a complete subgraph of with exactly vertices and edges. That is, every vertex in the clique is connected to every other vertex with an edge. This paper focuses on clique counting which counts the number of cliques in a given graph for a given .
A common approach to counting cliques in a graph is to traverse a search tree for each vertex in the graph to find cliques that contain that vertex. The search tree for each vertex (1clique) branches out to the vertex’s neighbors to find edges (2cliques), then for each edge, branches out to the common neighbors of the edge’s endpoints to find triangles (3cliques), then for each triangle, branches out to the common neighbors of the vertices in the triangle to find 4cliques, and so on. In general, for each clique, the tree branches out to the common neighbors of the vertices in the clique to find the cliques. Finding the common neighbors of a set of vertices requires adjacency list intersection operations. This approach to clique counting is commonly parallelized by processing different trees or subtrees in parallel.
One key distinguishing feature among algorithms is how they avoid discovering the same clique redundantly from multiple root vertices. Avoiding redundant clique discovery results in a substantial reduction in the amount of work done, thereby improving performance. The two major approaches to avoiding redundant clique discovery are: (1) orienting the graph before traversal, and (2) pivoting. These two approaches are described in Sections 2.2 and 2.3, respectively.
2.2. Graph Orientation Approach
One common approach to avoiding redundant clique discovery is orienting the graph before traversing it (Shi et al., 2020; Danisch et al., 2018; Chiba and Nishizeki, 1985; Eppstein et al., 2010; Li et al., 2020; Finocchi et al., 2015). Graph orientation (or vertex ordering) is a preprocessing step that converts the graph from an undirected graph to a directed one. Common orientation criteria include vertex degree (Chiba and Nishizeki, 1985; Shi et al., 2020; Finocchi et al., 2015), graph coloring (Li et al., 2020), and degeneracy based on core decomposition (Eppstein et al., 2010; Danisch et al., 2018; Shi et al., 2020) or relaxations of core (Shi et al., 2020). Graph orientation relies on the fact that a clique is a symmetric substructure, hence, it can be found by starting from any vertex it contains.
Figure 1(b) shows how the graph in Figure 1(a) is explored in the graph orientation approach to find all the 4cliques. Assume that the edges are oriented in alphabetical order (from the earlier letter to the later letter). The first level contains all the vertices in the graph representing the root of their respective search trees. At the second level, each tree branches out from the root vertex to its neighbors. For example, the branch indicates that there is an edge from vertex A to vertex B. At the third level, each edge branches out to the triangles it participates in. For example, the path indicates that there is a triangle containing vertices A, B, and C. Here, C is found while intersecting the adjacency lists of vertices A and B (i.e., ). Finally, at the fourth level, each triangle branches out to the 4cliques it participates in. For example, the path indicates that there is a 4clique containing vertices A, B, C, and D. Here, D is found while intersecting the adjacency lists for A, B, and C. Since, was computed in the previous level, what remains is intersecting the previous result with . Since we are looking for 4cliques, the tree traversal stops at the fourth level. In general, while looking for cliques, the tree traversal stops at level .
Graph orientation has two main benefits. The first benefit is that it eliminates redundant clique discovery as previously mentioned. In the example in Figure 1(b), the 4clique containing vertices A, B, C, and D is only discovered in the tree rooted at vertex A. It is not redundantly discovered in the other trees because vertex A is not reachable from the other vertices. The second benefit of graph orientation is that it reduces the outdegrees of the vertices and the maximum outdegree of the graph. The maximum outdegree has a quadratic impact on memory consumption, as we show in Section 2.4.
Parallel clique counting based on graph orientation has been extensively studied on CPU (Shi et al., 2020; Danisch et al., 2018; Li et al., 2020; Chiba and Nishizeki, 1985). Vertices or edges are typically distributed across CPU threads, and each thread performs a sequential depthfirst traversal of the tree or subtree to count the underlying cliques. To the best of our knowledge, ARBCOUNT (Shi et al., 2020) is the best performing parallel CPU implementation of clique counting based on graph orientation. In this paper, we present our parallel GPU implementation of clique counting based on graph orientation and compare its performance to ARBCOUNT.
2.3. Pivoting Approach
Another approach to avoiding redundant clique discovery is pivoting (Jain and Seshadhri, 2020). The idea of pivoting is inspired by the BronKerbosh maximal clique finding approach (Eppstein et al., 2010). Pivoting relies on the fact that a clique consists of cliques, so instead of searching for all of these cliques, it is sufficient to find the largest clique and all the cliques it contains are found. For example, the graph in Figure 1(a) contains a 5clique consisting of vertices A, B, C, D, and J. This 5clique contains different 4cliques. In the graph oriented approach in Figure 1(b), these five 4cliques are discovered by five different paths in the search trees. The pivoting approach aims at discovering the 5clique, then conclude the existence of five 4cliques.
Figure 1(c) shows an example of how the graph in Figure 1(a) is explored using the pivoting approach. As the search tree is traversed, at every branching step, one pivot child vertex is selected which is typically the vertex that has the largest common number of neighbors with its parent. All the pivot’s neighbors are then excluded from the branching since these neighbors are eventually reachable from the pivot vertex (i.e., the pivot vertex is their parent in the search tree). For example, at the first level, vertex B is selected as the pivot vertex because it has the largest number of neighbors. Accordingly, all of B’s neighbors are excluded from the first level. At the second level, when branching from vertex B to its neighbors, vertex D is selected as the pivot because it has the largest number of common neighbors with B. Accordingly, all of D’s neighbors are excluded from the branch. The tree traversal proceeds in this way. Unlike the graph orientation approach, the pivoting approach does not stop at level , but keeps exploring until it reaches the bottom of the tree (or satisfies a stopping criteria (Jain and Seshadhri, 2020)). To calculate the number of cliques found on a path, the combinatorial formula is used, where is the number of pivots in the branch, and and the number of vertices in the branch.
Compared to graph orientation, pivoting has the advantage that it reduces the search space significantly by eliminating the neighbors of the pivot vertex from the branching. This reduction is clear when comparing Figures 1(b) and 1(c). The reduction in branching makes pivoting particularly suitable for large , or for counting all cliques for all . On the other hand, pivoting has several disadvantages. The first disadvantage of pivoting is that it requires deeper exploration of the search tree which exacerbates load imbalance in parallel implementations. The second disadvantage of pivoting is that it reduces the amount of parallelism available. The third disadvantage of pivoting is that it requires an undirected graph, which makes adjacency list intersection operations prohibitively expensive for large graphs where vertices have much higher degree, hence longer adjacency lists. The second and third disadvantages are mitigated (Jain and Seshadhri, 2020) by starting with the graph orientation approach for the initial level(s) of the tree using a directed graph, then switching to the pivoting approach for the remaining levels using an undirected induced subgraph (see Section 2.4). An example of this hybrid approach is illustrated in Figures 1(d) and 1(e).
To the best of our knowledge, Pivoter (Jain and Seshadhri, 2020) is the only implementation of clique counting based on pivoting. The implementation is parallelized on the CPU, again by distributing vertices or edges across CPU threads, and having each thread perform a sequential depthfirst traversal of the tree or subtree. In this paper, we present our parallel GPU implementation of clique counting based on pivoting and compare its performance to Pivoter.
2.4. Induced Subgraph Optimization
Both the graph orientation and pivoting approaches spend a significant amount of time performing adjacency list intersection operations. A common optimization is to shrink the size of the adjacency lists being interesected by, for each search tree, removing the vertices from the graph that will never be reached in the search tree. For example, in Figure 1, when traversing a tree rooted at a vertex A, only the neighbors of A can ever be reached. Hence, any vertex that is not a neighbor of A can be removed from the graph before traversing the tree rooted at A.
In general, when traversing a tree rooted at vertex , the first step is to extract the vertexinduced subgraph consisting of the vertices in . This induced subgraph is used throughout the tree traversal instead of the full graph. The induced subgraph is typically much smaller than the full graph, with smaller adjacency lists resulting in faster adjacency list intersection operations. Note that an induced subgraph may be extracted at any level in the search tree. For example, if the tree contains a path , the subtree rooted at can only reach vertices that are neighbors of all the vertices . Therefore, the induced subgraph consisting of the vertices in is sufficient for traversing the subtree. However, extracting the induced subgraph at each level is usually not worth the overhead. The induced subgraph is typically extracted at one level, either the first or the second. In this paper, both alternatives are explored.
The largest possible induced subgraph has vertices where is the maximum outdegree of the graph. Therefore, the upper bound on the size of an induced subgraph is . Since the maximum outdegree of the graph has a quadratic impact on memory consumption, the choice of graph orientation is critical for reducing memory consumption, as mentioned in Section 2.2. It becomes even more critical in parallel implementations because when the trees or subtrees are processed in parallel and each has a different induced subgraph, each needs its own memory space to store its induced subgraph. Section 3.4 describes how the memory consumption of the induced subgraphs is further reduced in our parallel GPU implementation.
3. Clique Counting on GPUs
This section describes our implementation of kclique counting on GPUs, including the graph format, parallelization schemes, iterative subtree traversal, and various optimizations.
3.1. Graph Format and Orientation Criteria
We represent the input graph using the hybrid CSR+COO storage format. Figure 2 shows an example of how this storage format is used to represent the graph in Figure 1(a). The row pointer and column index arrays together constitute the CSR part. They facilitate finding the adjacency list of a given vertex, which is useful for vertexcentric processing and parallelization. The row index and column index arrays together constitute the COO part. They facilitate finding the source and destination vertex of a given edge, which is useful for edgecentric processing and parallelization.
Before clique counting begins, we first orient the graph to a directed graph. Recall that both the graph orientation approach and the pivoting approach require the graph to be oriented at the beginning. Our implementation supports two different orientation criteria: degree orientation and degeneracy orientation. Degree orientation orients the edges from the vertex with the lower degree to the vertex with the higher degree. Degeneracy orientation orients the edges from the lower core order to the higher core order. The core order is obtained from core decomposition which iteratively eliminates the minimum degree vertices from the graph. The core order is the order in which the vertex is removed from the graph.
We implement both orientation criteria on the GPU, the latter using a bucketbased approach (Che et al., 2020). In both cases, after determining which edges to keep, the undesired edges are filtered out and the row pointers are recomputed with histogram and exclusive scan operations. The tradeoff between orientation criteria is evaluated in Section 4.4.
We note that although degree orientation and degeneracy orientation are currently supported, our implementation can easily be extended to support other orientation criteria. The choice of orientation criteria is orthogonal to our work and not intended as a contribution of this paper.
3.2. Parallelization Schemes
Our main strategy for parallelizing kclique counting on GPU is to traverse different search trees or subtrees in parallel. GPUs provide a massive amount of parallelism and are capable of running tens of thousands to hundreds of thousands of threads concurrently. A grid of threads running on a GPU is divided into thread blocks. Threads in the same thread block can collaborate by synchronizing at a barrier and sharing a fast scratchpad memory (also called shared memory). Thread blocks are further divided into warps which typically consist of 32 threads bound by the SIMD execution model. Threads in the same warp can collaborate using low cost warplevel primitives.
In this paper, we implement two different parallelization schemes: a vertexcentric scheme and an edgecentric scheme. In the vertexcentric scheme, each thread block is assigned to a vertex (level 1 in the search tree) and is responsible for traversing the tree rooted at that vertex. The threads in the block collaborate to extract the induced subgraph consisting of the vertex’s neighbors, then proceed to traverse the vertex’s tree using the induced subgraph. In the edgecentric scheme, each thread block is assigned to an edge (level 2 in the search tree) and is responsible for traversing the subtree rooted at that edge. The threads in the block collaborate to extract the induced subgraph consisting of the common neighbors of the edge’s endpoints, then proceed to traverse the edge’s subtree using the induced subgraph.
The advantage of the vertexcentric scheme over the edgecentric scheme is that it extracts an induced subgraph for each vertex’s tree as opposed to each edge’s subtree, thereby amortizing the cost of extracting the induced subgraph over a larger tree. The advantage of the edgecentric scheme over the vertexcentric scheme is that it extracts smaller induced subgraphs (albeit more of them), thereby resulting in shorter adjacency lists. It also exposes finergrain parallelism which makes it more robust against load imbalance. We compare the performance of the vertexcentric and edgecentric schemes in Section 4.3.
Parallelization of work within each thread block differs between the graph orientation approach and the pivoting approach. In the graph orientation approach, we partition the blocks into groups of threads and each group independently traverses one of the subtrees in the next level. In the vertexcentric scheme, each group of threads is assigned to an outgoing edge (level 2 in the search tree) of the block’s vertex, and the group independently traverses the subtree rooted at the edge. An example is shown in Figure 3(a). In the edgecentric scheme, each group of threads is assigned to a triangle (level 3 in the search tree) that the block’s edge participates in, and the group independently traverses the subtree rooted at that triangle. An example is shown in Figure 3(b). In both cases, the group of threads jointly perform a depthfirst traversal of the subtree it is assigned to, visiting the nodes in the search tree sequentially. The threads in the group collaborate to perform the adjacency list intersection operation in parallel.
In the pivoting approach, we also partition the blocks into groups of threads, however these groups do not process nextlevel subtrees independently. Instead, all threads in the block stay together as they jointly perform a sequential depthfirst traversal of the tree/subtree that the block is assigned to. However, at each node in the search tree, identifying which neighbor is the pivot vertex requires performing a list intersection operation for each neighbor. Therefore, each group of threads is assigned to a different neighbor and performs a list intersection operation for that neighbor to check if it is the pivot. Examples of the vertexcentric and edgecentric schemes for the pivoting approach are shown in Figures 3(c) and 3(d), respectively.
There are two reasons why, in the pivoting approach, groups of threads are not assigned to process the nextlevel subtrees independently. The first reason is that the process of finding the pivot element at each tree node is expensive and already provides enough work to be parallelized across groups. The second reason is that the pivoting approach has deeper trees than the graph orientation approach, thereby requiring more memory to store intermediate results. Parallelizing the nextlevel subtrees across groups of threads would require too much memory for storing the intermediate results of each subtree.
For now, a group of threads can be thought of as a warp, which is the most natural way to partition a block. However, we improve on this partitioning granularity in Section 3.5.
3.3. Traversing a Subtree
In Section 3.2, we saw how the trees or subtrees are distributed across threads to be traversed in parallel. In this section, we explain how a block or group of threads jointly traverse a single tree or subtree. Tree traversal on CPUs is typically done using recursion. However, using recursion on the GPU is not an efficient approach. Due to the massive number of threads that run on the GPU, each thread can only afford to have a small stack which limits the recursion depth. Moreover, since multiple threads are used to jointly traverse a tree, these threads access the same intermediate traversal data. Replicating this traversal data in all the threads’ private stacks is not memory efficient. For this reason, our implementation replaces recursion with an iterative tree traversal whereby threads traversing the same tree explicitly manage a shared stack.
The shared stack we use for traversal consists the following data:

: a counter to track the current level

: an array that stores the lists of vertices at each level

: an array that stores the number of vertices at each level

: an array that stores the index of the current vertex at each level (note that this is not the vertex ID, but the index into the array to obtain the vertex ID)
Algorithm 1 shows the pseudocode for our iterative subtree traversal for the graph orientation approach. The routine takes , an initial level , the induced subgraph , the list of vertices at the initial level , and the number of vertices at the initial level (line 1). First, the parameters are used to initialize the stack data (lines 25). Next, a loop iterates until there are no more vertices to process (line 6). Inside the loop, the current vertex is read from the list of vertices (line 7). The children of the vertex are obtained by intersecting the list of vertices at the current level (which is also the list of common neighbors of all the vertex’s ancestors) with the adjacency list of the vertex itself (lines 810). The adjacency list is obtained from the induced subgraph. The threads in the group collaborate to perform the intersection operation in parallel. After the intersection, if the vertex is found to have children that are not at the bottom of the tree, we go to the next level to traverse the children (lines 1113). Otherwise, if the bottom of the tree is reached, we accumulate the number of cliques (lines 1516). We then increment the vertex index to go to the vertex’s next sibling (line 17). If the vertex has no more siblings, then we are done traversing the vertex’s parent, so we move on to the parent’s next sibling (lines 1820).
For the pivoting approach, we also use the shared stack to store the following additional data:

: an array that stores the pivot vertex ID at each level

: an array that stores, at each level, the number of pivot elements on the path to the vertex at the current level

: an array that stores, at each level, the list of vertices that are common neighbors of all the vertices at the previous levels (in the graph orientation approach, is equivalent to , but in the pivoting approach, it includes the neighbors of the pivot that are excluded from )
Algorithm 2 shows the pseudocode for our iterative subtree traversal for the pivoting approach. The routine takes , an initial level , the induced subgraph , and the common neighbors of all the vertices up to the initial level (line 1). First, the parameters are used to initialize the stack data (lines 24). Next, we find the pivot vertex for the current level (line 6). Recall that in the pivoting approach, all threads in the block jointly traverse the same tree or subtree. To find the pivot vertex, the vertices in are distributed across the groups of threads and each group intersects its vertex’s adjacency list with to find the number of common neighbors it has with its parent. A max operation is then performed across groups to identify the pivot vertex. After the pivot vertex is found, the neighbors of the pivot are excluded from the level vertices (lines 79). A loop then begins to iterate until there are no more vertices to process (line 10).
Inside the loop, the current vertex is read from the list of vertices (line 11). Based on whether the current vertex is the pivot vertex or not, the number of pivot elements up to the next level is set (line 12). Next, we must figure out if we need to traverse the child level or we have reached the bottom of the tree (lines 1319). If the stopping criterion (Jain and Seshadhri, 2020) is met, we set a flag not to go to the child level (lines 1314). Otherwise, we find the child vertices by intersecting the current vertex’s adjacency list with the common neighbors of its parents and removing the current vertex’s already processed siblings (lines 1618). If child vertices are found, we set a flag to traverse the child level, otherwise we have reached the bottom (line 19). If we have determined that we need to traverse the child level (line 20), we now go to the child level (lines 2122). We also identify the pivot vertex at the child level and exclude its neighbors from the level vertices (lines 2326) following the same procedure that was used for the initial level (lines 69). Otherwise, if we have determined that we have reached the bottom (line 27), we identify the number of nonredundant cliques in the clique that we found and accumulate the number of cliques (lines 2830). We then increment the current vertex index to go to the vertex’s next sibling (line 31). If the vertex has no more siblings, then we are done traversing the vertex’s parent, so we move on to the parent’s next sibling (lines 3234).
3.4. Binary Encoding of Induced Subgraphs
Recall that the first step a thread block performs before traversing its tree or subtree is to extract an induced subgraph. In the vertexcentric scheme, the induced subgraph consists of the vertex’s neighbors, whereas in the edgecentric scheme, the induced subgraph consists of the common neighbors of the edge’s endpoints. Each thread block has a different induced subgraph. The block’s induced subgraph is used by all threads in the block to perform the adjacency list intersection operations throughout the tree traversal.
The storage format used for the induced subgraph is of utmost importance for both memory consumption and execution time. For memory consumption, since each thread block has a different induced subgraph, enough memory must be allocated for all the blocks running simultaneously so they can each store a private induced subgraph. Hence, the space efficiency of the induced subgraph storage format is crucial for the overall memory consumption. For execution time, since threads use the induced subgraph to perform adjacency list intersection operations, the storage format must be designed to enable lowlatency parallel intersections.
To optimize both memory consumption and execution time, we use binary encoding to represent the induced subgraph. Figure 4 shows an example of how an induced subgraph can be binary encoded. Assume that a thread block is assigned to traverse the tree rooted at vertex A in the graph in Figure 4(a). The only vertices ever visited in that tree are the neighbors of A. Therefore, the thread block starts by extracting the induced subgraph consisting of the neighbors of A shown in Figure 4(b). This induced subgraph is binary encoded as shown in Figure 4
(c). The adjacency list of each vertex in the subgraph consists of a bit vector with a 1 for each neighbor and a 0 otherwise. Any two lists can be intersected by performing a simple bitwiseand operation between the two bit vectors. We use a directed induced subgraph for the graph orientation approach, and an undirected induced subgraph for the pivoting approach. Note that in addition to the induced subgraph being binary encoded, all the intermediate vertex lists (i.e., the
and lists in Algorithms 1 and 2) are also binary encoded, and intersections with these intermediate vertex lists also uses bitwiseand.The advantage of binary encoding for memory consumption is that each vertex in an adjacency list is represented with a single bit. Since dynamic memory allocation is not efficient on GPUs, the space for each thread block to store its induced subgraph and its intermediate vertex lists must be preallocated with enough capacity for the largest possible subgraph. The largest possible subgraph may have vertices, where is the maximum outdegree of the graph, and the subgraph may be completely dense. In this case, storing the subgraph would require memory, and storing the intermediate vertex lists would require memory per level. Binary encoding reduces these amounts by a factor of 32 by encoding each vertex in each list as a single bit.
The advantage of binary encoding for execution time is that it allows list intersection operations to be performed using simple bitwiseand operations. Traditional adjacency list intersection techniques on GPUs are complex to parallelize, suffer from control divergence, and exhibit uncoalesced memory access patterns. On the other hand, performing bitwiseand on a bit vector is easy to parallelize across threads in a group or block, does not suffer from control divergence, and enables coalescing of memory accesses.
The reason binary encoding is particularly effective for the induced subgraphs and intermediate vertex lists is that they typically consist of a small number of vertices, especially when a good graph orientation criteria is used. Moreover, the induced subgraphs are typically denser than the full graph. In contrast, binary encoding for the full graph is impractical because the full graph has many more vertices and is usually much more sparse, resulting in many wasted 0 bits. For this reason, we continue to represent the full graph using the hybrid CSR+COO format (see Section 3.1) and only represent the induced subgraphs using binary encoding. To extract the binary encoded induced subgraph from the full hybrid CSR+COO graph, we intersect the adjacency lists of the hybrid CSR+COO graph using binarysearchbased intersections.
To the best of our knowledge, our implementation is the first to use binary encoding for the induced subgraphs in clique counting. Related work on the maximal clique problem on GPUs uses binary encoding for the entire graph (VanCompernolle et al., 2016a) or for specialized data structures to represent and operate on the candidate maximal cliques (Yu and Liu, 2019; Dasari et al., 2014). Many other graph processing works use binary encoding in different contexts. We evaluate the performance improvement of using binary encoded induced subgraphs in Section 4.3.
3.5. Subwarp Partitioning
In Section 3.2, we explained that for both the graph orientation approach and the pivoting approach, our implementation partitions thread blocks into groups of threads and distributes the block’s work across these groups. In the graph orientation approach, a block is assigned to a tree or subtree, and has each group of threads process one of the next level subtrees in parallel. In the pivoting approach, a block is also assigned to a tree or subtree, but the groups of threads jointly iterate over the tree nodes sequentially. However, for each tree node, when determining which child vertex is the pivot, each group of threads is used to check a different child in parallel.
One important design consideration is the granularity at which thread blocks are partitioned into groups. The most natural granularity is the warp because threads in the same warp are bound together by SIMD and are able to collaborate using low cost warplevel primitives. However, the introduction of binary encoding makes the warp granularity often too coarse. Recall that threads within a group collaborate to perform a single list intersection operation in parallel. With binary encoding, each thread can intersect 32 list elements simultaneously with a single bitwiseand operation. Therefore, to fully utilize all 32 threads in the warp, the intersection needs to contain 1024 list elements. We show in Section 4.4 that with proper graph orientation criteria, the maximum outdegree (and consequently, the largest binary encoded list size) is often much smaller than that. Therefore, partitioning blocks at the warp granularity would lead to underutilization of parallel execution resources.
To address this issue, we implement subwarp partitioning where thread blocks are partitioned to groups smaller than a warp. Since the Volta architecture, the introduction of Independent Thread Scheduling (NVIDIA, 2017) has enabled finegrain collaboration between a subset of threads within a warp. This can be done by providing a 32bit mask to the warplevel primitives to specify the threads participating in the intrawarp synchronization operation. For example, the operation __warpsync(0x0000000F) synchronizes the first four threads in a warp, while the remaining threads execute independently. We leverage such primitives to enable the creation of thread groups that are 32, 16, 8, 4, 2, or 1 threads in size. The number of threads per group is a tunable parameter and the same traversal code works for any group size. We evaluate the performance improvement of using subwarp partitioning in Section 4.3.
3.6. Memory Management
The issue of memory consumption is exacerbated on the GPU compared to the CPU for two key reasons. The first reason is that the capacity of the device memory on a typical GPU is much smaller than the capacity of main memory on a typical CPU. The second reason is that GPUs are massively parallel processors so they perform much more work in parallel, thereby requiring much more intermediate data to be stored simultaneously. For example, a CPU may run tens of threads at a time, so it only needs enough memory to maintain that many different sets of induced subgraphs and intermediate vertex lists. In contrast, a GPU may run hundreds to thousands of thread blocks at once, so it needs enough memory to maintain hundreds to thousands of different sets of induced subgraphs and intermediate vertex lists. Therefore, with a lower memory capacity and a higher demand for memory, it becomes critical to manage memory efficiently on the GPU.
We have discussed multiple techniques that we use for reducing memory consumption throughout this paper. In Section 3.2, we discuss how an induced subgraph is extracted once per thread block and shared by multiple groups of threads in their traversal of different subtrees. In Section3.3, we discuss how threads in the same group or block explicitly manage a shared stack for storing the intermediate vertex lists, rather than each storing a private copy. In Section 3.4, we discuss how binary encoding of subgraphs and intermediate vetrex lists reduces their memory requirement. In this section, we discuss one more technique for reducing memory consumption, then analyze the overall memory consumption of our approaches.
A GPU grid typically consists of many more thread blocks than the number that can run simultaneously. A block executes to completion and is replaced by another thread block after it is done. For this reason, it is unnecessary to preallocate memory space for the induced subgraph and intermediate vertex lists of all the blocks in the grid. Instead, we preallocate memory space for the maximum number of blocks that can run simultaneously, and reuse the same space for multiple consecutive blocks.
Graph Orientation Approach  Pivoting Approach  

Data structure 





G  Global  Global  
Global  Global  
Shared  Global  
Shared  Global  
    Shared  
    Global  
    Global 
Table 1 shows the upper bound on the memory consumed per thread block for each of the graph orientation approach and the pivoting approach. For both approaches, a thread block has one copy of the induced subgraph which consumes memory as explained in Section 3.4. The remaining data structures are part of the shared stack and their size depends on the maximum depth of the stack as well as the number of shared stacks within the block (i.e., the number of thread groups processing subtrees independently). In the graph orientation approach, recall that each group of threads processes a subtree in parallel, hence there are shared stacks where is the number of thread groups in a block. Since the maximum depth of the stack is , each stack data structure has elements. In the pivoting approach, recall that all the groups of threads process the same tree or subtree, hence there is only one shared stack for the entire block. However, pivoting does not stop at level but goes to the bottom of the tree, so the maximum depth of the stack is the maximum number of vertices which is . Hence, each stack data structure has elements. The and arrays are binary encoded vertex lists so the size of each element is . The rest of the data structures consist of a single value per element. Note that some of the stack data structures are placed in shared memory for fast access.
4. Evaluation
4.1. Methodology
Evaluation Platform. We evaluate our GPU implementations on a dual socket machine with 512 GB of main memory. Each socket has a 20core (with 2way hyperthreading) Intel Xeon Gold 6230 CPU and two V100 GPUs with 32 GB and 16 GB of HBM2 global memory. The four GPUs are connected through NVLink 2.0. We only use a single 32GB V100 GPU in our evaluation. We compile our code with NVCC (CUDA 10.2) and GCC 8.3.1 with the O3 flag. The CUDA driver version used is 440.82. For the preporcessing filter and exclusive scan operations, we use Nvidia CUB 1.8.0 (Merrill, 2015).
Datasets. We use the same graphs used by ARBCOUNT (Shi et al., 2020) for exact clique counting evaluation, namely: asskitter, comdblp, comorkut, comfriendster, and comlj. These graphs are realworld undirected graphs from the Stanford Network Analysis Project (SNAP) (Leskovec and Krevl, 2014).
Baselines. We compare our GPU implementations to two CPU baselines: ARBCOUNT (Shi et al., 2020) and Pivoter (Jain and Seshadhri, 2020). ARBCOUNT (Shi et al., 2020) represents the stateoftheart parallel graph orientation implementation on CPU, which significantly outperforms other graph orientation implementations (Danisch et al., 2018; Li et al., 2020). Pivoter (Jain and Seshadhri, 2020) represents the stateoftheart parallel pivoting implementation on CPU. We use the execution times reported in ARBCOUNT (Shi et al., 2020) for the parallel implementations of ARBCOUNT and Pivoter. The execution times they report include the time spent preprocessing and counting, and exclude the time spent loading the graph. ARBCOUNT reports the best time achieved using different graph orientation and parallelization strategies.
Reporting of Measurements. For our GPU implementations, the execution times we report include the time spent preprocessing and counting, and exclude the time spent loading the graph. Unless otherwise specified, we report the time achieved for the best combination of orientation criteria (degree or degeneracy), parallelization scheme (vertexcentric or edgecentric), binary encoding, and subwarp partition size. Similar to ARBCOUNT (Shi et al., 2020), we do not report times greater than five hours.
4.2. Comparison with StateoftheArt Parallel CPU Implementations
Figure 5 compares the execution time of the stateoftheart parallel CPU implementations with our GPU implementations for both the graph orientation approach and the pivoting approach. The execution times in Figure 5 are also reported in Table 3 at the end of the paper. The missing datapoints in the figure are those that take longer than 5 hrs to run, or run out of memory in the case of comfriendster using Pivoter. Note that our GPU implementations do not run out of memory for any scenario, despite the constrained memory requirements on the GPU. Based on the results in Figure 5, we make two key observations.
Graph Orientation vs. Pivoting. The first observation is that the graph orientation approach performs better that the pivoting approach for small values of , while the latter performs better for large values of . This observation is consistent with that made in prior work (Shi et al., 2020). The observation applies for both CPU and GPU implementations. Recall from Section 2.3 that pivoting has the advantage of reducing the amount of branching but the disadvantage of having deeper search trees. For small values of , branching for the graph orientation approach is moderate, whereas the deep trees in the pivoting approach create load imbalance. However, as gets larger, the branching drastically increases causing the graph orientation approach to suffer. In most cases, the transition from the graph orientation approach being fastest to the pivoting approach being fastest happens at around .
GPU vs. CPU. The second observation is that our best GPU implementation consistently and significantly outperforms the best parallel stateoftheart CPU implementation across all values of . Our best GPU implementation outperforms the best CPU implementation by a geometric mean speedup of 12.39, 6.21, and 18.99 for , , and , respectively. This result demonstrates the effectiveness of GPUs at accelerating clique counting.
4.3. Impact of Parallelization Schemes and Speedup of Optimizations
Figures 6(a) and 6(b) show the incremental speedup of binary encoding and subwarp partitioning for both parallelization schemes for the graph orientation approach and the pivoting approach, respectively. VC and EC refer to the vertexcentric and edgecentric parallelization schemes, respectively, with induced subgraphs represented using the CSR format and blocks partitioned into groups at warp granularity. The +BE suffix refers to when binary encoding is applied to the induced subgraphs instead of using CSR. The +P suffix refers when subwarp partitioning is applied and the best partition size is used. The omitted datapoints in the figure are those where the baseline (VC) takes longer than 5 hrs to run.
Parallelization Scheme. We observe that vertexcentric parallelization scheme is more effective than the edgecentric parallelization scheme for the initial values of , particularly for the graph orientation approach. For example, for , VC+BE+P has a geometric mean speedup over EC+BE+P of 2.27 for the graph orientation approach and 1.09 for the pivoting approach. However, as gets larger, the edgecentric parallelization scheme becomes more effective. For example, for , EC+BE+P has a geometric mean speedup over VC+BE+P of 4.68 for the graph orientation approach and 1.45 for the pivoting approach. Recall from Section 3.2 that the vertexcentric scheme has the advantage of amortizing the cost of extracting the induced subgraph over a larger tree, whereas the edgecentric scheme has the advantage of extracting more parallelism making it more robust against load imbalance. The graph orientation approach for small has the smallest trees, hence amortizing the induced subgraph extraction across larger trees makes the vertexcentric scheme attractive. However, as increases, load imbalance increases, which makes the edgecentric scheme attractive. We also note that the transition from the vertexcentric scheme being fastest to the edgecentric scheme being fastest happens at a much smaller for our GPU implementation than it does for prior parallel CPU implementations (Shi et al., 2020). GPUs exhibit more parallel execution resources than CPUs which makes them more sensitive to load imbalance, thereby favoring an earlier transition to the more loadbalanced edgecentric scheme.
Binary Encoding. We observe that binary encoding gives consistent performance improvement for both the graph orientation and pivoting approaches across all graphs, parallelization schemes, and values of . The geometric mean speedup of applying binary encoding is 2.17 for the graph orientation approach, and 1.38 for the pivoting approach. Recall from Section 3.4 that binary encoding improves execution time because it enables lowerlatency parallel list intersection operations. The graph orientation approach spends the majority of time performing list intersection operations, whereas the pivoting approach performs other kinds of operations like finding a maximum. For this reason, it is expected that the speedup of binary encoding would be more pronounced in the graph orientation approach.
Subwarp Partitioning. We observe that subwarp partitioning also gives consistent performance improvement for both the graph orientation and pivoting approaches across all graphs, parallelization schemes, and values of . The geometric mean speedup of applying subwarp partitioning is 1.98 for the graph orientation approach, and 1.73 for the pivoting approach. Recall from Section 3.2 that the groups of threads within a block in the graph orientation approach operate completely independently, whereas in pivoting, these groups collaborate with each other to find the pivot vertex. For this reason, it is expected that the speedup of unleashing more parallelism via subwarp partitioning will be more pronounced in the graph orientation approach. Regarding the choice of the best partition size, recall from Section 3.5 that subwarp partitioning allows better utilization of the threads in the thread group when the adjacency lists being intersected are small. Our analysis shows that graphs with lower maximum degree (hence, shorter adjacency lists) favor having fewer threads per group, whereas graphs with larger maximum degree (hence, longer adjacency lists) favor having more threads per group.
4.4. Impact of Graph Orientation Criteria
Undirected  Degree Orientation  Degeneracy Orientation  

Graph 



asskitter  35,455  231  0.005  111  0.205  
comdblp  343  113  0.002  113  0.051  
comorkut  33,313  535  0.056  253  0.833  
comfriendster  5,214  868  5.465  304  12.294  
comlj  14,815  524  0.016  360  0.421 
Table 2 shows the achieved maximum outdegree of the graph and the preprocessing time of the two different orientation criteria used in this work, namely degree orientation and degeneracy orientation. It is clear that degeneracy orientation achieves lower maximum outdegree but has higher processing time. Recall that the maximum outdegree forms an upper bound on the length of the adjacency list intersections performed throughout the traversal. Therefore, a lower maximum outdegree results in faster list intersection operations.
Our analysis of the best orientation criteria for different runs shows that runs with lower values of favor degree orientation, whereas runs with higher values of favor degeneracy orientation. Runs with higher values of take longer and perform more list intersections, hence there is enough work reduction to amortize the higher preprocessing cost. On the other hand, runs with lower values of do not perform enough list intersections to amortize the preprocessing cost. This trend is consistent with the trend observed in prior work (Shi et al., 2020). We include this analysis here for completeness, however, as mentioned in Section 3.1, the choice of orientation criteria is orthogonal to our work and not intended as a contribution of this paper.
—V—  —E—  k  # kCliques  ARBCOUNT (Shi et al., 2020)  Pivoter (Jain and Seshadhri, 2020)  GPUGraphOrientation  GPUPivot  
asskitter  1,696,415  11,095,298  4  148,834,439  0.60  20.90  0.034  0.459 
5  1,183,885,507  0.67  22.54  0.069  0.721  
6  9,759,000,981  1.24  25.30  0.245  1.01  
7  73,142,566,591  5.73  27.14  1.434  1.269  
8  481,576,204,696  28.38  27.46  9.531  1.585  
9  2,781,731,674,867  158.45  28.45  68.221  1.835  
10  14,217,188,170,569  854.87  28.45  474.785  1.777  
11  64,975,151,572,336  4,158.53  28.45  2997.385  1.778  
comdblp  317,080  1,049,866  4  16,713,192  0.10  2.88  0.008  0.109 
5  262,663,639  0.13  2.88  0.016  0.109  
6  4,221,802,226  0.30  2.88  0.042  0.109  
7  60,913,718,813  2.05  2.88  0.545  0.109  
8  777,232,734,905  24.06  2.88  9.031  0.109  
9  8,813,264,533,265  281.39  2.88  139.046  0.109  
10  89,563,892,212,629  2,981.74  2.88  2262.99  0.109  
11  822,551,101,011,469  >5 hours  2.88  >5 hours  0.109  
comorkut  3,072,441  117,185,083  4  3,221,946,137  3.10  292.35  0.426  8.83 
5  15,766,607,860  4.94  385.04  1.014  13.869  
6  75,249,427,585  12.57  462.05  3.506  17.229  
7  353,962,921,685  42.09  517.29  11.719  20.331  
8  1,632,691,821,296  150.87  559.75  45.319  26.137  
9  7,248,102,160,867  584.39  598.88  212.912  33.644  
10  30,288,138,110,629  2,315.89  647.18  1,002.165  39.957  
11  117,138,620,358,191  8,843.51  647.18  4,421.597  48.101  
comfriendster  65,608,366  1,806,067,135  4  8,963,503,263  109.46  Out M  10.215  44.897 
5  21,710,817,218  111.75  Out M  11.796  53.874  
6  59,926,510,355  115.52  Out M  17.22  63.874  
7  296,858,496,789  139.98  Out M  45.697  66.544  
8  3,120,447,373,827  300.62  Out M  99.866  67.064  
9  40,033,489,612,826  1,796.12  Out M  803.53  71.404  
10  487,090,833,092,739  16,836.41  Out M  12,775.67  71.051  
11  5,403,375,502,221,431  >5 hours  Out M  >5 hours  71.448  
comlj  3,997,962  34,681,189  4  5,216,918,441  1.77  268.06  0.104  10.864 
5  246,378,629,120  7.52  1,475.99  0.943  68.966  
6  10,990,740,312,954  258.46  7,816.13  23.792  379.88  
7  449,022,426,169,164  10,733.21  >5 hours  1,077.66  1,639.537  
8  16,890,998,195,437,619  >5hours  >5 hours  >5 hours  6850.989 
5. Related Work
Graph Orientation Approach to Clique Counting. Graph orientation is a fundamental approach to avoiding redundant clique discovery (Chiba and Nishizeki, 1985; Finocchi et al., 2015; Danisch et al., 2018; Shi et al., 2020; Li et al., 2020). The ChibaNishizeki algorithm (Chiba and Nishizeki, 1985) was the first practical sequential algorithm to search for dense structures in graphs. This algorithm uses degree orientation. There have been several works that implement parallel clique counting on CPUs. Finocchi et al. (Finocchi et al., 2015) present parallel clique counting algorithms for MapReduce. Degree orientation is also used as a preprocessing step. KCList (Danisch et al., 2018) is designed to list all the cliques in a graph. It is based on the ChibaNishizeki algorithm, but uses degeneracy (core) orientation instead of degree orientation. Vertexcentric and edgecentric parallelizations are both explored. ARBCOUNT (Shi et al., 2020) includes a parallel workefficient CPU implementation of clique counting. It supports different orientation criteria such as degree, degeneracy, BarenboimElkin (Barenboim and Elkin, 2010), and GoodrichPszona (Goodrich and Pszona, 2011). Vertexcentric and edgecentric parallelizations are also both explored. ARBCOUNT reports being faster than other parallel clique implementations such as KCList (Danisch et al., 2018), as well as other subgraph matching implementations such as the worstcase optimal join algorithm (Mhedhbi and Salihoglu, 2019), Lai et al.’s implementation of a binary join algorithm (Lai et al., 2019), and the Escape algorithm (Pinar et al., 2017). Li et el. (Li et al., 2020) propose an orientation based on graph coloring for clique listing, but report that ARBCOUNT (Shi et al., 2020) is faster than their approach. All these parallel implementations of the graph orientation approach are designed for CPUs. We implement the graph orientation approach for clique counting on the GPU and compare our performance with ARBCOUNT (Shi et al., 2020).
Pivoting Approach to Clique Counting. Pivoter (Jain and Seshadhri, 2020) is a recent work on clique counting that is inspired by the classical pivoting idea of BronKerbosh’s maximal clique finding (Eppstein et al., 2010). ARBCOUNT (Shi et al., 2020) compares to Pivoter and shows that it is faster for large values. Pivoter is implemented on the CPU. We implement the pivoting approach for clique counting on the GPU and compare our performance with Pivoter (Jain and Seshadhri, 2020).
Maximal Clique Enumeration. Enumerating the maximal cliques in a graph has been extensively studied on CPUs (Cheng et al., 2012; Schmidt et al., 2009; Das et al., 2020; Xu et al., 2014) and GPUs (Lessley et al., 2017; Jenkins et al., 2011; VanCompernolle et al., 2016b; Henry and Eng, 2014; VanCompernolle et al., 2016a; Yu and Liu, 2019; Dasari et al., 2014). The pivoting approach is inspired by techniques used in maximal clique enumeration. Our work solves the clique counting problem.
Triangle Counting. Many works perform triangle counting on the CPU (Pagh and Tsourakakis, 2012; Al Hasan and Dave, 2018; Halappanavar and Ghosh, 2020; Kolountzakis et al., 2012) or the GPU (Pearson et al., 2019; Wang and Owens, 2019; Pandey et al., 2019; Hu et al., 2018; Mailthody et al., 2018; Bisson and Fatica, 2018b; Green et al., 2014a, b; Wang et al., 2016b). A triangle is a 3clique which is a special case of a clique. Our implementation performs clique counting for any value of .
Generalized Subgraph Matching. Many works perform generalized subgraph matching on the CPU (Pinar et al., 2017; Ahmed et al., 2017; Elenberg et al., 2016; Rossi et al., 2018b; Wang et al., 2017) and the GPU (Chen et al., 2020; Guo et al., 2020; Wang et al., 2016a; Lin et al., 2016; Tran et al., 2015; Zeng et al., 2020; Wang and Owens, 2020). These frameworks search for an arbitrary vertex subgraph and support different values of . Due to generalization, such subgraph matching frameworks suffer from memory explosion or prolonged execution times. Our implementation is specialized for cliques which are an important special case of a vertex subgraph. Specializing for cliques enables optimizations that are not applicable to general subgraphs, thereby providing better scalability for large values of .
Truss decomposition. Recently, truss decomposition has received significant attention on CPUs (Wang and Cheng, 2012; Pearce and Sanders, 2018; Low et al., 2018; Conte et al., 2018) and GPUs (Almasri et al., 2019; Diab et al., 2020; Che et al., 2020; Green et al., 2017; Bisson and Fatica, 2018a; Blanco et al., 2019). A truss is a relaxation of a clique. Our work solves the clique counting problem.
List Intersections. List intersection is an important operation at the heart of many subgraph search algorithms. Different list intersection strategies have been proposed for GPUs such as pointerchasing (Almasri et al., 2019), binarysearch (Pearson et al., 2019), mergepath (Green et al., 2014b), hashbased (Pandey et al., 2019), tilebased (Diab et al., 2020), bitmapbased (Bisson and Fatica, 2018b), and others. Our implementations use binarysearch intersections for extracting an induced subgraph, and binary encoding for the lists in the induced subgraph.
6. Conclusion
We present GPU implementations of clique counting for both the graph orientation approach and the pivoting approach. We explore vertexcentric and edgecentric parallelization schemes, replace the recursive search tree traversal with an iterative traversal, and apply various optimizations such as binary encoding and subwarp partitioning to reduce memory consumption and efficiently utilize parallel execution resources. Our evaluation shows that our GPU implementations substantially outperform their stateoftheart CPU counterparts. We also show that the binary encoding and subwarp partitioning optimizations yield significant performance gains, and provide insights on the choice of parallelization scheme and subwarp partitioning granularity.
References
 Graphlet decomposition: framework, algorithms, and applications. Knowledge and Information Systems 50 (3), pp. 689–722. Cited by: §5.
 Triangle counting in large networks: a review. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery 8 (2), pp. e1226. Cited by: §5.
 Update on ktruss decomposition on gpu. In 2019 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7. Cited by: §5, §5.
 Sublogarithmic distributed mis algorithm for sparse graphs using nashwilliams decomposition. Distributed Computing 22 (56), pp. 363–379. Cited by: §5.
 Update on static graph challenge on gpu. In 2018 IEEE High Performance extreme Computing Conference (HPEC), Vol. , pp. 1–8. External Links: Document Cited by: §5.
 Update on static graph challenge on gpu. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pp. 1–8. Cited by: §5, §5.
 Exploration of finegrained parallelism for load balancing eager ktruss on gpu and cpu. In 2019 IEEE High Performance Extreme Computing Conference (HPEC), Vol. , pp. 1–7. External Links: Document Cited by: §5.
 Accelerating truss decomposition on heterogeneous processors. Proceedings of the VLDB Endowment 13 (10), pp. 1751–1764. Cited by: §3.1, §5.
 Pangolin: an efficient and flexible graph mining system on cpu and gpu. Proceedings of the VLDB Endowment 13 (10), pp. 1190–1205. Cited by: §5.
 Fast algorithms for maximal clique enumeration with limited memory. In Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 1240–1248. Cited by: §5.
 Arboricity and subgraph listing algorithms. SIAM Journal on computing 14 (1), pp. 210–223. Cited by: §1, §1, §2.2, §2.2, §5.
 Discovering trusses in largescale networks. In 2018 IEEE High Performance extreme Computing Conference (HPEC), Vol. , pp. 1–6. External Links: Document Cited by: §5.
 Listing kcliques in sparse realworld graphs. In Proceedings of the 2018 World Wide Web Conference, pp. 589–598. Cited by: §1, §1, §2.2, §2.2, §4.1, §5.
 Large scale densityfriendly graph decomposition via convex programming. In Proceedings of the 26th International Conference on World Wide Web, pp. 233–242. Cited by: §1.
 Sharedmemory parallel maximal clique enumeration from static and dynamic graphs. ACM Transactions on Parallel Computing (TOPC) 7 (1), pp. 1–28. Cited by: §5.
 PbitMCE: a bitbased approach for maximal clique enumeration on multicore processors. In 2014 20th IEEE International Conference on Parallel and Distributed Systems (ICPADS), Vol. , pp. 478–485. External Links: Document Cited by: §3.4, §5.
 KTrussExplorer: exploring the design space of ktruss decomposition optimizations on gpus. In 2020 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–8. Cited by: §5, §5.

Distributed estimation of graph 4profiles
. In Proceedings of the 25th International Conference on World Wide Web, pp. 483–493. Cited by: §5.  Listing all maximal cliques in sparse graphs in nearoptimal time. In International Symposium on Algorithms and Computation, pp. 403–414. Cited by: §1, §1, §2.2, §2.3, §5.
 Efficient algorithms for densest subgraph discovery. Proceedings of the VLDB Endowment 12 (11), pp. 1719–1732. Cited by: §1.
 Clique counting in mapreduce: algorithms and experiments. Journal of Experimental Algorithmics (JEA) 20, pp. 1–20. Cited by: §1, §1, §2.2, §5.
 Dense subgraph discovery: kdd 2015 tutorial. In Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp. 2313–2314. Cited by: §1.
 Compact structure for sparse undirected graphs based on a clique graph partition. Information Sciences 544, pp. 485–499. Cited by: §1.
 Externalmemory network analysis algorithms for naturally sparse graphs. In European Symposium on Algorithms, pp. 664–676. Cited by: §5.
 Quickly finding a truss in a haystack. In 2017 IEEE High Performance Extreme Computing Conference (HPEC), Vol. , pp. 1–7. External Links: Document Cited by: §5.
 Fast triangle counting on the gpu. In Proceedings of the 4th Workshop on Irregular Applications: Architectures and Algorithms, pp. 1–8. Cited by: §5.
 Fast triangle counting on the gpu. In Proceedings of the 4th Workshop on Irregular Applications: Architectures and Algorithms, IA¡sup¿3¡/sup¿ ’14, pp. 1–8. External Links: ISBN 9781479970568 Cited by: §5, §5.
 Parallel kclique community detection on largescale networks. IEEE Transactions on Parallel and Distributed Systems 24 (8), pp. 1651–1660. Cited by: §1.
 Exploiting reuse for gpu subgraph enumeration. IEEE Transactions on Knowledge and Data Engineering. Cited by: §5.
 TriC: distributedmemory triangle counting by exploiting the graph structure. Technical report Pacific Northwest National Lab.(PNNL), Richland, WA (United States). Cited by: §5.
 Detecting (k, r)clique communities from social networks. In Advanced Multimedia and Ubiquitous Engineering, pp. 583–590. Cited by: §1.
 A parallel gpu solution to the maximal clique enumeration problem for cbir. In GPU Technology Conference (GTC 2014), Cited by: §5.
 Highperformance triangle counting on gpus. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pp. 1–5. Cited by: §5.
 The power of pivoting for exact clique counting. In Proceedings of the 13th International Conference on Web Search and Data Mining, pp. 268–276. Cited by: §1, §1, §2.3, §2.3, §2.3, §2.3, §3.3, §4.1, Table 3, §5.
 Lessons learned from exploring the backtracking paradigm on the gpu. In European Conference on Parallel Processing, pp. 425–437. Cited by: §5.
 Efficient triangle counting in large graphs via degreebased vertex partitioning. Internet Mathematics 8 (12), pp. 161–185. Cited by: §5.
 Distributed subgraph matching on timely dataflow. Proceedings of the VLDB Endowment 12 (10), pp. 1099–1112. Cited by: §5.
 A survey of algorithms for dense subgraph discovery. In Managing and Mining Graph Data, pp. 303–336. Cited by: §1.
 SNAP datasets: stanford large network dataset collection. Cited by: §4.1.
 Maximal clique enumeration with dataparallel primitives. In 2017 IEEE 7th Symposium on Large Data Analysis and Visualization (LDAV), pp. 16–25. Cited by: §5.

Ordering heuristics for kclique listing
. Proceedings of the VLDB Endowment 13 (12), pp. 2536–2548. Cited by: §1, §1, §2.2, §2.2, §4.1, §5.  Network motif discovery: a gpu approach. IEEE transactions on knowledge and data engineering 29 (3), pp. 513–528. Cited by: §5.
 Linear algebraic formulation of edgecentric ktruss algorithms with adjacency matrices. In 2018 IEEE High Performance extreme Computing Conference (HPEC), Vol. , pp. 1–7. External Links: Document Cited by: §5.
 Collaborative (cpu+ gpu) algorithms for triangle counting and truss decomposition. In 2018 IEEE High Performance extreme Computing Conference (HPEC), pp. 1–7. Cited by: §5.

Patient diet recommendation system using k clique and deep learning classifiers
.Journal of Artificial Intelligence
2 (02), pp. 121–130. Cited by: §1.  Cub. NVIDIA Research. Cited by: §4.1.
 Optimizing subgraph queries by combining binary and worstcase optimal joins. arXiv preprint arXiv:1903.02076. Cited by: §5.
 V100 gpu architecture whitepaper. NVIDIA. Cited by: §1, §3.5.
 Colorful triangle counting and a mapreduce implementation. Information Processing Letters 112 (7), pp. 277–281. Cited by: §5.
 Hindex: hashindexing for parallel triangle counting on gpus. In 2019 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7. Cited by: §5, §5.
 Ktruss decomposition for scalefree graphs at scale in distributed memory. In 2018 IEEE High Performance extreme Computing Conference (HPEC), Vol. , pp. 1–6. External Links: Document Cited by: §5.
 Update on triangle counting on gpu. In 2019 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–7. Cited by: §5, §5.
 Escape: efficiently counting all 5vertex subgraphs. In Proceedings of the 26th international conference on world wide web, pp. 1431–1440. Cited by: §5, §5.
 Higherorder network representation learning. In Companion Proceedings of the The Web Conference 2018, pp. 3–4. Cited by: §1.

Estimation of graphlet counts in massive networks.
IEEE transactions on neural networks and learning systems
30 (1), pp. 44–57. Cited by: §5.  Graphzip: a cliquebased sparse graph compression method. Journal of Big Data 5 (1), pp. 1–14. Cited by: §1.
 System and method for compressing graphs via cliques. Google Patents. Note: US Patent 10,217,241 Cited by: §1.
 Nucleus decompositions for identifying hierarchy of dense subgraphs. ACM Transactions on the Web (TWEB) 11 (3), pp. 1–27. Cited by: §1.
 A scalable, parallel algorithm for maximal clique enumeration. Journal of Parallel and Distributed Computing 69 (4), pp. 417–428. Cited by: §5.
 Parallel clique counting and peeling algorithms. arXiv preprint arXiv:2002.10047. Cited by: §1, §1, §2.2, §2.2, §4.1, §4.1, §4.1, §4.2, §4.3, §4.4, Table 3, §5, §5.
 Fast subgraph matching on large graphs using graphics processors. In International Conference on Database Systems for Advanced Applications, pp. 299–315. Cited by: §5.
 The kclique densest subgraph problem. In Proceedings of the 24th international conference on world wide web, pp. 1122–1132. Cited by: §1.
 Maximum clique solver using bitsets on gpus. In Information Technology: New Generations, pp. 327–337. Cited by: §3.4, §5.
 Maximum clique solver using bitsets on gpus. In Information Technology: New Generations, pp. 327–337. Cited by: §5.
 An efficient movie recommendation algorithm based on improved kclique. Humancentric Computing and Information Sciences 8 (1), pp. 1–15. Cited by: §1.
 Truss decomposition in massive networks. arXiv preprint arXiv:1205.6693. Cited by: §5.
 Fast bfsbased triangle counting on gpus. In 2019 IEEE High Performance Extreme Computing Conference (HPEC), pp. 1–6. Cited by: §5.
 Fast gunrock subgraph matching (gsm) on gpus. arXiv preprint arXiv:2003.01527. Cited by: §5.
 Fast parallel subgraph matching on the gpu. In HPDC, Cited by: §5.
 A comparative study on exact triangle counting algorithms on the gpu. In Proceedings of the ACM Workshop on High Performance Graph Processing, pp. 1–8. Cited by: §5.
 MOSS5: a fast method of approximating counts of 5node graphlets in large graphs. IEEE Transactions on Knowledge and Data Engineering 30 (1), pp. 73–86. Cited by: §5.
 Distributed maximal clique computation. In 2014 IEEE International Congress on Big Data, pp. 160–167. Cited by: §5.
 A memory efficient maximal clique enumeration method for sparse graphs with a parallel implementation. Parallel Computing 87, pp. 46–59. External Links: ISSN 01678191, Document, Link Cited by: §3.4, §5.
 Rum: network representation learning using motifs. In 2019 IEEE 35th International Conference on Data Engineering (ICDE), pp. 1382–1393. Cited by: §1.
 Indexbased densest clique percolation community search in networks. IEEE Transactions on Knowledge and Data Engineering 30 (5), pp. 922–935. Cited by: §1.
 GSI: gpufriendly subgraph isomorphism. In 2020 IEEE 36th International Conference on Data Engineering (ICDE), Vol. , pp. 1249–1260. External Links: Document Cited by: §5.
Comments
There are no comments yet.