1. Introduction
We study Maximal Clique Enumeration (MCE) from a graph, which requires to enumerate all cliques (complete subgraphs) in the graph that are maximal. A clique in a graph is a (dense) subgraph such that every pair of vertices in are directly connected by an edge. A clique is said to be maximal when there is no clique such that is a proper subgraph of (see Figure 1). Maximal cliques are perhaps the most fundamental dense subgraphs, and MCE has been widely used in diverse research areas, e.g. the works of Palla et al. (PD+05) on discovering protein groups by mining cliques in a proteinprotein network, of Koichi et al. (KA+14) on discovering chemical structures using MCE on a graph derived from largescale chemical databases, various problems on mining biological data (GA+93; HO+03; MBR04; CC05; JB06; RWY07; ZP+08), overlapping community detection (YQ+TKDE17), locationbased services on spatial databases (ZZ+ICDE19), and inference in graphical models (KF09). MCE is important in static as well as dynamic networks, i.e. networks that change over time due to the addition and deletion of vertices and edges. For example, Chateau et al. (CRR11) use MCE in studying changes in the genetic rearrangements of bacterial genomes, when new genomes are added to the existing genome population. Maximal cliques are building blocks for computing approximate common intervals of multiple genomes. Duan et al. (DL+12) use MCE on a dynamic graph to track highly interconnected communities in a dynamic social network.
MCE is a computationally hard problem since it is harder than the maximum clique problem, a classical NPcomplete combinatorial problem that asks to find a clique of the largest size in a graph. Note that maximal clique and maximum clique are two related, but distinct notions. A maximum clique is also a maximal clique, but a maximal clique need not be a maximum clique. The computational cost of enumerating maximal cliques can be higher than the cost of finding the maximum clique, since the output size (set of all maximal cliques) can itself be very large. Moon and Moser (MM65) showed that a graph on vertices can have as many as maximal cliques, which is proved to be a tight bound. Realworld networks typically do not have cliques of such high complexity and as a result, it is feasible to enumerate maximal cliques from large graphs. The literature is rich on sequential algorithms for MCE. Bron and Kerbosch (BK73) introduced a backtracking search method to enumerate maximal cliques. Tomita et. al (TTT06) introduced the idea of “pivoting” in the backtracking search, which led to a significant improvement in the runtime. This has been followed up by further work such as due to Eppstein et al. (ELS10), who used a degeneracybased vertex ordering on top of the pivot selection strategy.
Sequential approaches to MCE can lead to high runtimes on large graphs. Based on our experiments, a realworld network Orkut with approximately million vertices, million edges requires approximately hours to enumerate all maximal cliques using an efficient sequential algorithm due to Tomita et al. (TTT06). Graphs that are larger and/or more complex cannot be handled by sequential algorithms with a reasonable turnaround time, and the high computational complexity of MCE calls for parallel methods.
(a) Input Graph  (b) NonMaximal Clique in  (c) Maximal Clique in 
In this work, we consider sharedmemory parallel methods for MCE. In the sharedmemory model, the input graph can reside within globally shared memory, and multiple threads can work in parallel on enumerating maximal cliques. Sharedmemory parallelism is of high interest today since machines with tens to hundreds of cores and hundreds of Gigabytes of sharedmemory are readily available. The advantage of using sharedmemory approach over a distributed memory approach are: (1) Unlike distributed memory, it is not necessary to divide the graph into subgraphs and communicate the subgraphs among processors. In sharedmemory, different threads can work concurrently on a single shared copy of the graph (2) Subproblems generated during MCE are often highly imbalanced, and it is hard to predict which subproblems are small and which are large, while initially dividing the problem into subproblems. With a sharedmemory method, it is possible to further subdivide subproblems and process them in parallel. With a distributed memory method, handling such imbalances in subproblem sizes requires greater coordination and is more complex.
To show how imbalanced the subproblems can be, in Figure 2, we show data for two realworld networks AsSkitter and WikiTalk. These two networks have millions of edges and tens of millions of maximal cliques (for statistics on these networks, see Section 6). Consider a division of the MCE problem into pervertex subproblems, where each subproblem corresponds to the set of all maximal cliques containing a single vertex in a network and suppose these subproblems were solved independently, while taking care to prune out search for the same maximal clique multiple times. For AsSkitter, we observed that of total runtime required for MCE is taken by only of the subproblems and less than of all subproblems yield
of all maximal cliques. Even larger skew in subproblem sizes is observed in the
WikiTalk graph. This data demonstrates that load balancing is a central issue for parallel MCE.(a) AsSkitter  (b) WikiTalk 
(c) AsSkitter  (d) WikiTalk 
Prior works on parallel MCE have largely focused on distributed memory algorithms (WY+09; SS+09; LGG10; SMT+15). There are a few works on sharedmemory parallel algorithms (ZA+05; DW+09; LP+17). However, these algorithms do not scale to larger graphs due to memory or computational bottlenecks – either the algorithms miss out significant pruning opportunities as in (DW+09), or they need to generate a large number of nonmaximal cliques as in (ZA+05; LP+17).
1.1. Our Contributions
We make the following contributions towards enumerating all maximal cliques in a simple graph.
Theoretically Efficient Parallel Algorithm: We present a sharedmemory parallel algorithm , which takes as input a graph and enumerates all maximal cliques in . is an efficient parallelization of the algorithm due to Tomita, Tanaka, and Takahashi (TTT06). Our analysis of using a workdepth model of computation (S17) shows that it is workefficient when compared with (TTT06) and has a low parallel depth. To our knowledge, this is the first sharedmemory parallel algorithm for MCE with such provable properties.
Optimized Parallel Algorithm: We present a sharedmemory parallel algorithm that builds on and yields improved practical performance. Unlike , which starts with a single task at the top level that spawns recursive subtasks as it proceeds, leading to a lack of parallelism at the top level of recursion, spawns multiple parallel tasks at the top level. To achieve this, uses pervertex parallelization, where a separate subproblem is created for each vertex and different subproblems are processed in parallel. Each subproblem is required to enumerate cliques which contain the assigned vertex and care is taken to prevent overlap between subproblems. Each pervertex subproblem is further processed in parallel using – this additional (recursive) level of parallelism using is important since different pervertex subproblems may have significantly different computational costs, and having each run as a separate sequential task may lead to uneven load balance. To further address load balance, we use a vertex ordering in assigning cliques to different pervertex subproblems. For ordering the vertices, we use various metrics such as degree, triangle count, and the degeneracy number of the vertices.
Incremental Parallel Algorithm: Next, we present a parallel algorithm that can maintain the set of maximal cliques in a dynamic graph, when the graph is updated due to the addition of new edges. When a batch of edges are added to the graph, can (in parallel) enumerate the set of all new maximal cliques that emerged and the set of all maximal cliques that are no longer maximal (subsumed cliques). consists of two parts: for enumerating new maximal cliques, and for enumerating subsumed maximal cliques. We analyze using the workdepth model and show that it is workefficient relative to an efficient sequential algorithm, and has a low parallel depth. A summary of our algorithms is shown in Table 1.
Algorithm  Type  Description  
Static  A workefficient parallel algorithm for MCE on a static graph  
Static  A practical parallel algorithm for MCE on a static graph dealing with load imbalance  
Dynamic 



Experimental Evaluation: We implemented all our algorithms, and our experiments show that yields a speedup of 15x21x when compared with an efficient sequential algorithm (due to Tomita et al. (TTT06)) on a multicore machine with physical cores and TB RAM. For example, on the Wikipedia network with around million vertices, million edges, and million maximal cliques, achieves a 16.5x parallel speedup over the sequential algorithm, and the optimized achieves a 21.5x speedup, and completed in approximately two minutes. In contrast, prior sharedmemory parallel algorithms for MCE (ZA+05; DW+09; LP+17) failed to handle the input graphs that we considered, and either ran out of memory ((ZA+05; LP+17)) or did not complete in 5 hours ((DW+09)).
On dynamic graphs, we observe that gives a 3x19x speedup over a stateoftheart sequential algorithm (DST16) on a multicore machine with 32 cores. Interestingly, the speedup of the parallel algorithm increases with the magnitude of change in the set of maximal cliques – the “harder” the dynamic enumeration task is, the larger is the speedup obtained. For example, on a dense graph such as CaCitHepTh (with original graph density of ), we get approximately a 19x speedup over the sequential . More details are presented in Section 6.
Techniques for Load Balancing: Our parallel methods can effectively balance the load in solving parallel MCE. As shown in Figure 2, “natural” subproblems of MCE are highly imbalanced, and, therefore, load balancing is not trivial. In our algorithms, subproblems of MCE are broken down into smaller subproblems, according to the search used by the sequential algorithm (TTT06), and this process continues recursively. As a result, the final subproblem that is solved in a single task is not so large as to create load imbalances. Our experiments in Section 6 demonstrate that the recursive splitting of subproblems in MCE is essential for achieving a high speedup over existing algorithms (TTT06). In order to efficiently assign these (dynamically created) tasks to threads at runtime, we utilize a work stealing scheduler (R95; RL99).
Roadmap. The rest of the paper is organized as follows. We discuss related works in Section 2, we present preliminaries in Section 3, followed by a description of algorithms for a static graph in Section 4, algorithms for a dynamic graph in Section 5, an experimental evaluation in Section 6, and conclusions in Section 7.
2. Related Work
Maximal Clique Enumeration (MCE) from a graph is a fundamental problem that has been extensively studied for more than two decades, and there are multiple prior works on sequential and parallel algorithms. We first discuss sequential algorithms for MCE, followed by parallel algorithms.
Sequential MCE: Bron and Kerbosch (BK73) presented an algorithm for MCE based on depthfirstsearch. Following their work, a number of algorithms have been presented (TI+77; CN85; TTT06; Koch01; MU04; CK08; ELS10). The algorithm of Tomita et al. (TTT06) has a worstcase time complexity for an vertex graph, which is optimal in the worstcase, since the size of the output can be as large as (MM65). Eppstein et al. (ELS10; ES11) present an algorithm for sparse graphs whose complexity can be parameterized by the degeneracy of the graph, a measure of graph sparsity.
Another approach to MCE is a class of “outputsensitive” algorithms whose time complexity for enumerating maximal cliques is a function of the size of the output. There exist many such outputsensitive algorithms for MCE, including (CG+16; CN85; MU04; TI+77), which can be viewed as instances of a general paradigm called “reversesearch” (AF93). A recent algorithm (CG+16) provides favorable tradeoffs for delay (time between two enumerated maximal cliques), when compared with prior works. In terms of practical performance, the best outputsensitive algorithms (CG+16; CN85; MU04) are not as efficient as the best depthfirstsearch based algorithms (TTT06; ELS10). Other sequential methods for MCE include algorithms due to Kose et al. (KW+01), Johnson et al. (JYP88), Li et al. (LS+DASFAA19), on a special class of graphs due to Fox et al. (FRSWSIAMJ18), on temporal graph due to Qin et al. (QL+ICDE19). Multiple works have considered sequential algorithms for maintaining the set of maximal cliques (DST16; S04; OV10) on a dynamic graph, and, to our knowledge, the most efficient algorithm is the one due to Das et al. (DST16).
Parallel MCE: There are multiple prior works on parallel algorithms for MCE (ZA+05; DW+06; WY+09; SS+09; LGG10; SMT+15; YLPC19). We first discuss sharedmemory algorithms and then distributed memory algorithms. Zhang et al. (ZA+05) present a sharedmemory parallel algorithm based on the sequential algorithm due to Kose et al. (KW+01). This algorithm computes maximal cliques in an iterative manner, and, in each iteration, it maintains a set of cliques that are not necessarily maximal and, for each such clique, maintains the set of vertices that can be added to form larger cliques. This algorithm does not provide a theoretical guarantee on the runtime and suffers for large memory requirement. Du et al. (DW+06) present a outputsensitive sharedmemory parallel algorithm for MCE, but their algorithm suffers from poor load balancing as also pointed out by Schmidt et al. (SS+09). Lessley et al. (LP+17) present a shared memory parallel algorithm that generates maximal cliques using an iterative method, where in each iteration, cliques of size are extended to cliques of size . The algorithm of (LP+17) is memoryintensive, since it needs to store a number of intermediate nonmaximal cliques in each iteration. Note that the number of nonmaximal cliques may be far higher than the number of maximal cliques that are finally emitted, and a number of distinct nonmaximal cliques may finally lead to a single maximal clique. In the extreme case, a complete graph on vertices has nonmaximal cliques, and only a single maximal clique. We present a comparison of our algorithm with (LP+17; ZA+05; DW+06) in later sections.
Distributed memory parallel algorithms for MCE include works due to Wu et al. (WY+09), designed for the MapReduce framework, Lu et al. (LGG10), which is based on the sequential algorithm due to Tsukiyama et al. (TI+77), Svendsen et al. (SMT+15), Wang et al. (WC+17), and algorithm for sparse graph due to Yu and Liu (YLPC19). Other works on parallel and sequential algorithms for enumerating dense subgraphs from a massive graph include sequential algorithms for enumerating quasicliques (SD+18; LL08; U10), parallel algorithms for enumerating cores (MD+13; DDZ14; KM17; SSP17), trusses (SSP17; KM172; KM173), nuclei (SSP17), and distributed memory algorithms for enumerating plexes (CM+18), bicliques (MT17).
3. Preliminaries
We consider a simple undirected graph without self loops or multiple edges. For graph , let denote the set of vertices in and denote the set of edges in . Let denote the size of , and denote the size of . For vertex , let denote the set of vertices adjacent to in . When the graph is clear from the context, we use to mean . Let denote the set of all maximal cliques in .
Sequential Algorithm : The algorithm due to Tomita, Tanaka, and Takahashi. (TTT06), which we call , is a recursive backtrackingbased algorithm for enumerating all maximal cliques in an undirected graph, with a worstcase time complexity of where is the number of vertices in the graph. In practice, this is one of the most efficient sequential algorithms for MCE. Since we use as a subroutine in our parallel algorithms, we present a short description here.
In any recursive call, maintains three disjoint sets of vertices , , and , where is a candidate clique to be extended, is the set of vertices that can be used to extend , and is the set of vertices that are adjacent to , but need not be used to extend (these are being explored along other search paths). Each recursive call iterates over vertices from and in each iteration, a vertex is added to and a new recursive call is made with parameters , , and for generating all maximal cliques of that extend but do not contain any vertices from . The sets and can only contain vertices that are adjacent to all vertices in . The clique is a maximal clique when both and are empty.
The ingredient that makes different from the algorithm due to Bron and Kerbosch (BK73) is the use of a “pivot” where a vertex is selected that maximizes . Once the pivot is computed, it is sufficient to iterate over all the vertices of , instead of iterating over all vertices of . The pseudo code of is presented in Algorithm 1. For the initial call, and are initialized to an empty set, is the set of all vertices of .
Parallel Cost Model: For analyzing our sharedmemory parallel algorithms, we use the CRCW PRAM model (BM10), which is a model of shared parallel computation that assumes concurrent reads and concurrent writes. Our parallel algorithm can also work in other related models of sharedmemory such as EREW PRAM (exclusive reads and exclusive writes), with a logarithmic factor increase in work as well as parallel depth. We measure the effectiveness of the parallel algorithm using the workdepth model (S17). Here, the “work” of a parallel algorithm is equal to the total number of operations of the parallel algorithm, and the “depth” (also called the “parallel time” or the “span”) is the longest chain of dependent computations in the algorithm. A parallel algorithm is said to be workefficient if its total work is of the same order as the work due to the best sequential algorithm.^{1}^{1}1Note that workefficiency in the CRCW PRAM model does not imply workefficiency in the EREW PRAM model We aim for workefficient algorithms with a low depth, ideally polylogarithmic in the size of the input. Using Brent’s theorem (BM10), it can be seen that a parallel algorithm on input size with a depth of can theoretically achieve speedup on processors as long as .
We next restate a result on concurrent hash tables (SS06) that we use in proving the work and depth bounds of our parallel algorithms.
Theorem 3.1 (Theorem 3.15 (Ss06)).
There is an implementation of a hash table, which, given a hash function with expected uniform distribution, performs
insert, delete and find operations in parallel using work and depth on average.4. Parallel MCE Algorithms on a Static Graph
In this section, we present new sharedmemory parallel algorithms for MCE. We first describe a parallel algorithm , a parallelization of the sequential algorithm and an analysis of its theoretical properties. Then, we discuss bottlenecks in that arise in practice, leading us to another algorithm with a better practical performance. uses as a subroutine – it creates appropriate subproblems that can be solved in parallel and hands off the enumeration task to .
4.1. Algorithm
Our first algorithm is a workefficient parallelization of the sequential algorithm. The two main components of (Algorithm 1) are (1) Selection of the pivot element (Line ) and (2) Sequential backtracking for extending candidate cliques until all maximal cliques are explored (Line to Line ). We discuss how to parallelize each of these steps.
Parallel Pivot Selection:
Within a single recursive call of , the pivot element is computed in parallel using two steps, as described in (Algorithm 2). In the first step, the size of the intersection is computed in parallel for each vertex . In the second step, the vertex with the maximum intersection size is selected. The parallel algorithm for selecting a pivot is presented in Algorithm 2. The following lemma proves that the parallel pivot selection is workefficient with logarithmic depth:
Lemma 0 ().
The total work of is , which is , and depth is .
Proof.
If sets and are stored as hashsets, then, for vertex , the size can be computed sequentially in time – the intersection of two sets and can be found by considering the smaller set among the two, say , and searching for its elements within the larger set, say . It is possible to parallelize the computation of by executing the search elements of in parallel, followed by counting the number of elements that lie in the intersection, which can also be done in parallel in a workefficient manner using depth using Theorem 3.1. Since computing the maximum of a set of numbers can be accomplished using work and depth , for vertex , can be computed using work and depth . Once (i.e. ) is computed for every vertex , can be obtained using additional work and depth . Hence, the total work of is . Since the size of , , and are bounded by , this is , but typically much smaller in practice. ∎
Parallelization of Backtracking:
We first note that there is a sequential dependency among the different iterations within a recursive call of . In particular, the contents of the sets and in a given iteration are derived from the contents of and in the previous iteration. Such sequential dependence of updates to and restricts us from calling the recursive for different vertices of in parallel. To remove this dependency, we adopt a different view of which enables us to make the recursive calls in parallel. The elements of , the vertices to be considered for extending a maximal clique, are arranged in a predefined total order. Then, we unroll the loop and explicitly compute the parameters and for recursive calls.
Suppose is the order of vertices in . Each vertex , once added to , should be removed from further consideration in . To ensure this, in , we explicitly remove vertices from and add them to , before making the recursive calls. As a consequence, parameters of the th iteration are computed independently of prior iterations.
Here, we prove the work efficiency and low depth of in the following lemma:
Lemma 0 ().
Total work of (Algorithm 3) is and depth is where is the number of vertices in the graph, and is the size of a maximum clique in .
Proof.
First, we analyze the total work. Note that the computational tasks in is different from at Line 8 and Line 9 of where at an iteration , we remove all vertices from and add all these vertices to as opposed to the removal of a single vertex from and addition of that vertex to as in (Line 8 and Line 9 of Algorithm 1). Therefore, in , additional work is required due to independent computations of and . The total work, excluding the call to is . Adding up the work of , which requires work and total work for each single call of excluding further recursive calls (Algorithm 3, Line 10), which is the same as in original sequential algorithm (Section 4, (TTT06)). Hence, using Lemma 2 and Theorem 3 of (TTT06), we infer that the total work of is the same as the sequential algorithm and is bounded by .
Next, we analyze the depth of the algorithm. The depth of consists of the (sum of the) following components: (1) Depth of , (2) Depth of computation of , (3) Maximum depth of an iteration in the for loop from Line 5 to Line 10. According to Lemma 4.1, the depth of is . The depth of computing is since it takes time to check whether an element in is in the neighborhood of . Similarly, the depth of computing and at Line 8 and Line 9 are . The remaining is the depth of the call of at Line 10. Notice that the recursive call of continues until there is no further vertex to add for expanding , and this depth can be at most the size of the maximum clique which is because, at each recursive call of , the size of increases by . Thus, the overall depth of is . ∎
Corollary 0 ().
Using parallel processors, which are sharedmemory, (Algorithm 3) is a parallel algorithm for MCE and can achieve a worst case parallel time of using parallel processors. This is workoptimal and workefficient as long as .
Proof.
The parallel time follows Brent’s theorem (BM10), which states that the parallel time using processors is , where and are the work and the depth of the algorithm respectively. If the number of processors , then using Lemma 4.2, the parallel time is . The total work across all processors is , which is worstcase optimal, since the size of the output can be as large as maximal cliques (Moon and Moser (MM65)). ∎
4.2. Algorithm
While is a theoretically workefficient parallel algorithm, we note that its runtime can be further improved. While the worst case work complexity of matches that of the pivoting routine in , in practice, the work in can be higher, since computation of and has additional and growing overhead as the sizes of the and increase. This can result in a lower speedup than the theoretically expected one.
We set out to improve on this to derive a more efficient parallel implementation through a more selective use of . In this way, the cost of pivoting can be reduced by carefully choosing many pivots in parallel instead of a single pivot element as in at the beginning of the algorithm. We first note that the cost of is the highest during the iteration when set (the clique in the search space) is empty. During this iteration, the number of vertices in can be high, as large as the number of vertices in the graph. To improve upon this, we can perform the first few steps of pivoting, when is empty, using a sequential algorithm. Once set contains at least one element, the number of the vertices in is decremented to no more than the size of the intersection of neighborhoods of all vertices in , which is typically a number much smaller than the number of vertices in the graph (this number is smaller than the smallest degree of a vertex in ). Problem instances with , assigned to a single vertex, can be seen as subproblems, and, on each of these subproblems, the overhead of computation is much smaller since the number of vertices that have to be dealt with is also much smaller.
Based on this observation, we present a parallel algorithm which works as follows. The algorithm can be viewed as considering, for each vertex , a subgraph that is induced by the vertex and its neighborhood . It enumerates all maximal cliques from each subgraph in parallel using . While processing subproblem , it is important to not enumerate maximal cliques that are being enumerated elsewhere, in other subproblems. To handle this, the algorithm considers a specific ordering of all vertices in such that is the least ranked vertex in each maximal clique enumerated from . Subgraph for each vertex is handled in parallel – these subgraphs need not be processed in any particular order. However, the ordering allows us to populate the and sets accordingly such that each maximal clique is enumerated in exactly one subproblem. The order in which the vertices are considered is defined by a “rank” function rank, which indicates the position of a vertex in the total order. This global ordering on vertices has impact on the total work of the algorithm, as well as the load balance of the distribution of workloads across subproblems.
Load Balancing: Notice that the sizes of the subgraphs may vary widely because of two reasons: (1) The subgraphs themselves may be of different sizes, depending on the vertex degrees. (2) The number of maximal cliques and the sizes of the maximal cliques containing can vary widely from one vertex to another. Clearly, the subproblems that deal with a large number of maximal cliques or maximal cliques of a large size are more computationally expensive than others.
In order to maintain the size of the subproblems approximately balanced, we use an idea from PECO (SMT+15), where we choose the rank function on the vertices in such a way that for any two vertices and , rank() rank() if the complexity of enumerating maximal cliques from is higher than the complexity of enumerating maximal cliques from . Indeed, by giving a higher rank to than , we are decreasing the complexity of the subproblem since the subproblem at need not enumerate maximal cliques that involve any vertex whose rank is less than . Therefore, the higher the rank of vertex , the lower is its “share” (of maximal cliques it belongs to) of maximal cliques in . We use this idea for approximately balancing the workload across subproblems. The additional enhancements in when compared with the idea from PECO are as follows: (1) In PECO, the algorithm is designed for distributed memory such that the subgraphs and subproblems have to be explicitly copied across the network. (2) In , the vertex specific subproblem, dealing with is itself handled through a parallel algorithm, , while, in PECO, the subproblem for each vertex was handled through a sequential algorithm.
Note that it is computationally expensive to accurately count the number of maximal cliques within , and, hence, it is not possible to compute the rank of each vertex exactly according to the complexity of handling
. Instead, we estimate the cost of handling
using some easytoevaluate metrics on the subgraphs. In particular, we consider the following:

Degree Based Ranking: For vertex , define where and are degree and identifier of , respectively. For two vertices and , if or and , otherwise.

Triangle Count Based Ranking: For vertex , define where is the number of triangles, which vertex is a part of. Note that ranking based on triangle count is more expensive to compute than degree based ranking but may yield a better estimate of the complexity of maximal cliques within .^{2}^{2}2A triangle is a cycle of length three.

Degeneracy Based Ranking (ELS10): For a vertex , define where is the degeneracy of a vertex . A vertex has degeneracy number when it belongs to a core but no core, where a core is a maximal induced subgraph such that the minimum degree of each vertex in the subgraph is . A computational overhead of using this ranking is due to computing the degeneracy of vertices which takes time, where is the number of vertices and is the number of edges.
The different implementations of using degree, triangle, and degeneracy based rankings are called as , , respectively.
5. Parallel MCE Algorithm on a Dynamic Graph
When the graph changes over time due to addition of edges, the maximal cliques of the updated graph also change. The update in the set of maximal cliques consists of (1) The set of new maximal cliques – the maximal cliques that are newly formed (2) The set of subsumed cliques – maximal cliques of the original graph that are subsumed by the new maximal cliques. The combined set of new and subsumed maximal cliques is called the set of changes, and the size of this set refers to the size of change in the set of maximal cliques (see Figure 3).
(a) Input Graph  (b) A new edge  (c) Three new edges: 
Note that the size of change can be as small as or as large as exponential in the size of the graph upon addition of a new single edge. For example, consider a graph of size which is missing a single edge from being a clique. The size of change is only when that missing edge is added to the graph because there will be only one new maximal clique of size and two subsumed cliques, each of size . On the other hand, consider a MoonMoser graph (MM65) of size . Addition of a single edge to this graph makes the size of the changes in the order of .
In a previous work, we presented a sequential algorithm (DST16), which efficiently tackles the problem of updating the set of maximal cliques of a dynamic graph in an incremental model when new edges are added at a time. consists of for computing new maximal cliques and for computing subsumed cliques. However, is still unable to update the set of maximal cliques when the size of change is large. For instance, it takes around hours to update the set of maximal cliques, when the first edges of graph CaCitHepTh are added incrementally (with the original graph density 0.01). The high computational cost of calls for parallel methods.
In this section, we present parallel algorithms for enumerating the set of new and subsumed cliques when an edge set is added to a graph . Our parallel algorithms are based on (DST16). In this work, we focus on (1) processing new edges in parallel, (2) enumerating new maximal cliques using , and (3) Parallelizing (DST16). First, we describe an efficient parallel algorithm for generating new maximal cliques, i.e. the maximal cliques in , which are not present in and then an efficient parallel algorithm for generating subsumed maximal cliques, i.e. the cliques which are maximal in but not maximal in . We present a sharedmemory parallel algorithm for the incremental maintenance of maximal cliques. consists of (1) algorithm for enumerating new maximal cliques and (2) algorithm for enumerating subsumed cliques. A brief description of the algorithms is discussed in Table 2. Figure 4 also sketches the parallel sharedmemory setup for enumerating maximal cliques in a dynamic graph upon addition of new batches.
Objective 


Overview of Parallel Algorithms  





5.1. Parallel Enumeration of New Maximal Cliques
Here, we present a parallel algorithm for enumerating the set of new maximal cliques when a set of edges is added to the graph. The idea is that we iterate over new edges in parallel and at each (parallel) iteration we construct a subgraph of the original graph and enumerate the set of all maximal cliques within the subgraph. We present an efficient parallel algorithm (Algorithm 6) for this enumeration. The description of is presented in Algorithm 5.
Note that is based upon an existing sequential algorithm (DST16) for enumerating new maximal cliques using (Algorithm 8) (DST16), which lists all new maximal cliques without any duplication. avoids the duplication in maximal clique enumeration similar to the technique used in and uses a parallelization approach similar to of . More specifically, follows a global ordering of the new edges to avoid redundancy in the enumeration process. Note that the correctness of is followed by the correctness of the sequential algorithm . Following lemma shows the work efficiency and depth of .
Lemma 0 ().
Given a graph and a new edge set , is workefficient, i.e, the total work is of the same order of the time complexity of . The depth of is , where is the maximum degree and is the size of a maximum clique in .
We prove this lemma in Appendix A by showing the equivalence of the operations of and .
5.2. Parallel Enumeration of Subsumed Cliques
In this section, we present a parallel algorithm based on the sequential algorithm (DST16) for enumerating subsumed cliques. In , we perform parallelization in the following order: (1) Removing a single new edges from all the candidates in parallel and (2) Checking for the candidacy of the subsumed cliques in parallel. We present in Algorithm 7. In the following lemma, we show the work efficiency and depth of :
Lemma 0 ().
Given a graph and a new edge set , is workefficient; the total work is of the same order of the time complexity of . The depth of is for processing each new maximal clique, where is the size of a maximum clique in and is the size of .
Proof.
First, note that the procedure of is exactly the same as the procedure of except for the parallel loops at Line 6 and Line 13 of whereas these loops are sequential in . Since all the computations in is exactly the same as the computations in , except for the loop parallelization, is workefficient.
For proving the parallel depth of , first note that all the elements of at Line 6 of are processed in parallel, and the total cost of executing Lines 7 to 11 is . In addition, the depth of the operation at Line 12 of is using the concurrent hashtable. Therefore, the overall depth of the procedures from Line 4 to Line 12 is the number of new edges in a new maximal clique , processed at Line 2 of which is . Next, the depth of Lines 14 to 16 is because it only takes to execute Lines 15 and 16 using Theorem 3.1. As a result, for each new maximal clique, the depth of the procedure for enumerating all subsumed cliques within the new maximal clique is . ∎
5.3. Decremental Case
We have so far discussed incremental maintenance of new and subsumed maximal cliques when the stream only contains new edges. We note that our algorithm can also support the deletion of edges through a reduction to the incremental case, this is similar to the methods used for sequential algorithms. Please see Sections 4.4 and 4.5 of (DST16) for further details.
6. Evaluation
In this section, we experimentally evaluate the performance of our sharedmemory parallel static ( and ) and dynamic () algorithms for on static and dynamic (real world and synthetic) graphs to show the parallel speedup and scalability of our algorithms over efficient sequential algorithms and respectively. We also compare our algorithms with stateoftheart parallel algorithms for MCE to show that the performance of our algorithm has substantially improved over the prior works. We run the experiments on a computer configured with Intel Xeon (R) CPU E54620 running at 2.20GHz , with physical cores (4 NUMA nodes each with cores) and TB RAM.
6.1. Datasets
We use eight different realworld static and dynamic networks from publicly available repositories KONECT (konnect13), SNAP (JA14), and Network Repository (RN15) for doing the experiments. Dataset statistics are summarized in Table 3. For our experiments, we convert these networks to simple undirected graphs by removing selfloops, edge weights, parallel edges, and edge directions.
We consider networks DBLPCoauthor, AsSkitter, Wikipedia, WikiTalk, and Orkut for the evaluation of the algorithms on static graphs and networks DBLPCoauthor, Flickr, Wikipedia, LiveJournal, and CaCitHepTh for the evaluation of the algorithms on dynamic graphs.
DBLPCoauthor shows the collaboration of authors of papers from DBLP computer science bibliography. In this graph, vertices represent authors, and there is an edge between two authors if they have published a paper (RN15). AsSkitter is an Internet topology graph which represents the autonomous systems, connected to each other on the Internet (konnect13). Wikipedia is a network of English Wikipedia in 2013, where vertices represent pages in English Wikipedia, and there is an edge between two pages and if there is a hyperlink in page to page (JA14). WikiTalk contains users of Wikipedia as vertices where each edge between two users in this graph indicates that one of the users has edited the ”page talk” of the other user on Wikipedia (JA14). Orkut is a social network where each vertex represents a user in the network, and there is an edge if there is a friendship relation between two users (konnect13). Similar to Orkut, Flickr and LiveJournal are also social networks where a vertex represents a user and an edge represents the friendship between two users. CaCitHepTh is a citation network of high energy physics theory where a vertex represents a paper and there is an edge between paper and paper if cites .
For the evaluation of and , we give the entire graph as input to the algorithm in the form of an edge list and for the evaluation of the algorithms on dynamic graphs, we start with an empty graph that contains all vertices but no edges, and, at a time, we add a set of edges in an increasing order of timestamps for computing the changes in the set of maximal cliques. For , we use real dynamic graphs, where each edge has a timestamp. In addition, we evaluate on LiveJournal, which is a static graph. We convert this graph to a dynamic graph through randomly permuting the edges and processing edges in that ordering.
(a) DBLPCoauthor  (b) AsSkitter  (c) Wikipedia 
(d) WikiTalk  (e) Orkut 
To understand the datasets better, we illustrated the frequency distribution of sizes of maximal cliques in Figure 5. The size of maximal cliques in DBLPCoauthor can be as large as vertices. Although the number of such large maximal cliques are small, the depth of the search space might increase exponentially for discovering such large maximal cliques. As shown in Figure 5, most of the graphs contain more than tens of millions of maximal cliques. For example, Orkut contains more than two billion maximal cliques. That being said, the depth and breadth of the search space makes MCE a challenging problem to solve.
Dataset  #Vertices  #Edges  #Maximal Cliques 



DBLPCoauthor  1,282,468  5,179,996  1,219,320  3  119  
Orkut  3,072,441  117,184,899  2,270,456,447  20  51  
AsSkitter  1,696,415  11,095,298  37,322,355  19  67  
WikiTalk  2,394,385  4,659,565  86,333,306  13  26  
Wikipedia  1,870,709  36,532,531  131,652,971  6  31  
LiveJournal  4,033,137  27,933,062  38,413,665  29  214  
Flickr  2,302,925  22,838,276  ¿ 400 billion      
CaCitHepTh  22,908  2,444,798  ¿ 400 billion     
6.2. Implementation of the Algorithms
In our parallel implementations of , , and , we utilize parallel_for and parallel_for_each, developed by the Intel TBB parallel library (RL99). We also utilize concurrent_hash_map for atomic operations on hashtable. We use C++11 standard for the implementation of the algorithms and compile all the sources using Intel ICC compiler version 18.0.3 with optimization level ‘O3’. We use the command ‘numactl i all’ for balancing the memory in a NUMA machine. System level load balancing is performed using a dynamic work stealing scheduler (RL99) in TBB.
To compare with prior works on MCE, we implement some of them such as (TTT06; ELS10; DW+09; SMT+15; ZA+05) in C++, and we use the executable of the C++ implementations for the rest of the algorithms ( (SAS18), (LP+17), and (WC+17)), provided by the respective authors. See Subsection 6.4 for more details.
We compute the degeneracy number and triangle count for each vertex using sequential procedures. While the computation of pervertex triangle counts and the degeneracy ordering could be potentially parallelized, implementing a parallel method to rank vertices based on their degeneracy number or triangle count is itself a nontrivial task. We decided not to parallelize these routines since the degeneracy and trianglebased ordering did not yield significant benefits when compared with degreebased ordering, whereas the degreebased ordering is trivially available, without any additional computation.
We assume that the entire graph is available in a shared global memory. The Total Runtime (TR) of consists of (1) Ranking Time (RT): the time required to rank vertices of the graph based on the ranking metric used in the algorithm, i.e. degree, degeneracy number, or triangle count of vertices and (2) Enumeration Time (ET): the time required to enumerate all maximal cliques. For and algorithms, the runtime of ranking is also reported. Figures 6 and 7 show the parallel speedup (with respect to the runtime of ) and the enumeration time of using different vertex ordering strategies. Table 5 shows the breakdown of the Total Runtime (TR) into Ranking Time (RT) and Enumeration Time (ET).
The runtime of consists of (1) the computation time of and (2) the computation time of . In the implementation of , we follow the design of and instead of executing on the entire (sub)graph for an edge as in Line 9 of , we run on pervertex subproblems in parallel. For dealing with load balance, we use degree based ordering of the vertices of in creating the subproblems. We decided to implement in this way as we observed a significant improvement in the performance of over when we used a degreebased vertex ordering. For experiment on dynamic graphs, we use the batch size of edges for all the graphs except for an extremely dense graph CaCitHepTh (with original graph density 0.01) where we use batch size of .
6.3. Discussion of the Results
Here, we empirically evaluate our parallel algorithms and prior methods. First, we show the parallel speedup (with respect to the sequential algorithms) and scalability (with respect to the number of cores) of our algorithms. Next, we compare our works with the stateoftheart sequential and parallel algorithms for MCE. The results demonstrate that our solutions are faster than prior methods with significant margins.
6.3.1. Parallel MCE on Static Graphs
The total runtime of the parallel algorithms with threads are shown in Table 4. We observe that achieves a speedup of 5x14x over the sequential algorithm . The three versions of , i.e. , , , achieve a speedup of 15x21x with 32 threads, when we consider the runtime of enumeration task alone. The speedup ratio decreases in and if we include the time taken by ranking strategies (See Figure 6).
The runtime of is higher than the runtime of , due to a heavier cumulative overhead of pivot computation and of processing sets and in . For example, in DBLPCoauthor graph, when we run , the cumulative overhead of computing is 248 seconds, and the cumulative overhead of updating and is 38 seconds while, in , these runtimes are 156 and 21 seconds, respectively. This helps accelerate the entire process, which achieves 2x speedup over .
Dataset  
DBLPCoauthor  42  4  2  3  3 
Orkut  28923  3472  1676  2350  1959 
AsSkitter  660  68  39  43  48 
WikiTalk  961  109  52  78  58 
Wikipedia  2646  160  123  155  179 
(a) DBLPCoauthor  (b) AsSkitter  (c) Wikipedia 
(d) WikiTalk  (e) Orkut 
(a) DBLPCoauthor  (b) AsSkitter  (c) Wikipedia 
(d) WikiTalk  (e) Orkut 
(a) DBLPCoauthor  (b) Flickr  (c) LiveJournal 
(d) CaCitHepTh  (e) Wikipedia 
(a) DBLPCoauthor  (b) Flickr  (c) LiveJournal 
(d) CaCitHepTh  (e) Wikipedia 
Impact of vertex ordering on overall performance of . Here, we study the impact of different vertex ordering strategies, i.e. degree, degeneracy, and triangle count, on the overall performance of . Table 5 presents the total computation time when we use different orderings. We observe that degreebased ordering () in most cases is the fastest strategy for clique enumeration, even when we do not consider the computation time for generating degeneracy and trianglebased orderings. If we add up the runtime for the ordering step, degreebased ordering is obviously better than degeneracy or trianglebased orderings since degreebased ordering is available for free when the input graph is read, while the degeneracy and trianglebased orderings require additional computations.
Scaling up with the degree of parallelism.
As the number of threads (and the degree of parallelism) increases, the runtime of and of decreases. Figure 6 presents the speedup factor of over the stateoftheart sequential algorithm, i.e. , as a function of the number of threads, and Figure 7 presents the runtime of . achieves a speedup of more than 15x on all graphs, when threads are used.
Dataset  
RT  ET  TR  RT  ET  TR  
DBLPCoauthor  3  25  3  28  42  3  45 
Orkut  1676  928  2350  3278  2166  1959  4125 
AsSkitter  39  41  43  84  122  48  170 
WikiTalk 
Comments
There are no comments yet.