Clique percolation method: memory efficient almost exact communities

10/04/2021 ∙ by Alexis Baudin, et al. ∙ Laboratoire d'Informatique de Paris 6 Université de Bourgogne 0

Automatic detection of relevant groups of nodes in large real-world graphs, i.e. community detection, has applications in many fields and has received a lot of attention in the last twenty years. The most popular method designed to find overlapping communities (where a node can belong to several communities) is perhaps the clique percolation method (CPM). This method formalizes the notion of community as a maximal union of k-cliques that can be reached from each other through a series of adjacent k-cliques, where two cliques are adjacent if and only if they overlap on k-1 nodes. Despite much effort CPM has not been scalable to large graphs for medium values of k. Recent work has shown that it is possible to efficiently list all k-cliques in very large real-world graphs for medium values of k. We build on top of this work and scale up CPM. In cases where this first algorithm faces memory limitations, we propose another algorithm, CPMZ, that provides a solution close to the exact one, using more time but less memory.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

The problem of detecting communities in real networks has received a lot of attention in recent years. Many definitions of communities have been proposed, corresponding to different requirements on the type of communities that are to be detected and/or on some properties of the studied graph: some definitions depend on global or local properties of the graph, nodes can belong to several communities or to a single one, links may have weights, communities may have a hierarchical structure, etc. In practice, algorithms also have to be designed to extract the communities from large graphs. Many definitions of communities proposed with their corresponding algorithm already exist [5]. Most real networks are characterized by communities that may overlap, i.e. a node may belong to several communities. In the context of social networks for instance, each person belongs to several communities such as colleagues, family, leisure activities, etc.

One of the most popular methods designed to find overlapping communities is the clique percolation method (cpm) which produces -clique communities [14].

Definition 1 (-clique)

A -clique is a fully connected set of nodes, i.e. every pair of its vertices is connected by a link in the graph.

For example, in Figure 2, the set is a -clique.

Definition 2 (-cliques adjacency)

Two -cliques are said to be adjacent if and only if they share nodes.

For example, in Figure 2, the two -cliques and are adjacent.

Definition 3 (cpm community)

A -clique community (or cpm community) is the set of the vertices belonging to a maximal set of -cliques that can be reached from each other through a series of adjacent -cliques.

Though the corresponance between the obtained communities and real-world ones are hard to charactarize in a general manner, the advantages of this definition are well known [8]

: it is formally well defined, totally deterministic, does not use any heuristics or function optimizations that are hard to interpret, allows communities to overlap, and each community is defined locally

111Notice that if a node does not belong to at least one -clique, it doesn’t belong to any community..

Despite much effort cpm has not been scalable to large graphs for medium values of , i.e, values between 5 and 10. We therefore seek in this work to extend the computation of cpm to larger graphs. A bottleneck for most previous contributions is the memory required. Indeed, exact methods need to store in memory either all -cliques, or all maximal cliques (cliques which are not included in any other clique), which is prohibitive in many cases.

Our contribution is twofold:

  1. [topsep=0ex,partopsep=0ex]

  2. we improve on the state of the art concerning the computation of cpm communities, by leveraging an existing algorithm able to list -cliques in a very efficient manner;

  3. in cases where this first algorithm faces memory limitations, we propose another algorithm that provides a solution close to the exact one, using more time but less memory.

We will show that these algorithms allow to compute exact solutions in cases where this was not possible before, and to compute a close result of good quality in cases where the graph is so large that our exact method does not work due to memory limitations.

The rest of the paper is organized as follows. In Section 2, we present the related work. In Section 3, we present our exact and relaxed algorithms. We discuss time and memory requirements in Section 4. We then evaluate the performance of our exact algorithm against the state of the art in Section 5, and we compare the results and performances of our exact and relaxed algorithms. We conclude in Section 6, and present some perspectives for future work.

2 Related Work

There are many algorithms for computing overlapping communities as shown in a dedicated survey [17]. The focus of our paper is on the computation of the -clique communities. Existing algorithms to compute -clique communities in a graph can be split in two categories:

  1. [topsep=0ex,partopsep=0ex]

  2. algorithms that compute all maximal cliques of size or more and use them to compute all -clique communities. Indeed, two maximal cliques that overlap on nodes or more belong to the same -clique community. Most state-of-the-art approaches [8, 14, 15] belong to this category;

  3. Kumpula et al. [9] compute all -cliques and then compute -clique communities from them strictly following the definitions of a -clique community, i.e. detect which -cliques are adjacent.

Algorithms of the first category differ in the method used to find which maximal cliques are adjacent. However, the first step which consists in computing all maximal cliques is always the same and is done sequentially. While this problem is NP-hard, there exist algorithms scalable to relatively large sparse real-world graphs, based on the Bron-Kerbosch algorithm [1, 3, 4].

Any large clique, with more than vertices, will be included in a single -clique community, and there is no need to list all -cliques of this large clique. This is the main reason why there are more methods following the approach of listing maximal cliques, category (1), rather than listing -cliques, category (2). However, it has been found that most real-world graphs actually do not contain very large cliques and that listing -cliques for small and medium values of is a scalable problem in practice [2, 12], in many cases it is more tractable that listing all maximal cliques. This makes algorithms in the category (2) more interesting for practical scenarios.

The algorithm of [9] proposes a method to list all -cliques then merges the found -cliques into -clique communities using a Union-Find [7], a very efficient data structure [6, 16] which we describe briefly in Section 3.1. In the context of [9] the Union-Find contains all -cliques (as elements) and each -clique triggers the union of the subsets that contain at least one -clique of .

Our first contribution builds on the same idea. We first propose to use an efficient algorithm for listing -cliques [2], which improves the overall performance. Going further, in order to provide an approximation of the community structure for graphs for which it is not possible to obtain the exact result due to memory limits, we propose to perform union of sets of -cliques, , instead of -cliques. This construction is discussed in details in the next section.

3 Algorithm

A graph consists of its vertex set and its edge set . In the following will always denote a -clique, from the context it will be clear which one exactly.

3.1 Union-Find structure

The algorithms we will present rely on the Union-Find structure, also known in the literature as a disjoint-set data structure. It stores a collection of disjoint sets, allowing very efficient union operations between them. The structure is a forest, whose trees correspond to disjoint subsets, and nodes correspond to the elements. The operations on the nodes are the following:

  • [topsep=0ex,partopsep=0ex]

  • : returns the root of the tree containing a Union-Find node .

  • : performs the union of trees represented by their roots by making one root the parent of all others;

  • : creates a new tree with one node , corresponding to a new empty set, and returns .

3.2 Exact cpm algorithm

First we build on the idea introduced in [9]. A cpm community is represented as the set of all the -cliques it contains. These communities are represented by a Union-Find structure whose nodes are -cliques. The algorithm then iterates over all -cliques and tests if the current -clique belongs to a community, by testing whether it has a -clique in common with it.

1: Union-Find data structure
2: Empty Dictionary
3:for each -clique  do
4:      communities of to merge
5:     for each -clique  do
6:         if  then
7:              
8:         else
9:              
10:                        
11:               
12:     
Algorithm 1 Exact cpm algorithm

Algorithm 1 considers all -cliques one by one. For every -clique it iterates over its -cliques . For every , it identifies the set to which it belongs in the Union-Find. Then, it performs the union of all sets . Several algorithms exist for efficiently listing -cliques [12]. We substitute one the best [2] to the one proposed by the authors of [9].

As the number of -cliques can be very large, this approach is problematic as in some cases it is not possible to store them all in memory. This leads us to a new algorithm which requires less memory but in rare cases incorrectly merges some cpm communities together.

3.3 Memory efficient cpm approximation

For relatively small values of , there are far fewer -cliques than -cliques in real-world graphs. To get an intuition for this, consider the case of a large clique of size . It contains -cliques and this number increases with for . Therefore storing all -cliques is feasible in cases where it is not possible to store all the -cliques. We use this idea to propose an algorithm computing relaxed communities.

Definition 4 (cpmz community)

An agglomerated -clique community (or cpmz community) is the union of one or more cpm communities.

Our memory efficient method, called cpmz algorithm, given a graph , the size of -cliques and an integer , returns a set of agglomerated -clique communities, such that each cpm community is included in one and only one cpmz community (see Theorem 3.1).

In the cpm algorithm, a community is represented as a set of -cliques, and communities correspond to disjoint sets of -cliques. In the following, a community is represented as a set of -cliques, and cpmz communities are represented as non-disjoint sets of -cliques.

The main idea of our cpmz algorithm is to identify each -clique to the set of its containing -cliques. The algorithm is very similar to cpm. For each -clique, all -cliques are considered. Since we consider that a -clique is represented by the set of its -cliques, the community of a -clique is the one that contains all its -cliques.

The cpmz algorithm uses two principal data structures. UF is an Union-Find data structure, whose nodes are identifiers of -clique sets. We will call these Union-Find nodes. The operations defined on this structure are presented in Section 3.1. Each -clique can belong to several Union-Find nodes. This is recorded in the Setz dictionary, which associates to each -clique the set of Union-Find nodes to which it belongs. See Figure 1 for an example.

1

3

4

6

7

10

8

9

Union-Find forest:

Figure 1: Example of a graph and the corresponding data structures of the cpmz algorithm. In this example, we have . There are 17 -cliques belonging to 4 -cliques, namely {1,3,4,6}, {3,6,8,9}, {4,7,6,10} and {1,3,6,9}. The nodes of the Union-Find structure are represented using letters and . Each -clique is associated to one or more Union-Find nodes, as shown in the Setz information on the top-right. The Union-Find structure represents two sets because there are two different root nodes: and . The first set contains all 2-cliques associated with or , the second contains 2-cliques associated with .

More formally, Setz is a dictionary with -cliques as keys. For a -clique :

  • Setz[] is a set of Union-Find nodes;

  • Setz[].add() adds the Union-Find node to the set of Union-Find nodes of . It can also be seen as the action of adding to the set identified by .

At the end of the algorithm, every tree corresponds to a cpmz community represented as a union of sets containing -cliques.

Note that during the execution of the algorithm, the same -clique can belong to several Union-Find nodes of the same Union-Find set, which creates redundancies in Setz []. This is the case for instance in the example of Figure 1 in which Setz[(3,6)] contains both and which belong to the same Union-Find set. This situation can occur if a -clique belongs to two different Union-Find sets which are merged later.

In our cpmz algorithm (presented below) we eliminate these redundancies when we detect them (see Line 8).

1: Empty Union-Find data structure
2: Empty Dictionary
3:for each -clique  do
4:      Sets of -cliques to merge
5:     for each -clique  do
6:         
7:         for each -clique  do
8:              
9:              if  then
10:                  
11:              else
12:                                          
13:               
14:      Identifier of the resulting set of -cliques
15:     if  then
16:         
17:     else
18:               
19:     for each -clique  do
20:         Setz[].add()      
Algorithm 2 cpmz pseudocode

Algorithm 2 is the pseudo-code of cpmz. The for loop on Line 3 iterates over each -clique of a graph . As in the cpm algorithm, the idea is to identify the communities of each -clique of and perform their union. As explained above, the communities of a -clique are the ones that contain all its -cliques, which is why we compute their intersection in the set in Line 12. The set is computed for all -cliques in Lines 4-13 and their union is computed in set . Then all the sets in are merged in Line 18.

It may turn out that is empty after the loop of Line 5. This corresponds to the case where none of the -cliques of were observed before: if a -clique has not yet been seen in the algorithm, its -cliques may not belong to a common Union-Find set, and therefore computed at Line 12 is empty. If this happens for all -cliques of S is empty and a new set (Union-Find node) is created on Line 16. The identifier of the resulting (new or merged) set is added to the set of Union-Find nodes for every -clique of the current -clique (Line 20).

In some rare cases, Line 12 will consider that a -clique belongs to a Union-Find set while this is not true in the cpm exact case. Figure 2 gives an example. In that case this causes an incorrect -clique adjacency detection and results in an incorrect merge of two or more -clique communities.

1

2

3

4

5

6

7

8

9

10
Figure 2: In this example and . A -clique percolation community is formed by the nodes of the following -cliques: , , , , , , . The middle -clique with nodes is formed by the -cliques (edges) of other -cliques, whereas it is not itself a part of any -clique given above. When a new -clique is observed, the cpmz algorithm will produce one community , but the exact cpm algorithm gives two communities, namely and .
Theorem 3.1 (cpmz validity)

The cpmz algorithm returns a set of agglomerated -cliques communities, such that each cpm community is included in one and only one agglomerated community.

Proof

If two -cliques and are adjacent, this means they share a -clique . This will be correctly detected by Lines 5-12 of Algorithm 2: after the iteration on in the main loop, all -cliques of will belong to a common Union-Find set. The root of this set will belong to during the iteration on , ensuring that the Union-Find sets of both -cliques will be merged. In other words, cpm communities are never split by cpmz Algorithm and each cpm community belongs to a single agglomerated community. Conversely, an agglomerated community may contain more than one cpm community.

4 Analysis

We denote the number of -cliques in graph by . For each -clique, the cpm algorithm performs a Find and a Union operation for each of its -clique (see Algorithm 1). It is well known (see for example [16]) that Union-Find data structure performs operations in amortized time, where is the inverse Ackermann function, and is the number of elements. It grows extremely slowly. Therefore, the number of operations is proportional in practice to . The space complexity of this algorithm is dominated by the tree on the -cliques and the corresponding cost is proportional to .

cpmz is a tradeoff between memory and time. Indeed, we will see that the cpmz has a higher running time than the exact cpm algorithm, but requires less memory.

For the cpmz algorithm as well, the time required by the Union and Find operations can still be considered as constant, as is the time required for the MakeSet and Add operations. The total number of operations is then dominated by the number of Find operations of Line 8. This line runs on each distinct triplet , where and , and there are such triplets. Each -clique of Line 8 belongs to a number of Union-Find nodes that depends on the considered clique but also varies during the execution of the algorithm: it can either increase as new Union-Find node are added to (on Line 20) or decrease (on Line 8) if some Union-Find trees are merged (on Line 18). This number is bounded by the number of -cliques each -clique belongs to, which can theoretically be quite high. However, we computed in practice the average number of Find operations performed for each -clique, and we will see in Section 5.3 that it is very often equal to 1 or 2 and never exceeds 6 in our experiments. The main difference in the running time with respect to the exact cpm algorithm is, therefore, the extra factor.

Concerning the space requirements, the cpmz algorithm needs to store all -cliques, which takes a space proportional to . Each -clique then belongs to a number of Union-Find nodes that varies during the algorithm execution. Even if the average of this number is low, we are interested now in the maximum space taken at any time of the execution. Finally, the number of nodes in the Union-Find structure is equal to the number of MakeSet operations that have been performed during the execution. In theory this number can also be quite high. However, we will see in Section 5 that in practice the memory requirements of the cpmz algorithm are much lower than those of the cpm algorithm.

Finally, notice that both our algorithms result in the Union-Find structure whose nodes represent sets of cliques. This structure encodes the communities. In order to obtain the actual node list of each community, post-processing is needed. It consists in iterating over all cliques in the Union-Find structure. Then for each clique one must find its root node in the Union-Find and add the clique nodes to the corresponding set. We do not take into account this post-processing in the reported running time and memory usage in the next section.

5 Experimental evaluation

Machine

We carried out our experiments on a Linux machine DELL PowerEdge R440, equipped with 2 processors Intel Xeon Silver 4216 with 32 cores each, and with 384Gb of RAM.

Datasets

We consider several real-world graphs that we obtained from [11]. Their characteristics are presented in Table 1. We distinguish between three categories of graphs according to their number of -cliques. For graphs with small core values all algorithms are able to run; for graphs with medium core values, the state-of-the-art algorithms take too long to complete (with the exception of DBLP discussed below) while our cpm algorithm obtains results for small values of ; for graphs with large core values even our cpm algorithm runs out of memory or time except for very small values of , but we will show that we are able to obtain relaxed results of high quality with our cpmz algorithm.

network
soc-pokec
loc-gowalla
Youtube
zhishi-baidu
1,632,803
196,591
1,134,890
2,140,198
22,031,964
950,327
2,987,624
17,014,946
47
51
51
78
3 - 15
3 - 15
3 - 15
3 - 15
32,557,458
2,273,138
3,056,386
25,207,196
353,958,854
201,454,150
1,068
1,080,702,188
as-skitter
DBLP
WikiTalk
1,696,415
425,957
2,394,385
11,095,298
1,049,866
4,659,565
111
113
131
3 - 6
3 - 7
3 - 7
28,769,868
2,224,385
9,203,519
9,759,000,981
60,913,718,813
5,490,986,046
Orkut
Friendster
LiveJournal
3,072,627
124,836,180
4,036,538
117,185,083
1,806,067,135
34,681,189
253
304
360
3 - 5
3 - 4
3 - 4
627,584,181
4,173,724,124
177,820,130
15,766,607,860
8,963,503,236
5,216,918,441
Table 1: Our dataset of real-world graphs, ordered by core value . and represent the minimum and maximum on which we could run our cpm algorithm. is the number of -cliques of the graph.

Implementation

We implemented our cpm and cpmz algorithm in C. The implementation is available on the following gitlab repository: https://gitlab.lip6.fr/baudin/cpm-cpmz. For the competitors, we used the publicly available implementations of their algorithms [8, 9, 14, 15].

Computing domain

For each graph and each algorithm, we performed the computations of cpm for all values of from 3 to 15, unless we were not able to finish the computation for one of the following reasons:

  • the memory exceeded 390 Gb of RAM,

  • or the computation time exceeded 72 hours.

We ran the cpmz algorithm for and . It could be computed on all the cases for which cpm works, except for in graphs zhishi-baidu with and DBLP with .

The interesting point is that we manage to have results with cpmz in cases where the computation could not be carried out by the cpm algorithm: for as-skitter , WikiTalk , Orkut and Friendster .

The detail of all the calculated values, with which the following figures were generated, is available on the following gitlab repository: https://gitlab.lip6.fr/baudin/cpm-supplementary-material.

5.1 Comparison with the state of the art

The algorithm proposed by Palla et al. in the original paper introducing cpm [14] is quadratic in the number of maximal cliques. Given that we are interested in graphs with at least several million cliques, these graphs are too big to be processed by this algorithm. We do not perform experiments with this algorithm.

In addition, our tests have shown that the algorithm by Reid et al. [15] has a better performance than that of Gregori et al. [8] (sequential version) therefore we do not present the results obtained with the version of Gregori et al.

We observed that there are indeed linearity factors:

  • in time: for each k-clique, each of its -clique is processed in constant time, hence the running time of cpm is indeed linear in

  • in memory: the memory is used to store the Union-Find structure on the -cliques: one node per -clique encoded on integers, hence the memory needed by cpm is linear in .

Figure 3 compares the time and the memory necessary for the computation of the cpm communities by our cpm algorithm and the remaining competitive algorithms in the state of the art, proposed by Reid et al. [15] and Kumpula et al. [9]. For each competitive algorithm, we plot its running time (resp. memory usage) divided by the running time (resp. memory usage) of our cpm algorithm. We display the results as a function of , where is the number of -cliques of the input graph. In some cases, our cpm algorithm obtains results whereas one of the state-of-the-art doesn’t. This can happen because this algorithm exceeds either the time or memory limit. We display this by placing a symbol on the corresponding horizontal line at the top of the figure.

Figure 3: Comparison between time and memory consumption of the state-of-the-art cpm methods and those from ours. For each competitive algorithm, we plot its running time (resp. memory usage) divided by the running time (resp. memory usage) of our cpm algorithm. We display the results as a function of , where is the number of -cliques of the input graph. The maximum time is limited to 72h and the maximum memory to 390Gb. A marker placed at the corresponding line therefore indicates a computation that did not finish, because of either time or memory limit.

First notice that the Reid et al. algorithm is better than ours for the four smallest graphs in our dataset (soc-pokec, loc-gowalla, Youtube, zhishi-baidu). Indeed, this algorithm begins by computing the maximum cliques, then processes them to form communities. In the case of these small graphs, the maximum cliques are easily computed. For such graphs, the memory used does not depend on because in all cases the maximal cliques are stored; interestingly, the time computation time decreases with as only the maximal cliques of size larger than or equal to have to be tested for adjacency.

This algorithm is also more efficient than ours in certain configurations, when the graph is already well segmented into large cliques. This is the case with our DBLP graph, for which the Reid et al. algorithm manages to compute the communities in 10 seconds when we need several hours to process the large number of -cliques.

Notice however that their algorithm does not allow to process the largest graphs of our dataset. The as-skitter intermediate graph contains too many cliques and their algorithm does not provide a result in less than 72 hours. For denser graphs (WikiTalk, Orkut, Friendster, LiveJournal), there are too many maximum cliques for RAM, and the algorithm cannot run, while ours is able to compute the result.

Finally, the algorithm of Kumpula et al. is systematically less efficient than ours. Our algorithm is also able to obtain results in cases where no other algorithm can provide any (see the points on the two horizontal lines on top of the figures).

5.2 Memory gain of the cpmz algorithm

Figure 4 (right) compares the memory used by our algorithms cpm and cpmz with . We show the memory used by cpmz divided by the memory used by cpm, as a function of . As for the previous figure, we represent cases where cpmz exceeds the time limit on a horizontal line on top of the figure. In addition, cases where we obtain results with cpmz and not cpm are represented by symbols on a horizontal line at the bottom of the figure. For some small graphs, the number of -cliques, which are stored by cpm, is smaller than the number of -cliques. For these graphs cpmz requires more memory than cpm. In most cases however, we observe a huge memory gain, and in some cases it is even possible to obtain results unachievable by our cpm algorithm.

Figure 4: Comparison between time and memory consumption of the cpmz algorithm and our cpm. For cpmz with and , we plot its running time (resp. memory usage) divided by the running time (resp. memory usage) of our cpm algorithm. We display the results as a function of , where is the number of -cliques of the input graph. The maximum time is limited to 72h, and the memory is not limiting in comparison with cpm. A marker placed at the top line therefore indicates a computation that did not finish, because of time limit. A marker placed at the bottom line indicates a computation that finishes for cpmz but not for cpm.

In addition to this display and to the discussion about memory requirements in Section 4, we performed experiments to evaluate the memory associated with each -clique in the algorithm. To do so, we computed the mean number of Union-Find nodes to which each -clique belongs. In Algorithm 2, we therefore computed the sum the size of all Setz[] every time the algorithm reaches Line 8 Then, we divided this sum by the number of times this line is browsed, which is . We observed that this factor remains low; for all graphs and all values of and it is between 1 and 2, except for the small graphs of Youtube with a value around 4 for and , and , and zhishi-baidu with a value around 5 for and .

Concerning the running time, as discussed in Section 4, there are two factors that cause cpmz to be slower than cpm. The first one is the fact that a -clique can belong to several Union-Find nodes. As we just shown, this number is small in practice and therefore it does not play a strong role in the running time. The other factor is the extra factor induced by the fact that we consider all -cliques included in a -clique. This factor is high and therefore the computation time remains the limiting factor. Figure 4 (left) compares it to the running time of cpm.

5.3 cpmz communities are very close to cpm communities

To measure the precision of our algorithm, we compare the agglomerated communities computed by the cpmz algorithm with those of the cpm algorithm. To do so, we use an implementation of a Normalized Mutual Information measure for sets of overlapping clusters, called onmi, provided by McAid et al. [13]. This tool measures how similar two sets of overlapping communities are.

We carried out the similarity comparisons on all the graphs of our dataset, with all the values of for which we can compute the communities with the cpm algorithm (see Table 1). We observe the following:

  • [topsep=0ex,partopsep=0ex]

  • for cpmz with , the average similarity is , the median is and all values are larger than ;

  • for cpmz with , the average similarity is , the median is and all values are larger than .

This confirms that the incorrect merges between communities performed by the cpmz algorithm have little influence on the final result: the structure of communities is barely impacted by the cpmz algorithm.

6 Conclusion and discussions

In this paper we addressed the problem of overlapping community detection on graphs through the clique percolation method (cpm). Our contributions are twofold: first we proposed an improvement in the computation of the exact result by leveraging a state of the art -clique listing method; then we proposed a heuristic algorithm called cpmz that provides agglomerated communities, i.e. communities that are supersets of the exact communities; this algorithm uses much less memory than the exact algorithm, at the cost of a higher running time.

Through extensive experimentations on a large set of graphs coming from different contexts, we show that:

  • [topsep=0ex,partopsep=0ex]

  • our exact cpm algorithm outperforms the state of the art algorithms in many cases, and we are able to compute the cpm communities in cases where it was not possible before;

  • our relaxed cpmz algorithm uses significantly less memory than the exact algorithm; even though its running time is higher, this allows us to obtain agglomerated communities in cases where no other algorithm can run;

  • finally the results provided by the cpmz algorithm have an excellent accuracy, according to the onmi method and obtain a score very close to 1 in the vast majority of cases.

Notice however that for the DBLP graph, which is of medium size, the approach proposed in [15] works better than ours. This can be explained because this graph naturally has a strong clique structure. Indeed, a link exists in this graph if two authors have written a paper together, and each paper therefore induces a clique on the set of its authors. In this case computing the maximal cliques and extracting the community out of them is more efficient than detecting adjacent -cliques. This raises the interesting question of whether it is possible to predict which method will be more efficient on a graph by studying this graph’s structure.

Several other interesting perspectives arise from our work. It should be noted that the order in which -cliques are processed plays an important role in the incorrect community agglomerations performed by the cpmz algorithm, that we do not yet fully understand. Our experiments show that in practice only a few merges of -clique communities happen. This gives rise to many interesting graph theoretical questions about the characterisation of the sub-graphs that can produce an incorrect -clique adjacency detection: how many of them are there in a typical real-world graph? We conjecture that it is possible to construct examples in which no processing order of the -cliques will lead to the exact solution. However, in many cases including real-world graphs, it is possible that a certain processing order of -cliques yields results of a higher quality than other orderings. This raises the question of how to design such an ordering. Another interesting possibility would be to run the cpmz algorithm with two or more different -cliques ordering and compare their output: since the cpmz communities are coarser than the exact cpm communities, it is possible to compare the communities of both outputs to obtain a better result that any of the individual runs.

Finally, the linkstream formalism [10] allows to represent interactions that occur at different times, which is a natural framework to represent people meeting at different time during the week or computers exchanging ip packets on the internet. It would be interesting to investigate the community structure and its temporal aspects in such data by extending the definition of -clique communities to linkstreams.

Acknowledgements

This work was partly supported by projects ANER ARTICO (Bourgogne-Franche-Comté region), ANR (French National Agency of Research) Limass project (under grant ANR-19-CE23-0010), ANR FiT LabCom and ANR COREGRAPHIE project (grant ANR-20-CE23-0002). We would like to greatly thank Lionel Tabourier for insightful discussions, useful comments and suggestions, and Fabrice Lecuyer for his careful proofreading.

References

  • [1] C. Bron and J. Kerbosch (1973) Algorithm 457: finding all cliques of an undirected graph. Communications of the ACM 16 (9), pp. 575–577. Cited by: §2.
  • [2] M. Danisch, O. D. Balalau, and M. Sozio (2018) Listing k-cliques in sparse real-world graphs.. In WWW, P. Champin, F. L. Gandon, M. Lalmas, and P. G. Ipeirotis (Eds.), pp. 589–598. Cited by: §2, §2, §3.2.
  • [3] D. Eppstein, M. Löffler, and D. Strash (2010) Listing all maximal cliques in sparse graphs in near-optimal time. In International Symposium on Algorithms and Computation, pp. 403–414. Cited by: §2.
  • [4] D. Eppstein and D. Strash (2011) Listing all maximal cliques in large sparse real-world graphs. Experimental Algorithms, pp. 364–375. Cited by: §2.
  • [5] S. Fortunato and C. Castellano (2012) Community structure in graphs. In Computational Complexity: Theory, Techniques, and Applications, R. A. Meyers (Ed.), pp. 490–512. External Links: ISBN 978-1-4614-1800-9 Cited by: §1.
  • [6] M. Fredman and M. Saks (1989) The cell probe complexity of dynamic data structures. In

    Proceedings of the Twenty-First Annual ACM Symposium on Theory of Computing

    ,
    STOC ’89, pp. 345–354. External Links: ISBN 0897913078 Cited by: §2.
  • [7] B. A. Galler and M. J. Fisher (1964-05) An improved equivalence algorithm. Commun. ACM 7 (5), pp. 301–303. External Links: ISSN 0001-0782 Cited by: §2.
  • [8] E. Gregori, L. Lenzini, and S. Mainardi (2013) Parallel k-clique community detection on large-scale networks. IEEE Transactions on Parallel and Distributed Systems 24 (8), pp. 1651–1660. Note: Implementation available at: https://sourceforge.net/p/cosparallel Cited by: §1, item (1), §5, §5.1.
  • [9] J. M. Kumpula, M. Kivelä, K. Kaski, and J. Saramäki (2008) Sequential algorithm for fast clique percolation. Physical Review E 78 (2), pp. 026109. Note: Implementation available at: https://github.com/CxAalto/scp Cited by: item (2), §2, §3.2, §3.2, §5, §5.1.
  • [10] M. Latapy, T. Viard, and C. Magnien (2018-12) Stream Graphs and Link Streams for the Modeling of Interactions over Time. Social Networks Analysis and Mining 8 (1), pp. 61:1–61:29. External Links: Link, Document Cited by: §6.
  • [11] J. Leskovec and A. Krevl (2014-06) SNAP Datasets: Stanford large network dataset collection. Note: http://snap.stanford.edu/data Cited by: §5.
  • [12] R. Li, S. Gao, L. Qin, G. Wang, W. Yang, and J. Xu Yu (2020) Ordering heuristics for -clique listing. In PVLDB, Cited by: §2, §3.2.
  • [13] A. F. McDaid, D. Greene, and N. Hurley (2013) Normalized mutual information to evaluate overlapping community finding algorithms. Note: https://arxiv.org/abs/1110.2515 External Links: 1110.2515 Cited by: §5.3.
  • [14] G. Palla, I. Derényi, I. Farkas, and T. Vicsek (2005) Uncovering the overlapping community structure of complex networks in nature and society. Nature 435 (7043), pp. 814–818. Note: http://www.cfinder.orgImplementation available at: http://www.cfinder.org/ Cited by: §1, item (1), §5, §5.1.
  • [15] F. Reid, A. McDaid, and N. Hurley (2012) Percolation computation in complex networks. In Advances in Social Networks Analysis and Mining (ASONAM), 2012 IEEE/ACM International Conference on, pp. 274–281. Note: Implementation available at: https://sites.google.com/site/cliqueperccomp/home Cited by: item (1), §5, §5.1, §5.1, §6.
  • [16] R. E. Tarjan and J. van Leeuwen (1984-03) Worst-case analysis of set union algorithms. J. ACM 31 (2), pp. 245–281. External Links: ISSN 0004-5411 Cited by: §2, §4.
  • [17] J. Xie, S. Kelley, and B. K. Szymanski (2013) Overlapping community detection in networks: the state-of-the-art and comparative study. Acm computing surveys (csur) 45 (4), pp. 43. Cited by: §2.