A Fast Counting Method for 6-motifs with Low Connectivity

02/17/2020
by   Taha Sevim, et al.
0

A k-motif (or graphlet) is a subgraph on k nodes in a graph or network. Counting of motifs in complex networks has been a well-studied problem in network analysis of various real-word graphs arising from the study of social networks and bioinformatics. In particular, the triangle counting problem has received much attention due to its significance in understanding the behavior of social networks. Similarly, subgraphs with more than 3 nodes have received much attention recently. While there have been successful methods developed on this problem, most of the existing algorithms are not scalable to large networks with millions of nodes and edges. The main contribution of this paper is a preliminary study that genaralizes the exact counting algorithm provided by Pinar, Seshadhri and Vishal to a collection of 6-motifs. This method uses the counts of motifs with smaller size to obtain the counts of 6-motifs with low connecivity, that is, containing a cut-vertex or a cut-edge. Therefore, it circumvents the combinatorial explosion that naturally arises when counting subgraphs in large networks.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

11/24/2019

Efficiently Counting Vertex Orbits of All 5-vertex Subgraphs, by EVOKE

Subgraph counting is a fundamental task in network analysis. Typically, ...
06/13/2015

Graphlet Decomposition: Framework, Algorithms, and Applications

From social science to biology, numerous applications often rely on grap...
10/29/2019

A Survey on Subgraph Counting: Concepts, Algorithms and Applications to Network Motifs and Graphlets

Computing subgraph frequencies is a fundamental task that lies at the co...
03/19/2021

A systematic association of subgraph counts over a network

We associate all small subgraph counting problems with a systematic grap...
10/23/2020

Heterogeneous Graphlets

In this paper, we introduce a generalization of graphlets to heterogeneo...
02/04/2020

Extracting Dense and Connected Subgraphs in Dual Networks by Network Alignment

The use of network based approaches to model and analyse large datasets ...
08/19/2021

odeN: Simultaneous Approximation of Multiple Motif Counts in Large Temporal Networks

Counting the number of occurrences of small connected subgraphs, called ...
This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

In social network analysis, any fixed subgraph with nodes is called a -motif (or graphlet) and their analysis has been a useful method to characterize the structure of real-world graphs. It has observed particularly in social networks, that some motifs are more common than others, and the structure of the network is different than the structure of the random graphs [12, 18, 29]. While knowing that only the analysis of these subgraphs is not sufficient to understand the structure of the networks, it has been validated that the motif frequencies provide substantial information about the local network structure in various domains [12, 8, 9]. By counting the number of embeddings of each motif in a network, it is possible to create a profile of sufficient statistics that characterizes the network structure [24].

Although there has been significant amount of success and impact on areas varying from social science to biology, the search for faster and more efficient algorithms to compute the frequencies of graph patterns continues. The main reason to study algorithms to count motifs faster is combinatorial explosion. The running time of algorithms to exactly count -motifs on the vertex set is of the order . The counts of 6-motifs are in the orders of billions to trillions for graphs with more than a few million edges. Thus, an enumeration algorithm cannot terminate in a reasonable time. The idea presented in [19] and extended to 6-motifs here uses a framework of counting with minimal enumeration. The main contribution of this paper is a preliminary study that genaralizes the exact counting algorithm provided in [19] to a collection of 6-motifs. To the best of our knowledge, this is the only study that counts 6-motifs using exact computation and performs all counts in graphs with millions of edges in minutes. As a preliminary work, we are able to exactly count the motifs shown in Figure 2. The particular reason that this subset of motifs are chosen is that each of them contains a cut-vertex or a cut-edge, that is, removing that vertex or edge makes the motif disconnected. The main idea is to build a framework to cut each pattern of 6 nodes into smaller patterns, where each of the patterns contain that particular cutting subset, also called cut-set. Then, the enumeration is only needed for these smaller patterns rather than the big pattern. For our purposes, we do not carry out the enumeration and use the counts for these smaller patterns obtained in [19].

There are various approximation algorithms [4, 14, 20, 23, 31], however the results they provide are not exact and scalable for counting larger motifs with more than 4 nodes, whereas the method presented here is also scalable to very large networks. As presented in Section 3, our method is able to count 6-motifs in Figure 2 for a network with 3 millions of edges under 5 minutes. Most of the studies on counting motifs have been focusing on smaller motifs with size at most 4. In particular, the count of triangles has been widely studied due to its importance in the analysis of social networks [28]. These results have been helpful for graph classification and often used as graph attributes. Another group of recent studies on subgraph counts are used for detecting communities and dense subgraphs, such as [2, 22, 25, 27]. More recent algorithmic improvements on counting triangles can be found in [23, 26]. Exact and approximate algorithms for computing the number of non-induced 4-motifs are proposed in [10].

It has been observed that sampling algorithms [4, 20, 30, 31] and randomized algorithms, such as the color coding method [3, 13, 32], are not feasible for counting motifs of size larger than 4. One of the most recently developed algorithms in [5]estimates the number of 7-motifs on a graph with 65M nodes and 1.8B edges in around 40 minutes. Exact counting algorithms as in [16, 15, 31] exist, but they are very slow and not scalable to large graphs. The recent study in [19] showed an exact counting technique that counts all patterns with at most 5 vertices on graphs with tens of millions of edges in several minutes.

The main contribution in this paper is to cut each pattern in the chosen collection into smaller patterns and use the enumeration on these smaller patterns to count the big pattern by using the framework in [19]. Some other algorithms that used ideas to avoid enumeration can be seen in  [1, 6, 7, 11].

2 Methodology

The input graph is undirected, where and denote the vertex set and the edge set of , respectively. A subgraph of is called induced if all edges present in the host graph exist as edges in that subgraph. Otherwise, it is called non-induced. In our counting method, a subgraph means a non-induced subgraph. We call a triangle with a missing edge a wedge, a with a missing edge a diamond and a triangle with an edge attached to one of its vertices, a tailed triangle.

Figure 1: The collection of connected 5-motifs [19].

In our notation, for each vertex we use and (resp. ) to denote the degree of and the number of triangles that contain the vertex (resp. edge ), respectively. Similarly, (resp. ) and (resp. ) indicate the number of ’s and ’s that contain the vertex (resp. edge ), respectively. The number of diamonds, tailed triangles and ’s in the graph are denoted by , , and respectively. The number of wedges between two vertices and and ending at a vertex are written as and The numbers of the 5-motifs given in Figure 1 are described with , and of the ones in Figure 2 are described with

Figure 2: The 6-motifs with low connectivity.

A standard method for counting triangles is to enumerate the wedges and find the triangles by checking whether the missing edge is there or not. By a similar idea, the formulation here uses a cut-set, say , for each motif , whose removal disconnects Let the components be and let be There is some care needed in choosing this cut-set, however in our algorithm it is typically a vertex or an edge. The count for each possible that contains is obtained by the counts of 4-motifs and 5-motifs given in [19]. The collection of 5-motifs that are used in our counting method can be seen in Figure 1.

2.1 Main Theorems

The exact computation for the motifs presented in Figure 2 is obtained in the following theorems. We refer the reader to [19] for the technical details of the method used. Here, we briefly discuss two examples to present the general idea and how we apply it to obtain Theorems 2.1 and 2.2.

Theorem 2.1 (Cut is a vertex)

For example, the expression calculating in Theorem 2.1 has no overcounting and it considers every vertex as a cut-vertex and counts by pairing triangles and the three neighbors attached to .

Figure 3: The chosen cut-edge for motif-17.

However, in the calculation of , Theorem 2.2

, we subtract the number of other motifs, counted unnecessarily. Here, the cut-set is an ordered pair

. One example of overcounting occurs when the vertices labeled 1 and 3 in Figure 3 are chosen to be the same, meaning also 5-motifs with index 10 are counted. Thus, we subtract it twice considering that is mapped to the vertices labeled either 1 or 4 (in Figure 1). Similarly, all subtractions remove the contributions of overcounting.

Theorem 2.2 (Cut is an edge)

Here, indicates an ordered and an unordered pair.










3 Experimental Results and Conclusions

Experiments are performed on a computer that has 2.7 GHz dual-core Intel Core i5 processor, 3 MB L3 Cache and 8 GB 1867 MHz LPDDR3 memory. Our counting formulas are implemented with C++ using ESCAPE [19] framework. The datasets are taken from [21, 17].

The input graph is undirected and has vertices and edges, where multiple edges and loops are ignored. The input graph is stored as an adjacency list, where each list is a hash table. Thus, edge queries can be made in constant time.

Figure 4: The counts of 6-motifs in the given networks.

The motifs studied here are not induced, however it is still possible to observe the behavior of the relationships in the corresponding network by the motif analysis obtained in Figure 4. As expected, the most common motif is the 5-star and the tree motifs occur more frequently. One exception to that is the 5-star with an edge added. This is not surprising, since this and the 5-star are two graphs, abundant at the hub vertices with very high degrees in most social networks. Also, Figure 4 indicates that when the clique number of a motif is higher, the count of that motif is less.

In Table 1, the runtimes of the algorithm together with the size of each network are provided. The fourth column shows the runtimes obtained in [19] to evaluate the counts of motifs with 4 and 5 nodes. The runtime spent only for the counts of 6-motifs by our algorithm is listed in the last column.

Network 4-5 motifs [19] 6-motif
com-youtube 1.1M 2.9M 168.880 4.896
web-wiki-ch-internal 1.9M 8.9M 2017.165 17.047
web-stanford 281.9K 1.9M 222.296 3.233
tech-as-skitter 1.7M 11.1M 1401.271 15.991
soc-brightkite 56.7K 212.9K 6.629 0.242
tech-RL-caida 190.9K 607.6K 4.719 0.729
flickr 757.2K 1.4M 13.008 1.886
com-amazon 334.8K 925.8K 2.908 1.272
web-google-dir 875.5K 4.3M 63.511 5.589
ia-email-EU-dir 265.0K 364.4K 6.537 0.479
Table 1: The runtimes in seconds for the motif counts of various networks

The runtimes to count smaller motifs were predicted for any network in [19]. Similarly, we obtain predictions using the runtimes in the last column of Table 1, as shown in Figure 5. All counts in Theorems 2.1 and 2.2 can be computed in time where and As social networks are sparse graphs and , our prediction is seconds for any network with edges. As observed in Table 1, our algorithm is able to execute the counts of all 6-motifs in Figure 2 under 20 seconds, excluding the runtime spent to obtain the counts of smaller motifs.

Figure 5: The prediction of runtimes in seconds.

3.1 Conclusions

In this study, we presented a preliminary work that genaralizes the exact counting method for motifs of networks in [19] to a collection of 6-motifs with lower connectivity. We performed experiments to analyze the motif structure in real-world graphs and analyzed the runtime efficiency for the computations. The idea of counting 6-motifs by using algorithms based on the enumeration of smaller motifs results in much shorter runtime compared to other state-of-the-art algorithms. In a future study, we plan to extend this counting method to the remaining connected 6-motifs and use this data to obtain the counts of induced 6-motifs.

Acknowledgements

The research of the third author was supported in part by the BAGEP Award of the Science Academy of Turkey and by the TUBITAK Grant 11E283.

References

  • [1] Ahmed, N.K., Neville, J., Rossi, R.A., Duffield, N.: Efficient graphlet counting for large networks. In: 2015 IEEE International Conference on Data Mining. pp. 1–10. IEEE (2015)
  • [2] Benson, A.R., Gleich, D.F., Leskovec, J.: Higher-order organization of complex networks. Science 353(6295), 163–166 (2016)
  • [3] Betzler, N., Van Bevern, R., Fellows, M.R., Komusiewicz, C., Niedermeier, R.: Parameterized algorithmics for finding connected motifs in biological networks. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 8(5), 1296–1308 (2011)
  • [4] Bhuiyan, M.A., Rahman, M., Rahman, M., Al Hasan, M.: Guise: Uniform sampling of graphlets for large graph analysis. In: 2012 IEEE 12th International Conference on Data Mining. pp. 91–100. IEEE (2012)
  • [5] Bressan, M., Leucci, S., Panconesi, A.: Motivo: fast motif counting via succinct color coding and adaptive sampling (2019), https://arxiv.org/pdf/1906.01599.pdf
  • [6] Elenberg, E.R., Shanmugam, K., Borokhovich, M., Dimakis, A.G.: Beyond triangles: A distributed framework for estimating 3-profiles of large graphs. In: Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. pp. 229–238. ACM (2015)
  • [7] Elenberg, E.R., Shanmugam, K., Borokhovich, M., Dimakis, A.G.: Distributed estimation of graph 4-profiles. In: Proceedings of the 25th International Conference on World Wide Web. pp. 483–493. International World Wide Web Conferences Steering Committee (2016)
  • [8] Faust, K.: A puzzle concerning triads in social networks: Graph constraints and the triad census. Social Networks 32(3), 221–233 (2010)
  • [9] Frank, O.: Triad count statistics. In: Annals of Discrete Mathematics, vol. 38, pp. 141–149. Elsevier (1988)
  • [10] Gonen, M., Shavitt, Y.: Approximating the number of network motifs. Internet Mathematics 6(3), 349–372 (2009)
  • [11] Hočevar, T., Demšar, J.: A combinatorial approach to graphlet counting. Bioinformatics 30(4), 559–565 (2014)
  • [12] Holland, P.W., Leinhardt, S.: Local structure in social networks. Sociological methodology 7, 1–45 (1976)
  • [13] Hormozdiari, F., Berenbrink, P., Pržulj, N., Sahinalp, S.C.: Not all scale-free networks are born equal: the role of the seed graph in ppi network evolution. PLoS computational biology 3(7),  e118 (2007)
  • [14] Jha, M., Seshadhri, C., Pinar, A.: Path sampling: A fast and provable method for estimating 4-vertex subgraph counts. In: Proceedings of the 24th International Conference on World Wide Web. pp. 495–505. International World Wide Web Conferences Steering Committee (2015)
  • [15] Kashani, Z.R.M., Ahrabian, H., Elahi, E., Nowzari-Dalini, A., Ansari, E.S., Asadi, S., Mohammadi, S., Schreiber, F., Masoudi-Nejad, A.: Kavosh: a new algorithm for finding network motifs. BMC bioinformatics 10(1),  318 (2009)
  • [16] Kashtan, N., Itzkovitz, S., Milo, R., Alon, U.: Efficient sampling algorithm for estimating subgraph concentrations and detecting network motifs. Bioinformatics 20(11), 1746–1758 (2004)
  • [17] Leskovec, J., Krevl, A.: Stanford large network dataset collection. https://snap.stanford.edu/data (2014)
  • [18] Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U.: Network motifs: simple building blocks of complex networks. Science 298(5594), 824–827 (2002)
  • [19] Pinar, A., Seshadhri, C., Vishal, V.: Escape: Efficiently counting all 5-vertex subgraphs. In: Proceedings of the 26th International Conference on World Wide Web. pp. 1431–1440. International World Wide Web Conferences Steering Committee (2017)
  • [20] Rahman, M., Bhuiyan, M.A., Al Hasan, M.: Graft: An efficient graphlet counting method for large graph analysis. IEEE Transactions on Knowledge and Data Engineering 26(10), 2466–2478 (2014)
  • [21] Rossi, R., Ahmed, N.: Network data repository. https://networkrepository.com (2012)
  • [22] Sariyuce, A.E., Seshadhri, C., Pinar, A., Catalyurek, U.V.: Finding the hierarchy of dense subgraphs using nucleus decompositions. In: Proceedings of the 24th International Conference on World Wide Web. pp. 927–937. International World Wide Web Conferences Steering Committee (2015)
  • [23] Seshadhri, C., Pinar, A., Kolda, T.G.: Fast triangle counting through wedge sampling. In: Proceedings of the SIAM Conference on Data Mining. vol. 4, p. 5 (2013)
  • [24]

    Shervashidze, N., Vishwanathan, S., Petri, T., Mehlhorn, K., Borgwardt, K.: Efficient graphlet kernels for large graph comparison. In: Artificial Intelligence and Statistics. pp. 488–495 (2009)

  • [25] Tsourakakis, C.: The k-clique densest subgraph problem. In: Proceedings of the 24th international conference on world wide web. pp. 1122–1132. International World Wide Web Conferences Steering Committee (2015)
  • [26] Tsourakakis, C.E., Kolountzakis, M.N., Miller, G.L.: Triangle sparsifiers. J. Graph Algorithms Appl. 15(6), 703–726 (2011)
  • [27] Tsourakakis, C.E., Pachocki, J., Mitzenmacher, M.: Scalable motif-aware graph clustering. In: Proceedings of the 26th International Conference on World Wide Web. pp. 1451–1460. International World Wide Web Conferences Steering Committee (2017)
  • [28] Ugander, J., Backstrom, L., Kleinberg, J.: Subgraph frequencies: Mapping the empirical and extremal geography of large graph collections. In: Proceedings of the 22nd international conference on World Wide Web. pp. 1307–1318. ACM (2013)
  • [29] Watts, D.J., Strogatz, S.H.: Collective dynamics of ‘small-world’networks. nature 393(6684),  440 (1998)
  • [30] Wernicke, S.: Efficient detection of network motifs. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) 3(4), 347–359 (2006)
  • [31] Wernicke, S., Rasche, F.: Fanmod: a tool for fast network motif detection. Bioinformatics 22(9), 1152–1153 (2006)
  • [32] Zhao, Z., Wang, G., Butt, A.R., Khan, M., Kumar, V.A., Marathe, M.V.: Sahad: Subgraph analysis in massive networks using hadoop. In: 2012 IEEE 26th International Parallel and Distributed Processing Symposium. pp. 390–401. IEEE (2012)