Higher-order clustering in networks

04/12/2017 ∙ by Hao Yin, et al. ∙ cornell university Stanford University 0

A fundamental property of complex networks is the tendency for edges to cluster. The extent of the clustering is typically quantified by the clustering coefficient, which is the probability that a length-2 path is closed, i.e., induces a triangle in the network. However, higher-order structures beyond triangles are crucial to understanding complex networks, and the clustering behavior with respect to such higher-order patterns is not well understood. Here we introduce higher-order clustering coefficients that measure the closure probability of higher-order network structures and provide a more comprehensive view of how the edges of complex networks cluster. Our higher-order clustering coefficients are a natural generalization of the traditional clustering coefficient. We derive several properties about higher-order clustering coefficients and analyze them under common random graph models. Finally, we use higher-order clustering coefficients to gain new insights into the structure of real-world networks from several domains.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Networks are a fundamental tool for understanding and modeling complex physical, social, informational, and biological systems Newman (2003). Although such networks are typically sparse, a recurring trait of networks throughout all of these domains is the tendency of edges to appear in small clusters or cliques Rapoport (1953); Watts and Strogatz (1998). In many cases, such clustering can be explained by local evolutionary processes. For example, in social networks, clusters appear due to the formation of triangles where two individuals who share a common friend are more likely to become friends themselves, a process known as triadic closure Rapoport (1953); Granovetter (1973). Similar triadic closures occur in other networks: in citation networks, two references appearing in the same publication are more likely to be on the same topic and hence more likely to cite each other Wu and Holme (2009) and in co-authorship networks, scientists with a mutual collaborator are more likely to collaborate in the future Jin et al. (2001). In other cases, local clustering arises from highly connected functional units operating within a larger system, e.g., metabolic networks are organized by densely connected modules Ravasz and Barabási (2003).

The clustering coefficient quantifies the extent to which edges of a network cluster in terms of triangles. The clustering coefficient is defined as the fraction of length-2 paths, or wedges, that are closed with a triangle Watts and Strogatz (1998); Barrat and Weigt (2000) (Fig. 1, row ). In other words, the clustering coefficient measures the probability of triadic closure in the network. However, the clustering coefficient is inherently restrictive as it measures the closure probability of just one simple structure—the triangle. Moreover, higher-order structures such as larger cliques are crucial to the structure and function of complex networks Benson et al. (2016); Yaveroğlu et al. (2014); Rosvall et al. (2014). For example, 4-cliques reveal community structure in word association and protein-protein interaction networks Palla et al. (2005) and cliques of sizes 5–7 are more frequent than triangles in many real-world networks with respect to certain null models Slater et al. (2014). However, the extent of clustering of such higher-order structures has not been well understood nor quantified.

1. Start with 2. Find an adjacent edge 3. Check for an
an -clique to form an -wedge -clique

Figure 1: Overview of higher-order clustering coefficients as clique expansion probabilities. The th-order clustering coefficient measures the probability that an -clique and an adjacent edge, i.e., an -wedge, is closed, meaning that the possible edges between the -clique and the outside node in the adjacent edge exist to form an -clique.

Here, we provide a framework to quantify higher-order clustering in networks by measuring the normalized frequency at which higher-order cliques are closed, which we call higher-order clustering coefficients. We derive our higher-order clustering coefficients by extending a novel interpretation of the classical clustering coefficient as a form of clique expansion (Fig. 1). We then derive several properties about higher-order clustering coefficients and analyze them under the and small-world null models.

Using our theoretical analysis as a guide, we analyze the higher-order clustering behavior of real-world networks from a variety of domains. We find that each domain of networks has its own higher-order clustering pattern, which the traditional clustering coefficient does not show on its own. Conventional wisdom in network science posits that practically all real-world networks exhibit clustering; however, we find that not all networks exhibit higher-order clustering. More specifically, once we control for the clustering as measured by the classical clustering coefficient, some networks do not show significant clustering in terms of higher-order cliques. In addition to the theoretical properties and empirical findings exhibited in this paper, our related work also demonstrates a connection between higher-order clustering and community detection Yin et al. (2017).

Ii Derivation of higher-order clustering coefficients

In this section, we derive our higher-order clustering coefficients and some of their basic properties. We first present an alternative interpretation of the classical clustering coefficient and then show how this novel interpretation seamlessly generalizes to arrive at our definition of higher-order clustering coefficients. We then provide some probabilistic interpretations of higher-order clustering coefficients that will be useful for our subsequent analysis.

ii.1 Alternative interpretation of the classical clustering coefficient

Here we give an alternative interpretation of the clustering coefficient that will later allow us to generalize it and quantify clustering of higher-order network structures (this interpretation is summarized in Fig. 1). Our interpretation is based on a notion of clique expansion. First, we consider a -clique in a graph (that is, a single edge ; see Fig. 1, row , column 1). Next, we expand the clique by considering any edge adjacent to , i.e., and share exactly one node (Fig. 1, row , column 2). This expanded subgraph forms a wedge, i.e., a length- path. The classical global clustering coefficient of (sometimes called the transitivity of  Boccaletti et al. (2006)) is then defined as the fraction of wedges that are closed, meaning that the -clique and adjacent edge induce a -clique, or a triangle (Fig. 1, row , column 3) Barrat and Weigt (2000); Luce and Perry (1949). The novelty of our interpretation of the clustering coefficient is considering it as a form of clique expansion, rather than as the closure of a length- path, which is key to our generalizations in the next section.

Formally, the classical global clustering coefficient is

(1)

where is the set of -cliques (triangles), is the set of wedges, and the coefficient comes from the fact that each

-clique closes 6 wedges—the 6 ordered pairs of edges in the triangle.

We can also reinterpret the local clustering coefficient Watts and Strogatz (1998) in this way. In this case, each wedge again consists of a -clique and adjacent edge (Fig. 1, row , column 2), and we call the unique node in the intersection of the -clique and adjacent edge the center of the wedge. The local clustering clustering coefficient of a node is the fraction of wedges centered at that are closed:

(2)

where is the set of -cliques containing and is the set of wedges with center (if , we say that is undefined). The average clustering coefficient is the mean of the local clustering coefficients,

(3)

where is the set of nodes in the network where the local clustering coefficient is defined.

ii.2 Generalizing to higher-order clustering coefficients

Our alternative interpretation of the clustering coefficient, described above as a form of clique expansion, leads to a natural generalization to higher-order cliques. Instead of expanding -cliques to -cliques, we expand -cliques to -cliques (Fig. 1, rows and ). Formally, we define an -wedge to consist of an -clique and an adjacent edge for . Then we define the global th-order clustering coefficient as the fraction of -wedges that are closed, meaning that they induce an -clique in the network. We can write this as

(4)

where is the set of -cliques, and is the set of -wedges. The coefficient comes from the fact that each -clique closes that many wedges: each -clique contains -cliques, and each -clique contains nodes which may serve as the center of an -wedge. Note that the classical definition of the global clustering coefficient given in Eq. 1 is equivalent to the definition in Eq. 4 when .

We also define higher-order local clustering coefficients:

(5)

where is the set of -cliques containing node , is the set of -wedges with center (where the center is the unique node in the intersection of the -clique and adjacent edge comprising the wedge; see Fig. 1), and the coefficient comes from the fact that each -clique containing closes that many -wedges in . The th-order clustering coefficient of a node is defined for any node that is the center of at least one -wedge, and the average th-order clustering coefficient is the mean of the local clustering coefficients:

(6)

where is the set of nodes that are the centers of at least one -wedge.

To understand how to compute higher-order clustering coefficients, we substitute the following useful identity

(7)

where is the degree of node , into Eq. 5 to get

(8)

From Eq. 8, it is easy to see that we can compute all local th-order clustering coefficients by enumerating all -cliques and -cliques in the graph. The computational complexity of the algorithm is thus bounded by the time to enumerate -cliques and -cliques. Using the Chiba and Nishizeki algorithm Chiba and Nishizeki (1985), the complexity is , where is the arboricity of the graph, and is the number of edges. The arboricity may be as large as , so this algorithm is only guaranteed to take polynomial time if is a constant. In general, determining if there exists a single clique with at least nodes is NP-complete Karp (1972).

For the global clustering coefficient, note that

(9)

Thus, it suffices to enumerate -cliques (to compute using Eq. 7) and to count the total number of -cliques. In practice, we use the Chiba and Nishizeki to enumerate cliques and simultaneously compute and for all nodes

. This suffices for our clustering analysis with

on networks with over a hundred million edges in Section IV.

ii.3 Probabilistic interpretations of higher-order clustering coefficients

To facilitate understanding of higher-order clustering coefficients and to aid our analysis in Section III, we present a few probabilistic interpretations of the quantities. First, we can interpret as the probability that a wedge chosen uniformly at random from all wedges centered at is closed:

(10)

The variant of this interpretation for the classical clustering case of has been useful for graph algorithm development Seshadhri et al. (2013).

For the next probabilistic interpretation, it is useful to analyze the structure of the 1-hop neighborhood graph of a given node (not containing node ). The vertex set of is the set of all nodes adjacent to , and the edge set consists of all edges between neighbors of , i.e., , where is the edge set of the graph.

Any -clique in containing node corresponds to a unique -clique in , and specifically for , any edge corresponds to a node in . Therefore, each -wedge centered at corresponds to an -clique and one of the nodes outside (i.e., in ). Thus, Eq. 8 can be re-written as

(11)

where denotes the number of -cliques in .

If we uniformly at random select an -clique from and then also uniformly at random select a node from outside of this clique, then is the probability that these nodes form an -clique:

(12)

Moreover, if we condition on observing an -clique from this sampling procedure, then the -clique itself is selected uniformly at random from all -cliques in . Therefore, is the probability that an -clique and two nodes selected uniformly at random from form an -clique. Applying this recursively gives

(13)

In other words, the product of the higher-order local clustering coefficients of node up to order is the -clique density amongst ’s neighbors.

Iii Theoretical analysis and higher-order clustering in random graph models

We now provide some theoretical analysis of our higher-order clustering coefficients. We first give some extremal bounds on the values that higher-order clustering coefficients can take given the value of the traditional (second-order) clustering coefficient. After, we analyze the values of higher-order clustering coefficients in two common random graph models—the and small-world models. The theory from this section will be a useful guide for interpreting the clustering behavior of real-world networks in Section IV.

iii.1 Extremal bounds

Figure 2: Example 1-hop neighborhoods of a node with degree with different higher-order clustering. Left: For cliques, for any . Middle: If ’s neighbors form a complete bipartite graph, is constant while , . Right: If half of ’s neighbors form a star and half form a clique with , then , which is the upper bound in Proposition 1.

We first analyze the relationships between local higher-order clustering coefficients of different orders. Our technical result is Proposition 1, which provides essentially tight lower and upper bounds for higher-order local clustering coefficients in terms of the traditional local clustering coefficient. The main ideas of the proof are illustrated in Fig. 2.

Proposition 1.

For any fixed ,

(14)

Moreover,

  1. There exists a finite graph with a node such that the lower bound is tight and is within of any prescribed value in .

  2. There exists a finite graph with a node such that is within of the upper bound for any prescribed value of .

Proof.

Clearly, if the local clustering coefficient is well defined. This bound is tight when is -partite, as in the middle column of Fig. 2. In the -partite case, . By removing edges from this extremal case in a sufficiently large graph, we can make arbitrarily close to any value in .

To derive the upper bound, consider the 1-hop neighborhood , and let

(15)

denote the -clique density of . The Kruskal-Katona theorem Kruskal (1963); Katona (1966) implies that

Combining this with Eq. 8 gives

where the last equality uses the fact that is the edge density of .

The upper bound becomes tight when consists of a clique and isolated nodes (Fig. 2, right) and the neighborhood is sufficiently large. Specifically, let consist of a clique of size and isolated nodes. When ,

and by Eq. 11, when ,

By adjusting the ratio in , we can construct a family of graphs such that takes any value in the interval as and as . ∎

The second part of the result requires the neighborhoods to be sufficiently large in order to reach the upper bound. However, we will see later that in some real-world data, there are nodes for which is close to the upper bound for several values of .

Next, we analyze higher-order clustering coefficients in two common random graph models: the Erdős-Rényi model with edge probability (i.e., the model Erdös and Rényi (1959)) and the small-world model Watts and Strogatz (1998).

iii.2 Analysis for the model

Now, we analyze higher-order clustering coefficients in classical Erdős-Rényi random graph model, where each edge exists independently with probability (i.e., the model Erdös and Rényi (1959)). We implicitly assume that is small in the following analysis so that there should be at least one -wedge in the graph (with high probability and large, there is no clique of size greater than for any  Bollobás and Erdös (1976)). Therefore, the global and local clustering coefficients are well-defined.

In the model, we first observe that any -wedge is closed if and only if the possible edges between the -clique and the outside node in the adjacent edge exist to form an -clique. Each of the edges exist independently with probability in the model, which means that the higher-order clustering coefficients should scale as . We formalize this in the following proposition.

Proposition 2.

Let be a random graph drawn from the model. For constant ,

  1. for any node

Proof.

We prove the first part by conditioning on the set of -wedges, :

As noted above, the second equality is well defined (with high probability) for small . The third equality comes from the fact that any -wedge is closed if and only if the possible edges between the -clique and the outside node in the adjacent edge exist to form an -clique.

The proof of the second part is essentially the same, except we condition over the set of possible cases where .

Recall that is the set of nodes at the center of at least one -wedge. To prove the third part, we take the conditional expectation over and use our result from the second part. ∎

The above results say that the global, local, and average th order clustering coefficients decrease exponentially in . It turns out that if we also condition on the second-order clustering coefficient having some fixed value, then the higher-order clustering coefficients still decay exponentially in for the model. This will be useful for interpreting the distribution of local clustering coefficients on real-world networks.

Proposition 3.

Let be a random graph drawn from the model. Then for constant ,

Proof.

Similar to the proof of Proposition 3, we look at the conditional expectation over :

Now, note that has edges. Knowing that accounts for of these edges. By symmetry, the other edges appear in any of the remaining pairs of nodes uniformly at random. There are ways to place these edges, of which would close the wedge . Thus,

Now, for any small nonnegative integer ,

(Recall that is constant by assumption, so the big-O notation is appropriate). The above expression approaches when as well as when . ∎

Proposition 3 says that even if the second-order local clustering coefficient is large, the th-order clustering coefficient will still decay exponentially in , at least in the limit as grows large. By examining higher-order clique closures, this allows us to distinguish between nodes whose neighborhoods are “dense but random” ( is large but ) or “dense and structured” ( is large and ). Only the latter case exhibits higher-order clustering. We use this in our analysis of real-world networks in Section IV.

iii.3 Analysis for the small-world model

We also study higher-order clustering in the small-world random graph model Watts and Strogatz (1998). The model begins with a ring network where each node connects to its nearest neighbors. Then, for each node and each of the edges with following clockwise in the ring, the edge is rewired to with probability , where is chosen uniformly at random.

Figure 3: Average higher-order clustering coefficient as a function of rewiring probability in small-world networks for (, ). Proposition 4 shows that the th-order clustering coefficient when predicts that the clustering should decrease modestly as increases.

With no rewiring () and , it is known that  Watts and Strogatz (1998). As increases, the average clustering coefficient

slightly decreases until a phase transition near

, where decays to  Watts and Strogatz (1998) (also see Fig. 3). Here, we generalize these results for higher-order clustering coefficients.

Proposition 4.

In the small-world model without rewiring (),

for any constant as and while .

Proof.

Applying Eq. 8, it suffices to show that

(16)

as

which approaches as .

Now we give a derivation of Eq. 16. We first label the neighbors of as by their clockwise ordering in the ring. Since , these nodes are unique. Next, define the span of any -clique containing as the difference between the largest and smallest label of the nodes in the clique other than . The span of any -clique satisfies since any node is directly connected with a node of label difference no greater than . Also, since there are nodes in an -clique other than . For each span , we can find pairs of such that , and . Finally, for every such pair , there are choices of nodes between and which will form an -clique together with nodes , , and . Therefore,

If we ignore lower-order terms and note that , we get

Proposition 4 shows that, when , decreases as increases. Furthermore, via simulation, we observe the same behavior as for when adjusting the rewiring probability (Fig. 3). Regardless of , the phase transition happens near . Essentially, once there is enough rewiring, all local clique structure is lost, and clustering at all orders is lost. This is partly a consequence of Proposition 1, which says that as for any .

Iv Experimental results on real-world networks

We now analyze the higher-order clustering of real-world networks. We first study how the higher-order global and average clustering coefficients vary as we increase the order of the clustering coefficient on a collection of 20 networks from several domains. After, we concentrate on a few representative networks and compare the higher-order clustering of real-world networks to null models. We find that only some networks exhibit higher-order clustering once the traditional clustering coefficient is controlled. Finally, we examine the local clustering of real-world networks.

iv.1 Higher-order global and average clustering

Network Nodes Edges
Erdős-Rényi Erdös and Rényi (1959) 1,000 99,831 0.200 0.040 0.008 0.200 0.040 0.008 1.000 1.000 1.000
Small-world Watts and Strogatz (1998) 20,000 100,000 0.480 0.359 0.229 0.489 0.350 0.205 1.000 1.000 0.999
P. pacificus Bumbarger et al. (2013) 50 576 0.015 0.051 0.035 0.073 0.052 0.034 0.880 0.580 0.440
C. elegans Watts and Strogatz (1998) 297 2,148 0.181 0.080 0.056 0.308 0.137 0.062 0.949 0.926 0.808
Drosophila-medulla Takemura et al. (2013) 1,781 32,311 0.000 0.002 0.001 0.116 0.061 0.024 0.803 0.616 0.425
mouse-retina Helmstaedter et al. (2013) 1,076 577,350 0.008 0.038 0.029 0.033 0.100 0.085 0.998 0.996 0.994
fb-Stanford Traud et al. (2012) 11,621 568,330 0.157 0.107 0.116 0.253 0.181 0.157 0.955 0.922 0.877
fb-Cornell Traud et al. (2012) 18,660 790,777 0.136 0.106 0.121 0.225 0.169 0.148 0.973 0.951 0.923
Pokec Takac and Zabovsky (2012) 1,632,803 22,301,964 0.047 0.044 0.046 0.122 0.084 0.061 0.900 0.675 0.508
Orkut Mislove et al. (2007) 3,072,441 117,185,083 0.041 0.022 0.019 0.170 0.131 0.110 0.978 0.949 0.878
arxiv-HepPh Leskovec et al. (2007) 12,008 118,505 0.659 0.749 0.788 0.698 0.586 0.519 0.876 0.723 0.567
arxiv-AstroPh Leskovec et al. (2007) 18,772 198,050 0.318 0.326 0.359 0.677 0.609 0.561 0.932 0.839 0.740
congress-committees Porter et al. (2005) 871 248,848 0.037 0.080 0.063 0.082 0.142 0.126 1.000 1.000 1.000
DBLP Yang and Leskovec (2015) 317,080 1,049,866 0.306 0.634 0.821 0.732 0.613 0.517 0.864 0.675 0.489
email-Enron-core Klimt and Yang (2004) 148 1356 0.383 0.245 0.192 0.496 0.363 0.277 0.966 0.946 0.946
email-Eu-core Yin et al. (2017); Leskovec et al. (2007) 1005 16064 0.267 0.170 0.135 0.450 0.329 0.264 0.887 0.847 0.784
CollegeMsg Panzarasa et al. (2009) 1,899 41,579 0.004 0.005 0.003 0.053 0.017 0.006 0.829 0.591 0.332
wiki-Talk Leskovec et al. (2010) 2,394,385 4,659,565 0.002 0.011 0.010 0.201 0.081 0.051 0.262 0.077 0.027
oregon2-010526 Leskovec et al. (2005) 11,461 32,730 0.037 0.085 0.097 0.494 0.294 0.300 0.711 0.269 0.121
as-caida-20071105 Leskovec et al. (2005) 26,475 53,381 0.007 0.012 0.015 0.333 0.159 0.134 0.625 0.171 0.060
p2p-Gnutella31 Ripeanu et al. (2002); Leskovec et al. (2007) 62,586 147,892 0.004 0.003 0.000 0.010 0.001 0.000 0.542 0.067 0.001
as-skitter Leskovec et al. (2005) 1,696,415 11,095,298 0.005 0.007 0.011 0.296 0.126 0.109 0.871 0.633 0.335
Table 1: Higher-order clustering coefficients on random graph models, neural connections, online social networks, collaboration networks, human communication, and technological systems. Broadly, networks from the same domain have similar higher-order clustering characteristics. Since is the set of nodes at the center of at least one -wedge (see Eq. 6), is the fraction of nodes at the center of at least one -wedge (the higher-order average clustering coefficient is only measured over those nodes participating in at least one -wedge).

We compute and analyze the higher-order clustering for networks from a variety of domains (Table 1). We briefly describe the collection of networks and their categorization below:

  1. Two synthetic networks—a random instance of an Erdős-Rényi graph with nodes and edge probability and a small-world network with nodes, , and rewiring probability ;

  2. Four neural networks—the complete neural systems of the nematode worms

    P. pacificus and C. elegans as well as the neural connections of the Drosophila medulla and mouse retina;

  3. Four online social networks—two Facebook friendship networks between students at universities from 2005 (fb-Stanford, fb-Cornell) and two complete online friendship networks (Pokec and Orkut);

  4. Four collaboration networks—two co-authorship networks constructed from arxiv submission categories (arxiv-AstroPh and arxiv-HepPh), a co-authorship network constructed from DBLP, and the co-committee membership network of United States congresspersons (congress-committees);

  5. Four human communication networks—two email networks (email-Enron-core, email-Eu-core), a Facebook-like messaging network from a college (CollegeMsg), and the edits of user talk pages by other users on Wikipedia (wiki-Talk); and

  6. Four technological systems networks—three autonomous systems (oregon2-010526, as-caida-20071105, as-skitter) and a peer-to-peer connection network (p2p-Genutella31).

In all cases, we take the edges as undirected, even if the original network data is directed.

Table 1 lists the th-order global and average clustering coefficients for as well as the fraction of nodes that are the center of at least one -wedge (recall that the average clustering coefficient is the mean only over higher-order local clustering coefficients of nodes participating in at least one -wedge; see Kaiser (2008) for a discussion on how this can affect network analyses). We highlight some important trends in the raw clustering coefficients, and in the next section, we focus on higher-order clustering compared to what one gets in a null model.

Propositions 2 and 4 say that we should expect the higher-order global and average clustering coefficients to decrease as we increase the order for both the Erdős-Rényi and small-world models, and indeed for these networks. This trend also holds for most of the real-world networks (mouse-retina, congress-committees, and oregon2-010526 are the exceptions). Thus, when averaging over nodes, higher-order cliques are overall less likely to close in both the synthetic and real-world networks.

C. elegans fb-Stanford arxiv-AstroPh email-Enron-core oregon2-010526
original CM MRCN original CM MRCN original CM MRCN original CM MRCN original CM MRCN
Table 2: Average higher-order clustering coefficients for five networks as well as the clustering with respect to two null models: a Configuration Model (CM) that samples random graphs with the same degree distribution Bollobás (1980); Milo et al. (2003), and Maximally Random Clustered Networks (MRCN) that preserve degree distribution as well as  Park and Newman (2004); Colomer-de Simón et al. (2013). For the random networks, we report the mean over 100 samples. An asterisk (

) denotes when the value in the original network is at least five standard deviations above the mean and a dagger (

) denotes when the value in the original network is at least five standard deviations below the mean. Although all networks exhibit clustering with respect to CM, only some of the networks exhibit higher-order clustering when controlling for with MRCN.

The relationship between the higher-ordrer global clustering coefficient and the order is less uniform over the datasets. For the three co-authorship networks (arxiv-HepPh, arxiv-AstroPh, and DBLP) and the three autonomous systems networks (oregon2-010526, as-caida-20071105, and as-skitter), increases with , although the base clustering levels are much higher for co-authorship networks. This is not simply due to the presence of cliques—a clique has the same clustering for any order (Fig. 2, left). Instead, these datasets have nodes that serve as the center of a star and also participate in a clique (Fig. 2, right; see also Proposition 1). On the other hand, decreases with for the two email networks and the two nematode worm neural networks. Finally, the change in need not be monotonic in . In three of the four online social networks, but .

Overall, the trends in the higher-order clustering coefficients can be different within one of our dataset categories, but tend to be uniform within sub-categories: the change of and with is the same for the two nematode worms within the neural networks, the two email networks within the communication networks, and the three co-authorship networks within the collaboration networks. These trends hold even if the (classical) second-order clustering coefficients differ substantially in absolute value.

While the raw clustering values are informative, it is also useful to compare the clustering to what one expects from null models. We find in the next section that this reveals additional insights into our data.

iv.2 Comparison against null models

For one real-world network from each dataset category, we also measure the higher-order clustering coefficients with respect to two null models (Table 2). First, we compare against the Configuration Model (CM) that samples uniformly from simple graphs with the same degree distribution Bollobás (1980); Milo et al. (2003). In real-world networks, is much larger than expected with respect to the CM null model. We find that the same holds for .

Second, we use a null model that samples graphs preserving both degree distribution and . Specifically, these are samples from an ensemble of exponential graphs where the Hamiltonian measures the absolute value of the difference between the original network and the sampled network Park and Newman (2004). Such samples are referred to as as Maximally Random Clustered Networks (MRCN) and are sampled with a simulated annealing procedure Colomer-de Simón et al. (2013). Comparing between the real-world and the null network, we observe different behavior in higher-order clustering across our datasets. Compared to the MRCN null model, C. elegans has significantly less than expected higher-order clustering (in terms of ), the Facebook friendship and autonomous system networks have significantly more than expected higher-order clustering, and the co-authorship and email networks have slightly (but not significantly) more than expected higher-order clustering (Table 2). Put another way, all real-world networks exhibit clustering in the classical sense of triadic closure. However, the higher-order clustering coefficients reveal that the friendship and autonomous systems networks exhibit significant clustering beyond what is given by triadic closure. These results suggest the need for models that directly account for closure in node neighborhoods Bhat et al. (2016); Lambiotte et al. (2016).

Our finding about the lack of higher-order clustering in C. elegans agrees with previous results that 4-cliques are under-expressed, while open 3-wedges related to cooperative information propagation are over-expressed Benson et al. (2016); Milo et al. (2002); Varshney et al. (2011). This also provides credence for the “3-layer” model of C. elegans Varshney et al. (2011). The observed clustering in the friendship network is consistent with prior work showing the relative infrequency of open -wedges in many Facebook network subgraphs with respect to a null model accounting for triadic closure Ugander et al. (2013). Co-authorship networks and email networks are both constructed from “events” that create multiple edges—a paper with authors induces a -clique in the co-authorship graph and an email sent from one address to others induces edges. This event-driven graph construction creates enough closure structure so that the average third-order clustering coefficient is not much larger than random graphs where the classical second-order clustering coefficient and degree sequence is kept the same.

Figure 4:

Top row: Joint distributions of (

, ) for (A) C. elegans (B) Facebook friendship, (C) arxiv co-authorship, (D) email, and (E) autonomous systems networks. Each blue dot represents a node, and the red curve tracks the average over logarithmic bins. The upper trend line is the bound in Eq. 14, and the lower trend line is expected Erdős-Rényi behavior from Proposition 3. Bottom row: Average higher-order clustering coefficients as a function of degree.
Figure 5: Analogous plots of Fig. 4 for synthetic (A) Erdős-Rényi and (B) small-world networks. Top row: Joint distributions of (, ). Bottom row: Average higher-order clustering coefficients as a function of degree.

We emphasize that simple clique counts are not sufficient to obtain these results. For example, the discrepancy in the third-order average clustering of C. elegans and the MRCN null model is not simply due to the presence of 4-cliques. The original neural network has nearly twice as many 4-cliques (2,010) than the samples from the MRCN model (mean 1006.2, standard deviation 73.6), but the third-order clustering coefficient is larger in MRCN. The reason is that clustering coefficients normalize clique counts with respect to opportunities for closure.

Thus far, we have analyzed global and average higher-order clustering, which both summarize the clustering of the entire network. In the next section, we look at more localized properties, namely the distribution of higher-order local clustering coefficients and the higher-order average clustering coefficient as a function of node degree.

iv.3 Higher-order local clustering coefficients and degree dependencies

We now examine more localized clustering properties of our networks. Figure 4 (top) plots the joint distribution of and for the five networks analyzed in Table 2, and Fig. 5 (top) provides the analogous plots for the Erdős-Rényi and small-world networks. In these plots, the lower dashed trend line represents the expected Erdős-Rényi behavior, i.e., the expected clustering if the edges in the neighborhood of a node were configured randomly, as formalized in Proposition 3. The upper dashed trend line is the maximum possible value of given , as given by Proposition 1.

For many nodes in C. elegans, local clustering is nearly random (Fig. 4, top), i.e., resembles the Erdős-Rényi joint distribution (Fig. 5, top). In other words, there are many nodes that lie on the lower trend line. This provides further evidence that C. elegans lacks higher-order clustering. In the arxiv co-authorship network, there are many nodes with a large value of that have an even larger value of near the upper bound of Eq. 14 (see the inset of Fig. 4, top). This implies that some nodes appear in both cliques and also as the center of star-like patterns, as in Fig. 2. On the other hand, only a handful of nodes in the Facebook friendships, Enron email, and Oregon autonomous systems networks are close to the upper bound (insets of Figs. 4,4, and 4, top).

Figures 4 and 5 (bottom) plot higher-order average clustering as a function of node degree in the real-world and synthetic networks. In the Erdős-Rényi, small-world, C. elegans, and Enron email networks, there is a distinct gap between the average higher-order clustering coefficients for nodes of all degrees. Thus, our previous finding that the average clustering coefficient decreases with in these networks is independent of degree. In the Facebook friendship network, is larger than and on average for nodes of all degrees, but and are roughly the same for nodes of all degrees, which means that 4-cliques and 5-cliques close at roughly the same rate, independent of degree, albeit at a smaller rate than traditional triadic closure (Fig. 4, bottom). In the co-authorship network, nodes have roughly the same for , , , which means that -cliques close at about the same rate, independent of (Fig. 4, bottom). In the Oregon autonomous systems network, we see that, on average, for nodes with large degree (Fig. 4, bottom). This explains how the global clustering coefficient increases with the order, but the average clustering does not, as observed in Table 1.

V Discussion

We have proposed higher-order clustering coefficients to study higher-order closure patterns in networks, which generalizes the widely used clustering coefficient that measures triadic closure. Our work compliments other recent developments on the importance of higher-order information in network navigation Rosvall et al. (2014); Scholtes (2017) and on temporal community structure Sekara et al. (2016); in contrast, we examine higher-order clique closure and only implicitly consider time as a motivation for closure.

Prior efforts in generalizing clustering coefficients have focused on shortest paths Fronczak et al. (2002), cycle formation Caldarelli et al. (2004), and triangle frequency in -hop neighborhoods Andrade et al. (2006); Jiang and Claramunt (2004). Such approaches fail to capture closure patterns of cliques, suffer from challenging computational issues, and are difficult to theoretically analyze in random graph models more sophisticated than the Erdős-Rényi model. On the other hand, our higher-order clustering coefficients are simple but effective measurements that are analyzable and easily computable (we only rely clique enumeration, a well-studied algorithmic task). Furthermore, our methodology provides new insights into the clustering behavior of several real-world networks and random graph models, and our theoretical analysis provides intuition for the way in which higher-order clustering coefficients describe local clustering in graphs.

Finally, we focused on higher-order clustering coefficients as a global network measurement and as a node-level measurement, and in related work we also show that large higher-order clustering implies the existence of mesoscale clique-dense community structure Yin et al. (2017).

Acknowledgements.

This research has been supported in part by NSF IIS-1149837, ARO MURI, DARPA, ONR, Huawei, and Stanford Data Science Initiative. We thank Will Hamilton and Marinka Žitnik for insightful comments. We thank Mason Porter and Peter Mucha for providing the congress committee membership data.

References

  • Newman (2003) M. E. J. Newman, SIAM Review 45, 167 (2003).
  • Rapoport (1953) A. Rapoport, The Bulletin of Mathematical Biophysics 15, 523 (1953).
  • Watts and Strogatz (1998) D. J. Watts and S. H. Strogatz, Nature 393, 440 (1998).
  • Granovetter (1973) M. S. Granovetter, American Journal of Sociology , 1360 (1973).
  • Wu and Holme (2009) Z.-X. Wu and P. Holme, Physical Review E 80, 037101 (2009).
  • Jin et al. (2001) E. M. Jin, M. Girvan,  and M. E. J. Newman, Physical Review E 64, 046132 (2001).
  • Ravasz and Barabási (2003) E. Ravasz and A.-L. Barabási, Physical Review E 67, 026112 (2003).
  • Barrat and Weigt (2000) A. Barrat and M. Weigt, The European Physical Journal B: Condensed Matter and Complex Systems 13, 547 (2000).
  • Benson et al. (2016) A. R. Benson, D. F. Gleich,  and J. Leskovec, Science 353, 163 (2016).
  • Yaveroğlu et al. (2014) Ö. N. Yaveroğlu, N. Malod-Dognin, D. Davis, Z. Levnajic, V. Janjic, R. Karapandza, A. Stojmirovic,  and N. Pržulj, Scientific Reports 4 (2014).
  • Rosvall et al. (2014) M. Rosvall, A. V. Esquivel, A. Lancichinetti, J. D. West,  and R. Lambiotte, Nature Communications 5 (2014).
  • Palla et al. (2005) G. Palla, I. Derényi, I. Farkas,  and T. Vicsek, Nature 435, 814 (2005).
  • Slater et al. (2014) N. Slater, R. Itzchack,  and Y. Louzoun, Network Science 2, 387 (2014).
  • Yin et al. (2017) H. Yin, A. R. Benson, J. Leskovec,  and D. F. Gleich, in Proceedings of the 23rd ACM SIGKDD international conference on Knowledge discovery and data mining (2017) (To appear).
  • Boccaletti et al. (2006) S. Boccaletti, V. Latora, Y. Moreno, M. Chavez,  and D.-U. Hwang, Physics reports 424, 175 (2006).
  • Luce and Perry (1949) R. D. Luce and A. D. Perry, Psychometrika 14, 95 (1949).
  • Chiba and Nishizeki (1985) N. Chiba and T. Nishizeki, SIAM Journal on Computing 14, 210 (1985).
  • Karp (1972) R. M. Karp, in Complexity of computer computations (Springer, 1972) pp. 85–103.
  • Seshadhri et al. (2013) C. Seshadhri, A. Pinar,  and T. G. Kolda, in Proceedings of the 2013 SIAM International Conference on Data Mining (SIAM, 2013) pp. 10–18.
  • Kruskal (1963)

    J. B. Kruskal, Mathematical Optimization Techniques 

    10, 251 (1963).
  • Katona (1966) G. Katona, in Theory of Graphs: Proceedings of the Colloquium held at Tihany, Hungary (1966) pp. 187–207.
  • Erdös and Rényi (1959) P. Erdös and A. Rényi, Publicationes Mathematicae (Debrecen) 6, 290 (1959).
  • Bollobás and Erdös (1976) B. Bollobás and P. Erdös, in Mathematical Proceedings of the Cambridge Philosophical Society, Vol. 80 (Cambridge University Press, 1976) pp. 419–427.
  • Bumbarger et al. (2013) D. J. Bumbarger, M. Riebesell, C. Rödelsperger,  and R. J. Sommer, Cell 152, 109 (2013).
  • Takemura et al. (2013) S.-y. Takemura, A. Bharioke, Z. Lu, A. Nern, S. Vitaladevuni, P. K. Rivlin, W. T. Katz, D. J. Olbris, S. M. Plaza, P. Winston, et al., Nature 500, 175 (2013).
  • Helmstaedter et al. (2013) M. Helmstaedter, K. L. Briggman, S. C. Turaga, V. Jain, H. S. Seung,  and W. Denk, Nature 500, 168 (2013).
  • Traud et al. (2012) A. L. Traud, P. J. Mucha,  and M. A. Porter, Physica A: Statistical Mechanics and its Applications 391, 4165 (2012).
  • Takac and Zabovsky (2012) L. Takac and M. Zabovsky, in International Scientific Conference and International Workshop Present Day Trends of Innovations, Vol. 1 (2012).
  • Mislove et al. (2007) A. Mislove, M. Marcon, K. P. Gummadi, P. Druschel,  and B. Bhattacharjee, in Proceedings of the 5th ACM/Usenix Internet Measurement Conference (IMC’07) (San Diego, CA, 2007).
  • Leskovec et al. (2007) J. Leskovec, J. Kleinberg,  and C. Faloutsos, ACM Transactions on Knowledge Discovery from Data (TKDD) 1, 2 (2007).
  • Porter et al. (2005) M. A. Porter, P. J. Mucha, M. E. J. Newman,  and C. M. Warmbrand, Proceedings of the National Academy of Sciences 102, 7057 (2005).
  • Yang and Leskovec (2015) J. Yang and J. Leskovec, Knowledge and Information Systems 42, 181 (2015).
  • Klimt and Yang (2004) B. Klimt and Y. Yang, in CEAS (2004).
  • Panzarasa et al. (2009) P. Panzarasa, T. Opsahl,  and K. M. Carley, Journal of the Association for Information Science and Technology 60, 911 (2009).
  • Leskovec et al. (2010) J. Leskovec, D. P. Huttenlocher,  and J. M. Kleinberg, in Proceedings of the Internatonal Conference on Web and Social Media (2010).
  • Leskovec et al. (2005) J. Leskovec, J. Kleinberg,  and C. Faloutsos, in Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining (ACM, 2005) pp. 177–187.
  • Ripeanu et al. (2002) M. Ripeanu, A. Iamnitchi,  and I. Foster, IEEE Internet Computing 6, 50 (2002).
  • Kaiser (2008) M. Kaiser, New Journal of Physics 10, 083042 (2008).
  • Bollobás (1980) B. Bollobás, European Journal of Combinatorics 1, 311 (1980).
  • Milo et al. (2003) R. Milo, N. Kashtan, S. Itzkovitz, M. E. J. Newman,  and U. Alon, arXiv preprint cond-mat/0312028  (2003).
  • Park and Newman (2004) J. Park and M. E. J. Newman, Physical Review E 70, 066117 (2004).
  • Colomer-de Simón et al. (2013) P. Colomer-de Simón, M. Á. Serrano, M. G. Beiró, J. I. Alvarez-Hamelin,  and M. Boguñá, Scientific Reports 3, 2517 (2013).
  • Bhat et al. (2016) U. Bhat, P. Krapivsky, R. Lambiotte,  and S. Redner, Physical Review E 94, 062302 (2016).
  • Lambiotte et al. (2016) R. Lambiotte, P. Krapivsky, U. Bhat,  and S. Redner, Physical Review Letters 117, 218301 (2016).
  • Milo et al. (2002) R. Milo, S. Shen-Orr, S. Itzkovitz, N. Kashtan, D. Chklovskii,  and U. Alon, Science 298, 824 (2002).
  • Varshney et al. (2011) L. R. Varshney, B. L. Chen, E. Paniagua, D. H. Hall,  and D. B. Chklovskii, PLOS Computational Biology 7, e1001066 (2011).
  • Ugander et al. (2013) J. Ugander, L. Backstrom,  and J. Kleinberg, in Proceedings of the 22nd international conference on World Wide Web (ACM, 2013) pp. 1307–1318.
  • Scholtes (2017) I. Scholtes, arXiv:1702.05499  (2017).
  • Sekara et al. (2016) V. Sekara, A. Stopczynski,  and S. Lehmann, Proceedings of the National Academy of Sciences 113, 9977 (2016).
  • Fronczak et al. (2002) A. Fronczak, J. A. Hołyst, M. Jedynak,  and J. Sienkiewicz, Physica A: Statistical Mechanics and its Applications 316, 688 (2002).
  • Caldarelli et al. (2004) G. Caldarelli, R. Pastor-Satorras,  and A. Vespignani, The European Physical Journal B: Condensed Matter and Complex Systems 38, 183 (2004).
  • Andrade et al. (2006) R. F. Andrade, J. G. Miranda,  and T. P. Lobão, Physical Review E 73, 046101 (2006).
  • Jiang and Claramunt (2004) B. Jiang and C. Claramunt, Environment and Planning B: Planning and Design 31, 151 (2004).