Spectral Clustering with Epidemic Diffusion

03/11/2013 ∙ by Laura M. Smith, et al. ∙ 0

Spectral clustering is widely used to partition graphs into distinct modules or communities. Existing methods for spectral clustering use the eigenvalues and eigenvectors of the graph Laplacian, an operator that is closely associated with random walks on graphs. We propose a new spectral partitioning method that exploits the properties of epidemic diffusion. An epidemic is a dynamic process that, unlike the random walk, simultaneously transitions to all the neighbors of a given node. We show that the replicator, an operator describing epidemic diffusion, is equivalent to the symmetric normalized Laplacian of a reweighted graph with edges reweighted by the eigenvector centralities of their incident nodes. Thus, more weight is given to edges connecting more central nodes. We describe a method that partitions the nodes based on the componentwise ratio of the replicator's second eigenvector to the first, and compare its performance to traditional spectral clustering techniques on synthetic graphs with known community structure. We demonstrate that the replicator gives preference to dense, clique-like structures, enabling it to more effectively discover communities that may be obscured by dense intercommunity linking.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 5

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

I Introduction

Graph partitioning is used in many applications, including community detection Leskovec08www , image segmentation ShiMalik00 , and data mining Bertozzi2012 , where it is necessary to partition a graph into modules or clusters of similar, or similarly behaving, nodes. Spectral partitioning uses the eigenvectors associated with the smallest eigenvalues of the graph Laplacian matrix (or its normalized version) to partition the graph into clusters Chung:Spectral:97 ; Ng:2001:NIPS ; Spielman07 ; spectral-tutorial .

Existing methods for spectral partitioning are closely associated with random walks on graphs. A random walk is a stochastic dynamic process where transitions take place from a node to a random neighbor of that node, and it is described by the (normalized) graph Laplacian. The existence of a good partition implies that random walks take a long time to reach a stationary distribution on the graph Jerrum88 ; ShiMalik00 , because they spend a long time within a module and seldom pass between modules Rosvall08 . This forms a basis for objective functions used to select which edges to cut so as to partition the graph, such as normalized cut and conductance, though these functions have trouble partitioning real-world graphs where many inter-module edges obscure the underlying structure Leskovec08www .

Epidemic diffusion is another type of dynamic process on a graph. An epidemic undergoes transitions simultaneously to all the neighbors of a given node, rather than a single neighbor, and is often used to model the spread of a virus or an innovation through a social network Anderson91 ; Rogers03 . Recently, Lerman and Ghosh introduced the replicator matrix Lerman12pre , an analog of the graph Laplacian, to describe epidemic diffusion on graphs. They used the replicator to simulate dynamics of synchronization in a network of oscillators, showing that oscillators coupled via epidemic diffusion synchronize into different structures than oscillators coupled via random walk-like diffusion.

We propose a method for spectral graph partitioning based on epidemic diffusion. First, we show that the replicator is equivalent to the symmetric normalized Laplacian of a reweighted graph, where new edge weights are the product of old edge weights and the eigenvector centralities of the two end points. The eigenvector centrality Bonacich01 of a graph is given by the eigenvector corresponding to the largest eigenvalue of the adjacency matrix. Therefore, edges linking central nodes are given a higher weight by the reweighting scheme.

The equivalence between the replicator and symmetric normalized Laplacian of a reweighted graph allows us to exploit well-known relationships between spectral clustering and graph partitioning. To use the replicator for spectral partitioning, we give a computationally efficient procedure that orders nodes based on the componentwise ratio of the second to first eigenvectors and selects a partition that minimizes a quality function computed on the reweighted graph. This tends to preserve dense structures, since edges linking more central nodes in such dense clusters are less likely to be cut.

We study the performance of the proposed spectral partitioning method using synthetic graphs with known community structure. We demonstrate that spectral clustering based on epidemics leads to a better recovery of ground truth communities than traditional methods based on the graph Laplacian, especially in graphs that are more challenging because of the presence of many edges between clusters. Our work suggests that epidemic diffusion can be a useful probe of graph structure, as it can illuminate properties of graphs that are distinct from those found by methods based on the random walk.

Ii Spectral Clustering

An unweighted graph , with vertices (or nodes) and edges (or links) , can be represented by a adjacency matrix , with if the edge , and otherwise. By convention . We consider undirected graphs, where . The degree of node is defined as the number of edges incident on it, . Other useful constructs are , a diagonal degree matrix where

, and the identity matrix

.

ii.1 Graph Laplacian and Spectral Clustering

The graph Laplacian matrix is defined as . The eigenvalues and eigenvectors of capture many properties of the graph. In the simplest case, if the graph has disjoint components, the smallest eigenvalues of are zero, and the associated eigenvectors are indicator functions assigning nodes to their respective cluster or community spectral-tutorial . Even if the smallest eigenvalues are not all zero, their corresponding eigenvectors can be used to partition nodes into clusters by projecting these nodes onto a subspace of the first eigenvectors and using standard clustering techniques such as -means Ng:2001:NIPS ; ShiMalik00 . The simplest spectral clustering method, spectral bisection, partitions nodes based on the values of the second eigenvector of the adjacency matrix or the graph Laplacian. A splitting value is used to divide the nodes into different clusters based on whether or  Spielman07

. A range of splitting values have been used, including zero, the median value within the vector, the largest gap, and the value producing the best ratio cut, best conductance 

Vempala2009 , or another measure.

In practice, normalized versions of the graph Laplacian produce better results in spectral clustering applications Ng:2001:NIPS ; Bertozzi2012 . Two examples are the symmetric normalized Laplacian and the random walk Laplacian

, so named because the matrix of transition probabilities for a random walk on a graph is given by

.

ii.2 Graph Cuts and Their Quality Measures

Intuitively, a cluster is a set of nodes that are more tightly connected to each other than to nodes outside of the cluster. We use to denote the complement of , which consists of nodes that are not in . In order to bisect the graph into two disjoint clusters, one typically wants to minimize the number of cut edges between clusters,

while maximizing cluster size, which may be measured by the number of nodes it contains, , or the sum of the degrees of the nodes in the set, , also called volume of the set.

Several functions have been proposed for measuring the quality of a graph cut. The best known of these are ratio cut and normalized cut :

(1)
(2)

There is a relationship between graph cuts and spectral clustering. Deciding which edges to cut to optimize any of these quality functions is an NP-complete problem. Spectral clustering solves a relaxation of the problem, where the discrete indicator variables that assign nodes to clusters become continuous. Although in general there are no useful bounds for the approximation produced by this relaxation spectral-tutorial , in practice it often provides a simple and effective clustering method. Solutions to the relaxed optimization problem are given by the second eigenvector of the graph Laplacian or the normalized graph Laplacian  Spielman07 . Relaxing ratio cut leads to spectral clustering using , while relaxing normalized cut leads to spectral clustering using  ShiMalik00 ; spectral-tutorial . Such relaxation methods have also been applied productively to the popular modularity maximization method for community detection Newman06pnas ; Fortunato10community . By analogy with spectral bisection Spielman07 , the leading eigenvector approach assigns nodes to clusters based on the sign of the components of the leading eigenvector of the modularity matrix.

ii.3 Spectral Clustering and Random Walks

There exists a further relationship between spectral clustering, the partition quality function, and properties of random walks. A random walk on a graph is a stochastic process where transitions take place to a randomly chosen neighbor of a given node. Cluster properties of the graph can be expressed in terms of the transition matrix  Lovasz93 of a random walk. Spectral clustering finds a partition such that a random walk stays within the same cluster for a long time and seldom jumps between clusters ShiMalik00 ; Rosvall08 . Therefore, the presence of a good partition (low normalized cut value) implies that it will take a random walk a long time to reach its equilibrium distribution.

Iii Epidemic Diffusion on Graphs

An epidemic is a dynamic process that simultaneously undergoes transitions to every neighbor of the current node. Epidemics are used to model the spread of disease Hethcote00 and innovation Rogers03 in social networks. Epidemics differ from random walks in important ways. First, rather than choosing a single neighbor to transition to or “infect” as the random walk does, an epidemic will attempt to “infect” every neighbor of a node. In a random walk, the probability of finding the walker in a given location is a conserved quantity that diffuses through the graph, and the random walk transition matrix is a stochastic matric. Epidemics, on the other hand, replicate themselves with each successful transmission, without following a conservation law Lerman12pre .

Lerman and Ghosh Lerman12pre introduced the replicator operator to describe dynamics of synchronization in a network of nodes coupled via epidemic diffusion. Here is the largest eigenvalue of , also known as the epidemic threshold Wang03 . In this system, a dynamic variable associated with node can change its value based on the values of its neighbors according to:

(3)

where replaces the Laplacian used in the analogous heat equation that gives the (diffusive) evolution of a random walk on a graph Chung07pnas . By construction, the replicator has a steady state given by , the eigenvector of associated with : . is also known as the eigenvector centrality Bonacich01 , and was introduced by Bonacich to explain the importance of actors in a social network based on the importance of the actors to which they were connected.

Clusters of nodes with similar values of the dynamic variable emerge as the system of coupled nodes evolves towards the steady state Lerman12pre

. This motivates a community detection method with nodes classified according to the rate of convergence to their steady-state values. For large time

, we approximate the solution to Eq. 3 using the two leading eigenvectors and of ,

where and are constants, and is the second smallest eigenvalue of associated with eigenvector , guaranteed to be nonzero if the graph is connected. Therefore, convergence depends on , the componentwise ratio of the second to first eigenvectors. Note that eigenvectors of corresponding to ’s two smallest eigenvalues are the same as the eigenvectors of corresponding to ’s two largest eigenvalues.

iii.1 Replicator as the Symmetric Normalized Laplacian of a Reweighted Graph

In a social network, one might expect nodes of high “importance” to attract other nodes, resulting in communities forming around nodes with large eigenvector centrality values . In this section we propose a modification of our graph, converting the unweighted network into a weighted one where weights are given by the product of the eigenvector centralities of an edge’s end points: . Moreover, we show that the replicator on the unweighted graph given by is in fact exactly equivalent to the symmetric normalized Laplacian of the reweighted graph given by .

In the reweighted graph, the degree of node is given by

For convenience, define as the diagonal matrix whose elements are the components of eigenvector , i.e., . Then, from and above,

(4)

We can now write the symmetric normalized Laplacian of the reweighted graph:

Hence, .

The equivalence between epidemics and the diffusive process of random walks is at first surprising. Diffusive processes conserve the total amount of the substance diffusing, whereas no such conservation law holds for epidemics Lerman12pre . The intuition for the equivalence of the two processes is the following. A node’s eigenvector centrality gives the number of paths connecting it to all other nodes in the graph Ghosh11physrev ; hence, the product of eigenvector centralities of a pair of nodes captures how much of the substance is newly created when the epidemic follows the edge linking the pair. By encoding the amount of non-conservation in edge reweighting, this scheme allows the epidemic to be reduced to diffusion.

iii.2 Quality Measure for the Replicator

The equivalence proved above allows us to exploit the properties of the symmetric normalized Laplacian, along with its relationship to graph partitioning, for epidemic diffusion. Since the replicator is simply of the reweighted graph , spectral clustering using the replicator corresponds to a relaxation of normalized cut on this reweighted graph. The appropriate measure for assessing graph cut quality with the replicator is therefore normalized cut on the reweighted graph .

iii.3 An Illustrative Example

We use a simple example to highlight the differences between traditional graph partitioning and one based on epidemics. Consider the graph in Figure 1, which shows a dense cluster connected through node 6 to a sparsely linked cluster. Such a configuration is common in social networks, where a high-degree hub linking different communities may obscure community boundaries. We expect a good partition to group node with other nodes in its clique. However, the cut () that minimizes normalized cut () groups node with nodes and assigns nodes to the other cluster. Multiple cuts minimize ratio cut (), including one that groups together nodes .

Node has the highest eigenvector centrality. Furthermore, nodes that belong to the clique have higher centrality values than other nodes. Consequently, in the reweighted graph, the edges linking node to the rest of the clique are more “expensive” to cut, and nodes are grouped together by the preferred cut () that minimizes both the ratio cut and the normalized cut on the reweighted graph. The quality measures of the cuts are shown in the table in Fig. 1. By giving edges linking central nodes a higher weight, epidemic-based graph partitioning thus preserves dense, clique-like structures. Accordingly, deleting these edges will have the greatest impact on reducing the spread of an epidemic Tong:2012 .

Quality Cut A Cut B Original graph 1.83 1.83 0.528 0.417 Reweighted graph 11.4 32.3 0.747 0.778
Figure 1: (Color Online)(Left) An example graph. The possible cuts are shown by the dotted curves A and B. (Right) Quality measures of cuts A and B on the original and reweighted graph.

iii.4 Efficient Spectral Partitioning

We now describe an efficient method for spectral clustering using epidemic diffusion based on spectral bisection Spielman07 . First, we create a vector that is the componentwise ratio of the second eigenvector to the first eigenvector of the operator and sort its values. Next, we examine all cuts in this ordering (where ) and pick one corresponding to the partition that minimizes an appropriate quality measure. The quality measure we use with is normalized cut on the reweighted graph (). We compare the resulting partition with those produced by applying an analogous splitting procedure to , with quality measure , and , with quality measure (on the original graph).

The proposed optimization procedure is exhaustive, since it tests all possible cuts within the ordering produced by . It may seem that there would be some loss in accuracy from restricting our search to cuts in a one-dimensional projection, rather than searching over the entire subspace spanned by the first two eigenvectors and . However, it has been observed weiss ; ShiMalik00 that the componentwise ratio of the second to first eigenvector of is precisely equal to the second eigenvector of the random walk Laplacian

, whose first eigenvector is a constant vector. Thus, our algorithm is effective because it is a computationally efficient procedure for finding the best normalized cut in the two-dimensional eigenspace of

, i.e., on the reweighted graph. The advantages of using in spectral clustering are discussed in spectral-tutorial .

Iv Evaluation on Synthetic Graphs

We use synthetic graphs to gain better insight into the differences between operators , , and and the characteristics of graphs for which different operators find better solutions. Lancichinetti and Fortunato have proposed an algorithm to generate random graphs with known hierarchical community structure Fortunato2009 . The nodes are divided into macro communities, which are themselves composed of micro communities, and then edges between nodes are created using mixing parameters and . The parameter designates the fraction of a node’s edges that will connect to nodes in a different macro community, and gives the fraction of edges that will connect to nodes in a different micro community within the same macro community. The remaining fraction of edges link to other nodes within the same micro and macro communities. These benchmark networks allow us to systematically explore the performance of different spectral clustering approaches.

Figure 2:

Each pixel represents the mean average clustering coefficient (left) and the standard deviation (right) across 100 runs for fixed

.

Using software available on FortunatoWebsite , we generated 100 graphs for each set of parameter values. We took with two macro communities. We varied and between and . The average clustering coefficient ranged between 0.23 and 0.6421, suggesting that the synthetic graphs have properties similar to those often found in real world networks Watts1999 .

We partition each benchmark graph using , , and by minimizing their respective quality measures. To evaluate the resulting partitions, we use the Normalized Mutual Information (NMI) measure Danon2005 , which compares the partition to the ground truth communities. When the value of this measure is 1.0, the partitioning method has successfully recovered the underlying community structure. We calculate the average and standard deviation of the NMI scores for a fixed set of parameters and display the results in Figure 3.

Laplacian : Ratio Cut
Symmetric Normalized Laplacian : Normalized Cut
Replicator : Normalized Cut (reweighted)
Figure 3: NMI scores for minimizing the respective operators’ quality measure. Each pixel represents the average (left) or standard deviation (right) NMI score across 100 runs for fixed .

As the proportion of a node’s edges that connect to individuals in the opposite community, , increases, it becomes more difficult to divide the network into the correct communities. We find that and give better results when is small (very few links between the two communities). As increases, dominates with a higher NMI score. Additionally, has the lowest standard deviation of the three operators, indicating a consistent performance in identifying the underlying communities.

V Conclusion

Spectral partitioning traditionally uses the graph Laplacian. In this paper, we have introduced a method for spectral partitioning using the replicator, an operator describing epidemic diffusion on graphs. We have shown that this operator is equivalent to the symmetric normalized Laplacian on a different graph, where edges are reweighted according to the eigenvector centrality measure. By reweighting the edges, a higher weight is placed on globally important nodes. Thus, this method tends to preserve cliques and other dense clusters.

We have introduced a spectral bisection approach based on the componentwise ratio of the second to the first eigenvector of , choosing the partition by splitting the sorted vector so as to minimize an appropriate quality measure. Comparing the performance of different methods on synthetic graphs with known community structure, we have shown that spectral partitioning using the replicator is better able to recover the underlying community structure, especially in cases where more edges between the two macro communities make it more difficult for the Laplacian and symmetric normalized Laplacian to identify communities. By reweighting the edges using eigenvector centrality, the replicator assigns more importance to central nodes. Thus, the edges that pass between clusters are given less influence if they do not link nodes of high centrality. By limiting the cuts to influential edges, the method leads to a more accurate reconstruction of the community structure.

Acknowledgements

The authors are grateful to Arjuna Flenner, Yves van Gennip, and Blake Hunter for many instructive conversations and suggestions. KL and RG are also greatly indebted to Shanghua Teng, whose insights and enthusiasm continue to inspire them. This work has been funded by the Air Force Office of Scientific Research under contracts FA9550-10-1-0569 and FA9550-10-1-0102, DARPA under Contract No. W911NF-12-1-0034, the National Science Foundation under grant 1217605, and the Department of Energy Office of Science Advanced Computing Research (ASCR) program in Applied Mathematics.

References

  • (1) J. Leskovec, K. J. Lang, A. Dasgupta, and M. W. Mahoney, in Proceedings of 17th International World Wide Web Conference (WWW2008) (2008).
  • (2) J. Shi and J. Malik, IEEE Transactions on Pattern Analysis and Machine Intelligence 22(8), 888 (2000).
  • (3) A. Bertozzi and A. Flenner, Multiscale Modeling & Simulation 10(3), 1090 (2012).
  • (4) F. R. K. Chung, Spectral Graph Theory, vol. 92 of CBMS Regional Conference Series in Mathematics (American Mathematical Society, 1996).
  • (5) A. Y. Ng, M. I. Jordan, and Y. Weiss, in Advances in Neural Information Processing Systems (2001), vol. 14, pp. 849–856.
  • (6) D. A. Spielman and S.-H. Teng, Linear Algebra and its Applications 421(2-3), 284 (Mar. 2007).
  • (7) U. von Luxburg, Statistics and Computing 17(4), 395 (2007).
  • (8) M. Jerrum and A. Sinclair, in

    Proc. ACM Symposium on Theory of Computing

    (1988), STOC ’88, pp. 235–244.
  • (9) M. Rosvall and C. T. Bergstrom, Proc. National Academy of Sciences 105(4), 1118 (2008).
  • (10) R. M. Anderson and R. May, Infectious diseases of humans: dynamics and control (Oxford University Press, 1991).
  • (11) E. M. Rogers, Diffusion of Innovations, 5th Edition (Free Press, 2003).
  • (12) K. Lerman and R. Ghosh, Phys. Rev. E 86(026108) (2012).
  • (13) P. Bonacich and P. Lloyd, Social Networks 23(3), 191 (2001).
  • (14) R. Kannan and S. Vempala, Spectral Algorithms, vol. 4 of Foundations and Trends in Theoretical Computer Science (Publishers Inc., Hanover, MA, 2009).
  • (15) M. E. J. Newman, Proc. National Academy of Sciences 103(23), 8577 (2006).
  • (16) S. Fortunato, Physics Reports 486, 75 (2010).
  • (17) L. Lovász, Random Walks on Graphs: A Survey (1993), pp. 353–397.
  • (18) H. W. Hethcote, SIAM Review 42(4), 599 (2000).
  • (19) Y. Wang, D. Chakrabarti, C. Wang, and C. Faloutsos, in Proc. 22nd Symposium on Reliable Distributed Systems (IEEE, 2003), pp. 25–34.
  • (20) F. Chung, Proc. National Academy of Sciences 104(50), 19735 (2007).
  • (21) R. Ghosh and K. Lerman, Phys. Rev. E 83(6), 066118 (2011).
  • (22) H. Tong, B. A. Prakash, T. Eliassi-Rad, M. Faloutsos, and C. Faloutsos, in Proc. 21st ACM International Conference on Information and knowledge management (2012), CIKM ’12, pp. 245–254.
  • (23) Y. Weiss, in

    Proceedings IEEE International Conference on Computer Vision

    (1999), pp. 975–982.
  • (24) A. Lancichinetti and S. Fortunato, Phys. Rev. E 80(1) (2009).
  • (25) S. Fortunato, Benchmark graphs to test community detection algorithms (October 2011), https://sites.google.com/site/santofortunato/inthepress2.
  • (26) D. J. Watts, The American Journal of Sociology 105(2), 493 (1999).
  • (27) L. Danon, A. Díaz-Guilera, J. Duch, and A. Arenas, Journal of Statistical Mechanics: Theory and Experiment (2005).