Understanding Regularized Spectral Clustering via Graph Conductance

06/05/2018 ∙ by Yilin Zhang, et al. ∙ University of Wisconsin-Madison 0

This paper uses the relationship between graph conductance and spectral clustering to study (i) the failures of spectral clustering and (ii) the benefits of regularization. The explanation is simple. Sparse and stochastic graphs create a lot of small trees that are connected to the core of the graph by only one edge. Graph conductance is sensitive to these noisy `dangling sets'. Spectral clustering inherits this sensitivity. The second part of the paper starts from a previously proposed form of regularized spectral clustering and shows that it is related to the graph conductance on a `regularized graph'. We call the conductance on the regularized graph CoreCut. Based upon previous arguments that relate graph conductance to spectral clustering (e.g. Cheeger inequality), minimizing CoreCut relaxes to regularized spectral clustering. Simple inspection of CoreCut reveals why it is less sensitive to small cuts in the graph. Together, these results show that unbalanced partitions from spectral clustering can be understood as overfitting to noise in the periphery of a sparse and stochastic graph. Regularization fixes this overfitting. In addition to this statistical benefit, these results also demonstrate how regularization can improve the computational speed of spectral clustering. We provide simulations and data examples to illustrate these results.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 1

page 2

page 3

page 4

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Spectral clustering partitions the nodes of a graph into groups based upon the eigenvectors of the graph Laplacian

shi2000normalized; von2007tutorial. Despite the claims of spectral clustering being “popular”, in applied research using graph data, spectral clustering (without regularization) often returns a partition of the nodes that is uninteresting, typically finding a large cluster that contains most of the data and many smaller clusters, each with only a few nodes. These applications involve brain graphs binkiewicz2017covariate and social networks from Facebook zhang2017discovering and Twitter zhang2017attention. One key motivation for spectral clustering is that it relaxes a discrete optimization problem of minimizing graph conductance. Previous research has shown that across a wide range of social and information networks, the clusters with the smallest graph conductance are often rather small leskovec2009community. Figure 1

illustrates the leading singular vectors on a communication network from Facebook during the 2012 French presidential election

zhang2017discovering. The singular vectors localize on a few nodes, which leads to a highly unbalanced partition.

amini2013pseudo proposed regularized spectral clustering which adds a weak edge on every pair of nodes with edge weight , where is the number of nodes in the network and is a tuning parameter. chaudhuri2012spectral proposed a related technique. Figure 1 illustrates how regularization changes the leading singular vectors in the Facebook example. The singular vectors are more spread across nodes.

Many empirical networks have a core-periphery structure, where nodes in the core of the graph are more densely connected and nodes in the periphery are sparsely connected borgatti2000models. In Figure 1, regularized spectral clustering leads to a “deeper cut” into the core of the graph. In this application, regularization helps spectral clustering provide a more balanced partition, revealing a more salient political division.

Figure 1: This figure shows the leading singular vectors of the communication network. In the left panel, the singular vectors from vanilla spectral clustering are localized on a few nodes. In the right panel, the singular vectors from regularized spectral clustering provide a more balanced partition.

Previous research has studied how regularization improves the spectral convergence of the graph Laplacian qin2013regularized; joseph2016impact; le2015sparse. This paper aims to provide an alternative interpretation of regularization by relating it to graph conductance. We call spectral clustering without regularization Vanilla-SC and with edge-wise regularization Regularized-SC amini2013pseudo.

This paper demonstrates (1) what makes Vanilla-SC fail and (2) how Regularized-SC fixes that problem. One key motivation for Vanilla-SC is that it relaxes a discrete optimization problem of minimizing graph conductance chung1997spectral. Yet, this graph conductance problem is fragile to small cuts in the graph. The fundamental fragility of graph conductance that is studied in this paper comes from the type of subgraph illustrated in Figure 2 and defined here.

Definition 1.1.

In an unweighted graph , subset is -dangling if and only if the following conditions hold.

  1. contains exactly nodes.

  2. There are exactly edges within and they do not form any cycles (i.e. the node induced subgraph from is a tree).

  3. There is exactly one edge between nodes in and nodes in .

Figure 2: -dangling set.

The argument in this paper is structured as follows:

  1. A -dangling set has a small graph conductance, approximately (Section 3.2).

  2. For any fixed , graphs sampled from a sparse inhomogeneous model with nodes have -dangling sets in expectation (Theorem 3.4). As such, -dangling sets are created as an artifact of the sparse and stochastic noise.

  3. This makes eigenvalues in the normalized graph Laplacian which have an average value less than (Theorem 3.5) and reveal only noise. These small eigenvalues are so numerous that they conceal good cuts to the core of the graph.

  4. eigenvalues smaller than also make the eigengap exceptionally small. This slows down the numerical convergence for computing the eigenvectors and values.

  5. CoreCut, which is graph conductance on the regularized graph, does not assign a small value to small sets of nodes. This prevents all of the statistical and computational consequences listed above for -dangling sets and any other small noisy subgraphs that have a small conductance. Regularized-SC inherits the advantages of CoreCut.

The penultimate section evaluates the overfitting of spectral clustering in an experiment with several empirical graphs from SNAP snapnets. This experiment randomly divides the edges into training set and test set, then runs spectral clustering using the training edges and with the resulting partition, compares “training edge conductance” to “testing edge conductance.” This shows that Vanilla-SC overfits and Regularized-SC does not. Moreover, Vanilla-SC tends to identify highly unbalanced partitions, while Regularized-SC provides a balanced partition.

The paper concludes with a discussion which illustrates how these results might help inform the construction of neural architectures for a generalization of Convolutional Neural Networks to cases where the input data has an estimated dependence structure that is represented as a graph

lecun1995convolutional; bruna2013spectral; kipf2016semi; levie2017cayleynets.

2 Notation

Graph notation

The graph or network consists of node set and edge set . For a weighted graph, the edge weight can take any non-negative value for and define if . For an unweighted graph, define the edge weight if and otherwise. For each node , we denote its degree as . Given , the node induced subgraph of in is a graph with vertex set and includes every edge whose end point are both in , i.e. its edge set is .

Graph cut notation

For any subset , we denote , and its volume in graph as . Note that any non-empty subset forms a partition of with its complement . We denote the cut for such partition on graph as

and denote the graph conductance of any subset with as

Without loss of generality, we focus on non-empty subsets with .

Notation for Vanilla-SC and Regularized-SC

We denote the adjacency matrix with , and the degree matrix with and for . The normalized graph Laplacian matrix is

with eigenvalues (here and elsewhere, “leading” refers to the smallest eigenvalues). Let

represent the eigenvectors/eigenfunctions for

corresponding to eigenvalues .

There is a broad class of spectral clustering algorithms which represent each node in with and cluster the nodes by clustering their representations in with some algorithm. For simplicity, this paper focuses on the setting of and only uses . We refer to Vanilla-SC the algorithm which returns the set which solves

(2.1)

This construction of a partition appears in both shi2000normalized and in the proof of Cheeger inequality chung1996laplacians; chung1997spectral, which says that

Edge-wise regularization amini2013pseudo adds to every element of the adjacency matrix, where is a tuning parameter. It replaces by matrix , where and the node degree matrix by , which is computed with the row sums of (instead of the row sums of ) to get . We define to be a weighted graph with adjacency matrix as defined above. Regularized-SC partitions the graph using the leading eigenvectors of , which we represent by . Similarly, we only use when . We refer to Regularized-SC the algorithm which returns the set which solves

3 Vanilla-SC and the periphery of sparse and stochastic graphs

For notational simplicity, this section only considers unweighted graphs.

3.1 Dangling sets have small graph conductance.

The following fact follows from the definition of a -dangling set.

Fact 3.1.

If is a -dangling set, then its graph conductance is

To interpret the scale of this graph conductance, imagine that a graph is generated from a Stochastic Blockmodel with two equal-size blocks, where any two nodes from the same block connect with probability

and two nodes from different blocks connect with probability holland1983stochastic. Then, the graph conductance of one of the blocks is (up to random fluctuations). If there is a -dangling set with , then the -dangling set will have a smaller graph conductance than the block.

3.2 There are many dangling sets in sparse and stochastic social networks.

We consider random graphs sampled from the following model which generalizes Stochastic Blockmodels. Its key assumption is that edges are independent.

Definition 3.2.

A graph is generated from an inhomogeneous random graph model if the vertex set contains nodes and all edges are independent. That is, for any two nodes , connects to with some probability and this event is independent of the formation of any other edges. We only consider undirected graphs with no self-loops.

Definition 3.3.

Node is a peripheral node in an inhomogeneous random graph with nodes if there exist some constant , such that for all other nodes , where we allow .

For example, an Erdös-Rényi graph is an inhomogeneous random graph. If the Erdös-Rényi edge probability is specified by for some fixed , then all nodes are peripheral. As another example, a common assumption in the statistical literature on Stochastic Blockmodels is that the minimum expected degree grows faster than . Under this assumption, there are no peripheral nodes in the graph. That assumption is perhaps controversial because empirical graphs often have many low-degree nodes.

Theorem 3.4.

Suppose an inhomogeneous random graph model such that for some , for all nodes . If that model contains a non-vanishing fraction of peripheral nodes , such that for some , then the expected number of distinct -dangling sets in the sampled graph grows proportionally to .

Theorem 3.4 studies graphs sampled from an inhomogeneous random graph model with a non-vanishing fraction of peripheral nodes. Throughout the paper, we refer to these graphs more simply as graphs with a sparse and stochastic periphery and, in fact, the proof of Theorem 3.4 only relies on the randomness of the edges in the periphery, i.e. the edges that have an end point in . The proof does not rely on the distribution of the node-induced subgraph of the “core” . Combined with Fact 3.1, Theorem 3.4 shows that graphs with a sparse and stochastic periphery generate an abundance of -dangling sets, which creates an abundance of cuts with small conductance, but might only reveal noise. leskovec2009community also shows by real datasets that there is a substantial fraction of nodes that barely connect to the rest of graph, especially 1-whiskers, which is a generalized version of g-dangling sets.

Theorem 3.5.

If a graph contains -dangling sets, and the rest of the graph has volume at least , then there are at least eigenvalues that is smaller than .

Theorem 3.5 shows that every two dangling sets lead to a small eigenvalue. Due to the abundance of -dangling sets (Theorem 3.4), there are many small eigenvalues and their corresponding eigenvalues are localized on a small set of nodes. This explains what we see in the data example in Figure 1. Each of these many eigenvectors is costly to compute (due to the small eigengaps) and then one needs to decide which are localized (which requires another tuning).

4 CoreCut ignores small cuts and relaxes to Regularized-SC.

Similar to the graph conductance which relaxes to Vanilla-SC chung1997spectral; shi2000normalized; von2007tutorial, we introduce a new graph conductance CoreCut which relaxes to Regularized-SC. The following sketch illustrates the relations. This section compares and CoreCut. For ease of exposition, we continue to focus our attention on partitioning into two sets.