DeepAI
Log In Sign Up

Spectral methods for testing cluster structure of graphs

12/30/2018
by   Sandeep Silwal, et al.
MIT
0

In the framework of graph property testing, we study the problem of determining if a graph admits a cluster structure. We say that a graph is (k, ϕ)-clusterable if it can be partitioned into at most k parts such that each part has conductance at least ϕ. We present an algorithm that accepts all graphs that are (2, ϕ)-clusterable with probability at least 2/3 and rejects all graphs that are ϵ-far from (2, ϕ^*)-clusterable for ϕ^* <μϕ^2 ϵ^2 with probability at least 2/3 where μ > 0 is a parameter that affects the query complexity. This improves upon the work of Czumaj, Peng, and Sohler by removing a n factor from the denominator of the bound on ϕ^* for the case of k=2. Our work was concurrent with the work of Chiplunkar et al. who achieved the same improvement for all values of k. Our approach for the case k=2 relies on the geometric structure of the eigenvectors of the graph Laplacian and results in an algorithm with query complexity O(n^1/2+O(1)μ·poly(1/ϵ, 1/ϕ, n)).

READ FULL TEXT VIEW PDF
05/11/2021

Testing Triangle Freeness in the General Model in Graphs with Arboricity O(√(n))

We study the problem of testing triangle freeness in the general graph m...
05/27/2018

Property Testing of Planarity in the CONGEST model

We give a distributed algorithm in the CONGEST model for property testi...
10/04/2019

C-Planarity Testing of Embedded Clustered Graphs with Bounded Dual Carving-Width

For a clustered graph, i.e, a graph whose vertex set is recursively part...
04/22/2019

Robust Clustering Oracle and Local Reconstructor of Cluster Structure of Graphs

Due to the massive size of modern network data, local algorithms that ru...
01/06/2020

On Efficient Distance Approximation for Graph Properties

A distance-approximation algorithm for a graph property P in the adjacen...
08/14/2018

Testing Graph Clusterability: Algorithms and Lower Bounds

We consider the problem of testing graph cluster structure: given access...

1 Introduction

In this paper we study property testing of graphs in the bounded degree model. The input is a graph on vertices where all the vertices have degree at most . Given a graph property , we say that is -far from satisfying if edges need to be added or removed from to satisfy . A property testing algorithm for is an algorithm that accepts every graph satisfying with probability at least and rejects every graph that is -far from satisfying with probability at least .

is represented as an oracle that returns the th neighbor of any vertex for any . If is larger than the degree of , a special symbol is returned. The goal of property testing is to find an algorithm with an efficient query complexity, defined as the number of oracle queries that the algorithm performs. This framework of property testing of graphs was developed by Goldreich and Ron [6] and has been applied to study various properties such as bipartiteness [5] and 3-colorability [6]. See [6] and [10] for more examples.

Our paper deals with a generalization of property testing. We are interested in testing for a family of properties that depends on a single parameter and is nested, satisfying for all . Our goal is an algorithm which accepts graphs satisfying with probability at least and rejects graphs that are -far from satisfying with probability at least where . A diagram for this generalization of property testing is shown in Figure 1.

Figure 1: We want to accept graphs satisfying and reject graphs that are -far from satisfying where .

We are interested in the property of -clusterability as defined by Czumaj, Peng, and Sohler in [3]. Roughly speaking, a graph is -clusterable if it can be partitioned into at most clusters where vertices in the same cluster are “well-connected.” The connectedness of the clusters is measured in terms of their inner conductance, defined below. The idea of using conductance for graph clustering has been studied in numerous works, such as [11].

Testing for -clusterability is inspired by expansion testing which has been studied extensively. A graph is called an -expander if every of size at most has neighborhood of size at least . Czumaj and Sohler [4] showed that an algorithm proposed by Goldreich and Ron in [7] can distinguish between -expanders and graphs which are -far from having expansion at least in the bounded degree model. This work was subsequently improved by Kale and Seshadhri [8] and then by Nachmias and Shapira [9] who showed that the same algorithm distinguishes graphs which are -expanders from graphs which are -far from

-expanders. The work of Nachmias and Shapira also shows that expansion testing is related to the second eigenvalue of the Laplacian matrix. In addition, as shown by

[2], testing for -clusterability is related to the st eigenvalue of the Laplacian so -clusterability testing is a natural extension of expansion testing.

We now define conductance which is closely related to expansion. Let such that . The conductance of is defined to be where is the number of edges between and . The conductance of is defined to be the minimum conductance over all subsets and is denoted . Now for any , let denote the induced subgraph of on the vertex set defined by . We let denote the conductance of this subgraph. To avoid confusion we call the inner conductance.

We say is -clusterable if there exits a partition of into at most subsets such that for all . This definition is slightly different from the one used by Czumaj, Peng, and Sohler in [3] because their definition also requires to be bounded by for all . The algorithm of Czumaj, Peng, and Sohler accepts all -clusterable graphs with probability at least and rejects all graphs that are -far from -clusterable where and where depends only on [3]. Our work improves upon this result by removing the dependency for the case of .

Our main result is an algorithm in the bounded degree model that accepts every -clusterable graph with probability at least and rejects every graph that is -far from -clusterable with probability at least if where is a parameter that we can choose which affects the query complexity. Our algorithm has query complexity where denotes a polynomial in , and .

The work of Czumaj et al. for testing -clusterability uses property testing of distributions, such as testing the norm of a discrete distribution and testing the closeness of two discrete distributions. For some work on testing the norm of a discrete distribution and testing closeness of discrete distributions, see [4] and [1] respectively.

Our work was concurrent with the work of Chiplunkar et al. [2] who give an algorithm that for any fixed accepts -clusterable from graph with probability at least and rejects every graph that is -far from -clusterable with probability at least using queries. This matches the query bound achieved by Nachmias and Shapira in the expander testing setting (). The algorithm of Chiplunkar et al. also looks at the st largest eigenvalue of a transformation of the lazy random walk matrix , and accepts if this eigenvalue is below a certain threshold. We essentially employ the same approach in the case of . Our proof of correctness is rather different, relying on the geometric properties of the endpoint distributions of random walks on the input graph to deduce the size of the eigenvalues of .

We present our algorithm, Cluster-Test, in Section 1.1. We prove that Cluster-Test accepts -clusterable graphs in Section 3.1 and that it rejects graphs that are -far from -clusterable in Section 3.2.

1.1 Definitions

Definition 1.1 (Graph clusterability).

is -clusterable if the conductance of is at least or can be partitioned into two subsets and such that the inner conductance of is at least for each

The motivating idea in designing Cluster-Test is to compute the rank of where is the lazy random walk matrix and is the matrix of all ’s. Essentially, we show when is -clusterable, is “close” to a rank matrix while when is -far from -clusterable, is not “close” to rank . The intuition for this comes from Lemma 3.3 which tells us that the third eigenvalue of is small if is -clusterable.

Because computing the eigenvalues of is too expensive, we instead look at the eigenvalues of by principal submatrices of

. These principal submatrices are the Gram matrices of the endpoint distribution vectors of random walks on

minus . This allows us to show that if is -clusterable then we can expect all of these submatrices to have at least one small eigenvalues while if is -far from -clusterable, both of the eigenvalues of most of these principal submatrices are large. This is essentially what our algorithm tests for.

Before we present our algorithm we introduce some standard definitions and tools that we use. Given a graph with maximum degree , we work with the lazy random walk matrix defined as follows: the off diagonal entries of are times the corresponding entry in the adjacency matrix while the diagonal entries of are set so that the columns of add to which corresponds to adding self loops of the appropriate weights in . We then define the Laplacian matrix as . Our definition of the Laplacian follows the convention used in [3] so that we can easily use some of their results.

Let denote the eigenvalues of and let denote the corresponding orthonormal eigenvectors. Let denote the eigenvalues of where for . For , we define

to be the probability distribution of the endpoint of a length

lazy random walk that starts at vertex . That is,

(1)

where is the vector with in the entry corresponding to the vertex and elsewhere. Because , we typically work with for convenience. From Eq. (1), we have

(2)

1.2 Preliminary Results

In this paper always denotes the norm unless stated otherwise. We need the following classical result from [12] which roughly states that eigenvalues are stable under small perturbations.

Proposition 1.2 (Weyl’s Inequality).

Let and suppose has eigenvalues and has eigenvalues . Furthermore, suppose where denotes the Frobenius norm. Then for all .

Our work relies on estimating dot products and norms of various distributions where we view distributions over

elements as vectors in . To estimate these quantitites, we use the following result about distribution property testing.

Theorem 1.3 (Theorem 1.2 in [1]).

Let be two distributions with . There is an algorithm which computes an estimate of that is accurate to within additive error with probability at least and requires samples from each of the distributions and for some absolute constant .

Theorem 1.4 (Lemma 3.2 in [3]).

Let be a distribution over elements. There is an algorithm that accepts if and rejects if with probability at least and requires samples from . A condition on the input is that it must be at least .

2 Algorithm

We now describe our algorithm Cluster-Test. Our algorithm performs multiple lazy random walks on the input graph and uses the distribution testing results from Theorems 1.3 and 1.4 to approximate a principal submatrix of . As shown in Sections 3.1 and 3.2, our choice of in Theorem 2.1 is large enough so that a random walk of length mixes well in the case that is -clusterable and small enough so that the random walk does not mix well if is -far from -clusterable.

We use the notation for the by submatrix of with rows and columns indexed by the vertices and . Noting that , we can write

Note that we assume depends on the parameter that is inputted into

1for  rounds do
2      Pick a pair of vertices and uniformly at random from .
3       Run random walks of length starting from and starting from .
4       Compute and using the samples of and from step 3. If either trial rejects, abort and reject .
5       Compute with the results of step to approximate each entry of within additive error for each entry. Call the approximation .
6       Abort and reject if both the eigenvalues of are larger than .
7      
Accept .
Algorithm 1

We now present our main theorem about the guarantees of Cluster-Test.

Theorem 2.1 (Main Theorem).

Let be an vertex graph with maximum degree at most . For any we set

where and the constants and are defined in Theorem 1.3 and Lemmas 3.3, 3.5, and 3.10 respectively. Then,

  1. Cluster-Test with the parameters defined above accepts every -clusterable graph with probability at least .

  2. Cluster-Test with the parameters defined above rejects every graph that is -far from -clusterable for any with probability at least .

Furthermore, the query complexity of Cluster-Test is .

3 Proof of Main Theorem

3.1 Completeness: accepting -clusterable graphs

In this section we show that Cluster-Test with the parameters defined in Theorem 2.1 accepts with probability greater than if is -clusterable. We first introduce the main geometric property of our paper.

Definition 3.1.

Vectors a and b are -close to collinear if they can be moved distance at most to lie on a line through the origin. Vectors a and b are -far from collinear if they are not -close to collinear. See Figure 4 for reference.

Let be a -clusterable graph. We show that Cluster-Test accepts with probability at least using the following argument.

  • First in Lemma 3.2 we first show that how close and are to collinear corresponds to how small the eigenvalues of are.

  • We show in Lemma 3.4 that any pair of vectors and , where are vertices of , are very close to collinear. This relies on a result about the eigenvalues of from [3] which is restated in Lemma 3.3.

  • We finally show that this implies that Cluster-Test accepts with probability greater than in Lemma 3.6.

Lemma 3.2.

If and are -close to collinear then the smallest eigenvalue of is less than . Conversely, if and are -far from collinear then both the eigenvalues of are larger than .

Proof.

Write , the matrix with columns and . Then . Because is positive semidefinite, we can also write

where are orthonormal and . Suppose and are -close to collinear. An equivalent formulation of Definition 3.1 is that there exists such that and and lie on a line through the origin. This implies that the matrix with columns is such that is rank . Therefore, is also a rank matrix so it has a zero eigenvalue. Because

we know by Weyl’s inequality that has an eigenvalue less than

where denotes the Frobenius norm. Because , we can easily compute that

Therefore,

for which proves the first part of our lemma.

For the second part, we prove the contrapositive. Suppose that . We wish to show that and are -close to collinear. Define

Let and denote the columns of as . Then is rank and . By the orthogonality of and , we have that . Therefore, the matrix satisfies , implying that for any vector the equation is satisfied. We note that

is also a rank matrix and we define the columns of to be . Because preserves lengths, . Thus are -close to collinear, as desired. ∎

We proceed to show that if the input graph is -clusterable then for any pair of vertices , and are -close to collinear. To show this, we need the following lemma from [3] which relates the property of -clusterable to the eigenvalues of the Laplacian matrix.

Lemma 3.3 (Lemma in [3]).

There exists a constant depending on such that for a -clusterable graph of maximum degree at most , for where is the -th smallest eigenvalue of the Laplacian matrix of .

We note here that there is a short proof that has at most one large eigenvalue if is -clusterable. Lemma 3.3 states that has at most one large eigenvalue, hence also has at most one large eigenvalue. Then the Cauchy interlacing theorem implies that all the minors also have at most one large eigenvalue. However, we present this longer proof that uses the definition of -close to highlight the similarities between the proofs of the soundness and completeness case.

We now show that given Lemma 3.3, it follows that is close to the line spanned by .

Lemma 3.4.

Let be -clusterable. Then for any pair of vertices , and are -close to collinear where is a constant defined in Lemma 3.3.

Proof.

Recall that where is the probability distribution of the endpoint of a length lazy random walk starting at vertex . Writing in the eigenbasis of gives us

Therefore,

It follows that for any vertices and , and are -close to collinear. ∎

Lemmas 3.2 and 3.4 together guarantee that both of the eigenvalues of cannot be large. We now want to show that this also holds when the Cluster-Test approximates . We need the following lemma which tells us that accepts with high probability in step of Cluster-Test. This lemma is just a technicality that we need for the query complexity of Theorem 1.3.

Lemma 3.5 (Lemma in [3]).

Let . There exists a constant such that for a -clusterable graph, there exists with such that for any and any , the following holds:

We now prove that Cluster-Test with the parameters defined in Theorem 2.1 passes the completeness case.

Lemma 3.6.

Cluster-Test with the parameters defined in Theorem 2.1 accepts -clusterable graphs with probability greater than .

Proof.

Let be a -clusterable graph. We analyze one round of Cluster-Test and calculate the rejection probability of one round. Note that Cluster-Test samples a pair of vertices and uniformly at random from at each round. There are three ways one round can reject :

  1. One of the vertices or in the complement of in Lemma 3.5.

  2. rejects or in step of Cluster-Test.

  3. Both of the eigenvalues of are larger than .

Setting in Lemma 3.5, we see that both and lie inside in Lemma 3.5 with probability at least . Therefore, the rejection probability of case is at most .

If as defined in Lemma 3.5, then . Given this along with the fact that , we have that accepts with probability at least Therefore, the rejection probability of case is also at most .

By Lemma 3.4, and are -close to collinear. Recall that in Theorem 2.1. Therefore by Lemma 3.2, has at least one eigenvalue smaller than

The matrix that Cluster-Test computes can be written as where each entry of the by matrix is at most with probability due to . Therefore, with probability . If this holds, then by Weyl’s inequality, has an eigenvalue at most

Therefore, the rejection probability of case is at most .

Adding up the rejection probabilities of each of the three cases tells us that one round rejects with probability at most . Thus the total probability that we reject in one of the rounds is at most , as desired. The query complexity is . ∎

3.2 Soundness: rejecting graphs -far from -clusterable

In this section we show that Cluster-Test rejects with probability greater than if is -far from -clusterable for . We introduce two properties that expand on the property of -close to collinear.

Definition 3.7.

Vectors a and b are -close to antipodal if they can be moved distance at most to lie on a line through the origin where the origin lies between the two moved points. Vectors a and b are -far from antipodal if they are not -close to antipodal.

Definition 3.8.

Vectors a and b are -close to podal if they can be moved distance at most to lie on a line through the origin where the origin does not lie between the two moved points. Vectors a and b are -far from podal if they are not -close to podal.

See Figure 4 for reference. Note that vectors a and b are -far from collinear if and only if they are -far from both antipodal and podal.

(a)
(b)
Figure 4: Origin is denoted as . Vectors a and b are -close to collinear in both cases.
(a): Vectors a and b are -close to antipodal. (b): Vectors a and b are -close to podal.

We now outline our argument which shows that Cluster-Test rejects graph if is -far from -clusterable. We do this by showing that there are many pair of vertices where and are far from collinear which allows us to say that the eigenvalues of are large due to Lemma 3.2. This is a relatively harder task than showing that and are close to collinear in the completeness case so we need a more complicated argument which is detailed below. For , we define .

  • We first present a result from [3] in Lemma 3.10 which says that has two large subsets of vertices and that are each separated from the rest of the vertices by sparse cuts.

  • We let be the projection onto the span of the eigenvectors of with “large” eigenvalues. We use the above result to show that the aggregate vectors and are far from collinear in Lemma 3.11. This projection trick is necessary to relate to later on.

We now want to use the fact that the aggregate vectors and are far from collinear to find many pairs of vectors that are far from collinear.

  • We use the pigeonhole principle to deduce that there are pairs of vertices such that and are far from antipodal. Similarly, we show that that there are pairs of vertices such that and are far from podal. This is shown in Lemma 3.13.

Note that the above point does not immediately imply that there are pairs of vertices such that and are far from both podal and antipodal.

  • We use results from the previous step along with geometric properties of the vectors to show that there are pairs of vertices such that and are far from collinear in Lemmas 3.14 and 3.15.

  • Using properties of we transfer this result on the vectors to the vectors.

  • Finally we refer back to Lemma 3.2 to argue that there are many pairs such that both the eigenvalues of are sufficiently large which means that Cluster-Test rejects with probability at least . This is shown in Lemmas 3.16 and 3.17.

We now give quantitative versions of the definitions of antipodal and podal which is useful later on in our argument.

Lemma 3.9.

If vectors a and b are -close to antipodal then

(3)

Similarly, if a and b are -close to podal then

(4)
Proof.

If a and b are -close to collinear then there exists and such that and lie on the same line through the origin and . If a and b are -close to antipodal then we can find such that . We have

Therefore,

which proves Eq. (3). A similar calculation for the podal case proves Eq. (4). ∎

We restate a lemma from [3] which says that we can partition a graph that is from -clusterable into three subsets of vertices that are separated by sparse cuts.

Lemma 3.10 (Lemma in [3]).

Let be a graph with maximum degree at most . There are constants and , that depend on , such that if is -far from -clusterable with , then there exists a partition of into three subsets such that for each , we have and .

From now on we assume that and are the smallest of the two parts so

always holds.

We begin by showing that a projection of the aggregate vectors are -far from collinear by using tools from [8].

Lemma 3.11.

Let and be two disjoint subsets of vertices such that the cut has conductance less than for . Suppose that and let denote the projection onto the span of the eigenvectors of with eigenvalue greater than . Then and are -far from collinear.

Proof.

Recall that is the Laplacian and is the lazy random walk matrix related by the equation . Also recall that the eigenvalues of are with corresponding eigenvectors . Let and define the vector as

where is any constant in and . Let . Write in the eigenbasis of as . We have and one can compute that . Equating these two gives

(5)

We now also compute in two different ways. We have . On the other hand, using the quadratic form of gives us . Now note that there are three cases where the term is nonzero:

  1. One of vertex and vertex lies in and the other lies in ,

  2. One of vertex and vertex lies in and the other lies in ,

  3. One of vertex and vertex lies in and the other lies in .

In these three cases, evaluates to , , and respectively. We bound these expressions from above by , , and respectively to extract the bound

Now using the fact that the has conductance less than for each , we have

It follows that