Multisection in the Stochastic Block Model using Semidefinite Programming

07/08/2015 ∙ by Naman Agarwal, et al. ∙ 0

We consider the problem of identifying underlying community-like structures in graphs. Towards this end we study the Stochastic Block Model (SBM) on k-clusters: a random model on n=km vertices, partitioned in k equal sized clusters, with edges sampled independently across clusters with probability q and within clusters with probability p, p>q. The goal is to recover the initial "hidden" partition of [n]. We study semidefinite programming (SDP) based algorithms in this context. In the regime p = α(m)/m and q = β(m)/m we show that a certain natural SDP based algorithm solves the problem of exact recovery in the k-community SBM, with high probability, whenever √(α) - √(β) > √(1), as long as k=o( n). This threshold is known to be the information theoretically optimal. We also study the case when k=θ((n)). In this case however we achieve recovery guarantees that no longer match the optimal condition √(α) - √(β) > √(1), thus leaving achieving optimality for this range an open question.

READ FULL TEXT VIEW PDF
POST COMMENT

Comments

There are no comments yet.

Authors

page 15

page 16

This week in AI

Get the week's most popular data science and artificial intelligence research sent straight to your inbox every Saturday.

1 Introduction

Identifying underlying structure in graphs is a primitive question for scientists: can existing communities be located in a large graph? Is it possible to partition the vertices of a graph into strongly connected clusters? Several of these questions have been shown to be hard to answer, even approximately, so instead of looking for worst-case guarantees attention has shifted towards average-case analyses. In order to study such questions, the usual approach is to consider a random [McS01] or a semi-random [FK01, MMV14]

generative model of graphs, and use it as a benchmark to test existing algorithms or to develop new ones. With respect to identifying underlying community structure, the Stochastic Block Model (SBM) (or planted partition model) has, in recent times, been one of the most popular choices. Its growing popularity is largely due to the fact that its structure is simple to describe, but at the same time it has interesting and involved phase transition properties which have only recently been discovered (

[DKMZ11, MNS12, MNS13, ABH14, CX14, MNS14b, HWX14, HWX15, AS15, Ban15]). In this paper we consider the SBM on -communities defined as follows. Let be a multiple of , be the set of vertices and be a partition of them into equal sized clusters each of size . Construct a random graph on by adding an edge for any two vertices in the same cluster independently with probability and any two vertices across distinct clusters independently with probability where . We will write to denote that a graph is generated from the above model. Given such a the goal is to recover (with high probability) the initial hidden partition . The SBM can be seen as an extension of the Erdős-Rényi random graph model [ER59] with the additional property of possessing a non-trivial underlying community structure (something which the Erdős-Rényi model lacks). This richer structure not only makes this model interesting to study theoretically, but also renders it closer to real world inputs, which tend to have a community structure. It is also worth noting that, as pointed out in [CX14], a slight generalization of the SBM encompasses several classical planted random graph problems including planted clique [AKS98], [McS01], planted coloring [AK97], planted dense subgraph [AV13] and planted partition [Bop87, CK01, FK01]. There are two natural problems that arise in context of the SBM: exact recovery, where the aim is to recover the hidden partition completely; and detection, where the aim is to recover the partition better than what a random guess would achieve. In this paper we focus on exact recovery. Note that exact recovery necessarily requires the hidden clusters to be connected (since otherwise there would be no way to match the partitions in one component to another component) and it is easy to see that the threshold for connectivity occurs when . Therefore the right scale for the threshold behavior of the parameters is , which is what we consider in this paper. In the case of two communities () Abbe et al. [ABH14] recently established a sharp phase transition phenomenon from information-theoretic impossibility to computational feasibility of exact recovery. However, the existence of such a phenomenon in the case of was left open until solved, for , in independent parallel research [AS15, HWX15]. In this paper we resolve the above showing the existence of a sharp phase transition for . More precisely, in this work, we study a Semidefinite Programming (SDP) based algorithm that, for , recovers, for an optimal range of parameters, exactly the planted -partition of with high probability. The range of the parameters is optimal in the following sense: it can be shown that this parameter range exhibits a sharp phase transition from information-theoretic impossibility to computational feasibility through the SDP algorithm studied in this paper. An interesting aspect of our result is that, for , the threshold is the same as for . This means that, even if an oracle reveals all of the cluster memberships except for two, the problem has essentially the same difficulty. We also consider the case when . Unfortunately, in this regime we can no longer guarantee exact recovery up to the proposed information theoretic threshold. Similar behavior was observed and reported by Chen et al. [CX14] and in our work we observe that the divergence between our information theoretic lower bound and our computational upper bound sets in at . This is formally summarized in the following theorems. Given a graph with hidden clusters each of size and and , where are fixed constants, the semidefinite program  (4), with probability , recovers the clusters when:

  • for , as long as

  • for for a fixed , as long as

    where is a universal constant.

We complement the above theorem by showing the following lower bound which is a straightforward extension of the lower bound for from [ABH14]. Given a graph with hidden clusters each of size where is for any fixed , if and , where are fixed constants, then it is information theoretically impossible to recover the clusters exactly with high probability if

Note that Theorem 1 establishes a sharp phase transition between computational feasibility and information theoretic impossibility when . At we see that our lower and upper bounds diverge. We leave as an open problem to determine whether such divergence is necessary or a shortcoming of the SDP approach. At the heart of our argument is the following theorem which establishes a sufficient condition for exact recovery with high probability. Let , with probability over the choice of , if the following condition is satisfied, the semidefinite program  (4) recovers the hidden partition:

(1)

where is a universal constant and is defined as the difference between the number of neighbors a vertex has in its own cluster and the maximum number of neighbors it has in any other cluster (with respect to the hidden partition). In other words, with probability , (1) implies exact recovery. We are able to give sharp guarantees for the semidefinite programming algorithm based essentially on the behavior of inner and outer degrees of the vertices. This is achieved by constructing a candidate dual certificate and using bounds on the spectral norm of random matrices to show that the constructed candidate is indeed a valid one. The problem is then reduced to the easier task of understanding the typical values of such degrees. Remarkably, the conditions required for these quantities are very similar to the ones required for the problem to be information-theoretically solvable (which essentially correspond to each node having larger in-degree than out-degree). This helps explain the optimality of our algorithm. The approach of reducing the validity of a dual certificate to conditions on an interpretable quantity appeared in [Ban15] for a considerably simpler class of problems where the dual certificate construction is straightforward (which includes the stochastic block model for but not

). In contrast, in the current setting, the dual certificate construction is complex, rendering a different, and considerably more involved analysis. Moreover, the estimates we need (both of spectral norms and of inner and outer degrees) do not fall under the class of the ones studied in 

[Ban15]. We also show that our algorithm recovers the planted partitions exactly also in the presence of a monotone adversary, a semi-random model defined in [FK01].

1.1 Related Previous and Parallel Work

Graph partitioning problem has been studied over the years with various different objectives and guarantees. There has been significant recent literature concentration around the bipartiton (bisection) and the general -partition problems (multisection) in random and semi-random models ([DKMZ11], [MNS12], [MNS13], [YP14], [MNS14a], [Mas14], [ABH14], [CX14], [MNS14b], [Vu14], [CRV15]). Some of the first results on partitioning random graphs were due to Bui et al. [BCLS84] who presented algorithms for finding bipartitions in dense graphs. Boppana [Bop87] showed a spectral algorithm that for a large range of parameters recovers a planted bipartition in a graph. Feige and Kilian [FK01] present an SDP based algorithm to solve the problem of planted bipartition (along with the problems of finding Independent Sets and Graph Coloring). Independently, McSherry [McS01] gave a spectral algorithm that solved the problems of Multisection, Clique and Graph Coloring. More recently, a spate of results have established very interesting phase transition phenomena for SBMs, both for the case of detection and exact recovery. For the case of detection, where the aim is to recover partitions better than a random guess asymptotically, recent works of [MNS12, MNS13, Mas14] established a striking sharp phase transition from information theoretic impossibility to computational feasibility for the case of . For the case of exact recovery Abbe et al. [ABH14], and independently [MNS14b], established the existence of a similar phase transition phenomenon albeit at a different parameter range. More recently the same phenomenon was shown to exist for a semidefinite programming relaxation, for in [HWX14, Ban15]. However, the works described above established phase transition for and the case for larger was left open. Our paper bridges the gap for larger upto for the case of exact recovery. To put our work into context, the corresponding case of establishing such behavior for the problem of detection remains open. In fact, it is conjectured in [DKMZ11, MNS12] that, for the detection problem, there exists a gap between the thresholds for computational feasibility and information theoretic impossibility for any number of communities greater than 4. In this paper, we show that that is not case for the exact recovery problem. Chen et al. [CX14] also study the -community SBM and provide convex programming based algorithms and information theoretic lower bounds for exact recovery. Their results are similar to ours in the sense that they also conjecture a separation between information theoretic impossibility and computation feasibility as grows. In comparison we focus strongly on the case of slightly superconstant () and mildly growing () and show exact recovery to the optimal (even up to constants) threshold in the former case. Very recently in independent and parallel work, Abbe and Sandon [AS15] studied the problem of exact recovery for a fixed number of () communities where the symmetry constraint (equality of cluster sizes and the probabilities of connection are same in different clusters) is removed. Our result, in contrast to theirs, is based on the integrality of a semidefinite relaxation, which has the added benefit of producing an explicit certificate for optimality (i.e. indeed when the solution is “integral” we know for sure that it is the optimal balanced -partition). Abbe and Sandon [AS15] comment in their paper that their results can be extended for slightly superconstant but leave it as future work. In another parallel and independent work, Hajek et al. [HWX15] study semidefinite programming relaxations for exact recovery in SBMs and achieve similar results as ours. We remark that semidefinite program in consideration in [HWX15] is the same as the semidefinite program (4) considered by us (up to an additive/multiplicative shift) and both works achieve the same optimality guarantee for . They also consider the problem of SBM with 2 unequal sized clusters and the Binary Censored Block Model. In contrast we show that the guarantees extend to the case even is superconstant and provide sufficient guarantees for the case of pointing to a possible divergence between information theoretic possiblity and computational feasibility at which we leave as an open question.

1.2 Preliminaries

In this section we describe the notation and definitions which we use through the rest of the paper.
Notation. Throughout the rest of the paper we will be reserving capital letters such as for matrices and with we will denote the corresponding entries. In particular, will be used to denote the all ones matrix and

the identity matrix. Let

be the element wise inner product of two matrices, i.e. . We note that the all the logarithms used in this paper are natural logarithms i.e. with the base . Let be a graph, the number of vertices and its adjacency matrix. With we denote a graph drawn from the stochastic block model distribution as described earlier with denoting the number of hidden clusters each of size . We denote the underlying hidden partition with . Let be the function that maps vertex to the cluster containing . To avoid confusion in the notation note that with we denote the cluster and denotes the cluster containing the vertex . We now describe the definitions of a few quantities which will be useful in further discussion of our results as well as their proofs. Define to be the “degree” of vertex to cluster . Formally

Similarly for any two clusters define as

Define the “in degree” of a vertex , denoted , to be the number of edges of going from the vertex to its own cluster

also define to be the maximum “out degree” of a vertex to any other cluster

Finally, define

will be the crucial parameter in our threshold. Remember that for

is a random variable and let

be its expectation (same for all ).
Paper Organization. The rest of this paper is structured as follows. In Section 2 we discuss the two SDP relaxations we consider in the paper. We state sufficient conditions for exact recovery for both of them as Theorem 2 and Theorem 2 (the latter is a restatement of Theorem 1) and provide an intuitive explanation of why the condition (1) is sufficient for recovery upto the optimal threshold. We provide formal proofs of Theorems 1 and 1 in the Appendix in Sections LABEL:sec:proofmaintheoremupperbound and LABEL:sec:proofmaintheoremlowerbound respectively. We provide the proof of Theorem 2 in Section 3. Further in Section 4 we show how our result can be extended to a semi random model with a monotone adversary. Lastly in the Appendix we collect the proofs of all the lemmas and theorems left unproven in the main sections.

2 SDP relaxations and main results

In this section we present two candidate SDPs which we use to recover the hidden partition. The first SDP is inspired from the Max-k-Cut SDP introduced by Frieze and Jerrum [FJ95] where we do not explicitly encode the fact that each cluster contains equal number of vertices. In the second SDP we encode the fact that each cluster has exactly vertices explicitly. We state our main theorems which provide sufficient conditions for exact recovery in both SDPs. Indeed the latter SDP, being stronger, is the one we use to prove our main theorem, Theorem 1. Before describing the SDPs lets first consider the Maximum Likelihood Estimator (MLE) of the hidden partition. It is easy to see that the MLE corresponds to the following problem which we refer to as the Multisection problem. Given a graph divide the set of vertices into clusters such that for all , and the number of edges such that and are minimized. (This problem has been studied under the name of Min-Balanced-k-partition [KNS09]). In this section we consider two SDP relaxations for the Multisection problem. Since SDPs can be solved in polynomial time, the relaxations provide polynomial time algorithms to recover the hidden partitions. A natural relaxation to consider for the problem of multisection in the Stochastic Block Model is the Min-k-cut SDP relaxation studied by Frieze and Jerrum [FJ95] (They actually study the Max-k-Cut problem but we can analogously study the min cut version too). The Min-k-cut SDP formulates the problem as an instance of Min-k-cut where one tries to separate the graph into partitions with the objective of minimizing the number of edges cut by the partition. Note that the k-Cut version does not have any explicit constraints for ensuring balancedness. However studying Min-k-Cut through SDPs has a natural difficulty, the relaxation must explicitly contain a constraint that tells it to divide the graph into at least clusters. In the case of SBMs with the parameters and one can try and overcome the above difficulty by making use of the fact that the generated graph is very sparse. Thus, instead of looking directly at the min-k-cut objective we can consider the following objective: minimizing the difference between the number of edges cut and the number of non-edges cut. Indeed for sparse graphs the second term in the difference is the dominant term and hence the SDP has an incentive to produce more clusters. Note that the above objective can also be thought of as doing Min-k-Cut on the signed adjacency matrix (where is the all ones matrix). Following the above intuition we consider the following SDP (2) which is inspired from the Max-k-Cut formulation of Feige and Jerrum [FJ95]. In the Appendix Section LABEL:sec:nonuniquegames we provide a reduction, to the k-Cut SDP we study in this paper, from a more general class of SDPs studied by Charikar et al. [CMM06] for Unique Games, and more recently by Bandeira et al. [BCS15] in a more general setting.

s.t.
.
(2)

To see that the above SDP is a relaxation of the multisection problem note that for the hidden partition we can define a candidate solution as follows. if belong to the same cluster and if belong to different clusters. Note that although the objective does not directly minimize the number of edges cut, it is an additive/multiplicative shift of it. For the above SDP we prove the following theorem in the Appendix in Section LABEL:sec:proofSDP2. Given , define

Let , with and where are constant. Consider the SDP given by (2). With probability over the choice of , if the following condition is satisfied then the SDP recovers the hidden partition

(3)

where is a universal constant. In other words with probability , condition (3) implies exact recovery. The proof of the above Theorem is included in the Appendix in Section LABEL:sec:proofSDP2. We note the above condition is not an optimal one in terms of exact recovery and we discuss this issue next. It is quite possible that the above SDP recovers the planted multisection all the way down to the threshold however we have not been able to establish this and leave it as an open question. Indeed to prove our results we consider a stronger SDP with which we establish optimality. We have empirically tested the performance of both the SDPs and include the results in the Appendix in Section LABEL:sec:experiments. We now take a closer look at the above sufficient condition (3) and argue why the condition is not strong enough to achieve optimal results. It is not hard to see that

Note that, in expectation, the maximization term in the definition of has an extra term as the maximization runs through all pairs. For the condition (3) to hold with at least a constant probability, we expect that it needs to be the case that

Substituting the parameter range that we are interested and we require that

Indeed from the above expression it is clear that if the first term above dominates and we cannot expect to get the tight results we hope for in Theorem 1. A closer look at the above calculation reveals that the major barrier towards achieving the optimal result is the additional factor due to the maximization over all in the definition of . For instance if one could replace the maximization term above with a term that takes the maximum per vertex over all clusters one would pick up only a term (as there are only clusters) and hopefully achieve optimality. In context of the above discussion we suggest the following SDP in which we explicitly add a per-row constraint bounding the number of vertices belonging to the same cluster as the vertex in contention.

s.t.
.
(4)

To see that the above SDP is a relaxation of the MLE discussed above note that for any partition , we can associate a canonical matrix with it defined as

Note that satisfies the SDP constraints and the SDP maximizes the number of edges within the cluster which is equivalent to minimizing the number of edges across the clusters. The second constraint above, since is symmetric, says that the sum of the values along the row is , which represents the number of vertices in a cluster. For the SDP above we show the following theorem which is a restatement of Theorem 1 Let . With probability over the choice of , if the following condition is satisfied then the SDP defined by (4) recovers the hidden partition

(5)

In other words with probability , condition (5) implies exact recovery. We remark that the above statement is indeed true for all values of . For the specific range that we are interested in we show in Section LABEL:sec:proofmaintheoremupperbound how condition (5) leads to the optimal threshold. In the next section we provide an intuitive explanation of why this is so.

2.1 Optimality of Theorem 2

In this section we give an intuitive high level explanation for the optimality of the condition in (5) for in Theorem 2. We prove it formally in the appendix. As stated earlier the regime we consider is the case when and , where and are constants. Note that for the MLE to succeed the values of and should be such that w.h.p., since otherwise one expects there to be many such vertices for which for some and in particular a pair such that there exists such that as well as . This would imply that we can exchange the pairs and get a better partition than the planted partition and therefore that the MLE itself does not recover the hidden partition. Recall that . We now show that the deviation in required by Theorem 2 is and therefore informally one can expect, intuitively, that

which implies that the SDP in Theorem 2 recovers the partition optimally. Indeed, the deviation required in Theorem 2 is ,

Above we assumed that . Following from the intuition above we prove Theorems 1 and 1 in the appendix which imply that our SDP is optimal. In the Appendix (Section LABEL:sec:experiments) we present an experimental evaluation of the two SDPs considered in this section. The experiments corroborate Theorem 1 and also show that the SDP in (2) experimentally seems to have a similar recovery performance as the (stronger) SDP in (4) however we could only prove a suboptimal result about it. We leave the possible optimality of the SDP in (2) as an open question.

3 Proof of the main theorem

In this section we prove our main theorem, Theorem 2 about the SDP defined by (4). We restate the SDP here.

s.t.
.
(6)

Let be the matrix corresponding to the hidden partition , i.e. if belong to the same cluster and otherwise. Let be the optimal value in the above SDP. We will show that is the unique solution to SDP (4) w.h.p as long as the conditions in Theorem 2 are satisfied. This would prove Theorem 2. Our proof will be based on a dual certificate. In that context consider the dual formulation of the above SDP which is the following

min
s.t.
(7)

where is a diagonal matrix, are scalars, is a non-negative symmetric matrix (corresponding to the constraints) with in the diagonal entries, is the matrix with in every entry of row and 0 otherwise, is the matrix with in every entry of column and 0 otherwise and we write instead of when there is no fear of confusion. Let be the optimal value of the above dual program. We will first exhibit a valid dual solution which, with high probability, has dual objective value such that . But since (by weak duality) we get that is an optimal solution to the above SDP. We will also show uniqueness via complementary slackness. Before moving on further it will be convenient to introduce the following definition which we will be used in the proof later. We also encourage the reader to revisit the Notations section (Section 1.2) at this time as it would help with the reading of what follows. Given a partition of vertices

we define the vectors

to be the indicator vectors of the clusters. We further define the following subspaces, which are perpendicular to each other, and partition .

  • : the subspace spanned by the vectors , i.e. the subspace of vectors with equal values in each cluster,

  • : the subspace perpendicular to , i.e. the subspace where the sum on each cluster is equal to 0.

At this point it is useful to look at what the complementary slackness condition implies. Since strong duality holds in the case of our SDP (easy to check that Slater’s conditions are satisfied) we have that complementary slackness is zero which implies that

for any optimal dual solution . The above condition implies that for any such (since is PSD) it must be that the subspace

is an eigenspace with eigenvalue

which implies

(8)

Having established the conditions that must be satisfied by the optimal dual solution , we describe our candidate dual solution

We begin by describing the choice of . If vertex and belong to the same cluster then otherwise

It is easy to see that the matrix is symmetric by noting that exchanging and in the above expression leads to the same value. Also to see that each entry of is non-negative note that is the sum of non-negative terms. Having defined as above we choose to be such that the condition given in Equation 8 holds for the non-diagonal blocks, yielding:

And finally we define to balance out the sum along the diagonal blocks from as well as the .

Interestingly, this dual certificate construction seems to share some features with the one proposed by Awasthi et al. [ABC15]

for an SDP relaxation for k-means clustering. While we were not able to make a formal connection, it would be very interesting if the reason for the similarities was the existence of some type of canonical way of building certificates for clustering problems, we leave this for future investigations. Now consider the objective for the dual program (

7). It is easy to see that it is equal to

The following lemma the proof of which we provide in the Appendix in Section LABEL:sec:proof_main_lemma implies that the above mentioned solution is a valid dual solution, proving that is an optimal solution to the above program (by weak duality). The matrix (as defined above) is such that with probability , if the condition (5) is satisfied, then

It is easy to show using complementary slackness that is indeed the unique optimal solution with high probability. For completeness we provide the proof in the Appendix in Section LABEL:sec:uniqueness

4 Note about the Monotone Adversary

In this section, we extend our result to the following semi random model considered in the paper of Feige and Kilian [FK01]. We first define a monotone adversary (we define it for the “homophilic” case). Given a graph and a partition a monotone adversary is allowed to take any of the following two actions on the graph:

  • Arbitrarily remove edges across clusters, i.e. .

  • Arbitrarily add edges within clusters, i.e. .

Given a graph let be the resulting graph after the adversary’s actions. The adversary is monotone in the sense that the set of the optimal multisections in contains the set of the optimal multisections in . Let be the number of edges cut in the optimal multisection. We now consider the following semi-random model, where we first randomly pick a graph and then the algorithm is given where the monotone adversary has acted on . The following theorem shows that our algorithm is robust against such a monotone adversary: Given a graph generated by a semi-random model described above we have that with probability the algorithm described in section 3 recovers the original (hidden) partition. The probability is over the randomness in the production of on which the adverary acts. We provide the proof of the above theorem in the Appendix in Section LABEL:sec:adversaryproof

References